Transformer kernel release (#242)

* Transformer kernels release Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

Transformer kernel release (#242)
* Transformer kernels release Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
734d8991 · Jeff Rasley · GitHub · b652395e · 734d8991 · 734d8991
Unverified Commit 734d8991 authored May 29, 2020 by Jeff Rasley Committed by GitHub May 29, 2020
20 changed files
--- a/.clang-format
+++ b/.clang-format
+---
+# Refer to the following link for the explanation of each params:
+#   http://releases.llvm.org/8.0.0/tools/clang/docs/ClangFormatStyleOptions.html
+Language: Cpp
+# BasedOnStyle: Google
+AccessModifierOffset: -4
+AlignAfterOpenBracket: Align
+AlignConsecutiveAssignments: false
+AlignConsecutiveDeclarations: false
+AlignEscapedNewlines: Left
+AlignOperands: true
+AlignTrailingComments: true
+AllowAllParametersOfDeclarationOnNextLine: false
+AllowShortBlocksOnASingleLine: true
+AllowShortCaseLabelsOnASingleLine: true
+AllowShortFunctionsOnASingleLine: All
+AllowShortIfStatementsOnASingleLine: true
+AllowShortLoopsOnASingleLine: true
+# This is deprecated
+AlwaysBreakAfterDefinitionReturnType: None
+AlwaysBreakAfterReturnType: None
+AlwaysBreakBeforeMultilineStrings: true
+AlwaysBreakTemplateDeclarations: Yes
+BinPackArguments:  false
+BinPackParameters: false
+BraceWrapping:
+  AfterClass:            false
+  AfterControlStatement: false
+  AfterEnum:             false
+  AfterFunction:         false
+  AfterNamespace:        false
+  AfterObjCDeclaration:  false
+  AfterStruct:           false
+  AfterUnion:            false
+  AfterExternBlock:      false
+  BeforeCatch:           false
+  BeforeElse:            false
+  IndentBraces:          false
+  # disabling the below splits, else, they'll just add to the vertical length of source files!
+  SplitEmptyFunction: false
+  SplitEmptyRecord: false
+  SplitEmptyNamespace: false
+BreakBeforeBinaryOperators: None
+BreakBeforeBraces: WebKit
+BreakBeforeInheritanceComma: false
+BreakInheritanceList: BeforeColon
+BreakBeforeTernaryOperators: true
+BreakConstructorInitializersBeforeComma: false
+BreakConstructorInitializers: BeforeColon
+BreakAfterJavaFieldAnnotations: false
+BreakStringLiterals: true
+ColumnLimit: 100
+CommentPragmas: '^ IWYU pragma:'
+CompactNamespaces: false
+ConstructorInitializerAllOnOneLineOrOnePerLine: true
+# Kept the below 2 to be the same as `IndentWidth` to keep everything uniform
+ConstructorInitializerIndentWidth: 4
+ContinuationIndentWidth: 4
+Cpp11BracedListStyle: true
+DerivePointerAlignment: false
+DisableFormat: false
+ExperimentalAutoDetectBinPacking: false
+FixNamespaceComments: true
+ForEachMacros:
+  - foreach
+  - Q_FOREACH
+  - BOOST_FOREACH
+IncludeBlocks: Preserve
+IncludeCategories:
+  - Regex:           '^<ext/.*\.h>'
+    Priority:        2
+  - Regex:           '^<.*\.h>'
+    Priority:        1
+  - Regex:           '^<.*'
+    Priority:        2
+  - Regex:           '.*'
+    Priority:        3
+IncludeIsMainRegex: '([-_](test|unittest))?$'
+IndentCaseLabels: true
+IndentPPDirectives: None
+IndentWidth:     4
+IndentWrappedFunctionNames: false
+JavaScriptQuotes: Leave
+JavaScriptWrapImports: true
+KeepEmptyLinesAtTheStartOfBlocks: false
+MacroBlockBegin: ''
+MacroBlockEnd:   ''
+MaxEmptyLinesToKeep: 1
+NamespaceIndentation: None
+ObjCBinPackProtocolList: Never
+ObjCBlockIndentWidth: 4
+ObjCSpaceAfterProperty: false
+ObjCSpaceBeforeProtocolList: true
+PenaltyBreakAssignment: 4
+PenaltyBreakBeforeFirstCallParameter: 1
+PenaltyBreakComment: 300
+PenaltyBreakFirstLessLess: 120
+PenaltyBreakString: 1000
+PenaltyBreakTemplateDeclaration: 10
+PenaltyExcessCharacter: 1000000
+PenaltyReturnTypeOnItsOwnLine: 200
+PointerAlignment: Left
+RawStringFormats:
+  - Language: Cpp
+    Delimiters:
+      - cc
+      - CC
+      - cpp
+      - Cpp
+      - CPP
+      - 'c++'
+      - 'C++'
+    CanonicalDelimiter: ''
+  - Language: TextProto
+    Delimiters:
+      - pb
+      - PB
+      - proto
+      - PROTO
+    EnclosingFunctions:
+      - EqualsProto
+      - EquivToProto
+      - PARSE_PARTIAL_TEXT_PROTO
+      - PARSE_TEST_PROTO
+      - PARSE_TEXT_PROTO
+      - ParseTextOrDie
+      - ParseTextProtoOrDie
+    CanonicalDelimiter: ''
+    BasedOnStyle: google
+# Enabling comment reflow causes doxygen comments to be messed up in their formats!
+ReflowComments: true
+SortIncludes: true
+SortUsingDeclarations: true
+SpaceAfterCStyleCast: false
+SpaceAfterTemplateKeyword: true
+SpaceBeforeAssignmentOperators: true
+SpaceBeforeCpp11BracedList: false
+SpaceBeforeCtorInitializerColon: true
+SpaceBeforeInheritanceColon: true
+SpaceBeforeParens: ControlStatements
+SpaceBeforeRangeBasedForLoopColon: true
+SpaceInEmptyParentheses: false
+SpacesBeforeTrailingComments: 2
+SpacesInAngles: false
+SpacesInContainerLiterals: true
+SpacesInCStyleCastParentheses: false
+SpacesInParentheses: false
+SpacesInSquareBrackets: false
+Standard: Cpp11
+StatementMacros:
+  - Q_UNUSED
+  - QT_REQUIRE_VERSION
+# Be consistent with indent-width, even for people who use tab for indentation!
+TabWidth: 4
+UseTab: Never
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -17,3 +17,9 @@ repos:
    hooks:
    -   id: yapf
        exclude: "examples/"
+-   repo: https://gitlab.com/daverona/pre-commit-cpp
+    rev: 0.6.0
+    hooks:
+    -   id: clang-format  # formatter of C/C++ code based on a style guide: LLVM, Google, Chromium, Mozilla, and WebKit available
+        args: []
--- a/DeepSpeedExamples @ 10b60d10
+++ b/DeepSpeedExamples @ 10b60d10
-Subproject commit 274787a189b265814ed75dd5ddeae2dce026ea88
+Subproject commit 10b60d107f172d1331924e1dcb70f4da3ebf5dd1
--- a/basic_install_test.py
+++ b/basic_install_test.py
@@ -18,8 +18,8 @@ except Exception as err:
    raise err
 try:
-    fused_lamb = importlib.import_module('fused_lamb_cuda')
+    fused_lamb = importlib.import_module('deepspeed_lamb_cuda')
-    print('deepspeed kernels successfully installed')
+    print('deepspeed fused lamb kernels successfully installed')
 except Exception as err:
    raise err
@@ -28,3 +28,9 @@ try:
    print("using old-style apex")
 except ImportError:
    print("using new-style apex")
+try:
+    ds_transformer = importlib.import_module('deepspeed_transformer_cuda')
+    print('deepspeed transformer kernels successfully installed')
+except Exception as err:
+    raise err
--- a/csrc/fused_lamb_cuda.cpp
+++ b/csrc/fused_lamb_cuda.cpp
-/* Copyright 2019 The Microsoft DeepSpeed Team */
-#include <torch/extension.h>
-// CUDA forward declaration
-void fused_lamb_cuda(at::Tensor & p, at::Tensor & p_copy, at::Tensor & m, at::Tensor & v, at::Tensor & g,
-                        float lr, float beta1, float beta2, float max_coeff, float min_coeff, float eps, float grad_scale, int step, int mode, int bias_correction, float decay,
-                        at::Tensor & w_l2_i, at::Tensor & u_l2_i, at::Tensor & lamb_coeff_val );
-#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
-#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
-#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
-// C++ interface
-at::Tensor lamb(at::Tensor & p, at::Tensor & p_copy, at::Tensor & m, at::Tensor & v, at::Tensor & g, float lr, float beta1, float beta2, float max_coeff, float min_coeff, float eps, float grad_scale, int step, int mode, int bias_correction, float decay) {
-        CHECK_INPUT(p);
-        if (p_copy.numel() > 0) CHECK_INPUT(p_copy);
-        CHECK_INPUT(m);
-        CHECK_INPUT(v);
-        CHECK_INPUT(g);
-        int64_t num_elem = p.numel();
-        AT_ASSERTM(m.numel() == num_elem, "number of elements in m and p tensors should be equal");
-        AT_ASSERTM(v.numel() == num_elem, "number of elements in v and p tensors should be equal");
-        AT_ASSERTM(g.numel() == num_elem, "number of elements in g and p tensors should be equal");
-        AT_ASSERTM(p_copy.numel() == num_elem || p_copy.numel() == 0, "number of elements in p_copy and p tensors should be equal, or p_copy should be empty");
-        //intermediate for weight L2 reduction
-        //make sure that the threads per block is at least 512 during the kernel launch otherwise the behavious is unexpected
-        at::Tensor w_l2_i = at::empty({512}, p.options().dtype(p.type().scalarType()==at::ScalarType::Half ? at::ScalarType::Float : p.type().scalarType()));
-        //intermediate for update L2 reduction
-        //make sure that the threads per block is at least 512 during the kernel launch otherwise the behavious is unexpected
-        at::Tensor u_l2_i = at::empty({512}, p.options().dtype(p.type().scalarType()==at::ScalarType::Half ? at::ScalarType::Float : p.type().scalarType()));
-        at::Tensor lamb_coeff_val = at::empty({1}, p.options().dtype(p.type().scalarType()==at::ScalarType::Half ? at::ScalarType::Float : p.type().scalarType()));
-        fused_lamb_cuda(p, p_copy, m, v, g, lr, beta1, beta2, max_coeff, min_coeff, eps, grad_scale, step, mode, bias_correction, decay, w_l2_i, u_l2_i, lamb_coeff_val);
-        return lamb_coeff_val;
-}
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-        m.def("lamb", &lamb, "Adam optimized CUDA implementation with LAMB.");
-}
--- a/csrc/includes/StopWatch.h
+++ b/csrc/includes/StopWatch.h
+#pragma once
+#ifdef _WIN32
+#include <windows.h>
+#else
+#include <time.h>
+#endif
+#ifdef _WIN32
+class Stopwatch {
+private:
+    double m_total_time;
+    LARGE_INTEGER m_start_time;
+public:
+    Stopwatch() { m_total_time = 0.0; }
+    ~Stopwatch() {}
+    void Reset() { m_total_time = 0.0; }
+    void Start() { QueryPerformanceCounter(&m_start_time); }
+    void Restart()
+    {
+        m_total_time = 0.0;
+        QueryPerformanceCounter(&m_start_time);
+    }
+    void Stop()
+    {
+        LARGE_INTEGER frequency;
+        LARGE_INTEGER stop_time;
+        QueryPerformanceFrequency(&frequency);
+        QueryPerformanceCounter(&stop_time);
+        m_total_time +=
+            ((double)(stop_time.QuadPart - m_start_time.QuadPart) / (double)frequency.QuadPart);
+    }
+    double GetTimeInSeconds() { return m_total_time; }
+};
+#else
+class Stopwatch {
+private:
+    double m_total_time;
+    struct timespec m_start_time;
+    bool m_is_started;
+public:
+    Stopwatch()
+    {
+        m_total_time = 0.0;
+        m_is_started = false;
+    }
+    ~Stopwatch() {}
+    void Reset() { m_total_time = 0.0; }
+    void Start()
+    {
+        clock_gettime(CLOCK_MONOTONIC, &m_start_time);
+        m_is_started = true;
+    }
+    void Restart()
+    {
+        m_total_time = 0.0;
+        clock_gettime(CLOCK_MONOTONIC, &m_start_time);
+        m_is_started = true;
+    }
+    void Stop()
+    {
+        if (m_is_started) {
+            m_is_started = false;
+            struct timespec end_time;
+            clock_gettime(CLOCK_MONOTONIC, &end_time);
+            m_total_time += (double)(end_time.tv_sec - m_start_time.tv_sec) +
+                            (double)(end_time.tv_nsec - m_start_time.tv_nsec) / 1e9;
+        }
+    }
+    double GetTimeInSeconds()
+    {
+        if (m_is_started) {
+            Stop();
+            Start();
+        }
+        return m_total_time;
+    }
+};
+#endif
--- a/csrc/includes/Timer.h
+++ b/csrc/includes/Timer.h
+#ifndef __TIMER_H__
+#define __TIMER_H__
+#include <cuda_runtime.h>
+#include <chrono>
+#include "cuda.h"
+class GPUTimer {
+    cudaEvent_t start, stop;
+public:
+    GPUTimer()
+    {
+        cudaEventCreate(&start);
+        cudaEventCreate(&stop);
+    }
+    ~GPUTimer()
+    {
+        cudaEventDestroy(start);
+        cudaEventDestroy(stop);
+    }
+    inline void Record() { cudaEventRecord(start); }
+    inline void Elapsed(float& time_elapsed)
+    {
+        cudaEventRecord(stop);
+        cudaEventSynchronize(stop);
+        cudaEventElapsedTime(&time_elapsed, start, stop);
+    }
+};
+class CPUTimer {
+    std::chrono::high_resolution_clock::time_point start;
+public:
+    CPUTimer() : start(std::chrono::high_resolution_clock::now()) {}
+    inline void Reset() { start = std::chrono::high_resolution_clock::now(); }
+    inline float Elapsed()
+    {
+        auto temp = start;
+        start = std::chrono::high_resolution_clock::now();
+        return (float)(std::chrono::duration_cast<std::chrono::microseconds>(start - temp).count() /
+                       1e3);
+    }
+};
+#endif
--- a/csrc/includes/context.h
+++ b/csrc/includes/context.h
+#pragma once
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_runtime_api.h>
+#include <cassert>
+#include <iostream>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "curand.h"
+#include "gemm_test.h"
+#define WARP_SIZE 32
+#define CUDA_CHECK(callstr)                                                                    \
+    {                                                                                          \
+        cudaError_t error_code = callstr;                                                      \
+        if (error_code != cudaSuccess) {                                                       \
+            std::cerr << "CUDA error " << error_code << " at " << __FILE__ << ":" << __LINE__; \
+            assert(0);                                                                         \
+        }                                                                                      \
+    }
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+#define CUDA_2D_KERNEL_LOOP(i, n, j, m)                                                          \
+    for (size_t i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x) \
+        for (size_t j = blockIdx.y * blockDim.y + threadIdx.y; j < (m); j += blockDim.y * gridDim.y)
+#define DS_CUDA_NUM_THREADS 512
+#define DS_MAXIMUM_NUM_BLOCKS 4096
+inline int DS_GET_BLOCKS(const int N)
+{
+    return std::max(
+        std::min((N + DS_CUDA_NUM_THREADS - 1) / DS_CUDA_NUM_THREADS, DS_MAXIMUM_NUM_BLOCKS),
+        // Use at least 1 block, since CUDA does not allow empty block
+        1);
+}
+class Context {
+public:
+    Context() : _workspace(nullptr), _seed(42), _curr_offset(0)
+    {
+        curandCreateGenerator(&_gen, CURAND_RNG_PSEUDO_DEFAULT);
+        curandSetPseudoRandomGeneratorSeed(_gen, 123);
+        if (cublasCreate(&_cublasHandle) != CUBLAS_STATUS_SUCCESS) {
+            auto message = std::string("Fail to create cublas handle.");
+            std::cerr << message << std::endl;
+            throw std::runtime_error(message);
+        }
+    }
+    virtual ~Context()
+    {
+        cublasDestroy(_cublasHandle);
+        cudaFree(_workspace);
+    }
+    static Context& Instance()
+    {
+        static Context _ctx;
+        return _ctx;
+    }
+    void GenWorkSpace(size_t size)
+    {
+        if (!_workspace) {
+            assert(_workspace == nullptr);
+            cudaMalloc(&_workspace, size);
+        } else if (_workSpaceSize != size) {
+            cudaFree(_workspace);
+            cudaMalloc(&_workspace, size);
+        }
+        _workSpaceSize = size;
+    }
+    void* GetWorkSpace() { return _workspace; }
+    curandGenerator_t& GetRandGenerator() { return _gen; }
+    cudaStream_t GetCurrentStream()
+    {
+        // get current pytorch stream.
+        cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+        return stream;
+    }
+    cublasHandle_t GetCublasHandle() { return _cublasHandle; }
+    std::pair<uint64_t, uint64_t> IncrementOffset(uint64_t offset_inc)
+    {
+        uint64_t offset = _curr_offset;
+        _curr_offset += offset_inc;
+        return std::pair<uint64_t, uint64_t>(_seed, offset);
+    }
+    void SetSeed(uint64_t new_seed) { _seed = new_seed; }
+    void TestGemmFP16(bool test_gemm, int batch_size, int seq_len, int head_num, int size_per_head)
+    {
+        // avoid rerun.
+        if (_gemm_algos.size() > 0) return;
+        if (test_gemm) {
+            cublasHandle_t handle = GetCublasHandle();
+            std::unique_ptr<GemmTest<__half>> test_qkv_fw(
+                new GemmTest<__half>(batch_size * seq_len,      // M
+                                     head_num * size_per_head,  // N
+                                     head_num * size_per_head,  // K
+                                     CUBLAS_OP_T,
+                                     CUBLAS_OP_N,
+                                     handle));
+            std::unique_ptr<GemmTest<__half>> test_inter(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     4 * head_num * size_per_head,  // N
+                                     head_num * size_per_head,      // K
+                                     CUBLAS_OP_T,
+                                     CUBLAS_OP_N,
+                                     handle));
+            std::unique_ptr<GemmTest<__half>> test_output(
+                new GemmTest<__half>(batch_size * seq_len,          // M
+                                     head_num * size_per_head,      // N
+                                     4 * head_num * size_per_head,  // K
+                                     CUBLAS_OP_T,
+                                     CUBLAS_OP_N,
+                                     handle));
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_scores(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            seq_len,                // M
+                                            seq_len,                // N
+                                            size_per_head,          // K
+                                            CUBLAS_OP_T,
+                                            CUBLAS_OP_N,
+                                            handle));
+            std::unique_ptr<StridedGemmTest<__half>> test_attn_context(
+                new StridedGemmTest<__half>(batch_size * head_num,  // batch
+                                            size_per_head,          // M
+                                            seq_len,                // N
+                                            seq_len,                // K
+                                            CUBLAS_OP_N,
+                                            CUBLAS_OP_N,
+                                            handle));
+            _gemm_algos.push_back(test_qkv_fw->TestAlgo(100));
+            _gemm_algos.push_back(test_inter->TestAlgo(100));
+            _gemm_algos.push_back(test_output->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_scores->TestAlgo(100));
+            _gemm_algos.push_back(test_attn_context->TestAlgo(100));
+        } else {
+            // Use default algo.
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+            _gemm_algos.push_back(std::array<int, 3>({99, 99, 99}));
+        }
+    }
+    const std::vector<std::array<int, 3>>& GetGemmAlgos() const { return _gemm_algos; }
+private:
+    curandGenerator_t _gen;
+    cublasHandle_t _cublasHandle;
+    void* _workspace;
+    uint64_t _seed;
+    uint64_t _curr_offset;
+    size_t _workSpaceSize;
+    std::vector<std::array<int, 3>> _gemm_algos;
+};
--- a/csrc/includes/cublas_wrappers.h
+++ b/csrc/includes/cublas_wrappers.h
+#pragma once
+#include <assert.h>
+#include <cublas_v2.h>
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <mma.h>
+#include <stdio.h>
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT);
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP);
--- a/csrc/includes/custom_cuda_layers.h
+++ b/csrc/includes/custom_cuda_layers.h
+#pragma once
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cooperative_groups.h>
+#include <curand_kernel.h>
+#include "context.h"
+#include "cublas_wrappers.h"
+#define MAX_THREADS 1024
+#define THREADS 256
+#define MAX_THREAD_STRIDE 32
+#define TILE_DIM 32
+// Maximum sequence-length support based on the number of threads (2048) allowed in each block and
+// this MAX is 8K For higher sequence length we need to use higher Max, like for 64K : 32
+#define MAX_THREAD_ITERATIONS 8  // Maximum 8K
+#define MAX_WARP_NUM 32
+// Fused bias add with gelu activation
+template <typename T>
+void launch_bias_gelu(const T* input,
+                      const T* bias,
+                      T* output,
+                      int intermediate_size,
+                      int batch_size,
+                      int sequence_length,
+                      cudaStream_t stream);
+template <typename T>
+void launch_gelu(const T* input,
+                 T* output,
+                 int intermediate_size,
+                 int batch_size,
+                 int sequence_length,
+                 cudaStream_t stream);
+template <typename T>
+void launch_d_gelu(T* d_output,
+                   const T* input,
+                   const T* bias,
+                   int intermediate_size,
+                   int batch_size,
+                   int sequence_length,
+                   cudaStream_t stream);
+// Custom fused bias add with layer normalization
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int sequence_length,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training = false,
+                                     T* vars = nullptr,
+                                     T* means = nullptr,
+                                     T* vals_hat = nullptr);
+template <typename T>
+void launch_bias_residual_layer_norm(T* vals,
+                                     const T* residual,
+                                     const T* gamma,
+                                     const T* beta,
+                                     float epsilon,
+                                     int batch_size,
+                                     int sequence_length,
+                                     int hidden_dim,
+                                     cudaStream_t stream,
+                                     bool preLayerNorm,
+                                     bool training = false,
+                                     T* vars = nullptr,
+                                     T* vals_hat = nullptr,
+                                     bool save_vals = false);
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* X_data,
+                                         const T* vars,
+                                         const T* means,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int sequence_length,
+                                         int hidden_dim,
+                                         cudaStream_t stream[2]);
+template <typename T>
+void launch_layerNorm_backward_fused_add(const T* out_grad1,
+                                         const T* out_grad2,
+                                         const T* vals_hat,
+                                         const T* vars,
+                                         const T* gamma,
+                                         T* gamma_grad,
+                                         T* betta_grad,
+                                         T* inp_grad,
+                                         int batch_size,
+                                         int sequence_length,
+                                         int hidden_dim,
+                                         cudaStream_t stream[2],
+                                         bool invertible = false,
+                                         const T* betta = nullptr);
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* X_data,
+                               const T* vars,
+                               const T* means,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int sequence_length,
+                               int hidden_dim,
+                               cudaStream_t stream[2]);
+template <typename T>
+void launch_layerNorm_backward(const T* out_grad,
+                               const T* vals_hat,
+                               const T* vars,
+                               const T* gamma,
+                               T* gamma_grad,
+                               T* betta_grad,
+                               T* inp_grad,
+                               int batch_size,
+                               int sequence_length,
+                               int hidden_dim,
+                               cudaStream_t stream[2],
+                               bool invertible = false,
+                               const T* betta = nullptr);
+template <typename T>
+void launch_layerNorm_backward_nreversible(const T* out_grad,
+                                           const T* vals,
+                                           const T* out_grad_trans,
+                                           const T* vals_trans,
+                                           const T* means,
+                                           const T* vars,
+                                           const T* gamma,
+                                           T* gamma_grad,
+                                           T* betta_grad,
+                                           T* inp_grad,
+                                           int batch_size,
+                                           int sequence_length,
+                                           int hidden_dim,
+                                           cudaStream_t stream[2]);
+template <typename T>
+void Transpose(const T* inp_mat, T* out_mat, int rows, int cols, cudaStream_t stream);
+template <typename T>
+void launch_attn_softmax_backward(T* out_grad,
+                                  const T* soft_inp,
+                                  int batch_size,
+                                  int heads,
+                                  int seq_length,
+                                  cudaStream_t stream);
+template <typename T>
+void launch_attn_softmax_backward_v2(T* out_grad,
+                                     const T* soft_inp,
+                                     int batch_size,
+                                     int heads,
+                                     int seq_length,
+                                     cudaStream_t stream);
+// Custom softmax with scaling and attention mask addition
+template <typename T>
+void launch_attn_softmax(T* vals,
+                         const T* attn_mask,
+                         int batch_size,
+                         int heads,
+                         int sequence_length,
+                         cudaStream_t stream);
+template <typename T>
+void launch_transform_0213(T* output,
+                           const T* vals,
+                           int batch_size,
+                           int seq_length,
+                           int hidden_dim,
+                           int heads,
+                           cudaStream_t stream);
+// Custom bias add
+template <typename T>
+void launch_bias_add_transform_0213(T* outputs,
+                                    const T* vals,
+                                    const T* bias,
+                                    int batch_size,
+                                    int seq_length,
+                                    int hidden_dim,
+                                    int heads,
+                                    cudaStream_t stream,
+                                    int trans_count);
+// 4D transform [0, 1, 2, 3] -> [0, 2, 1, 3]
+template <typename T>
+void launch_transform4d_0213(T* out,
+                             const T* in,
+                             int batch_size,
+                             int heads,
+                             int seq_length,
+                             int hidden_dim,
+                             cudaStream_t stream,
+                             int trans_count);
+template <typename T>
+void launch_dropout(T* vals,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream);
+template <typename T>
+void launch_dropout(T* vals_out,
+                    const T* vals,
+                    uint8_t* mask,
+                    int total_count,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream,
+                    bool bwd = false);
+template <typename T>
+void launch_dropout(T* out,
+                    const T* vals,
+                    const T* residual,
+                    const T* bias,
+                    uint8_t* mask,
+                    int batch,
+                    int dim,
+                    float ratio,
+                    cudaStream_t stream);
+template <typename T>
+void launch_dropout_grad(T* vals, uint8_t* mask, int total_count, float ratio, cudaStream_t stream);
+template <typename T>
+void launch_dropout_grad(T* vals_out,
+                         const T* vals,
+                         uint8_t* mask,
+                         int total_count,
+                         float ratio,
+                         cudaStream_t stream);
+template <typename T>
+void launch_fuse_transpose_bias_kernel(const T* inp,
+                                       T* out,
+                                       int rows,
+                                       int cols,
+                                       cudaStream_t stream);
--- a/csrc/includes/dropout.h
+++ b/csrc/includes/dropout.h
+#pragma once
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+template <typename T>
+class Dropout {
+public:
+    struct Config {
+        float ratio;
+        uint32_t batch, dim;
+        bool training;
+        Config(float r, uint32_t batch, uint32_t dim)
+            : ratio(r), batch(batch), dim(dim), training(true)
+        {
+        }
+        float RATIO() const { return training ? ratio : 0.0; }
+    };
+    Dropout(const Config& config) : _config(config), _mask(nullptr) {}
+    virtual ~Dropout() {}
+    void Forward(int bsz, T* out, const T* vals, cudaStream_t stream, bool bwd = false)
+    {
+        launch_dropout<T>(
+            out, vals, _mask, bsz * _config.dim, _config.dim, _config.RATIO(), stream, bwd);
+    }
+    void ForwardWithBias(int bsz, T* vals, const T* bias, cudaStream_t stream)
+    {
+        launch_dropout<T>(vals, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+    void ForwardWithBias(int bsz,
+                         T* out,
+                         const T* vals,
+                         const T* residual,
+                         const T* bias,
+                         cudaStream_t stream)
+    {
+        launch_dropout<T>(
+            out, vals, residual, bias, _mask, bsz, _config.dim, _config.RATIO(), stream);
+    }
+    void Backward(int bsz, T* d_vals, cudaStream_t stream)
+    {
+        launch_dropout_grad<T>(d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+    void Backward(int bsz, T* d_vals_out, const T* d_vals, cudaStream_t stream)
+    {
+        launch_dropout_grad<T>(
+            d_vals_out, d_vals, _mask, bsz * _config.dim, _config.RATIO(), stream);
+    }
+    bool HasDropout() const { return _config.RATIO() > 0.0; }
+    void SetTrainingMode(bool training) { _config.training = training; }
+    void SetMask(uint8_t* mask)
+    {
+        if (!mask) { throw std::runtime_error("Dropout mask is null."); }
+        _mask = mask;
+    }
+    Config GetConfig() const { return _config; }
+private:
+    uint8_t* _mask;
+    Config _config;
+};
--- a/csrc/includes/ds_transformer_cuda.h
+++ b/csrc/includes/ds_transformer_cuda.h
+#pragma once
+#include <cuda_runtime_api.h>
+#include <curand.h>
+#include <memory>
+#include <vector>
+#include "cublas_v2.h"
+#include "cuda.h"
+#include "dropout.h"
+#include "feed_forward.h"
+#include "gelu.h"
+#include "general_kernels.h"
+#include "normalize_layer.h"
+#include "softmax.h"
+#include "strided_batch_gemm.h"
+struct BertGemmAlgos {
+    int m_gemm_qkv_algo;
+    int m_gemm_inter_algo;
+    int m_gemm_output_algo;
+    int m_gemm_batch1_algo;
+    int m_gemm_batch2_algo;
+    BertGemmAlgos()
+        : m_gemm_qkv_algo(-1),
+          m_gemm_inter_algo(-1),
+          m_gemm_output_algo(-1),
+          m_gemm_batch1_algo(-1),
+          m_gemm_batch2_algo(-1)
+    {
+    }
+};
+template <typename T>
+class BertTransformerLayer {
+public:
+    BertTransformerLayer(int layer_id,
+                         int batch_size,
+                         int hidden_size,
+                         int num_heads,
+                         int intermediate_size,
+                         int seq_length,
+                         float attn_dropout_ratio,
+                         float hidden_output_dropout_ratio,
+                         bool pre_or_postLayerNorm,
+                         const std::vector<std::array<int, 3>>& gemm_algos,
+                         bool attn_dropout_checkpoint,
+                         bool normalize_invertible,
+                         bool gelu_checkpoint,
+                         bool stochastic_mode);
+    virtual ~BertTransformerLayer();
+    void Forward(int bsz,
+                 const T* input_ptr,
+                 const T* input_mask_ptr,
+                 const T* attn_qkvw_ptr,
+                 const T* attn_qkvb_ptr,
+                 const T* attn_ow_ptr,
+                 const T* attn_ob_ptr,
+                 const T* attn_nw_ptr,
+                 const T* attn_nb_ptr,
+                 const T* inter_w_ptr,
+                 const T* inter_b_ptr,
+                 const T* output_w_ptr,
+                 const T* output_b_ptr,
+                 const T* norm_w_ptr,
+                 const T* norm_b_ptr,
+                 T* out_ptr,
+                 T* inp_norm_ptr,
+                 T* q_tf_ptr,
+                 T* k_tf_ptr,
+                 T* v_tf_ptr,
+                 T* softmax_output_ptr,
+                 T* ctx_bufB_ptr,
+                 T* attn_o_inp_ptr,
+                 T* add_res_ptr,
+                 T* ff1_inp_ptr,
+                 T* gelu_inp_ptr,
+                 T* ff2_inp_ptr);
+    void Backward(int bsz,
+                  const T* grad_output_ptr,
+                  const T* input_ptr,
+                  const T* output_ptr,
+                  const T* inp_norm_ptr,
+                  const T* q_tf_ptr,
+                  const T* k_tf_ptr,
+                  const T* v_tf_ptr,
+                  const T* softmax_output_ptr,
+                  const T* ctx_bufB_ptr,
+                  const T* attn_o_inp_ptr,
+                  const T* add_res_ptr,
+                  const T* ff1_inp_ptr,
+                  const T* gelu_inp_ptr,
+                  const T* ff2_inp_ptr,
+                  const T* input_mask_ptr,
+                  const T* attn_qkvw_ptr,
+                  const T* attn_ow_ptr,
+                  const T* attn_nw_ptr,
+                  const T* attn_nb_ptr,
+                  const T* inter_w_ptr,
+                  const T* inter_b_ptr,
+                  const T* output_w_ptr,
+                  const T* norm_w_ptr,
+                  const T* norm_b_ptr,
+                  T* grad_input_ptr,
+                  T* grad_attn_qkvw_ptr,
+                  T* grad_attn_qkvb_ptr,
+                  T* grad_attn_ow_ptr,
+                  T* grad_attn_ob_ptr,
+                  T* grad_attn_nw_ptr,
+                  T* grad_attn_nb_ptr,
+                  T* grad_inter_w_ptr,
+                  T* grad_inter_b_ptr,
+                  T* grad_output_w_ptr,
+                  T* grad_output_b_ptr,
+                  T* grad_norm_w_ptr,
+                  T* grad_norm_b_ptr);
+    void SetIntermediateBuffers(uint8_t* attn_prob_dropout_mask_ptr,
+                                uint8_t* attn_output_dropout_mask_ptr,
+                                uint8_t* layer_output_dropout_mask_ptr);
+    inline int GetBatchSize() const { return _batch_size; }
+    inline int GetNumHeads() const { return _heads; }
+    inline int GetSeqLength() const { return _seq_length; }
+    inline int GetHiddenSize() const { return _hidden_size; }
+    void SetTrainingMode(bool training);
+private:
+    void Initialize();
+    size_t getWorkspaceSize(int maxBatchSize) const;
+    // Params
+    int _layer_id;
+    int _batch_size;
+    int _hidden_size;
+    int _heads;
+    int _size_per_head;
+    int _intermediate_size;
+    int _seq_length;
+    bool _pre_or_postLayerNorm;
+    cublasHandle_t _cublasHandle;
+    cudaStream_t _stream;
+    // layers
+    FeedForward<T> _qkv_linear;
+    FeedForward<T> _attn_out_linear;
+    Normalize_Layer<T> _norm_layer2;
+    Normalize_Layer<T> _norm_layer3;
+    Normalize_Layer<T>* _last_normalize;
+    FeedForward<T> _ff1, _ff2;
+    Softmax<T> _softmax;
+    Gelu<T> _gelu;
+    Dropout<T> _attn_prob_dropout;
+    Dropout<T> _attn_output_dropout;
+    Dropout<T> _layer_output_dropout;
+    StridedBatchGemm<T> _attn_scores;
+    StridedBatchGemm<T> _attn_context;
+    bool _training;
+    // Memory saving flags
+    bool _attn_dropout_checkpoint;
+    bool _normalize_invertible;
+    bool _gelu_checkpoint;
+    // High Performace flags
+    bool _stochastic_mode;
+};
--- a/csrc/includes/feed_forward.h
+++ b/csrc/includes/feed_forward.h
+#ifndef __FEEDFORWARD_H__
+#define __FEEDFORWARD_H__
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+template <typename T>
+class FeedForward {
+public:
+    struct Config {
+        int batchSize, outputSize;
+        int inputSize;
+        std::array<int, 3> gemm_algos;
+        Config(int batch, int outputs, int inputs, const std::array<int, 3>& algos)
+            : batchSize(batch), outputSize(outputs), inputSize(inputs), gemm_algos(algos)
+        {
+        }
+    };
+    FeedForward(Config config) : config_(config) {}
+    ~FeedForward() {}
+    void Forward(int bsz,
+                 const T* input_ptr,
+                 const T* weights,
+                 T* out,
+                 cublasHandle_t& _cublasHandle)
+    {
+        float alpha = T(1.);
+        float beta = T(0.);
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_T,
+                       CUBLAS_OP_N,
+                       config_.outputSize,
+                       bsz,
+                       config_.inputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       input_ptr,
+                       out,
+                       cublasGemmAlgo_t(config_.gemm_algos[0]));
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* input_ptr,
+                  const T* weights,
+                  T* weights_grad,
+                  T* bias_grad,
+                  cublasHandle_t& _cublasHandle,
+                  cudaStream_t& stream,
+                  T* inp_grad_out = nullptr,
+                  T* out_grad_trans_out = nullptr)
+    {
+        float alpha = (T)1.0, beta = (T)0.0;
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_N,
+                       CUBLAS_OP_T,
+                       config_.inputSize,
+                       config_.outputSize,
+                       bsz,
+                       &alpha,
+                       &beta,
+                       input_ptr,
+                       out_grad,
+                       weights_grad,
+                       cublasGemmAlgo_t(config_.gemm_algos[1]));
+        cublas_gemm_ex(_cublasHandle,
+                       CUBLAS_OP_N,
+                       CUBLAS_OP_N,
+                       config_.inputSize,
+                       bsz,
+                       config_.outputSize,
+                       &alpha,
+                       &beta,
+                       weights,
+                       out_grad,
+                       inp_grad_out,
+                       cublasGemmAlgo_t(config_.gemm_algos[2]));
+        launch_fuse_transpose_bias_kernel<T>(out_grad, bias_grad, bsz, config_.outputSize, stream);
+    }
+private:
+    Config config_;
+};
+#endif
--- a/csrc/includes/gelu.h
+++ b/csrc/includes/gelu.h
+#pragma once
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+template <typename T>
+class Gelu {
+public:
+    struct Config {
+        uint32_t batch_size;
+        uint32_t seq_length;
+        uint32_t intermediate_size;
+        Config(uint32_t batch, uint32_t seq, uint32_t inter_size)
+            : batch_size(batch), seq_length(seq), intermediate_size(inter_size)
+        {
+        }
+    };
+    Gelu(const Config& config) : _config(config) {}
+    virtual ~Gelu() {}
+    void ForwardWithBiasAdd(int bsz,
+                            const T* input_buf,
+                            const T* bias,
+                            T* output,
+                            cudaStream_t stream)
+    {
+        launch_bias_gelu<T>(
+            input_buf, bias, output, _config.intermediate_size, bsz, _config.seq_length, stream);
+    }
+    void Backward(int bsz, T* d_output, const T* input_buf, const T* bias, cudaStream_t stream)
+    {
+        launch_d_gelu<T>(
+            d_output, input_buf, bias, _config.intermediate_size, bsz, _config.seq_length, stream);
+    }
+private:
+    Config _config;
+};
--- a/csrc/includes/gemm_test.h
+++ b/csrc/includes/gemm_test.h
+#pragma once
+#include <cuda_fp16.h>
+#include <cuda_profiler_api.h>
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <ctime>
+#include <limits>
+#include <memory>
+#include "StopWatch.h"
+#include "cublas_wrappers.h"
+template <typename T>
+void check(T result, char const* const func, const char* const file, int const line)
+{
+    if (result) {
+        std::cout << (std::string("CUDA runtime error: ") + +file + ":" + std::to_string(line) +
+                      " \n");
+    }
+}
+#define check_cuda_error(val) check((val), #val, __FILE__, __LINE__)
+template <typename T>
+class GemmTest {
+public:
+    GemmTest(int m, int n, int k, cublasOperation_t ta, cublasOperation_t tb, cublasHandle_t h)
+        : M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K));
+        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N));
+        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N));
+    }
+    ~GemmTest()
+    {
+        check_cuda_error(cudaFree(A));
+        check_cuda_error(cudaFree(B));
+        check_cuda_error(cudaFree(C));
+    }
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+        int algo_fw = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_T,
+                           CUBLAS_OP_N,
+                           N,
+                           M,
+                           K,
+                           &alpha,
+                           &beta,
+                           B,
+                           A,
+                           C,
+                           static_cast<cublasGemmAlgo_t>(algo));
+        });
+        int algo_bw1 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_N,
+                           CUBLAS_OP_T,
+                           K,
+                           N,
+                           M,
+                           &alpha,
+                           &beta,
+                           A,
+                           C,
+                           B,
+                           static_cast<cublasGemmAlgo_t>(algo));
+        });
+        int algo_bw2 = Run(loops, [=](int algo) {
+            cublas_gemm_ex(handle,
+                           CUBLAS_OP_N,
+                           CUBLAS_OP_N,
+                           K,
+                           M,
+                           N,
+                           &alpha,
+                           &beta,
+                           B,
+                           C,
+                           A,
+                           static_cast<cublasGemmAlgo_t>(algo));
+        });
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = std::numeric_limits<float>::max();
+        int fast_algo = 0;
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+            cudaDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+            for (int i = 0; i < loops; ++i) f(algo);
+            cudaDeviceSynchronize();
+            timer.Stop();
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+        return fast_algo;
+    }
+private:
+    int M, N, K;
+    cublasHandle_t handle;
+    cublasOperation_t transa, transb;
+    T *A, *B, *C;
+};
+template <typename T>
+class StridedGemmTest {
+public:
+    StridedGemmTest(int b,
+                    int m,
+                    int n,
+                    int k,
+                    cublasOperation_t ta,
+                    cublasOperation_t tb,
+                    cublasHandle_t h)
+        : bsz(b), M(m), N(n), K(k), transa(ta), transb(tb), handle(h)
+    {
+        check_cuda_error(cudaMalloc((void**)&A, sizeof(T) * M * K * bsz));
+        check_cuda_error(cudaMalloc((void**)&B, sizeof(T) * K * N * bsz));
+        check_cuda_error(cudaMalloc((void**)&C, sizeof(T) * M * N * bsz));
+    }
+    ~StridedGemmTest()
+    {
+        check_cuda_error(cudaFree(A));
+        check_cuda_error(cudaFree(B));
+        check_cuda_error(cudaFree(C));
+    }
+    std::array<int, 3> TestAlgo(int loops)
+    {
+        float alpha = (T)1.0f;
+        float beta = (T)0.0f;
+        int algo_fw = Run(loops, [=](int algo) {
+            int stride_a = M * K;
+            int stride_b = N * K;
+            int stride_c = M * N;
+            cublas_strided_batched_gemm(handle,
+                                        M,
+                                        N,
+                                        K,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        B,
+                                        C,
+                                        transa,
+                                        transb,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+                                        static_cast<cublasGemmAlgo_t>(algo));
+        });
+        int algo_bw1 = Run(loops, [=](int algo) {
+            int mb = (transa == CUBLAS_OP_T ? K : M);
+            int kb = (transa == CUBLAS_OP_T ? M : K);
+            int stride_a = mb * N;
+            int stride_b = N * kb;
+            int stride_c = M * K;
+            // B need to transpose.
+            cublasOperation_t op_b = (transb == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+            // Calculate d_A.
+            cublas_strided_batched_gemm(handle,
+                                        mb,
+                                        kb,
+                                        N,
+                                        &alpha,
+                                        &beta,
+                                        (transa == CUBLAS_OP_T ? B : C),
+                                        (transa == CUBLAS_OP_T ? C : B),
+                                        A,
+                                        CUBLAS_OP_N,
+                                        op_b,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+                                        static_cast<cublasGemmAlgo_t>(algo));
+        });
+        int algo_bw2 = Run(loops, [=](int algo) {
+            // A need to transpose.
+            cublasOperation_t op_a = (transa == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+            int stride_a = M * K;
+            int stride_b = M * N;
+            int stride_c = N * K;
+            // Calculate d_B.
+            cublas_strided_batched_gemm(handle,
+                                        K,
+                                        N,
+                                        M,
+                                        &alpha,
+                                        &beta,
+                                        A,
+                                        C,
+                                        B,
+                                        op_a,
+                                        CUBLAS_OP_N,
+                                        stride_a,
+                                        stride_b,
+                                        stride_c,
+                                        bsz,
+                                        static_cast<cublasGemmAlgo_t>(algo));
+        });
+        return std::array<int, 3>({algo_fw, algo_bw1, algo_bw2});
+    }
+    template <typename Func>
+    int Run(int loops, Func f)
+    {
+        float fast_latency = std::numeric_limits<float>::max();
+        int fast_algo = 0;
+        for (int algo = (int)CUBLAS_GEMM_DEFAULT_TENSOR_OP;
+             algo <= (int)CUBLAS_GEMM_ALGO15_TENSOR_OP;
+             algo++) {
+            int warm_up = 5;
+            for (int i = 0; i < warm_up; ++i) f(algo);
+            cudaDeviceSynchronize();
+            Stopwatch timer;
+            timer.Restart();
+            for (int i = 0; i < loops; ++i) f(algo);
+            cudaDeviceSynchronize();
+            timer.Stop();
+            float avg_latency = (float)timer.GetTimeInSeconds() * 1000 / loops;
+            printf("algo-%d: %.3fms\n", algo, avg_latency);
+            if (avg_latency < fast_latency) {
+                fast_latency = avg_latency;
+                fast_algo = algo;
+            }
+        }
+        printf("fast_algo %d: %.3f ms\n", fast_algo, fast_latency);
+        return fast_algo;
+    }
+private:
+    int bsz, M, N, K;
+    cublasHandle_t handle;
+    cublasOperation_t transa, transb;
+    T *A, *B, *C;
+};
--- a/csrc/includes/general_kernels.h
+++ b/csrc/includes/general_kernels.h
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cooperative_groups.h>
+#include <curand_kernel.h>
+#include "context.h"
+#include "cublas_wrappers.h"
+#define THREADS 256
+#define TILE_DIM 32
+#define minus_infinity -1 * std::numeric_limits<float>::infinity()
+#define FINAL_MASK 0xffffffff
+template <typename T>
+void launch_fused_add2(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
+template <typename T>
+void launch_fused_add4(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       const T* inp4,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
+template <typename T>
+void launch_fused_add3(T* out,
+                       const T* inp1,
+                       const T* inp2,
+                       const T* inp3,
+                       int batch_size,
+                       int seq_length,
+                       int hidden_size,
+                       cudaStream_t& stream);
--- a/csrc/includes/normalize_layer.h
+++ b/csrc/includes/normalize_layer.h
+#pragma once
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include <fstream>
+#include "custom_cuda_layers.h"
+using namespace std;
+template <typename T>
+class Normalize_Layer {
+public:
+    struct Config {
+        uint32_t batchSize;
+        uint32_t seqLength;
+        uint32_t hiddenDim;
+        float epsilon;
+        bool training, save_vals;
+        bool allocateGrad;
+        bool useMean;
+        Config(uint32_t batch,
+               uint32_t seq,
+               uint32_t h,
+               bool training,
+               bool save_vals = true,
+               bool allocateGrad = true,
+               bool useMean = true)
+            : batchSize(batch),
+              seqLength(seq),
+              hiddenDim(h),
+              epsilon(1e-12),
+              training(training),
+              save_vals(save_vals),
+              allocateGrad(allocateGrad),
+              useMean(useMean)
+        {
+        }
+    };
+    Normalize_Layer(Config config) : config_(config), vars(nullptr), vals_hat(nullptr)
+    {
+        if (config_.training) {
+            cudaMalloc((void**)&vars, config_.batchSize * config_.seqLength * sizeof(T));
+            if (config_.useMean)
+                cudaMalloc((void**)&means, config_.batchSize * config_.seqLength * sizeof(T));
+            if (config_.save_vals)
+                cudaMalloc((void**)&vals_hat,
+                           config_.batchSize * config_.seqLength * config_.hiddenDim * sizeof(T));
+            if (config_.allocateGrad)
+                cudaMalloc((void**)&inp_grad,
+                           config_.batchSize * config_.seqLength * config_.hiddenDim * sizeof(T));
+        }
+    }
+    ~Normalize_Layer()
+    {
+        if (config_.training) {
+            cudaFree(vars);
+            if (config_.useMean) cudaFree(means);
+            if (config_.save_vals) cudaFree(vals_hat);
+            if (config_.allocateGrad) cudaFree(inp_grad);
+        }
+    }
+    void ForwardCheckpoint(int bsz,
+                           T* vals,
+                           const T* residual,
+                           const T* gamma,
+                           const T* betta,
+                           cudaStream_t& stream,
+                           bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.seqLength,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars,
+                                        means,
+                                        vals_hat);
+    }
+    void Forward(int bsz,
+                 T* vals,
+                 const T* residual,
+                 const T* gamma,
+                 const T* betta,
+                 cudaStream_t& stream,
+                 bool preLayerNorm = false)
+    {
+        launch_bias_residual_layer_norm(vals,
+                                        residual,
+                                        gamma,
+                                        betta,
+                                        config_.epsilon,
+                                        bsz,
+                                        config_.seqLength,
+                                        config_.hiddenDim,
+                                        stream,
+                                        preLayerNorm,
+                                        config_.training,
+                                        vars,
+                                        vals_hat,
+                                        config_.save_vals);
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  cudaStream_t stream[2],
+                  T* inp_grad_out = nullptr,
+                  const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  norm_in,
+                                  vars,
+                                  means,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  (config_.allocateGrad ? inp_grad : inp_grad_out),
+                                  bsz,
+                                  config_.seqLength,
+                                  config_.hiddenDim,
+                                  stream);
+    }
+    void Backward(int bsz,
+                  const T* out_grad,
+                  const T* gamma,
+                  const T* betta,
+                  T* gamma_grad,
+                  T* betta_grad,
+                  cudaStream_t stream[2],
+                  T* inp_grad_out = nullptr,
+                  const T* norm_out = nullptr)
+    {
+        launch_layerNorm_backward(out_grad,
+                                  (config_.save_vals ? vals_hat : norm_out),
+                                  vars,
+                                  gamma,
+                                  gamma_grad,
+                                  betta_grad,
+                                  (config_.allocateGrad ? inp_grad : inp_grad_out),
+                                  bsz,
+                                  config_.seqLength,
+                                  config_.hiddenDim,
+                                  stream,
+                                  config_.save_vals,
+                                  betta);
+    }
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          cudaStream_t stream[2],
+                          T* inp_grad_out = nullptr,
+                          const T* norm_in = nullptr)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            norm_in,
+                                            vars,
+                                            means,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            (config_.allocateGrad ? inp_grad : inp_grad_out),
+                                            bsz,
+                                            config_.seqLength,
+                                            config_.hiddenDim,
+                                            stream);
+    }
+    void BackwardFusedAdd(int bsz,
+                          const T* out_grad1,
+                          const T* out_grad2,
+                          const T* gamma,
+                          const T* betta,
+                          T* gamma_grad,
+                          T* betta_grad,
+                          cudaStream_t stream[2],
+                          T* inp_grad_out = nullptr,
+                          const T* norm_out = nullptr)
+    {
+        launch_layerNorm_backward_fused_add(out_grad1,
+                                            out_grad2,
+                                            (config_.save_vals ? vals_hat : norm_out),
+                                            vars,
+                                            gamma,
+                                            gamma_grad,
+                                            betta_grad,
+                                            (config_.allocateGrad ? inp_grad : inp_grad_out),
+                                            bsz,
+                                            config_.seqLength,
+                                            config_.hiddenDim,
+                                            stream,
+                                            config_.save_vals,
+                                            betta);
+    }
+    inline T* GetInputGrad() const { return inp_grad; }
+    inline bool UseMean() const { return config_.useMean; }
+private:
+    Config config_;
+    T* vars;
+    T* means;
+    T* vals_hat;
+    T* inp_grad;
+};
--- a/csrc/includes/softmax.h
+++ b/csrc/includes/softmax.h
+#pragma once
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+#include "custom_cuda_layers.h"
+#include <fstream>
+using namespace std;
+template <typename T>
+class Softmax {
+public:
+    struct Config {
+        size_t batchSize;
+        size_t heads;
+        size_t seq_length;
+        size_t prob_depth;
+        float temprature;
+        bool mem_alloc;
+        Config(size_t batch, size_t h, size_t seq, int prob_size = 0, bool mem_alloc = false)
+            : batchSize(batch),
+              heads(h),
+              seq_length(seq),
+              prob_depth(prob_size),
+              temprature(1.0),
+              mem_alloc(mem_alloc)
+        {
+        }
+    };
+    Softmax(Config config) : config_(config) {}
+    ~Softmax() {}
+    void Forward(int bsz, T* vals, const T* attn_mask, cudaStream_t& stream)
+    {
+        launch_attn_softmax<T>(vals, attn_mask, bsz, config_.heads, config_.seq_length, stream);
+    }
+    void Backward(int bsz, T* out_grad, const T* soft_out, cudaStream_t stream)
+    {
+        launch_attn_softmax_backward_v2<T>(
+            out_grad, soft_out, bsz, config_.heads, config_.seq_length, stream);
+    }
+    inline int GetProbDepth() const { return config_.prob_depth; }
+    inline int GetBatchSize() const { return config_.batchSize; }
+    inline int GetNumHeads() const { return config_.heads; }
+    inline int GetSeqLength() const { return config_.seq_length; }
+private:
+    Config config_;
+};
--- a/csrc/includes/strided_batch_gemm.h
+++ b/csrc/includes/strided_batch_gemm.h
+#pragma once
+#include <cuda.h>
+#include <cuda_fp16.h>
+#include <stdio.h>
+template <typename T>
+class StridedBatchGemm {
+public:
+    struct Config {
+        int batch_size;
+        int m;
+        int n;
+        int k;
+        float alpha;
+        float beta;
+        cublasOperation_t op_A;
+        cublasOperation_t op_B;
+        std::array<int, 3> gemm_algos;
+        Config(int batch,
+               int mm,
+               int nn,
+               int kk,
+               float param_alpha,
+               float param_beta,
+               cublasOperation_t opA,
+               cublasOperation_t opB,
+               const std::array<int, 3>& algos)
+            : batch_size(batch),
+              m(mm),
+              n(nn),
+              k(kk),
+              alpha(param_alpha),
+              beta(param_beta),
+              op_A(opA),
+              op_B(opB),
+              gemm_algos(algos)
+        {
+        }
+    };
+    StridedBatchGemm(const Config& config) : _config(config) {}
+    virtual ~StridedBatchGemm() {}
+    void Forward(int bsz, T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+    }
+    void ForwardPlusSave(T* output, const T* _buffer_a, const T* _buffer_b, cublasHandle_t handle)
+    {
+        int stride_a = _config.m * _config.k;
+        int stride_b = _config.n * _config.k;
+        int stride_c = _config.m * _config.n;
+        cublas_strided_batched_gemm(handle,
+                                    _config.m,
+                                    _config.n,
+                                    _config.k,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    _buffer_b,
+                                    output,
+                                    _config.op_A,
+                                    _config.op_B,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    _config.batch_size,
+                                    cublasGemmAlgo_t(_config.gemm_algos[0]));
+        k_buf = _buffer_a;
+        q_buf = _buffer_b;
+    }
+    void Backward(int bsz,
+                  const T* d_output,
+                  const T* _buffer_a,
+                  const T* _buffer_b,
+                  cublasHandle_t handle,
+                  T* inpGradA = nullptr,
+                  T* inpGradB = nullptr)
+    {
+        int mb = (_config.op_A == CUBLAS_OP_T ? _config.k : _config.m);
+        int kb = (_config.op_A == CUBLAS_OP_T ? _config.m : _config.k);
+        int stride_a = mb * _config.n;
+        int stride_b = _config.n * kb;
+        int stride_c = _config.m * _config.k;
+        // B need to transpose.
+        cublasOperation_t op_b = (_config.op_B == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+        // Calculate d_A.
+        cublas_strided_batched_gemm(handle,
+                                    mb,
+                                    kb,
+                                    _config.n,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    (_config.op_A == CUBLAS_OP_T ? _buffer_b : d_output),
+                                    (_config.op_A == CUBLAS_OP_T ? d_output : _buffer_b),
+                                    inpGradA,
+                                    CUBLAS_OP_N,
+                                    op_b,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+                                    cublasGemmAlgo_t(_config.gemm_algos[1]));
+        // A need to transpose.
+        cublasOperation_t op_a = (_config.op_A == CUBLAS_OP_T ? CUBLAS_OP_N : CUBLAS_OP_T);
+        stride_a = _config.m * _config.k;
+        stride_b = _config.m * _config.n;
+        stride_c = _config.n * _config.k;
+        // Calculate d_B.
+        cublas_strided_batched_gemm(handle,
+                                    _config.k,
+                                    _config.n,
+                                    _config.m,
+                                    &_config.alpha,
+                                    &_config.beta,
+                                    _buffer_a,
+                                    d_output,
+                                    inpGradB,
+                                    op_a,
+                                    CUBLAS_OP_N,
+                                    stride_a,
+                                    stride_b,
+                                    stride_c,
+                                    bsz,
+                                    cublasGemmAlgo_t(_config.gemm_algos[2]));
+    }
+    inline int GetN() const { return _config.k; }
+    inline const T* GetBufferA() const { return k_buf; }
+    inline const T* GetBufferB() const { return q_buf; }
+private:
+    Config _config;
+    const T* q_buf;
+    const T* k_buf;
+};
--- a/csrc/includes/type_shim.h
+++ b/csrc/includes/type_shim.h
+/* Taken from NVIDIA/apex commit 855808f3fc268e9715d613f3c2e56469d8c986d8 */
+#include <ATen/ATen.h>
+// Forward/backward compatiblity hack around
+// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
+// pending more future-proof guidance from upstream.
+// struct TypeShim
+// {
+//   const at::Type& payload;
+//   TypeShim(const at::Type& type) : payload(type) {}
+//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
+//   operator const at::Type&(){ return payload; };
+//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
+//   //operator at::ScalarType(){ return payload.; };
+// };
+#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                          \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...)                   \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Half: {                                             \
+            using scalar_t_##LEVEL = at::Half;                                   \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+#define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...)                        \
+    switch (TYPE) {                                                              \
+        case at::ScalarType::Double: {                                           \
+            using scalar_t_##LEVEL = double;                                     \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        case at::ScalarType::Float: {                                            \
+            using scalar_t_##LEVEL = float;                                      \
+            __VA_ARGS__;                                                         \
+            break;                                                               \
+        }                                                                        \
+        default: AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+    }
+template <typename T>
+__device__ __forceinline__ T
+reduce_block_into_lanes(T* x,
+                        T val,
+                        int lanes = 1,
+                        bool share_result = false)  // lanes is intended to be <= 32.
+{
+    int tid = threadIdx.x + threadIdx.y * blockDim.x;
+    int blockSize = blockDim.x * blockDim.y;  // blockSize is intended to be a multiple of 32.
+    if (blockSize >= 64) {
+        x[tid] = val;
+        __syncthreads();
+    }
+#pragma unroll
+    for (int i = (blockSize >> 1); i >= 64; i >>= 1) {
+        if (tid < i) x[tid] = x[tid] + x[tid + i];
+        __syncthreads();
+    }
+    T final;
+    if (tid < 32) {
+        if (blockSize >= 64)
+            final = x[tid] + x[tid + 32];
+        else
+            final = val;
+            // __SYNCWARP();
+#pragma unroll
+        for (int i = 16; i >= lanes; i >>= 1)
+            final = final + __shfl_down_sync(0xffffffff, final, i);
+    }
+    if (share_result) {
+        if (tid < lanes) x[tid] = final;  // EpilogueOp
+        // Make sure the smem result is visible to all warps.
+        __syncthreads();
+    }
+    return final;
+}