Transformer kernel release (#242)

* Transformer kernels release Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

Transformer kernel release (#242)
* Transformer kernels release Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
734d8991 · Jeff Rasley · GitHub · b652395e · 734d8991 · 734d8991
Unverified Commit 734d8991 authored May 29, 2020 by Jeff Rasley Committed by GitHub May 29, 2020
20 changed files
--- a/csrc/lamb/fused_lamb_cuda.cpp
+++ b/csrc/lamb/fused_lamb_cuda.cpp
+/* Copyright 2019 The Microsoft DeepSpeed Team */
+#include <torch/extension.h>
+// CUDA forward declaration
+void fused_lamb_cuda(at::Tensor& p,
+                     at::Tensor& p_copy,
+                     at::Tensor& m,
+                     at::Tensor& v,
+                     at::Tensor& g,
+                     float lr,
+                     float beta1,
+                     float beta2,
+                     float max_coeff,
+                     float min_coeff,
+                     float eps,
+                     float grad_scale,
+                     int step,
+                     int mode,
+                     int bias_correction,
+                     float decay,
+                     at::Tensor& w_l2_i,
+                     at::Tensor& u_l2_i,
+                     at::Tensor& lamb_coeff_val);
+#define CHECK_CUDA(x) AT_ASSERTM(x.type().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) AT_ASSERTM(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+    CHECK_CUDA(x);     \
+    CHECK_CONTIGUOUS(x)
+// C++ interface
+at::Tensor lamb(at::Tensor& p,
+                at::Tensor& p_copy,
+                at::Tensor& m,
+                at::Tensor& v,
+                at::Tensor& g,
+                float lr,
+                float beta1,
+                float beta2,
+                float max_coeff,
+                float min_coeff,
+                float eps,
+                float grad_scale,
+                int step,
+                int mode,
+                int bias_correction,
+                float decay)
+{
+    CHECK_INPUT(p);
+    if (p_copy.numel() > 0) CHECK_INPUT(p_copy);
+    CHECK_INPUT(m);
+    CHECK_INPUT(v);
+    CHECK_INPUT(g);
+    int64_t num_elem = p.numel();
+    AT_ASSERTM(m.numel() == num_elem, "number of elements in m and p tensors should be equal");
+    AT_ASSERTM(v.numel() == num_elem, "number of elements in v and p tensors should be equal");
+    AT_ASSERTM(g.numel() == num_elem, "number of elements in g and p tensors should be equal");
+    AT_ASSERTM(
+        p_copy.numel() == num_elem || p_copy.numel() == 0,
+        "number of elements in p_copy and p tensors should be equal, or p_copy should be empty");
+    // intermediate for weight L2 reduction
+    // make sure that the threads per block is at least 512 during the kernel launch otherwise the
+    // behavious is unexpected
+    at::Tensor w_l2_i = at::empty(
+        {512},
+        p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
+                                                                        : p.type().scalarType()));
+    // intermediate for update L2 reduction
+    // make sure that the threads per block is at least 512 during the kernel launch otherwise the
+    // behavious is unexpected
+    at::Tensor u_l2_i = at::empty(
+        {512},
+        p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
+                                                                        : p.type().scalarType()));
+    at::Tensor lamb_coeff_val = at::empty(
+        {1},
+        p.options().dtype(p.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float
+                                                                        : p.type().scalarType()));
+    fused_lamb_cuda(p,
+                    p_copy,
+                    m,
+                    v,
+                    g,
+                    lr,
+                    beta1,
+                    beta2,
+                    max_coeff,
+                    min_coeff,
+                    eps,
+                    grad_scale,
+                    step,
+                    mode,
+                    bias_correction,
+                    decay,
+                    w_l2_i,
+                    u_l2_i,
+                    lamb_coeff_val);
+    return lamb_coeff_val;
+}
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("lamb", &lamb, "Adam optimized CUDA implementation with LAMB.");
+}
--- a/csrc/fused_lamb_cuda_kernel.cu
+++ b/csrc/fused_lamb_cuda_kernel.cu
--- a/csrc/transformer/cublas_wrappers.cu
+++ b/csrc/transformer/cublas_wrappers.cu
+#include "cublas_wrappers.h"
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const float* A,
+                   const float* B,
+                   float* C,
+                   cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_32F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_32F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         C,
+                                         CUDA_R_32F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr, "!!!! kernel execution error.\n");
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+int cublas_gemm_ex(cublasHandle_t handle,
+                   cublasOperation_t transa,
+                   cublasOperation_t transb,
+                   int m,
+                   int n,
+                   int k,
+                   const float* alpha,
+                   const float* beta,
+                   const __half* A,
+                   const __half* B,
+                   __half* C,
+                   cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmEx(handle,
+                                         transa,
+                                         transb,
+                                         m,
+                                         n,
+                                         k,
+                                         (const void*)alpha,
+                                         (const void*)A,
+                                         CUDA_R_16F,
+                                         (transa == CUBLAS_OP_N) ? m : k,
+                                         (const void*)B,
+                                         CUDA_R_16F,
+                                         (transb == CUBLAS_OP_N) ? k : n,
+                                         (const void*)beta,
+                                         (void*)C,
+                                         CUDA_R_16F,
+                                         m,
+                                         CUDA_R_32F,
+                                         algo);
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr, "!!!! kernel execution error.\n");
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const float* A,
+                                const float* B,
+                                float* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_32F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_32F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_32F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr, "!!!! kernel execution error.\n");
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
+int cublas_strided_batched_gemm(cublasHandle_t handle,
+                                int m,
+                                int n,
+                                int k,
+                                const float* alpha,
+                                const float* beta,
+                                const __half* A,
+                                const __half* B,
+                                __half* C,
+                                cublasOperation_t op_A,
+                                cublasOperation_t op_B,
+                                int stride_A,
+                                int stride_B,
+                                int stride_C,
+                                int batch,
+                                cublasGemmAlgo_t algo)
+{
+    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
+                                                       op_A,
+                                                       op_B,
+                                                       m,
+                                                       n,
+                                                       k,
+                                                       alpha,
+                                                       A,
+                                                       CUDA_R_16F,
+                                                       (op_A == CUBLAS_OP_N) ? m : k,
+                                                       stride_A,
+                                                       B,
+                                                       CUDA_R_16F,
+                                                       (op_B == CUBLAS_OP_N) ? k : n,
+                                                       stride_B,
+                                                       beta,
+                                                       C,
+                                                       CUDA_R_16F,
+                                                       m,
+                                                       stride_C,
+                                                       batch,
+                                                       CUDA_R_32F,
+                                                       algo);
+    if (status != CUBLAS_STATUS_SUCCESS) {
+        fprintf(stderr, "!!!! kernel execution error.\n");
+        return EXIT_FAILURE;
+    }
+    return 0;
+}
--- a/csrc/transformer/dropout_kernels.cu
+++ b/csrc/transformer/dropout_kernels.cu
--- a/csrc/transformer/ds_transformer_cuda.cpp
+++ b/csrc/transformer/ds_transformer_cuda.cpp
--- a/csrc/transformer/gelu_kernels.cu
+++ b/csrc/transformer/gelu_kernels.cu
--- a/csrc/transformer/general_kernels.cu
+++ b/csrc/transformer/general_kernels.cu
--- a/csrc/transformer/normalize_kernels.cu
+++ b/csrc/transformer/normalize_kernels.cu
--- a/csrc/transformer/softmax_kernels.cu
+++ b/csrc/transformer/softmax_kernels.cu
--- a/csrc/transformer/transform_kernels.cu
+++ b/csrc/transformer/transform_kernels.cu
--- a/csrc/type_shim.h
+++ b/csrc/type_shim.h
-/* Taken from NVIDIA/apex commit 855808f3fc268e9715d613f3c2e56469d8c986d8 */
-#include <ATen/ATen.h>
-// Forward/backward compatiblity hack around
-// https://github.com/pytorch/pytorch/commit/3aeb78079bcd68282fe9117088e138b77318e288
-// pending more future-proof guidance from upstream.
-// struct TypeShim
-// {
-//   const at::Type& payload;
-//   TypeShim(const at::Type& type) : payload(type) {}
-//   // Enable trivial conversion to a const at::Type& for pre-3aeb78
-//   operator const at::Type&(){ return payload; };
-//   // Enable dispatch switch statements to take *this directly for  post-3aeb78
-//   //operator at::ScalarType(){ return payload.; };
-// };
-#define DISPATCH_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...) \
-  switch(TYPE) \
-  { \
-    case at::ScalarType::Float: \
-    { \
-      using scalar_t_##LEVEL = float; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    case at::ScalarType::Half: \
-    { \
-      using scalar_t_##LEVEL = at::Half; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    default: \
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");  \
-  }
-#define DISPATCH_DOUBLE_FLOAT_AND_HALF(TYPE, LEVEL, NAME, ...) \
-  switch(TYPE) \
-  { \
-    case at::ScalarType::Double: \
-    { \
-      using scalar_t_##LEVEL = double; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    case at::ScalarType::Float: \
-    { \
-      using scalar_t_##LEVEL = float; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    case at::ScalarType::Half: \
-    { \
-      using scalar_t_##LEVEL = at::Half; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    default: \
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");  \
-  }
-  #define DISPATCH_DOUBLE_AND_FLOAT(TYPE, LEVEL, NAME, ...) \
-  switch(TYPE) \
-  { \
-    case at::ScalarType::Double: \
-    { \
-      using scalar_t_##LEVEL = double; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    case at::ScalarType::Float: \
-    { \
-      using scalar_t_##LEVEL = float; \
-      __VA_ARGS__; \
-      break; \
-    } \
-    default: \
-      AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'");  \
-  }
-template<typename T>
-__device__ __forceinline__ T reduce_block_into_lanes
-  (T *x,
-   T val,
-   int lanes=1,
-   bool share_result=false) // lanes is intended to be <= 32.
-{
-  int tid = threadIdx.x + threadIdx.y*blockDim.x;
-  int blockSize = blockDim.x*blockDim.y; // blockSize is intended to be a multiple of 32.
-  if(blockSize >= 64)
-  {
-    x[tid] = val;
-    __syncthreads();
-  }
-  #pragma unroll
-  for(int i = (blockSize >> 1); i >= 64; i >>= 1)
-  {
-    if(tid < i)
-      x[tid] = x[tid] + x[tid+i];
-    __syncthreads();
-  }
-  T final;
-  if(tid < 32)
-  {
-    if(blockSize >= 64)
-      final = x[tid] + x[tid+32];
-    else
-      final = val;
-    // __SYNCWARP();
-    #pragma unroll
-    for(int i = 16; i >= lanes; i >>= 1)
-      final = final + __shfl_down_sync(0xffffffff, final, i);
-  }
-  if(share_result)
-  {
-    if(tid < lanes)
-      x[tid] = final; // EpilogueOp
-    // Make sure the smem result is visible to all warps.
-    __syncthreads();
-  }
-  return final;
-}
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -5,6 +5,8 @@ Copyright 2020 The Microsoft DeepSpeed Team
 from deepspeed.pt.deepspeed_light import DeepSpeedLight
 from deepspeed.pt.deepspeed_light import ADAM_OPTIMIZER, LAMB_OPTIMIZER
 from deepspeed.pt.deepspeed_lr_schedules import add_tuning_arguments
+from deepspeed.pt.deepspeed_cuda import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig
+from deepspeed.pt.deepspeed_config import DeepSpeedConfig
 import deepspeed.pt.deepspeed_checkpointing as checkpointing

--- a/deepspeed/pt/deepspeed_cuda.py
+++ b/deepspeed/pt/deepspeed_cuda.py
--- a/deepspeed/pt/deepspeed_fused_lamb.py
+++ b/deepspeed/pt/deepspeed_fused_lamb.py
@@ -58,7 +58,7 @@ class FusedLamb(torch.optim.Optimizer):
                 min_coeff=0.01,
                 amsgrad=False):
        global fused_lamb_cuda
-        fused_lamb_cuda = importlib.import_module("fused_lamb_cuda")
+        fused_lamb_cuda = importlib.import_module("deepspeed_lamb_cuda")
        if amsgrad:
            raise RuntimeError('FusedLamb does not support the AMSGrad variant.')

--- a/deepspeed/pt/deepspeed_run.py
+++ b/deepspeed/pt/deepspeed_run.py
@@ -283,7 +283,9 @@ def main(args=None):
            "-u",
            "-m",
            "deepspeed.pt.deepspeed_launch",
-            "--world_info={}".format(world_info_base64)
+            "--world_info={}".format(world_info_base64),
+            "--master_addr={}".format(args.master_addr),
+            "--master_port={}".format(args.master_port)
        ]
        cmd = deepspeed_launch + [args.user_script] + args.user_args
    else:

--- a/docs/_posts/2020-05-19-bert-record.md
+++ b/docs/_posts/2020-05-19-bert-record.md
@@ -17,6 +17,7 @@ DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024
 NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on
 the same number and generation of GPUs.
-For a technical overview, see our [blog post](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
+* Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
+* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/05/28/bert-record.html).
-**Code and tutorials are coming soon!**
+* Tutorial on how to reproduce our results, see our [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/).
+* The source code for our transformer kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
--- a/docs/_posts/2020-05-28-fastest-bert-training.md
+++ b/docs/_posts/2020-05-28-fastest-bert-training.md
--- a/docs/_tutorials/bert-finetuning.md
+++ b/docs/_tutorials/bert-finetuning.md
--- a/docs/_tutorials/bert-pretraining.md
+++ b/docs/_tutorials/bert-pretraining.md
--- a/docs/_tutorials/transformer_kernel.md
+++ b/docs/_tutorials/transformer_kernel.md