DeepSpeed JIT op + PyPI support (#496)

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

DeepSpeed JIT op + PyPI support (#496)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
31f46fee · Jeff Rasley · GitHub · 0ad4fd88 · 31f46fee · 31f46fee
Unverified Commit 31f46fee authored Nov 12, 2020 by Jeff Rasley Committed by GitHub Nov 12, 2020
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -10,6 +10,7 @@ build/
 dist/
 *.so
 deepspeed.egg-info/
+build.txt

 # Website
 docs/_site/
@@ -23,3 +24,7 @@ docs/code-docs/build

 # Testing data
 tests/unit/saved_checkpoint/
+
+# Dev/IDE data
+.vscode
+.theia
--- a/.gitmodules
+++ b/.gitmodules
-[submodule "third_party/apex"]
-	path = third_party/apex
-	url = https://github.com/NVIDIA/apex.git
 [submodule "DeepSpeedExamples"]
 	path = DeepSpeedExamples
 	url = https://github.com/microsoft/DeepSpeedExamples

--- a/MANIFEST.in
+++ b/MANIFEST.in
+global-include *.cpp *.h *.cu *.tr *.cuh *.cc *.txt
--- a/README.md
+++ b/README.md
 [![Build Status](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_apis/build/status/microsoft.DeepSpeed?branchName=master)](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/latest?definitionId=1&branchName=master)
+[![PyPI version](https://badge.fury.io/py/deepspeed.svg)](https://badge.fury.io/py/deepspeed)
 [![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)
 [![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
 [![Docker Pulls](https://img.shields.io/docker/pulls/deepspeed/deepspeed)](https://hub.docker.com/r/deepspeed/deepspeed)
@@ -31,29 +32,25 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)


 # News
-* [2020/09/10] [DeepSpeed: Extreme-scale model training for everyone](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)
+* [2020/11/12] [Simplified install, JIT compiled ops, PyPI releases, and reduced dependencies](#installation)
+* [2020/11/10] [Efficient and robust compressed training through progressive layer dropping](https://www.deepspeed.ai/news/2020/10/28/progressive-layer-dropping-news.html)
+* [2020/09/10] [DeepSpeed v0.3: Extreme-scale model training for everyone](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)
  * [Powering 10x longer sequences and 6x faster execution through DeepSpeed Sparse Attention](https://www.deepspeed.ai/news/2020/09/08/sparse-attention-news.html)
  * [Training a trillion parameters with pipeline parallelism](https://www.deepspeed.ai/news/2020/09/08/pipeline-parallelism.html)
  * [Up to 5x less communication and 3.4x faster training through 1-bit Adam](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-news.html)
  * [10x bigger model training on a single GPU with ZeRO-Offload](https://www.deepspeed.ai/news/2020/09/08/ZeRO-Offload.html)
 * [2020/08/07] [DeepSpeed Microsoft Research Webinar](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html) is now available on-demand
-* [2020/07/24] [DeepSpeed Microsoft Research Webinar](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html) on August 6th, 2020
-  [![DeepSpeed webinar](docs/assets/images/webinar-aug2020.png)](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-Live.html)
-* [2020/05/19] [ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/)
-* [2020/05/19] [An Order-of-Magnitude Larger and Faster Training with ZeRO-2](https://www.deepspeed.ai/news/2020/05/18/zero-stage2.html)
-* [2020/05/19] [The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
-* [2020/02/13] [Turing-NLG: A 17-billion-parameter language model by Microsoft](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/)
-* [2020/02/13] [ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)


 # Table of Contents
 | Section                                 | Description                                 |
 | --------------------------------------- | ------------------------------------------- |
 | [Why DeepSpeed?](#why-deepspeed)        |  DeepSpeed overview                         |
-| [Features](#features)                   |  DeepSpeed features                         |
-| [Further Reading](#further-reading)     |  DeepSpeed documentation, tutorials, etc.   |
-| [Contributing](#contributing)           |  Instructions for contributing to DeepSpeed |
-| [Publications](#publications)           |  DeepSpeed publications                     |
+| [Install](#installation)                |  Installation details                       |
+| [Features](#features)                   |  Feature list and overview                  |
+| [Further Reading](#further-reading)     |  Documentation, tutorials, etc.             |
+| [Contributing](#contributing)           |  Instructions for contributing              |
+| [Publications](#publications)           |  Publications related to DeepSpeed          |

 # Why DeepSpeed?
 Training advanced deep learning models is challenging. Beyond model design,
@@ -65,8 +62,32 @@ a large model easily runs out of memory with pure data parallelism and it is
 difficult to use model parallelism. DeepSpeed addresses these challenges to
 accelerate model development *and* training.

-# Features
+# Installation
+
+The quickest way to get started with DeepSpeed is via pip, this will install
+the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA
+versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer
+to as our 'ops'.  By default, all of these extensions/ops will be built
+just-in-time (JIT) using [torch's JIT C++ extension loader that relies on
+ninja](https://pytorch.org/docs/stable/cpp_extension.html) to build and
+dynamically link them at runtime.
+
+```bash
+pip install deepspeed
+```
+
+After installation you can validate your install and see which extensions/ops
+your machine is compatible with via the DeepSpeed environment report.

+```bash
+ds_report
+```
+
+If you would like to pre-install any of the DeepSpeed extensions/ops (instead
+of JIT compiling) or install pre-compiled ops via PyPI please see our [advanced
+installation instructions](https://www.deepspeed.ai/tutorials/advanced-install/).
+
+# Features
 Below we provide a brief feature list, see our detailed [feature
 overview](https://www.deepspeed.ai/features/) for descriptions and usage.


--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -43,7 +43,6 @@ jobs:
      conda install -q --yes conda
      conda install -q --yes pip
      conda install -q --yes gxx_linux-64
-      if [[ $(cuda.version) != "10.2" ]]; then conda install --yes -c conda-forge cudatoolkit-dev=$(cuda.version) ; fi
      echo "PATH=$PATH, LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
    displayName: 'Setup environment python=$(python.version) pytorch=$(pytorch.version) cuda=$(cuda.version)'

@@ -51,9 +50,8 @@ jobs:
  - script: |
      source activate $(conda_env)
      pip install --progress-bar=off torch==$(pytorch.version) torchvision==$(torchvision.version)
-      #-f https://download.pytorch.org/whl/torch_stable.html
-      ./install.sh --local_only
-      #python -I basic_install_test.py
+      pip install .[dev]
+      ds_report
    displayName: 'Install DeepSpeed'

  - script: |
@@ -71,7 +69,8 @@ jobs:

  - script: |
      source activate $(conda_env)
-      pytest --durations=0 --forked --verbose -x tests/unit/
+      if [[ -d ./torch-extensions ]]; then rm -rf ./torch-extensions; fi
+      TORCH_EXTENSIONS_DIR=./torch-extensions pytest --durations=0 --forked --verbose -x tests/unit/
    displayName: 'Unit tests'

 #   - script: |

--- a/basic_install_test.py
+++ b/basic_install_test.py
-import torch
-import warnings
-import importlib
-import warnings
-
-GREEN = '\033[92m'
-RED = '\033[91m'
-YELLOW = '\033[93m'
-END = '\033[0m'
-SUCCESS = f"{GREEN} [SUCCESS] {END}"
-WARNING = f"{YELLOW} [WARNING] {END}"
-FAIL = f'{RED} [FAIL] {END}'
-INFO = ' [INFO]'
-
-try:
-    import deepspeed
-    print(f"{SUCCESS} deepspeed successfully imported.")
-except ImportError as err:
-    raise err
-
-print(f"{INFO} torch install path: {torch.__path__}")
-print(f"{INFO} torch version: {torch.__version__}, torch.cuda: {torch.version.cuda}")
-print(f"{INFO} deepspeed install path: {deepspeed.__path__}")
-print(
-    f"{INFO} deepspeed info: {deepspeed.__version__}, {deepspeed.__git_hash__}, {deepspeed.__git_branch__}"
-)
-
-try:
-    apex_C = importlib.import_module('apex_C')
-    print(f"{SUCCESS} apex extensions successfully installed")
-except Exception as err:
-    print(f'{WARNING} apex extensions are not installed')
-
-try:
-    from apex.optimizers import FP16_Optimizer
-    print(f"{INFO} using old-style apex")
-except ImportError:
-    print(f"{INFO} using new-style apex")
-
-try:
-    importlib.import_module('deepspeed.ops.lamb.fused_lamb_cuda')
-    print(f'{SUCCESS} fused lamb successfully installed.')
-except Exception as err:
-    print(f"{WARNING} fused lamb is NOT installed.")
-
-try:
-    importlib.import_module('deepspeed.ops.transformer.transformer_cuda')
-    print(f'{SUCCESS} transformer kernels successfully installed.')
-except Exception as err:
-    print(f'{WARNING} transformer kernels are NOT installed.')
-
-try:
-    with warnings.catch_warnings():
-        warnings.simplefilter('ignore')
-        importlib.import_module('deepspeed.ops.sparse_attention.cpp_utils')
-        import triton
-    print(f'{SUCCESS} sparse attention successfully installed.')
-except ImportError:
-    print(f'{WARNING} sparse attention is NOT installed.')
-
-try:
-    importlib.import_module('deepspeed.ops.adam.cpu_adam_op')
-    print(f'{SUCCESS} cpu-adam (used by ZeRO-offload) successfully installed.')
-except ImportError:
-    print(f'{WARNING} cpu-adam (used by ZeRO-offload) is NOT installed.')
--- a/bin/ds_report
+++ b/bin/ds_report
+#!/usr/bin/env python
+
+from deepspeed.env_report import main
+
+if __name__ == '__main__':
+    main()
--- a/csrc/adam/compat.h
+++ b/csrc/adam/compat.h
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#ifndef TORCH_CHECK
+#define TORCH_CHECK AT_CHECK
+#endif
+
+#ifdef VERSION_GE_1_3
+#define DATA_PTR data_ptr
+#else
+#define DATA_PTR data
+#endif
--- a/csrc/adam/custom_cuda_kernel.cu
+++ b/csrc/adam/custom_cuda_kernel.cu
@@ -4,30 +4,15 @@

 __global__ void param_update_kernel(const float* input, __half* output, int size)
 {
-    const float4* input_cast = reinterpret_cast<const float4*>(input);
-    float2* output_cast = reinterpret_cast<float2*>(output);
-
    int id = blockIdx.x * blockDim.x + threadIdx.x;

-    if (id < size) {
-        float4 data = input_cast[id];
-        float2 cast_data;
-        __half* output_h = reinterpret_cast<__half*>(&cast_data);
-
-        output_h[0] = (__half)data.x;
-        output_h[1] = (__half)data.y;
-        output_h[2] = (__half)data.z;
-        output_h[3] = (__half)data.w;
-
-        output_cast[id] = cast_data;
-    }
+    if (id < size) { output[id] = (__half)input[id]; }
 }

 void launch_param_update(const float* input, __half* output, int size, cudaStream_t stream)
 {
-    int threads = 512;
+    int threads = 1024;

-    size /= 4;
    dim3 grid_dim((size - 1) / threads + 1);
    dim3 block_dim(threads);


--- a/csrc/adam/fused_adam_frontend.cpp
+++ b/csrc/adam/fused_adam_frontend.cpp
+#include <torch/extension.h>
+
+void multi_tensor_adam_cuda(int chunk_size,
+                            at::Tensor noop_flag,
+                            std::vector<std::vector<at::Tensor>> tensor_lists,
+                            const float lr,
+                            const float beta1,
+                            const float beta2,
+                            const float epsilon,
+                            const int step,
+                            const int mode,
+                            const int bias_correction,
+                            const float weight_decay);
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("multi_tensor_adam",
+          &multi_tensor_adam_cuda,
+          "Compute and apply gradient update to parameters for Adam optimizer");
+}
--- a/csrc/adam/multi_tensor_adam.cu
+++ b/csrc/adam/multi_tensor_adam.cu
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+// Another possibility:
+// #include <torch/all.h>
+
+#include <assert.h>
+
+#include "multi_tensor_apply.cuh"
+#include "type_shim.h"
+
+#define BLOCK_SIZE 512
+#define ILP 4
+
+typedef enum {
+    ADAM_MODE_0 = 0,  // L2 regularization mode
+    ADAM_MODE_1 = 1   // Decoupled weight decay mode(AdamW)
+} adamMode_t;
+
+using MATH_T = float;
+
+template <typename T>
+struct AdamFunctor {
+    __device__ __forceinline__ void operator()(int chunk_size,
+                                               volatile int* noop_gmem,
+                                               TensorListMetadata<4>& tl,
+                                               const float beta1,
+                                               const float beta2,
+                                               const float beta1_correction,
+                                               const float beta2_correction,
+                                               const float epsilon,
+                                               const float lr,
+                                               adamMode_t mode,
+                                               const float decay)
+    {
+        // I'd like this kernel to propagate infs/nans.
+        // if(*noop_gmem == 1)
+        //   return;
+
+        int tensor_loc = tl.block_to_tensor[blockIdx.x];
+
+        // potentially use to pass in list of scalar
+        // int tensor_num = tl.start_tensor_this_launch + tensor_loc;
+
+        int chunk_idx = tl.block_to_chunk[blockIdx.x];
+        int n = tl.sizes[tensor_loc];
+
+        T* g = (T*)tl.addresses[0][tensor_loc];
+        g += chunk_idx * chunk_size;
+
+        T* p = (T*)tl.addresses[1][tensor_loc];
+        p += chunk_idx * chunk_size;
+
+        T* m = (T*)tl.addresses[2][tensor_loc];
+        m += chunk_idx * chunk_size;
+
+        T* v = (T*)tl.addresses[3][tensor_loc];
+        v += chunk_idx * chunk_size;
+
+        n -= chunk_idx * chunk_size;
+
+        // see note in multi_tensor_scale_kernel.cu
+        for (int i_start = 0; i_start < n && i_start < chunk_size; i_start += blockDim.x * ILP) {
+            MATH_T r_g[ILP];
+            MATH_T r_p[ILP];
+            MATH_T r_m[ILP];
+            MATH_T r_v[ILP];
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    r_g[ii] = g[i];
+                    r_p[ii] = p[i];
+                    r_m[ii] = m[i];
+                    r_v[ii] = v[i];
+                } else {
+                    r_g[ii] = MATH_T(0);
+                    r_p[ii] = MATH_T(0);
+                    r_m[ii] = MATH_T(0);
+                    r_v[ii] = MATH_T(0);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                if (mode == ADAM_MODE_0) {  // L2
+                    r_g[ii] = r_g[ii] + (decay * r_p[ii]);
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = next_m_unbiased / denom;
+                    r_p[ii] = r_p[ii] - (lr * update);
+                } else {  // weight decay
+                    r_m[ii] = beta1 * r_m[ii] + (1 - beta1) * r_g[ii];
+                    r_v[ii] = beta2 * r_v[ii] + (1 - beta2) * r_g[ii] * r_g[ii];
+                    MATH_T next_m_unbiased = r_m[ii] / beta1_correction;
+                    MATH_T next_v_unbiased = r_v[ii] / beta2_correction;
+                    MATH_T denom = sqrtf(next_v_unbiased) + epsilon;
+                    MATH_T update = (next_m_unbiased / denom) + (decay * r_p[ii]);
+                    r_p[ii] = r_p[ii] - (lr * update);
+                }
+            }
+#pragma unroll
+            for (int ii = 0; ii < ILP; ii++) {
+                int i = i_start + threadIdx.x + ii * blockDim.x;
+                if (i < n && i < chunk_size) {
+                    p[i] = r_p[ii];
+                    m[i] = r_m[ii];
+                    v[i] = r_v[ii];
+                }
+            }
+        }
+    }
+};
+
+void multi_tensor_adam_cuda(int chunk_size,
+                            at::Tensor noop_flag,
+                            std::vector<std::vector<at::Tensor>> tensor_lists,
+                            const float lr,
+                            const float beta1,
+                            const float beta2,
+                            const float epsilon,
+                            const int step,
+                            const int mode,
+                            const int bias_correction,
+                            const float weight_decay)
+{
+    using namespace at;
+
+    // Handle bias correction mode
+    float bias_correction1 = 1.0f, bias_correction2 = 1.0f;
+    if (bias_correction == 1) {
+        bias_correction1 = 1 - std::pow(beta1, step);
+        bias_correction2 = 1 - std::pow(beta2, step);
+    }
+
+    // Assume single type across p,g,m1,m2 now
+    DISPATCH_DOUBLE_FLOAT_AND_HALF(tensor_lists[0][0].scalar_type(),
+                                   0,
+                                   "adam",
+                                   multi_tensor_apply<4>(BLOCK_SIZE,
+                                                         chunk_size,
+                                                         noop_flag,
+                                                         tensor_lists,
+                                                         AdamFunctor<scalar_t_0>(),
+                                                         beta1,
+                                                         beta2,
+                                                         bias_correction1,
+                                                         bias_correction2,
+                                                         epsilon,
+                                                         lr,
+                                                         (adamMode_t)mode,
+                                                         weight_decay);)
+
+    AT_CUDA_CHECK(cudaGetLastError());
+}
--- a/csrc/adam/multi_tensor_apply.cuh
+++ b/csrc/adam/multi_tensor_apply.cuh
+/* Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <ATen/ATen.h>
+#include <ATen/AccumulateType.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <ATen/cuda/Exceptions.h>
+#include <c10/cuda/CUDAGuard.h>
+#include "compat.h"
+
+#include <assert.h>
+
+// #include <iostream>
+
+// This header is the one-stop shop for all your multi-tensor apply needs.
+
+// TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson)
+constexpr int depth_to_max_tensors[5] = {110, 64, 48, 36, 30};
+constexpr int depth_to_max_blocks[5] = {320, 320, 320, 320, 320};
+
+template <int n>
+struct TensorListMetadata {
+    void* addresses[n][depth_to_max_tensors[n - 1]];
+    int sizes[depth_to_max_tensors[n - 1]];
+    unsigned char block_to_tensor[depth_to_max_blocks[n - 1]];
+    int block_to_chunk[depth_to_max_blocks[n - 1]];  // I fear this needs to be a full int.
+    int start_tensor_this_launch;
+};
+
+template <typename T, typename U, typename... ArgTypes>
+__global__ void multi_tensor_apply_kernel(int chunk_size,
+                                          volatile int* noop_flag,
+                                          T tl,
+                                          U callable,
+                                          ArgTypes... args)
+{
+    // Hand the chunk information to the user-supplied functor to process however it likes.
+    callable(chunk_size, noop_flag, tl, args...);
+}
+
+template <int depth, typename T, typename... ArgTypes>
+void multi_tensor_apply(int block_size,
+                        int chunk_size,
+                        const at::Tensor& noop_flag,
+                        const std::vector<std::vector<at::Tensor>>& tensor_lists,
+                        T callable,
+                        ArgTypes... args)
+{
+    TORCH_CHECK(tensor_lists.size() == depth, "tensor_lists.size() != depth");
+    int len0 = tensor_lists[0].size();
+    TORCH_CHECK(len0 > 0, "tensor_lists[0].size() is not > 0");
+    auto ref_device = tensor_lists[0][0].device();
+    TORCH_CHECK(ref_device.type() == at::kCUDA, "expected input to be on cuda");
+    for (int l = 0; l < tensor_lists.size(); l++)  // No range-based for because I need indices
+    {
+        TORCH_CHECK(tensor_lists[l].size() == len0, "Size mismatch among tensor lists");
+        for (int t = 0; t < tensor_lists[l].size(); t++) {
+            // TODO:  Print which tensor fails.
+            bool contiguous_memory = tensor_lists[l][t].is_contiguous();
+#ifdef VERSION_GE_1_5
+            contiguous_memory = (contiguous_memory ||
+                                 tensor_lists[l][t].is_contiguous(at::MemoryFormat::ChannelsLast));
+#endif
+            TORCH_CHECK(contiguous_memory, "A tensor was not contiguous.");
+            TORCH_CHECK(tensor_lists[l][t].device() == ref_device,
+                        "A tensor was not on the same device as the first tensor");
+            TORCH_CHECK(tensor_lists[l][t].numel() == tensor_lists[0][t].numel(), "Size mismatch");
+        }
+    }
+
+    int ntensors = tensor_lists[0].size();
+
+    TensorListMetadata<depth> tl;
+
+    const at::cuda::OptionalCUDAGuard device_guard(device_of(tensor_lists[0][0]));
+    auto stream = at::cuda::getCurrentCUDAStream();
+
+    tl.start_tensor_this_launch = 0;
+    int loc_block_info = 0;
+    int loc_tensor_info = 0;
+    for (int t = 0; t < ntensors; t++) {
+        tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
+        for (int d = 0; d < depth; d++)
+            tl.addresses[d][loc_tensor_info] = tensor_lists[d][t].data_ptr();
+        loc_tensor_info++;
+
+        int chunks_this_tensor = (tensor_lists[0][t].numel() + chunk_size - 1) / chunk_size;
+
+        for (int chunk = 0; chunk < chunks_this_tensor; chunk++) {
+            // std::cout << chunks_this_tensor << std::endl;
+            tl.block_to_tensor[loc_block_info] = loc_tensor_info - 1;
+            tl.block_to_chunk[loc_block_info] = chunk;
+            loc_block_info++;
+
+            bool tensors_full = (loc_tensor_info == depth_to_max_tensors[depth - 1] &&
+                                 chunk == chunks_this_tensor - 1);
+            bool blocks_full = (loc_block_info == depth_to_max_blocks[depth - 1]);
+            bool last_chunk = (t == ntensors - 1 && chunk == chunks_this_tensor - 1);
+            if (tensors_full || blocks_full || last_chunk) {
+                // using accscalar_t = acc_type<scalar_t, true>;
+                multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>(
+                    chunk_size, noop_flag.DATA_PTR<int>(), tl, callable, args...);
+
+                AT_CUDA_CHECK(cudaGetLastError());
+
+                // Reset.  The control flow possibilities here make my brain hurt.
+                loc_block_info = 0;
+                if (chunk == chunks_this_tensor - 1) {
+                    // std::cout << "Hit case 1 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    loc_tensor_info = 0;
+                    tl.start_tensor_this_launch = t + 1;
+                } else {
+                    // std::cout << "Hit case 2 " << cond1 << " " << cond2 << " " << cond3 <<
+                    // std::endl;
+                    tl.sizes[0] = tl.sizes[loc_tensor_info - 1];
+                    for (int d = 0; d < depth; d++)
+                        tl.addresses[d][0] = tl.addresses[d][loc_tensor_info - 1];
+                    loc_tensor_info = 1;
+                    tl.start_tensor_this_launch = t;
+                }
+            }
+        }
+    }
+}
--- a/csrc/utils/flatten_unflatten.cpp
+++ b/csrc/utils/flatten_unflatten.cpp
+/*
+   Copyright 2020 The Microsoft DeepSpeed Team
+   Copyright NVIDIA/apex
+   This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+*/
+
+#include <torch/csrc/utils/tensor_flatten.h>
+#include <torch/extension.h>
+// https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/tensor_flatten.h
+
+at::Tensor flatten(std::vector<at::Tensor> tensors)
+{
+    return torch::utils::flatten_dense_tensors(tensors);
+}
+
+std::vector<at::Tensor> unflatten(at::Tensor flat, std::vector<at::Tensor> tensors)
+{
+    return torch::utils::unflatten_dense_tensors(flat, tensors);
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("flatten", &flatten, "Flatten dense tensors");
+    m.def("unflatten", &unflatten, "Unflatten dense tensors");
+}
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -17,24 +17,19 @@ from .utils import log_dist

 from .pipe import PipelineModule

-try:
-    from .git_version_info import version, git_hash, git_branch
-except ImportError:
-    version = "0.0.0+unknown"
-    git_hash = None
-    git_branch = None
+from .git_version_info import version, git_hash, git_branch
+
+
+def _parse_version(version_str):
+    '''Parse a version string and extract the major, minor, and patch versions.'''
+    import re
+    matched = re.search('^(\d+)\.(\d+)\.(\d+)', version_str)
+    return int(matched.group(1)), int(matched.group(2)), int(matched.group(3))
+

 # Export version information
-version, __version_tag__ = version.split('+')
-__version_major__ = int(version.split('.')[0])
-__version_minor__ = int(version.split('.')[1])
-__version_patch__ = int(version.split('.')[2])
-__version__ = '.'.join(
-    map(str,
-        [__version_major__,
-         __version_minor__,
-         __version_patch__]))
-__version__ = f"{__version__}+{__version_tag__}"
+__version__ = version
+__version_major__, __version_minor__, __version_patch__ = _parse_version(__version__)
 __git_hash__ = git_hash
 __git_branch__ = git_branch


--- a/deepspeed/env_report.py
+++ b/deepspeed/env_report.py
+import torch
+import deepspeed
+import subprocess
+from .ops.op_builder import ALL_OPS
+from .git_version_info import installed_ops, torch_info
+from .ops import __compatible_ops__ as compatible_ops
+
+GREEN = '\033[92m'
+RED = '\033[91m'
+YELLOW = '\033[93m'
+END = '\033[0m'
+SUCCESS = f"{GREEN} [SUCCESS] {END}"
+OKAY = f"{GREEN}[OKAY]{END}"
+WARNING = f"{YELLOW}[WARNING]{END}"
+FAIL = f'{RED}[FAIL]{END}'
+INFO = '[INFO]'
+
+color_len = len(GREEN) + len(END)
+okay = f"{GREEN}[OKAY]{END}"
+warning = f"{YELLOW}[WARNING]{END}"
+
+
+def op_report():
+    max_dots = 23
+    max_dots2 = 11
+    h = ["op name", "installed", "compatible"]
+    print("-" * (max_dots + max_dots2 + len(h[0]) + len(h[1])))
+    print("DeepSpeed C++/CUDA extension op report")
+    print("-" * (max_dots + max_dots2 + len(h[0]) + len(h[1])))
+
+    print("NOTE: Ops not installed will be just-in-time (JIT) compiled at\n"
+          "      runtime if needed. Op compatibility means that your system\n"
+          "      meet the required dependencies to JIT install the op.")
+
+    print("-" * (max_dots + max_dots2 + len(h[0]) + len(h[1])))
+    print("JIT compiled ops requires ninja")
+    ninja_status = OKAY if ninja_installed() else FAIL
+    print('ninja', "." * (max_dots - 5), ninja_status)
+    print("-" * (max_dots + max_dots2 + len(h[0]) + len(h[1])))
+    print(h[0], "." * (max_dots - len(h[0])), h[1], "." * (max_dots2 - len(h[1])), h[2])
+    print("-" * (max_dots + max_dots2 + len(h[0]) + len(h[1])))
+    installed = f"{GREEN}[YES]{END}"
+    no = f"{YELLOW}[NO]{END}"
+    for op_name, builder in ALL_OPS.items():
+        dots = "." * (max_dots - len(op_name))
+        is_compatible = OKAY if builder.is_compatible() else no
+        is_installed = installed if installed_ops[op_name] else no
+        dots2 = '.' * ((len(h[1]) + (max_dots2 - len(h[1]))) -
+                       (len(is_installed) - color_len))
+        print(op_name, dots, is_installed, dots2, is_compatible)
+    print("-" * (max_dots + max_dots2 + len(h[0]) + len(h[1])))
+
+
+def ninja_installed():
+    try:
+        import ninja
+    except ImportError:
+        return False
+    return True
+
+
+def nvcc_version():
+    import torch.utils.cpp_extension
+    cuda_home = torch.utils.cpp_extension.CUDA_HOME
+    try:
+        output = subprocess.check_output([cuda_home + "/bin/nvcc",
+                                          "-V"],
+                                         universal_newlines=True)
+    except FileNotFoundError:
+        return f"{RED} [FAIL] nvcc missing {END}"
+    output_split = output.split()
+    release_idx = output_split.index("release")
+    release = output_split[release_idx + 1].replace(',', '').split(".")
+    return ".".join(release)
+
+
+def debug_report():
+    max_dots = 33
+    report = [
+        ("torch install path",
+         torch.__path__),
+        ("torch version",
+         torch.__version__),
+        ("torch cuda version",
+         torch.version.cuda),
+        ("nvcc version",
+         nvcc_version()),
+        ("deepspeed install path",
+         deepspeed.__path__),
+        ("deepspeed info",
+         f"{deepspeed.__version__}, {deepspeed.__git_hash__}, {deepspeed.__git_branch__}"
+         ),
+        ("deepspeed wheel compiled w.",
+         f"torch {torch_info['version']}, cuda {torch_info['cuda_version']}"),
+    ]
+    print("DeepSpeed general environment info:")
+    for name, value in report:
+        print(name, "." * (max_dots - len(name)), value)
+
+
+def main():
+    op_report()
+    debug_report()
+
+
+if __name__ == "__main__":
+    main()
--- a/deepspeed/git_version_info.py
+++ b/deepspeed/git_version_info.py
@@ -3,12 +3,11 @@ try:
    from .git_version_info_installed import *
 except ModuleNotFoundError:
    # Will be missing from checkouts that haven't been installed (e.g., readthedocs)
-    version = '0.3.0+[none]'
+    version = open('version.txt', 'r').read().strip()
    git_hash = '[none]'
    git_branch = '[none]'
-    installed_ops = {
-        'lamb': False,
-        'transformer': False,
-        'sparse-attn': False,
-        'cpu-adam': False
-    }
+
+    from .ops.op_builder import ALL_OPS
+    installed_ops = dict.fromkeys(ALL_OPS.keys(), False)
+    compatible_ops = dict.fromkeys(ALL_OPS.keys(), False)
+    torch_info = {'version': "0.0", "cuda_version": "0.0"}
--- a/deepspeed/ops/__init__.py
+++ b/deepspeed/ops/__init__.py
-from ..git_version_info import installed_ops as __installed_ops__
+from . import adam
 from . import lamb
+from . import sparse_attention
 from . import transformer
-if __installed_ops__['sparse-attn']:
-    from . import sparse_attention
-if __installed_ops__['cpu-adam']:
-    from . import adam
+
+from ..git_version_info import compatible_ops as __compatible_ops__
--- a/deepspeed/ops/adam/__init__.py
+++ b/deepspeed/ops/adam/__init__.py
 from .cpu_adam import DeepSpeedCPUAdam
+from .fused_adam import FusedAdam
--- a/deepspeed/ops/adam/cpu_adam.py
+++ b/deepspeed/ops/adam/cpu_adam.py
@@ -4,9 +4,9 @@ Copyright 2020 The Microsoft DeepSpeed Team

 import math
 import torch
-import importlib
-
-ds_opt_adam = None
+import time
+from pathlib import Path
+from ..op_builder import CPUAdamBuilder


 class DeepSpeedCPUAdam(torch.optim.Optimizer):
@@ -67,15 +67,15 @@ class DeepSpeedCPUAdam(torch.optim.Optimizer):
        self.opt_id = DeepSpeedCPUAdam.optimizer_id
        DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1

-        global ds_opt_adam
-        ds_opt_adam = importlib.import_module('deepspeed.ops.adam.cpu_adam_op')
-        ds_opt_adam.create_adam(self.opt_id,
-                                lr,
-                                betas[0],
-                                betas[1],
-                                eps,
-                                weight_decay,
-                                adamw_mode)
+        self.ds_opt_adam = CPUAdamBuilder().load()
+
+        self.ds_opt_adam.create_adam(self.opt_id,
+                                     lr,
+                                     betas[0],
+                                     betas[1],
+                                     eps,
+                                     weight_decay,
+                                     adamw_mode)

    def __setstate__(self, state):
        super(DeepSpeedCPUAdam, self).__setstate__(state)
@@ -101,18 +101,20 @@ class DeepSpeedCPUAdam(torch.optim.Optimizer):
                    print(f'group {group_id} param {param_id} = {p.numel()}')
                    state['step'] = 0
                    # gradient momentums
-                    state['exp_avg'] = torch.zeros_like(
-                        p.data,
-                        memory_format=torch.preserve_format)
+                    state['exp_avg'] = torch.zeros_like(p.data,
+                                                        dtype=p.dtype,
+                                                        device='cpu')
+                    #memory_format=torch.preserve_format)
                    # gradient variances
-                    state['exp_avg_sq'] = torch.zeros_like(
-                        p.data,
-                        memory_format=torch.preserve_format)
+                    state['exp_avg_sq'] = torch.zeros_like(p.data,
+                                                           dtype=p.dtype,
+                                                           device='cpu')
+                    #memory_format=torch.preserve_format)

                state['step'] += 1

                if fp16_param_groups is not None:
-                    ds_opt_adam.adam_update_copy(
+                    self.ds_opt_adam.adam_update_copy(
                        self.opt_id,
                        state['step'],
                        group['lr'],
@@ -122,11 +124,11 @@ class DeepSpeedCPUAdam(torch.optim.Optimizer):
                        state['exp_avg_sq'],
                        fp16_param_groups[group_id][param_id].data)
                else:
-                    ds_opt_adam.adam_update(self.opt_id,
-                                            state['step'],
-                                            group['lr'],
-                                            p.data,
-                                            p.grad.data,
-                                            state['exp_avg'],
-                                            state['exp_avg_sq'])
+                    self.ds_opt_adam.adam_update(self.opt_id,
+                                                 state['step'],
+                                                 group['lr'],
+                                                 p.data,
+                                                 p.grad.data,
+                                                 state['exp_avg'],
+                                                 state['exp_avg_sq'])
        return loss
--- a/deepspeed/ops/adam/fused_adam.py
+++ b/deepspeed/ops/adam/fused_adam.py
+'''
+Copyright 2020 The Microsoft DeepSpeed Team
+
+Copyright NVIDIA/apex
+This file is adapted from fused adam in NVIDIA/apex, commit a109f85
+'''
+
+import torch
+import importlib
+from .multi_tensor_apply import MultiTensorApply
+multi_tensor_applier = MultiTensorApply(2048 * 32)
+from ..op_builder import FusedAdamBuilder
+
+
+class FusedAdam(torch.optim.Optimizer):
+    """Implements Adam algorithm.
+
+    Currently GPU-only.
+
+    This version of fused Adam implements 2 fusions.
+
+      * Fusion of the Adam update's elementwise operations
+      * A multi-tensor apply launch that batches the elementwise updates applied to all the model's parameters into one or a few kernel launches.
+
+    Adam was been proposed in `Adam: A Method for Stochastic Optimization`_.
+
+    Arguments:
+        params (iterable): iterable of parameters to optimize or dicts defining
+            parameter groups.
+        lr (float, optional): learning rate. (default: 1e-3)
+        betas (Tuple[float, float], optional): coefficients used for computing
+            running averages of gradient and its square. (default: (0.9, 0.999))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability. (default: 1e-8)
+        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+            algorithm from the paper `On the Convergence of Adam and Beyond`_
+            (default: False) NOT SUPPORTED in FusedAdam!
+        adam_w_mode (boolean, optional): Apply L2 regularization or weight decay
+            True for decoupled weight decay(also known as AdamW) (default: True)
+        set_grad_none (bool, optional): whether set grad to None when zero_grad()
+            method is called. (default: True)
+
+    .. _Adam - A Method for Stochastic Optimization:
+        https://arxiv.org/abs/1412.6980
+    .. _On the Convergence of Adam and Beyond:
+        https://openreview.net/forum?id=ryQu7f-RZ
+    """
+    def __init__(self,
+                 params,
+                 lr=1e-3,
+                 bias_correction=True,
+                 betas=(0.9,
+                        0.999),
+                 eps=1e-8,
+                 adam_w_mode=True,
+                 weight_decay=0.,
+                 amsgrad=False,
+                 set_grad_none=True):
+
+        if amsgrad:
+            raise RuntimeError('FusedAdam does not support the AMSGrad variant.')
+        defaults = dict(lr=lr,
+                        bias_correction=bias_correction,
+                        betas=betas,
+                        eps=eps,
+                        weight_decay=weight_decay)
+        super(FusedAdam, self).__init__(params, defaults)
+        self.adam_w_mode = 1 if adam_w_mode else 0
+        self.set_grad_none = set_grad_none
+
+        fused_adam_cuda = FusedAdamBuilder().load()
+        # Skip buffer
+        self._dummy_overflow_buf = torch.cuda.IntTensor([0])
+        self.multi_tensor_adam = fused_adam_cuda.multi_tensor_adam
+
+    def zero_grad(self):
+        if self.set_grad_none:
+            for group in self.param_groups:
+                for p in group['params']:
+                    p.grad = None
+        else:
+            super(FusedAdam, self).zero_grad()
+
+    def step(self,
+             closure=None,
+             grads=None,
+             output_params=None,
+             scale=None,
+             grad_norms=None):
+        """Performs a single optimization step.
+
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+
+        The remaining arguments are deprecated, and are only retained (for the moment) for error-checking purposes.
+        """
+        if any(p is not None for p in [grads, output_params, scale, grad_norms]):
+            raise RuntimeError(
+                'FusedAdam has been updated.  Simply initialize it identically to torch.optim.Adam, and call step() with no arguments.'
+            )
+        loss = None
+        if closure is not None:
+            loss = closure()
+
+        for group in self.param_groups:
+            bias_correction = 1 if group['bias_correction'] else 0
+            beta1, beta2 = group['betas']
+
+            # assume same step across group now to simplify things
+            # per parameter step can be easily support by making it tensor, or pass list into kernel
+            if 'step' in group:
+                group['step'] += 1
+            else:
+                group['step'] = 1
+
+            # create lists for multi-tensor apply
+            g_16, p_16, m_16, v_16 = [], [], [], []
+            g_32, p_32, m_32, v_32 = [], [], [], []
+
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                if p.grad.data.is_sparse:
+                    raise RuntimeError(
+                        'FusedAdam does not support sparse gradients, please consider SparseAdam instead'
+                    )
+
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p.data)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p.data)
+
+                if p.dtype == torch.float16:
+                    g_16.append(p.grad.data)
+                    p_16.append(p.data)
+                    m_16.append(state['exp_avg'])
+                    v_16.append(state['exp_avg_sq'])
+                elif p.dtype == torch.float32:
+                    g_32.append(p.grad.data)
+                    p_32.append(p.data)
+                    m_32.append(state['exp_avg'])
+                    v_32.append(state['exp_avg_sq'])
+                else:
+                    raise RuntimeError('FusedAdam only support fp16 and fp32.')
+
+            if (len(g_16) > 0):
+                multi_tensor_applier(self.multi_tensor_adam,
+                                     self._dummy_overflow_buf,
+                                     [g_16,
+                                      p_16,
+                                      m_16,
+                                      v_16],
+                                     group['lr'],
+                                     beta1,
+                                     beta2,
+                                     group['eps'],
+                                     group['step'],
+                                     self.adam_w_mode,
+                                     bias_correction,
+                                     group['weight_decay'])
+            if (len(g_32) > 0):
+                multi_tensor_applier(self.multi_tensor_adam,
+                                     self._dummy_overflow_buf,
+                                     [g_32,
+                                      p_32,
+                                      m_32,
+                                      v_32],
+                                     group['lr'],
+                                     beta1,
+                                     beta2,
+                                     group['eps'],
+                                     group['step'],
+                                     self.adam_w_mode,
+                                     bias_correction,
+                                     group['weight_decay'])
+
+        return loss