push dsv0.8.2 version

67ea635f · aiss · 1b2721ad · 67ea635f · 67ea635f · 67ea635f
Commit 67ea635f authored Mar 30, 2023 by aiss
20 changed files
--- a/CODEOWNERS
+++ b/CODEOWNERS
-*       @jeffra @samyam @tjruwase @ShadenSmith @conglongli @awan-10 @arashashari @cli99 @eltonzheng @minjiaz @RezaYazdaniAminabadi @niumanar
+# This file is used to subscribe for notifications for PRs
+# related to specific file paths, does not necessarily mean
+# approval is required from these people before merging.
+#
+# Learn more about CODEOWNERS syntax here:
+# https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners
+
+
+# top-level repo folders
+/.github/ @jeffra @mrwyattii
+/azure/ @jeffra @awan-10
+/benchmarks/ @jeffra @awan-10 @mrwyattii @molly-smith
+/bin/ @jeffra
+/csrc/ @RezaYazdaniAminabadi @awan-10 @jeffra @cmikeh2 @arashb
+/deepspeed/ @jeffra
+/docker/ @jeffra @awan-10
+/docs/ @jeffra @mrwyattii
+/examples/ @jeffra @awan-10 @mrwyattii
+/op_builder/ @jeffra @RezaYazdaniAminabadi @cmikeh2
+/release/ @jeffra @mrwyattii
+/requirements/ @jeffra @mrwyattii
+/scripts/ @jeffra @awan-10
+/tests/ @jeffra @mrwyattii @tjruwase
+
+# deepspeed
+/deepspeed/autotuning/ @cli99
+/deepspeed/checkpoint/ @tjruwase
+/deepspeed/comm/ @awan-10
+/deepspeed/compression/ @yaozhewei @minjiaz @xiaoxiawu-microsoft @conglongli
+/deepspeed/elasticity/ @jeffra @awan-10
+/deepspeed/launcher/ @jeffra @awan-10
+/deepspeed/module_inject/ @RezaYazdaniAminabadi @jeffra @mrwyattii @awan-10 @cmikeh2 @arashb
+/deepspeed/moe/ @awan-10
+/deepspeed/monitor/ @awan-10 @jeffra
+/deepspeed/nebula/ @tjruwase @jeffra
+/deepspeed/ops/ @RezaYazdaniAminabadi @jeffra @mrwyattii @awan-10 @cmikeh2 @arashb
+/deepspeed/pipe/ @ShadenSmith @duli2012
+/deepspeed/profiling/ @cli99
+/deepspeed/utils/ @jeffra @tjruwase @awan-10
+
+# inference
+/deepspeed/inference/ @RezaYazdaniAminabadi @jeffra @mrwyattii @awan-10 @cmikeh2 @arashb
+/deepspeed/model_implementations/ @RezaYazdaniAminabadi @jeffra @mrwyattii @awan-10 @cmikeh2 @arashb
+
+# training
+/deepspeed/runtime/ @jeffra @tjruwase
+/deepspeed/runtime/activation_checkpointing/ @jeffra @tjruwase
+/deepspeed/runtime/checkpoint_engine/ @tjruwase @jeffra
+/deepspeed/runtime/comm/ @awan-10
+/deepspeed/runtime/compression/ @awan-10 @conglongli
+/deepspeed/runtime/data_pipeline/ @conglongli
+/deepspeed/runtime/fp16/ @jeffra @tjruwase
+/deepspeed/runtime/fp16/onebit/ @conglongli @awan-10
+/deepspeed/runtime/pipe/ @ShadenSmith @duli2012
+/deepspeed/runtime/swap_tensor/ @tjruwase @mrwyattii
+/deepspeed/runtime/zero/ @jeffra @tjruwase @samyam @mrwyattii
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -2,3 +2,6 @@ include *.txt README.md
 recursive-include requirements *.txt
 recursive-include deepspeed *.cpp *.h *.cu *.hip *.tr *.cuh *.cc *.json
 recursive-include csrc *.cpp *.h *.cu *.tr *.cuh *.cc
+recursive-include op_builder *.py
+recursive-include benchmarks *.py
+recursive-include accelerator *.py
--- a/MANIFEST_win.in
+++ b/MANIFEST_win.in
@@ -6,3 +6,4 @@ recursive-include deepspeed *.tr
 recursive-exclude deepspeed/ops/csrc *.cpp *.h *.cu *.cuh *.cc
 prune csrc
 prune op_builder
+prune accelerator
--- a/README.md
+++ b/README.md
-[![Build Status](https://github.com/microsoft/deepspeed/workflows/Build/badge.svg)](https://github.com/microsoft/DeepSpeed/actions)
+[![License MIT](https://badgen.net/badge/license/MIT/blue)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
 [![PyPI version](https://badge.fury.io/py/deepspeed.svg)](https://pypi.org/project/deepspeed/)
-[![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)
-[![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
+[![Downloads](https://pepy.tech/badge/deepspeed)](https://pepy.tech/project/deepspeed)
+[![Build](https://badgen.net/badge/build/check-status/blue)](#build-pipeline-status)
+

 <div align="center">
 <img src="docs/assets/images/DeepSpeed_light.svg#gh-light-mode-only" width="400px">
 <img src="docs/assets/images/DeepSpeed_dark_transparent.svg#gh-dark-mode-only" width="400px">
 </div>

-<!--
-Remove until pypi issue is resolved: https://status.python.org/incidents/2jj696st6yn5
-[![Downloads](https://pepy.tech/badge/deepspeed/month)](https://pepy.tech/project/deepspeed)
-->
 ## Latest News
-* [2022/03/21] [Supporting efficient large model training on AMD Instinct GPUs with DeepSpeed](https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/)
-* [2022/03/07] [Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/)
-* [2022/01/19] [DeepSpeed: Advancing MoE inference and training to power next-generation AI scale](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/)
-    * [Mixture of Experts (MoE) for NLG tutorial](https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/).
-    * [Mixture of Experts (MoE) Inference tutorial](https://www.deepspeed.ai/tutorials/moe-inference-tutorial).
-* [2021/11/15] [Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed](https://www.deepspeed.ai/news/2021/11/15/autotuning.html)
-* [2021/10/11] [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
-  * Read more on how to [train large models with DeepSpeed](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/)
-
-### DeepSpeed is hiring, [come join us!](https://careers.microsoft.com/us/en/search-results?keywords=http:%2F%2Fdeepspeed.ai)
+<b> DeepSpeed trained the world's most powerful language models ([MT-530B](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/), [BLOOM](https://huggingface.co/blog/bloom-megatron-deepspeed)); [learn how](https://www.deepspeed.ai/tutorials/large-models-w-deepspeed/).</b>
+
+* [2023/02] [Automatic Tensor Parallelism: Enables tensor parallelism by default without providing an injection policy](https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/)
+* [2022/12] [DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality](https://www.deepspeed.ai/2022/12/11/data-efficiency.html)
+* [2022/11] [Stable Diffusion Image Generation under 1 second w. DeepSpeed MII](https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/benchmark/txt2img)
+* [2022/10] [DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference](https://www.deepspeed.ai/2022/10/10/mii.html)
+* [2022/09] [ZeRO-Inference: Democratizing massive model inference](https://www.deepspeed.ai/2022/09/09/zero-inference.html)
+* [2022/07] [Azure and DeepSpeed empower easy-to-use and high-performance model training](https://azure.microsoft.com/en-us/blog/azure-empowers-easytouse-highperformance-and-hyperscale-model-training-using-deepspeed/)
+
+---
+
+# Extreme Speed and Scale for DL Training and Inference
+
+[DeepSpeed](https://www.deepspeed.ai/) is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can:
+
+* Train/Inference dense or sparse models with billions or trillions of parameters
+* Achieve excellent system throughput and efficiently scale to thousands of GPUs
+* Train/Inference on resource constrained GPU systems
+* Achieve unprecedented low latency and high throughput for inference
+* Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs
+
 ---

-[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
-library that makes distributed training easy, efficient, and effective.
+# DeepSpeed's three innovation pillars
+
+<img src="docs/assets/images/3pillars.png" width="800px">
+
+
+## DeepSpeed-Training
+
+DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: [DeepSpeed-Training](https://www.deepspeed.ai/training/)
+
+## DeepSpeed-Inference
+
+DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, throughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: [DeepSpeed-Inference](https://www.deepspeed.ai/inference)
+
+
+## DeepSpeed-Compression
+
+To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: [DeepSpeed-Compression](https://www.deepspeed.ai/compression)
+
+---

-<p align="center"><i><b>10x Larger Models</b></i></p>
-<p align="center"><i><b>10x Faster Training</b></i></p>
-<p align="center"><i><b>Minimal Code Change</b></i></p>
+# DeepSpeed Software Suite

-DeepSpeed delivers extreme-scale model training for everyone, from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU:
-* Extreme scale: Using current generation of GPU clusters with hundreds of devices,  3D parallelism of DeepSpeed can efficiently train deep learning models with trillions of parameters.
-* Extremely memory efficient: With just a single GPU, ZeRO-Offload of DeepSpeed can train models with over 10B parameters, 10x bigger than the state of arts, democratizing multi-billion-parameter model training such that many deep learning scientists can explore bigger and better models.
-* Extremely long sequence length: Sparse attention of DeepSpeed powers an order-of-magnitude longer input sequence and obtains up to 6x faster execution comparing with dense transformers.
-* Extremely communication efficient: 3D parallelism improves communication efficiency allows users to train multi-billion-parameter models 2–7x faster on clusters with limited network bandwidth.  1-bit Adam, 0/1 Adam and 1-bit LAMB reduce communication volume by up to 26x while achieving similar convergence efficiency to Adam/LAMB, allowing for scaling to different types of GPU clusters and networks.
+## DeepSpeed Library

-Early adopters of DeepSpeed have already produced
-a language model (LM) with over 17B parameters called
-[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
-establishing a new SOTA in the LM category.
+   The [DeepSpeed](https://github.com/microsoft/deepspeed) library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, inference or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see [DeepSpeed Adoption](#deepspeed-adoption)).
+
+## Model Implementations for Inference (MII)
+
+   [Model Implementations for Inference (MII)](https://github.com/microsoft/deepspeed-mii) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.
+
+## DeepSpeed on Azure
+
+   DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML [recipes](https://github.com/Azure/azureml-examples/tree/main/v1/python-sdk/workflows/train/deepspeed). The job submission and data preparation scripts have been made available [here](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/azureml). For more details on how to use DeepSpeed on Azure, please follow the [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/).
+
+---
+
+# DeepSpeed Adoption

 DeepSpeed is an important part of Microsoft’s new
 [AI at Scale](https://www.microsoft.com/en-us/research/project/ai-at-scale/)
 initiative to enable next-generation AI capabilities at scale, where you can find more
 information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale).

-**_For further documentation, tutorials, and technical deep-dives please see [deepspeed.ai](https://www.deepspeed.ai/)!_**
-
-# Table of Contents
-| Section                                 | Description                                 |
-| --------------------------------------- | ------------------------------------------- |
-| [Why DeepSpeed?](#why-deepspeed)        |  DeepSpeed overview                         |
-| [Install](#installation)                |  Installation details                       |
-| [Features](#features)                   |  Feature list and overview                  |
-| [Further Reading](#further-reading)     |  Documentation, tutorials, etc.             |
-| [Contributing](#contributing)           |  Instructions for contributing              |
-| [Publications](#publications)           |  Publications related to DeepSpeed          |
-| [Videos](#videos)                       |  Videos related to DeepSpeed                |
-
-# Why DeepSpeed?
-Training advanced deep learning models is challenging. Beyond model design,
-model scientists also need to set up the state-of-the-art training techniques
-such as distributed training, mixed precision, gradient accumulation, and
-checkpointing. Yet still, scientists may not achieve the desired system
-performance and convergence rate. Large model sizes are even more challenging:
-a large model easily runs out of memory with pure data parallelism and it is
-difficult to use model parallelism. DeepSpeed addresses these challenges to
-accelerate model development *and* training.
+DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):
+
+  * [Megatron-Turing NLG (530B)](https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)
+  * [Jurassic-1 (178B)](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf)
+  * [BLOOM (176B)](https://huggingface.co/blog/bloom-megatron-deepspeed)
+  * [GLM (130B)](https://github.com/THUDM/GLM-130B)
+  * [YaLM (100B)](https://github.com/yandex/YaLM-100B)
+  * [GPT-NeoX (20B)](https://github.com/EleutherAI/gpt-neox)
+  * [AlexaTM (20B)](https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning)
+  * [Turing NLG (17B)](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/)
+  * [METRO-LM (5.4B)](https://arxiv.org/pdf/2204.06644.pdf)
+
+DeepSpeed has been integrated with several different popular open-source DL frameworks such as:
+
+|                                                                                                | Documentation                                |
+| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
+<img src="docs/assets/images/transformers-light.png#gh-light-mode-only" width="250px"><img src="docs/assets/images/transformers-dark.png#gh-dark-mode-only" width="250px"> | [Transformers with DeepSpeed](https://huggingface.co/docs/transformers/main/main_classes/deepspeed) |
+| <img src="docs/assets/images/accelerate-light.png#gh-light-mode-only" width="250px"><img src="docs/assets/images/accelerate-dark.png#gh-dark-mode-only" width="250px"> | [Accelerate with DeepSpeed](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) |
+| <img src="docs/assets/images/lightning-light.svg#gh-light-mode-only" width="200px"><img src="docs/assets/images/lightning-dark.svg#gh-dark-mode-only" width="200px"> | [Lightning with DeepSpeed](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.strategies.DeepSpeedStrategy.html) |
+| <img src="docs/assets/images/mosaicml.svg" width="200px"> | [MosaicML with DeepSpeed](https://docs.mosaicml.com/en/latest/trainer/using_the_trainer.html?highlight=deepspeed#deepspeed-integration) |
+| <img src="docs/assets/images/determined.svg" width="225px"> | [Determined with DeepSpeed](https://docs.determined.ai/latest/training/apis-howto/deepspeed/overview.html) |
+
+---
+
+# Build Pipeline Status
+
+| Description | Status |
+| ----------- | ------ |
+| NVIDIA | [![nv-torch12-p40](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch12-p40.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch12-p40.yml) [![nv-torch18-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch18-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch18-v100.yml) [![nv-torch-latest-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-latest-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-latest-v100.yml) [![nv-inference](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-inference.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-inference.yml) [![nv-nightly](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-nightly.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-nightly.yml) |
+| AMD | [![amd](https://github.com/microsoft/DeepSpeed/actions/workflows/amd.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/amd.yml) |
+| PyTorch Nightly | [![nv-torch-nightly-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-nightly-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-torch-nightly-v100.yml) |
+| Integrations | [![nv-transformers-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-transformers-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-transformers-v100.yml) [![nv-lightning-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-lightning-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-lightning-v100.yml) [![nv-accelerate-v100](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-accelerate-v100.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/nv-accelerate-v100.yml) |
+| Misc | [![Formatting](https://github.com/microsoft/DeepSpeed/actions/workflows/formatting.yml/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/formatting.yml) [![pages-build-deployment](https://github.com/microsoft/DeepSpeed/actions/workflows/pages/pages-build-deployment/badge.svg)](https://github.com/microsoft/DeepSpeed/actions/workflows/pages/pages-build-deployment) [![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)|

 # Installation

@@ -81,8 +120,16 @@ just-in-time (JIT) using [torch's JIT C++ extension loader that relies on
 ninja](https://pytorch.org/docs/stable/cpp_extension.html) to build and
 dynamically link them at runtime.

-**Note:** [PyTorch](https://pytorch.org/) must be installed _before_ installing
-DeepSpeed.
+## Requirements
+* [PyTorch](https://pytorch.org/) must be installed _before_ installing DeepSpeed.
+* For full feature support we recommend a version of PyTorch that is >= 1.8 and ideally the latest PyTorch stable release.
+* A CUDA or ROCm compiler such as [nvcc](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#introduction) or [hipcc](https://github.com/ROCm-Developer-Tools/HIPCC) used to compile C++/CUDA/HIP extensions.
+* Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
+  * NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
+  * AMD: MI100 and MI200
+
+## PyPI
+We regularly push releases to [PyPI](https://pypi.org/project/deepspeed/) and encourage users to install from there in most cases.

 ```bash
 pip install deepspeed
@@ -99,83 +146,29 @@ If you would like to pre-install any of the DeepSpeed extensions/ops (instead
 of JIT compiling) or install pre-compiled ops via PyPI please see our [advanced
 installation instructions](https://www.deepspeed.ai/tutorials/advanced-install/).

-On Windows you can build wheel with following steps, currently only inference mode is supported.
+## Windows
+Windows support is partially supported with DeepSpeed. On Windows you can build wheel with following steps, currently only inference mode is supported.
 1. Install pytorch, such as pytorch 1.8 + cuda 11.1
 2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
 3. Launch cmd console with Administrator privilege for creating required symlink folders
 4. Run `python setup.py bdist_wheel` to build wheel in `dist` folder

 # Features
-Below we provide a brief feature list, see our detailed [feature
-overview](https://www.deepspeed.ai/features/) for descriptions and usage.
-
-* [Distributed Training with Mixed Precision](https://www.deepspeed.ai/features/#distributed-training-with-mixed-precision)
-  * 16-bit mixed precision
-  * Single-GPU/Multi-GPU/Multi-Node
-* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
-  * Support for Custom Model Parallelism
-  * Integration with Megatron-LM
-* [Pipeline Parallelism](https://www.deepspeed.ai/tutorials/pipeline/)
-  * 3D Parallelism
-* [The Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/)
-  * Optimizer State and Gradient Partitioning
-  * Activation Partitioning
-  * Constant Buffer Optimization
-  * Contiguous Memory Optimization
-* [ZeRO-Offload](https://www.deepspeed.ai/tutorials/zero-offload/)
-  * Leverage both CPU/GPU memory for model training
-  * Support 10B model training on a single GPU
-* [Ultra-fast dense transformer kernels](https://www.deepspeed.ai/2020/05/18/bert-record.html)
-* [Sparse attention](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html)
-  * Memory- and compute-efficient sparse kernels
-  * Support 10x longer sequences than dense
-  * Flexible support to different sparse structures
-* [1-bit Adam](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html), [0/1 Adam](https://www.deepspeed.ai/tutorials/zero-one-adam/) and [1-bit LAMB](https://www.deepspeed.ai/tutorials/onebit-lamb/)
-  * Custom communication collective
-  * Up to 26x communication volume saving
-* [Additional Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#additional-memory-and-bandwidth-optimizations)
-  * Smart Gradient Accumulation
-  * Communication/Computation Overlap
-* [Training Features](https://www.deepspeed.ai/features/#training-features)
-  * Simplified training API
-  * Gradient Clipping
-  * Automatic loss scaling with mixed precision
-* [Training Optimizers](https://www.deepspeed.ai/features/#training-optimizers)
-  * Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
-  * Memory bandwidth optimized FP16 Optimizer
-  * Large Batch Training with LAMB Optimizer
-  * Memory efficient Training with ZeRO Optimizer
-  * CPU-Adam
-* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
-* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
-  * Learning Rate Range Test
-  * 1Cycle Learning Rate Schedule
-* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
-* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
-  * A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
-  * Stable and 3.3x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
-  * Complementary to many other DeepSpeed features
-* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
-* [Mixture of Experts (MoE)](https://www.deepspeed.ai/tutorials/mixture-of-experts/)

+Please checkout [DeepSpeed-Training](https://www.deepspeed.ai/training), [DeepSpeed-Inference](https://www.deepspeed.ai/inference) and [DeepSpeed-Compression](https://www.deepspeed.ai/compression) pages for full set of features offered along each of these three pillars.

 # Further Reading

-All DeepSpeed documentation can be found on our website: [deepspeed.ai](https://www.deepspeed.ai/)
+All DeepSpeed documentation, tutorials, and blogs can be found on our website: [deepspeed.ai](https://www.deepspeed.ai/)


-| Article                                                                                        | Description                                  |
+|                                                                                                | Description                                  |
 | ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
-| [DeepSpeed Features](https://www.deepspeed.ai/features/)                                       |  DeepSpeed features                          |
 | [Getting Started](https://www.deepspeed.ai/getting-started/)                                   |  First steps with DeepSpeed                  |
 | [DeepSpeed JSON Configuration](https://www.deepspeed.ai/docs/config-json/)                     |  Configuring DeepSpeed                       |
 | [API Documentation](https://deepspeed.readthedocs.io/en/latest/)                               |  Generated DeepSpeed API documentation       |
-| [CIFAR-10 Tutorial](https://www.deepspeed.ai/tutorials/cifar-10)                               |  Getting started with CIFAR-10 and DeepSpeed |
-| [Megatron-LM Tutorial](https://www.deepspeed.ai/tutorials/megatron/)                           |  Train GPT2 with DeepSpeed and Megatron-LM   |
-| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/)             |  Pre-train BERT with DeepSpeed               |
-| [Learning Rate Range Test Tutorial](https://www.deepspeed.ai/tutorials/lrrt/)                  |  Faster training with large learning rates   |
-| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/one-cycle/)                               |  SOTA learning schedule in DeepSpeed         |
-
+| [Tutorials](https://www.deepspeed.ai/tutorials/)                                               |  Tutorials                                   |
+| [Blogs](https://www.deepspeed.ai/posts/)                                                       |  Blogs                                   |


 # Contributing
@@ -204,13 +197,20 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
 1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054) and [In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20)](https://dl.acm.org/doi/10.5555/3433701.3433727).
 2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
 3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
-4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
+4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840) and [USENIX ATC 2021](https://www.usenix.org/conference/atc21/presentation/ren-jie).
 5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
-6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
-7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
-8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
+6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857) and [SC 2021](https://dl.acm.org/doi/abs/10.1145/3458817.3476205).
+7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069) and [HiPC 2022](https://hipc.org/advance-program/).
+8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084) and [NeurIPS 2022](https://openreview.net/forum?id=JpZ5du_Kdh).
 9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. [arXiv:2202.06009](https://arxiv.org/abs/2202.06009).
-10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596).
+10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [arXiv:2201.05596](https://arxiv.org/abs/2201.05596) and [ICML 2022](https://proceedings.mlr.press/v162/rajbhandari22a.html).
+11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [arXiv:2201.11990](https://arxiv.org/abs/2201.11990).
+12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. [arXiv:2206.01859](https://arxiv.org/abs/2206.01859) and [NeurIPS 2022](https://openreview.net/forum?id=xNeAhc2CNAl).
+13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [arXiv:2206.01861](https://arxiv.org/abs/2206.01861) and [NeurIPS 2022](https://openreview.net/forum?id=f-fVCElZ-G1).
+14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [arXiv:2207.00032](https://arxiv.org/abs/2207.00032) and [SC 2022](https://dl.acm.org/doi/abs/10.5555/3571885.3571946).
+15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. [arXiv:2211.11586](https://arxiv.org/abs/2211.11586).
+16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. [arXiv:2212.03597](https://arxiv.org/abs/2212.03597).
+

 # Videos
 1. DeepSpeed KDD 2020 Tutorial
@@ -225,5 +225,6 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
    * [ZeRO & Fastest BERT: Increasing the scale and speed of deep learning training in DeepSpeed](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-On-Demand.html).
 3. [DeepSpeed on AzureML](https://youtu.be/yBVXR8G8Bg8)
 4. Community Tutorials
-    * [DeepSpeed: All the tricks to scale to gigantic models](https://www.youtube.com/watch?v=pDGI668pNg0)
-    * [Turing-NLG, DeepSpeed and the ZeRO optimizer](https://www.youtube.com/watch?v=tC01FRB0M7w)
+    * [DeepSpeed: All the tricks to scale to gigantic models (Mark Saroufim)](https://www.youtube.com/watch?v=pDGI668pNg0)
+    * [Turing-NLG, DeepSpeed and the ZeRO optimizer (Yannic Kilcher)](https://www.youtube.com/watch?v=tC01FRB0M7w)
+    * [Ultimate Guide To Scaling ML Models (The AI Epiphany)](https://www.youtube.com/watch?v=hc0u4avAkuM)
--- a/accelerator/__init__.py
+++ b/accelerator/__init__.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+from .abstract_accelerator import DeepSpeedAccelerator
+from .real_accelerator import get_accelerator, set_accelerator
--- a/accelerator/abstract_accelerator.py
+++ b/accelerator/abstract_accelerator.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+import abc
+from abc import ABC
+
+
+class DeepSpeedAccelerator(ABC):
+    def __init__(self):
+        self._name = None
+        self._communication_backend_name = None
+
+    # Device APIs
+    @abc.abstractmethod
+    def device_name(self, device_index):
+        ...
+
+    @abc.abstractmethod
+    def device(self, device_index):
+        ...
+
+    @abc.abstractmethod
+    def set_device(self, device_index):
+        ...
+
+    @abc.abstractmethod
+    def current_device(self):
+        ...
+
+    @abc.abstractmethod
+    def current_device_name(self):
+        ...
+
+    @abc.abstractmethod
+    def device_count(self):
+        ...
+
+    @abc.abstractmethod
+    def synchronize(self, device_index=None):
+        ...
+
+    # RNG APIs
+    @abc.abstractmethod
+    def random(self):
+        ...
+
+    @abc.abstractmethod
+    def set_rng_state(self, new_state, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def get_rng_state(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def manual_seed(self, seed):
+        ...
+
+    @abc.abstractmethod
+    def manual_seed_all(self, seed):
+        ...
+
+    @abc.abstractmethod
+    def initial_seed(self, seed):
+        ...
+
+    @abc.abstractmethod
+    def default_generator(self, device_index):
+        ...
+
+    # Streams/Events
+    @property
+    @abc.abstractmethod
+    def Stream(self):
+        ...
+
+    @abc.abstractmethod
+    def stream(self, stream):
+        ...
+
+    @abc.abstractmethod
+    def current_stream(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def default_stream(self, device_index=None):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def Event(self):
+        ...
+
+    # Memory management
+    @abc.abstractmethod
+    def empty_cache(self):
+        ...
+
+    @abc.abstractmethod
+    def memory_allocated(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def max_memory_allocated(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def reset_max_memory_allocated(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def memory_cached(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def max_memory_cached(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def reset_max_memory_cached(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def memory_stats(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def reset_peak_memory_stats(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def memory_reserved(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def max_memory_reserved(self, device_index=None):
+        ...
+
+    @abc.abstractmethod
+    def total_memory(self, device_index=None):
+        ...
+
+    # Data types
+    @abc.abstractmethod
+    def is_bf16_supported(self):
+        ...
+
+    @abc.abstractmethod
+    def is_fp16_supported(self):
+        ...
+
+    # Misc
+    @abc.abstractmethod
+    def amp(self):
+        ...
+
+    @abc.abstractmethod
+    def is_available(self):
+        ...
+
+    @abc.abstractmethod
+    def range_push(self, msg):
+        ...
+
+    @abc.abstractmethod
+    def range_pop(self):
+        ...
+
+    @abc.abstractmethod
+    def lazy_call(self, callback):
+        ...
+
+    @abc.abstractmethod
+    def communication_backend_name(self):
+        ...
+
+    # Tensor operations
+    @property
+    @abc.abstractmethod
+    def BFloat16Tensor(self):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def ByteTensor(self):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def DoubleTensor(self):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def FloatTensor(self):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def HalfTensor(self):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def IntTensor(self):
+        ...
+
+    @property
+    @abc.abstractmethod
+    def LongTensor(self):
+        ...
+
+    @abc.abstractmethod
+    def pin_memory(self, tensor):
+        ...
+
+    @abc.abstractmethod
+    def on_accelerator(self, tensor):
+        ...
+
+    @abc.abstractmethod
+    def op_builder_dir(self):
+        ...
+
+    # create an instance of op builder, specified by class_name
+    @abc.abstractmethod
+    def create_op_builder(self, class_name):
+        ...
+
+    # return an op builder class, specified by class_name
+    @abc.abstractmethod
+    def get_op_builder(self, class_name):
+        ...
+
+    @abc.abstractmethod
+    def build_extension(self):
+        ...
--- a/accelerator/cuda_accelerator.py
+++ b/accelerator/cuda_accelerator.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+import os
+import pkgutil
+import importlib
+
+from .abstract_accelerator import DeepSpeedAccelerator
+# During setup stage torch may not be installed, pass on no torch will
+# allow op builder related API to be executed.
+try:
+    import torch.cuda
+except ImportError:
+    pass
+
+
+class CUDA_Accelerator(DeepSpeedAccelerator):
+    def __init__(self):
+        self._name = 'cuda'
+        self._communication_backend_name = 'nccl'
+
+        # begin initialize for create_op_builder()
+        # put all valid class name <--> class type mapping into class_dict
+        op_builder_dir = self.op_builder_dir()
+        op_builder_module = importlib.import_module(op_builder_dir)
+
+        for _, module_name, _ in pkgutil.iter_modules([os.path.dirname(op_builder_module.__file__)]):
+            # avoid self references
+            if module_name != 'all_ops' and module_name != 'builder':
+                module = importlib.import_module("{}.{}".format(
+                    op_builder_dir,
+                    module_name))
+                for member_name in module.__dir__():
+                    if member_name.endswith(
+                            'Builder'
+                    ) and member_name != "OpBuilder" and member_name != "CUDAOpBuilder" and member_name != "TorchCPUOpBuilder":  # avoid abstract classes
+                        if not member_name in self.class_dict:
+                            self.class_dict[member_name] = getattr(module, member_name)
+        # end initialize for create_op_builder()
+
+    # Device APIs
+    def device_name(self, device_index=None):
+        if device_index == None:
+            return 'cuda'
+        return 'cuda:{}'.format(device_index)
+
+    def device(self, device_index=None):
+        return torch.cuda.device(device_index)
+
+    def set_device(self, device_index):
+        torch.cuda.set_device(device_index)
+
+    def current_device(self):
+        return torch.cuda.current_device()
+
+    def current_device_name(self):
+        return 'cuda:{}'.format(torch.cuda.current_device())
+
+    def device_count(self):
+        return torch.cuda.device_count()
+
+    def synchronize(self, device_index=None):
+        return torch.cuda.synchronize(device_index)
+
+    # RNG APIs
+    def random(self):
+        return torch.random
+
+    def set_rng_state(self, new_state, device_index=None):
+        if device_index is None:
+            return torch.cuda.set_rng_state(new_state)
+
+        return torch.cuda.set_rng_state(new_state, device_index)
+
+    def get_rng_state(self, device_index=None):
+        if device_index is None:
+            return torch.cuda.get_rng_state()
+
+        return torch.cuda.get_rng_state(device_index)
+
+    def manual_seed(self, seed):
+        return torch.cuda.manual_seed(seed)
+
+    def manual_seed_all(self, seed):
+        return torch.cuda.manual_seed_all(seed)
+
+    def initial_seed(self, seed):
+        return torch.cuda.initial_seed(seed)
+
+    def default_generator(self, device_index):
+        return torch.cuda.default_generators[device_index]
+
+    # Streams/Events
+    @property
+    def Stream(self):
+        return torch.cuda.Stream
+
+    def stream(self, stream):
+        return torch.cuda.stream(stream)
+
+    def current_stream(self, device_index=None):
+        return torch.cuda.current_stream(device_index)
+
+    def default_stream(self, device_index=None):
+        return torch.cuda.default_stream(device_index)
+
+    @property
+    def Event(self):
+        return torch.cuda.Event
+
+    # Memory management
+    def empty_cache(self):
+        return torch.cuda.empty_cache()
+
+    def memory_allocated(self, device_index=None):
+        return torch.cuda.memory_allocated(device_index)
+
+    def max_memory_allocated(self, device_index=None):
+        return torch.cuda.max_memory_allocated(device_index)
+
+    def reset_max_memory_allocated(self, device_index=None):
+        return torch.cuda.reset_max_memory_allocated(device_index)
+
+    def memory_cached(self, device_index=None):
+        return torch.cuda.memory_cached(device_index)
+
+    def max_memory_cached(self, device_index=None):
+        return torch.cuda.max_memory_cached(device_index)
+
+    def reset_max_memory_cached(self, device_index=None):
+        return torch.cuda.reset_max_memory_cached(device_index)
+
+    def memory_stats(self, device_index=None):
+        if hasattr(torch.cuda, 'memory_stats'):
+            return torch.cuda.memory_stats(device_index)
+
+    def reset_peak_memory_stats(self, device_index=None):
+        if hasattr(torch.cuda, 'reset_peak_memory_stats'):
+            return torch.cuda.reset_peak_memory_stats(device_index)
+
+    def memory_reserved(self, device_index=None):
+        if hasattr(torch.cuda, 'memory_reserved'):
+            return torch.cuda.memory_reserved(device_index)
+
+    def max_memory_reserved(self, device_index=None):
+        if hasattr(torch.cuda, 'max_memory_reserved'):
+            return torch.cuda.max_memory_reserved(device_index)
+
+    def total_memory(self, device_index=None):
+        return torch.cuda.get_device_properties(device_index).total_memory
+
+    # Data types
+    def is_bf16_supported(self):
+        return torch.cuda.is_bf16_supported()
+
+    def is_fp16_supported(self):
+        major, _ = torch.cuda.get_device_capability()
+        if major >= 7:
+            return True
+        else:
+            return False
+
+    # Misc
+    def amp(self):
+        if hasattr(torch.cuda, 'amp'):
+            return torch.cuda.amp
+        return None
+
+    def is_available(self):
+        return torch.cuda.is_available()
+
+    def range_push(self, msg):
+        if hasattr(torch.cuda.nvtx, 'range_push'):
+            return torch.cuda.nvtx.range_push(msg)
+
+    def range_pop(self):
+        if hasattr(torch.cuda.nvtx, 'range_pop'):
+            return torch.cuda.nvtx.range_pop()
+
+    def lazy_call(self, callback):
+        return torch.cuda._lazy_call(callback)
+
+    def communication_backend_name(self):
+        return self._communication_backend_name
+
+    # Tensor operations
+
+    @property
+    def BFloat16Tensor(self):
+        return torch.cuda.BFloat16Tensor
+
+    @property
+    def ByteTensor(self):
+        return torch.cuda.ByteTensor
+
+    @property
+    def DoubleTensor(self):
+        return torch.cuda.DoubleTensor
+
+    @property
+    def FloatTensor(self):
+        return torch.cuda.FloatTensor
+
+    @property
+    def HalfTensor(self):
+        return torch.cuda.HalfTensor
+
+    @property
+    def IntTensor(self):
+        return torch.cuda.IntTensor
+
+    @property
+    def LongTensor(self):
+        return torch.cuda.LongTensor
+
+    def pin_memory(self, tensor):
+        return tensor.pin_memory()
+
+    def on_accelerator(self, tensor):
+        device_str = str(tensor.device)
+        if device_str.startswith('cuda:'):
+            return True
+        else:
+            return False
+
+    def op_builder_dir(self):
+        try:
+            # during installation time op_builder is visible, otherwise return deepspeed.ops.op_builder
+            import op_builder  # noqa: F401
+            return "op_builder"
+        except ImportError:
+            return "deepspeed.ops.op_builder"
+
+    # dict that holds class name <--> class type mapping i.e.
+    # 'AsyncIOBuilder': <class 'op_builder.async_io.AsyncIOBuilder'>
+    # this dict will be filled at init stage
+    class_dict = {}
+
+    # create an instance of op builder and return, name specified by class_name
+    def create_op_builder(self, class_name):
+        if class_name in self.class_dict:
+            return self.class_dict[class_name]()
+        else:
+            return None
+
+    # return an op builder class, name specified by class_name
+    def get_op_builder(self, class_name):
+        if class_name in self.class_dict:
+            return self.class_dict[class_name]
+        else:
+            return None
+
+    def build_extension(self):
+        from torch.utils.cpp_extension import BuildExtension
+        return BuildExtension
--- a/accelerator/real_accelerator.py
+++ b/accelerator/real_accelerator.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+try:
+    from accelerator.abstract_accelerator import DeepSpeedAccelerator as dsa1
+except ImportError as e:
+    dsa1 = None
+try:
+    from deepspeed.accelerator.abstract_accelerator import DeepSpeedAccelerator as dsa2
+except ImportError as e:
+    dsa2 = None
+
+ds_accelerator = None
+
+
+def _validate_accelerator(accel_obj):
+    # because abstract_accelerator has different path during
+    # build time (accelerator.abstract_accelerator)
+    # and run time (deepspeed.accelerator.abstract_accelerator)
+    # and extension would import the
+    # run time abstract_accelerator/DeepSpeedAccelerator as its base
+    # class, so we need to compare accel_obj with both base class.
+    # if accel_obj is instance of DeepSpeedAccelerator in one of
+    # accelerator.abstractor_accelerator
+    # or deepspeed.accelerator.abstract_accelerator, consider accel_obj
+    # is a conforming object
+    if not ((dsa1 != None and isinstance(accel_obj,
+                                         dsa1)) or
+            (dsa2 != None and isinstance(accel_obj,
+                                         dsa2))):
+        raise AssertionError(
+            f'{accel_obj.__class__.__name__} accelerator is not subclass of DeepSpeedAccelerator'
+        )
+
+    # TODO: turn off is_available test since this breaks tests
+    #assert accel_obj.is_available(), \
+    #    f'{accel_obj.__class__.__name__} accelerator fails is_available() test'
+
+
+def get_accelerator():
+    global ds_accelerator
+    if ds_accelerator is None:
+        try:
+            from intel_extension_for_deepspeed import XPU_Accelerator
+        except ImportError as e:
+            pass
+        else:
+            ds_accelerator = XPU_Accelerator()
+            _validate_accelerator(ds_accelerator)
+            return ds_accelerator
+
+        from .cuda_accelerator import CUDA_Accelerator
+        ds_accelerator = CUDA_Accelerator()
+        _validate_accelerator(ds_accelerator)
+    return ds_accelerator
+
+
+def set_accelerator(accel_obj):
+    global ds_accelerator
+    _validate_accelerator(accel_obj)
+    ds_accelerator = accel_obj
+
+
+'''
+-----------[code] test_get.py -----------
+from deepspeed.accelerator import get_accelerator
+my_accelerator = get_accelerator()
+print(f'{my_accelerator._name=}')
+print(f'{my_accelerator._communication_backend=}')
+print(f'{my_accelerator.HalfTensor().device=}')
+print(f'{my_accelerator.total_memory()=}')
+-----------[code] test_get.py -----------
+
+---[output] python test_get.py---------
+my_accelerator.name()='cuda'
+my_accelerator.communication_backend='nccl'
+my_accelerator.HalfTensor().device=device(type='cuda', index=0)
+my_accelerator.total_memory()=34089730048
+---[output] python test_get.py---------
+
+**************************************************************************
+-----------[code] test_set.py -----------
+from deepspeed.accelerator.cuda_accelerator import CUDA_Accelerator
+cu_accel = CUDA_Accelerator()
+print(f'{id(cu_accel)=}')
+from deepspeed.accelerator import set_accelerator, get_accelerator
+set_accelerator(cu_accel)
+
+my_accelerator = get_accelerator()
+print(f'{id(my_accelerator)=}')
+print(f'{my_accelerator._name=}')
+print(f'{my_accelerator._communication_backend=}')
+print(f'{my_accelerator.HalfTensor().device=}')
+print(f'{my_accelerator.total_memory()=}')
+-----------[code] test_set.py -----------
+
+
+---[output] python test_set.py---------
+id(cu_accel)=139648165478304
+my_accelerator=<deepspeed.accelerator.cuda_accelerator.CUDA_Accelerator object at 0x7f025f4bffa0>
+my_accelerator.name='cuda'
+my_accelerator.communication_backend='nccl'
+my_accelerator.HalfTensor().device=device(type='cuda', index=0)
+my_accelerator.total_memory()=34089730048
+---[output] python test_set.py---------
+'''
--- a/azure/README.md
+++ b/azure/README.md
 # Getting Started with DeepSpeed on Azure

-Please see our [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/) to get started with DeepSpeed on Azure!
+The recommended and simplest method to try DeepSpeed on Azure is through [AzureML](https://azure.microsoft.com/en-us/services/machine-learning/). For more details, please see our [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/).
--- a/benchmarks/__init__.py
+++ b/benchmarks/__init__.py
+'''Copyright The Microsoft DeepSpeed Team'''
--- a/benchmarks/communication/README.md
+++ b/benchmarks/communication/README.md
+# Running Communication Benchmarks
+
+
+To run benchmarks, there are two options:
+
+1. Run a single communication operation:
+
+For example, run with a single large message size:
+<pre>
+deepspeed all_reduce.py
+</pre>
+
+Scan across message sizes:
+<pre>
+deepspeed all_reduce.py --scan
+</pre>
+
+2. Run all available communication benchmarks:
+
+<pre>
+deepspeed run_all.py
+</pre>
+
+Like the individual benchmarks, `run_all.py` supports scanning arguments for the max message size, bw-unit, etc. Simply pass the desired arguments to `run_all.py` and they'll be propagated to each comm op.
+
+<pre>
+usage: ds_bench [-h] [--local_rank LOCAL_RANK] [--trials TRIALS] [--warmups WARMUPS] [--maxsize MAXSIZE] [--async-op] [--bw-unit {Gbps,GBps}] [--backend {nccl}] [--dist {deepspeed,torch}] [--scan] [--raw] [--all-reduce] [--all-gather] [--all-to-all]
+                [--pt2pt] [--broadcast] [--dtype DTYPE] [--mem-factor MEM_FACTOR] [--debug]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --local_rank LOCAL_RANK
+  --trials TRIALS       Number of timed iterations
+  --warmups WARMUPS     Number of warmup (non-timed) iterations
+  --maxsize MAXSIZE     Max message size as a power of 2
+  --async-op            Enables non-blocking communication
+  --bw-unit {Gbps,GBps}
+  --backend {nccl}      Communication library to use
+  --dist {deepspeed,torch}
+                        Distributed DL framework to use
+  --scan                Enables scanning all message sizes
+  --raw                 Print the message size and latency without units
+  --all-reduce          Run all_reduce
+  --all-gather          Run all_gather
+  --all-to-all          Run all_to_all
+  --pt2pt               Run pt2pt
+  --broadcast           Run broadcast
+  --dtype DTYPE         PyTorch tensor dtype
+  --mem-factor MEM_FACTOR
+                        Proportion of max available GPU memory to use for single-size evals
+  --debug               Enables all_to_all debug prints
+</pre>
+
+Note that `ds_bench` is a pre-packaged wrapper around `run_all.py`. Users can pass the same arguments as well:
+
+<pre>
+<path to deepspeed>/bin/ds_bench --scan --trials=10
+</pre>
+
+Finally, users can choose specific communication operations to run in `run_all.py` or `ds_bench` by passing them as arguments (all operations are run by default). For example:
+
+<pre>
+deepspeed run_all.py --scan --all-reduce --all-to-all --broadcast
+</pre>
+
+
+# Adding Communication Benchmarks
+
+To add new communication benchmarks, follow this general procedure:
+
+1. Copy a similar benchmark file (e.g. to add `reduce_scatter`, copy `all_reduce.py` as a template)
+2. Add a new bw formula in `utils.get_bw`, a new maximum tensor element formula in `utils.max_numel`, and a new arg in `utils.benchmark_parser`
+3. Replace comm op calls in new file with find-replace
+4. Find a good default `mem_factor` for use in `run_<collective>_single()` function
+5. Add new comm op to `run_all.py`
--- a/benchmarks/communication/__init__.py
+++ b/benchmarks/communication/__init__.py
+'''Copyright The Microsoft DeepSpeed Team'''
--- a/benchmarks/communication/all_gather.py
+++ b/benchmarks/communication/all_gather.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+from benchmarks.communication.utils import *
+from benchmarks.communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+import time
+
+
+# Run all_gather and print metrics
+def timed_all_gather(input, output, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        # use all_gather_base if available
+        if args.dist == 'torch':
+            if hasattr(torch.distributed, "_all_gather_base"):
+                dist._all_gather_base(output, input, group=None, async_op=args.async_op)
+            else:
+                output_tensors = list(
+                    torch.chunk(output_tensor,
+                                cdb.get_world_size(group)))
+                dist.all_gather(output_tensors, input_tensor, group=group, async_op=True)
+        elif args.dist == 'deepspeed':
+            dist.allgather_fn(output, input, group=None, async_op=args.async_op)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        # use all_gather_base if available
+        if args.dist == 'torch':
+            if hasattr(torch.distributed, "_all_gather_base"):
+                dist._all_gather_base(output, input, group=None, async_op=args.async_op)
+            else:
+                output_tensors = list(
+                    torch.chunk(output_tensor,
+                                cdb.get_world_size(group)))
+                dist.all_gather(output_tensors, input_tensor, group=group, async_op=True)
+        elif args.dist == 'deepspeed':
+            dist.allgather_fn(output, input, group=None, async_op=args.async_op)
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    n = dist.get_world_size()
+    tput, busbw = get_bw('all_gather', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(
+        f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_all_gather(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    # Prepare benchmark header
+    print_header(args, 'all_gather')
+    global_rank = dist.get_rank()
+    world_size = dist.get_world_size()
+
+    if args.scan:
+        # Create list of message sizes
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size,
+                                 M,
+                                 dtype=getattr(
+                                     torch,
+                                     args.dtype)).to(
+                                         get_accelerator().device_name(local_rank))
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+                # Delete original mat to avoid OOM
+                del mat
+                get_accelerator().empty_cache()
+                output = torch.zeros(input.nelement() * world_size,
+                                     dtype=getattr(
+                                         torch,
+                                         args.dtype)).to(
+                                             get_accelerator().device_name(local_rank))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+            sync_all()
+            timed_all_gather(input, output, args)
+    else:
+        # all_gather_base saves memory
+        if (args.dist == 'torch'
+                and hasattr(torch.distributed,
+                            "_all_gather_base")) or (args.dist == 'deepspeed'
+                                                     and dist.has_allgather_base):
+            mem_factor = args.mem_factor + 0.2
+        else:
+            mem_factor = args.mem_factor
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        sync_all()
+        elements_per_gpu = max_numel(comm_op='all_gather',
+                                     dtype=getattr(torch,
+                                                   args.dtype),
+                                     mem_factor=mem_factor,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu,
+                             dtype=getattr(torch,
+                                           args.dtype)).to(
+                                               get_accelerator().device_name(local_rank))
+            # multiply each GPU's tensor by the rank to ease debugging
+            input = ((mat.mul_(float(global_rank))).view(-1))
+            # Delete original mat to avoid OOM
+            del mat
+            get_accelerator().empty_cache()
+            output = torch.zeros(
+                elements_per_gpu * world_size,
+                dtype=getattr(torch,
+                              args.dtype)).to(get_accelerator().device_name(local_rank))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print(
+                        'WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!'
+                    )
+                sync_all()
+                return
+
+        sync_all()
+        timed_all_gather(input, output, args)
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_all_gather(local_rank=rank, args=args)
--- a/benchmarks/communication/all_reduce.py
+++ b/benchmarks/communication/all_reduce.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+from benchmarks.communication.utils import *
+from benchmarks.communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+import time
+
+
+def timed_all_reduce(input, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        dist.all_reduce(input, async_op=args.async_op)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        dist.all_reduce(input, async_op=args.async_op)
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    n = dist.get_world_size()
+    tput, busbw = get_bw('all_reduce', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(
+        f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_all_reduce(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    # Prepare benchmark header
+    print_header(args, 'all_reduce')
+
+    world_size = dist.get_world_size()
+    global_rank = dist.get_rank()
+
+    if args.scan:
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size,
+                                 M,
+                                 dtype=getattr(
+                                     torch,
+                                     args.dtype)).to(
+                                         get_accelerator().device_name(local_rank))
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+            sync_all()
+            timed_all_reduce(input, args)
+    else:
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        # Don't need output tensor, so we double mem_factor
+        elements_per_gpu = max_numel(comm_op='all_reduce',
+                                     dtype=getattr(torch,
+                                                   args.dtype),
+                                     mem_factor=args.mem_factor * 2,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu,
+                             dtype=getattr(torch,
+                                           args.dtype)).to(
+                                               get_accelerator().device_name(local_rank))
+            input = ((mat.mul_(float(global_rank))).view(-1))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print(
+                        'WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!'
+                    )
+                sync_all()
+                return
+        sync_all()
+        timed_all_reduce(input, args)
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_all_reduce(local_rank=rank, args=args)
--- a/benchmarks/communication/all_to_all.py
+++ b/benchmarks/communication/all_to_all.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+from benchmarks.communication.utils import *
+from benchmarks.communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+import time
+
+
+def timed_all_to_all(input, output, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        dist.all_to_all_single(output, input, async_op=args.async_op)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        dist.all_to_all_single(output, input, async_op=args.async_op)
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    n = dist.get_world_size()
+    tput, busbw = get_bw('all_to_all', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(
+        f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_all_to_all(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    world_size = dist.get_world_size()
+    global_rank = dist.get_rank()
+    # Prepare benchmark header
+    print_header(args, 'all_to_all')
+
+    if args.scan:
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size,
+                                 M,
+                                 dtype=getattr(
+                                     torch,
+                                     args.dtype)).to(
+                                         get_accelerator().device_name(local_rank))
+                assert mat.numel() % world_size == 0, f"tensor cannot be divided in {world_size} chunks"
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+                output = (mat.clone().view(-1))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+            sync_all()
+            timed_all_to_all(input, output, args)
+    else:
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        elements_per_gpu = max_numel(comm_op='all_to_all',
+                                     dtype=getattr(torch,
+                                                   args.dtype),
+                                     mem_factor=args.mem_factor,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu,
+                             dtype=getattr(torch,
+                                           args.dtype)).to(
+                                               get_accelerator().device_name(local_rank))
+            assert mat.numel() % world_size == 0, f"tensor with {mat.numel()} elements cannot be divided in {world_size} chunks"
+            input = ((mat.mul_(float(global_rank))).view(-1))
+            # Delete original mat to avoid OOM
+            del mat
+            get_accelerator().empty_cache()
+            output = torch.zeros(
+                elements_per_gpu,
+                dtype=getattr(torch,
+                              args.dtype)).to(get_accelerator().device_name(local_rank))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print(
+                        'WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!'
+                    )
+                sync_all()
+                return
+        sync_all()
+
+        if args.debug:
+            for i in range(world_size):
+                if i == global_rank:
+                    print(f"Before AllToAll Input List at rank {global_rank}: {input}")
+                dist.barrier()
+
+        timed_all_to_all(input, output, args)
+
+        if args.debug:
+            for i in range(world_size):
+                if i == global_rank:
+                    print(f"AllToAll Results at rank {global_rank}: {output}")
+                dist.barrier()
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_all_to_all(local_rank=rank, args=args)
--- a/benchmarks/communication/broadcast.py
+++ b/benchmarks/communication/broadcast.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+import torch
+from benchmarks.communication.utils import *
+from benchmarks.communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+import time
+
+
+def timed_broadcast(input, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        dist.broadcast(input, 0, async_op=args.async_op)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        dist.broadcast(input, 0, async_op=args.async_op)
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    n = dist.get_world_size()
+    tput, busbw = get_bw('broadcast', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(
+        f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_broadcast(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    # Prepare benchmark header
+    print_header(args, 'broadcast')
+
+    world_size = dist.get_world_size()
+    global_rank = dist.get_rank()
+
+    if args.scan:
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size,
+                                 M,
+                                 dtype=getattr(
+                                     torch,
+                                     args.dtype)).to(
+                                         get_accelerator().device_name(local_rank))
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+            sync_all()
+            timed_broadcast(input, args)
+    else:
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        # Don't need output tensor, so we double mem_factor
+        elements_per_gpu = max_numel(comm_op='broadcast',
+                                     dtype=getattr(torch,
+                                                   args.dtype),
+                                     mem_factor=args.mem_factor * 2,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu,
+                             dtype=getattr(torch,
+                                           args.dtype)).to(
+                                               get_accelerator().device_name(local_rank))
+            input = ((mat.mul_(float(global_rank))).view(-1))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print(
+                        'WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!'
+                    )
+                sync_all()
+                return
+        sync_all()
+        timed_broadcast(input, args)
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_broadcast(local_rank=rank, args=args)
--- a/benchmarks/communication/constants.py
+++ b/benchmarks/communication/constants.py
+'''Copyright The Microsoft DeepSpeed Team'''
+from deepspeed.accelerator import get_accelerator
+
+DEFAULT_WARMUPS = 5
+DEFAULT_TRIALS = 50
+DEFAULT_TYPE = 'float'
+DEFAULT_BACKEND = get_accelerator().communication_backend_name()
+DEFAULT_UNIT = 'Gbps'
+DEFAULT_DIST = 'deepspeed'
+DEFAULT_MAXSIZE = 24
--- a/benchmarks/communication/pt2pt.py
+++ b/benchmarks/communication/pt2pt.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+from benchmarks.communication.utils import *
+from benchmarks.communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+import time
+
+
+def timed_pt2pt(input, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    sync_all()
+    # Warmups, establish connections, etc.
+    for i in range(args.warmups):
+        if dist.get_rank() == 0:
+            if args.async_op:
+                dist.isend(input, 1)
+            else:
+                dist.send(input, 1)
+        if dist.get_rank() == 1:
+            if args.async_op:
+                dist.irecv(input, src=0)
+            else:
+                dist.recv(input, src=0)
+    sync_all()
+
+    # time the actual comm op trials times and average it
+    pre = time.perf_counter()
+    for i in range(args.trials):
+        if dist.get_rank() == 0:
+            if args.async_op:
+                dist.isend(input, 1)
+            else:
+                dist.send(input, 1)
+        if dist.get_rank() == 1:
+            if args.async_op:
+                dist.irecv(input, src=0)
+            else:
+                dist.recv(input, src=0)
+
+    sync_all()
+    duration = time.perf_counter() - pre
+
+    # maintain and clean performance data
+    avg_duration = duration / args.trials
+    size = input.element_size() * input.nelement()
+    n = dist.get_world_size()
+    tput, busbw = get_bw('pt2pt', size, avg_duration, args)
+    tput_str, busbw_str, duration_str = get_metric_strings(args, tput, busbw, avg_duration)
+    desc = f'{input.nelement()}x{input.element_size()}'
+
+    if not args.raw:
+        size = convert_size(size)
+
+    print_rank_0(
+        f"{size:<20} {desc:25s} {duration_str:20s} {tput_str:20s} {busbw_str:20s}")
+
+
+def run_pt2pt(local_rank, args):
+    if args.dist == 'torch':
+        import torch.distributed as dist
+    elif args.dist == 'deepspeed':
+        import deepspeed.comm as dist
+
+    # Prepare benchmark header
+    print_header(args, 'pt2pt')
+    global_rank = dist.get_rank()
+    world_size = dist.get_world_size()
+
+    if args.scan:
+        # Create list of message sizes
+        M_LIST = []
+        for x in (2**p for p in range(1, args.maxsize)):
+            M_LIST.append(x)
+
+        sync_all()
+        # loop over various tensor sizes
+        for M in M_LIST:
+            global_rank = dist.get_rank()
+            try:
+                mat = torch.ones(world_size,
+                                 M,
+                                 dtype=getattr(
+                                     torch,
+                                     args.dtype)).to(
+                                         get_accelerator().device_name(local_rank))
+                sync_all()
+                input = ((mat.mul_(float(global_rank))).view(-1))
+            except RuntimeError as e:
+                if 'out of memory' in str(e):
+                    if dist.get_rank() == 0:
+                        print('WARNING: Ran out of GPU memory. Exiting comm op.')
+                    sync_all()
+                    break
+            sync_all()
+            timed_pt2pt(input, args)
+    else:
+        # Send the biggest message size our GPUs can fit. If you're facing OOM errors, reduce the mem_factor
+        # Don't need output tensor, so double mem_factor
+        elements_per_gpu = max_numel(comm_op='pt2pt',
+                                     dtype=getattr(torch,
+                                                   args.dtype),
+                                     mem_factor=args.mem_factor * 2,
+                                     local_rank=local_rank,
+                                     args=args)
+        try:
+            mat = torch.ones(elements_per_gpu,
+                             dtype=getattr(torch,
+                                           args.dtype)).to(
+                                               get_accelerator().device_name(local_rank))
+            input = ((mat.mul_(float(global_rank))).view(-1))
+        except RuntimeError as e:
+            if 'out of memory' in str(e):
+                if dist.get_rank() == 0:
+                    print(
+                        'WARNING: Ran out of GPU memory. Try to reduce the --mem-factor argument!'
+                    )
+                sync_all()
+                return
+        sync_all()
+        timed_pt2pt(input, args)
+
+
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    init_processes(local_rank=rank, args=args)
+    run_pt2pt(local_rank=rank, args=args)
--- a/benchmarks/communication/run_all.py
+++ b/benchmarks/communication/run_all.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+from benchmarks.communication.utils import *
+from benchmarks.communication.all_reduce import run_all_reduce
+from benchmarks.communication.all_gather import run_all_gather
+from benchmarks.communication.all_to_all import run_all_to_all
+from benchmarks.communication.pt2pt import run_pt2pt
+from benchmarks.communication.broadcast import run_broadcast
+from benchmarks.communication.constants import *
+
+
+# For importing
+def main(args, rank):
+
+    init_processes(local_rank=rank, args=args)
+
+    ops_to_run = []
+    if args.all_reduce:
+        ops_to_run.append('all_reduce')
+    if args.all_gather:
+        ops_to_run.append('all_gather')
+    if args.broadcast:
+        ops_to_run.append('broadcast')
+    if args.pt2pt:
+        ops_to_run.append('pt2pt')
+    if args.all_to_all:
+        ops_to_run.append('all_to_all')
+
+    if len(ops_to_run) == 0:
+        ops_to_run = ['all_reduce', 'all_gather', 'all_to_all', 'broadcast', 'pt2pt']
+
+    for comm_op in ops_to_run:
+        if comm_op == 'all_reduce':
+            run_all_reduce(local_rank=rank, args=args)
+        if comm_op == 'all_gather':
+            run_all_gather(local_rank=rank, args=args)
+        if comm_op == 'all_to_all':
+            run_all_to_all(local_rank=rank, args=args)
+        if comm_op == 'pt2pt':
+            run_pt2pt(local_rank=rank, args=args)
+        if comm_op == 'broadcast':
+            run_broadcast(local_rank=rank, args=args)
+
+
+# For directly calling benchmark
+if __name__ == "__main__":
+    args = benchmark_parser().parse_args()
+    rank = args.local_rank
+    main(args, rank)
--- a/benchmarks/communication/utils.py
+++ b/benchmarks/communication/utils.py
+'''Copyright The Microsoft DeepSpeed Team'''
+
+import torch
+import os
+import math
+import argparse
+from benchmarks.communication.constants import *
+from deepspeed.accelerator import get_accelerator
+
+global dist
+
+
+def init_torch_distributed(backend):
+    global dist
+    import torch.distributed as dist
+    torch.distributed.init_process_group(backend)
+    local_rank = int(os.environ['LOCAL_RANK'])
+    get_accelerator().set_device(local_rank)
+
+
+def init_deepspeed_comm(backend):
+    global dist
+    import deepspeed
+    import deepspeed.comm as dist
+    deepspeed.init_distributed(dist_backend=backend)
+    local_rank = int(os.environ['LOCAL_RANK'])
+    get_accelerator().set_device(local_rank)
+
+
+def init_processes(local_rank, args):
+    if args.dist == 'deepspeed':
+        init_deepspeed_comm(args.backend)
+    elif args.dist == 'torch':
+        init_torch_distributed(args.backend)
+    else:
+        print_rank_0(f"distributed framework {args.dist} not supported")
+        exit(0)
+
+
+def print_rank_0(message):
+    if dist.get_rank() == 0:
+        print(message)
+
+
+def print_header(args, comm_op):
+    if comm_op == 'pt2pt':
+        world_size = 2
+    else:
+        world_size = dist.get_world_size()
+    tput = f'Throughput ({args.bw_unit})'
+    busbw = f'BusBW ({args.bw_unit})'
+    header = f"\n---- Performance of {comm_op} on {world_size} devices ---------------------------------------------------------\n"
+    duration_str = 'Duration'
+    if args.raw:
+        duration_str += ' (us)'
+    header += f"{'Size (Bytes)':20s} {'Description':25s} {duration_str:20s} {tput:20s} {busbw:20s}\n"
+    header += "----------------------------------------------------------------------------------------------------"
+    print_rank_0(header)
+
+
+def get_bw(comm_op, size, duration, args):
+    n = dist.get_world_size()
+    tput = 0
+    busbw = 0
+    if comm_op == "all_to_all":
+        tput = (size / duration)
+        busbw = (size / duration) * ((n - 1) / n)
+    elif comm_op == "all_gather":
+        size *= n
+        tput = (size / duration)
+        busbw = (size / duration) * ((n - 1) / n)
+    elif comm_op == "all_reduce":
+        tput = (size * 2 / duration)
+        busbw = (size / duration) * (2 * (n - 1) / n)
+    elif comm_op == "pt2pt" or comm_op == "broadcast":
+        tput = (size / duration)
+        busbw = tput
+    else:
+        print_rank_0("wrong comm_op specified")
+        exit(0)
+
+    if args.bw_unit == 'Gbps':
+        tput *= 8
+        busbw *= 8
+
+    return tput, busbw
+
+
+def get_metric_strings(args, tput, busbw, duration):
+    duration_ms = duration * 1e3
+    duration_us = duration * 1e6
+    tput = f'{tput / 1e9:.3f}'
+    busbw = f'{busbw /1e9:.3f}'
+
+    if duration_us < 1e3 or args.raw:
+        duration = f'{duration_us:.3f}'
+        if not args.raw:
+            duration += ' us'
+    else:
+        duration = f'{duration_ms:.3f} ms'
+    return tput, busbw, duration
+
+
+def sync_all():
+    get_accelerator().synchronize()
+    dist.barrier()
+
+
+def max_numel(comm_op, dtype, mem_factor, local_rank, args):
+    dtype_size = _element_size(dtype)
+    max_memory_per_gpu = get_accelerator().total_memory(local_rank) * mem_factor
+    if comm_op == 'all_reduce' or comm_op == 'pt2pt' or comm_op == 'broadcast':
+        elements_per_gpu = int(max_memory_per_gpu // dtype_size)
+    elif comm_op == 'all_gather':
+        # all_gather performance is lower for non-powers of two, and the output buffer size scales with world size
+        # Therefore, divide by world size and round down to nearest power of 2
+        elements_per_gpu = int(max_memory_per_gpu // dtype_size // dist.get_world_size())
+        elements_per_gpu = int(pow(2, int(math.log(elements_per_gpu, 2))))
+    elif comm_op == 'all_to_all':
+        # Number of elements must be divisible by world_size
+        # all_to_all performance is lower for non-powers of two. Round down like all_gather.
+        elements_per_gpu = int(max_memory_per_gpu // dtype_size)
+        elements_per_gpu = int(dist.get_world_size() *
+                               round(elements_per_gpu / dist.get_world_size()))
+        elements_per_gpu = int(pow(2, int(math.log(elements_per_gpu, 2))))
+    else:
+        print(f"This communication operation: {comm_op} is not supported yet")
+        exit(0)
+    return elements_per_gpu
+
+
+# Helper function to pretty-print message sizes
+def convert_size(size_bytes):
+    if size_bytes == 0:
+        return "0B"
+    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
+    i = int(math.floor(math.log(size_bytes, 1024)))
+    p = math.pow(1024, i)
+    s = round(size_bytes / p, 2)
+    return "%s %s" % (s, size_name[i])
+
+
+# Copied from torch. Need to add the func here for old torch compatibility.
+def _element_size(dtype):
+    """
+    Returns the element size for a dtype, in bytes
+    """
+    if not isinstance(dtype, torch.dtype):
+        raise RuntimeError(f'expected torch.dtype, but got {type(dtype)}')
+
+    if dtype.is_complex:
+        return torch.finfo(dtype).bits >> 2
+    elif dtype.is_floating_point:
+        return torch.finfo(dtype).bits >> 3
+    elif dtype == torch.bool:
+        # NOTE: torch.bool is not supported in torch.iinfo()
+        return 1
+    else:
+        return torch.iinfo(dtype).bits >> 3
+
+
+def benchmark_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_rank", type=int)
+    parser.add_argument("--trials",
+                        type=int,
+                        default=DEFAULT_TRIALS,
+                        help='Number of timed iterations')
+    parser.add_argument("--warmups",
+                        type=int,
+                        default=DEFAULT_WARMUPS,
+                        help='Number of warmup (non-timed) iterations')
+    parser.add_argument("--maxsize",
+                        type=int,
+                        default=24,
+                        help='Max message size as a power of 2')
+    parser.add_argument("--async-op",
+                        action="store_true",
+                        help='Enables non-blocking communication')
+    parser.add_argument("--bw-unit",
+                        type=str,
+                        default=DEFAULT_UNIT,
+                        choices=['Gbps',
+                                 'GBps'])
+    parser.add_argument("--backend",
+                        type=str,
+                        default=DEFAULT_BACKEND,
+                        choices=['nccl',
+                                 'ccl'],
+                        help='Communication library to use')
+    parser.add_argument("--dist",
+                        type=str,
+                        default=DEFAULT_DIST,
+                        choices=['deepspeed',
+                                 'torch'],
+                        help='Distributed DL framework to use')
+    parser.add_argument("--scan",
+                        action="store_true",
+                        help='Enables scanning all message sizes')
+    parser.add_argument("--raw",
+                        action="store_true",
+                        help='Print the message size and latency without units')
+    parser.add_argument("--all-reduce", action="store_true", help='Run all_reduce')
+    parser.add_argument("--all-gather", action="store_true", help='Run all_gather')
+    parser.add_argument("--all-to-all", action="store_true", help='Run all_to_all')
+    parser.add_argument("--pt2pt", action="store_true", help='Run pt2pt')
+    parser.add_argument("--broadcast", action="store_true", help='Run broadcast')
+    parser.add_argument("--dtype",
+                        type=str,
+                        default=DEFAULT_TYPE,
+                        help='PyTorch tensor dtype')
+    parser.add_argument(
+        "--mem-factor",
+        type=float,
+        default=.4,
+        help='Proportion of max available GPU memory to use for single-size evals')
+    parser.add_argument("--debug",
+                        action="store_true",
+                        help='Enables all_to_all debug prints')
+    return parser