Ported Megatron tutorial. (#30)

* Ported DeepSpeed overview. * Renamed subsection * Formatting table of contents * initial import of Megatron tutorial * Grammatical edits, formatting, and paths. * formatting and data download instructions * formatting tutorial * formatting tutorial * formatting tutorial * formatting tutorial * formatting tutorial * formatting tutorial * new perf chart * removing TODO * adding pointer to tutorial * edits * azure to low bandwidth Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>

Ported Megatron tutorial. (#30)
* Ported DeepSpeed overview. * Renamed subsection * Formatting table of contents * initial import of Megatron tutorial * Grammatical edits, formatting, and paths. * formatting and data download instructions * formatting tutorial * formatting tutorial * formatting tutorial * formatting tutorial * formatting tutorial * formatting tutorial * new perf chart * removing TODO * adding pointer to tutorial * edits * azure to low bandwidth Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
073da729 · Shaden Smith · GitHub · 766f030e · 073da729 · 073da729
Unverified Commit 073da729 authored Feb 06, 2020 by Shaden Smith Committed by GitHub Feb 06, 2020
4 changed files
--- a/README.md
+++ b/README.md
-# DeepSpeed
-
 [![Build Status](https://msdlserving.visualstudio.com/DeepScale/_apis/build/status/DeepSpeed%20GPU%20CI?branchName=master)](https://msdlserving.visualstudio.com/DeepScale/_build/latest?definitionId=36&branchName=master)
 [![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)

+DeepSpeed is a deep learning optimization library that makes distributed training easy,
+efficient, and effective.
+
+<p align="center"><i><b>10x Larger Models</b></i></p>
+<p align="center"><i><b>5x Faster Training</b></i></p>
+<p align="center"><i><b>Minimal Code Change</b></i></p>
+
+DeepSpeed can train DL models with over a hundred billion parameters on current generation of GPU clusters, while achieving over 5x in system performance compared to the state-of-art. Early adopters of DeepSpeed have already produced language model (LM) with over 17B parameters establishing new SOTA in the LM category.
+
+Below we provide a brief feature list, see our detailed [feature
+overview](#deepspeed-feature-overview) for descriptions and usage guide.
+
+* [Distributed Training with Mixed Precision](#distributed-training-with-mixed-precision)
+    * 16-bit mixed precision
+    * Single-GPU/Multi-GPU/Multi-Node
+* [Model Parallelism](#model-parallelism)
+    * Support for Custom Model Parallelism
+    * Integration with Megatron-LM
+* [Memory and Bandwidth Optimizations](#memory-and-bandwidth-optimizations)
+    * The Zero Redundancy Optimizer (ZeRO)
+    * Constant Buffer Optimization (CBO)
+    * Smart Gradient Accumulation
+* [Training Features](#training-features)
+    * Simplified training API
+    * Gradient Clipping
+    * Automatic loss scaling with mixed precision
+* [Training Optimizers](#training-optimizers)
+    * Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
+    * Memory bandwidth optimized FP16 Optimizer
+    * Large Batch Training with LAMB Optimizer
+    * Memory efficient Training with ZeRO Optimizer
+* [Training Agnostic Checkpointing](#training-agnostic-checkpointing)
+* [Advanced Parameter Search](#advanced-parameter-search)
+    * Learning Rate Range Test
+    * 1Cycle Learning Rate Schedule
+* [Simplified Data Loader](#simplified-data-loader)
+* [Performance Analysis and Debugging](#performance-analysis-and-debugging)
+
+
+
+## Table of Contents
+
+| Section                                 | Description                                 |
+| --------------------------------------- | ------------------------------------------- |
+| [Why DeepSpeed?](#why-deepspeed)        |  DeepSpeed overview                         |
+| [Installation](#installation)           |  Installation instructions                  |
+| [Feature Overview](#feature-overview)   |  Preview of DeepSpeed's features            |
+| [Testing](#testing)                     |  Instructions for testing DeepSpeed         |
+| [Contributing](#contributing)           |  Instructions for contributing to DeepSpeed |
+
+
+
+
+## Why DeepSpeed?
+Training advanced deep learning models is challenging. Beyond model design,
+model scientists also need to set up the state-of-the-art training techniques
+such as distributed training, mixed precision, gradient accumulation, and
+checkpointing. Yet still, scientists may not achieve the desired system
+performance and convergence rate. Large model sizes are even more challenging:
+a large model easily runs out of memory with pure data paralelism and it is
+difficult to use model parallelism. DeepSpeed addresses these challenges to
+accelerate model development *and* training.
+
+### Distributed, Effective, and Efficient Training with Ease
+The DeepSpeed API is a lightweight wrapper on [PyTorch](https://pytorch.org/). This
+means that you can use everything you love in PyTorch and without learning a new
+platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art
+training techniques, such as distributed training, mixed precision, gradient
+accumulation, and checkpoints so that you can focus on your model development. Most
+importantly, you can leverage the distinctive efficiency and effectiveness benefit of
+DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch
+models.
+
+### Speed
+DeepSpeed achieves high performance and fast convergence through a combination of
+efficiency optimizations on compute/communication/memory/IO and effectiveness
+optimizations on advanced hyperparameter tuning and optimizers. For example:
+
+* DeepSpeed trains BERT-large to parity in 14 hours using 64 GPUs (4 DGX-2 boxes) and in
+  3.7 hours using 256 GPUs (16 DGX-2 boxes).
+
+  **BERT-large Training Times**
+
+  | Devices       | Source    | Training Time (hours) |
+  | ------------- | --------- | ---------------------:|
+  | 64 TPUs       | Google    |                    96 |
+  | 64 V100 GPUs  | DeepSpeed |                **14** |
+  | 256 V100 GPUs | NVIDIA    |                   3.9 |
+  | 256 V100 GPUs | DeepSpeed |               **3.7** |
+
+  <!---*Read more*: [BERT tutorial](../../Tutorials/bert_pretraining/deepspeed_bert_training.md)-->
+
+  *BERT Tutorial*: Coming Soon
+
+* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
+  Megatron on Azure GPUs.
+
+  *Read more*: [GPT tutorial](../../Tutorials/Megatron_GPT2/MegatronGPT2Tutorial.md)
+
+
+
+### Memory efficiency
+DeepSpeed provides memory-efficient data parallelism and enables training models without
+model parallelism. For example, DeepSpeed can train models with up to 6 billion parameters on
+NVIDIA V100 GPUs with 32GB of device memory. In comparison, existing frameworks (e.g.,
+PyTorch's Distributed Data Parallel) run out of memory with 1.5 billion parameter models.
+
+DeepSpeed reduces the training memory footprint through a novel solution called Zero
+Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are
+replicated across data-parallel processes, ZeRO partitions model states to save
+significant memory. The current implementation (stage 1 of ZeRO) reduces memory by up to
+4x relative to the state-of-art. You can read more about ZeRO in our [technical
+report](https://arxiv.org/abs/1910.02054).
+
+With this impressive memory reduction, early adopters of DeepSpeed have already produced language model (LM) with over 17B parameters called [Turing-NLG](link-to-turing-blog) establishing new SOTA in the LM category.
+
+### Scalability
+DeepSpeed supports efficient data parallelism, model parallelism, and their
+combination. ZeRO boosts the scaling capability and efficiency further.
+* DeepSpeed provides system support to run models up to 100 billion parameters,
+  10x larger than the state-of-art (8 billion NVIDIA GPT, 11 billion Google T5).
+* DeepSpeed can run large models more efficiently, up to 6x faster for models with
+  various sizes spanning 1.5B to 100B.
+
+  *Read more*: [technical report](https://arxiv.org/abs/1910.02054),
+  [GPT tutorial](../../Tutorials/Megatron_GPT2/MegatronGPT2Tutorial.md),
+  and [QANet tutorial](../../Tutorials/QANet/QANetTutorial.md).
+
+![DeepSpeed-vs-Megatron](./docs/figures/DeepSpeed-vs-Megatron.png)
+
+
+### Fast convergence for effectiveness
+DeepSpeed supports advanced hyperparameter tuning and large batch size
+optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
+effectiveness of model training and reduce the number of samples required to
+convergence to desired accuracy.
+
+*Read more*: [Tuning tutorial](../../Tutorials/1cycle/1Cycle.md), [QANet
+tutorial](../../Tutorials/QANet/QANetTutorial.md) and *BERT Tutorial*: Coming Soon
+<!---[BERT
+tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md),
+-->

 ## Installation
+**TODO**
+
+
+## Feature Overview
+
+### Distributed Training with Mixed Precision
+
+#### Mixed Precision Training
+Enable 16-bit (FP16) training by in the `deepspeed_config` JSON.
+```json
+"fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+}
+```
+
+#### Single-GPU, Multi-GPU, and Multi-Node Training
+Easily switch between single-GPU, single-node multi-GPU, or multi-node multi-GPU
+execution by specifying resources with a hostfile.
+```bash
+deepspeed --hostfile=<hostfile> \
+	<client_entry.py> <client args> \
+	--deepspeed --deepspeed_config ds_config.json
+```
+
+The script `<client_entry.py>` will execute on the resources specified in `<hostfile>`.
+
+**TODO: explain hostfile formatting**
+
+
+### Model Parallelism
+
+#### Support for Custom Model Parallelism
+DeepSpeed is supports all forms of model parallelism including tensor slicing based
+approaches such as the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), or a
+pipelined parallelism approach such as
+[PipeDream](https://github.com/msr-fiddle/pipedream) or
+[GPipe](https://github.com/kakaobrain/torchgpipe). It does so by only requiring the model
+parallelism framework to provide a *model parallelism unit* (`mpu`) that implements a few
+bookkeeping functionalities:
+
+```python
+mpu.get_model_parallel_rank()
+mpu.get_model_parallel_group()
+mpu.get_model_parallel_world_size()
+
+mpu.get_data_parallel_rank/group/world_size()
+mpu.get_data_parallel_group()
+mpu.get_data_parallel_world_size()
+```
+#### Integration with Megatron-LM
+**TODO: port tutorial to its own page**
+DeepSpeed is fully compatible with [Megatron](https://github.com/NVIDIA/Megatron-LM).
+Please see the [Megatron-LM tutorial](docs/tutorials/MegatronGPT2Tutorial.md) for details.
+
+
+
+### Memory and Bandwidth Optimizations
+
+#### The Zero Redundancy Optimizer (ZeRO)
+[ZeRO](https://arxiv.org/abs/1910.02054) is at the heart of DeepSpeed and
+enables large model training at a scale that is simply not possible with model
+parallelism alone. When enabled, ZeRO allows training models with
+over 6 billion parameters without any model parallelism, and up to 100 billion
+parameter models with model parallelism on current generation hardware.
+
+For more details see the [ZeRO paper](https://arxiv.org/abs/1910.02054), [GPT
+tutorial](../../Tutorials/Megatron_GPT2/MegatronGPT2Tutorial.md) on integration with
+DeepSpeed. Additional tutorals including *BERT Tutorial*: Coming Soon.
+<!---[BERT
+tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md),
+-->
+#### Constant Buffer Optimization (CBO)
+CBO enables high network and memory throughput while restricting memory usage to a
+constant size. For memory- and network-bound operations such as normalization or
+allreduce collectives, the performance depends on the size of the operand. Simply fusing
+all operands into a single large operand can enable great throughput at the expense of
+unnecessary memory overhead. CBO in DeepSpeed fuses smaller operands into approximately a
+pre-defined sized buffer large enough to achieve great performance without the
+unnecessary memory overhead.
+
+#### Smart Gradient Accumulation
+Gradient accumulation allows running larger batch size with limited memory by breaking an
+effective batch into several sequential micro-batches, and averaging the parameter
+gradients across these micro-batches. Furthermore, instead of averaging the gradients of
+each micro-batch across all GPUs, the gradients are averaged locally during each step of
+the sequence, and a single `allreduce` is done at the end of the sequence to produce the
+averaged gradients for the effective batch across all GPUs. This strategy significantly
+reduces the communication involved over the approach of averaging globally for each
+micro-batch, specially when the number of micro-batches per effective batch is large.
+
+
+### Training Features
+
+#### Simplified training API
+The DeepSpeed core API consists of just a handful of methods:
+* initialization: `initialize`
+* training: `backward` and `step`
+* argument parsing: `add_config_arguments`
+* checkpointing : `load_checkpoint` and `store_checkpoint`
+
+DeepSpeed supports all the features described in this document, via the use of these API,
+along with a `deepspeed_config` JSON file for enabling and disabling the features. Please
+see [core API doc](../../API/core_api/core_api.md) for more details.
+
+
+#### Gradient Clipping
+DeepSpeed handles gradient clipping under the hood based on the max gradient norm
+specified by the user. See [core API doc](../../API/core_api/core_api.md) for more
+details.
+
+#### Automatic loss scaling with mixed precision
+DeepSpeed internally handles loss scaling for mixed precision training. The parameters
+for loss scaling can be specified in the `deepspeed_config` JSON file. See [core API
+  doc](../../API/core_api/core_api.md) for more details.
+
+### Training Optimizers
+
+#### Fused Adam optimizer and arbitrary torch.optim.Optimizer
+With DeepSpeed, the user can choose to use a high performance implementation of ADAM from
+NVIDIA, or any training optimizer that extends torch's `torch.optim.Optimizer` class.
+
+#### Memory bandwidth optimized FP16 Optimizer
+Mixed precision training is handled by the DeepSpeed FP16 Optimizer. This optimizer not
+only handles FP16 training but is also highly efficient. The performance of weight update
+is primarily dominated by the memory bandwidth, and the achieved memory bandwidth is
+dependent on the size of the input operands. The FP16 Optimizer is designed to maximize
+the achievable memory bandwidth by merging all the parameters of the model into a single
+large buffer, and applying the weight updates in a single kernel, allowing it to achieve
+high memory bandwidth.
+
+#### Large Batch Training with LAMB Optimizer
+**TODO: port tutorial**
+DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Optimizer.
+For more details on LAMB, see the [BERT
+tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md)  and the [LAMB
+paper](https://arxiv.org/pdf/1904.00962.pdf).
+
+#### Memory-Efficient Training with ZeRO Optimizer
+DeepSpeed can train models up with up to 6 billion parameters without parallelism, and
+models with up to 100 billion parameters with 16-way model parallelism. This leap in
+model size is possible though the memory efficiency achieved via the ZeRO Optimizer. For
+more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .
+
+
+
+### Training Agnostic Checkpointing
+**TODO: API documentation**
+DeepSpeed can simplify checkpointing for you regardless of whether you are using data
+parallel training, model parallel training, mixed-precision training, a mix of these
+three, or using the zero optimizer to enable larger model sizes. See the [getting
+started](../../Onboard/onboard/onboard.md) or [core API
+doc](../../API/core_api/core_api.md) for details.
+
+### Advanced parameter search
+DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for
+large batch scaling.
+
+#### Learning Rate Range Test
+Please refer to [Learning Rate Range Test](../../Tutorials/lrrt/lrrt.md).
+
+#### 1Cycle Learning Rate Schedule
+Please refer to [1Cycle Learning Rate Schedule](../../Tutorials/1cycle/1Cycle.md).
+
+
+### Simplified Data Loader
+DeepSpeed abstracts away data parallelism and model parallelism from the user when it
+comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader
+can automatically handle batch creation appropriately.
+
+### Performance Analysis and Debugging
+For performance debugging, DeepSpeed can give you a detailed breakdown of the time spent
+in different parts of the training with by simply enabling it in the `deepspeed_config`
+file. See [core API doc](../../API/core_api/core_api.md).
+```json
+{
+  "wallclock_breakdwon": true
+}
+```
+


 ## Testing
@@ -13,15 +336,15 @@ DeepSpeed tracks two types of tests: unit tests and more costly model convergenc
 The model convergence tests train
 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure
 end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and
-model convergence tests are found in `tests/model/`.
+the model convergence tests are found in `tests/model/`.

 ### Unit Tests
 [PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be
 installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the
 unit tests:
-
-    pytest --forked tests/unit/
-
+```bash
+pytest --forked tests/unit/
+```
 You can also provide the `-v` flag to `pytest` to see additional information about the
 tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
 `--forked` flag are required to test CUDA functionality in distributed tests.
@@ -30,33 +353,33 @@ tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) an
 To execute model tests, first [install DeepSpeed](#installation). The
 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
 as part of this process. Next, execute the model test driver:
-
-    cd tests/model/
-    pytest run_sanity_check.py
-
+```bash
+cd tests/model/
+pytest run_sanity_check.py
+```
+Note that the `--forked` flag is not necessary for the model tests.


 ## Contributing
-
 DeepSpeed welcomes your contributions!

-### Prerequisites

+### Prerequisites
 DeepSpeed uses [pre-commit](https://pre-commit.com/) to ensure that formatting is
 consistent across DeepSpeed. First, ensure that `pre-commit` is installed from either
 installing DeepSpeed or `pip install pre-commit`. Next, the pre-commit hooks must be
 installed once before commits can be made:
-
-    pre-commit install
+```bash
+pre-commit install
+```

 Afterwards, our suite of formatting tests run automatically before each `git commit`. You
 can also run these manually:
-
-    pre-commit run --all-files
-
+```bash
+pre-commit run --all-files
+```

 ### Contributor License Agreement
-
 This project welcomes contributions and suggestions. Most contributions require you to
 agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
 actually do, grant us the rights to use your contribution. For details, visit
@@ -68,7 +391,6 @@ follow the instructions provided by the bot. You will only need to do this once
 all repos using our CLA.

 ### Code of Conduct
-
 This project has adopted the [Microsoft Open Source Code of
 Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
 [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact

--- a/docs/figures/DeepSpeed-vs-Megatron.png
+++ b/docs/figures/DeepSpeed-vs-Megatron.png
--- a/docs/figures/megatron-gpt2-perf-test.png
+++ b/docs/figures/megatron-gpt2-perf-test.png
--- a/docs/tutorials/MegatronGPT2Tutorial.md
+++ b/docs/tutorials/MegatronGPT2Tutorial.md
+# Tutorial: Megatron-LM GPT2 with DeepSpeed
+**TODO: these two links are broken (not yet implemented).**
+
+We advise you to first read through the guides for [Setup and
+Onboarding](../../Onboard/onboard/onboard.md) and [Model
+Training](../../Onboard/model_training/deepspeed_model_training.md).
+
+In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, which
+is a large, powerful transformer. Megatron-LM supports model-parallel and multi-node
+training. Please see the corresponding paper for more details: [Megatron-LM:
+Training Multi-Billion Parameter Language Models Using Model
+Parallelism](https://arxiv.org/abs/1909.08053).
+
+First, we discuss data and environment setup and how to train the GPT-2 model with the
+original Megatron-LM. Next, we proceed step-by-step in enabling this model to run with
+DeepSpeed. Finally, we demonstrate the **_performance gains_**, and **_memory footprint
+reduction_** from using DeepSpeed.
+
+## 1 Training GPT-2 with the Original Megatron-LM
+
+The original model code from
+[Megatron-LM](https://github.com/NVIDIA/Megatron-LM).  We've copied this repo
+under
+[DeepSpeedExamples/Megatron-LM/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
+and made it available as a submodule. To download, execute:
+```bash
+git submodule update --init --recursive
+```
+
+### 1.1 Training Data Setup
+* Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data)
+  to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
+  * (*Microsoft*:) Raw and pre-processed data has already been downloaded on
+     all DLTS clusters: `/data/Megatron-LM/data/`. You can simply execute
+      `ln -s /data/Megatron-LM/data DeepSpeedExamples/Megatron-LM/`.
+
+### 1.2 Running Unmodified Megatron-LM GPT2 model
+
+* For a single GPU run:
+    - change `scripts/pretrain_gpt2.sh`, set its `--train-data` argument as `"webtext"`.
+    - run `bash scripts/pretrain_gpt2.sh`
+
+* For multiple GPUs and/or nodes run:
+    - change `scripts/pretrain_gpt2_model_parallel.sh`
+        - set its `--train-data` argument as `"webtext"`
+        - `GPUS_PER_NODE` indicates how many GPUs per node involved in the testing
+        - `NNODES` indicates how many nodes involved in the testing
+
+    - run `bash scripts/pretrain_gpt2_model_parallel.sh`
+
+
+## 2 Enabling DeepSpeed
+
+To use DeepSpeed we will modify three files :
+
+* `arguments.py` : Arguments configurations
+* `pretrain_gpt2.py` : Main entry point for training
+* `utils.py` : Checkpoints saving and loading utilities
+
+
+### 2.1 Argument Parsing
+The first step is to apply DeepSpeed is adding DeepSpeed arguments to
+Megatron-LM GPT2 model, using `deepspeed.add_config_arguments()` in
+`arguments.py`.
+
+```python
+def get_args():
+    """Parse all the args."""
+
+    parser = argparse.ArgumentParser(description='PyTorch BERT Model')
+    parser = add_model_config_args(parser)
+    parser = add_fp16_config_args(parser)
+    parser = add_training_args(parser)
+    parser = add_evaluation_args(parser)
+    parser = add_text_generate_args(parser)
+    parser = add_data_args(parser)
+
+    # Include DeepSpeed configuration arguments
+    parser = deepspeed.add_config_arguments(parser)
+```
+
+
+
+### 2.2 Initialization and Training
+We modify `pretrain.py` to enable training with DeepSpeed.
+
+#### 2.2.1 Initialization
+We use `deepspeed.initialize` to create `model_engine`, `optimizer` and LR
+`scheduler`. Below is its definition:
+```python
+def initialize(args,
+               model,
+               optimizer=None,
+               model_parameters=None,
+               training_data=None,
+               lr_scheduler=None,
+               mpu=None,
+               dist_init_required=True,
+               collate_fn=None):
+```
+
+For the Megatron-LM GPT2 model, we initialize DeepSpeed in its
+`setup_model_and_optimizer()` function as below, to pass the raw `model`,
+`optimizer`, `args`, `lr_scheduler` and `mpu`.
+```python
+def setup_model_and_optimizer(args):
+    """Setup model and optimizer."""
+
+    model = get_model(args)
+    optimizer = get_optimizer(model, args)
+    lr_scheduler = get_learning_rate_scheduler(optimizer, args)
+
+    if args.deepspeed:
+        import deepspeed
+
+        print_rank_0("DeepSpeed is enabled.")
+
+        model, optimizer, _, lr_scheduler = deepspeed.initialize(
+            model=model,
+            optimizer=optimizer,
+            args=args,
+            lr_scheduler=lr_scheduler,
+            mpu=mpu,
+            dist_init_required=False
+       )
+```
+
+
+Note that when FP16 is enabled, Megatron-LM GPT2 adds a wrapper to the `Adam`
+optimizer. DeepSpeed has its own FP16 Optimizer, so we need to pass the `Adam`
+optimizer to DeepSpeed directly without any wrapper. We return the unwrapped
+Adam optimizer from `get_optimizer()` when DeepSpeed is enabled.
+```python
+def get_optimizer(model, args):
+    """Setup the optimizer."""
+
+    ......
+
+    # Use Adam.
+    optimizer = Adam(param_groups,
+                     lr=args.lr, weight_decay=args.weight_decay)
+
+    if args.deepspeed:
+        # fp16 wrapper is not required for DeepSpeed.
+        return optimizer
+```
+
+#### 2.2.2 Using the Training API
+The `model` returned by `deepspeed.initialize` is the _DeepSpeed Model Engine_
+that we will use to train the model using the forward, backward and step API.
+
+
+##### Forward Propagation
+The forward propagation API is compatible to PyTorch and no change is required.
+
+
+##### Backward Propagation
+Backward propagation is done by calling `backward(loss)` directly on the model engine.
+
+```python
+    def backward_step(optimizer, model, lm_loss, args, timers):
+        """Backward step."""
+
+        # Total loss.
+        loss = lm_loss
+
+        # Backward pass.
+        if args.deepspeed:
+            model.backward(loss)
+        else:
+            optimizer.zero_grad()
+            if args.fp16:
+                optimizer.backward(loss, update_master_grads=False)
+            else:
+                loss.backward()
+```
+
+Zeroing the gradients is handled automatically by DeepSpeed after the weights
+have been updated using a mini-batch.
+
+Furthermore, DeepSpeed addresses distributed data parallel and FP16 under the
+hood, simplifying code in multiple places.
+
+(A) DeepSpeed also performs gradient averaging automatically at the gradient
+accumulation boundaries. So we skip the allreduce communication.
+
+   ```python
+        if args.deepspeed:
+            # DeepSpeed backward propagation already addressed all reduce communication.
+            # Reset the timer to avoid breaking timer logs below.
+            timers('allreduce').reset()
+        else:
+            torch.distributed.all_reduce(reduced_losses.data)
+            reduced_losses.data = reduced_losses.data / args.world_size
+            if not USE_TORCH_DDP:
+                timers('allreduce').start()
+                model.allreduce_params(reduce_after=False,
+                                       fp32_allreduce=args.fp32_allreduce)
+                timers('allreduce').stop()
+
+   ```
+
+(B) We also skip updating master gradients, since DeepSpeed addresses it internally.
+
+   ```python
+        # Update master gradients.
+        if not args.deepspeed:
+            if args.fp16:
+                optimizer.update_master_grads()
+
+            # Clipping gradients helps prevent the exploding gradient.
+            if args.clip_grad > 0:
+                if not args.fp16:
+                    mpu.clip_grad_norm(model.parameters(), args.clip_grad)
+                else:
+                    optimizer.clip_master_grads(args.clip_grad)
+
+        return lm_loss_reduced
+
+   ```
+
+##### Updating the Model Parameters
+The `step()` function in DeepSpeed engine updates the model parameters as well
+as the learning rate.
+
+```python
+     if args.deepspeed:
+         model.step()
+     else:
+         optimizer.step()
+
+         # Update learning rate.
+         if not (args.fp16 and optimizer.overflow):
+             lr_scheduler.step()
+         else:
+             skipped_iter = 1
+
+```
+
+
+
+##### Loss Scaling
+The GPT2 training script logs the loss scaling value during training. Inside,
+the DeepSpeed optimizer, this value is stored as `cur_scale` instead of
+`loss_scale` in Megatron's optimizer. Therefore, we appropriately replace it in
+the logging string.
+
+```python
+             if args.fp16:
+                 log_string += ' loss scale {:.1f} |'.format(
+                     optimizer.cur_scale if args.deepspeed else optimizer.loss_scale)
+
+```
+
+
+### 2.3 Checkpoints Saving & Loading
+
+DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle
+the states from both the client model and its own internal.
+
+```python
+def save_checkpoint(self, save_dir, tag, client_state={})
+def load_checkpoint(self, load_dir, tag)
+```
+
+Applying DeepSpeed needs to update utils.py in which Megatron-LM GPT2 saves and
+loads its checkpoints.
+
+A new function `save_ds_checkpoint()` is created as below for DeepSpeed, it
+collects the client model states and passes to DeepSpeed engine by calling
+`save_checkpoint()` of DeepSpeed.
+
+```python
+ def save_ds_checkpoint(iteration, model, args):
+     """Save a model checkpoint."""
+
+     sd = {}
+     sd['iteration'] = iteration
+     # rng states.
+     if not args.no_save_rng:
+         sd['random_rng_state'] = random.getstate()
+         sd['np_rng_state'] = np.random.get_state()
+         sd['torch_rng_state'] = torch.get_rng_state()
+         sd['cuda_rng_state'] = torch.cuda.get_rng_state()
+         sd['rng_tracker_states'] = mpu.get_cuda_rng_tracker().get_states()
+
+     model.save_checkpoint(args.save, iteration, client_state = sd)
+
+```
+
+In Megatron-LM GPT2 `save_checkpoint()` function, adds following lines to
+invoke the above function for DeepSpeed.
+
+```python
+ def save_checkpoint(iteration, model, optimizer,
+                     lr_scheduler, args):
+     """Save a model checkpoint."""
+     if args.deepspeed:
+         save_ds_checkpoint(iteration, model, args)
+     else:
+		......
+
+```
+
+In `load_checkpoint()` function, use DeepSpeed loading checkpoint API as below,
+and return the states for the client model.
+
+```python
+ def load_checkpoint(model, optimizer, lr_scheduler, args):
+     """Load a model checkpoint."""
+
+     iteration, release = get_checkpoint_iteration(args)
+
+     if args.deepspeed:
+         checkpoint_name, sd = model.load_checkpoint(args.load, iteration)
+
+         if checkpoint_name is None:
+             if mpu.get_data_parallel_rank() == 0:
+                 print("Unable to load checkpoint.")
+             return iteration
+     else:
+         ......
+
+```
+
+### 2.4 Train  scripts
+Assume webtext data was prepared in previous step, to start training
+Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to
+start training.
+
+- Single GPU run
+  - run `bash scripts/ds_pretrain_gpt2.sh`
+- Multiple GPUs/Nodes run
+  - run `bash scripts/ds_pretrain_gpt2_model_parallel.sh`
+
+
+
+## 3 DeepSpeed Improvements over Megatron
+DeepSpeed enables training very large models effectively via the advanced [ZeRO
+optimizer](https://arxiv.org/abs/1910.02054v2). ZeRO significantly reduces the memory
+footprint for training large models which means large models can be trained with i) less
+model parallelism and ii) larger batch sizes. A lower model parallelism degree improves
+training efficiency by increasing the granularity of the computation such as the matrix
+multiplication where performance is directly related to the size of the matrices.
+Furthermore, less model parallelism also results in less communication between model
+parallel GPUs, which further boosts performance.  Larger batch size has a similar effect
+of increasing the computational granularity as well as reducing communication, also
+resulting in better performance. Therefore, using DeepSpeed with Megatron can be
+significantly faster than using Megatron without DeepSpeed.
+
+The observed performance improvements depend on several factors such as the memory per
+GPU, the local GPU interconnect (i.e., PCI-E vs NVLINK vs NVSwitch), the model size,
+inter node network interconnect, etc. Below, we show some of the performance improvements
+from using DeepSpeed over Megatron on a 16 GPU Low Bandwidth (40 Gbps) cluster and a 400 GPU DGX-2 High Bandwidth (800 Gbps) cluster.
+For details please see the [ZeRO Paper](https://arxiv.org/abs/1910.02054v2). We also
+present performance improvement on a 64 GPU cluster along with detailed configuration
+analysis to show where the improvements come from.
+
+![DeepSpeed-vs-Megatron](../figures/DeepSpeed-vs-Megatron.png)
+
+### 3.1 On Low Bandwidth GPU Cluster
+The figure above shows that training 1.5B parameter model with DeepSpeed is
+nearly 4x faster than without DeepSpeed on a cluster with 4 nodes, 4 GPU per
+node, and 16 GPUs total. These GPUs have 16GB of memory each, and PCI-E
+interconnects GPUs within a node, and 40 Gbps infiniband across nodes.
+
+The performance improvement comes from lower model parallelism degree and
+larger batch size as discussed earlier. Training 1.5B parameter model with
+Megatron alone requires 4-way model parallelism, and can only fit an effective
+batch size of 32 using all 16 GPUs. On the other hand, DeepSpeed does not
+require any model-parallelism to train this model, and can support and
+effective batch size of 128 without running out of memory, resulting in
+significantly higher performance.
+
+
+### 3.2 On High bandwidth DGX-2 GPU Cluster
+Each GPU on the DGX-2 cluster has 32 GB of memory, and GPUs inside a box is connected via
+the high-bandwidth NVSwitch. DGX-2 nodes are connected to each other via 800 Gbps (8 x 100Gbps) infiniband interconnect. As such, running a 1.5B model on DGX-2 requires less model
+parallelism, and the performance improvement from DeepSpeed for this model size is not
+significant. However, at larger model sizes, Megatron still requires significantly larger
+model parallelism degree, and can only run much smaller batch sizes than DeepSpeed.
+Therefore, as the model sizes get larger, DeepSpeed starts to significantly outperform
+Megatron.
+
+
+### 3.3 Performance Improvements with Configuration Details
+The figure below compares DeepSpeed with Megatron on a 64 GPU cluster with 4
+DGX-2 nodes. To give the readers a clear idea of source of the performance
+improvements, we also present the configuration table for both Megatron and
+DeepSpeed. It shows the smallest model parallelism degree and the largest batch
+size that can be used to train these models without running out of memory. As
+discussed above, the tables demonstrate that DeepSpeed can run with smaller
+achieve better performance than Megatron.
+
+![DeepSpeed Performance SpeedUp](../figures/megatron-gpt2-perf-test.png)
+
+**a ) Megatron-LM GPT2 Baseline**
+
+|      | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
+| 1.5B | 2                 | 32               | 64    | 512        | 48     | 1600        | 16              | 128.56        |
+| 4B   | 4                 | 16               | 64    | 128        | 64     | 2304        | 16              | 49.36         |
+| 8B   | 4                 | 16               | 64    | 128        | 72     | 3072        | 24              | 24.57         |
+| 20B  | 16                | 4                | 64    | 16         | 111    | 3808        | 32              | 3.42          |
+
+
+
+**b ) Megatron-LM GPT2 with DeepSpeed**
+
+|      | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
+| 1.5B | 1                 | 64               | 64    | 2048       | 48     | 1600        | 16              | 151.35        |
+| 4B   | 1                 | 64               | 64    | 512        | 64     | 2304        | 16              | 75.13         |
+| 8B   | 2                 | 32               | 64    | 512        | 72     | 3072        | 24              | 43.52         |
+| 20B  | 4                 | 16               | 64    | 128        | 111    | 3808        | 32              | 12.65         |