Web edits (#147)

b84a1fa4 · Shaden Smith · GitHub · 4d735946 · b84a1fa4 · b84a1fa4
Unverified Commit b84a1fa4 authored Mar 18, 2020 by Shaden Smith Committed by GitHub Mar 18, 2020
19 changed files
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ optimizations on advanced hyperparameter tuning and optimizers. For example:
 * DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
  Megatron on Azure GPUs.
-  *Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)
+  *Read more*: [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/)
@@ -106,10 +106,10 @@ combination. ZeRO boosts the scaling capability and efficiency further.
  significant performance gains compared to using model parallelism alone.
  *Read more*: [technical report](https://arxiv.org/abs/1910.02054),
-  and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md).
+  and [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/).
  <!-- and [QANet tutorial](../../Tutorials/QANetTutorial.md). -->
-![DeepSpeed-vs-Megatron](./docs/figures/DeepSpeed-vs-Megatron.png)
+![DeepSpeed-vs-Megatron](./docs/assets/images/DeepSpeed-vs-Megatron.png)
 <p align="center">
 <em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.</em>
 </p>
@@ -121,7 +121,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
 effectiveness of model training and reduce the number of samples required to
 convergence to desired accuracy.
-*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md),
+*Read more*: [Tuning tutorial](https://www.deepspeed.ai/tutorials/1Cycle/),
 <!---
 and *BERT Tutorial*: Coming Soon.
@@ -137,33 +137,33 @@ Only a few lines of code changes are needed to enable a PyTorch model to use Dee
 ## Features
 Below we provide a brief feature list, see our detailed [feature
-overview](./docs/features.md) for descriptions and usage.
+overview](https://www.deepspeed.ai/features/) for descriptions and usage.
-* [Distributed Training with Mixed Precision](./docs/features.md#distributed-training-with-mixed-precision)
+* [Distributed Training with Mixed Precision](https://www.deepspeed.ai/features/#distributed-training-with-mixed-precision)
    * 16-bit mixed precision
    * Single-GPU/Multi-GPU/Multi-Node
-* [Model Parallelism](./docs/features.md#model-parallelism)
+* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
    * Support for Custom Model Parallelism
    * Integration with Megatron-LM
-* [Memory and Bandwidth Optimizations](./docs/features.md#memory-and-bandwidth-optimizations)
+* [Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#memory-and-bandwidth-optimizations)
    * The Zero Redundancy Optimizer (ZeRO)
    * Constant Buffer Optimization (CBO)
    * Smart Gradient Accumulation
-* [Training Features](./docs/features.md#training-features)
+* [Training Features](https://www.deepspeed.ai/features/#training-features)
    * Simplified training API
    * Gradient Clipping
    * Automatic loss scaling with mixed precision
-* [Training Optimizers](./docs/features.md#training-optimizers)
+* [Training Optimizers](https://www.deepspeed.ai/features/#training-optimizers)
    * Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
    * Memory bandwidth optimized FP16 Optimizer
    * Large Batch Training with LAMB Optimizer
    * Memory efficient Training with ZeRO Optimizer
-* [Training Agnostic Checkpointing](./docs/features.md#training-agnostic-checkpointing)
+* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
-* [Advanced Parameter Search](./docs/features.md#advanced-parameter-search)
+* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
    * Learning Rate Range Test
    * 1Cycle Learning Rate Schedule
-* [Simplified Data Loader](./docs/features.md#simplified-data-loader)
+* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
-* [Performance Analysis and Debugging](./docs/features.md#performance-analysis-and-debugging)
+* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
 # Getting Started
@@ -171,9 +171,9 @@ overview](./docs/features.md) for descriptions and usage.
 ## Installation
-* Please see our [Azure tutorial](docs/azure.md) to get started with DeepSpeed on Azure!
+* Please see our [Azure tutorial](https://www.deepspeed.ai/tutorials/azure/) to get started with DeepSpeed on Azure!
 * If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies.
-* If you want to install DeepSpeed manually, we provide an install script [install.sh](install.sh) to help install on a local machine or across an entire cluster.
+* If you want to install DeepSpeed manually, we provide an install script `install.sh` to help install on a local machine or across an entire cluster.
 ## Writing DeepSpeed Models
 DeepSpeed model training is accomplished using the DeepSpeed engine. The engine
@@ -280,7 +280,7 @@ the `step` value is stored as part of the `client_sd`.
 DeepSpeed features can be enabled, disabled, or configured using a config JSON
 file that should be specified as `args.deepspeed_config`. A sample config file
 is shown below. For a full set of features see [core API
-doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
+doc](https://deepspeed.readthedocs.io/en/latest/).
 ```json
 {
@@ -412,13 +412,13 @@ as the hostname.
 | Article                                                                                        | Description                                  |
 | ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
-| [DeepSpeed Features](./docs/features.md)                                                       |  DeepSpeed features                          |
+| [DeepSpeed Features](https://www.deepspeed.ai/features/)                                       |  DeepSpeed features                          |
-| [DeepSpeed JSON Configuration](./docs/config_json.md)                                          |  Configuring DeepSpeed                       |
+| [DeepSpeed JSON Configuration](https://www.deepspeed.ai/docs/config_json/)                     |  Configuring DeepSpeed                       |
-| [API Documentation]( https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) |  Generated DeepSpeed API documentation       |
+| [API Documentation](https://deepspeed.readthedocs.io/en/latest/)                               |  Generated DeepSpeed API documentation       |
-| [CIFAR-10 Tutorial](./docs/tutorials/CIFAR-10.md)                                              |  Getting started with CIFAR-10 and DeepSpeed |
+| [CIFAR-10 Tutorial](https://www.deepspeed.ai/tutorials/CIFAR-10)                               |  Getting started with CIFAR-10 and DeepSpeed |
-| [Megatron-LM Tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)                               |  Train GPT2 with DeepSpeed and Megatron-LM   |
+| [Megatron-LM Tutorial](https://www.deepspeed.ai/tutorials/megatron/)                           |  Train GPT2 with DeepSpeed and Megatron-LM   |
-| [Learning Rate Range Test Tutorial](./docs/tutorials/lrrt.md)                                  |  Faster training with large learning rates   |
+| [Learning Rate Range Test Tutorial](https://www.deepspeed.ai/tutorials/lrrt/)                  |  Faster training with large learning rates   |
-| [1Cycle Tutorial](./docs/tutorials/1Cycle.md)                                                  |  SOTA learning schedule in DeepSpeed         |
+| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/1Cycle/)                                  |  SOTA learning schedule in DeepSpeed         |

--- a/azure/README.md
+++ b/azure/README.md
-../docs/azure.md
+../docs/_tutorials/azure.md
\ No newline at end of file
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -33,25 +33,28 @@ collections:
 defaults:
  - scope:
      path: ""
-      type: posts
    values:
      layout: single
      author_profile: false
      read_time: false
      comments: false
-      share: true
+      share: false
      related: false
      sneak_preview: false
-  # _tutorials
-  - scope:
-      path: ""
-      type: tutorials
-    values:
-      layout: single
      toc: true
      toc_label: "Contents"
      sidebar:
        nav: "lnav"
+  - scope:
+      path: "_pages"
+    values:
+      permalink: /docs/:title:output_ext
+  - scope:
+      path: ""
+      type: posts
+    values:
+      layout: single
+      share: true
 timezone: America/Los_Angeles
 breadcrumbs: true
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -11,13 +11,33 @@ main:
    url: https://github.com/microsoft/DeepSpeed
 lnav:
-  - title: "DeepSpeed Documentation"
+  - title: "Feature Overview"
+    url: /features/
  - title: "Getting Started"
    url: /getting-started/
    children:
      - title: "Installation"
        url: /getting-started/#installation
-      - title: "Configuration"
+      - title: "Writing Models"
-        url: /getting-started/#deepspeed-configuration
+        url: /getting-started/#writing-deepspeed-models
-  - title: "DeepSpeed Features"
+      - title: "Training"
-    url: /features/
+        url: /getting-started/#training
+      - title: "Launching"
+        url: /getting-started/#launching-deepspeed-training
+  - title: "Configuration"
+    url: /docs/config_json/
+  - title: "Tutorials"
+    url: /tutorials/
+    children:
+      - title: "Getting Started on Azure"
+        url: /tutorials/azure/
+      - title: "CIFAR-10"
+        url: /tutorials/cifar-10/
+      - title: "Megatron-LM GPT2"
+        url: /tutorials/megatron/
+      - title: "1-Cycle Schedule"
+        url: /tutorials/1Cycle/
+      - title: "Learning Rate Range Test"
+        url: /tutorials/lrrt/
+  - title: "Contributing"
+    url: /contributing/
--- a/docs/_pages/config_json.md
+++ b/docs/_pages/config_json.md
+---
+title: "DeepSpeed Configuration JSON"
+permalink: /docs/config_json.html
+---
+## REQUIRED DeepSpeed Config JSON Parameters
+***train\_batch\_size***: [integer]
+| Value                                                        | Example |
+| ------------------------------------------------------------ | ------- |
+| The effective training batch size. This is the amount of data samples that leads to one step of model update. ***train\_batch\_size*** is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., ***train\_step\_batch\_size***),  the gradient accumulation steps (a.k.a., ***gradient\_accumulation\_steps***), and the number of GPUs. | `32`      |
+## OPTIONAL DeepSpeed Config JSON Parameters
+### Batch Size Related Parameters
+***train\_micro\_batch\_size\_per\_gpu***: [integer]
+| Description                                                  | Default                      |
+| ------------------------------------------------------------ | ---------------------------- |
+| Batch size to be processed by one GPU in one step (without gradient accumulation). When specified, ***gradient\_accumulation\_steps*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***gradient\_accumulation\_steps*** in the configuration JSON. | ***train\_batch\_size*** value |
+***gradient\_accumulation\_steps***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. When specified, ***train\_step\_batch\_size*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***train\_step\_batch\_size*** in the configuration JSON. | `1`       |
+### Optimizer Parameters
+***optimizer***: [dictionary]
+| Fields | Value                                                        | Example                        |
+| ------ | ------------------------------------------------------------ | ------------------------------ |
+| type   | The optimizer name. DeepSpeed natively supports Adam and LAMB optimizers and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"`                         |
+| params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)). | `{"lr": 0.001, "eps": 1e-8}` |
+  Example of ***optimizer***
+```json
+"optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.001,
+      "betas": [
+        0.8,
+        0.999
+      ],
+      "eps": 1e-8,
+      "weight_decay": 3e-7
+    }
+  }
+```
+### Scheduler Parameters
+***scheduler***: [dictionary]
+| Fields | Value                                                        | Example                        |
+| ------ | ------------------------------------------------------------ | ------------------------------ |
+| type   | The scheduler name. See [here](https://deepspeed.readthedocs.io/en/latest/deepspeed.pt.html) for list of support schedulers. | `"1Cycle"`                      |
+| params | Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature. | `{"lr": 0.001, "eps": 1e-8}` |
+Example of ***scheduler***
+```json
+ "scheduler": {
+      "type": "WarmupLR",
+      "params": {
+          "warmup_min_lr": 0,
+          "warmup_max_lr": 0.001,
+          "warmup_num_steps": 1000
+      }
+  }  
+```
+### Communication options
+***fp32\_allreduce***: [boolean]
+| Description                          | Default |
+| ------------------------------------ | ------- |
+| During gradient averaging perform allreduce with 32 bit values | `false`   |
+***disable\_allgather***: [boolean]
+| Description                  | Default |
+| ---------------------------- | ------- |
+| Disable allgather when using ZeRO optimizer and instead use broadcast | `false`  
+***prescale\_gradients***: [boolean]
+| Description                            | Default |
+| -------------------------------------- | ------- |
+| Scale gradients before doing allreduce | `false`   |
+***sparse\_gradients***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. | `false`    |
+### FP16 training options
+***zero\_optimization***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false`   |
+***fp16***: [dictionary]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Configuration for using mixed precision/FP16 training that leverages [NVIDIA's Apex package](https://nvidia.github.io/apex/). An example, including the available dictionary keys is illustrated below. | None    |
+```json
+"fp16": {
+    "enabled": true,
+    "loss_scale": 0,
+    "initial_scale_power": 32,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+ 	"min_loss_scale": 1
+}
+```
+***fp16:enabled***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| ***enabled*** is a **fp16** parameter indicating whether or not FP16 training enabled. | `false`   |
+***fp16:loss\_scale***: [float]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| ***loss\_scale*** is a ***fp16*** parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0`     |
+***fp16:initial\_scale\_power***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| ***initial\_loss\_scale\_power*** is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2<sup>***initial\_loss\_scale\_power***</sup>. | `32`      |
+***fp16:loss\_scale\_window***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| ***loss\_scale\_window*** is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000`    |
+***fp16:hysteresis***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| ***hysteresis*** is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2`       |
+***fp16:min\_loss\_scale***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| ***min\_loss\_scale*** is  a **fp16** parameter representing the minimum dynamic loss scale value. | `1000`    |
+### Gradient Clipping
+***gradient\_clipping***: [float]
+| Description                         | Default |
+| ----------------------------------- | ------- |
+| Enable gradient clipping with value | `0`      |
+### Logging
+***steps\_per\_print***: [integer]
+| Description | Default |
+| ----------- | ------- |
+| Print train loss every N steps | `10` |
+***wall\_clock\_breakdown***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enable timing of the latency of forward/backward/update training phases | `false`   |
+***dump_state***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Print out state information of DeepSpeed object after initialization | `false`   |
--- a/docs/features.md
+++ b/docs/features.md
@@ -6,33 +6,6 @@ toc: true
 toc_label: "Contents"
 ---
-* [Distributed Training with Mixed Precision](#distributed-training-with-mixed-precision)
-    * 16-bit mixed precision
-    * Single-GPU/Multi-GPU/Multi-Node
-* [Model Parallelism](#model-parallelism)
-    * Support for Custom Model Parallelism
-    * Integration with Megatron-LM
-* [Memory and Bandwidth Optimizations](#memory-and-bandwidth-optimizations)
-    * The Zero Redundancy Optimizer (ZeRO)
-    * Constant Buffer Optimization (CBO)
-    * Smart Gradient Accumulation
-* [Training Features](#training-features)
-    * Simplified training API
-    * Gradient Clipping
-    * Automatic loss scaling with mixed precision
-* [Training Optimizers](#training-optimizers)
-    * Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
-    * Memory bandwidth optimized FP16 Optimizer
-    * Large Batch Training with LAMB Optimizer
-    * Memory efficient Training with ZeRO Optimizer
-* [Training Agnostic Checkpointing](#training-agnostic-checkpointing)
-* [Advanced Parameter Search](#advanced-parameter-search)
-    * Learning Rate Range Test
-    * 1Cycle Learning Rate Schedule
-* [Simplified Data Loader](#simplified-data-loader)
-* [Performance Analysis and Debugging](#performance-analysis-and-debugging)
 ## Distributed Training with Mixed Precision
 ### Mixed Precision Training
@@ -81,7 +54,7 @@ mpu.get_data_parallel_world_size()
 ### Integration with Megatron-LM
 DeepSpeed is fully compatible with [Megatron](https://github.com/NVIDIA/Megatron-LM).
-Please see the [Megatron-LM tutorial](tutorials/MegatronGPT2Tutorial.md) for details.
+Please see the [Megatron-LM tutorial](/tutorials/megatron/) for details.
@@ -95,11 +68,9 @@ over 6 billion parameters without any model parallelism, and up to 100 billion
 parameter models with model parallelism on current generation hardware.
 For more details see the [ZeRO paper](https://arxiv.org/abs/1910.02054), [GPT
-tutorial](tutorials/MegatronGPT2Tutorial.md) on integration with
+tutorial](/tutorials/megatron/) on integration with
 DeepSpeed. Additional tutorials including *BERT Tutorial*: Coming Soon.
-<!---[BERT
-tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md),
-->
 ### Constant Buffer Optimization (CBO)
 CBO enables high network and memory throughput while restricting memory usage to a
 constant size. For memory- and network-bound operations such as normalization or
@@ -131,18 +102,18 @@ The DeepSpeed core API consists of just a handful of methods:
 DeepSpeed supports all the features described in this document, via the use of these API,
 along with a `deepspeed_config` JSON file for enabling and disabling the features.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ### Gradient Clipping
 DeepSpeed handles gradient clipping under the hood based on the max gradient norm
 specified by the user.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ### Automatic loss scaling with mixed precision
 DeepSpeed internally handles loss scaling for mixed precision training. The parameters
 for loss scaling can be specified in the `deepspeed_config` JSON file.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ## Training Optimizers
@@ -176,19 +147,19 @@ more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .
 DeepSpeed can simplify checkpointing for you regardless of whether you are using data
 parallel training, model parallel training, mixed-precision training, a mix of these
 three, or using the zero optimizer to enable larger model sizes.
-Please see the [Getting Started](../README.md#getting-started) guide
+Please see the [Getting Started](/getting-started/) guide
 and the
-[core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ## Advanced parameter search
 DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for
 large batch scaling.
 ### Learning Rate Range Test
-Please refer to the [Learning Rate Range Test](tutorials/lrrt.md) tutorial.
+Please refer to the [Learning Rate Range Test](/tutorials/lrrt/) tutorial.
 ### 1Cycle Learning Rate Schedule
-Please refer to the [1Cycle Learning Rate Schedule](tutorials/1Cycle.md) tutorial.
+Please refer to the [1Cycle Learning Rate Schedule](/tutorials/1Cycle/) tutorial.
 ## Simplified Data Loader
@@ -200,7 +171,7 @@ can automatically handle batch creation appropriately.
 For performance debugging, DeepSpeed can give you a detailed breakdown of the time spent
 in different parts of the training with by simply enabling it in the `deepspeed_config`
 file.
-Please see the [core API doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html) for more details.
+Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ```json
 {
  "wall_clock_breakdown": true

--- a/docs/_tutorials/1Cycle.md
+++ b/docs/_tutorials/1Cycle.md
+---
+title: "1-Cycle Schedule"
+---
+This tutorial shows how to implement 1Cycle schedules for learning rate and
+momentum in PyTorch.
+## 1-Cycle Schedule
+Recent research has demonstrated that the slow convergence problems of large
+batch size training can be addressed by tuning critical hyperparameters such
+as learning rate and momentum, during training using cyclic and decay
+schedules. In DeepSpeed, we have implemented a state-of-the-art schedule called
+[1-Cycle](https://arxiv.org/abs/1803.09820) to help data scientists
+effectively use larger batch sizes to train their models in PyTorch.
+## Prerequisites
+To use 1-cycle schedule for model training, you should satisfy these two requirements:
+1. Integrate DeepSpeed into your training script using the [Getting
+Started](/getting-started/) guide.
+2. Add the parameters to configure a 1-Cycle schedule to the parameters of your
+model. We will define the 1-Cycle parameters below.
+## Overview
+The 1-cycle schedule operates in two phases, a cycle phase and a decay phase,
+which span one iteration over the training data. For concreteness, we will
+review how 1-cycle schedule of learning rate works. In the cycle phase,
+the learning rate oscillates between a minimum value and a maximum value over a
+number of training steps. In the decay phase, the learning rate decays starting
+from the minimum value of the cycle phase. An example of 1-cycle learning rate
+schedule during model training is illustrated below.
+![1cycle_lr](/assets/images/1cycle_lr.png)
+### 1-Cycle Parameters
+The 1-Cycle schedule is defined by a number of parameters which allow users
+explore different configurations. The literature recommends concurrent tuning
+of learning rate and momentum because they are correlated hyperparameters. We
+have leveraged this recommendation to reduce configuration burden by organizing
+the 1-cycle parameters into two groups to:
+1. Global parameters for configuring the cycle and decay phase
+2. Local parameters for configuring learning rate and momentum
+The global parameters for configuring the 1-cycle phases are:
+1. `cycle_first_step_size`: The count of training steps to complete first step of cycle phase
+2. `cycle_first_stair_count`: The count of updates (or stairs) in first step of cycle phase
+3. `cycle_second_step_size`: The count of training steps to complete second step of cycle phase
+4. `cycle_second_stair_count`: The count of updates (or stairs) in the second step of cycle phase
+5. `post_cycle_decay_step_size`: The interval, in training steps, to decay hyperparameter in decay phase
+The local parameters for the hyperparameters are:
+**Learning rate**:
+1. `cycle_min_lr`: minimum learning rate in cycle phase
+2. `cycle_max_lr`: maximum learning rate in cycle phase
+3. `decay_lr_rate`: decay rate for learning rate in decay phase
+Although appropriate values `cycle_min_lr` and `cycle_max_lr` values can be
+selected based on experience or expertise,  we recommend using [learning rate
+range test](/tutorials/lrrt/) feature of DeepSpeed to configure them.
+**Momentum**
+1. `cycle_min_mom`: minimum momentum in cycle phase
+2. `cycle_max_mom`: maximum momentum in cycle phase
+3. `decay_mom_rate`: decay rate for momentum in decay phase
+## Required Model Configuration Changes
+To illustrate the required model configuration changes to use 1-Cycle schedule
+in model training, we will use a schedule with the following properties:
+1. A symmetric cycle phase, where each half of the cycle spans the same number
+of training steps.  For this example, it will take 1000 training steps for the
+learning rate to increase from 0.0001 to 0.0010 (10X scale), and then to
+decrease back to 0.0001. The momentum will correspondingly cycle between 0.85
+and 0.99 in similar number of steps.
+2. A decay phase, where learning rate decays by 0.001 every 1000 steps, while
+momentum is not decayed.
+Note that these parameters are processed by DeepSpeed as session parameters,
+and so should be added to the appropriate section of the model configuration.
+### **PyTorch model**
+PyTorch versions 1.0.1 and newer provide a feature for implementing schedulers
+for hyper-parameters, called  [learning rate
+  schedulers](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html).
+  We have implemented 1-Cycle schedule using this feature.  You will add a
+  scheduler entry of type **"OneCycle"** as illustrated below.
+```json
+"scheduler": {
+    "type": "OneCycle",
+    "params": {
+        "cycle_first_step_size": 1000,
+        "cycle_first_stair_count": 500,
+        "cycle_second_step_size": 1000,
+        "cycle_second_stair_count": 500,
+        "decay_step_size": 1000,
+        "cycle_min_lr": 0.0001,
+        "cycle_max_lr": 0.0010,
+        "decay_lr_rate": 0.001,
+        "cycle_min_mom": 0.85,
+        "cycle_max_mom": 0.99,
+        "decay_mom_rate": 0.0
+    }
+},
+```
+## Batch Scaling Example
+As example of how 1-Cycle schedule can enable effective batch scaling, we
+briefly share our experience with an internal model in Microsoft. In this case,
+the model was well-tuned for fast convergence (in data samples) on a single
+GPU, but was converging slowly to target performance (AUC) when training on 8
+GPUs (8X batch size). The plot below shows model convergence with 8 GPUs for
+these learning rate schedules:
+1. **Fixed**: using an optimal fixed learning rate for 1-GPU training.
+2. **LinearScale**: using a fixed learning rate that is 8X of **Fixed**.
+3. **1Cycle**: using 1-Cycle schedule.
+![model_convergence](/assets/images/model_convergence.png)
+With **1Cycle**, the model converges faster than the other schedules to the
+target AUC . In fact, **1Cycle** converges as fast as the optimal 1-GPU
+training (not shown). For **Fixed**,  convergence is about 5X slower (needs 5X
+more data samples). With **LinearScale**, the model diverges because the
+learning rate is too high. The plot below illustrates the schedules by
+reporting the learning rate values during 8-GPU training.
+![lr_schedule](/assets/images/lr_schedule.png)
+We see that the learning rate for **1Cycle** is always larger than **Fixed**
+and is briefly larger than **LinearScale** to achieve faster convergence. Also
+**1Cycle** lowers the learning rate later during training to avoid model
+divergence, in contrast to **LinearScale**. In summary, by configuring an
+appropriate 1-Cycle schedule we were able to effective scale the training batch
+size for this model by 8X without loss of convergence speed.
--- a/docs/_tutorials/azure.md
+++ b/docs/_tutorials/azure.md
+---
+title: "Getting Started with DeepSpeed on Azure"
+---
+This tutorial will help you get started running DeepSpeed on [Azure virtual
+machines](https://azure.microsoft.com/en-us/services/virtual-machines/).
+Looking forward, we will be integrating these techniques and additional enhancements
+into the [Azure ML](https://azure.microsoft.com/en-us/services/machine-learning/) platform to
+benefit all your large model training jobs.
+If you don't already have an Azure account please see more details here: [https://azure.microsoft.com/](https://azure.microsoft.com/).
+To help with launching Azure instances we suggest using the [Azure
+CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
+several helper scripts to get you quickly started using DeepSpeed with Azure.
+ * Install Azure CLI on your local box: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
+ * Alternatively you can use the Azure in-browser shell: https://shell.azure.com/
+## Create an SSH key
+Generate an SSH key that will be used across this tutorial to SSH into your VMs and
+between Docker containers. `ssh-keygen` is the recommended way of doing this. Our scripts
+assume your key is located inside the same directory as the Azure scripts.
+## Azure Config JSON
+Our helper scripts depend on the following a configuration JSON for deployment
+and setup.  We have provided a simple example JSON in `azure_config.json` that
+sets up a basic environment with two VMs. This config uses the NV6_Promo
+instance type which has one NVIDIA Tesla M60 GPU per VM. You can read more
+details about the VM on the [Linux Virtual Machines
+Pricing](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)
+page.
+See the example below:
+ ```json
+{
+  "num_vms": 2,
+  "location": "southcentralus",
+  "azure_sku": "Standard_NV6_Promo",
+  "ssh_private_key": "id_rsa",
+  "docker_ssh_port": 2222
+}
+```
+## Dependencies
+The scripts in this tutorial require [jq](https://stedolan.github.io/jq/) to help with
+parsing JSON from the command line. Also it is recommended to install
+[pdsh](https://linux.die.net/man/1/pdsh) to help launch ssh connections in parallel.
+## Create Azure VMs
+We first need to allocate the VMs. We provide a script
+```bash
+./create_vms.sh
+```
+to create VMs with the Azure SKU in the region specified in `azure_config.json`. Feel
+free to customize your JSON to your desired region/SKU. This step will take a few minutes
+to complete while it sets up all of your VMs on Azure.
+## Setup VM environment to use DeepSpeed
+Next, we need to configure the VM environment for DeepSpeed. We provide a script
+```bash
+./setup_vms.sh
+```
+to generate a [hostfile](/getting-started/#resource-configuration-multi-node) and SSH
+configuration on all of the VMs. This configuration will be used by the DeepSpeed
+Docker containers in the next step.
+## Start the DeepSpeed docker container
+We now setup the DeepSpeed Docker containers on the VMs. We provide a script
+```bash
+./setup_docker.sh
+```
+to pull the DeepSpeed image onto all VMs and start a container instance in the
+background. This will take several minutes since it needs to pull the entire Docker
+image.
+## Access VMs
+The tool `azure_ssh.sh` will let you SSH into any of the VMs with this
+syntax:
+```bash
+./azure_ssh.sh <node-id> [command]
+```
+where the `node-id` is a number between `0` and `num_vms-1`.  This script will find the
+public IP address of your VM and use the SSH key provided in the Azure configuration
+JSON.
+## Access DeepSpeed container
+Everything should be up and running at this point. Let's access the running DeepSpeed
+container on the first VM and make sure we can talk to the other containers in our deployment.
+ * SSH into the first VM via: `./azure_ssh.sh 0`
+ * Change directories into the azure folder of this repo via: `cd ~/workdir/DeepSpeed/azure`
+ * Attach the running docker container via: `./attach.sh`
+ * You should now be able to `ssh` into any other docker container, the containers can be
+   accessed via their SSH alias of `worker-N`, where `N` is the VM number between `0`
+   and `num_vms-1`. In this example we should be able to successfully run `ssh worker-1
+   hostname` which will return the hostname of worker-1.
+## Parallel SSH across containers
+ DeepSpeed comes installed with a helper script `ds_ssh` which is a wrapper around
+ the [pdsh](https://linux.die.net/man/1/pdsh) command that lets you issue commands
+ to groups of hosts (via SSH) in parallel. This wrapper simply connects with the
+ hostfile that defines all the containers in your deployment. For example if you run
+ `ds_ssh hostname` you should see a list of all the hostnames in your deployment.
+## Run CIFAR-10 example model
+We will now run the DeepSpeed CIFAR-10 model example to test the VM setup. From inside
+the first DeepSpeed container:
+  1) Install the python dependencies necessary to run the CIFAR-10 example model. You can
+  do this across your cluster via:
+  ```bash
+  ds_ssh pip install -r ~/workdir/DeepSpeed/DeepSpeedExamples/cifar/requirements.txt
+  ```
+  2) Now change directories to the CIFAR example:
+  ```bash
+  cd ~/workdir/DeepSpeed/DeepSpeedExamples/cifar
+  ```
+  3) Finally, launch training across all VMs:
+  ```bash
+  deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
+  ```
+## Megatron-LM GPT2
+DeepSpeed includes an example model using Megatron-LM's GPT2. Please refer to the full
+[Megatron tutorial](/tutorials/megatron/) for more details.
+ * In order to fully train GPT2 with DeepSpeed and ZeRO we recommend using 8 instances of
+   Azure's Standard_ND40rs_v2 SKU for a total of 64 NVIDIA V100 GPUs. With this setup and
+   a batch size of 1536 you should be able to complete 100k training steps (153.6 million
+   samples) in less than 2 weeks of training.
--- a/docs/_tutorials/getting-started.md
+++ b/docs/_tutorials/getting-started.md
@@ -6,9 +6,10 @@ excerpt: "First steps with DeepSpeed"
 ## Installation
-* Please see our [Azure tutorial](docs/azure.md) to get started with DeepSpeed on Azure!
+* Please see our [Azure tutorial](/tutorials/azure/) to get started with DeepSpeed on Azure!
 * If you're not on Azure, we recommend using our docker image via `docker pull deepspeed/deepspeed:latest` which contains a pre-installed version of DeepSpeed and all the necessary dependencies.
-* If you want to install DeepSpeed manually, we provide an install script [install.sh](install.sh) to help install on a local machine or across an entire cluster.
+* If you want to install DeepSpeed manually, we provide an install script
+* `install.sh` to help install on a local machine or across an entire cluster.
 ## Writing DeepSpeed Models
 DeepSpeed model training is accomplished using the DeepSpeed engine. The engine
@@ -114,8 +115,8 @@ the `step` value is stored as part of the `client_sd`.
 ## DeepSpeed Configuration
 DeepSpeed features can be enabled, disabled, or configured using a config JSON
 file that should be specified as `args.deepspeed_config`. A sample config file
-is shown below. For a full set of features see [core API
+is shown below. For a full set of features see [ API
-doc](https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/index.html).
+doc](/docs/config_json/).
 ```json
 {

--- a/docs/_tutorials/lrrt.md
+++ b/docs/_tutorials/lrrt.md
+---
+title: "Learning Rate Range Test"
+---
+This tutorial shows how to use to perform Learning Rate range tests in PyTorch.
+## Learning Rate Range Test (LRRT)
+Learning rate range test ( [LRRT](https://arxiv.org/abs/1803.09820) ) is a
+method for discovering the largest learning rate values that can be used to
+train a model without divergence. Data scientists are often interested in this
+information because  large learning rates lead to faster model convergence than
+a small learning rates.  Moreover, large learning rates are crucial in learning
+rate schedules such as [CLR](https://arxiv.org/abs/1506.01186)  and
+[1Cycle](https://arxiv.org/abs/1803.09820), which are used to train effectively
+with large batch sizes. DeepSpeed provides LRRT for model training in PyTorch
+frameworks.
+## Prerequisites
+To use DeepSpeed's LRRT, you must satisfy the following two conditions:
+1. Integrate DeepSpeed into your training script using the [Getting
+Started](/getting-started/) guide.
+2. Add the parameters to configure LRRT to the parameters of your model. The
+LRRT parameters are defined below.
+## LRRT Parameters
+LRRT works by linearly increasing the learning rate by a predefined amount, at
+predefined intervals. Thus, LRRT is a form of learning rate schedule because it
+defines how and when the learning rate should change during model training.  To
+configure LRRT, you will need to set these parameters:
+1. `lr_range_test_min_lr` : The initial learning rate for training `(float)`
+2. `lr_range_test_step_size`: The interval for scaling up learning rate,
+defined in training steps `(integer)`
+3. `lr_range_test_step_rate`: The scaling factor for increasing learning rate
+`(float)`
+4. `lr_range_test_staircase`: If true, learning rate is changed every
+`lr_range_test_step_size` training steps, otherwise learning rate is changed at
+every training step `(boolean)`
+## Required Model Configuration Changes
+We will illustrate the required model configuration changes an example LRRT
+schedule that:
+1. Starts training with an initial learning rate of 0.0001
+2. Uses a scaling rate of 5
+3. Uses a scaling interval of 200 training steps
+4. Scales learning rate at every training step, i.e., does not use staircase
+### PyTorch
+For PyTorch models, LRRT is implemented as a [learning rate
+scheduler](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html),
+a feature that is available in PyTorch versions 1.0.1 and newer. Thus, you can
+add a `"scheduler"` entry of type `"LRRangeTest"` into your model configuration
+as illustrated below:
+```json
+"scheduler": {
+    "type": "LRRangeTest",
+    "params": {
+        "lr_range_test_min_lr": 0.0001,
+        "lr_range_test_step_size": 200,
+        "lr_range_test_step_rate": 5,
+        "lr_range_test_staircase": false
+    }
+}
+```
+## Example: Tuning for Large Batch Sizes
+We illustrate how LRRT can benefit data scientists with a snippet of our
+experience of tuning an internal production model to converge efficiently on
+larger batch sizes, as we scaled from one GPU (batch size 512) to four GPUs
+(batch size 2048). Our goal was to train the model with the larger batch size
+to match the performance of the smaller batch size using the same amount of
+data samples. The challenge here is the well known problem of slow convergence
+of large batch size training. Our approach was to use a
+[1Cycle](/tutorials/1Cycle/) schedule in DeepSpeed to tackle
+this problem, and we used LRRT to configure the schedule.
+In the plots below, we illustrate using LRRT to discover the maximum learning
+rates for effective training with batch size 2048. The plot on the left shows
+the impact of large learning rates on validation loss over the first 9000
+batches of training. The plot on the right shows the learning rate values
+during the same period of training.  Using grid search we discover that the
+best fixed learning rate for the batch size 2048 is 0.0002. The blue line
+(`lr=0.0002`) represents training with this fixed learning rate. We compare the
+two LRRT schedules with this fixed learning rate. The orange
+(`lr_range_test_step_rate=5`) and gray (`lr_range_test_step_rate=50`) lines
+represent training with similar LRRT schedules that differ only in
+`lr_range_test_step_rate` values. Although the LRRT schedules start from the
+same base learning rate, the gray line's learning rate grows about 10 times
+faster than the orange line. Also, the learning rates of the LRRT schedules had
+grown larger than that of the blue line in the presented data points. We
+subsequently refer to the gray line as "fast growing", and the orange line as
+"slow growing" LRRT schedules respectively.
+![validation_loss](/assets/images/loss_and_lr.png)
+We make the following observations from this small example.
+1. Larger learning rates clearly benefit model performance, up to some point.
+The fast growing LRRT schedule achieves validation loss of 0.46 after 3000
+batches, which the fixed learning rate does not achieve with 9000 batches. The
+slow growing LRRT does not match that score until after 6000 batches, however
+it maintains an increasing performance advantage over the fixed learning rate.
+2. There is an upper bound on learning rate values that are useful for training
+the model. The fast growing LRRT schedule hits this boundary quickly and
+diverges, while the slow growing LRRT will later diverge for the same reason.
+LRRT helped us discover these boundaries quickly,  using less than 2% of the
+training data. These boundaries are useful information for constructing
+learning rate schedules.
+These observations from LRRT helped us to configure the learning rate
+boundaries and the cycle span for a 1Cycle schedule that solves the problem, as
+shown below.
+```json
+"OneCycle": {
+    "cycle_min_lr": 0.002,
+    "cycle_max_lr": 0.005,
+    "cycle_first_step_size": 2000,
+    "cycle_second_step_size": 2000,
+    ...
+}
+```
+In our experience these are four most critical parameters of 1Cycle schedules.
+1. We chose to use the slower LRRT schedule (`lr_range_test_step_rate=5`) to
+set `cycle_min_lr` because it achieves the best loss and the faster schedule
+diverges fairly quickly.
+2. We set `cycle_min_lr` to 0.005 even though the plot shows that performance
+was still improving at slightly higher learning rate. This is because we
+observed that if we wait till the maximum learning rate, the model could be at
+the point of divergence and impossible to recover.
+3. Since it takes 8000 batches for the learning rate to become 0.005, we set
+`cycle_first_step_size` and (`cycle_second_step_size`) to 2000 which is the
+number of steps that it takes for four GPUs to process 8000 batches.
+We hope this brief example sparks your imagination on using LRRT for your own
+unique tuning challenges.
--- a/docs/_tutorials/megatron.md
+++ b/docs/_tutorials/megatron.md
+---
+title: "Megatron-LM GPT2"
+---
+If you haven't already, we advise you to first read through the [Getting
+Started](/getting-started/) guide before stepping through this tutorial.
+In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, which
+is a large, powerful transformer. Megatron-LM supports model-parallel and multi-node
+training. Please see the corresponding paper for more details: [Megatron-LM:
+Training Multi-Billion Parameter Language Models Using Model
+Parallelism](https://arxiv.org/abs/1909.08053).
+First, we discuss data and environment setup and how to train the GPT-2 model with the
+original Megatron-LM. Next, we proceed step-by-step in enabling this model to run with
+DeepSpeed. Finally, we demonstrate the **_performance gains_**, and **_memory footprint
+reduction_** from using DeepSpeed.
+## Training GPT-2 with the Original Megatron-LM
+The original model code from
+[Megatron-LM](https://github.com/NVIDIA/Megatron-LM).  We've copied this repo
+under
+[DeepSpeedExamples/Megatron-LM/](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM)
+and made it available as a submodule. To download, execute:
+```bash
+git submodule update --init --recursive
+```
+### Training Data Setup
+* Follow Megatron's [instructions](https://github.com/NVIDIA/Megatron-LM#collecting-gpt2-webtext-data)
+  to download the webtext data and place a symbolic link under `DeepSpeedExamples/Megatron-LM/data`:
+### Running Unmodified Megatron-LM GPT2 model
+* For a single GPU run:
+    - change `scripts/pretrain_gpt2.sh`, set its `--train-data` argument as `"webtext"`.
+    - run `bash scripts/pretrain_gpt2.sh`
+* For multiple GPUs and/or nodes run:
+    - change `scripts/pretrain_gpt2_model_parallel.sh`
+        - set its `--train-data` argument as `"webtext"`
+        - `GPUS_PER_NODE` indicates how many GPUs per node involved in the testing
+        - `NNODES` indicates how many nodes involved in the testing
+    - run `bash scripts/pretrain_gpt2_model_parallel.sh`
+## Enabling DeepSpeed
+To use DeepSpeed we will modify three files :
+* `arguments.py` : Arguments configurations
+* `pretrain_gpt2.py` : Main entry point for training
+* `utils.py` : Checkpoints saving and loading utilities
+### Argument Parsing
+The first step is to apply DeepSpeed is adding DeepSpeed arguments to
+Megatron-LM GPT2 model, using `deepspeed.add_config_arguments()` in
+`arguments.py`.
+```python
+def get_args():
+    """Parse all the args."""
+    parser = argparse.ArgumentParser(description='PyTorch BERT Model')
+    parser = add_model_config_args(parser)
+    parser = add_fp16_config_args(parser)
+    parser = add_training_args(parser)
+    parser = add_evaluation_args(parser)
+    parser = add_text_generate_args(parser)
+    parser = add_data_args(parser)
+    # Include DeepSpeed configuration arguments
+    parser = deepspeed.add_config_arguments(parser)
+```
+### Initialization and Training
+We modify `pretrain.py` to enable training with DeepSpeed.
+#### Initialization
+We use `deepspeed.initialize` to create `model_engine`, `optimizer` and LR
+`scheduler`. Below is its definition:
+```python
+def initialize(args,
+               model,
+               optimizer=None,
+               model_parameters=None,
+               training_data=None,
+               lr_scheduler=None,
+               mpu=None,
+               dist_init_required=True,
+               collate_fn=None):
+```
+For the Megatron-LM GPT2 model, we initialize DeepSpeed in its
+`setup_model_and_optimizer()` function as below, to pass the raw `model`,
+`optimizer`, `args`, `lr_scheduler` and `mpu`.
+```python
+def setup_model_and_optimizer(args):
+    """Setup model and optimizer."""
+    model = get_model(args)
+    optimizer = get_optimizer(model, args)
+    lr_scheduler = get_learning_rate_scheduler(optimizer, args)
+    if args.deepspeed:
+        import deepspeed
+        print_rank_0("DeepSpeed is enabled.")
+        model, optimizer, _, lr_scheduler = deepspeed.initialize(
+            model=model,
+            optimizer=optimizer,
+            args=args,
+            lr_scheduler=lr_scheduler,
+            mpu=mpu,
+            dist_init_required=False
+       )
+```
+Note that when FP16 is enabled, Megatron-LM GPT2 adds a wrapper to the `Adam`
+optimizer. DeepSpeed has its own FP16 Optimizer, so we need to pass the `Adam`
+optimizer to DeepSpeed directly without any wrapper. We return the unwrapped
+Adam optimizer from `get_optimizer()` when DeepSpeed is enabled.
+```python
+def get_optimizer(model, args):
+    """Setup the optimizer."""
+    ......
+    # Use Adam.
+    optimizer = Adam(param_groups,
+                     lr=args.lr, weight_decay=args.weight_decay)
+    if args.deepspeed:
+        # fp16 wrapper is not required for DeepSpeed.
+        return optimizer
+```
+#### Using the Training API
+The `model` returned by `deepspeed.initialize` is the _DeepSpeed Model Engine_
+that we will use to train the model using the forward, backward and step API.
+##### Forward Propagation
+The forward propagation API is compatible to PyTorch and no change is required.
+##### Backward Propagation
+Backward propagation is done by calling `backward(loss)` directly on the model engine.
+```python
+    def backward_step(optimizer, model, lm_loss, args, timers):
+        """Backward step."""
+        # Total loss.
+        loss = lm_loss
+        # Backward pass.
+        if args.deepspeed:
+            model.backward(loss)
+        else:
+            optimizer.zero_grad()
+            if args.fp16:
+                optimizer.backward(loss, update_master_grads=False)
+            else:
+                loss.backward()
+```
+Zeroing the gradients is handled automatically by DeepSpeed after the weights
+have been updated using a mini-batch.
+Furthermore, DeepSpeed addresses distributed data parallel and FP16 under the
+hood, simplifying code in multiple places.
+(A) DeepSpeed also performs gradient averaging automatically at the gradient
+accumulation boundaries. So we skip the allreduce communication.
+   ```python
+        if args.deepspeed:
+            # DeepSpeed backward propagation already addressed all reduce communication.
+            # Reset the timer to avoid breaking timer logs below.
+            timers('allreduce').reset()
+        else:
+            torch.distributed.all_reduce(reduced_losses.data)
+            reduced_losses.data = reduced_losses.data / args.world_size
+            if not USE_TORCH_DDP:
+                timers('allreduce').start()
+                model.allreduce_params(reduce_after=False,
+                                       fp32_allreduce=args.fp32_allreduce)
+                timers('allreduce').stop()
+   ```
+(B) We also skip updating master gradients, since DeepSpeed addresses it internally.
+   ```python
+        # Update master gradients.
+        if not args.deepspeed:
+            if args.fp16:
+                optimizer.update_master_grads()
+            # Clipping gradients helps prevent the exploding gradient.
+            if args.clip_grad > 0:
+                if not args.fp16:
+                    mpu.clip_grad_norm(model.parameters(), args.clip_grad)
+                else:
+                    optimizer.clip_master_grads(args.clip_grad)
+        return lm_loss_reduced
+   ```
+##### Updating the Model Parameters
+The `step()` function in DeepSpeed engine updates the model parameters as well
+as the learning rate.
+```python
+     if args.deepspeed:
+         model.step()
+     else:
+         optimizer.step()
+         # Update learning rate.
+         if not (args.fp16 and optimizer.overflow):
+             lr_scheduler.step()
+         else:
+             skipped_iter = 1
+```
+##### Loss Scaling
+The GPT2 training script logs the loss scaling value during training. Inside,
+the DeepSpeed optimizer, this value is stored as `cur_scale` instead of
+`loss_scale` in Megatron's optimizer. Therefore, we appropriately replace it in
+the logging string.
+```python
+             if args.fp16:
+                 log_string += ' loss scale {:.1f} |'.format(
+                     optimizer.cur_scale if args.deepspeed else optimizer.loss_scale)
+```
+### Checkpoints Saving & Loading
+DeepSpeed engine has flexible APIs for checkpoint saving and loading, to handle
+the states from both the client model and its own internal.
+```python
+def save_checkpoint(self, save_dir, tag, client_state={})
+def load_checkpoint(self, load_dir, tag)
+```
+Applying DeepSpeed needs to update utils.py in which Megatron-LM GPT2 saves and
+loads its checkpoints.
+A new function `save_ds_checkpoint()` is created as below for DeepSpeed, it
+collects the client model states and passes to DeepSpeed engine by calling
+`save_checkpoint()` of DeepSpeed.
+```python
+ def save_ds_checkpoint(iteration, model, args):
+     """Save a model checkpoint."""
+     sd = {}
+     sd['iteration'] = iteration
+     # rng states.
+     if not args.no_save_rng:
+         sd['random_rng_state'] = random.getstate()
+         sd['np_rng_state'] = np.random.get_state()
+         sd['torch_rng_state'] = torch.get_rng_state()
+         sd['cuda_rng_state'] = torch.cuda.get_rng_state()
+         sd['rng_tracker_states'] = mpu.get_cuda_rng_tracker().get_states()
+     model.save_checkpoint(args.save, iteration, client_state = sd)
+```
+In Megatron-LM GPT2 `save_checkpoint()` function, adds following lines to
+invoke the above function for DeepSpeed.
+```python
+ def save_checkpoint(iteration, model, optimizer,
+                     lr_scheduler, args):
+     """Save a model checkpoint."""
+     if args.deepspeed:
+         save_ds_checkpoint(iteration, model, args)
+     else:
+		......
+```
+In `load_checkpoint()` function, use DeepSpeed loading checkpoint API as below,
+and return the states for the client model.
+```python
+ def load_checkpoint(model, optimizer, lr_scheduler, args):
+     """Load a model checkpoint."""
+     iteration, release = get_checkpoint_iteration(args)
+     if args.deepspeed:
+         checkpoint_name, sd = model.load_checkpoint(args.load, iteration)
+         if checkpoint_name is None:
+             if mpu.get_data_parallel_rank() == 0:
+                 print("Unable to load checkpoint.")
+             return iteration
+     else:
+         ......
+```
+### Train  scripts
+Assume webtext data was prepared in previous step, to start training
+Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to
+start training.
+- Single GPU run
+  - run `bash scripts/ds_pretrain_gpt2.sh`
+- Multiple GPUs/Nodes run
+  - run `bash scripts/ds_pretrain_gpt2_model_parallel.sh`
+## Performance Improvements
+DeepSpeed enables training very large models effectively via the advanced [ZeRO
+optimizer](https://arxiv.org/abs/1910.02054v2). ZeRO significantly reduces the memory
+footprint for training large models which means large models can be trained with i) less
+model parallelism and ii) larger batch sizes. A lower model parallelism degree improves
+training efficiency by increasing the granularity of the computation such as the matrix
+multiplication where performance is directly related to the size of the matrices.
+Furthermore, less model parallelism also results in less communication between model
+parallel GPUs, which further boosts performance.  Larger batch size has a similar effect
+of increasing the computational granularity as well as reducing communication, also
+resulting in better performance. Therefore, DeepSpeed combines ZeRO-powered data parallelism with
+Megatron-LM tensor-slicing model parallelism, which is
+significantly faster than using Megatron-LM alone.
+The observed performance improvements depend on several factors such as the memory per
+GPU, the local GPU interconnect (i.e., PCI-E vs NVLINK vs NVSwitch), the model size,
+inter node network interconnect, etc. Below, we show some of the performance improvements
+from using DeepSpeed over Megatron on a 16 GPU Low Bandwidth (40 Gbps) cluster and a 400 GPU DGX-2 High Bandwidth (800 Gbps) cluster.
+For details please see the [ZeRO Paper](https://arxiv.org/abs/1910.02054v2). We also
+present performance improvement on a 64 GPU cluster along with detailed configuration
+analysis to show where the improvements come from.
+![DeepSpeed-vs-Megatron](/assets/images/DeepSpeed-vs-Megatron.png)
+<p align="center">
+<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
+</p>
+### On Low Bandwidth GPU Cluster
+The figure above shows that training 1.5B parameter model with DeepSpeed is
+nearly 4x faster than without DeepSpeed on a cluster with 4 nodes, 4 GPU per
+node, and 16 GPUs total. These GPUs have 16GB of memory each, and PCI-E
+interconnects GPUs within a node, and 40 Gbps infiniband across nodes.
+The performance improvement comes from lower model parallelism degree and
+larger batch size as discussed earlier. Training 1.5B parameter model with
+Megatron-LM alone requires 4-way model parallelism, and can only fit an effective
+batch size of 32 using all 16 GPUs. On the other hand, DeepSpeed does not
+require any model-parallelism to train this model, and can support an
+effective batch size of 128 without running out of memory, resulting in
+significantly higher performance.
+### On High bandwidth DGX-2 GPU Cluster
+Each GPU on the DGX-2 cluster has 32 GB of memory, and GPUs inside a box is connected via
+the high-bandwidth NVSwitch. DGX-2 nodes are connected to each other via 800 Gbps (8 x 100Gbps) infiniband interconnect. As such, running a 1.5B model on DGX-2 requires less model
+parallelism, and the performance improvement from DeepSpeed for this model size is less
+significant. However, at larger model sizes, Megatron still requires significantly larger
+model parallelism degree, and can only run much smaller batch sizes than DeepSpeed.
+Therefore, as the model sizes get larger, DeepSpeed, by coming ZeRO with Megatron model parallelism, starts to significantly outperform
+using Megatron-LM alone.
+### Performance Improvements with Configuration Details
+The figure below compares DeepSpeed with Megatron on a 64 GPU cluster with 4
+DGX-2 nodes. To give the readers a clear idea of source of the performance
+improvements, we also present the configuration table for both Megatron and
+DeepSpeed. It shows the smallest model parallelism degree and the largest batch
+size that can be used to train these models without running out of memory. As
+discussed above, the tables demonstrate that DeepSpeed runs with smaller model parallelism degree
+and achieves better performance.
+![DeepSpeed Performance SpeedUp](/assets/images/megatron-gpt2-perf-test.png)
+<p align="center">
+<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
+</p>
+**a ) Megatron-LM GPT2 Baseline**
+|      | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
+| 1.5B | 2                 | 32               | 64    | 512        | 48     | 1600        | 16              | 128.56        |
+| 4B   | 4                 | 16               | 64    | 128        | 64     | 2304        | 16              | 49.36         |
+| 8B   | 4                 | 16               | 64    | 128        | 72     | 3072        | 24              | 24.57         |
+| 20B  | 16                | 4                | 64    | 16         | 111    | 3808        | 32              | 3.42          |
+**b ) Megatron-LM GPT2 with DeepSpeed**
+|      | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
+| 1.5B | 1                 | 64               | 64    | 2048       | 48     | 1600        | 16              | 151.35        |
+| 4B   | 1                 | 64               | 64    | 512        | 64     | 2304        | 16              | 75.13         |
+| 8B   | 2                 | 32               | 64    | 512        | 72     | 3072        | 24              | 43.52         |
+| 20B  | 4                 | 16               | 64    | 128        | 111    | 3808        | 32              | 12.65         |
--- a/docs/assets/css/main.scss
+++ b/docs/assets/css/main.scss
@@ -31,8 +31,7 @@
  border-radius: $border-radius;
  -webkit-box-shadow: $box-shadow;
  box-shadow: $box-shadow;
-	position: fixed;
+	//position: fixed;
  .nav__title {
    color: #fff;
    font-size: $type-size-6;

--- a/docs/assets/images/1cycle_lr.png
+++ b/docs/assets/images/1cycle_lr.png
--- a/docs/assets/images/loss_and_lr.png
+++ b/docs/assets/images/loss_and_lr.png
--- a/docs/assets/images/lr_schedule.png
+++ b/docs/assets/images/lr_schedule.png
--- a/docs/assets/images/megatron-gpt2-perf-test.png
+++ b/docs/assets/images/megatron-gpt2-perf-test.png
--- a/docs/assets/images/model_convergence.png
+++ b/docs/assets/images/model_convergence.png
--- a/docs/contributing.md
+++ b/docs/contributing.md
+---
+title: "Contributing"
+permalink: /contributing/
+---
+DeepSpeed welcomes your contributions!
+## Prerequisites
+DeepSpeed uses [pre-commit](https://pre-commit.com/) to ensure that formatting is
+consistent across DeepSpeed. First, ensure that `pre-commit` is installed from either
+installing DeepSpeed or `pip install pre-commit`. Next, the pre-commit hooks must be
+installed once before commits can be made:
+```bash
+pre-commit install
+```
+Afterwards, our suite of formatting tests run automatically before each `git commit`. You
+can also run these manually:
+```bash
+pre-commit run --all-files
+```
+If a formatting test fails, it will fix the modified code in place and abort
+the `git commit`. After looking over the changes, you can `git add <modified files>`
+and then repeat the previous `git commit` command.
+## Testing
+DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests.
+The model convergence tests train
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) and measure
+end-to-end convergence and related metrics. Unit tests are found in `tests/unit/` and
+the model convergence tests are found in `tests/model/`.
+### Unit Tests
+[PyTest](https://docs.pytest.org/en/latest/) is used to execute tests. PyTest can be
+installed from PyPI via `pip install pytest`. Simply invoke `pytest --forked` to run the
+unit tests:
+```bash
+pytest --forked tests/unit/
+```
+You can also provide the `-v` flag to `pytest` to see additional information about the
+tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
+`--forked` flag are required to test CUDA functionality in distributed tests.
+### Model Tests
+Model tests require four GPUs and training data downloaded for
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/).
+To execute model tests, first [install DeepSpeed](#installation). The
+[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
+as part of this process. Next, execute the model test driver:
+```bash
+cd tests/model/
+pytest run_sanity_check.py
+```
+Note that the `--forked` flag is not necessary for the model tests.
+## Contributor License Agreement
+This project welcomes contributions and suggestions. Most contributions require you to
+agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
+actually do, grant us the rights to use your contribution. For details, visit
+https://cla.opensource.microsoft.com.
+When you submit a pull request, a CLA bot will automatically determine whether you need
+to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
+follow the instructions provided by the bot. You will only need to do this once across
+all repos using our CLA.
+## Code of Conduct
+This project has adopted the [Microsoft Open Source Code of
+Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
+[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
+[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
+comments.
--- a/docs/index.md
+++ b/docs/index.md
@@ -71,7 +71,7 @@ optimizations on advanced hyperparameter tuning and optimizers. For example:
 * DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
  Megatron on Azure GPUs.
-  *Read more*: [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)
+  *Read more*: [GPT tutorial](/tutorials/megatron/)
@@ -105,8 +105,7 @@ combination. ZeRO boosts the scaling capability and efficiency further.
  significant performance gains compared to using model parallelism alone.
  *Read more*: [technical report](https://arxiv.org/abs/1910.02054),
-  and [GPT tutorial](./docs/tutorials/MegatronGPT2Tutorial.md).
+  and [GPT tutorial](/tutorials/megatron).
-  <!-- and [QANet tutorial](../../Tutorials/QANetTutorial.md). -->
 ![DeepSpeed-vs-Megatron](/assets/images/DeepSpeed-vs-Megatron.png)
 <p align="center">
@@ -120,13 +119,7 @@ optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
 effectiveness of model training and reduce the number of samples required to
 convergence to desired accuracy.
-*Read more*: [Tuning tutorial](./docs/tutorials/1Cycle.md),
+*Read more*: [Tuning tutorial](/tutorials/1Cycle).
-<!---
- and *BERT Tutorial*: Coming Soon.
-[BERT tutorial](../../Tutorials/BingBertSquad/BingBertSquadTutorial.md),
-[QANet tutorial](../../Tutorials/QANet/QANetTutorial.md)
-->
 ## Good Usability
@@ -165,24 +158,9 @@ overview](features) for descriptions and usage.
 * [Performance Analysis and Debugging](features.md#performance-analysis-and-debugging)
-# Further Reading
-| Article                                                                                        | Description                                  |
-| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
-| [DeepSpeed Features](features.md)                                                              |  DeepSpeed features                          |
-| [DeepSpeed JSON Configuration](config_json.md)                                                 |  Configuring DeepSpeed                       |
-| [API Documentation](/code-docs/)                                                               |  Generated DeepSpeed API documentation       |
-| [CIFAR-10 Tutorial](./docs/tutorials/CIFAR-10.md)                                              |  Getting started with CIFAR-10 and DeepSpeed |
-| [Megatron-LM Tutorial](./docs/tutorials/MegatronGPT2Tutorial.md)                               |  Train GPT2 with DeepSpeed and Megatron-LM   |
-| [Learning Rate Range Test Tutorial](./docs/tutorials/lrrt.md)                                  |  Faster training with large learning rates   |
-| [1Cycle Tutorial](./docs/tutorials/1Cycle.md)                                                  |  SOTA learning schedule in DeepSpeed         |
 # Contributing
 DeepSpeed welcomes your contributions! Please see our
-[contributing](CONTRIBUTING.md) guide for more details on formatting, testing,
+[contributing](/contributing/) guide for more details on formatting, testing,
 etc.
 ## Contributor License Agreement