Update documentation for 1-bit Adam (#388)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Update documentation for 1-bit Adam (#388)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
093f09ff · Ammar Ahmad Awan · GitHub · 65c2f974 · 093f09ff · 093f09ff
Unverified Commit 093f09ff authored Sep 10, 2020 by Ammar Ahmad Awan Committed by GitHub Sep 10, 2020
16 changed files
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -66,6 +66,8 @@ lnav:
        url: /tutorials/lrrt/
      - title: "DeepSpeed Sparse Attention"
        url: /tutorials/sparse-attention/
+      - title: "DeepSpeed with 1-bit Adam"
+        url: /tutorials/onebit-adam/
      - title: "Pipeline Parallelism"
        url: /tutorials/pipeline/
  - title: "Contributing"

--- a/docs/_pages/config-json.md
+++ b/docs/_pages/config-json.md
@@ -34,7 +34,7 @@ title: "DeepSpeed Configuration JSON"
 | Fields | Value                                                        | Example                        |
 | ------ | ------------------------------------------------------------ | ------------------------------ |
-| type   | The optimizer name. DeepSpeed natively supports Adam and LAMB optimizers and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"`                         |
+| type   | The optimizer name. DeepSpeed natively supports Adam, OneBitAdam, and LAMB optimizers and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"`                         |
 | params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)). | `{"lr": 0.001, "eps": 1e-8}` |
  Example of ***optimizer***
@@ -53,6 +53,24 @@ title: "DeepSpeed Configuration JSON"
    }
  }
 ```
+  Another example of ***optimizer*** with 1-bit Adam specific parameters is as follows.
+```json
+"optimizer": {
+    "type": "OneBitAdam",
+    "params": {
+      "lr": 0.001,
+      "betas": [
+        0.8,
+        0.999
+      ],
+      "eps": 1e-8,
+      "weight_decay": 3e-7,
+      "freeze_step": 400,
+      "cuda_aware": true
+    }
+  }
+```
 ### Scheduler Parameters

--- a/docs/_pages/features.md
+++ b/docs/_pages/features.md
@@ -158,6 +158,16 @@ Please see the [core API doc](https://deepspeed.readthedocs.io/) for more detail
 ## Training Optimizers
+### 1-bit Adam optimizer with up to 5x less communication
+DeepSpeed has an efficient implementation of a novel algorithm called 1-bit Adam.
+It offers the same convergence as Adam, incurs up to 5x less communication that enables
+up to 3.5x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput
+for SQuAD fine-tuning on bandwidth-limited clusters. For more details on usage and performance,
+please refer to the detailed [tutorial](https://www.deepspeed.ai/tutorials/onebit-adam) and
+[blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.md), respectively.
+<!-- **TODO: add paper link when it is ready ** -->
 ### Fused Adam optimizer and arbitrary torch.optim.Optimizer
 With DeepSpeed, the user can choose to use a high performance implementation of ADAM from
 NVIDIA, or any training optimizer that extends torch's `torch.optim.Optimizer` class.

--- a/docs/_posts/2020-09-09-onebit-adam-blog-post.md
+++ b/docs/_posts/2020-09-09-onebit-adam-blog-post.md
+---
+layout: single
+title: "DeepSpeed with 1-bit Adam: 5x less communication and 3.4x faster training"
+excerpt: ""
+categories: news
+new_post: false
+date: 2020-09-09 00:00:00
+---
+## 1. Introduction
+Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth.
+Communication compression is an important technique to reduce training time on such systems. One of the most effective ways to compress communication is via error compensation compression, which offers robust convergence speed, even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like Stochastic Gradient Descent (SGD) and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offers state-of-the-art convergence efficiency and accuracy for many tasks, including training of BERT-like models.
+For a powerful optimizer like ADAM, the non-linear dependency on gradient (in the variance term) makes it challenging to develop error compensation-based compression techniques, limiting the practical value of the state-of-the-art communication compression techniques.
+### 1.1 Background: Classic compression techniques
+One way of communication compression is 1-bit compression, which can be expressed as:
+<img src="https://render.githubusercontent.com/render/math?math=x%5Cto%20%5Cfrac%7B%5C%7Cx%5C%7C%7D%7B%5C%7CSign(x)%5C%7C%7DSign(x)">
+With this compression, we could achieve a 32x reduction of memory size by representing each number using one bit. The problem is that using this straightforward method would significantly degrade the convergence speed, which makes this method inapplicable. To solve this problem, recent studies show that by using error compensation compression, we could expect almost the same convergence rate with communication compression.
+The idea of error compensation can be summarized as: 1) doing compression, 2) memorizing the compression error, and then 3) adding the compression error back in during the next iteration. For SGD, doing error compression leads to:
+<img src="https://render.githubusercontent.com/render/math?math=x_t%3D%20x_%7Bt-1%7D%20-%20%5Cgamma%20C(g_t%20%2B%20e_%7Bt-1%7D)%2C%20%5Cquad%20e_t%20%3D%20g_t%2Be_%7Bt-1%7D-C(g_t%2Be_%7Bt-1%7D%20)">
+Where C(⋅) is the 1-bit compression operator. The good thing about doing this error compensation is that the history compression error (e_t and e_(t-1)) would be canceled by itself eventually, which can be seen by:
+<img src="https://render.githubusercontent.com/render/math?math=x_t%3Dx_%7Bt-1%7D-%5Cgamma(g_t%2Be_%7Bt-1%7D-e_t%20)">
+This strategy has been proven to work for optimization algorithms that are linearly dependent on the gradient, such as SGD and Momentum SGD.
+### 1.2 Challenges in applying error-compensation to Adam
+We provide an overview of the Adam algorithm below. The update rules are as follows.
+<img src="https://render.githubusercontent.com/render/math?math=m_%7Bt%2B1%7D%3D%5Cbeta_1%20m_t%2B(1-%5Cbeta_1%20)%20g_t">
+<img src="https://render.githubusercontent.com/render/math?math=v_%7Bt%2B1%7D%3D%5Cbeta_2%20v_t%2B(1-%5Cbeta_2%20)%20(g_t%20)%5E2">
+<img src="https://render.githubusercontent.com/render/math?math=x_%7Bt%2B1%7D%3Dx_t-%5Cgamma%20%5Cfrac%7Bm_%7Bt%2B1%7D%7D%7B%5Csqrt%7Bv_%7Bt%2B1%7D%7D%20%2B%5Ceta%7D">
+As shown in the equations above, the variance term v_t is nonlinearly dependent on the gradient g_t. If we apply basic error compensation compression to Adam, we observe that Adam will not converge as shown in Figure 1.
+![Inapplicability of Error-compensation Compression for Adam due to non-linear dependence on the gradient](/assets/images/adam-convergence.png){: .align-center}
+Figure 1: Inapplicability of Error-compensation Compression for Adam due to non-linear dependence on the gradient
+## 2. Compressing communication with 1-bit Adam
+To compress communication while using the Adam optimizer, we develop 1-bit Adam, which addresses the non-linearity in gradients via preconditioning. We observe that the magnitude of changes on the non-linear term, variance ( v_t), decrease significantly after a few epochs of training and setting v_t constant afterwards will not change the convergence speed. The proposed 1-bit Adam optimizer, as shown in Figure 2, consists of two parts: the warmup stage, which is essentially the vanilla Adam algorithm; and the compression stage, which keeps the variance term constant and compresses the remaining linear term, that is the momentum, into 1-bit representation.
+The compression stage of the algorithm is controlled by a threshold parameter (as shown in Figure 2). When we detect that the change in “variance” falls below a certain threshold, we switch to the compression stage. Our study shows that only 15-20% of the overall training steps are needed for the warmup stage.
+![Comparison of distributed training steps in classic Adam and the proposed 1-bit compressed Adam algorithm](/assets/images/onebit-adam-overview.png){: .align-center}
+Figure 2: Comparison of distributed training steps in classic Adam and the proposed 1-bit compressed Adam algorithm
+### 2.1 How 1-bit Adam works under the hood
+The weight update rule for 1-bit Adam is governed by the following equations.
+For the i-th worker, in the compression stage:
+<img src="https://render.githubusercontent.com/render/math?math=m_%7Bt%2B1%7D%5E%7B(i)%7D%3D%5Cbeta_1%20m_t%2B(1-%5Cbeta_1%20)%20g_t%5E%7B(i)%7D">
+<img src="https://render.githubusercontent.com/render/math?math=%5Cwidehat%7Bm%7D_%7Bt%2B1%7D%5E%7B(i)%7D%3DC(m_%7Bt%2B1%7D%5E%7B(i)%7D%2Be_t%5E%7B(i)%7D)%2C%20%5Cquad%20e_%7Bt%2B1%7D%5E%7B(i)%7D%3D(m_%7Bt%2B1%7D%5E%7B(i)%7D%2Be_t%5E%7B(i)%7D%20)-%5Chat%7Bm%7D_%7Bt%2B1%7D%5E%7B(i)%7D">
+<img src="https://render.githubusercontent.com/render/math?math=m_%7Bt%2B1%7D%5E%7B(ave)%7D%20%3D%20%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En%20%5Chat%7Bm%7D_%7Bt%2B1%7D%5E%7B(i)%7D">
+<img src="https://render.githubusercontent.com/render/math?math=%5Chat%7Bm%7D_%7Bt%2B1%7D%5E%7B(ave)%7D%3DC(m_%7Bt%2B1%7D%5E%7B(ave)%7D%2Be_t%5E%7B(ave)%7D%20)%2C%5Cquad%20%20%20e_%7Bt%2B1%7D%5E%7B(ave)%7D%3D(%5Chat%7Bm%7D_%7Bt%2B1%7D%5E%7B(ave)%7D%2Be_t%5E%7B(ave)%7D%20)-%5Chat%7Bm%7D_%7Bt%2B1%7D%5E%7B(ave)%7D">
+<img src="https://render.githubusercontent.com/render/math?math=m_%7Bt%2B1%7D%3D%5Chat%7Bm%7D_%7Bt%2B1%7D%5E%7B(ave)%7D">
+<img src="https://render.githubusercontent.com/render/math?math=x_%7Bt%2B1%7D%3Dx_t-%5Cgamma%20%5Cfrac%7Bm_%7Bt%2B1%7D%7D%7B%5Csqrt%7Bv_%7Bwarmup%7D%7D%2B%5Ceta%7D">
+Where x_t is the model after iteration (t-1), m_t^(i), e_t^(i) are the momentum and compression error on worker i after iteration (t-1), and v_warmup is the variance term after the warmup stage.
+### 2.2 Addressing system challenges for 1-bit Adam
+Besides the algorithmic challenge, there are two system challenges in applying 1-bit Adam in training systems. First, we need efficient kernels that convert the momentum to 1-bit representations. Second, we need efficient communication schemes to exchange this compressed momentum across different GPUs. The goal of compression is to reduce the overall training time so that commodity systems with bandwidth-limited interconnects can be used to train large models. We address these challenges in DeepSpeed and introduce a fully optimized 1-bit Adam implementation for training on communication-constrained systems.
+## 3. Benefits of 1-bit Adam on communication-constrained systems
+1-bit Adam offers the same convergence as Adam, incurs up to 5x less communication that enables up to 3.5x higher throughput for BERT-Large pretraining and up to 2.7x higher throughput for SQuAD fine-tuning. This end-to-end throughput improvement is enabled by the 6.6x (Figure 3) and 6.2x (Figure 4) speedup observed during the compression stage. It is worth mentioning that our 1-bit Adam optimizer scales so well on a 40 Gigabit Ethernet system that its performance is comparable to Adam’s scalability on a 40 Gigabit InfiniBand QDR system. We note that the effective bandwidth on 40 Gigabit Ethernet is 4.1 Gbps based on iperf benchmarks whereas InfiniBand provides near-peak bandwidth of 32Gbps based on InfiniBand perftest microbenchmarks.
+![BERT-Large Pretraining](/assets/images/bert-scaling.png){: .align-center}
+Figure 3: Scalability of 1-bit Adam for BERT-Large Pretraining on V100 GPUs with batch size of 16/GPU.
+![SQuAD Finetuning](/assets/images/squad-scaling.png){: .align-center}
+Figure 4: Scalability of 1-bit Adam for SQuAD Finetuning on V100 GPUs with batch size of 3/GPU.
+## 4. Dive deeper into 1-bit Adam evaluation results
+### Same convergence as Adam
+One major question for using 1-bit Adam is the convergence speed, and we find that 1-bit Adam can achieve the same convergence speed and comparable testing performance using the same number of training samples as shown in Figure 5.
+![1-bit Adam convergence](/assets/images/onebit-convergence.png){: .align-center}
+Figure 5: 1-bit Adam converges like Adam using the same number of training samples.
+Detailed BERT-Base and BERT-Large results are shown in Table 1. We see that the scores are on par with or better than the original model for both the uncompressed and compressed cases.
+![1-bit Adam convergence table](/assets/images/convergence-table.png){: .align-center}
+Table 1: Verifying correctness of 1-bit Adam on various testing tasks
+Up to 5x less communication: 1-bit Adam provides the same convergence as Adam and reduces the communication volume by 16x during the compression stage for 16-bit (FP16) training. For BERT pretraining, this leads to an overall communication reduction of 5x as we observed the warmup stage to be just 15% of the end-to-end training time.
+The formula to calculate the communication volume ratio of the original versus 1-bit Adam is as follows:
+    1 / (warmup + (1 – warmup)/16)
+In the case of warmup equaling 15%, original Adam incurs 5x of the communication as 1-bit Adam.
+### 1-bit Adam is 3.5x faster for training BERT-Large
+We present two main results for training BERT-Large on systems with two different bandwidth-limited interconnects: 1) 40 gigabit Ethernet (Figure 5) and 2) 40 gbps InfiniBand QDR (Figure 6). During the compression phase, we observe up to 6.6x higher throughput on the system with Ethernet and up to 2x higher throughput on the system with InfiniBand, resulting in end-to-end speed up (including both warmup and compression stages) of 3.5x and 2.7x, respectively. The major benefit of 1-bit Adam comes from the communication volume reduction—enabled by our compressed momentum exchange—and from our custom allreduce operation that implements efficient 1-bit communication using non-blocking gather operations followed by an allgather operation.
+It is important to note that one can also increase total batch size to reduce communication using optimizers like LAMB instead of Adam for BERT pretraining. However, 1-bit Adam avoids the need for rigorous hyperparameter tuning, which is often more difficult for large batches from our experience. Furthermore, 1-bit Adam also works very well for workloads that have small critical batch size (cannot converge well with large batch size) like many fine-tuning tasks.
+![Performance of 1-bit Adam for BERT-Large training on 40 gbps Ethernet](/assets/images/bert-tcp.png){: .align-center}
+Figure 5: Performance of 1-bit Adam for BERT-Large training on 40 Gbps Ethernet interconnect during the compression stage.
+![Performance of 1-bit Adam for BERT-Large training on 40 gbps InfiniBand](/assets/images/bert-ib.png){: .align-center}
+Figure 6: Performance of 1-bit Adam for BERT-Large training on 40 Gbps InfiniBand interconnect during the compression stage.
+### 1-bit Adam is 2.7x faster for SQuAD fine-tuning
+1-bit Adam offers scalability not only on large-scale training tasks but also on tasks like SQuAD fine-tuning. As shown in Figures 7 and 8, 1-bit Adam scales well on both Ethernet- and InfiniBand-based systems and offers up to 6.2x higher throughput (during the compression stage) on the Ethernet-based system, resulting in 2.7x end-to-end speedup (25% warmup plus 75% compression stage). For SQuAD fine-tuning, we observed that a total batch size of 96 offers the best F1 score. Batch sizes larger than this value lower the convergence rate and require additional hyperparameter tuning.  Therefore, in order to scale to 32 GPUs, we can only apply a small batch size of 3-4 per GPU. This makes fine-tuning tasks communication intensive and hard to scale. 1-bit Adam addresses the scaling challenge well, obtaining 3.4x communication reduction without enlarging batch size, and it results in a 2.7x end-to-end speedup.
+![1-bit Adam convergence](/assets/images/squad-tcp.png){: .align-center}
+Figure 7: Performance of 1-bit Adam for SQuAD fine-tuning on 40 gbps Ethernet during the compression stage.
+![1-bit Adam convergence](/assets/images/squad-ib.png){: .align-center}
+Figure 8: Performance of 1-bit Adam for SQuAD fine-tuning on 40 gbps InfiniBand interconnect during the compression stage.
--- a/docs/_posts/2020-09-09-onebit-adam-news.md
+++ b/docs/_posts/2020-09-09-onebit-adam-news.md
+---
+layout: single
+title: "Up to 5x less communication and 3.4x faster training through 1-bit Adam"
+excerpt: ""
+categories: news
+new_post: true
+date: 2020-09-09 00:00:00
+---
+Adam is an effective and probably the most well-utilized optimizer for
+training many large-scale deep learning models.  However, Adam is generally
+not compatible with communication-efficient optimization algorithms, and
+therefore the communication cost could become a bottleneck while scaling
+across distributed devices. We introduce a new algorithm - 1-bit Adam - and
+its efficient implementation in DeepSpeed. 1-bit Adam offers the ***same convergence*** as Adam, incurs up to ***5x less communication*** that enables up to ***3.5x higher throughput for BERT-Large pretraining*** and up to ***2.7x higher throughput for SQuAD fine-tuning*** on bandwidth-limited clusters.
+* Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
+* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html).
+* Tutorial on how to reproduce our results, see our [1-bit Adam tutorial](/tutorials/onebit-adam/).
+* The source code for 1-bit Adam can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed). The implementation of 1-bit Adam is in [onebit_adam.py](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/fp16/onebit_adam.py) and CUDA-Aware communication for 1-bit Adam is in [custom_collectives.py](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py). Example codes to try this feature can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples) as shown in the [tutorial](/tutorials/onebit-adam/).
--- a/docs/_tutorials/onebit-adam.md
+++ b/docs/_tutorials/onebit-adam.md
@@ -2,16 +2,18 @@
 title: "1-bit Adam: Up to 5x less communication volume and up to 2x faster training"
 ---
-In this tutorial, we are going to introduce the 1-bit Adam optimizer in DeepSpeed. 1-bit Adam can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 5x.
+In this tutorial, we are going to introduce the 1-bit Adam optimizer in DeepSpeed. 1-bit Adam can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 5x. Detailed description of the 1-bit Adam algorithm, its implementation in DeepSpeed, and performance evaluation is available from our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html).
 To illustrate the benefits and usage of 1-bit Adam optimizer in DeepSpeed, we use the following two training tasks as examples:
 1. BingBertSQuAD Fine-tuning
 2. BERT Pre-training
-For more details on these tasks, please refer to the tutorial posts on [BingBertSQuAD Fine-tuning](https://www.deepspeed.ai/tutorials/bert-finetuning/) and [BERT Pre-training](https://www.deepspeed.ai/tutorials/bert-pretraining/).
+For more details on these tasks, please refer to the tutorial posts on [BingBertSQuAD Fine-tuning](/tutorials/bert-finetuning/) and [BERT Pre-training](/tutorials/bert-pretraining/).
-## Overview
+## 1. Overview
+### Pre-requisites for installing DeepSpeed
 If you don't already have a copy of the DeepSpeed repository, please clone in
 now and checkout the DeepSpeedExamples submodule that contains the BingBertSQuAD and BERT Pre-training examples.
@@ -22,7 +24,8 @@ cd DeepSpeed
 git submodule update --init --recursive
 cd DeepSpeedExamples/
 ```
-## Pre-requisites for 1-bit Adam
+### Pre-requisites for 1-bit Adam
 1-bit Adam uses advanced communication schemes that are not yet supported by PyTorch distributed and NCCL. We rely on Message Passing Interface (MPI) for these advanced communication primitives.
@@ -40,7 +43,11 @@ Alternatively, the standard mpirun launcher can also be used as follows:
 mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash [training_script.sh]
 ```
-### Configuration
+### 1-bit Algorithm
+The detailed description of the 1-bit Algorithm can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html).
+### Configuration of 1-bit Adam
 The 1-bit Adam feature can be used by setting the optimizer configuration options as follows. An example json config file is shown below.
 ```json
@@ -67,7 +74,7 @@ This feature is only supported on systems with InfiniBand interconnect and a CUD
 `freeze_step` is the number of warm up steps before 1-bit compression gets applied to the communication. In order to determine the number of warm up steps, one strategy is to set 15-25% of the total training steps for a given model. If it provides the desired outcome, one can try to extract more performance by reducing the steps systematically. In future, we plan to introduce a threshold that can automatically search and decide for the number of warm up steps for different models. The examples below have been tuned for the number of warm up steps. The `freeze_step` parameter has already been set to the best number we found in the corresponding run scripts.
-## 1. BingBertSQuAD fine-tuning with 1-bit Adam
+## 2. BingBertSQuAD Fine-tuning with 1-bit Adam
 * Download the SQuAD dataset:
  * Training set: [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
@@ -78,7 +85,7 @@ This feature is only supported on systems with InfiniBand interconnect and a CUD
 You can also use a pre-trained BERT model checkpoint from either DeepSpeed, [HuggingFace](https://github.com/huggingface/transformers), or [TensorFlow](https://github.com/google-research/bert#pre-trained-models) to run the fine-tuning.
-### 1.1 Running BingBertSQuAD with DeepSpeed and 1-bit Adam
+### 2.1 Running BingBertSQuAD with DeepSpeed and 1-bit Adam
 The main part of training is done in `nvidia_run_squad_deepspeed.py`, which has
 already been modified to use DeepSpeed. The `run_squad_deepspeed.sh` script
@@ -99,10 +106,10 @@ To enable the 1-bit compressed training, 1-bit Adam uses an MPI library (E.g. MV
 ### Launch with deepspeed
-The following helper script in the DeepSpeedExamples/BingBertSQuAD will launch the training without the need for setting any `mpirun` parameters.
+The following helper script in the DeepSpeedExamples/BingBertSQuAD will launch the training without the need for setting any `mpirun` parameters. The number of nodes and GPUs will be automatically detected and the job will be launched on all the available resources.
 ```shell
-bash run_squad_deepspeed_onebitadam.sh
+bash run_squad_deepspeed_onebitadam.sh <PATH_TO_OUTPUT_DIR>
 ```
 ### Launch with mpirun
@@ -110,21 +117,22 @@ bash run_squad_deepspeed_onebitadam.sh
 Alternatively, we show how the standard `mpirun` launcher can be used for launching the fine-tuning job.
 ```shell
-mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash run_squad_deepspeed_onebitadam.sh
+mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash run_squad_mpi_onebitadam.sh
 ```
 For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use the `mpirun` launcher packaged with the MVAPICH2 library. Please run the folowing command:
 ```shell
-mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash run_squad_deepspeed_onebitadam.sh
+mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash run_squad_mpi_onebitadam.sh
 ```
-### 1.2 Configuration for BingBertSQuAD with DeepSpeed and 1-bit Adam enabled
+### 2.2 Configuration for BingBertSQuAD with DeepSpeed and 1-bit Adam enabled
-The `deepspeed_bsz96_onebit_config.json` file gives the user the ability to specify DeepSpeed
+The `deepspeed_onebitadam_bsz96_config.json` file gives the user the ability to specify DeepSpeed
 options in terms of batch size, micro batch size, optimizer, learning rate, and other parameters.
 When running the `nvidia_run_squad_deepspeed.py`, in addition to the
 `--deepspeed` flag to enable DeepSpeed, the appropriate DeepSpeed configuration
-file must be specified using `--deepspeed_config deepspeed_bsz96_config.json`.
+file must be specified using `--deepspeed_config deepspeed_onebitadam_bsz96_config.json`.
 Table 1 shows the fine-tuning configuration we used in our experiments.
@@ -142,8 +150,11 @@ Table 1 shows the fine-tuning configuration we used in our experiments.
 Table 1. Fine-tuning configuration
-### 1.3 Results for BingBertSQuAD Fine-tuning
+**Note:** For more details about loading checkpoint, argument parsing, initialization, forward pass, backward pass, weight update and evaluation, please refer to the [BingBertSQuAD Fine-tuning](/tutorials/bert-finetuning/) tutorial.
+### 2.3 Performance Results for BingBertSQuAD Fine-tuning
+***Accuracy:***
 The results are summarized in the table below. The total batch size is set to 96 and training is conducted
 on 32 GPUs for 2 epochs. A set of parameters (seeds and learning rates) were tried and the best ones were selected.
 We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scores we achieved that are on-par or better than the [HuggingFace results](https://github.com/huggingface/transformers/tree/master/examples/question-answering).
@@ -152,19 +163,24 @@ We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scor
 | ----------- | ------------------------------------- | --------- | ----- | ----- |
 | HuggingFace | [Bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin) | FP16      | 87.26 | 93.32 |
-**Note:** For more details about loading checkpoint, argument parsing, initialization, forward pass, backward pass, weight update and evaluation, please refer to the [BingBertSQuAD Fine-tuning](https://www.deepspeed.ai/tutorials/bert-finetuning/) tutorial.
+***Training Speed and Scalability:***
+1-bit Adam enables up to 2.7x overall speedup in training speed for SQuAD fine-tuning. This is made possible by up to 6.2x faster througput during the compressed stage of the algorithm as shown in Figure 1.
-## 2. BERT Pre-training with 1-bit Adam
+![SQuAD Finetuning](/assets/images/squad-scaling.png){: .align-center}
-For data downloading and pre-processing, please refer to [BERT Pre-training](https://www.deepspeed.ai/tutorials/bert-pretraining/) posts
-for more details.
-### 2.1 Running Pre-training with DeepSpeed and 1-bit Adam
+Figure 1: Scalability of 1-bit Adam for SQuAD Finetuning on V100 GPUs with batch size of 3/GPU.
+## 3. BERT Pre-training with 1-bit Adam
+For data downloading and pre-processing, please refer to the [BERT Pre-training](/tutorials/bert-pretraining/) post.
+### 3.1 Running Pre-training with DeepSpeed and 1-bit Adam
 The main part of training is done in `deepspeed_train.py`, which has
-already been modified to use DeepSpeed. The `ds_train_bert_onebitadam_bsz4k_seq128.sh` and `ds_train_bert_bsz64k_seq128.sh` are the
+already been modified to use DeepSpeed. The `ds_train_bert_onebit_bsz4k_seq128.sh` and `ds_train_bert_bsz64k_seq128.sh`
- shell scripts that
+are the shell scripts that help to invoke training and setup several different hyperparameters relevant
-help to invoke training and setup several different hyperparameters relevant
 to the training process.
 - **DeepSpeed-enabled:** Start training with DeepSpeed by running the command below:
@@ -180,7 +196,7 @@ bash ds_train_bert_bsz64k_seq128.sh
 As discussed for BingBertSQuAD fine-tuning, we can simply use the `deepspeed` launcher to launch our BERT pre-training jobs as follows.
 ```shell
-bash ds_train_bert_onebitadam_bsz4k_seq128.sh
+bash ds_train_bert_onebit_bsz4k_seq128.sh
 ```
 ### Launch with mpirun
@@ -188,27 +204,28 @@ bash ds_train_bert_onebitadam_bsz4k_seq128.sh
 Alternatively, use the following command to launch using `mpirun`.
 ```shell
-mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash ds_train_bert_onebitadam_bsz4k_seq128.sh
+mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash mpi_train_bert_onebit_bsz4k_seq128.sh
 ```
 For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use MVAPICH2 as the launcher and run the following command:
 ```shell
-mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash ds_train_bert_onebitadam_bsz4k_seq128.sh
+mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash ds_train_bert_onebit_bsz4k_seq128.sh
 ```
-### 2.2 Configuration for BingBertSQuAD with DeepSpeed and 1-bit Adam enabled
+### 3.2 Configuration for BingBertSQuAD with DeepSpeed and 1-bit Adam enabled
 The `deepspeed_bsz4k_onebit_config_seq128.json` file gives the user the ability to specify DeepSpeed
 options in terms of batch size, micro batch size, optimizer, learning rate, and other parameters.
-Below is the DeepSpeed configuration file for running BERT-large pre-training with sequence length of 128.
+Below is the DeepSpeed configuration file for running BERT-large pre-training with sequence length of 128 using the 1-bit Adam optimizer.
 ```json
 {
  "train_batch_size": 4096,
-  "train_micro_batch_size_per_gpu": 64,
+  "train_micro_batch_size_per_gpu": 16,
-  "steps_per_print": 1000,
+  "steps_per_print": 100,
  "optimizer": {
-    "type": "Adam",
+    "type": "OneBitAdam",
    "params": {
      "lr": 2e-4,
      "max_grad_norm": 1.0,
@@ -225,10 +242,8 @@ Below is the DeepSpeed configuration file for running BERT-large pre-training wi
  }
 }
 ```
-Notice that for BERT-base training (sequence length 128), the suggested freeze_step is 16000. For the rest of the pre-training using sequence 512, we suggest to use a freeze_step of 1500.
+The above file is for BERT-large but for BERT-base training (sequence length 128), the suggested freeze_step will need to be changed to 16000. For the rest of the pre-training using sequence 512, we suggest to use a freeze_step of 1500.
-### 2.3 Results for BERT pre-training
-Using 1-bit Adam, we are able to achieve significantly higher througput compared to the original Adam optimizer. We note that increase training speed during the compressed stage enables overall training speedup of up to 3.5x on Ethernet based systems where communication bandwidth is significantly limited. However, we are able to achieve up to 1.7x overall speedup even for the 40 Gigabit InfiniBand QDR based system. Furthermore, it is important to highlight that we are able to achieve feasible BERT pre-training using 1-bit Adam on a significantly smaller batch size of 4k compared to 32k and 64k for the LAMB optimizer.
+### 3.3 Performance Results for BERT Pre-training
-Graphs to be added from the blog post ...
+Performance results of BERT Pre-training can be seen from our detailed [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html).
--- a/docs/assets/images/adam-convergence.png
+++ b/docs/assets/images/adam-convergence.png
--- a/docs/assets/images/bert-ib.png
+++ b/docs/assets/images/bert-ib.png
--- a/docs/assets/images/bert-scaling.png
+++ b/docs/assets/images/bert-scaling.png
--- a/docs/assets/images/bert-tcp.png
+++ b/docs/assets/images/bert-tcp.png
--- a/docs/assets/images/convergence-table.png
+++ b/docs/assets/images/convergence-table.png
--- a/docs/assets/images/onebit-adam-overview.png
+++ b/docs/assets/images/onebit-adam-overview.png
--- a/docs/assets/images/onebit-convergence.png
+++ b/docs/assets/images/onebit-convergence.png
--- a/docs/assets/images/squad-ib.png
+++ b/docs/assets/images/squad-ib.png
--- a/docs/assets/images/squad-scaling.png
+++ b/docs/assets/images/squad-scaling.png
--- a/docs/assets/images/squad-tcp.png
+++ b/docs/assets/images/squad-tcp.png