ZeRO-2 (#217)

Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: yuxionghe <yuxhe@microsoft.com> Co-authored-by: Arash Ashari <arashari@microsoft.com>

ZeRO-2 (#217)
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: yuxionghe <yuxhe@microsoft.com> Co-authored-by: Arash Ashari <arashari@microsoft.com>
f2ac7eaf · Jeff Rasley · GitHub · c61e23b4 · f2ac7eaf · f2ac7eaf
Unverified Commit f2ac7eaf authored May 19, 2020 by Jeff Rasley Committed by GitHub May 19, 2020
20 changed files
--- a/docs/_pages/config-json.md
+++ b/docs/_pages/config-json.md
@@ -102,15 +102,8 @@ Example of ***scheduler***
 | ------------------------------------------------------------ | ------- |
 | Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. | `false`    |
 ### FP16 training options
-***zero\_optimization***: [boolean]
-| Description                                                  | Default |
-| ------------------------------------------------------------ | ------- |
-| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false`   |
 ***fp16***: [dictionary]
 | Description                                                  | Default |
@@ -172,6 +165,66 @@ Example of ***scheduler***
 | ----------------------------------- | ------- |
 | Enable gradient clipping with value | `0`      |
+### ZeRO Optimizations for FP16 Training
+Enabling and configure ZeRO memory optimizations
+```json
+  "zero_optimization": {
+    "stage": [0|1|2],
+    "allgather_partitions": [true|false],
+    "allgather_bucket_size": 500000000,
+    "reduce_scatter": [true|false],
+    "reduce_bucket_size": 500000000,
+    "contiguous_gradients" : [true|false]
+    }
+```
+***zero\_optimization***: [dictionary]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false`   |
+***stage***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Chooses different stages of ZeRO Optimizer. Stage 0, 1, and 2 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitiong, respectively. | `0`   |
+***allgather_partitions***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step  | `true`   |
+***allgather_bucket_size***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes   | `500000000`   |
+***reduce_scatter***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Uses reduce or reduce scatter instead of allreduce to average gradients   | `true`   |
+***reduce_bucket_size***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes   | `500000000`   |
+***contiguous_gradients***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. Only useful when running very large models.   | `False`   |
 ### Logging
 ***steps\_per\_print***: [integer]
@@ -191,3 +244,52 @@ Example of ***scheduler***
 | Description                                                  | Default |
 | ------------------------------------------------------------ | ------- |
 | Print out state information of DeepSpeed object after initialization | `false`   |
+### Activation Checkpointing
+```json
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "cpu_checkpointing": false,
+    "contiguous_memory_optimization": false,
+    "number_checkpoints": null,
+    "synchronize_checkpoint_boundary": false,
+    "profile": false
+    }
+```
+***partition\_activations***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enables partition activation when used with model parallelism | `false`   |
+***cpu\_checkpointing***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Offloads partitioned activations to CPU if partition_activations is enabled| `false`   |
+***contiguous\_memory\_optimization***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Copies partitioned activations so that they are contiguous in memory | `false`   |
+***number_checkpoints***: [integer]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Total number of activation checkpoints used to allocate memory buffer for contiguous_memoty_optimization | `None`   |
+***synchronize\_checkpoint\_boundary***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Inserts torch.cuda.synchronize() at each checkpoint boundary. | `false`   |
+***profile***: [boolean]
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Logs the forward and backward time for each checkpoint function | `false`   |
--- a/docs/_pages/features.md
+++ b/docs/_pages/features.md
@@ -57,19 +57,33 @@ DeepSpeed is fully compatible with [Megatron](https://github.com/NVIDIA/Megatron
 Please see the [Megatron-LM tutorial](/tutorials/megatron/) for details.
+## The Zero Redundancy Optimizer
-## Memory and Bandwidth Optimizations
+The Zero Redundancy Optimizer ([ZeRO](https://arxiv.org/abs/1910.02054)) is at
+the heart of DeepSpeed and enables large model training at a scale that is
-### The Zero Redundancy Optimizer (ZeRO)
+simply not possible with model parallelism alone. When enabled, ZeRO allows
-[ZeRO](https://arxiv.org/abs/1910.02054) is at the heart of DeepSpeed and
+training models with over 13 billion parameters without any model parallelism,
-enables large model training at a scale that is simply not possible with model
+and up to 200 billion parameter models with model parallelism on current
-parallelism alone. When enabled, ZeRO allows training models with
+generation hardware.
-over 6 billion parameters without any model parallelism, and up to 100 billion
-parameter models with model parallelism on current generation hardware.
 For more details see the [ZeRO paper](https://arxiv.org/abs/1910.02054), [GPT
 tutorial](/tutorials/megatron/) on integration with
-DeepSpeed. Additional tutorials including *BERT Tutorial*: Coming Soon.
+DeepSpeed.
+### Optimizer State and Gradient Partitioning
+Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the
+model states (optimizer states, gradients and parmaeters) by 8x compared to standard
+data parallelism by partitioning these states across data parallel process instead of
+replicating them.
+### Activation Partitioning
+Activation Partitioning is a memory optimization in ZeRO that can reduce the memory
+consumed by activations during model parallel training (MP). In MP certain
+activations maybe required by all MP processes, resulting in a replication of
+activations across MP GPUs. Activation Partitioning stores these activations in a
+partitioned state once they are used for computation in the forward propagation. These
+activations are allgathered right before they are needed again during the backward propagation.
+By storing activations in a partitioned state, ZeRO in DeepSpeed can reduce the activation
+memory footprint proportional to the MP degree.
 ### Constant Buffer Optimization (CBO)
 CBO enables high network and memory throughput while restricting memory usage to a
@@ -80,6 +94,17 @@ unnecessary memory overhead. CBO in DeepSpeed fuses smaller operands into approx
 pre-defined sized buffer large enough to achieve great performance without the
 unnecessary memory overhead.
+### Contiguous Memory Optimization (CMO)
+CMO reduces reduces memory fragmentation during training, preventing out of memory errors
+due to lack of contiguous memory. Memory fragmentation is a result of interleaving between
+short lived and long lived memory objects. During the forward propagation activation
+checkpoints are long lived but the activations that recomputed are short lived. Similarly,
+during the backward computation, the activation gradients are short lived while the parameter
+gradients are long lived. CMO transfers activation checkpoints and parameter gradients
+to contiguous buffers preventing memory fragmentation.
+## Additional Memory and Bandwidth Optimizations
 ### Smart Gradient Accumulation
 Gradient accumulation allows running larger batch size with limited memory by breaking an
 effective batch into several sequential micro-batches, and averaging the parameter
@@ -90,6 +115,11 @@ averaged gradients for the effective batch across all GPUs. This strategy signif
 reduces the communication involved over the approach of averaging globally for each
 micro-batch, specially when the number of micro-batches per effective batch is large.
+### Communication Overlapping
+During back propagation, DeepSpeed can overlap the communication required for averaging
+parameter gradients that have already been computed with the ongoing gradient computation.
+This computation communication overlap, allows DeepSpeed to achieve higher throughput even
+at modest batch sizes.  
 ## Training Features
@@ -100,12 +130,23 @@ The DeepSpeed core API consists of just a handful of methods:
 * argument parsing: `add_config_arguments`
 * checkpointing : `load_checkpoint` and `store_checkpoint`
-DeepSpeed supports all the features described in this document, via the use of these API,
+DeepSpeed supports most of the features described in this document, via the use of these API,
 along with a `deepspeed_config` JSON file for enabling and disabling the features.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
+### Activation Checkpointing API
+DeepSpeed's Activation Checkpoinitng API supports activation checkpoint partitioning,
+cpu checkpoiniting, and contiguous memory optimizations, while also allowing layerwise
+profiling. Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ### Gradient Clipping
+```json
+{
+  "gradient_clipping": 1.0
+}
+```
 DeepSpeed handles gradient clipping under the hood based on the max gradient norm
 specified by the user.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
@@ -136,8 +177,8 @@ DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Opt
 For more details on LAMB, see the [LAMB paper](https://arxiv.org/pdf/1904.00962.pdf).
 ### Memory-Efficient Training with ZeRO Optimizer
-DeepSpeed can train models up with up to 6 billion parameters without parallelism, and
+DeepSpeed can train models up with up to 13 billion parameters without parallelism, and
-models with up to 100 billion parameters with 16-way model parallelism. This leap in
+models with up to 200 billion parameters with 16-way model parallelism. This leap in
 model size is possible though the memory efficiency achieved via the ZeRO Optimizer. For
 more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .
@@ -174,6 +215,10 @@ file.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ```json
 {
-  "wall_clock_breakdown": true
+  "wall_clock_breakdown": true,
+  "activation_checkpointing": {
+    "profile": true
+  }
 }
 ```
--- a/docs/_posts/2020-03-17-zero-stage2.md
+++ b/docs/_posts/2020-03-17-zero-stage2.md
---
-title: "ZeRO stage 2"
-sneak_preview: true
-excerpt: "Reduce memory footprint to enable training 10B models without model parallelism!"
---
-* Reduce memory footprint of gradients
-* Train larger models: e.g., 10B parameters on 32GPUs without model parallelism
-* Train larger batch sizes
-## Further updates coming soon!
--- a/docs/_posts/2020-05-19-bert-record.md
+++ b/docs/_posts/2020-05-19-bert-record.md
+---
+layout: single
+title: "DeepSpeed optimizes transformer kernels to achieve world's fastest BERT training record: 44 minutes on 1024 NVIDIA V100 GPUs"
+excerpt: ""
+categories: news
+new_post: true
+date: 2020-05-19 00:00:00
+---
+We introduce new technology to accelerate single GPU performance via
+kernel optimizations. These optimizations not only create a strong
+foundation for scaling out large models, but also improve the single GPU
+performance of highly tuned and moderately sized models like BERT by more
+than 30%, reaching a staggering performance of 66 teraflops per V100 GPU,
+which is 52% of the hardware peak. **Using these optimizations as the building
+block, DeepSpeed achieves the fastest BERT training record: 44 minutes on
+1,024 NVIDIA V100 GPUs**, compared with the best published result
+of 67 minutes on the same number and generation of GPUs.
+**Code and tutorials are coming soon!**
--- a/docs/_posts/2020-05-19-zero-stage2.md
+++ b/docs/_posts/2020-05-19-zero-stage2.md
+---
+layout: single
+title: "ZeRO-2 empowers training models as large as 170 billion parameters up to 10x faster compared to state-of-the-art"
+excerpt: ""
+categories: news
+new_post: true
+date: 2020-05-19 01:00:00
+---
+ZeRO-2 expands the scope of memory optimizations in the original ZeRO by
+tackling the full spectrum of memory consumption during training. More
+specifically, ZeRO-2 introduces new technology to reduce the memory footprint
+of gradients, activation memory, and fragmented memory, in addition to
+optimizer state memory optimization in the original ZeRO. Altogether, the
+memory savings empower DeepSpeed to improve the scale and speed of deep
+learning training by an order of magnitude. More concretely, ZeRO-2 allows
+training models as large as 170 billion parameters up to 10x faster compared
+to state of the art.
+For more information on using ZeRO-2, see the [Megatron tutorial](/tutorials/megatron/).
+For a technical deep dive, see our [technical report](https://arxiv.org/abs/1910.02054).
--- a/docs/_tutorials/getting-started.md
+++ b/docs/_tutorials/getting-started.md
@@ -2,6 +2,7 @@
 title: "Getting Started"
 permalink: /getting-started/
 excerpt: "First steps with DeepSpeed"
+date: 2020-05-15
 ---
 ## Installation

--- a/docs/_tutorials/megatron.md
+++ b/docs/_tutorials/megatron.md
@@ -320,6 +320,43 @@ and return the states for the client model.
 ```
+### DeepSpeed Activation Checkpoints (Optional)
+DeepSpeed can reduce the activation memory during model parallel training by partitioning activation checkpoints across model parallel GPUs, or offloading them to CPU. These optimization are optional, and can be skipped unless activation memory becomes a memory bottlenck. To enable partition activation, we use the `deepspeed.checkpointing` API to replace Megatron's activation checkpointing and random state tracker APIs. The replacement should happen before the first invocation of these APIs.
+a) Replace in `pretrain_gpt.py` :
+ ```python
+    # Optional DeepSpeed Activation Checkpointing Features
+    #
+    if args.deepspeed and args.deepspeed_activation_checkpointing:
+        set_deepspeed_activation_checkpointing(args)
+def set_deepspeed_activation_checkpointing(args):
+    deepspeed.checkpointing.configure(mpu,
+                            deepspeed_config=args.deepspeed_config,
+                            partition_activation=True)
+    mpu.checkpoint = deepspeed.checkpointing.checkpoint
+    mpu.get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
+    mpu.model_parallel_cuda_manual_seed =
+                    deepspeed.checkpointing.model_parallel_cuda_manual_seed
+```
+b) Replace in `mpu/transformer.py`:
+```python
+if deepspeed.checkpointing.is_configured():
+            global get_cuda_rng_tracker, checkpoint
+            get_cuda_rng_tracker = deepspeed.checkpoint.get_cuda_rng_tracker
+            checkpoint = deepspeed.checkpointing.checkpoint
+```
+With these replacements, various DeepSpeed activation checkpointing optimizations such as activation partitioning, contiguous checkpointing, and CPU checkpointing, can be specified with either `deepspeed.checkpointing.configure` or in the `deepspeed_config` file.
 ### Train  scripts
 Assume webtext data was prepared in previous step, to start training
 Megatron-LM GPT2 model with DeepSpeed applied, execute the following command to
@@ -328,13 +365,18 @@ start training.
 - Single GPU run
  - run `bash scripts/ds_pretrain_gpt2.sh`
 - Multiple GPUs/Nodes run
-  - run `bash scripts/ds_pretrain_gpt2_model_parallel.sh`
+  - run `bash scripts/ds_zero2_pretrain_gpt2_model_parallel.sh`
+## DeepSpeed Evaluation using GPT-2
-## Performance Improvements
 DeepSpeed enables training very large models effectively via the advanced [ZeRO
-optimizer](https://arxiv.org/abs/1910.02054v2). ZeRO significantly reduces the memory
+optimizer](https://arxiv.org/abs/1910.02054v2). In February, we released a sub-set
+of optimizations from ZeRO in DeepSpeed that performs optimizer state partitioning.
+We refer to them as ZeRO-1. In May, 2020 we extended ZeRO-1 in DeepSpeed to include
+additional optimizations from ZeRO including gradient and activation partitioning,
+as well as contiguous memory optimizations. We refer to this release as ZeRO-2.  
+ZeRO-2 significantly reduces the memory
 footprint for training large models which means large models can be trained with i) less
 model parallelism and ii) larger batch sizes. A lower model parallelism degree improves
 training efficiency by increasing the granularity of the computation such as the matrix
@@ -342,80 +384,25 @@ multiplication where performance is directly related to the size of the matrices
 Furthermore, less model parallelism also results in less communication between model
 parallel GPUs, which further boosts performance.  Larger batch size has a similar effect
 of increasing the computational granularity as well as reducing communication, also
-resulting in better performance. Therefore, DeepSpeed combines ZeRO-powered data parallelism with
+resulting in better performance. Therefore, with DeepSpeed and ZeRO-2 integration into Megatron,
-Megatron-LM tensor-slicing model parallelism, which is
+we elevate the model scale and speed to an entirely new level compared to Megatron alone.
-significantly faster than using Megatron-LM alone.
-The observed performance improvements depend on several factors such as the memory per
-GPU, the local GPU interconnect (i.e., PCI-E vs NVLINK vs NVSwitch), the model size,
-inter node network interconnect, etc. Below, we show some of the performance improvements
-from using DeepSpeed over Megatron on a 16 GPU Low Bandwidth (40 Gbps) cluster and a 400 GPU DGX-2 High Bandwidth (800 Gbps) cluster.
-For details please see the [ZeRO Paper](https://arxiv.org/abs/1910.02054v2). We also
-present performance improvement on a 64 GPU cluster along with detailed configuration
-analysis to show where the improvements come from.
-![DeepSpeed-vs-Megatron](/assets/images/DeepSpeed-vs-Megatron.png)
-<p align="center">
-<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
-</p>
+![DeepSpeed-vs-Megatron](../assets/images/zero-full.png)
-### On Low Bandwidth GPU Cluster
-The figure above shows that training 1.5B parameter model with DeepSpeed is
-nearly 4x faster than without DeepSpeed on a cluster with 4 nodes, 4 GPU per
-node, and 16 GPUs total. These GPUs have 16GB of memory each, and PCI-E
-interconnects GPUs within a node, and 40 Gbps infiniband across nodes.
-The performance improvement comes from lower model parallelism degree and
-larger batch size as discussed earlier. Training 1.5B parameter model with
-Megatron-LM alone requires 4-way model parallelism, and can only fit an effective
-batch size of 32 using all 16 GPUs. On the other hand, DeepSpeed does not
-require any model-parallelism to train this model, and can support an
-effective batch size of 128 without running out of memory, resulting in
-significantly higher performance.
-### On High bandwidth DGX-2 GPU Cluster
-Each GPU on the DGX-2 cluster has 32 GB of memory, and GPUs inside a box is connected via
-the high-bandwidth NVSwitch. DGX-2 nodes are connected to each other via 800 Gbps (8 x 100Gbps) infiniband interconnect. As such, running a 1.5B model on DGX-2 requires less model
-parallelism, and the performance improvement from DeepSpeed for this model size is less
-significant. However, at larger model sizes, Megatron still requires significantly larger
-model parallelism degree, and can only run much smaller batch sizes than DeepSpeed.
-Therefore, as the model sizes get larger, DeepSpeed, by coming ZeRO with Megatron model parallelism, starts to significantly outperform
-using Megatron-LM alone.
-### Performance Improvements with Configuration Details
-The figure below compares DeepSpeed with Megatron on a 64 GPU cluster with 4
-DGX-2 nodes. To give the readers a clear idea of source of the performance
-improvements, we also present the configuration table for both Megatron and
-DeepSpeed. It shows the smallest model parallelism degree and the largest batch
-size that can be used to train these models without running out of memory. As
-discussed above, the tables demonstrate that DeepSpeed runs with smaller model parallelism degree
-and achieves better performance.
-![DeepSpeed Performance SpeedUp](/assets/images/megatron-gpt2-perf-test.png)
 <p align="center">
-<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Nvidia Megatron-LM) over using Megatron-LM alone.</em>
+<em>Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains super linear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.</em>
 </p>
+More concretely, DeepSpeed and ZeRO-2 excel in four aspects (as visualized in Figure 2), supporting an order-of-magnitude bigger models, up to 10x faster, with superlinear scalability, and improved usability to democratize large model training. These four aspects are detailed below.
-**a ) Megatron-LM GPT2 Baseline**
+Figure 2: ZeRO-2 scales to 170 billion parameters, has up to 10x higher throughput, obtains super linear speedup, and improves usability by avoiding the need for code refactoring for models up to 13 billion parameters.
-|      | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+Model size: State-of-the-art large models such as OpenAI GPT-2, NVIDIA Megatron-LM, Google T5, and Microsoft Turing-NLG have sizes of 1.5B, 8.3B, 11B, and 17B parameters respectively. ZeRO-2 provides system support to efficiently run models of 170 billion parameters, an order-of-magnitude bigger than these largest models (Figure 2, top left).
-| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
-| 1.5B | 2                 | 32               | 64    | 512        | 48     | 1600        | 16              | 128.56        |
-| 4B   | 4                 | 16               | 64    | 128        | 64     | 2304        | 16              | 49.36         |
-| 8B   | 4                 | 16               | 64    | 128        | 72     | 3072        | 24              | 24.57         |
-| 20B  | 16                | 4                | 64    | 16         | 111    | 3808        | 32              | 3.42          |
+Speed: Improved memory efficiency powers higher throughput and faster training. Figure 2 (bottom left) shows system throughput of ZeRO-2 and ZeRO-1 (both combining ZeRO-powered data parallelism with NVIDIA Megatron-LM model parallelism) as well as using the state-of-the-art model parallelism approach Megatron-LM alone (baseline in Figure 2, bottom left). ZeRO-2 runs 100-billion-parameter models on a 400 NVIDIA V100 GPU cluster with over 38 teraflops per GPU and aggregated performance over 15 petaflops. For models of the same size, ZeRO-2 is 10x faster in training speed when compared with using Megatron-LM alone and 5x faster when compared with ZeRO-1.
+Scalability: We observe superlinear speedup (Figure 2, top right), where the performance more than doubles when the number of GPUs are doubled. ZeRO-2 reduces the memory footprint of the model states as we increase the data parallelism degree, allowing us to fit larger batch sizes per GPU and resulting in better performance.
-**b ) Megatron-LM GPT2 with DeepSpeed**
+Democratizing large model training: ZeRO-2 empowers model scientists to train models up to 13 billion parameters efficiently without any model parallelism that typically requires model refactoring (Figure 2, bottom right). 13 billion parameters is larger than most of the largest state-of-the-art models (such as Google T5, with 11 billion parameters). Model scientists can therefore experiment freely with large models without worrying about model parallelism. In comparison, the implementations of classic data-parallelism approaches (such as PyTorch Distributed Data Parallel) run out of memory with 1.4-billion-parameter models, while ZeRO-1 supports up to 6 billion parameters for comparison.
-|      | Model Parallelism | Data Parallelism | #gpus | batch size | layers | hidden size | attention heads | samples / sec |
+Furthermore, in the absence of model parallelism, these models can be trained on low bandwidth clusters while still achieving significantly better throughput compared to using model parallelism. For example, the GPT-2 model can be trained nearly 4x faster with ZeRO powered data parallelism compared to using model parallelism on a four node cluster connected with 40 Gbps Infiniband interconnect, where each node have four NVIDIA 16GB V100 GPUs connected with PCI-E. Therefore, with this performance improvement, large model training is no longer limited to GPU clusters with ultra fast interconnect but also accesible on modest clusters with limited bandwidth.
-| ---- | ----------------: | ---------------: | ----: | ---------: | -----: | -----------:| --------------: | ------------: |
-| 1.5B | 1                 | 64               | 64    | 2048       | 48     | 1600        | 16              | 151.35        |
-| 4B   | 1                 | 64               | 64    | 512        | 64     | 2304        | 16              | 75.13         |
-| 8B   | 2                 | 32               | 64    | 512        | 72     | 3072        | 24              | 43.52         |
-| 20B  | 4                 | 16               | 64    | 128        | 111    | 3808        | 32              | 12.65         |
--- a/docs/assets/images/DeepSpeed-vs-Megatron.png
+++ b/docs/assets/images/DeepSpeed-vs-Megatron.png
--- a/docs/assets/images/deepspeed-speedup.png
+++ b/docs/assets/images/deepspeed-speedup.png
--- a/docs/assets/images/deepspeed-throughput-seq512.png
+++ b/docs/assets/images/deepspeed-throughput-seq512.png
--- a/docs/assets/images/zero-full.png
+++ b/docs/assets/images/zero-full.png
--- a/docs/code-docs/requirements.local.txt
+++ b/docs/code-docs/requirements.local.txt
--- a/docs/code-docs/requirements.readthedocs.txt
+++ b/docs/code-docs/requirements.readthedocs.txt
 tqdm
 psutil
-tensorboardX==1.8
--- a/docs/code-docs/source/activation-checkpointing.rst
+++ b/docs/code-docs/source/activation-checkpointing.rst
+Activation Checkpointing
+========================
+The activation checkpointing API's in DeepSpeed can be used to enable a range
+of memory optimizations relating to activation checkpointing. These include
+activation partitioning across GPUs when using model parallelism, CPU
+checkpointing, contiguous memory optimizations, etc.
+Please see the `DeepSpeed JSON config <https://www.deepspeed.ai/docs/config-json/>`_
+for the full set.
+Here we present the activation checkpointing API. Please see the enabling
+DeepSpeed for `Megatron-LM tutorial <https://www.deepspeed.ai/tutorials/megatron/>`_
+for example usage.
+Configuring Activation Checkpointing
+------------------------------------
+.. autofunction:: deepspeed.checkpointing.configure
+.. autofunction:: deepspeed.checkpointing.is_configured
+Using Activation Checkpointing
+------------------------------
+.. autofunction:: deepspeed.checkpointing.checkpoint
+.. autofunction:: deepspeed.checkpointing.reset
+Configuring and Checkpointing Random Seeds
+------------------------------------------
+.. autofunction:: deepspeed.checkpointing.get_cuda_rng_tracker
+.. autofunction:: deepspeed.checkpointing.model_parallel_cuda_manual_seed
+.. autoclass:: deepspeed.checkpointing.CudaRNGStatesTracker
+.. autoclass:: deepspeed.checkpointing.CheckpointFunction
--- a/docs/code-docs/source/checkpointing.rst
+++ b/docs/code-docs/source/checkpointing.rst
+DeepSpeed Activation Checkpointing
+======================
+The activation checkpointing API's in DeepSpeed can be used to enable a range of memory optimizations relating
+to activation checkpointing. These include activation partitioning across
+GPUs when using model parallelism, CPU Checkpointing, contiguous memory optimizations, etc.
+Please see the `DeepSpeed JSON config <https://www.deepspeed.ai/docs/config-json/>`_ for the full set.
+Here we present the activation checkpointing API's.
+Please see the enabling DeepSpeed for Megatron-LM tutorial for usage details.
+.. autofunction:: deepspeed.checkpointing.configure
+.. autofunction:: deepspeed.checkpointing.is_configured
+.. autofunction:: deepspeed.checkpointing.checkpoint
+.. autofunction:: deepspeed.checkpointing.reset
+.. autofunction:: deepspeed.checkpointing.get_cuda_rng_tracker
+.. autofunction:: deepspeed.checkpointing.model_parallel_cuda_manual_seed
+.. autoclass:: deepspeed.checkpointing.CudaRNGStatesTracker
+.. autoclass:: deepspeed.checkpointing.CheckpointFunction
--- a/docs/code-docs/source/conf.py
+++ b/docs/code-docs/source/conf.py
@@ -71,25 +71,9 @@ html_context = {
 from unittest.mock import MagicMock
 sys.path.insert(0, os.path.abspath('../../../'))
+# Prepend module names to class descriptions?
+add_module_names = True
-class Mock(MagicMock):
+autoclass_content = 'both'
-    @classmethod
-    def __getattr__(cls, name):
+autodoc_mock_imports = ["torch", "apex", "mpi4py", "tensorboardX"]
-        return MagicMock()
-MOCK_MODULES = [
-    'torch',
-    'torch.utils',
-    'torch.utils.data',
-    'torch.utils.data.distributed',
-    'torch._utils',
-    'torch.cuda',
-    'torch.nn.modules',
-    'torch.nn',
-    'torch.distributed',
-    'torch.distributed.distributed_c10d',
-    'torch.optim',
-    'torch._six'
-]
-sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)
--- a/docs/code-docs/source/index.rst
+++ b/docs/code-docs/source/index.rst
 DeepSpeed
 =========
+Model Setup
+-----------
 .. toctree::
   :maxdepth: 2
-   :caption: Contents:
   initialize
+   checkpointing
+Training API
+------------
+.. toctree::
+   :maxdepth: 2
+   training
+Checkpointing API
+-----------------
+.. toctree::
+   :maxdepth: 2
+   model-checkpointing
+   activation-checkpointing
 Indices and tables
-==================
+------------------
 * :ref:`genindex`
 * :ref:`modindex`

--- a/docs/code-docs/source/initialize.rst
+++ b/docs/code-docs/source/initialize.rst
-Initializing DeepSpeed
+Training Setup
-======================
+==============
+.. _deepspeed-args:
+Argument Parsing
+----------------
+DeepSpeed uses the `argparse <https://docs.python.org/3/library/argparse.html>`_ library to
+supply commandline configuration to the DeepSpeed runtime. Use ``deepspeed.add_config_arguments()``
+to add DeepSpeed's builtin arguments to your application's parser.
+.. code-block:: python
+    parser = argparse.ArgumentParser(description='My training script.')
+    parser.add_argument('--local_rank', type=int, default=-1,
+                        help='local rank passed from distributed launcher')
+    # Include DeepSpeed configuration arguments
+    parser = deepspeed.add_config_arguments(parser)
+    cmd_args = parser.parse_args()
+.. autofunction:: deepspeed.add_config_arguments
+.. _deepspeed-init:
+Training Initialization
+-----------------------
 The entrypoint for all training with DeepSpeed is ``deepspeed.initialize()``.
 Example usage:

--- a/docs/code-docs/source/model-checkpointing.rst
+++ b/docs/code-docs/source/model-checkpointing.rst
+Model Checkpointing
+===================
+DeepSpeed provides routines for checkpointing model state during training.
+Loading Training Checkpoints
+----------------------------
+.. autofunction:: deepspeed.DeepSpeedLight.load_checkpoint
+Saving Training Checkpoints
+---------------------------
+.. autofunction:: deepspeed.DeepSpeedLight.save_checkpoint
--- a/docs/code-docs/source/training.rst
+++ b/docs/code-docs/source/training.rst
+Training API
+============
+:func:`deepspeed.initialize` returns a *model engine* in its first argument
+of type ``DeepSpeedLight``. This engine is used to progress training:
+.. code-block:: python
+    for step, batch in enumerate(data_loader):
+        #forward() method
+        loss = model_engine(batch)
+        #runs backpropagation
+        model_engine.backward(loss)
+        #weight update
+        model_engine.step()
+Forward Propagation
+-------------------
+.. autofunction:: deepspeed.DeepSpeedLight.forward
+Backward Propagation
+--------------------
+.. autofunction:: deepspeed.DeepSpeedLight.backward
+Optimizer Step
+--------------
+.. autofunction:: deepspeed.DeepSpeedLight.step