ZeRO-2 (#217)

Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: yuxionghe <yuxhe@microsoft.com> Co-authored-by: Arash Ashari <arashari@microsoft.com>

ZeRO-2 (#217)
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: yuxionghe <yuxhe@microsoft.com> Co-authored-by: Arash Ashari <arashari@microsoft.com>
f2ac7eaf · Jeff Rasley · GitHub · c61e23b4 · f2ac7eaf · f2ac7eaf
Unverified Commit f2ac7eaf authored May 19, 2020 by Jeff Rasley Committed by GitHub May 19, 2020
20 changed files
--- a/docs/_pages/config-json.md
+++ b/docs/_pages/config-json.md
@@ -102,15 +102,8 @@ Example of ***scheduler***
 | ------------------------------------------------------------ | ------- |
 | Enable sparse compression of [torch.nn.Embedding](https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding) gradients. | `false`    |

-
 ### FP16 training options

-***zero\_optimization***: [boolean]
-
-| Description                                                  | Default |
-| ------------------------------------------------------------ | ------- |
-| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false`   |
-
 ***fp16***: [dictionary]

 | Description                                                  | Default |
@@ -172,6 +165,66 @@ Example of ***scheduler***
 | ----------------------------------- | ------- |
 | Enable gradient clipping with value | `0`      |

+
+
+### ZeRO Optimizations for FP16 Training
+
+Enabling and configure ZeRO memory optimizations
+```json
+  "zero_optimization": {
+    "stage": [0|1|2],
+    "allgather_partitions": [true|false],
+    "allgather_bucket_size": 500000000,
+    "reduce_scatter": [true|false],
+    "reduce_bucket_size": 500000000,
+    "contiguous_gradients" : [true|false]
+    }
+```
+
+***zero\_optimization***: [dictionary]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false`   |
+
+***stage***: [integer]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Chooses different stages of ZeRO Optimizer. Stage 0, 1, and 2 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitiong, respectively. | `0`   |
+
+***allgather_partitions***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step  | `true`   |
+
+***allgather_bucket_size***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes   | `500000000`   |
+
+***reduce_scatter***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Uses reduce or reduce scatter instead of allreduce to average gradients   | `true`   |
+
+***reduce_bucket_size***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes   | `500000000`   |
+
+***contiguous_gradients***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. Only useful when running very large models.   | `False`   |
+
+
+
 ### Logging

 ***steps\_per\_print***: [integer]
@@ -191,3 +244,52 @@ Example of ***scheduler***
 | Description                                                  | Default |
 | ------------------------------------------------------------ | ------- |
 | Print out state information of DeepSpeed object after initialization | `false`   |
+
+### Activation Checkpointing
+```json
+  "activation_checkpointing": {
+    "partition_activations": false,
+    "cpu_checkpointing": false,
+    "contiguous_memory_optimization": false,
+    "number_checkpoints": null,
+    "synchronize_checkpoint_boundary": false,
+    "profile": false
+    }
+```
+***partition\_activations***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Enables partition activation when used with model parallelism | `false`   |
+
+***cpu\_checkpointing***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Offloads partitioned activations to CPU if partition_activations is enabled| `false`   |
+
+
+***contiguous\_memory\_optimization***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Copies partitioned activations so that they are contiguous in memory | `false`   |
+
+***number_checkpoints***: [integer]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Total number of activation checkpoints used to allocate memory buffer for contiguous_memoty_optimization | `None`   |
+
+***synchronize\_checkpoint\_boundary***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Inserts torch.cuda.synchronize() at each checkpoint boundary. | `false`   |
+
+
+***profile***: [boolean]
+
+| Description                                                  | Default |
+| ------------------------------------------------------------ | ------- |
+| Logs the forward and backward time for each checkpoint function | `false`   |
--- a/docs/_pages/features.md
+++ b/docs/_pages/features.md
@@ -57,19 +57,33 @@ DeepSpeed is fully compatible with [Megatron](https://github.com/NVIDIA/Megatron
 Please see the [Megatron-LM tutorial](/tutorials/megatron/) for details.


-
-## Memory and Bandwidth Optimizations
-
-### The Zero Redundancy Optimizer (ZeRO)
-[ZeRO](https://arxiv.org/abs/1910.02054) is at the heart of DeepSpeed and
-enables large model training at a scale that is simply not possible with model
-parallelism alone. When enabled, ZeRO allows training models with
-over 6 billion parameters without any model parallelism, and up to 100 billion
-parameter models with model parallelism on current generation hardware.
+## The Zero Redundancy Optimizer
+The Zero Redundancy Optimizer ([ZeRO](https://arxiv.org/abs/1910.02054)) is at
+the heart of DeepSpeed and enables large model training at a scale that is
+simply not possible with model parallelism alone. When enabled, ZeRO allows
+training models with over 13 billion parameters without any model parallelism,
+and up to 200 billion parameter models with model parallelism on current
+generation hardware.

 For more details see the [ZeRO paper](https://arxiv.org/abs/1910.02054), [GPT
 tutorial](/tutorials/megatron/) on integration with
-DeepSpeed. Additional tutorials including *BERT Tutorial*: Coming Soon.
+DeepSpeed.
+
+### Optimizer State and Gradient Partitioning
+Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the
+model states (optimizer states, gradients and parmaeters) by 8x compared to standard
+data parallelism by partitioning these states across data parallel process instead of
+replicating them.
+
+### Activation Partitioning
+Activation Partitioning is a memory optimization in ZeRO that can reduce the memory
+consumed by activations during model parallel training (MP). In MP certain
+activations maybe required by all MP processes, resulting in a replication of
+activations across MP GPUs. Activation Partitioning stores these activations in a
+partitioned state once they are used for computation in the forward propagation. These
+activations are allgathered right before they are needed again during the backward propagation.
+By storing activations in a partitioned state, ZeRO in DeepSpeed can reduce the activation
+memory footprint proportional to the MP degree.

 ### Constant Buffer Optimization (CBO)
 CBO enables high network and memory throughput while restricting memory usage to a
@@ -80,6 +94,17 @@ unnecessary memory overhead. CBO in DeepSpeed fuses smaller operands into approx
 pre-defined sized buffer large enough to achieve great performance without the
 unnecessary memory overhead.

+### Contiguous Memory Optimization (CMO)
+CMO reduces reduces memory fragmentation during training, preventing out of memory errors
+due to lack of contiguous memory. Memory fragmentation is a result of interleaving between
+short lived and long lived memory objects. During the forward propagation activation
+checkpoints are long lived but the activations that recomputed are short lived. Similarly,
+during the backward computation, the activation gradients are short lived while the parameter
+gradients are long lived. CMO transfers activation checkpoints and parameter gradients
+to contiguous buffers preventing memory fragmentation.
+
+## Additional Memory and Bandwidth Optimizations
+
 ### Smart Gradient Accumulation
 Gradient accumulation allows running larger batch size with limited memory by breaking an
 effective batch into several sequential micro-batches, and averaging the parameter
@@ -90,6 +115,11 @@ averaged gradients for the effective batch across all GPUs. This strategy signif
 reduces the communication involved over the approach of averaging globally for each
 micro-batch, specially when the number of micro-batches per effective batch is large.

+### Communication Overlapping
+During back propagation, DeepSpeed can overlap the communication required for averaging
+parameter gradients that have already been computed with the ongoing gradient computation.
+This computation communication overlap, allows DeepSpeed to achieve higher throughput even
+at modest batch sizes.  

 ## Training Features

@@ -100,12 +130,23 @@ The DeepSpeed core API consists of just a handful of methods:
 * argument parsing: `add_config_arguments`
 * checkpointing : `load_checkpoint` and `store_checkpoint`

-DeepSpeed supports all the features described in this document, via the use of these API,
+DeepSpeed supports most of the features described in this document, via the use of these API,
 along with a `deepspeed_config` JSON file for enabling and disabling the features.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.

+### Activation Checkpointing API
+
+DeepSpeed's Activation Checkpoinitng API supports activation checkpoint partitioning,
+cpu checkpoiniting, and contiguous memory optimizations, while also allowing layerwise
+profiling. Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
+

 ### Gradient Clipping
+```json
+{
+  "gradient_clipping": 1.0
+}
+```
 DeepSpeed handles gradient clipping under the hood based on the max gradient norm
 specified by the user.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
@@ -136,8 +177,8 @@ DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Opt
 For more details on LAMB, see the [LAMB paper](https://arxiv.org/pdf/1904.00962.pdf).

 ### Memory-Efficient Training with ZeRO Optimizer
-DeepSpeed can train models up with up to 6 billion parameters without parallelism, and
-models with up to 100 billion parameters with 16-way model parallelism. This leap in
+DeepSpeed can train models up with up to 13 billion parameters without parallelism, and
+models with up to 200 billion parameters with 16-way model parallelism. This leap in
 model size is possible though the memory efficiency achieved via the ZeRO Optimizer. For
 more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .

@@ -174,6 +215,10 @@ file.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ```json
 {
-  "wall_clock_breakdown": true
+  "wall_clock_breakdown": true,
+
+  "activation_checkpointing": {
+    "profile": true
+  }
 }
 ```
--- a/docs/_posts/2020-03-17-zero-stage2.md
+++ b/docs/_posts/2020-03-17-zero-stage2.md
---
-title: "ZeRO stage 2"
-sneak_preview: true
-excerpt: "Reduce memory footprint to enable training 10B models without model parallelism!"
---
-* Reduce memory footprint of gradients
-* Train larger models: e.g., 10B parameters on 32GPUs without model parallelism
-* Train larger batch sizes
-
-## Further updates coming soon!
--- a/docs/_posts/2020-05-19-bert-record.md
+++ b/docs/_posts/2020-05-19-bert-record.md
+---
+layout: single
+title: "DeepSpeed optimizes transformer kernels to achieve world's fastest BERT training record: 44 minutes on 1024 NVIDIA V100 GPUs"
+excerpt: ""
+categories: news
+new_post: true
+date: 2020-05-19 00:00:00
+---
+
+
+We introduce new technology to accelerate single GPU performance via
+kernel optimizations. These optimizations not only create a strong
+foundation for scaling out large models, but also improve the single GPU
+performance of highly tuned and moderately sized models like BERT by more
+than 30%, reaching a staggering performance of 66 teraflops per V100 GPU,
+which is 52% of the hardware peak. **Using these optimizations as the building
+block, DeepSpeed achieves the fastest BERT training record: 44 minutes on
+1,024 NVIDIA V100 GPUs**, compared with the best published result
+of 67 minutes on the same number and generation of GPUs.
+
+**Code and tutorials are coming soon!**
--- a/docs/_posts/2020-05-19-zero-stage2.md
+++ b/docs/_posts/2020-05-19-zero-stage2.md
+---
+layout: single
+title: "ZeRO-2 empowers training models as large as 170 billion parameters up to 10x faster compared to state-of-the-art"
+excerpt: ""
+categories: news
+new_post: true
+date: 2020-05-19 01:00:00
+---
+
+ZeRO-2 expands the scope of memory optimizations in the original ZeRO by
+tackling the full spectrum of memory consumption during training. More
+specifically, ZeRO-2 introduces new technology to reduce the memory footprint
+of gradients, activation memory, and fragmented memory, in addition to
+optimizer state memory optimization in the original ZeRO. Altogether, the
+memory savings empower DeepSpeed to improve the scale and speed of deep
+learning training by an order of magnitude. More concretely, ZeRO-2 allows
+training models as large as 170 billion parameters up to 10x faster compared
+to state of the art.
+
+For more information on using ZeRO-2, see the [Megatron tutorial](/tutorials/megatron/).
+
+For a technical deep dive, see our [technical report](https://arxiv.org/abs/1910.02054).
--- a/docs/_tutorials/getting-started.md
+++ b/docs/_tutorials/getting-started.md
--- a/docs/_tutorials/megatron.md
+++ b/docs/_tutorials/megatron.md
--- a/docs/assets/images/DeepSpeed-vs-Megatron.png
+++ b/docs/assets/images/DeepSpeed-vs-Megatron.png
--- a/docs/assets/images/deepspeed-speedup.png
+++ b/docs/assets/images/deepspeed-speedup.png
--- a/docs/assets/images/deepspeed-throughput-seq512.png
+++ b/docs/assets/images/deepspeed-throughput-seq512.png
--- a/docs/assets/images/zero-full.png
+++ b/docs/assets/images/zero-full.png
--- a/docs/code-docs/requirements.local.txt
+++ b/docs/code-docs/requirements.local.txt
--- a/docs/code-docs/requirements.readthedocs.txt
+++ b/docs/code-docs/requirements.readthedocs.txt
--- a/docs/code-docs/source/activation-checkpointing.rst
+++ b/docs/code-docs/source/activation-checkpointing.rst
--- a/docs/code-docs/source/checkpointing.rst
+++ b/docs/code-docs/source/checkpointing.rst
--- a/docs/code-docs/source/conf.py
+++ b/docs/code-docs/source/conf.py
--- a/docs/code-docs/source/index.rst
+++ b/docs/code-docs/source/index.rst
--- a/docs/code-docs/source/initialize.rst
+++ b/docs/code-docs/source/initialize.rst
--- a/docs/code-docs/source/model-checkpointing.rst
+++ b/docs/code-docs/source/model-checkpointing.rst
--- a/docs/code-docs/source/training.rst
+++ b/docs/code-docs/source/training.rst