[tutorials] typos (#676)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

[tutorials] typos (#676)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
7b0bee0b · Stas Bekman · GitHub · 82cecf69 · 7b0bee0b · 7b0bee0b
Unverified Commit 7b0bee0b authored Jan 20, 2021 by Stas Bekman Committed by GitHub Jan 20, 2021
Show whitespace changes
Inline Side-by-side

Showing with 19 additions and 17 deletions

docs/_pages/config-json.md docs/_pages/config-json.md +5 -5

docs/_tutorials/zero.md docs/_tutorials/zero.md +14 -12

No files found.
--- a/docs/_pages/config-json.md
+++ b/docs/_pages/config-json.md
@@ -229,15 +229,15 @@ Example of ***scheduler***

 ### ZeRO Optimizations for FP16 Training

-Enabling and configure ZeRO memory optimizations
+Enabling and configuring ZeRO memory optimizations
 ```json
  "zero_optimization": {
    "stage": [0|1|2],
    "allgather_partitions": [true|false],
-    "allgather_bucket_size": 500000000,
+    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": [true|false],
-    "reduce_bucket_size": 500000000,
+    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : [true|false],
    "cpu_offload": [true|false]
    }
@@ -265,7 +265,7 @@ Enabling and configure ZeRO memory optimizations

 | Description                                                  | Default |
 | ------------------------------------------------------------ | ------- |
-| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes   | `500000000`   |
+| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes   | `5e8`   |

 ***overlap_comm***: [boolean]

@@ -283,7 +283,7 @@ Enabling and configure ZeRO memory optimizations

 | Description                                                  | Default |
 | ------------------------------------------------------------ | ------- |
-| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes   | `500000000`   |
+| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes   | `5e8`   |

 ***contiguous_gradients***: [boolean]


--- a/docs/_tutorials/zero.md
+++ b/docs/_tutorials/zero.md
@@ -13,7 +13,7 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis
 * **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.

 ## Training environment
-We use the DeepSpeed [Megatrom-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.  
+We use the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.

 ## Enabling ZeRO Optimization
 To enable ZeRO optimizations for a DeepSpeed model, we simply add the **_zero_optimization_** key to the DeepSpeed json configuration. A full description of configuration knobs of the **zero_optimization** key is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
@@ -39,20 +39,22 @@ A key reason why this model does not fit in GPU memory is that the Adam optimize
 {
    "zero_optimization": {
        "stage":1,
-        "reduce_bucket_size": 500000000
+        "reduce_bucket_size": 5e8
    }
 }
 ```
-As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 50M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory.   Below we provide some screenshots of the model training:
+As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 500M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory.   Below we provide some screenshots of the model training:

 ![ZERO1_DP8_1.5B_LOG](/assets/images/zero1_dp8_1.5B_log.png)

 ![ZERO1_DP8_1.5B_SMI](/assets/images/zero1_dp8_1.5B_smi.png)

-From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.  
+From the nvidia-smi screenshot above we can see that only GPUs 6-7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.

 ### Training a 10B Parameter GPT-2 model
-ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
+ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this by training a model with 10B parameters using 32 V100 GPUs.
+
+First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.

 ```bash
       --model-parallel-size 1 \
@@ -73,8 +75,8 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
-        "reduce_bucket_size": 50000000,
-        "allgather_bucket_size": 500000000
+        "reduce_bucket_size": 5e8,
+        "allgather_bucket_size": 5e8
    }
 }
 ```
@@ -85,7 +87,7 @@ Here is a screenshot of the training log:

 ![ZERO2_DP32_10B_LOG](/assets/images/zero2_dp32_10B_log.png)

-Here is a screenshot of nvidia-smi show GPU activity during training:
+Here is a screenshot of nvidia-smi showing GPU activity during training:

 ![ZERO2_DP32_10B_SMI](/assets/images/zero2_dp32_10B_smi.png)