"tutorials/basics/1_first.py" did not exist on "cab1fdf2ec8bb5b281db804dc8f5d282b653d5f8"
Unverified Commit 7b0bee0b authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[tutorials] typos (#676)


Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
parent 82cecf69
...@@ -229,15 +229,15 @@ Example of ***scheduler*** ...@@ -229,15 +229,15 @@ Example of ***scheduler***
### ZeRO Optimizations for FP16 Training ### ZeRO Optimizations for FP16 Training
Enabling and configure ZeRO memory optimizations Enabling and configuring ZeRO memory optimizations
```json ```json
"zero_optimization": { "zero_optimization": {
"stage": [0|1|2], "stage": [0|1|2],
"allgather_partitions": [true|false], "allgather_partitions": [true|false],
"allgather_bucket_size": 500000000, "allgather_bucket_size": 5e8,
"overlap_comm": false, "overlap_comm": false,
"reduce_scatter": [true|false], "reduce_scatter": [true|false],
"reduce_bucket_size": 500000000, "reduce_bucket_size": 5e8,
"contiguous_gradients" : [true|false], "contiguous_gradients" : [true|false],
"cpu_offload": [true|false] "cpu_offload": [true|false]
} }
...@@ -265,7 +265,7 @@ Enabling and configure ZeRO memory optimizations ...@@ -265,7 +265,7 @@ Enabling and configure ZeRO memory optimizations
| Description | Default | | Description | Default |
| ------------------------------------------------------------ | ------- | | ------------------------------------------------------------ | ------- |
| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `500000000` | | Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `5e8` |
***overlap_comm***: [boolean] ***overlap_comm***: [boolean]
...@@ -283,7 +283,7 @@ Enabling and configure ZeRO memory optimizations ...@@ -283,7 +283,7 @@ Enabling and configure ZeRO memory optimizations
| Description | Default | | Description | Default |
| ------------------------------------------------------------ | ------- | | ------------------------------------------------------------ | ------- |
| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `500000000` | | Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `5e8` |
***contiguous_gradients***: [boolean] ***contiguous_gradients***: [boolean]
......
...@@ -13,7 +13,7 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis ...@@ -13,7 +13,7 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis
* **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. * **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
## Training environment ## Training environment
We use the DeepSpeed [Megatrom-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM. We use the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.
## Enabling ZeRO Optimization ## Enabling ZeRO Optimization
To enable ZeRO optimizations for a DeepSpeed model, we simply add the **_zero_optimization_** key to the DeepSpeed json configuration. A full description of configuration knobs of the **zero_optimization** key is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). To enable ZeRO optimizations for a DeepSpeed model, we simply add the **_zero_optimization_** key to the DeepSpeed json configuration. A full description of configuration knobs of the **zero_optimization** key is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
...@@ -39,20 +39,22 @@ A key reason why this model does not fit in GPU memory is that the Adam optimize ...@@ -39,20 +39,22 @@ A key reason why this model does not fit in GPU memory is that the Adam optimize
{ {
"zero_optimization": { "zero_optimization": {
"stage":1, "stage":1,
"reduce_bucket_size": 500000000 "reduce_bucket_size": 5e8
} }
} }
``` ```
As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 50M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory. Below we provide some screenshots of the model training: As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 500M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory. Below we provide some screenshots of the model training:
![ZERO1_DP8_1.5B_LOG](/assets/images/zero1_dp8_1.5B_log.png) ![ZERO1_DP8_1.5B_LOG](/assets/images/zero1_dp8_1.5B_log.png)
![ZERO1_DP8_1.5B_SMI](/assets/images/zero1_dp8_1.5B_smi.png) ![ZERO1_DP8_1.5B_SMI](/assets/images/zero1_dp8_1.5B_smi.png)
From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone. From the nvidia-smi screenshot above we can see that only GPUs 6-7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
### Training a 10B Parameter GPT-2 model ### Training a 10B Parameter GPT-2 model
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script. ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this by training a model with 10B parameters using 32 V100 GPUs.
First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
```bash ```bash
--model-parallel-size 1 \ --model-parallel-size 1 \
...@@ -73,8 +75,8 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena ...@@ -73,8 +75,8 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
"contiguous_gradients": true, "contiguous_gradients": true,
"overlap_comm": true, "overlap_comm": true,
"reduce_scatter": true, "reduce_scatter": true,
"reduce_bucket_size": 50000000, "reduce_bucket_size": 5e8,
"allgather_bucket_size": 500000000 "allgather_bucket_size": 5e8
} }
} }
``` ```
...@@ -85,7 +87,7 @@ Here is a screenshot of the training log: ...@@ -85,7 +87,7 @@ Here is a screenshot of the training log:
![ZERO2_DP32_10B_LOG](/assets/images/zero2_dp32_10B_log.png) ![ZERO2_DP32_10B_LOG](/assets/images/zero2_dp32_10B_log.png)
Here is a screenshot of nvidia-smi show GPU activity during training: Here is a screenshot of nvidia-smi showing GPU activity during training:
![ZERO2_DP32_10B_SMI](/assets/images/zero2_dp32_10B_smi.png) ![ZERO2_DP32_10B_SMI](/assets/images/zero2_dp32_10B_smi.png)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment