Unverified Commit 7b0bee0b authored by Stas Bekman's avatar Stas Bekman Committed by GitHub
Browse files

[tutorials] typos (#676)


Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
parent 82cecf69
......@@ -229,15 +229,15 @@ Example of ***scheduler***
### ZeRO Optimizations for FP16 Training
Enabling and configure ZeRO memory optimizations
Enabling and configuring ZeRO memory optimizations
```json
"zero_optimization": {
"stage": [0|1|2],
"allgather_partitions": [true|false],
"allgather_bucket_size": 500000000,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": [true|false],
"reduce_bucket_size": 500000000,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : [true|false],
"cpu_offload": [true|false]
}
......@@ -265,7 +265,7 @@ Enabling and configure ZeRO memory optimizations
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `500000000` |
| Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `5e8` |
***overlap_comm***: [boolean]
......@@ -283,7 +283,7 @@ Enabling and configure ZeRO memory optimizations
| Description | Default |
| ------------------------------------------------------------ | ------- |
| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `500000000` |
| Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `5e8` |
***contiguous_gradients***: [boolean]
......
......@@ -13,7 +13,7 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis
* **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
## Training environment
We use the DeepSpeed [Megatrom-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.
We use the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.
## Enabling ZeRO Optimization
To enable ZeRO optimizations for a DeepSpeed model, we simply add the **_zero_optimization_** key to the DeepSpeed json configuration. A full description of configuration knobs of the **zero_optimization** key is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
......@@ -39,20 +39,22 @@ A key reason why this model does not fit in GPU memory is that the Adam optimize
{
"zero_optimization": {
"stage":1,
"reduce_bucket_size": 500000000
"reduce_bucket_size": 5e8
}
}
```
As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 50M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory. Below we provide some screenshots of the model training:
As seen above, we set two fields in the **zero_optimization** key. Specifically we set the _stage_ field to 1, and the optional _reduce_bucket_size_ for gradient reduction to 500M. With ZeRO stage 1 enabled, the model can now train smoothly on 8 GPUs without running out of memory. Below we provide some screenshots of the model training:
![ZERO1_DP8_1.5B_LOG](/assets/images/zero1_dp8_1.5B_log.png)
![ZERO1_DP8_1.5B_SMI](/assets/images/zero1_dp8_1.5B_smi.png)
From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
From the nvidia-smi screenshot above we can see that only GPUs 6-7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
### Training a 10B Parameter GPT-2 model
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this by training a model with 10B parameters using 32 V100 GPUs.
First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
```bash
--model-parallel-size 1 \
......@@ -73,8 +75,8 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 50000000,
"allgather_bucket_size": 500000000
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8
}
}
```
......@@ -85,7 +87,7 @@ Here is a screenshot of the training log:
![ZERO2_DP32_10B_LOG](/assets/images/zero2_dp32_10B_log.png)
Here is a screenshot of nvidia-smi show GPU activity during training:
Here is a screenshot of nvidia-smi showing GPU activity during training:
![ZERO2_DP32_10B_SMI](/assets/images/zero2_dp32_10B_smi.png)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment