Unverified Commit 0ad4fd88 authored by Samyam Rajbhandari's avatar Samyam Rajbhandari Committed by GitHub
Browse files

Update zero.md tutorial (#495)



* Update zero.md

Update to ZeRO tutorial to specify the use of activation checkpointing

* Update zero-offload.md

Use activation checkpointing with ZeRO-Offload
Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
parent eea1c285
...@@ -15,7 +15,7 @@ For this tutorial, we will configure a 10 billion parameter GPT-2 model using th ...@@ -15,7 +15,7 @@ For this tutorial, we will configure a 10 billion parameter GPT-2 model using th
We need to make changes to the Megatron-LM launch script and to the DeepSpeed configuration json. We need to make changes to the Megatron-LM launch script and to the DeepSpeed configuration json.
### Megatron-LM GPT-2 launch script changes ### Megatron-LM GPT-2 launch script changes
We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model, which can be achieved by the following set of changes: We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model with activation checkpointing enabled, which can be achieved by the following set of changes:
```bash ```bash
--model-parallel-size 1 \ --model-parallel-size 1 \
...@@ -23,9 +23,9 @@ We need to apply two changes to the launch script for the DeepSpeed Megatron-LM ...@@ -23,9 +23,9 @@ We need to apply two changes to the launch script for the DeepSpeed Megatron-LM
--hidden-size 4096 \ --hidden-size 4096 \
--num-attention-heads 32 \ --num-attention-heads 32 \
--batch-size 10 \ --batch-size 10 \
--d \
--deepspeed_config ds_zero_offload.config \ --deepspeed_config ds_zero_offload.config \
--cpu_optimizer \ --cpu_optimizer \
--checkpoint-activations
``` ```
Most of the flags in the changes above should be familiar if you have stepped through the Megatron-LM [tutorial](/tutorials/megatron/), except for the **_--cpu_optimizer_**. This flag informs the model script to pass a CPU-based Adam optimizer, rather than a GPU-based one, to DeepSpeed as the client optimizer. It is very important that this flag be used when training with ZeRO-Offload to ensure correct operation of the DeepSpeed engine. Most of the flags in the changes above should be familiar if you have stepped through the Megatron-LM [tutorial](/tutorials/megatron/), except for the **_--cpu_optimizer_**. This flag informs the model script to pass a CPU-based Adam optimizer, rather than a GPU-based one, to DeepSpeed as the client optimizer. It is very important that this flag be used when training with ZeRO-Offload to ensure correct operation of the DeepSpeed engine.
......
...@@ -27,7 +27,6 @@ We demonstrate the benefits of ZeRO stage 1 by showing that it enables data para ...@@ -27,7 +27,6 @@ We demonstrate the benefits of ZeRO stage 1 by showing that it enables data para
--hidden-size 1600 \ --hidden-size 1600 \
--num-attention-heads 16 \ --num-attention-heads 16 \
--batch-size 1 \ --batch-size 1 \
--d \
--deepspeed_config ds_zero_stage_1.config \ --deepspeed_config ds_zero_stage_1.config \
``` ```
...@@ -53,7 +52,7 @@ As seen above, we set two fields in the **zero_optimization** key. Specifically ...@@ -53,7 +52,7 @@ As seen above, we set two fields in the **zero_optimization** key. Specifically
From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone. From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
### Training a 10B Parameter GPT-2 model ### Training a 10B Parameter GPT-2 model
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script. ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
```bash ```bash
--model-parallel-size 1 \ --model-parallel-size 1 \
...@@ -61,8 +60,8 @@ ZeRO stage 2 optimizations further increases the size of models that can be trai ...@@ -61,8 +60,8 @@ ZeRO stage 2 optimizations further increases the size of models that can be trai
--hidden-size 4096 \ --hidden-size 4096 \
--num-attention-heads 32 \ --num-attention-heads 32 \
--batch-size 1 \ --batch-size 1 \
--d \
--deepspeed_config ds_zero_stage_2.config \ --deepspeed_config ds_zero_stage_2.config \
--checkpoint-activations
``` ```
Next, we need to update the DeepSpeed json configuration, as shown below, to enable ZeRO stage 2 optimizations: Next, we need to update the DeepSpeed json configuration, as shown below, to enable ZeRO stage 2 optimizations:
...@@ -80,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena ...@@ -80,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
} }
``` ```
In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now run the launch the training run. In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
Here is a screenshot of the training log: Here is a screenshot of the training log:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment