Merge branch 'main' into t5_pipeline_parallelism

cb00a196 · Jared Casper · 38a774e9 · 5ab64637 · cb00a196 · cb00a196
Commit cb00a196 authored Sep 29, 2021 by Jared Casper
20 changed files
--- a/README.md
+++ b/README.md
@@ -156,7 +156,7 @@ OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
-             --checkpoint-activations"
+             --activations-checkpoint-method uniform"

 python pretrain_bert.py \
       $BERT_ARGS \
@@ -302,6 +302,15 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_<model>.py \

 The interleaved pipelining schedule (more details in Section 2.2.2 of [our paper](https://arxiv.org/pdf/2104.04473.pdf)) can be enabled using the `--num-layers-per-virtual-pipeline-stage` argument, which controls the number of transformer layers in a virtual stage (by default with the non-interleaved schedule, each GPU will execute a single virtual stage with `NUM_LAYERS / PIPELINE_MP_SIZE` transformer layers). The total number of layers in the transformer model should be divisible by this argument value. Additionally, the number of microbatches in the pipeline (computed as `GLOBAL_BATCH_SIZE / (DATA_PARALLEL_SIZE * MICRO_BATCH_SIZE)`) should be divisible by the `PIPELINE_MP_SIZE` when using this schedule (this condition is checked in an assertion in the code). The interleaved schedule is not supported for pipelines with 2 stages (`PIPELINE_MP_SIZE=2`).

+## Activation Checkpointing and Recomputation
+
+To reduce GPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation. We use a Transformer layer as the unit of checkpointing because the activation size bloats in the middle of a Transformer layer so checkpointing the input of a Transformer layer is storage-efficient. We support two activation checkpointing methods: `uniform` and `block`.
+
+Uniform method uniformly divides the Transformer layers into groups of layers and stores the input activations of each group in the memory. The baseline group size is 1 and, in this case, the input activation of each Transformer layer is checkpointed. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage thus enables running a bigger model. For example, when using the number of layers per group of 4, the input activation of each group of 4 Transformer layers is checkpointed.
+
+Block method checkpoints the input activations of a set number of individual Transformer layers per pipeline stage and do the rest of layers without any checkpointing. This method can be used to skip checkpointing some Transformer layers until the GPU memory is fully used, which is applicable only when there is unused GPU memory. Checkpointing fewer transformer layers avoids unnecessary activation recomputation in the backprop thus improves training performance. For example, when we specify 5 layers to checkpoint of 8 layers per pipeline stage, the input activations of only the first 5 Transformer layers are checkpointed and activation recomputation for the rest 3 layers is not needed in the backprop.
+
+
 ## GPT-3 Example

 In `examples/pretrain_gpt3_175B.sh` we have provided an example of how to configure Megatron to run [GPT-3](https://arxiv.org/abs/2005.14165) with 175 billion parameters on 1024 GPUs. The script is designed for [slurm](https://slurm.schedmd.com/documentation.html) with [pyxis](https://github.com/NVIDIA/pyxis) plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options `global-batch-size 1536` and `rampup-batch-size 16 16 5859375`, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.
@@ -345,7 +354,7 @@ python pretrain_ict.py \
    --max-position-embeddings 256 \
    --ict-head-size 128 \
    --train-iters 100000 \
-    --checkpoint-activations \
+    --activations-checkpoint-method uniform \
    --bert-load /path/to/pretrained_bert \
    --load checkpoints \
    --save checkpoints \
@@ -375,7 +384,7 @@ python tools/create_doc_index.py \
    --ict-head-size 128 \
    --num-attention-heads 12 \
    --batch-size 128 \
-    --checkpoint-activations \
+    --activations-checkpoint-method uniform \
    --seq-length 256 \
    --max-position-embeddings 256 \
    --ict-load /path/to/pretrained_ict \
@@ -482,7 +491,7 @@ python tasks/main.py \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
-       --checkpoint-activations \
+       --activations-checkpoint-method uniform \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng
@@ -512,7 +521,7 @@ python tasks/main.py \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
-       --checkpoint-activations \
+       --activations-checkpoint-method uniform \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng
@@ -542,7 +551,7 @@ COMMON_TASK_ARGS="--num-layers 24 \
 COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
                      --valid-data $VALID_DATA \
                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
-                      --checkpoint-activations \
+                      --activations-checkpoint-method uniform \
                      --save-interval 10000 \
                      --save $CHECKPOINT_PATH \
                      --log-interval 100 \

--- a/examples/evaluate_retriever_nq.sh
+++ b/examples/evaluate_retriever_nq.sh
@@ -20,7 +20,7 @@ python tasks/main.py \
    --num-attention-heads 12 \
    --tensor-model-parallel-size 1 \
    --micro-batch-size 128 \
-    --checkpoint-activations \
+    --activations-checkpoint-method uniform \
    --seq-length 512 \
    --max-position-embeddings 512 \
    --load ${CHECKPOINT_PATH} \

--- a/examples/evaluate_zeroshot_gpt.sh
+++ b/examples/evaluate_zeroshot_gpt.sh
@@ -29,7 +29,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
               --hidden-size 1024 \
               --num-attention-heads 16 \
               --batch-size 8 \
-               --checkpoint-activations \
+               --activations-checkpoint-method uniform \
               --seq-length 1024 \
               --max-position-embeddings 1024 \
               --log-interval 10 \

--- a/examples/finetune_mnli_distributed.sh
+++ b/examples/finetune_mnli_distributed.sh
@@ -29,7 +29,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
               --hidden-size 1024 \
               --num-attention-heads 16 \
               --micro-batch-size 8 \
-               --checkpoint-activations \
+               --activations-checkpoint-method uniform \
               --lr 5.0e-5 \
               --lr-decay-style linear \
               --lr-warmup-fraction 0.065 \

--- a/examples/finetune_race_distributed.sh
+++ b/examples/finetune_race_distributed.sh
@@ -29,7 +29,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
               --hidden-size 1024 \
               --num-attention-heads 16 \
               --micro-batch-size 4 \
-               --checkpoint-activations \
+               --activations-checkpoint-method uniform \
               --lr 1.0e-5 \
               --lr-decay-style linear \
               --lr-warmup-fraction 0.06 \

--- a/examples/pretrain_gpt.sh
+++ b/examples/pretrain_gpt.sh
@@ -33,7 +33,7 @@ python pretrain_gpt.py \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
-       --checkpoint-activations \
+       --activations-checkpoint-method uniform \
       --log-interval 100 \
       --save-interval 10000 \
       --eval-interval 1000 \

--- a/examples/pretrain_gpt3_175B.sh
+++ b/examples/pretrain_gpt3_175B.sh
@@ -49,7 +49,7 @@ options=" \
 	--init-method-std 0.006 \
 	--tensorboard-dir <TENSORBOARD DIRECTORY> \
        --fp16 \
-	--checkpoint-activations "
+	--activations-checkpoint-method uniform "


 run_cmd="python -u ${DIR}/pretrain_gpt.py $@ ${options}"

--- a/examples/pretrain_gpt_distributed.sh
+++ b/examples/pretrain_gpt_distributed.sh
@@ -40,7 +40,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
-       --checkpoint-activations \
+       --activations-checkpoint-method uniform \
       --log-interval 100 \
       --save-interval 10000 \
       --eval-interval 1000 \

--- a/examples/pretrain_gpt_distributed_with_mp.sh
+++ b/examples/pretrain_gpt_distributed_with_mp.sh
@@ -42,7 +42,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
-       --checkpoint-activations \
+       --activations-checkpoint-method uniform \
       --log-interval 100 \
       --save-interval 10000 \
       --eval-interval 1000 \

--- a/examples/run_text_generation_server_345M.sh
+++ b/examples/run_text_generation_server_345M.sh
+#!/bin/bash
+# This example will start serving the 345M model.
+DISTRIBUTED_ARGS="--nproc_per_node 1 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+CHECKPOINT=<Path to checkpoint (e.g /345m)>
+VOCAB_FILE=<Path to vocab.json (e.g. /gpt2-vocab.json)>
+MERGE_FILE=<Path to merges.txt (e.g. /gpt2-merges.txt)>
+
+pip install flask-restful
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py   /
+       --tensor-model-parallel-size 1  /
+       --pipeline-model-parallel-size 1  /
+       --num-layers 24  /
+       --hidden-size 1024  /
+       --load ${CHECKPOINT}  /
+       --num-attention-heads 16  /
+       --max-position-embeddings 1024  /
+       --tokenizer-type GPT2BPETokenizer  /
+       --fp16  /
+       --micro-batch-size 1  /
+       --seq-length 1024  /
+       --out-seq-length 1024  /
+       --temperature 1.0  /
+       --vocab-file $VOCAB_FILE  /
+       --merge-file $MERGE_FILE  /
+       --top_p 0.9  /
+       --seed 42
--- a/examples/run_text_generation_server_345M_8_tensor_parallel.sh
+++ b/examples/run_text_generation_server_345M_8_tensor_parallel.sh
+#!/bin/bash
+# This example will start serving the 345M model that is partitioned 8 way tensor parallel
+DISTRIBUTED_ARGS="--nproc_per_node 8 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+CHECKPOINT=<Path to checkpoint (e.g /345m)>
+VOCAB_FILE=<Path to vocab.json (e.g. /gpt2-vocab.json)>
+MERGE_FILE=<Path to merges.txt (e.g. /gpt2-merges.txt)>
+
+pip install flask-restful
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py   /
+       --tensor-model-parallel-size 8  /
+       --pipeline-model-parallel-size 1  /
+       --num-layers 24  /
+       --hidden-size 1024  /
+       --load ${CHECKPOINT}  /
+       --num-attention-heads 16  /
+       --max-position-embeddings 1024  /
+       --tokenizer-type GPT2BPETokenizer  /
+       --fp16  /
+       --micro-batch-size 1  /
+       --seq-length 1024  /
+       --out-seq-length 1024  /
+       --temperature 1.0  /
+       --vocab-file $VOCAB_FILE  /
+       --merge-file $MERGE_FILE  /
+       --top_p 0.9  /
+       --seed 42
--- a/examples/sc21/run_figure_11.sh
+++ b/examples/sc21/run_figure_11.sh
@@ -25,7 +25,7 @@ MBS=1
 HS=20480
 NAH=128
 DDP=local
-MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "


 # Name of the job.

--- a/examples/sc21/run_figure_12.sh
+++ b/examples/sc21/run_figure_12.sh
@@ -16,9 +16,9 @@ GBS=12

 # Set interleaved schedule options.
 if [ ${INTERLEAVED} == "YES" ]; then
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 2 "
 elif [ ${INTERLEAVED} == "NO" ]; then
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 else
    echo "Invalid configuration"
    exit 1

--- a/examples/sc21/run_figure_13.sh
+++ b/examples/sc21/run_figure_13.sh
@@ -24,7 +24,7 @@ NLS=32
 HS=20480
 NAH=128
 DDP=local
-MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 NNODES=8



--- a/examples/sc21/run_figure_14.sh
+++ b/examples/sc21/run_figure_14.sh
@@ -25,7 +25,7 @@ NLS=32
 HS=3840
 NAH=32
 DDP=local
-MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 NNODES=8



--- a/examples/sc21/run_figure_15.sh
+++ b/examples/sc21/run_figure_15.sh
@@ -25,7 +25,7 @@ NLS=32
 HS=3840
 NAH=32
 DDP=local
-MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 NNODES=8



--- a/examples/sc21/run_figure_16.sh
+++ b/examples/sc21/run_figure_16.sh
@@ -21,7 +21,7 @@ NLS=32
 HS=15360
 NAH=128
 DDP=local
-MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 NNODES=8



--- a/examples/sc21/run_figure_17.sh
+++ b/examples/sc21/run_figure_17.sh
@@ -16,7 +16,7 @@ GBS=1

 # Set activation recomputation.
 if [ ${ACTIVATION_RECOMPUTATION} == "YES" ]; then
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 elif [ ${ACTIVATION_RECOMPUTATION} == "NO" ]; then
    MEGATRON_EXTRA_PARAMS=""
 else

--- a/examples/sc21/run_figure_18.sh
+++ b/examples/sc21/run_figure_18.sh
@@ -16,9 +16,9 @@ GBS=12

 # Set scatter-gather communication optimization options.
 if [ ${SCATTER_GATHER} == "YES" ]; then
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 2 "
 elif [ ${SCATTER_GATHER} == "NO" ]; then
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 --no-scatter-gather-tensors-in-pipeline "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 2 --no-scatter-gather-tensors-in-pipeline "
 else
    echo "Invalid configuration"
    exit 1

--- a/examples/sc21/run_table_1.sh
+++ b/examples/sc21/run_table_1.sh
@@ -21,7 +21,7 @@ if [ ${MODEL_SIZE} == "1.7B" ]; then
    NAH=24
    DDP=torch
    NNODES=4
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 elif [ ${MODEL_SIZE} == "3.6B" ]; then
    TP=2
    PP=1
@@ -32,7 +32,7 @@ elif [ ${MODEL_SIZE} == "3.6B" ]; then
    NAH=32
    DDP=torch
    NNODES=8
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 elif [ ${MODEL_SIZE} == "7.5B" ]; then
    TP=4
    PP=1
@@ -43,7 +43,7 @@ elif [ ${MODEL_SIZE} == "7.5B" ]; then
    NAH=32
    DDP=torch
    NNODES=16
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 elif [ ${MODEL_SIZE} == "18B" ]; then
    TP=8
    PP=1
@@ -54,7 +54,7 @@ elif [ ${MODEL_SIZE} == "18B" ]; then
    NAH=48
    DDP=torch
    NNODES=32
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 elif [ ${MODEL_SIZE} == "39B" ]; then
    TP=8
    PP=2
@@ -65,7 +65,7 @@ elif [ ${MODEL_SIZE} == "39B" ]; then
    NAH=64
    DDP=local
    NNODES=64
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 elif [ ${MODEL_SIZE} == "76B" ]; then
    TP=8
    PP=4
@@ -76,7 +76,7 @@ elif [ ${MODEL_SIZE} == "76B" ]; then
    NAH=80
    DDP=local
    NNODES=128
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 5"
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 5"
 elif [ ${MODEL_SIZE} == "145B" ]; then
    TP=8
    PP=8
@@ -87,7 +87,7 @@ elif [ ${MODEL_SIZE} == "145B" ]; then
    NAH=96
    DDP=local
    NNODES=192
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 5 "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 5 "
 elif [ ${MODEL_SIZE} == "310B" ]; then
    TP=8
    PP=16
@@ -98,7 +98,7 @@ elif [ ${MODEL_SIZE} == "310B" ]; then
    NAH=128
    DDP=local
    NNODES=240
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 3 "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 3 "
 elif [ ${MODEL_SIZE} == "530B" ]; then
    TP=8
    PP=35
@@ -109,7 +109,7 @@ elif [ ${MODEL_SIZE} == "530B" ]; then
    NAH=128
    DDP=local
    NNODES=315
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 1 "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 1 "
 elif [ ${MODEL_SIZE} == "1T" ]; then
    TP=8
    PP=64
@@ -120,7 +120,7 @@ elif [ ${MODEL_SIZE} == "1T" ]; then
    NAH=160
    DDP=local
    NNODES=384
-    MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
+    MEGATRON_EXTRA_PARAMS="--activations-checkpoint-method uniform "
 else
    echo "Invalid configuration"
    exit 1