Release fixes

50a4b5fa · Mohammad Shoeybi · Deepak Narayanan · 23632ee5 · 50a4b5fa · 23632ee5
Commit 50a4b5fa authored Apr 08, 2021 by Mohammad Shoeybi Committed by Deepak Narayanan Apr 08, 2021
9 changed files
--- a/README.md
+++ b/README.md
@@ -10,13 +10,12 @@ Below are some of the projects where we have directly used Megatron:
 * [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
 * [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)

-Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs per second (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
+Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

-![Cases](images/cases_jan2021.png)
+![Cases](images/cases_april2021.png)

-The following figures show achieved percentage of theoretical peak FLOPs and achieved aggregate petaFLOPs per second as a function of number of GPUs. All the cases from 1 billion to 1 trillion achieve more than 41% half precision utilization, which is high for an end-to-end application. We observe that initially as the model parallel size increases, utilization slightly decreases; as hidden size increases for larger models, utilization starts increasing and reaches 49% for the largest model. We also note that achieved aggregate petaFLOPs per second across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.
+All the cases from 1 billion to 1 trillion parameters achieve more than 43% half precision utilization, which is high for an end-to-end application. We observe that initially the utilization remains constant but as hidden size increases for larger models, utilization starts increasing and reaches 52% for the largest model. We also note that achieved aggregate petaFLOPs across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.

-![Model Parallel Scaling](images/scaling.png)

 # Contents
   * [Contents](#contents)

--- a/images/Makefile
+++ b/images/Makefile
-default: cases.png scaling-mp.png scaling-dp.png
-
-# for some reason the size option to convert in scaling.tex doesn't work, manually do it after
-cases.png scaling-mp.png scaling-dp.png: tables.tex
-	latex --shell-escape $<
-	convert tables-1.png -resize 650 cases.png
-	convert tables-2.png -resize 600 scaling-mp.png
-	convert tables-3.png -resize 350 scaling-dp.png
-
-clean:
-	rm -rf *.aux *.log *.dvi *.ps
-	rm -rf tables-*.png
--- a/images/cases_april2021.png
+++ b/images/cases_april2021.png
--- a/images/cases_jan2021.png
+++ b/images/cases_jan2021.png
--- a/images/scaling.png
+++ b/images/scaling.png
--- a/images/tables.tex
+++ b/images/tables.tex
-\documentclass[multi,convert]{standalone}
-\usepackage{multirow}
-\standaloneenv{tabular}
-
-\begin{document}
-
-\begin{tabular}{cccccc}
-  Case & Hidden Size & Attention Heads & Layers & Parameters (billions) & Model Parallel Partitions \\
-  \hline
-  1B & 1920 & 15 & 24 & 1.16 & 1 \\
-  2B & 2304 & 18 & 30 & 2.03 & 2 \\
-  4B & 3072 & 24 & 36 & 4.24 & 4 \\
-  8B & 4096 & 32 & 42 & 8.67 & 8 \\
-\end{tabular}
-
-\begin{tabular}{cc|ccc|ccc}
-  & & \multicolumn{3}{c|}{\textbf{DGX-2 (V100) batch size 8}} & \multicolumn{3}{c}{\textbf{DGX-A100 batch size 16}} \\
-  \hline
-  \multirow{2}{*}{Case} & Number of & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs \\
-                        & GPUs      & Time (ms) &                          & per GPU   & Time (ms) &                          & per GPU \\
-  \hline
-  1B & 1 & 1121 & 100.0\% & 71.9 & 1076 & 100\%  & 149.8 \\
-  2B & 2 & 1093 & 89.6\%  & 64.2 & 1026 & 91.7\% & 136.8 \\
-  4B & 4 & 1238 & 82.5\%  & 58.5 & 1162 & 84.5\% & 124.7 \\
-  8B & 8 & 1407 & 74.3\%  & 52.2 & 1343 & 74.7\% & 109.3 \\
-\end{tabular}
-
-\begin{tabular}{cc|ccc}
-  & & \multicolumn{3}{c}{\textbf{DGX-A100 batch size 2048}} \\
-  \hline
-  \multirow{2}{*}{Case} & Number of & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs \\
-                        & GPUs      & Time (ms) &                          & per GPU   \\
-  \hline
-  1B & 128  & 1153 & 93.3\% & 139.8 \\
-  2B & 256  & 1101 & 85.5\% & 127.5 \\
-  4B & 512  & 1242 & 79.0\% & 116.7 \\
-  8B & 1024 & 1380 & 72.7\% & 106.5 \\
-\end{tabular}
-
-\end{document}
--- a/megatron/arguments.py
+++ b/megatron/arguments.py
@@ -136,6 +136,13 @@ def parse_args(extra_args_provider=None, defaults={},
    if args.bf16:
        assert not args.fp16
        args.params_dtype = torch.bfloat16
+        # bfloat16 requires gradient accumulation and all-reduce to
+        # be done in fp32.
+        if not args.accumulate_allreduce_grads_in_fp32:
+            args.accumulate_allreduce_grads_in_fp32 = True
+            if args.rank == 0:
+                print('accumulate and all-reduce gradients in fp32 for '
+                      'bfloat16 data type.', flush=True)

    if args.rank == 0:
        print('using {} for parameters ...'.format(args.params_dtype),

--- a/tasks/finetune_utils.py
+++ b/tasks/finetune_utils.py
@@ -170,7 +170,7 @@ def _train(model, optimizer, lr_scheduler, forward_step,
    report_memory_flag = True

    # For each remaining epoch
-    timers('interval time').start()
+    timers('interval-time').start()
    for epoch in range(start_epoch, args.epochs):
        print_rank_0('working on epoch {} ...'.format(epoch + 1))


--- a/tasks/vision/finetune_utils.py
+++ b/tasks/vision/finetune_utils.py
@@ -149,7 +149,7 @@ def _train(
    report_memory_flag = True

    # For each remaining epoch
-    timers("interval time").start()
+    timers("interval-time").start()
    for epoch in range(start_epoch, args.epochs):
        print_rank_0("working on epoch {} ...".format(epoch + 1))