Merge branch 'readme_update' into 'main'

Update scaling numbers in README and other small tweaks. See merge request ADLR/megatron-lm!130

Merge branch 'readme_update' into 'main'
Update scaling numbers in README and other small tweaks. See merge request ADLR/megatron-lm!130
19301985 · Jared Casper · 91ee60df · 52819194 · 19301985 · 19301985
Commit 19301985 authored Sep 11, 2020 by Jared Casper
6 changed files
--- a/README.md
+++ b/README.md
 [Megatron](https://arxiv.org/pdf/1909.08053.pdf) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel, and multinode training of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) using mixed precision.
-Our codebase is capable of efficiently training a 72-layer, 8.3 billion parameter GPT-2 language model with 8-way model and 64-way data parallelism across 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak theoritical FLOPs. Using our GPT-2 model we achieve SOTA results on the WikiText-103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. 
+Using our GPT-2 model we achieve a perplexity of 10.8 on the WikiText-103 dataset (improving SOTA from 15.8) and an accuracy of 66.5% on the LAMBADA datasets. For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.
-For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.
+Our codebase is capable of efficiently training very large (several billion parameter) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs we consider the following GPT-2 model sizes. All models use a vocabulary size of 51,200 and a sequence length of 1024.
+![Cases](images/cases.png)
+The table below details the weak scaling from 1 to 8 GPUs of our model parallelism code in both a DGX-2 and a DGX-A100. Notice that we double the batch size on the DGX-A100 but the iteration time decreases compared to the DGX-2 resulting in a **2.1x** speedup for the end-to-end application.
+![Model Parallel Scaling](images/scaling-mp.png)
+The following table details how Megatron scales using data parallelism in conjuction with model parallelism in a cluster of DGX-A100s. All of these cases use 128-way data parallelism and the scaling numbers are relative to a single A100 (Case 1B with a 1076ms iteration time).
+![Data Parallel Scaling](images/scaling-dp.png)
 <a id="contents"></a>
 # Contents
@@ -266,7 +276,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt2.py \
 <a id="realm"></a>
 ## REALM Pipeline
-The following sections (will) reflect the three stages of training a REALM system. For now it's just the ICT code.
+We are working on implementing the [REALM](https://arxiv.org/pdf/2002.08909.pdf) system. The following sections (will) reflect the three stages of training it. For now it's just the ICT code.
 Loosely, they are pretraining the retriever modules, then jointly training the language model and the retriever, and then finetuning a question answering head on the language model with fixed retriever.
 ### Inverse Cloze Task (ICT) Pretraining

--- a/images/Makefile
+++ b/images/Makefile
+default: cases.png scaling-mp.png scaling-dp.png
+# for some reason the size option to convert in scaling.tex doesn't work, manually do it after
+cases.png scaling-mp.png scaling-dp.png: tables.tex
+	latex --shell-escape $<
+	convert tables-1.png -resize 650 cases.png
+	convert tables-2.png -resize 600 scaling-mp.png
+	convert tables-3.png -resize 350 scaling-dp.png
+clean:
+	rm -rf *.aux *.log *.dvi *.ps
+	rm -rf tables-*.png
--- a/images/cases.png
+++ b/images/cases.png
--- a/images/scaling-dp.png
+++ b/images/scaling-dp.png
--- a/images/scaling-mp.png
+++ b/images/scaling-mp.png
--- a/images/tables.tex
+++ b/images/tables.tex
+\documentclass[multi,convert]{standalone}
+\usepackage{multirow}
+\standaloneenv{tabular}
+\begin{document}
+\begin{tabular}{cccccc}
+  Case & Hidden Size & Attention Heads & Layers & Parameters (billions) & Model Parallel Partitions \\
+  \hline
+  1B & 1920 & 15 & 24 & 1.16 & 1 \\
+  2B & 2304 & 18 & 30 & 2.03 & 2 \\
+  4B & 3072 & 24 & 36 & 4.24 & 4 \\
+  8B & 4096 & 32 & 42 & 8.67 & 8 \\
+\end{tabular}
+\begin{tabular}{cc|ccc|ccc}
+  & & \multicolumn{3}{c|}{\textbf{DGX-2 (V100) batch size 8}} & \multicolumn{3}{c}{\textbf{DGX-A100 batch size 16}} \\
+  \hline
+  \multirow{2}{*}{Case} & Number of & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs \\
+                        & GPUs      & Time (ms) &                          & per GPU   & Time (ms) &                          & per GPU \\
+  \hline
+  1B & 1 & 1121 & 100.0\% & 71.9 & 1076 & 100\%  & 149.8 \\
+  2B & 2 & 1093 & 89.6\%  & 64.2 & 1026 & 91.7\% & 136.8 \\
+  4B & 4 & 1238 & 82.5\%  & 58.5 & 1162 & 84.5\% & 124.7 \\
+  8B & 8 & 1407 & 74.3\%  & 52.2 & 1343 & 74.7\% & 109.3 \\
+\end{tabular}
+\begin{tabular}{cc|ccc}
+  & & \multicolumn{3}{c}{\textbf{DGX-A100 batch size 2048}} \\
+  \hline
+  \multirow{2}{*}{Case} & Number of & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs \\
+                        & GPUs      & Time (ms) &                          & per GPU   \\
+  \hline
+  1B & 128  & 1153 & 93.3\% & 139.8 \\
+  2B & 256  & 1101 & 85.5\% & 127.5 \\
+  4B & 512  & 1242 & 79.0\% & 116.7 \\
+  8B & 1024 & 1380 & 72.7\% & 106.5 \\
+\end{tabular}
+\end{document}