Commit 50a4b5fa authored by Mohammad Shoeybi's avatar Mohammad Shoeybi Committed by Deepak Narayanan
Browse files

Release fixes

parent 23632ee5
...@@ -10,13 +10,12 @@ Below are some of the projects where we have directly used Megatron: ...@@ -10,13 +10,12 @@ Below are some of the projects where we have directly used Megatron:
* [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html) * [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
* [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf) * [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)
Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs per second (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging. Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
![Cases](images/cases_jan2021.png) ![Cases](images/cases_april2021.png)
The following figures show achieved percentage of theoretical peak FLOPs and achieved aggregate petaFLOPs per second as a function of number of GPUs. All the cases from 1 billion to 1 trillion achieve more than 41% half precision utilization, which is high for an end-to-end application. We observe that initially as the model parallel size increases, utilization slightly decreases; as hidden size increases for larger models, utilization starts increasing and reaches 49% for the largest model. We also note that achieved aggregate petaFLOPs per second across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling. All the cases from 1 billion to 1 trillion parameters achieve more than 43% half precision utilization, which is high for an end-to-end application. We observe that initially the utilization remains constant but as hidden size increases for larger models, utilization starts increasing and reaches 52% for the largest model. We also note that achieved aggregate petaFLOPs across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.
![Model Parallel Scaling](images/scaling.png)
# Contents # Contents
* [Contents](#contents) * [Contents](#contents)
......
default: cases.png scaling-mp.png scaling-dp.png
# for some reason the size option to convert in scaling.tex doesn't work, manually do it after
cases.png scaling-mp.png scaling-dp.png: tables.tex
latex --shell-escape $<
convert tables-1.png -resize 650 cases.png
convert tables-2.png -resize 600 scaling-mp.png
convert tables-3.png -resize 350 scaling-dp.png
clean:
rm -rf *.aux *.log *.dvi *.ps
rm -rf tables-*.png
\documentclass[multi,convert]{standalone}
\usepackage{multirow}
\standaloneenv{tabular}
\begin{document}
\begin{tabular}{cccccc}
Case & Hidden Size & Attention Heads & Layers & Parameters (billions) & Model Parallel Partitions \\
\hline
1B & 1920 & 15 & 24 & 1.16 & 1 \\
2B & 2304 & 18 & 30 & 2.03 & 2 \\
4B & 3072 & 24 & 36 & 4.24 & 4 \\
8B & 4096 & 32 & 42 & 8.67 & 8 \\
\end{tabular}
\begin{tabular}{cc|ccc|ccc}
& & \multicolumn{3}{c|}{\textbf{DGX-2 (V100) batch size 8}} & \multicolumn{3}{c}{\textbf{DGX-A100 batch size 16}} \\
\hline
\multirow{2}{*}{Case} & Number of & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs \\
& GPUs & Time (ms) & & per GPU & Time (ms) & & per GPU \\
\hline
1B & 1 & 1121 & 100.0\% & 71.9 & 1076 & 100\% & 149.8 \\
2B & 2 & 1093 & 89.6\% & 64.2 & 1026 & 91.7\% & 136.8 \\
4B & 4 & 1238 & 82.5\% & 58.5 & 1162 & 84.5\% & 124.7 \\
8B & 8 & 1407 & 74.3\% & 52.2 & 1343 & 74.7\% & 109.3 \\
\end{tabular}
\begin{tabular}{cc|ccc}
& & \multicolumn{3}{c}{\textbf{DGX-A100 batch size 2048}} \\
\hline
\multirow{2}{*}{Case} & Number of & Iteration & \multirow{2}{*}{Scaling} & TeraFLOPs \\
& GPUs & Time (ms) & & per GPU \\
\hline
1B & 128 & 1153 & 93.3\% & 139.8 \\
2B & 256 & 1101 & 85.5\% & 127.5 \\
4B & 512 & 1242 & 79.0\% & 116.7 \\
8B & 1024 & 1380 & 72.7\% & 106.5 \\
\end{tabular}
\end{document}
...@@ -136,6 +136,13 @@ def parse_args(extra_args_provider=None, defaults={}, ...@@ -136,6 +136,13 @@ def parse_args(extra_args_provider=None, defaults={},
if args.bf16: if args.bf16:
assert not args.fp16 assert not args.fp16
args.params_dtype = torch.bfloat16 args.params_dtype = torch.bfloat16
# bfloat16 requires gradient accumulation and all-reduce to
# be done in fp32.
if not args.accumulate_allreduce_grads_in_fp32:
args.accumulate_allreduce_grads_in_fp32 = True
if args.rank == 0:
print('accumulate and all-reduce gradients in fp32 for '
'bfloat16 data type.', flush=True)
if args.rank == 0: if args.rank == 0:
print('using {} for parameters ...'.format(args.params_dtype), print('using {} for parameters ...'.format(args.params_dtype),
......
...@@ -170,7 +170,7 @@ def _train(model, optimizer, lr_scheduler, forward_step, ...@@ -170,7 +170,7 @@ def _train(model, optimizer, lr_scheduler, forward_step,
report_memory_flag = True report_memory_flag = True
# For each remaining epoch # For each remaining epoch
timers('interval time').start() timers('interval-time').start()
for epoch in range(start_epoch, args.epochs): for epoch in range(start_epoch, args.epochs):
print_rank_0('working on epoch {} ...'.format(epoch + 1)) print_rank_0('working on epoch {} ...'.format(epoch + 1))
......
...@@ -149,7 +149,7 @@ def _train( ...@@ -149,7 +149,7 @@ def _train(
report_memory_flag = True report_memory_flag = True
# For each remaining epoch # For each remaining epoch
timers("interval time").start() timers("interval-time").start()
for epoch in range(start_epoch, args.epochs): for epoch in range(start_epoch, args.epochs):
print_rank_0("working on epoch {} ...".format(epoch + 1)) print_rank_0("working on epoch {} ...".format(epoch + 1))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment