Commit ac837a4e authored by Mohammad Shoeybi's avatar Mohammad Shoeybi
Browse files

Merge branch 'community-fixes' into 'main'

Community fixes

See merge request ADLR/megatron-lm!176
parents e7c7a78f 356f8771
[Megatron](https://arxiv.org/pdf/1909.08053.pdf) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel, and multinode training of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) using mixed precision. [Megatron](https://arxiv.org/pdf/1909.08053.pdf) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel, and multinode training of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) using mixed precision.
Using our GPT-2 model we achieve a perplexity of 10.8 on the WikiText-103 dataset (improving SOTA from 15.8) and an accuracy of 66.5% on the LAMBADA datasets. For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%. Using our GPT-2 model we achieve a perplexity of 10.8 on the WikiText-103 dataset (improving SOTA from 15.8) and an accuracy of 66.5% on the LAMBADA datasets. For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT model with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.
Our codebase is capable of efficiently training very large (several billion parameter) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs we consider the following GPT-2 model sizes. All models use a vocabulary size of 51,200 and a sequence length of 1024. Our codebase is capable of efficiently training very large (several billion parameter) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs we consider the following GPT-2 model sizes. All models use a vocabulary size of 51,200 and a sequence length of 1024.
......
...@@ -358,7 +358,7 @@ def _add_mixed_precision_args(parser): ...@@ -358,7 +358,7 @@ def _add_mixed_precision_args(parser):
def _add_distributed_args(parser): def _add_distributed_args(parser):
group = parser.add_argument_group(title='mixed precision') group = parser.add_argument_group(title='distributed')
group.add_argument('--model-parallel-size', type=int, default=1, group.add_argument('--model-parallel-size', type=int, default=1,
help='Size of the model parallel.') help='Size of the model parallel.')
...@@ -402,8 +402,8 @@ def _add_data_args(parser): ...@@ -402,8 +402,8 @@ def _add_data_args(parser):
group.add_argument('--split', type=str, default='969, 30, 1', group.add_argument('--split', type=str, default='969, 30, 1',
help='Comma-separated list of proportions for training,' help='Comma-separated list of proportions for training,'
' validation, and test split. For example the split ' ' validation, and test split. For example the split '
'`90,5,5` will use 90% of data for training, 5% for ' '`90,5,5` will use 90%% of data for training, 5%% for '
'validation and 5% for test.') 'validation and 5%% for test.')
group.add_argument('--vocab-file', type=str, default=None, group.add_argument('--vocab-file', type=str, default=None,
help='Path to the vocab file.') help='Path to the vocab file.')
group.add_argument('--merge-file', type=str, default=None, group.add_argument('--merge-file', type=str, default=None,
......
...@@ -90,7 +90,7 @@ class _CopyToModelParallelRegion(torch.autograd.Function): ...@@ -90,7 +90,7 @@ class _CopyToModelParallelRegion(torch.autograd.Function):
class _ReduceFromModelParallelRegion(torch.autograd.Function): class _ReduceFromModelParallelRegion(torch.autograd.Function):
"""All-redcue the input from the model parallel region.""" """All-reduce the input from the model parallel region."""
@staticmethod @staticmethod
def symbolic(graph, input_): def symbolic(graph, input_):
......
...@@ -224,7 +224,7 @@ def model_parallel_cuda_manual_seed(seed): ...@@ -224,7 +224,7 @@ def model_parallel_cuda_manual_seed(seed):
# 2718 is just for fun and any POSITIVE value will work. # 2718 is just for fun and any POSITIVE value will work.
offset = seed + 2718 offset = seed + 2718
model_parallel_seed = offset + get_model_parallel_rank() model_parallel_seed = offset + get_model_parallel_rank()
# Data parallel gets the original sedd. # Data parallel gets the original seed.
data_parallel_seed = seed data_parallel_seed = seed
if torch.distributed.get_rank() == 0: if torch.distributed.get_rank() == 0:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment