Merge branch 'community-fixes' into 'main'

Community fixes See merge request ADLR/megatron-lm!176

Merge branch 'community-fixes' into 'main'
Community fixes See merge request ADLR/megatron-lm!176
ac837a4e · Mohammad Shoeybi · e7c7a78f · 356f8771 · ac837a4e · ac837a4e
Commit ac837a4e authored Nov 17, 2020 by Mohammad Shoeybi
5 changed files
--- a/README.md
+++ b/README.md
 [Megatron](https://arxiv.org/pdf/1909.08053.pdf) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel, and multinode training of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) using mixed precision.

-Using our GPT-2 model we achieve a perplexity of 10.8 on the WikiText-103 dataset (improving SOTA from 15.8) and an accuracy of 66.5% on the LAMBADA datasets. For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.
+Using our GPT-2 model we achieve a perplexity of 10.8 on the WikiText-103 dataset (improving SOTA from 15.8) and an accuracy of 66.5% on the LAMBADA datasets. For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT model with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.

 Our codebase is capable of efficiently training very large (several billion parameter) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs we consider the following GPT-2 model sizes. All models use a vocabulary size of 51,200 and a sequence length of 1024.


--- a/megatron/arguments.py
+++ b/megatron/arguments.py
@@ -358,7 +358,7 @@ def _add_mixed_precision_args(parser):


 def _add_distributed_args(parser):
-    group = parser.add_argument_group(title='mixed precision')
+    group = parser.add_argument_group(title='distributed')

    group.add_argument('--model-parallel-size', type=int, default=1,
                       help='Size of the model parallel.')
@@ -402,8 +402,8 @@ def _add_data_args(parser):
    group.add_argument('--split', type=str, default='969, 30, 1',
                       help='Comma-separated list of proportions for training,'
                       ' validation, and test split. For example the split '
-                       '`90,5,5` will use 90% of data for training, 5% for '
-                       'validation and 5% for test.')
+                       '`90,5,5` will use 90%% of data for training, 5%% for '
+                       'validation and 5%% for test.')
    group.add_argument('--vocab-file', type=str, default=None,
                       help='Path to the vocab file.')
    group.add_argument('--merge-file', type=str, default=None,

--- a/megatron/mpu/mappings.py
+++ b/megatron/mpu/mappings.py
@@ -90,7 +90,7 @@ class _CopyToModelParallelRegion(torch.autograd.Function):


 class _ReduceFromModelParallelRegion(torch.autograd.Function):
-    """All-redcue the input from the model parallel region."""
+    """All-reduce the input from the model parallel region."""

    @staticmethod
    def symbolic(graph, input_):

--- a/megatron/mpu/random.py
+++ b/megatron/mpu/random.py
@@ -224,7 +224,7 @@ def model_parallel_cuda_manual_seed(seed):
    # 2718 is just for fun and any POSITIVE value will work.
    offset = seed + 2718
    model_parallel_seed = offset + get_model_parallel_rank()
-    # Data parallel gets the original sedd.
+    # Data parallel gets the original seed.
    data_parallel_seed = seed

    if torch.distributed.get_rank() == 0:

--- a/tools/openwebtext/group_duplicates_url.py
+++ b/tools/openwebtext/group_duplicates_url.py