Merge branch 'main' of https://github.com/satpalsr/Megatron-LM into github-pr

34f55429 · Jared Casper · adebe364 · a7f882fe · 34f55429 · 34f55429
Commit 34f55429 authored Jan 28, 2022 by Jared Casper
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 4 deletions

README.md README.md +3 -3

megatron/utils.py megatron/utils.py +1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -207,7 +207,7 @@ Further command line arguments are described in the source file [`arguments.py`]
 ## T5 Pretraining
-Very similar to BERT and GPT, the `examples/pretrain_t5.sh` script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accomodate the T5 architecture:
+Very similar to BERT and GPT, the `examples/pretrain_t5.sh` script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accommodate the T5 architecture:
 * `--kv-channels` sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.
@@ -400,7 +400,7 @@ python tools/create_doc_index.py \
 We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the `--finetune` flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the `--finetune` flag before continuing, otherwise the training will start again from the beginning.
-Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelsim on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.
+Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this. Currently only tensor model parallelism is supported on input and pipeline model parallelism on the output. This example reads in a model with 2-way tensor model parallelism and writes out a model with 2-way pipeline model parallelism.
 <pre>
 TENSOR_MODEL_PARALLEL_SIZE=2
@@ -485,7 +485,7 @@ python tasks/main.py \
 ### LAMBADA Cloze Accuracy
-To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
+To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
 We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching. Make that `lambada` is part of the file path.

--- a/megatron/utils.py
+++ b/megatron/utils.py
@@ -84,7 +84,7 @@ def average_losses_across_data_parallel_group(losses):
        [loss.clone().detach().view(1) for loss in losses])
    torch.distributed.all_reduce(averaged_losses,
                                 group=mpu.get_data_parallel_group())
-    averaged_losses = averaged_losses / \
+    averaged_losses = averaged_losses /
        torch.distributed.get_world_size(group=mpu.get_data_parallel_group())
    return averaged_losses