Merge branch 'main' into main

a7f882fe · Satpal Singh Rathore · GitHub · 8f241a96 · 9d86ca67 · a7f882fe
Unverified Commit a7f882fe authored Dec 25, 2021 by Satpal Singh Rathore Committed by GitHub Dec 25, 2021
20 changed files
--- a/README.md
+++ b/README.md
@@ -11,7 +11,9 @@ Below are some of the projects where we have directly used Megatron:
 * [Scaling Language Model Training to a Trillion Parameters Using Megatron](https://arxiv.org/pdf/2104.04473.pdf)
 * [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)
-Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specific model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
+Megatron is also used in [NeMo Megatron](https://developer.nvidia.com/nvidia-nemo#nemo-megatron), a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.
+Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
 ![Cases](images/cases_april2021.png)
@@ -426,33 +428,23 @@ WORLD_SIZE=$TENSOR_MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
 Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.
 ## GPT Text Generation
-`bash examples/generate_text.sh`
-We generate text samples using largely the GPT pretraining script. Few changes need to make, such as we need to provide the path to the pretrained checkpoint, the length of the output samples, whether to generate texts unconditionally (`--num-samples` to denote how many samples to generate) or conditional (need to pass `--sample-input-file <filename>` where each line of the file will be used as the conditional texts). There are few optional parameters to play, e.g. `top-k`, `top-p`, or `greedy` (set top-k and top-p to 0) sampling..
+We have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`. You run it much like you would start a pretraining job, specifying an appropriate pretrained checkpoint. There are also few optional parameters: `temperature`, `top-k`and `top-p`. See `--help` or the source file for more information. See [examples/run_text_generation_server_345M.sh](examples/run_text_generation_server_345M.sh) for an example of how to run the server.
+Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on.
 <pre>
-CHECKPOINT_PATH=checkpoints/gpt2_345m
+tools/text_generation_cli.py localhost
-VOCAB_FILE=gpt2-vocab.json
+</pre>
-MERGE_FILE=gpt2-merges.txt
-GPT_ARGS=&#60;same as those in <a href="#gpt-pretraining">GPT pretraining</a> above&#62;
-MAX_OUTPUT_SEQUENCE_LENGTH=1024
+You can also use CURL or any other tools to query the server directly:
-TEMPERATURE=1.0
-TOP_P=0.9
-NUMBER_OF_SAMPLES=2
-OUTPUT_FILE=samples.json
-python tools/generate_samples_gpt.py \
+<pre>
-       $GPT_ARGS \
+curl 'http://localhost:5000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8'  -d '{"prompts":["Hello world"], "tokens_to_generate":1}'
-       --load $CHECKPOINT_PATH \
-       --out-seq-length $MAX_OUTPUT_SEQUENCE_LENGTH \
-       --temperature $TEMPERATURE \
-       --genfile $OUTPUT_FILE \
-       --num-samples $NUMBER_OF_SAMPLES \
-       --top_p $TOP_P \
-       --recompute
 </pre>
+See [megatron/text_generation_server.py](megatron/text_generation_server.py) for more API options.
 ## GPT Evaluation
 We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.

--- a/examples/generate_text.sh
+++ b/examples/generate_text.sh
-#!/bin/bash
-CHECKPOINT_PATH=checkpoints/gpt2_345m
-VOCAB_FILE=gpt2-vocab.json
-MERGE_FILE=gpt2-merges.txt
-python tools/generate_samples_gpt2.py \
-       --tensor-model-parallel-size 1 \
-       --num-layers 24 \
-       --hidden-size 1024 \
-       --load $CHECKPOINT_PATH \
-       --num-attention-heads 16 \
-       --max-position-embeddings 1024 \
-       --tokenizer-type GPT2BPETokenizer \
-       --fp16 \
-       --batch-size 2 \
-       --seq-length 1024 \
-       --out-seq-length 1024 \
-       --temperature 1.0 \
-       --vocab-file $VOCAB_FILE \
-       --merge-file $MERGE_FILE \
-       --genfile unconditional_samples.json \
-       --num-samples 2 \
-       --top_p 0.9 \
-       --recompute
--- a/examples/msdp/README.md
+++ b/examples/msdp/README.md
+# Multi-Stage Prompting for Knowledgeable Dialogue Generation
+This directory contains all the scripts of multi-stage prompting for knowledgeable dialogue generation that includes data preparation, and knowledge and response generations. More details are available on [`knowledgeable task directory`](../../tasks/msdp).
--- a/examples/msdp/data_processing.sh
+++ b/examples/msdp/data_processing.sh
+#!/bin/bash
+# Data preparation for our framework: preprocessing the WoW and WoI datasets
+# The datasets can be downloaded through the following links:
+# WoW: https://parl.ai/projects/wizard_of_wikipedia/
+# WoI: https://parl.ai/projects/sea/
+DIR=`pwd`
+# Before running the preprocessing, please download 
+# the wizard of wikipedia and wizard datasets
+WOW_DATA_FOLDER=<PATH_OF_WIZARD_OF_WIKIPEDIA_DATA_FOLDER>
+WOI_DATA_FOLDER=<PATH_OF_WIZARD_OF_INTERNET_DATA_FOLDER>
+# We provide examples for processing the raw data from Wizard of Wikipedia
+# Processing the train dataset (train.json)
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func process_wow_dataset \
+        --raw_file ${WOW_DATA_FOLDER}/train.json \
+        --processed_file ${WOW_DATA_FOLDER}/train_processed.txt
+# Processing test seen dataset (test_random_split.json)
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func process_wow_dataset \
+        --raw_file ${WOW_DATA_FOLDER}/test_random_split.json \
+        --processed_file ${WOW_DATA_FOLDER}/testseen_processed.txt \
+        --knwl_ref_file ${WOW_DATA_FOLDER}/output_testseen_knowledge_reference.txt \
+        --resp_ref_file ${WOW_DATA_FOLDER}/output_testseen_response_reference.txt
+# processing test unseen dataset (test_topic_split.json)
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func process_wow_dataset \
+        --raw_file ${WOW_DATA_FOLDER}/test_topic_split.json \
+        --processed_file ${WOW_DATA_FOLDER}/testunseen_processed.txt \
+        --knwl_ref_file ${WOW_DATA_FOLDER}/output_testunseen_knowledge_reference.txt \
+        --resp_ref_file ${WOW_DATA_FOLDER}/output_testunseen_response_reference.txt
+# We provide the following script to process the raw data from Wizard of Internet
+# Processing the test dataset (test.jsonl)
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func process_woi_dataset \
+        --raw_file ${WOI_DATA_FOLDER}/test.jsonl \
+        --processed_file ${WOI_DATA_FOLDER}/test_processed.txt \
+        --knwl_ref_file ${WOI_DATA_FOLDER}/output_test_knowledge_reference.txt \
+        --resp_ref_file ${WOI_DATA_FOLDER}/output_test_response_reference.txt
+# Get the knowledge generation prompts for the each test dataset in WoW and WoI
+MODEL_FILE=<PATH_OF_THE_FINETUNED_DPR_MODEL> 
+# WoW test seen
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func get_knwl_gen_prompts \
+        --test_file ${WOW_DATA_FOLDER}/testseen_processed.txt \
+        --train_file ${WOW_DATA_FOLDER}/train_processed.txt \
+        --model_file ${MODEL_FILE} \
+        --processed_file ${WOW_DATA_FOLDER}/output_testseen_knowledge_prompts.json \
+        --data_type wow_seen
+# WoW test unseen
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func get_knwl_gen_prompts \
+        --test_file ${WOW_DATA_FOLDER}/testunseen_processed.txt \
+        --train_file ${WOW_DATA_FOLDER}/train_processed.txt \
+        --model_file ${MODEL_FILE} \
+        --processed_file ${WOW_DATA_FOLDER}/output_testunseen_knowledge_prompts.json \
+        --data_type wow_unseen
+# WoI
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func get_knwl_gen_prompts \
+        --test_file ${WOI_DATA_FOLDER}/test_processed.txt \
+        --train_file ${WOW_DATA_FOLDER}/train_processed.txt \
+        --model_file ${MODEL_FILE} \
+        --processed_file ${WOI_DATA_FOLDER}/output_test_knowledge_prompts.json \
+        --data_type woi
+# Get the response generation prompts (can be applied for all the test datasets)
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func get_resp_gen_prompts \
+        --train_file ${WOW_DATA_FOLDER}/train_processed.txt \
+        --processed_file ${WOW_DATA_FOLDER}/output_response_prompts.txt
--- a/examples/msdp/eval_knwl_generation.sh
+++ b/examples/msdp/eval_knwl_generation.sh
+#!/bin/bash
+#########################
+# Evaluate the F1 scores.
+#########################
+WORLD_SIZE=1
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+MODEL_GEN_PATH=<PATH_OF_THE_KNOWLEDGE_GENERATION> \ 
+        (e.g., /testseen_knowledge_generations.txt)
+GROUND_TRUTH_PATH=<PATH_OF_THE_GROUND_TRUTH_KNOWLEDGE> \ 
+        (e.g., /testseen_knowledge_reference.txt)
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/msdp/main.py \
+        --num-layers 24 \
+        --hidden-size 1024 \
+        --num-attention-heads 16 \
+        --seq-length 2048 \
+        --max-position-embeddings 2048 \
+        --micro-batch-size 4 \
+        --task MSDP-EVAL-F1 \
+        --guess-file ${MODEL_GEN_PATH} \
+        --answer-file ${GROUND_TRUTH_PATH}
+############################################
+# Evaluate BLEU, METEOR, and ROUGE-L scores.
+############################################
+# We follow the nlg-eval (https://github.com/Maluuba/nlg-eval) to 
+# evaluate the BLEU, METEOR, and ROUGE-L scores. 
+# To evaluate on these metrics, please setup the environments based on 
+# the nlg-eval github, and run the corresponding evaluation commands.
+nlg-eval \
+    --hypothesis=<PATH_OF_THE_KNOWLEDGE_GENERATION> \
+    --references=<PATH_OF_THE_GROUND_TRUTH_KNOWLEDGE>
--- a/examples/msdp/eval_resp_generation.sh
+++ b/examples/msdp/eval_resp_generation.sh
+#!/bin/bash
+#########################
+# Evaluate the F1 scores.
+#########################
+WORLD_SIZE=1
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+MODEL_GEN_PATH=<PATH_OF_THE_RESPONSE_GENERATION> \ 
+        (e.g., /testseen_response_generations.txt)
+GROUND_TRUTH_PATH=<PATH_OF_THE_GROUND_TRUTH_RESPONSE> \ 
+        (e.g., /testseen_response_reference.txt)
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/msdp/main.py \
+        --num-layers 24 \
+        --hidden-size 1024 \
+        --num-attention-heads 16 \
+        --seq-length 2048 \
+        --max-position-embeddings 2048 \
+        --micro-batch-size 4 \
+        --task MSDP-EVAL-F1 \
+        --guess-file ${MODEL_GEN_PATH} \
+        --answer-file ${GROUND_TRUTH_PATH}
+##########################
+# Evaluate the KF1 scores.
+##########################
+MODEL_GEN_PATH=<PATH_OF_THE_RESPONSE_GENERATION> \ 
+        (e.g., /testseen_response_generations.txt)
+GROUND_TRUTH_PATH=<PATH_OF_THE_GROUND_TRUTH_KNOWLEDGE> \ 
+        (e.g., /testseen_knowledge_reference.txt)
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/msdp/main.py \
+        --num-layers 24 \
+        --hidden-size 1024 \
+        --num-attention-heads 16 \
+        --seq-length 2048 \
+        --max-position-embeddings 2048 \
+        --micro-batch-size 4 \
+        --task MSDP-EVAL-F1 \
+        --guess-file ${MODEL_GEN_PATH} \
+        --answer-file ${GROUND_TRUTH_PATH}
+############################################
+# Evaluate BLEU, METEOR, and ROUGE-L scores.
+############################################
+# We follow the nlg-eval (https://github.com/Maluuba/nlg-eval) to 
+# evaluate the BLEU, METEOR, and ROUGE-L scores. 
+# To evaluate on these metrics, please setup the environments based on 
+# the nlg-eval github, and run the corresponding evaluation commands.
+nlg-eval \
+    --hypothesis=<PATH_OF_THE_RESPONSE_GENERATION> \
+    --references=<PATH_OF_THE_GROUND_TRUTH_RESPONSE>
--- a/examples/msdp/prep_resp_gen.sh
+++ b/examples/msdp/prep_resp_gen.sh
+#!/bin/bash
+# Preparing the input file for the response generation (second-stage prompting)
+DIR=`pwd`
+TEST_FILE=<PATH_OF_PROCESSED_TEST_DATA> \
+        (e.g., /testseen_processed.txt)
+KNOWLEDGE_FILE=<PATH_OF_GENERATED_KNOWLEDGE_DATA> \
+        (e.g., /testseen_knowledge_generations.txt)
+PROCESSED_FILE=<PATH_OF_INPUT_FILE_FOR_RESPONSE_GENERATION> \
+        (e.g., /testseen_processed_with_generated_knowledge.txt)
+python ${DIR}/tasks/msdp/preprocessing.py \
+        --func prepare_input \
+        --test_file ${TEST_FILE} \
+        --knowledge_gen_file ${KNOWLEDGE_FILE} \
+        --processed_file ${PROCESSED_FILE}
--- a/examples/msdp/prompt_knwl_gen.sh
+++ b/examples/msdp/prompt_knwl_gen.sh
+#!/bin/bash
+# Stage-1: Prompt a pretrained language model to generate the context-relevant knowledge
+# The input contains prompts and current dialogue context, the output is the relevant knowledge
+# The size of the pretrained language model is 357M
+WORLD_SIZE=8
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+CHECKPOINT_PATH=<PATH_OF_LANGUAGE_MODEL> (e.g., /357m)
+VOCAB_PATH=<PATH_OF_VOCAB_FILE> (e.g., /gpt2-vocab.json)
+MERGE_PATH=<PATH_OF_MERGE_FILE> (e.g., /gpt2-merges.txt)
+INPUT_PATH=<PATH_OF_PROCESSED_TEST_DATA_FILE> \ 
+        (e.g., /testseen_processed.txt)
+PROMPT_PATH=<PATH_OF_KNOWLEDGE_GENERATION_PROMPTS> \
+        (e.g., /testseen_knowledge_prompts.json)
+OUTPUT_PATH=<PATH_OF_OUTPUT_GENERATION_FILE> \
+        (e.g., /testseen_knowledge_generations.txt)
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/msdp/main.py \
+        --num-layers 24 \
+        --hidden-size 1024 \
+        --num-attention-heads 16 \
+        --seq-length 2048 \
+        --max-position-embeddings 2048 \
+        --micro-batch-size 1 \
+        --vocab-file ${VOCAB_PATH} \
+        --merge-file ${MERGE_PATH} \
+        --load ${CHECKPOINT_PATH} \
+        --fp16 \
+        --DDP-impl torch \
+        --tokenizer-type GPT2BPETokenizer \
+        --sample-input-file ${INPUT_PATH} \
+        --sample-output-file ${OUTPUT_PATH} \
+        --prompt-file ${PROMPT_PATH} \
+        --prompt-type knowledge \
+        --num-prompt-examples 10 \
+        --task MSDP-PROMPT 
+# NOTE: If you use api for the model generation, please use 
+# the "--api-prompt" flag (setting this value as True). 
--- a/examples/msdp/prompt_resp_gen.sh
+++ b/examples/msdp/prompt_resp_gen.sh
+#!/bin/bash
+# Stage-2: Prompt a pretrained language model to generate the corresponding response
+# The input contains prompts, current dialogue context, and generated knowledge in Stage-1
+# The output is the corresponding response.
+# The size of the pretrained language model is 357M
+WORLD_SIZE=8
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+CHECKPOINT_PATH=<PATH_OF_LANGUAGE_MODEL> (e.g., /357m)
+VOCAB_PATH=<PATH_OF_VOCAB_FILE> (e.g., /gpt2-vocab.json)
+MERGE_PATH=<PATH_OF_MERGE_FILE> (e.g., /gpt2-merges.txt)
+INPUT_PATH=<PATH_OF_INPUT_TEST_DATA_FILE> (e.g., /testseen_processed.txt)
+PROMPT_PATH=<PATH_OF_RESPONSE_GENERATION_PROMPTS> \
+        (e.g., /response_prompts.txt)
+OUTPUT_PATH=<PATH_OF_OUTPUT_GENERATION_FILE> \
+        (e.g., /output_testseen_response_generations.txt)
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/msdp/main.py \
+        --num-layers 24 \
+        --hidden-size 1024 \
+        --num-attention-heads 16 \
+        --seq-length 2048 \
+        --max-position-embeddings 2048 \
+        --micro-batch-size 1 \
+        --vocab-file ${VOCAB_PATH} \
+        --merge-file ${MERGE_PATH} \
+        --load ${CHECKPOINT_PATH} \
+        --fp16 \
+        --DDP-impl torch \
+        --tokenizer-type GPT2BPETokenizer \
+        --sample-input-file ${INPUT_PATH} \
+        --sample-output-file ${OUTPUT_PATH} \
+        --prompt-file ${PROMPT_PATH} \
+        --prompt-type response \
+        --num-prompt-examples 20 \
+        --task MSDP-PROMPT 
+# NOTE: If you use api for the model generation, please use 
+# the "--api-prompt" flag (setting this value as True). 
--- a/examples/run_text_generation_server_345M.sh
+++ b/examples/run_text_generation_server_345M.sh
@@ -12,21 +12,21 @@ MERGE_FILE=<Path to merges.txt (e.g. /gpt2-merges.txt)>
 pip install flask-restful
-python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py   /
+python -m torch.distributed.run $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
-       --tensor-model-parallel-size 1  /
+       --tensor-model-parallel-size 1  \
-       --pipeline-model-parallel-size 1  /
+       --pipeline-model-parallel-size 1  \
-       --num-layers 24  /
+       --num-layers 24  \
-       --hidden-size 1024  /
+       --hidden-size 1024  \
-       --load ${CHECKPOINT}  /
+       --load ${CHECKPOINT}  \
-       --num-attention-heads 16  /
+       --num-attention-heads 16  \
-       --max-position-embeddings 1024  /
+       --max-position-embeddings 1024  \
-       --tokenizer-type GPT2BPETokenizer  /
+       --tokenizer-type GPT2BPETokenizer  \
-       --fp16  /
+       --fp16  \
-       --micro-batch-size 1  /
+       --micro-batch-size 1  \
-       --seq-length 1024  /
+       --seq-length 1024  \
-       --out-seq-length 1024  /
+       --out-seq-length 1024  \
-       --temperature 1.0  /
+       --temperature 1.0  \
-       --vocab-file $VOCAB_FILE  /
+       --vocab-file $VOCAB_FILE  \
-       --merge-file $MERGE_FILE  /
+       --merge-file $MERGE_FILE  \
-       --top_p 0.9  /
+       --top_p 0.9  \
       --seed 42
--- a/examples/run_text_generation_server_345M_8_tensor_parallel.sh
+++ b/examples/run_text_generation_server_345M_8_tensor_parallel.sh
@@ -12,21 +12,21 @@ MERGE_FILE=<Path to merges.txt (e.g. /gpt2-merges.txt)>
 pip install flask-restful
-python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py   /
+python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
-       --tensor-model-parallel-size 8  /
+       --tensor-model-parallel-size 8  \
-       --pipeline-model-parallel-size 1  /
+       --pipeline-model-parallel-size 1  \
-       --num-layers 24  /
+       --num-layers 24  \
-       --hidden-size 1024  /
+       --hidden-size 1024  \
-       --load ${CHECKPOINT}  /
+       --load ${CHECKPOINT}  \
-       --num-attention-heads 16  /
+       --num-attention-heads 16  \
-       --max-position-embeddings 1024  /
+       --max-position-embeddings 1024  \
-       --tokenizer-type GPT2BPETokenizer  /
+       --tokenizer-type GPT2BPETokenizer  \
-       --fp16  /
+       --fp16  \
-       --micro-batch-size 1  /
+       --micro-batch-size 1  \
-       --seq-length 1024  /
+       --seq-length 1024  \
-       --out-seq-length 1024  /
+       --out-seq-length 1024  \
-       --temperature 1.0  /
+       --temperature 1.0  \
-       --vocab-file $VOCAB_FILE  /
+       --vocab-file $VOCAB_FILE  \
-       --merge-file $MERGE_FILE  /
+       --merge-file $MERGE_FILE  \
-       --top_p 0.9  /
+       --top_p 0.9  \
       --seed 42
--- a/megatron/__init__.py
+++ b/megatron/__init__.py
@@ -17,6 +17,7 @@ import torch
 from .global_vars import get_args
 from .global_vars import get_current_global_batch_size
 from .global_vars import get_num_microbatches
+from .global_vars import get_signal_handler
 from .global_vars import update_num_microbatches
 from .global_vars import get_tokenizer
 from .global_vars import get_tensorboard_writer

--- a/megatron/arguments.py
+++ b/megatron/arguments.py
@@ -41,6 +41,7 @@ def parse_args(extra_args_provider=None, defaults={},
    parser = _add_biencoder_args(parser)
    parser = _add_vit_args(parser)
    parser = _add_logging_args(parser)
+    parser = _add_inference_args(parser)
    # Custom arguments.
    if extra_args_provider is not None:
@@ -244,17 +245,29 @@ def parse_args(extra_args_provider=None, defaults={},
    if args.fp32_residual_connection:
        assert args.fp16 or args.bf16, \
            'residual connection in fp32 only supported when using fp16 or bf16.'
+    TORCH_MAJOR = int(torch.__version__.split('.')[0])
+    TORCH_MINOR = int(torch.__version__.split('.')[1])
+    # Persistent fused layer norm.
+    if TORCH_MAJOR < 1 or (TORCH_MAJOR == 1 and TORCH_MINOR < 11):
+        args.no_persist_layer_norm = True
+        if args.rank == 0:
+            print('Persistent fused layer norm kernel is supported from '
+                  'pytorch v1.11 (nvidia pytorch container paired with v1.11). '
+                  'Defaulting to no_persist_layer_norm=True')
    # Activation checkpointing.
    if args.distribute_checkpointed_activations:
        assert args.tensor_model_parallel_size > 1, 'can distribute ' \
            'checkpointed activations only across tensor model ' \
            'parallel groups'
        assert args.activations_checkpoint_method is not None, \
-            'for distribute-checkpointed-activations to work you '\
+            'for distributed checkpoint activations to work you '\
            'need to use a activation-checkpoint method '
-        assert args.num_layers_per_virtual_pipeline_stage is None, \
+        assert TORCH_MAJOR >= 1 and TORCH_MINOR >= 10, \
-            'currently distrobuted checkpoint activations only supported for ' \
+            'distributed checkpoint activations are supported for pytorch ' \
-            'nointerleaved pipeline parallelism'
+            'v1.10 and above (Nvidia Pytorch container >= 21.07). Current ' \
+            'pytorch version is v%s.%s.' % (TORCH_MAJOR, TORCH_MINOR)
    _print_args(args)
    return args
@@ -279,6 +292,18 @@ def _check_arg_is_not_none(args, arg):
    assert getattr(args, arg) is not None, '{} argument is None'.format(arg)
+def _add_inference_args(parser):
+    group = parser.add_argument_group(title='inference')
+    group.add_argument('--inference-batch-times-seqlen-threshold',
+                       type=int, default=512,
+                       help='During inference, if batch-size times '
+                       'sequence-length is smaller than this threshold '
+                       'then we will not use pipelining, otherwise we will.')
+    return parser
 def _add_network_size_args(parser):
    group = parser.add_argument_group(title='network size')
@@ -354,6 +379,9 @@ def _add_logging_args(parser):
    group.add_argument('--log-memory-to-tensorboard',
                       action='store_true',
                       help='Enable memory logging to tensorboard.')
+    group.add_argument('--log-world-size-to-tensorboard',
+                       action='store_true',
+                       help='Enable world size logging to tensorboard.')
    return parser
@@ -449,6 +477,9 @@ def _add_training_args(parser):
                       'by this value.')
    group.add_argument('--exit-duration-in-mins', type=int, default=None,
                       help='Exit the program after this many minutes.')
+    group.add_argument('--exit-signal-handler', action='store_true',
+                       help='Dynamically save the checkpoint and shutdown the '
+                       'training if SIGTERM is received')
    group.add_argument('--tensorboard-dir', type=str, default=None,
                       help='Write TensorBoard logs to this directory.')
    group.add_argument('--no-masked-softmax-fusion',
@@ -473,6 +504,11 @@ def _add_training_args(parser):
                       help='Disable asynchronous execution of '
                       'tensor-model-parallel all-reduce with weight '
                       'gradient compuation of a column-linear layer.')
+    group.add_argument('--no-persist-layer-norm', action='store_true',
+                       help='Disable using persistent fused layer norm kernel. '
+                       'This kernel supports only a set of hidden sizes. Please '
+                       'check persist_ln_hidden_sizes if your hidden '
+                       'size is supported.')
    return parser

--- a/megatron/dist_signal_handler.py
+++ b/megatron/dist_signal_handler.py
+import signal
+import torch
+def get_world_size():
+    if torch.distributed.is_available() and torch.distributed.is_initialized():
+        world_size = torch.distributed.get_world_size()
+    else:
+        world_size = 1
+    return world_size
+def get_device(local_rank=None):
+    backend = torch.distributed.get_backend()
+    if backend == 'nccl':
+        if local_rank is None:
+            device = torch.device('cuda')
+        else:
+            device = torch.device(f'cuda:{local_rank}')
+    elif backend == 'gloo':
+        device = torch.device('cpu')
+    else:
+        raise RuntimeError
+    return device
+def all_gather_item(item, dtype, group=None, async_op=False, local_rank=None):
+    if not torch.distributed.is_available() or \
+       not torch.distributed.is_initialized():
+        return [item]
+    device = get_device(local_rank)
+    if group is not None:
+        group_size = group.size()
+    else:
+        group_size = get_world_size()
+    tensor = torch.tensor([item], device=device, dtype=dtype)
+    output_tensors = [
+        torch.zeros(1, dtype=tensor.dtype, device=tensor.device)
+        for _ in range(group_size)
+    ]
+    torch.distributed.all_gather(output_tensors, tensor, group, async_op)
+    output = [elem.item() for elem in output_tensors]
+    return output
+class DistributedSignalHandler:
+    def __init__(self, sig=signal.SIGTERM):
+        self.sig = sig
+    def signals_received(self):
+        all_received = all_gather_item(
+            self._signal_received, dtype=torch.int32
+        )
+        return all_received
+    def __enter__(self):
+        self._signal_received = False
+        self.released = False
+        self.original_handler = signal.getsignal(self.sig)
+        def handler(signum, frame):
+            self._signal_received = True
+        signal.signal(self.sig, handler)
+        return self
+    def __exit__(self, type, value, tb):
+        self.release()
+    def release(self):
+        if self.released:
+            return False
+        signal.signal(self.sig, self.original_handler)
+        self.released = True
+        return True
--- a/megatron/fused_kernels/layer_norm_cuda_kernel.cu
+++ b/megatron/fused_kernels/layer_norm_cuda_kernel.cu
@@ -329,6 +329,7 @@ void cuApplyLayerNorm(
      mean[i1] = mu;
      invvar[i1] = c_invvar;
    }
+    __syncthreads();
  }
 }
@@ -644,6 +645,8 @@ void cuComputeGradInput(
        k_grad_input[l] = static_cast<T>(f_grad_input);
      }
    }
+    // prevent race where buf is written again before reads are done
+    __syncthreads();
  }
 }

--- a/megatron/global_vars.py
+++ b/megatron/global_vars.py
@@ -21,6 +21,7 @@ import time
 import torch
+from megatron import dist_signal_handler
 from megatron.tokenizer import build_tokenizer
 from .arguments import parse_args
 from .microbatches import build_num_microbatches_calculator
@@ -31,6 +32,7 @@ _GLOBAL_TOKENIZER = None
 _GLOBAL_TENSORBOARD_WRITER = None
 _GLOBAL_ADLR_AUTORESUME = None
 _GLOBAL_TIMERS = None
+_GLOBAL_SIGNAL_HANDLER = None
 def get_args():
@@ -75,6 +77,14 @@ def get_timers():
    _ensure_var_is_initialized(_GLOBAL_TIMERS, 'timers')
    return _GLOBAL_TIMERS
+def get_signal_handler():
+    _ensure_var_is_initialized(_GLOBAL_SIGNAL_HANDLER, 'signal handler')
+    return _GLOBAL_SIGNAL_HANDLER
+def _set_signal_handler():
+    global _GLOBAL_SIGNAL_HANDLER
+    _ensure_var_is_not_initialized(_GLOBAL_SIGNAL_HANDLER, 'signal handler')
+    _GLOBAL_SIGNAL_HANDLER = dist_signal_handler.DistributedSignalHandler().__enter__()
 def set_global_variables(extra_args_provider=None, args_defaults={},
                         ignore_unknown_args=False):
@@ -89,6 +99,9 @@ def set_global_variables(extra_args_provider=None, args_defaults={},
    _set_adlr_autoresume(args)
    _set_timers()
+    if args.exit_signal_handler:
+        _set_signal_handler()
 def _parse_args(extra_args_provider=None, defaults={},
                ignore_unknown_args=False):

--- a/megatron/initialize.py
+++ b/megatron/initialize.py
@@ -180,7 +180,7 @@ def _initialize_distributed():
    torch.distributed.init_process_group(
        backend=args.distributed_backend,
        world_size=args.world_size, rank=args.rank,
-        timeout=timedelta(days=7))
+        timeout=timedelta(minutes=10))
    # Set the tensor model-parallel, pipeline model-parallel, and
    # data-parallel communicators.

--- a/megatron/model/fused_layer_norm.py
+++ b/megatron/model/fused_layer_norm.py
@@ -23,6 +23,12 @@ from torch.nn.parameter import Parameter
 from torch.nn import init
 import importlib
+try:
+    from apex.contrib.layer_norm.layer_norm import FastLayerNormFN
+    HAVE_PERSIST_LAYER_NORM = True
+except:
+    HAVE_PERSIST_LAYER_NORM = False
 global fused_mix_prec_layer_norm_cuda
 fused_mix_prec_layer_norm_cuda = None
@@ -61,13 +67,23 @@ class FusedLayerNormAffineFunction(torch.autograd.Function):
 class MixedFusedLayerNorm(torch.nn.Module):
-  def __init__(self, normalized_shape, eps=1e-5):
+  def __init__(self, normalized_shape, eps=1e-5, no_persist_layer_norm=True):
        super(MixedFusedLayerNorm, self).__init__()
        global fused_mix_prec_layer_norm_cuda
        fused_mix_prec_layer_norm_cuda = importlib.import_module(
          "fused_mix_prec_layer_norm_cuda")
+        # List of hiddens sizes supported in the persistent layer norm kernel
+        # If the hidden size is not supported, fall back to the non-persistent
+        # kernel.
+        persist_ln_hidden_sizes = [1024, 1536, 2048, 2304, 3072, 3840, 4096,
+            5120, 6144, 8192, 10240, 12288, 12800, 15360, 16384, 18432, 20480,
+            24576, 25600, 30720, 32768, 40960, 49152, 65536]
+        if normalized_shape not in persist_ln_hidden_sizes or \
+                not HAVE_PERSIST_LAYER_NORM:
+            no_persist_layer_norm = True
        if isinstance(normalized_shape, numbers.Integral):
            normalized_shape = (normalized_shape,)
        self.normalized_shape = torch.Size(normalized_shape)
@@ -75,6 +91,7 @@ class MixedFusedLayerNorm(torch.nn.Module):
        self.weight = Parameter(torch.Tensor(*normalized_shape))
        self.bias = Parameter(torch.Tensor(*normalized_shape))
        self.reset_parameters()
+        self.no_persist_layer_norm = no_persist_layer_norm
  def reset_parameters(self):
@@ -85,6 +102,10 @@ class MixedFusedLayerNorm(torch.nn.Module):
  def forward(self, input):
-    return FusedLayerNormAffineFunction.apply(
+    if self.no_persist_layer_norm:
-      input, self.weight, self.bias, self.normalized_shape,self.eps)
+        return FusedLayerNormAffineFunction.apply(
+          input, self.weight, self.bias, self.normalized_shape, self.eps)
+    else:
+        return FastLayerNormFN.apply(
+          input, self.weight, self.bias, self.eps)
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -82,16 +82,13 @@ class GPTModel(MegatronModule):
        self.language_model.set_input_tensor(input_tensor)
    def forward(self, input_ids, position_ids, attention_mask, labels=None,
-                tokentype_ids=None,
+                tokentype_ids=None, inference_params=None):
-                set_inference_key_value_memory=False,
-                inference_max_sequence_len=None):
        lm_output = self.language_model(
            input_ids,
            position_ids,
            attention_mask,
-            set_inference_key_value_memory=set_inference_key_value_memory,
+            inference_params=inference_params)
-            inference_max_sequence_len=inference_max_sequence_len)
        if self.post_process:
            return post_language_model_processing(

--- a/megatron/model/language_model.py
+++ b/megatron/model/language_model.py
@@ -334,10 +334,6 @@ class TransformerLanguageModel(MegatronModule):
        # Decoder (usually set to False, True if part of an encoder-decoder
        # architecture and in decoder-only stage).
        if self.add_decoder:
-            # Temporary assertion until we verify correctness of pipeline parallelism
-            # implementation of T5.
-            assert args.pipeline_model_parallel_size == 1, \
-                'pipeline parallelism is not supported in the presence of decoder'
            self.decoder = ParallelTransformer(
                self.init_method,
                output_layer_init_method,
@@ -386,8 +382,7 @@ class TransformerLanguageModel(MegatronModule):
    def forward(self, enc_input_ids, enc_position_ids, enc_attn_mask,
                dec_input_ids=None, dec_position_ids=None, dec_attn_mask=None,
                enc_dec_attn_mask=None, tokentype_ids=None,
-                set_inference_key_value_memory=False,
+                inference_params=None,
-                inference_max_sequence_len=None,
                pooling_sequence_index=0,
                enc_hidden_states=None, output_enc_hidden=False):
@@ -404,8 +399,7 @@ class TransformerLanguageModel(MegatronModule):
                encoder_output = self.encoder(
                    encoder_input,
                    enc_attn_mask,
-                    set_inference_key_value_memory=set_inference_key_value_memory,
+                    inference_params=inference_params)
-                    inference_max_sequence_len=inference_max_sequence_len)
            else:
                encoder_output = self.encoder_hidden_state
        else:
@@ -438,8 +432,7 @@ class TransformerLanguageModel(MegatronModule):
            dec_attn_mask,
            encoder_output=encoder_output,
            enc_dec_attn_mask=enc_dec_attn_mask,
-            set_inference_key_value_memory=set_inference_key_value_memory,
+            inference_params=inference_params)
-            inference_max_sequence_len=inference_max_sequence_len)
        if self.add_pooler and self.post_process:
            return decoder_output, encoder_output, pooled_output