Merge branch 'staging' into 'master'

April 2020 Release See merge request ADLR/megatron-lm!69

Merge branch 'staging' into 'master'
April 2020 Release See merge request ADLR/megatron-lm!69
22c0e300 · Jared Casper · 70174ae3 · acf8780b · 22c0e300 · 22c0e300
Commit 22c0e300 authored Apr 16, 2020 by Jared Casper
20 changed files
--- a/LICENSE
+++ b/LICENSE
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+The following applies to all files unless otherwise noted:
+
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -24,8 +26,15 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

+--
+
+This repository also contains code from Hugging Face Inc., Google Research,
+and Facebook (from their Fairseq project). Files from these
+organizations have notices at the top of each file. Below are licenses
+used in those files, as indicated.
+

------------- LICENSE FOR huggingface(transformer) repository --------------
+------------- LICENSE FOR huggingface and Google Research code  --------------


                                 Apache License
@@ -229,3 +238,27 @@
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
+
+------------- LICENSE FOR Facebook Fairseq code --------------
+
+MIT License
+
+Copyright (c) Facebook, Inc. and its affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
-Megatron is a large, powerful transformer. This repo is for ongoing research on training large, powerful transformer language models at scale. Currently, we support model-parallel, multinode training of [GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) in mixed precision. 
-
-Our codebase is capable of efficiently training a 72-layer, 8.3 Billion Parameter GPT2 Language model with 8-way model and 64-way data parallelism across 512 GPUs. We find that bigger language models are able to surpass current GPT2-1.5B wikitext perplexities in as little as 5 epochs of training.
+[Megatron](https://arxiv.org/pdf/1909.08053.pdf) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel, and multinode training of [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) using mixed precision.
+
+Our codebase is capable of efficiently training a 72-layer, 8.3 billion parameter GPT-2 language model with 8-way model and 64-way data parallelism across 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak theoritical FLOPs. Using our GPT-2 model we achieve SOTA results on the WikiText-103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. 
+
+For BERT training, we swapped the position of the layer normalization and the residual connection in the model architecture (similar to GPT-2 architucture), which allowed the models to continue to improve as they were scaled up. Our BERT models with 3.9 billion parameters reaches a loss of 1.16, SQuAD 2.0 F1-score of 91.7, and RACE accuracy of 90.9%.
+
+<a id="contents"></a>
+# Contents
+<!-- MarkdownTOC -->
+
+- [Setup](#setup)
+  - [Downloading Checkpoints](#downloading-checkpoints)
+- [Usage](#usage)
+- [Training](#training)
+  - [Data Preprocessing](#data-preprocessing)
+  - [BERT Pretraining](#bert-pretraining)
+  - [GPT-2 Pretraining](#gpt-2-pretraining)
+  - [Distributed BERT or GPT-2 Pretraining](#distributed-bert-or-gpt-2-pretraining)
+- [Evaluation and Tasks](#evaluation-and-tasks)
+  - [GPT-2 Text Generation](#gpt-2-text-generation)
+  - [GPT-2 Evaluation](#gpt-2-evaluation)
+    - [WikiText Perplexity Evaluation](#wikitext-perplexity-evaluation)
+    - [LAMBADA Cloze Accuracy](#lambada-cloze-accuracy)
+  - [BERT Task Evaluation](#bert-task-evaluation)
+    - [RACE Evaluation](#race-evaluation)
+    - [MNLI Evaluation](#mnli-evaluation)
+- [Datasets](#datasets)
+  - [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
+  - [Collecting GPT-2 Webtext Data](#collecting-gpt-2-webtext-data)
+
+<!-- /MarkdownTOC -->
+
+<a id="setup"></a>
+# Setup
+We officially support only python 3.6, pytorch 1.5, cuda 10, and nccl 2.6 versions and above.

-For BERT training our repository trains BERT Large on 64 V100 GPUs in 3 days. We achieved a final language modeling perplexity of 3.15 and SQuAD F1-score of 90.7.
-<!--
-do we want to make any claims about GPT2 speed, convergence, or model release
-->
+To use this repo please install the latest supported versions of PyTorch with GPU support. We strongly recommend using one of [NGC's recent PyTorch containers](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) (the latest compatible version at time of publication can be pulled with `docker pull nvcr.io/nvidia/pytorch:20.03-py3`). Data preprocessing requires [NLTK](https://www.nltk.org/install.html), though this is not required for training, evaluation or downstream tasks.

-# Setup
-We officially support only python3.6.
+<a id="downloading-checkpoints"></a>
+## Downloading Checkpoints
+We've provided two pretrained checkpoints for use to evaluate or finetuning downstream tasks. To access these checkpoints, first please [sign up](https://ngc.nvidia.com/signup) for and [setup](https://ngc.nvidia.com/setup/installers/cli) the NVIDIA GPU Cloud (NGC) Registry CLI.

-To use this repo please install the latest supported versions of PyTorch with GPU support. 
+The checkpoints can be downloaded with:
+<pre>
+ngc registry model download-version --dest &#60;output_base_directory&#62; nvidia/&#60;model_name&#62;:&#60;version&#62;
+</pre>

-Additionally, part of this codebase leverages tensorflow-cpu to (optionally) perform dataloading of TFRecords for BERT training. We recommend either utilizing the provided Dockerfile in [`./docker/`](./docker) or creating a virtual environment (to avoid breaking existing tf installations) and install our `requirements.txt`. 
+The available models along with `<model_name>:<version>` are below:
+* [BERT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m): megatron\_bert\_345m:v0.0
+* [GPT-2-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m): megatron\_lm\_345m:v0.0 

-```
-python -m pip install virtualenv
-virtualenv bert_env
-source bert_env/bin/activate
-pip install -r requirements.txt
-```

+Further documentation for downloading models can be found in the [NGC documentation](https://docs.nvidia.com/dgx/ngc-registry-cli-user-guide/index.html#topic_6_4_1)

+<a id="usage"></a>
 # Usage
-We've provided 5 scripts that pretrain BERT and 3 scripts that pretrain GPT2. Save and load model checkpoints with `--save` and `--load`. Additionally we provide GPT2 scripts for interactive text generation and zero shot evaluation of GPT2 on wikitext and LAMBADA.

-## BERT Pretraining
-`bash scripts/pretrain_bert.sh`
+After installation, there are several possible workflows. The most comprehensive is:
+1. Data preprocessing
+2. Pretraining
+3. Finetuning (Optional for zero-shot tasks)
+4. Downstream task evaluation or text generation
+
+However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.
+
+We've provided several scripts for pretraining both BERT and GPT-2 in [`examples`](./examples) directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT-2 interactive text generation.
+
+<a id="training"></a>
+# Training
+<a id="data-preprocessing"></a>
+## Data Preprocessing
+We support three file formats for training, but all require preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:
+<pre>
+{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
+{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
+</pre>
+
+The name of the `text` field of the json can be changed by using the `--json-key` flag in [`preprocess_data.py`](./tools/preprocess_data.py) The other metadata are optional and are not used in training.
+
+The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use `preprocess_data.py`. Set the `--dataset-impl` flag to `mmap`, `cached`, or `lazy`, respectively (default is `mmap`). An example script to prepare data for BERT training is:
+<pre>
+python tools/preprocess_data.py \
+       --input my-corpus.json \
+       --output-prefix my-bert \
+       --vocab bert-vocab.txt \
+       --dataset-impl mmap \
+       --tokenizer-type BertWordPieceLowerCase \
+       --split-sentences
+</pre>
+
+The output will be two files named, in this case, `my-bert_text_sentence.bin` and `my-bert_text_sentence.idx`. The `--data-path` specified in later BERT training is the full path and new filename, but without the file extension.
+
+Some minor modifications are required for GPT-2 data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:
+<pre>
+python tools/preprocess_data.py \
+       --input my-corpus.json \
+       --output-prefix my-gpt2 \
+       --vocab gpt2-vocab.json \
+       --dataset-impl mmap \
+       --tokenizer-type GPT2BPETokenizer \
+       --merge-file gpt2-merges.txt \
+       --append-eod
+</pre>

-This script runs single gpu BERT pretraining and is mainly for debugging purposes. The optimization arguments are set with 64-way distributed training in mind.
+Here the output files are named `my-gpt2_text_document.bin` and `my-gpt2_text_document.idx`. As before, in GPT-2 training, use the longer name without the extension as `--data-path`.

-To use this script place your `--train-data` in loose json format with one json per line. The text field of your json dictionaries should correspond to `--text-key`. 
+Further command line arguments are described in the source file [`preprocess_data.py`](./tools/preprocess_data.py).
+
+<a id="bert-pretraining"></a>
+## BERT Pretraining
+`bash examples/pretrain_bert.sh`
+
+This script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--warmup`. While this is single GPU training, the batch size specified by `--batch-size` is per GPU used for data parallelism. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`).
+
+The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions.
+
+<pre>
+CHECKPOINT_PATH=checkpoints/bert_345m
+VOCAB_FILE=bert-vocab.txt
+DATA_PATH=my-bert_text_sentence
+
+BERT_ARGS="--num-layers 24 \
+           --hidden-size 1024 \
+           --num-attention-heads 16 \
+           --seq-length 512 \
+           --max-position-embeddings 512 \
+           --lr 0.0001 \
+           --train-iters 2000000 \
+           --min-lr 0.00001 \
+           --lr-decay-iters 990000 \
+           --warmup 0.01 \
+           --batch-size 8 \
+           --vocab-file $VOCAB_FILE \
+           --split 949,50,1 \
+           --fp16"
+
+OUTPUT_ARGS="--log-interval 10 \
+             --save-interval 500 \
+             --eval-interval 100 \
+             --eval-iters 10 \
+             --checkpoint-activations"

-```
 python pretrain_bert.py \
-       --num-layers 24 \
-       --hidden-size 1024 \
-       --num-attention-heads 16 \
-       --batch-size 4 \
-       --seq-length 512 \
-       --max-preds-per-seq 80 \
-       --max-position-embeddings 512 \
-       --train-iters 1000000 \
-       --save checkpoints/bert_345m \
-       --load checkpoints/bert_345m \
-       --resume-dataloader \
-       --train-data wikipedia \
-       --lazy-loader \
-       --tokenizer-type BertWordPieceTokenizer \
-       --tokenizer-model-type bert-large-uncased \
-       --presplit-sentences \
-       --cache-dir cache \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.0001 \
-       --lr-decay-style linear \
-       --lr-decay-iters 990000 \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --warmup .01 \
-       --fp16 \
-       --fp32-embedding
-```
-
-## GPT2 Pretraining
-`bash scripts/pretrain_gpt2.sh`
-
-This script runs single gpu gpt2 pretraining and is mainly for debugging purposes. The optimization arguments are set with 64-way distributed training in mind. 
-
-It follows largely the same format as the previous script with a few notable differences: the `--tokenizer-type` has been switched to a `GPT2BPETokenizer`, the `--lr-decay-style` has been switched to cosine decay, and activation checkpointing has been turned on with `--checkpoint-activations` and `--checkpoint-num-layers` set to checkpoint every `1` layers.
-
-Additionally GPT2 uses a different parameter initialization from BERT designed for training deep residual networks. To train BERT with this initialization use `--deep-init`.
-
-```
+       $BERT_ARGS \
+       $OUTPUT_ARGS \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH
+</pre>
+
+Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
+
+<a id="gpt-2-pretraining"></a>
+## GPT-2 Pretraining
+`bash examples/pretrain_gpt2.sh`
+
+This script runs single GPU 345M parameter GPT-2 pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training. 
+
+It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a `json` vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the `--lr-decay-style` has been set to cosine decay.  Note that the `--data-path` now includes the additional `_text_document` suffix added in preprocessing, but does not include the file extensions.
+
+<pre>
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+DATA_PATH=my-gpt2_text_document
+
+GPT2_ARGS="--num-layers 24 \
+           --hidden-size 1024 \
+           --num-attention-heads 16 \
+           --seq-length 1024 \
+           --max-position-embeddings 1024 \
+           --batch-size 4 \
+           --lr 0.00015 \
+           --train-iters 500000 \
+           --lr-decay-iters 320000 \
+           --lr-decay-style cosine \
+           --vocab-file $VOCAB_FILE \
+           --merge-file $MERGE_FILE \
+           --warmup .01 \
+           --fp16"
+
+OUTPUT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;
+
 python pretrain_gpt2.py \
-       --num-layers 24 \
-       --hidden-size 1024 \
-       --num-attention-heads 16 \
-       --batch-size 8 \
-       --seq-length 1024 \
-       --max-position-embeddings 1024 \
-       --train-iters 320000 \
-       --save checkpoints/gpt2_345m \
-       --load checkpoints/gpt2_345m \
-       --resume-dataloader \
-       --train-data wikipedia \
-       --lazy-loader \
+       $GPT2_ARGS \
+       $OUTPUT_ARGS \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+</pre>
+
+Further command line arguments are described in the source file [`arguments.py`](./megatron/arguments.py).
+
+<a id="distributed-bert-or-gpt-2-pretraining"></a>
+## Distributed BERT or GPT-2 Pretraining
+`bash examples/pretrain_bert_distributed.sh`
+
+`bash examples/pretrain_gpt2_distributed.sh`
+
+These scripts use the PyTorch distributed launcher for distributed training. As such, multinode training can be achieved by properly setting environment variables and using `init_method='env://'` in the launcher. See the official PyTorch [documentation](https://pytorch.org/docs/stable/distributed.html#launch-utility) for further description of these [environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization). By default, multinode training uses the [nccl](https://developer.nvidia.com/nccl) distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the Python flag `-m torch.distributed.launch`, detailed below, are the only additional requirements to adopt distributed training.
+
+The two tiers of parallelism are data and model parallelism. First, we facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use `--DDP-impl local` or `--DDP-impl torch`, respectively. As expected, Torch distributed data parallelism is more efficient at larger model parallel sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.
+
+Second, we developed a simple and efficient intra-layer model parallel approach. To use model parallelism, add the `--model-parallel-size` flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. With `WORLD_SIZE` GPUs and `MP_SIZE` model parallel size, `WORLD_SIZE`/`MP_SIZE` GPUs will be used for data parallelism. The default value for `--model-parallel-size` is 1, which will not implement model parallelism.
+
+Other than these minor changes, the distributed training is identical to the training on a single GPU.
+
+Distributed BERT training:
+<pre>
+WORLD_SIZE=8
+MP_SIZE=2
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+CHECKPOINT_PATH=checkpoints/bert_345m
+VOCAB_FILE=bert-vocab.txt
+DATA_PATH=my-bert_text_sentence
+BERT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;
+OUTPUT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_bert.py \
+                $BERT_ARGS \
+                $OUTPUT_ARGS \
+                --save $CHECKPOINT_PATH \
+                --load $CHECKPOINT_PATH \
+                --data-path $DATA_PATH \
+                --model-parallel-size $MP_SIZE \
+                --DDP-impl torch
+</pre>
+
+Distributed GPT-2 training:
+<pre>
+WORLD_SIZE=8
+MP_SIZE=2
+
+DISTRIBUTED_ARGS=&#60;same as those directly above&#62;
+
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+DATA_PATH=my-gpt2_text_document
+GPT2_ARGS=&#60;same as those in <a href="#gpt-2-pretraining">GPT-2 pretraining</a> above&#62;
+OUTPUT_ARGS=&#60;same as those in <a href="#bert-pretraining">BERT pretraining</a> above&#62;
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./pretrain_gpt2.py \
+                $GPT2_ARGS \
+                $OUTPUT_ARGS \
+                --save $CHECKPOINT_PATH \
+                --load $CHECKPOINT_PATH \
+                --data-path $DATA_PATH \
+                --model-parallel-size $MP_SIZE \
+                --DDP-impl torch
+
+</pre>
+
+<a id="evaluation-and-tasks"></a>
+# Evaluation and Tasks
+
+We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the `--finetune` flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the `--finetune` flag before continuing, otherwise the training will start again from the beginning.
+
+Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on a single GPU in downstream tasks. The following script accomplishes this.
+
+<pre>
+MODEL_PARALLEL_SIZE=2
+
+VOCAB_FILE=bert-vocab.txt
+CHECKPOINT_PATH=checkpoints/bert_345m
+
+WORLD_SIZE=$MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
+        --model-type BERT \
+        --model-parallel-size $MODEL_PARALLEL_SIZE \
+        --tokenizer-type BertWordPieceLowerCase \
+        --vocab-file $VOCAB_FILE \
+        --num-layers 24 \
+        --hidden-size 1024 \
+        --num-attention-heads 16 \
+        --seq-length 512 \
+        --max-position-embeddings 512 \
+        --load $CHECKPOINT_PATH
+
+</pre>
+
+Several downstream tasks are described for both GPT-2 and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.
+
+<a id="gpt-2-text-generation"></a>
+## GPT-2 Text Generation
+`bash examples/generate_text.sh`
+
+We generate text samples using largely the GPT-2 pretraining script. Few changes need to make, such as we need to provide the path to the pretrained checkpoint, the length of the output samples, whether to generate texts unconditionally (`--num-samples` to denote how many samples to generate) or conditional (need to pass `--sample-input-file <filename>` where each line of the file will be used as the conditional texts). There are few optional parameters to play, e.g. `top-k`, `top-p`, or `greedy` (set top-k and top-p to 0) sampling..
+
+<pre>
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+GPT2_ARGS=&#60;same as those in <a href="#gpt-2-pretraining">GPT-2 pretraining</a> above&#62;
+
+MAX_OUTPUT_SEQUENCE_LENGTH=1024
+TEMPERATURE=1.0
+TOP_P=0.9
+NUMBER_OF_SAMPLES=2
+OUTPUT_FILE=samples.json
+
+python tools/generate_samples_gpt2.py \
+       $GPT2_ARGS \
+       --load $CHECKPOINT_PATH \
+       --out-seq-length $MAX_OUTPUT_SEQUENCE_LENGTH \
+       --temperature $TEMPERATURE \
+       --genfile $OUTPUT_FILE \
+       --num-samples $NUMBER_OF_SAMPLES \
+       --top_p $TOP_P \
+       --recompute
+</pre>
+
+<a id="gpt-2-evaluation"></a>
+## GPT-2 Evaluation
+We include example scripts for GPT-2 evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.
+
+<a id="wikitext-perplexity-evaluation"></a>
+### WikiText Perplexity Evaluation
+For even comparison with prior works, we evaluate perplexity on the word-level [WikiText-103 test dataset](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
+
+We use the following command to run WikiText-103 evaluation on a 345M parameter model.
+<pre>
+TASK="WIKITEXT103"
+
+VALID_DATA=&#60;wikitext path&#62;.txt
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+
+COMMON_TASK_ARGS="--num-layers 24 \
+                  --hidden-size 1024 \
+                  --num-attention-heads 16 \
+                  --seq-length 1024 \
+                  --max-position-embeddings 1024 \
+                  --fp16 \
+                  --vocab-file $VOCAB_FILE"
+
+python tasks/main.py \
+       --task $TASK \
+       $COMMON_TASK_ARGS \
+       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
-       --cache-dir cache \
-       --split 949,50,1 \
-       --distributed-backend nccl \
-       --lr 0.00015 \
-       --lr-decay-style cosine \
-       --weight-decay 1e-2 \
-       --clip-grad 1.0 \
-       --warmup .01 \
+       --merge-file $MERGE_FILE \
+       --load $CHECKPOINT_PATH \
+       --batch-size 8 \
       --checkpoint-activations \
-       --fp16
-```
-
-## GPT2 Text Generation
-`bash scripts/generate_text.sh`
-
-Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt. Specify the model in the script by setting the `CHECKPOINT_PATH` variable and the appropriate model configuration. 
-
-The script is capable of greedy sampling, top-k, or top-p sampling as specified by the appropriate variables within the script.
-
-## GPT2 Evaluation
-We support 3 modes of GPT2 evaluation with [`./scripts/run_gpt2_eval.py`](./scripts/run_gpt2_eval.py): wikitext ppl evaluation, lambada cloze accuracy, large corpora ppl evaluation.
-
-### Wikitext PPL evaluation
-For even comparison with prior works we evaluate wikitext perplexity on the word-level wikitext test dataset, which can be downloaded [here](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
-
-We use the following command to run wikitext evaluation:
-
-```
-python scripts/run_gpt2_eval.py \
-  --model-parallel-size 1 \
-  --num-layers 24 \
-  --hidden-size 1024 \
-  --num-attention-heads 16 \
-  --model-path <gpt2_345_path> \
-  --data-path <wikitext_tokens_test_path> \
-  --batch-size 16 \
-  --cache-dir cache
-```
-
-### Lambada Cloze Accuracy
-To compute Lambada cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the Lambada dataset we sourced from [here](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
-
-We use the following command to run lambada evaluation:
+       --log-interval 10 \
+       --no-load-optim \
+       --no-load-rng
+</pre>

-```
-python scripts/run_gpt2_eval.py \
-  --model-parallel-size 1 \
-  --num-layers 24 \
-  --hidden-size 1024 \
-  --num-attention-heads 16 \
-  --model-path <gpt2_345_path> \
-  --data-path <lambada_test_path> \
-  --batch-size 16 \
-  --cloze-eval \
-  --cache-dir cache
-```

-### Large Corpora PPL evaluation
-This functionality allows one to evaluate the gpt2 model on a loose json file. With the following command we evaluate the gpt2 model for 5000 iterations at a batch size of 16 on a webtext test data split. We recommend that the user presplit their dataset before training a model according to the procedure outlined [below](#partitioning-datasets-into-train-val-test).
+<a id="lambada-cloze-accuracy"></a>
+### LAMBADA Cloze Accuracy
+To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceeding tokens) we utilize a detokenized, processed version of the [LAMBADA dataset](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).

-```
-python scripts/run_gpt2_eval.py \
-  --model-parallel-size 1 \
-  --num-layers 24 \
-  --hidden-size 1024 \
-  --num-attention-heads 16 \
-  --model-path <gpt2_345_path> \
-  --data-path <webtext_test_path> \
-  --batch-size 16 \
-  --eval-iters 5000 \
-  --webtext-eval \
-  --cache-dir cache
-```
+We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the `--strict-lambada` flag should be used to require whole word matching. Make that `lambada` is part of the file path.

-## Distributed BERT or GPT2 Pretraining
-`bash scripts/pretrain_bert_distributed.sh` or `bash scripts/pretrain_gpt2_distributed.sh`
+<pre>
+TASK="LAMBADA"

-To use these scripts, follow the same data preparation procedure as in earlier sections. This script uses the pytorch distributed launcher to launch distributed training. As such, multinode training can be achieved by properly setting environment variables for the `env://` init method. See the official pytorch [documentation](https://pytorch.org/docs/stable/distributed.html#launch-utility) for further description of these [environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization). By default multinode training uses the nccl distributed backend.
-
-## Model Parallel BERT or GPT2 Pretraining
-`bash scripts/pretrain_bert_model_parallel.sh` or `bash scripts/pretrain_gpt2_model_parallel.sh`
-
-These scripts build upon the distributed training scripts and are identical in setup. They differ in use of the `--model-parallel-size` flag. For model parallelism of 2 and a world size of 8, the scripts will launch training with 4-way distributed data parallelism and 2-way model parallelism.
-
-We note that we have experimented with multiple distributed data parallel implementations: a simple one of our own which performs gradient all-reduce at the end of back propagation step, and torch's distributed data parallel wrapper which overlaps gradient reduction with back propagation computation. To switch between these two options toggle the `USE_TORCH_DDP` flag (the default is set to `False` and uses our DDP implementation) at the top of `pretrain_bert.py` and `pretrain_gpt2.py`. We find that torch distributed data parallelism is more efficient at larger model parallel sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 74% when torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.
-
-## Distributed BERT Pretraining with TFRecords
-`bash scripts/pretrain_bert_tfrecords_distributed.sh`
-
-This script takes advantage of TensorFlow BERT's [`create_pretraining.py`](https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/create_pretraining_data.py) script to pre-cache the dataset in the TFRecord format. To convert the data to pytorch tensors we use a `TFRecordDataset` and tensorflow eager mode to turn the TFRecords into numpy matrices before loading them into pytorch gpu tensors. This greatly reduces the overhead of dataprocessing and speeds up training. Pass a whitespace-separated list of TFRecord paths to `--train-data` and enable the `--use-tfrecords` flag. Multinode training can be achieved as described in the [previous section](#distributed-bert-pretraining).
-
-## Train Custom Sentence Piece Tokenizer and Pretrain BERT
-`bash scripts/pretrain_bert_sentencepiece.sh`
-
-This script runs BERT pretraining with a `sentencepiece` tokenizer. If no sentencepiece tokenizer exists at `--tokenizer-path` one will be trained automatically. The sentencepiece tokenizer can be used with the previous scripts (NOTE: sentencepiece training can only happen during single gpu pretraining). `<--tokenizer-path>.vocab` can be used with [`create_pretraining_data.py`](https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/create_pretraining_data.py) to make a TFRecord dataset with the given tokenization.
+VALID_DATA=&#60;lambada path&#62;.json
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+COMMON_TASK_ARGS=&#60;same as those in <a href="#wikitext-perplexity-evaluation">WikiText Perplexity Evaluation</a> above&#62;

+python tasks/main.py \
+       --task $TASK \
+       $COMMON_TASK_ARGS \
+       --valid-data $VALID_DATA \
+       --tokenizer-type GPT2BPETokenizer \
+       --strict-lambada \
+       --merge-file $MERGE_FILE \
+       --load $CHECKPOINT_PATH \
+       --batch-size 8 \
+       --checkpoint-activations \
+       --log-interval 10 \
+       --no-load-optim \
+       --no-load-rng
+</pre>
+
+Further command line arguments are described in the source file [`main.py`](./tasks/main.py)
+
+<a id="bert-task-evaluation"></a>
+## BERT Task Evaluation
+<a id="race-evaluation"></a>
+### RACE Evaluation
+The following script finetunes the BERT model for evaluation on the [RACE dataset](http://www.cs.cmu.edu/~glai1/data/race/). The `TRAIN_DATA` and `VALID_DATA` directory contain the RACE dataset as separate `.txt` files.
+
+<pre>
+TRAIN_DATA="data/RACE/train/middle"
+VALID_DATA="data/RACE/dev/middle \
+            data/RACE/dev/high"
+VOCAB_FILE=bert-vocab.txt
+PRETRAINED_CHECKPOINT=checkpoints/bert_345m
+CHECKPOINT_PATH=checkpoints/bert_345m_race
+COMMON_TASK_ARGS="--num-layers 24 \
+                  --hidden-size 1024 \
+                  --num-attention-heads 16 \
+                  --seq-length 512 \
+                  --max-position-embeddings 512 \
+                  --fp16 \
+                  --vocab-file $VOCAB_FILE"
+
+COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
+                      --valid-data $VALID_DATA \
+                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
+                      --checkpoint-activations \
+                      --save-interval 10000 \
+                      --save $CHECKPOINT_PATH \
+                      --log-interval 100 \
+                      --eval-interval 1000 \
+                      --eval-iters 10 \
+                      --weight-decay 1.0e-1"
+
+python tasks/main.py \
+       --task RACE \
+       $COMMON_TASK_ARGS \
+       $COMMON_TASK_ARGS_EXT \
+       --tokenizer-type BertWordPieceLowerCase \
+       --epochs 3 \
+       --batch-size 4 \
+       --lr 1.0e-5 \
+       --warmup 0.06
+</pre>
+
+<a id="mnli-evaluation"></a>
+### MNLI Evaluation
+The following script finetunes the BERT model for evaluation with the [MultiNLI sentence pair corpus](https://www.nyu.edu/projects/bowman/multinli/). Because the matching tasks are quite similar, the script can be quickly tweaked to work with the [Quora Question Pairs](https://www.kaggle.com/quora/question-pairs-dataset) (QQP) dataset as well.
+
+<pre>
+
+TRAIN_DATA="data/glue_data/MNLI/train.tsv"
+VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
+            data/glue_data/MNLI/dev_mismatched.tsv"
+PRETRAINED_CHECKPOINT=checkpoints/bert_345m
+VOCAB_FILE=bert-vocab.txt
+CHECKPOINT_PATH=checkpoints/bert_345m_mnli
+COMMON_TASK_ARGS=&#60;same as those in <a href="#race-evaluation">RACE Evaluation</a> above&#62;
+COMMON_TASK_ARGS_EXT=&#60;same as those in <a href="#race-evaluation">RACE Evaluation</a> above&#62;
+
+python tasks/main.py \
+       --task MNLI \
+       $COMMON_TASK_ARGS \
+       $COMMON_TASK_ARGS_EXT \
+       --tokenizer-type BertWordPieceLowerCase \
+       --epochs 5 \
+       --batch-size 8 \
+       --lr 5.0e-5 \
+       --warmup 0.065
+</pre>

-# Data sets
-We do not host any datasets for GPT2 or BERT training, however, we detail their collection so that our results may be reproduced.
+<a id="datasets"></a>
+# Datasets
+We do not host any datasets for GPT-2 or BERT training, however, we detail their collection so that our results may be reproduced.

+<a id="collecting-wikipedia-training-data"></a>
 ## Collecting Wikipedia Training Data
-We recommend following the wikipedia data extraction process specified by google research: "the recommended pre-processing is to download [the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), extract the text with [WikiExtractor.py](https://github.com/attardi/wikiextractor), and then apply any necessary cleanup to convert it into plain text." 
-
-We recommend using the `--json` argument when using WikiExtractor, which will dump the wikipedia data into loose json format (one json per line), making it more manageable and readily consumable by our codebase. We recommend further preprocessing this json dataset by preprocessing the dataset with nltk punctuation standardization, and presplitting each document into newline separated sentences. This can be done with the provided script `./scripts/presplit_sentences_json.py` and will allow for faster data processing during training time. Pretraining with presplit data should be run with the `--presplit-sentences` flag as shown above. (Note that if you'd like to use wikipedia data for GPT2 training you should still clean it with nltk/spacy/ftfy, but do not split it into newline seperated sentences)
-
-Once the json dataset is ready make sure to set the path in line 27 of `data_utils/corpora.py`.
-
-If your system is memory limited we also recommend running pretraining with the `--lazy-loader` argument as we've done. After preprocessing the dataset once, this will allow the dataset to be lazily loaded from disk, as opposed to storing it in memory. Make sure to run the code once on a 
-
-## Collecting GPT2 Webtext Data
-We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library from [jcpeterson](https://github.com/jcpeterson/openwebtext) and [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our [openwebtext](./openwebtext) directory. For reddit URLS corresponding to content upto october 2018 we arrived at approximately 37GB of content.
-
-We recommend creating an alias for this dataset as described below.
-
-## Aliasing datasets with corpora.py
-As mentioned in the previous Wikipedia data section we recommend aliasing datasets with human readable names (eg. `--train-data wikipedia`). This helps avoid forgetting arguments when submitting jobs, and allows one to combine datasets that would otherwise require different commandline options/data structures.
-
-Examples of how to create these dataset objects can be found in [`./data_utils/corpora.py`](./data_utils/corpora.py). We recommend that the objects inherit from or adhere to the interface laid out by `torch.utils.data.Dataset` objects.
-
-Any created datasets should be then added to the `NAMED_CORPORA` dictionary object in [`./data_utils/corpora.py`](./data_utils/corpora.py). At runtime one can specify one or more corpora from the commandline with `--train-data corpus1 corpus2 corpus3`, `--valid-data corpus1 corpus2 corpus3`, or `--test-data ...`.
+We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download [the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), extract the text with [WikiExtractor.py](https://github.com/attardi/wikiextractor), and then apply any necessary cleanup to convert it into plain text." 

-## Partitioning datasets into Train/Val/Test
-We support multiple ways to partition corpora into train/val/test splits. By specifying a `--split 95,5` commandline argument, the corpora specified by `--train-data` will have it's documents split proportionally into a 95%, 5% train/val split. The split is performed lazily on the fly and is efficient and deterministic from run to run given the same `--seed`. Note that if `--valid-data` or `--test-data` is specified then the train data will still be split accordingly, but `--valid-data`/`--test-data` will still be used as the validation/test source.
+We recommend using the `--json` argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset by nltk punctuation standardization. For BERT training, add newlines between sentences during data preprocessing. This is done with the `--split-sentences` flag in `preprocess_data.py` as described [above](#data-preprocessing). (Note that if you'd like to use Wikipedia data for GPT-2 training you should still clean it with nltk/spacy/ftfy, but do not split it into newline separated sentences.)

-We do realize that this method, while effective, introduces noise into the development process, since different seeds will change the dataset and outcome. To have fixed training/validation/test sets across all your runs please utilize our script [`./scripts/split_json.py`](./scripts/split_json.py)
+<a id="collecting-gpt-2-webtext-data"></a>
+## Collecting GPT-2 Webtext Data
+We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library from [jcpeterson](https://github.com/jcpeterson/openwebtext) and [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our [openwebtext](./tools/openwebtext) directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.
--- a/detokenizer.py
+++ b/detokenizer.py
-import re
-
-def ptb_detokenizer(string):
-	string = string.replace(" '", "'")
-	string = string.replace(" \n", "\n")
-	string = string.replace("\n ", "\n")
-	string = string.replace(" n't", "n't")
-	string = string.replace(" N ","1 ")
-	string = string.replace("$ 1", "$1")
-	string = string.replace("# 1", "#1")
-	return string
-
-
-def wikitext_detokenizer(string):
-	#contractions
-	string = string.replace("s '", "s'")
-	string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
-	# number separators
-	string = string.replace(" @-@ ", "-")
-	string = string.replace(" @,@ ", ",")
-	string = string.replace(" @.@ ", ".")
-	#punctuation
-	string = string.replace(" : ", ": ")
-	string = string.replace(" ; ", "; ")
-	string = string.replace(" . ", ". ")
-	string = string.replace(" ! ", "! ")
-	string = string.replace(" ? ", "? ")
-	string = string.replace(" , ", ", ")
-	# double brackets
-	string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
-	string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
-	string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
-	string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
-	string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
-	# miscellaneous
-	string = string.replace("= = = =", "====")
-	string = string.replace("= = =", "===")
-	string = string.replace("= =", "==")
-	string = string.replace(" "+chr(176)+" ", chr(176))
-	string = string.replace(" \n", "\n")
-	string = string.replace("\n ", "\n")
-	string = string.replace(" N ", " 1 ")
-	string = string.replace(" 's", "'s")
-
-	return string
-
-def lambada_detokenizer(string):
-	return string
-
-def get_detokenizer(path):
-	for key in DETOKENIZERS.keys():
-		if key in path:
-			print(key)
-			return DETOKENIZERS[key]
-
-DETOKENIZERS = {
-	'ptb': ptb_detokenizer,
-	'wikitext': wikitext_detokenizer,
-	'lambada': lambada_detokenizer,
-}
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
-# ===========
-# base images
-# ===========
-FROM nvcr.io/nvidia/pytorch:19.09-py3
-
-
-# ===============
-# system packages
-# ===============
-RUN apt-get update && apt-get install -y \
-    bash-completion \
-    emacs \
-    git \
-    graphviz \
-    htop \
-    libopenexr-dev \
-    rsync \
-    wget \
-&& rm -rf /var/lib/apt/lists/*
-
-
-# ============
-# pip packages
-# ============
-RUN pip install --upgrade pip && \
-    pip install --upgrade setuptools
-COPY requirements.txt /tmp/
-RUN pip install --upgrade --ignore-installed -r /tmp/requirements.txt
-
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
-boto3
-google-cloud-language
-inflect
-nltk
-numpy
-pandas
-requests
-sentencepiece
-tensorflow
-tqdm
--- a/evaluate_gpt2.py
+++ b/evaluate_gpt2.py
-# coding=utf-8
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Pretrain BERT"""
-
-import os
-import json
-import math
-import random
-import numpy as np
-import torch
-
-
-from arguments import get_args
-from configure_data import configure_data
-from megatron.fp16 import FP16_Module
-from megatron.fp16 import FP16_Optimizer
-from megatron.learning_rates import AnnealingLR
-from megatron.model import GPT2Model
-from megatron.model import DistributedDataParallel as DDP
-from megatron import mpu
-from apex.optimizers import FusedAdam as Adam
-from megatron.utils import Timers
-from megatron.utils import load_checkpoint
-from megatron.utils import report_memory
-from megatron.utils import print_params_min_max_norm
-from megatron.utils import print_rank_0
-
-from megatron.data_utils import make_tokenizer
-
-from detokenizer import *
-
-def get_model(args):
-    """Build the model."""
-
-    print_rank_0('building GPT2 model ...')
-    model = GPT2Model(num_layers=args.num_layers,
-                      vocab_size=args.vocab_size,
-                      hidden_size=args.hidden_size,
-                      num_attention_heads=args.num_attention_heads,
-                      embedding_dropout_prob=args.hidden_dropout,
-                      attention_dropout_prob=args.attention_dropout,
-                      output_dropout_prob=args.hidden_dropout,
-                      max_sequence_length=args.max_position_embeddings,
-                      checkpoint_activations=args.checkpoint_activations,
-                      checkpoint_num_layers=args.checkpoint_num_layers,
-                      parallel_output=not args.cloze_eval)
-
-    print_rank_0(' > number of parameters: {}'.format(
-        sum([p.nelement() for p in model.parameters()])))
-
-    # GPU allocation.
-    model.cuda(torch.cuda.current_device())
-
-    # Fp16 conversion.
-    if args.fp16:
-        model = FP16_Module(model)
-
-    # Wrap model for distributed training.
-    model = DDP(model)
-
-    return model
-
-
-def setup_model(args):
-    """Setup model and optimizer."""
-
-    model = get_model(args)
-
-    if args.load is not None:
-        _ = load_checkpoint(
-            model, None, None, args)
-
-    return model
-
-def get_masks_and_position_ids(data,
-                               eod_token,
-                               reset_position_ids,
-                               reset_attention_mask):
-
-    # Extract batch size and sequence length.
-    batch_size, seq_length = data.size()
-
-    # Attention mask (lower triangular).
-    if reset_attention_mask:
-        att_mask_batch = batch_size
-    else:
-        att_mask_batch = 1
-    attention_mask = torch.tril(torch.ones(
-        (att_mask_batch, seq_length, seq_length), device=data.device)).view(
-            att_mask_batch, 1, seq_length, seq_length)
-
-    # Loss mask.
-    loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
-    loss_mask[data == eod_token] = 0.0
-
-    # Position ids.
-    position_ids = torch.arange(seq_length, dtype=torch.long,
-                                device=data.device)
-    position_ids = position_ids.unsqueeze(0).expand_as(data)
-    # We need to clone as the ids will be modifed based on batch index.
-    if reset_position_ids:
-        position_ids = position_ids.clone()
-
-    if reset_position_ids or reset_attention_mask:
-        # Loop through the batches:
-        for b in range(batch_size):
-
-            # Find indecies where EOD token is.
-            eod_index = position_ids[b, data[b] == eod_token]
-            # Detach indecies from positions if going to modify positions.
-            if reset_position_ids:
-                eod_index = eod_index.clone()
-
-            # Loop through EOD indecies:
-            prev_index = 0
-            for j in range(eod_index.size()[0]):
-                i = eod_index[j]
-                # Mask attention loss.
-                if reset_attention_mask:
-                    attention_mask[b, 0, (i+1):, :(i+1)] = 0
-                # Reset positions.
-                if reset_position_ids:
-                    position_ids[b, (i+1):] -= (i + 1 - prev_index)
-                    prev_index = i + 1
-
-    return attention_mask, loss_mask, position_ids
-
-def get_batch(data_iterator, args, timers):
-    ''' get_batch subdivides the source data into chunks of
-    length args.seq_length. If source is equal to the example
-    output of the data loading example, with a seq_length limit
-    of 2, we'd get the following two Variables for i = 0:
-    ┌ a g m s ┐ ┌ b h n t ┐
-    └ b h n t ┘ └ c i o u ┘
-    Note that despite the name of the function, the subdivison of data is not
-    done along the batch dimension (i.e. dimension 1), since that was handled
-    by the data loader. The chunks are along dimension 0, corresponding
-    to the seq_len dimension in the LSTM. A Variable representing an appropriate
-    shard reset mask of the same dimensions is also returned.
-    '''
-    # Items and their type.
-    keys = ['text', 'pad_mask']
-    datatype = torch.int64
-
-    # Broadcast data.
-    timers('data loader').start()
-    if data_iterator is not None:
-        data = next(data_iterator)
-    else:
-        data = None
-    timers('data loader').stop()
-    data_b = mpu.broadcast_data(keys, data, datatype)
-
-    # Unpack.
-    tokens_ = data_b['text'].long()
-    lm_labels = tokens_[:, 1:].contiguous()
-    tokens = tokens_[:, :-1].contiguous()
-    padding_mask = data_b['pad_mask'].byte()
-
-    # Get the masks and postition ids.
-    attention_mask, loss_mask, position_ids = get_masks_and_position_ids(
-        tokens,
-        args.eod_token,
-        args.reset_position_ids,
-        args.reset_attention_mask)
-
-    # Convert
-    if args.fp16:
-        attention_mask = attention_mask.half()
-
-    return tokens, lm_labels, attention_mask, position_ids, padding_mask
-
-
-def forward_step(data_iterator, model, args, timers):
-    """Forward step."""
-
-    # Get the batch.
-    timers('batch generator').start()
-    batch = get_batch(data_iterator, args, timers)
-    if batch is None:
-        return None
-    tokens, lm_labels, attention_mask, position_ids, loss_mask = batch
-    timers('batch generator').stop()
-    # Forward model.
-    if args.eval_hf:
-        output, _ = model(tokens)
-    else:
-        output = model(tokens, position_ids, attention_mask)
-
-    if not args.cloze_eval:
-        #losses = torch.nn.CrossEntropyLoss(reduce=False)(
-        losses = mpu.vocab_parallel_cross_entropy(
-            output.contiguous().float(), lm_labels.contiguous())
-        loss_mask = loss_mask.contiguous()
-        loss_mask = loss_mask.view(-1)
-        lm_loss = torch.sum(
-            losses.view(-1) * loss_mask.float())
-    else:
-        outputs = torch.argmax(output, -1)
-        correct = (outputs == lm_labels).float()
-        correct[(1-loss_mask).bool()] = 1
-        correct = correct.prod(-1)
-        lm_loss = correct.sum()
-#        loss_mask = loss_mask.contiguous().view(-1).float()
-#        lm_loss = torch.sum(acc * loss_mask)
-
-    return lm_loss
-
-
-def evaluate(data_loader, model, args, timers,
-             num_iterations=None):
-    """Evaluation."""
-
-    # Turn on evaluation mode which disables dropout.
-    model.eval()
-
-    total_lm_loss = 0
-    if num_iterations is not None:
-        max_iters = num_iterations
-    else:
-        if mpu.get_model_parallel_rank() == 0:
-            max_iters_gpu = torch.cuda.LongTensor([len(data_loader)])
-        else:
-            max_iters_gpu = torch.cuda.LongTensor([0])
-        torch.distributed.broadcast(max_iters_gpu,
-                                    mpu.get_model_parallel_src_rank(),
-                                    group=mpu.get_model_parallel_group())
-        max_iters = max_iters_gpu[0].item()
-        print_rank_0('global rank: {} | max iters: {}'.format(
-            torch.distributed.get_rank(), max_iters))
-
-    if data_loader is not None:
-        data_iterator = iter(data_loader)
-    else:
-        data_iterator = None
-
-    with torch.no_grad():
-        iteration = 0
-        while iteration < max_iters:
-            if iteration % args.log_interval == 0:
-                print_rank_0('global rank: {} | iteration: {}'.format(
-                    torch.distributed.get_rank(), iteration))
-            # Forward evaluation.
-            lm_loss = forward_step(data_iterator, model, args, timers)
-            if lm_loss is None:
-                break
-            # Reduce across processes.
-            if isinstance(model, DDP):
-                torch.distributed.all_reduce(lm_loss.data)
-                if args.cloze_eval:
-                    lm_loss.data = lm_loss.data / args.world_size
-                else:
-                    lm_loss.data = lm_loss.data / args.model_parallel_size
-
-            if not args.cloze_eval:
-                total_lm_loss += lm_loss.data.detach().float().item()/(args.num_tokenized_tokens-1)
-            else:
-                total_lm_loss += lm_loss.data.detach().float().item()
-
-            iteration += 1
-
-    # Move model back to the train mode.
-    model.train()
-
-    return total_lm_loss
-
-
-def evaluate_and_print_results(prefix, data_iterator, model,
-                               args, timers, num_iterations=None):
-    """Helper function to evaluate and dump results on screen."""
-    if not args.cloze_eval:
-        lm_loss = evaluate(data_iterator, model, args, timers, num_iterations)
-        val_loss = lm_loss
-        ppl = math.exp(min(20, val_loss))
-        token_ratio = (args.num_tokenized_tokens-1)/(args.num_original_tokens-1)
-        adjusted_ppl = math.exp(min(20, val_loss*token_ratio))
-        print_rank_0('-' * 100)
-        string = ' validation results on {} | '.format(prefix)
-        string += 'avg loss: {:.4E} | '.format(val_loss)
-        string += 'ppl: {:.4E} | '.format(ppl)
-        string += 'adjusted ppl: {:.4E} | '.format(adjusted_ppl)
-        string += 'token ratio: {} |'.format(token_ratio)
-        length = len(string) + 1
-        print_rank_0('-' * length)
-        print_rank_0(string)
-        print_rank_0('-' * length)
-
-        return val_loss
-    else:
-        num_correct = evaluate(data_iterator, model, args, timers, num_iterations)
-        acc = num_correct / args.num_examples
-        print_rank_0('-' * 100)
-        string = ' validation results on {} | '.format(prefix)
-        string += 'number correct: {:.4E} | '.format(num_correct)
-        string += 'total examples: {:.4E} | '.format(args.num_examples)
-        string += 'avg accuracy: {:.4E}'.format(acc)
-        length = len(string) + 1
-        print_rank_0('-' * length)
-        print_rank_0(string)
-        print_rank_0('-' * length)
-        return acc
-
-
-def initialize_distributed(args):
-    """Initialize torch.distributed."""
-
-    # Manually set the device ids.
-    device = args.rank % torch.cuda.device_count()
-    if args.local_rank is not None:
-        device = args.local_rank
-    torch.cuda.set_device(device)
-    # Call the init process
-    init_method = 'tcp://'
-    master_ip = os.getenv('MASTER_ADDR', 'localhost')
-    master_port = os.getenv('MASTER_PORT', '6000')
-    init_method += master_ip + ':' + master_port
-    torch.distributed.init_process_group(
-        backend=args.distributed_backend,
-        world_size=args.world_size, rank=args.rank,
-        init_method=init_method)
-
-    # Set the model-parallel / data-parallel communicators.
-    mpu.initialize_model_parallel(args.model_parallel_size)
-
-
-def set_random_seed(seed):
-    """Set random seed for reproducability."""
-
-    if seed is not None and seed > 0:
-        random.seed(seed)
-        np.random.seed(seed)
-        torch.manual_seed(seed)
-        mpu.model_parallel_cuda_manual_seed(seed)
-
-
-class LM_Eval_Dataset(torch.utils.data.Dataset):
-    def __init__(self, tokens, seq_len, pad_idx, overalapping_eval=None, **kwargs):
-        self.tokens = tokens
-        self.seq_len = seq_len
-        self.pad_idx = pad_idx
-        self.overalapping_eval = overalapping_eval
-        if self.overalapping_eval is None:
-            self.overalapping_eval = self.seq_len
-        self.overalapping_eval = max(1, self.overalapping_eval)
-
-        self.total_targets = len(self.tokens) - 1
-        # remove first sequence tokens
-        targets = max(self.total_targets - self.overalapping_eval, 0)
-        self.total_sequences = max(math.ceil(targets / self.overalapping_eval)+1, 1)
-
-    def __len__(self):
-        return self.total_sequences
-
-    def __getitem__(self, idx):
-        start_idx = idx * self.overalapping_eval
-        end_idx = start_idx + self.seq_len
-        tokens = self.tokens[start_idx:end_idx+1]
-        num_tokens = len(tokens)
-        pad_mask = [1]*num_tokens
-        if num_tokens < self.seq_len+1:
-            num_pad = (self.seq_len+1-num_tokens) 
-            pad_mask += [0]*(num_pad)
-            tokens += [self.pad_idx] * num_pad
-        pad_mask = np.array(pad_mask[1:])
-        if self.overalapping_eval != self.seq_len and idx!=0:
-            pad_mask[:-self.overalapping_eval] *= 0
-
-        return {'text': np.array(tokens), 'pad_mask': pad_mask}
-
-class Lambada_Eval_Dataset(torch.utils.data.Dataset):
-    def __init__(self, path, tokenizer, seq_len, strict=False, **kwargs):
-        self.seq_len = seq_len
-        self.pad_idx = tokenizer.get_command('pad').Id
-        self.tokenizer = tokenizer
-        self.strict = strict
-
-        self.tokens = []
-        self.labels = []
-        with open(path, 'r') as f:
-            for line in f.readlines():
-                text = json.loads(line)['text']
-                tokens, labels = self.get_tokens(text)
-                self.tokens.append(tokens)
-                self.labels.append(labels)
-
-    def get_tokens(self, text):
-        if not self.strict:
-            tokens = self.tokenizer.EncodeAsIds(text).tokenization
-            return tokens[:-1], [tokens[-1]]
-        last_token = text.split()[-1]
-        start_idx = text.rfind(last_token)
-        beginning_tokens = self.tokenizer.EncodeAsIds(text[:start_idx].strip()).tokenization
-        last_token = self.tokenizer.EncodeAsIds(' '+last_token).tokenization
-        return beginning_tokens, last_token
-
-    def __len__(self):
-        return len(self.tokens)
-
-    def __getitem__(self, idx):
-
-        tokens = self.tokens[idx]
-        num_tokens = len(tokens)
-        pad_mask = [0]*num_tokens
-        labels = self.labels[idx]
-        pad_mask += [1]*len(labels)
-        tokens = tokens+labels
-        num_tokens = len(tokens)
-        if num_tokens < self.seq_len+1:
-            num_pad = (self.seq_len+1-num_tokens) 
-            pad_mask += [0]*(num_pad)
-            tokens += [self.pad_idx] * num_pad
-        pad_mask = np.array(pad_mask[1:])
-
-        return {'text': np.array(tokens), 'pad_mask': pad_mask}
-
-def get_tokenizer(args):
-    tokenizer_args = {
-        'tokenizer_type': args.tokenizer_type,
-        'corpus': None,
-        'model_path': args.tokenizer_path,
-        'vocab_size': args.vocab_size,
-        'model_type': args.tokenizer_model_type,
-        'cache_dir': args.cache_dir}
-    return make_tokenizer(**tokenizer_args) 
-
-def get_eval_data(args):
-    val_dataloader = None
-    if mpu.get_model_parallel_rank() == 0:
-        eval_batch_size = args.eval_batch_size
-        eval_batch_size = args.batch_size if eval_batch_size is None else eval_batch_size
-        seq_len = args.seq_length
-        valid_data = args.valid_data
-        valid_data = valid_data[0] if isinstance(valid_data, list) else valid_data
-
-        tokenizer = get_tokenizer(args)
-
-        if not args.cloze_eval:
-
-            with open(valid_data, "rb") as reader:
-                entire_data = reader.read().decode('utf-8')
-            num_original_tokens = len(entire_data.strip().split(" "))
-            entire_data = get_detokenizer(valid_data)(entire_data)
-            tokenized_data = tokenizer.EncodeAsIds(entire_data).tokenization
-            num_tokenized_tokens = len(tokenized_data)
-            string = 'Original Tokens: %d, Detokenized tokens: %d' % (num_tokenized_tokens, num_original_tokens)
-            print_rank_0(string)
-
-            eod_token = tokenizer.get_command('pad').Id
-            val_dataset = LM_Eval_Dataset(tokenized_data, seq_len, eod_token,
-                                          args.overlapping_eval)
-        else:
-            val_dataset = Lambada_Eval_Dataset(valid_data, tokenizer, seq_len, args.strict_lambada)
-            num_tokenized_tokens = 0
-            num_original_tokens = 0
-        val_dataloader = torch.utils.data.DataLoader(
-            val_dataset, batch_size=eval_batch_size, drop_last=False)
-
-        before = tokenizer.num_tokens
-        after = before
-        multiple = args.make_vocab_size_divisible_by * \
-                   mpu.get_model_parallel_world_size()
-        while (after % multiple) != 0:
-            after += 1
-        print_rank_0('> padded vocab (size: {}) with {} dummy tokens (new size: {})'.
-              format(before, after - before, after))
-        eod_token = tokenizer.get_command('pad').Id
-        num_examples = len(val_dataset)
-        token_counts = torch.cuda.LongTensor([after, eod_token, num_examples,
-                                              num_original_tokens,
-                                              num_tokenized_tokens])
-    else:
-        token_counts = torch.cuda.LongTensor([0, 0, 0, 0, 0])
-    torch.distributed.broadcast(token_counts,
-                                mpu.get_model_parallel_src_rank(),
-                                group=mpu.get_model_parallel_group())
-    args.vocab_size = token_counts[0].item()
-    args.eod_token = token_counts[1].item()
-    args.num_examples = token_counts[2].item()
-    args.num_original_tokens = token_counts[3].item()
-    args.num_tokenized_tokens = token_counts[4].item()
-
-    print('global rank: {} | vocab size: {} | eod token: {} | '
-          'num_examples: {} | num_original_tokens: {} | '
-          'num_tokenized_tokens: {}'.format(
-              torch.distributed.get_rank(), args.vocab_size,
-              args.eod_token, args.num_examples, args.num_original_tokens,
-              args.num_tokenized_tokens ))
-    return val_dataloader
-
-def main():
-    """Main training program."""
-
-    print('Evaluate GPT2 model')
-
-    # Disable CuDNN.
-    torch.backends.cudnn.enabled = False
-
-    # Timer.
-    timers = Timers()
-
-    # Arguments.
-    args = get_args()
-
-    # Pytorch distributed.
-    initialize_distributed(args)
-
-    # Random seeds for reproducability.
-    set_random_seed(args.seed)
-
-    # Data stuff.
-    eval_data = get_eval_data(args)
-
-    # Model, optimizer, and learning rate.
-    if args.eval_hf:
-        from pytorch_pretrained_bert import GPT2LMHeadModel
-        from pytorch_pretrained_bert import GPT2Model as HFGPT2Model
-        if args.num_layers == 24:
-            model_path = args.load
-            #model_path = '/home/universal-lm-data.cosmos549/repos/gpt2_mp/models/345M'
-            hfmodel = HFGPT2Model.from_pretrained(model_path, cache_dir='gpt2_weights', from_tf=True).cuda()
-            model = GPT2LMHeadModel(hfmodel.config)
-            model.transformer.load_state_dict(hfmodel.state_dict())
-            model.cuda()
-        else:
-            model = GPT2LMHeadModel.from_pretrained('gpt2', cache_dir='gpt2_weights').cuda()
-    else:
-        if args.load_openai:
-            from megatron.utils import move_weights
-            model_path = args.load
-            args.load = None
-            model = setup_model(args)
-            from pytorch_pretrained_bert import GPT2LMHeadModel
-            from pytorch_pretrained_bert import GPT2Model as HFGPT2Model
-
-            model_path = 'gpt2'
-            from_tf = False
-            print('loading openai weights')
-            model.cpu()
-            if args.num_layers == 24:
-                #model_path = '/home/universal-lm-data.cosmos549/repos/gpt2_mp/models/345M'
-                hfmodel = HFGPT2Model.from_pretrained(model_path, cache_dir='gpt2_weights', from_tf=True)
-                gpt2model = GPT2LMHeadModel(hfmodel.config)
-                gpt2model.transformer.load_state_dict(hfmodel.state_dict())
-                gpt2model
-            else:
-                gpt2model = GPT2LMHeadModel.from_pretrained('gpt2', cache_dir='gpt2_weights')
-            model2fill = model
-            while isinstance(model2fill, (DDP, FP16_Module)):
-                model2fill = model2fill.module
-            move_weights(model2fill, gpt2model)
-            model.cuda()
-        else:
-            model = setup_model(args)
-
-    # Run on test data.
-    prefix = "wiki" #os.path.basename(args.valid_data)
-    evaluate_and_print_results(prefix, eval_data,
-                               model, args, timers)
-
-
-if __name__ == "__main__":
-    main()
--- a/examples/evaluate_zeroshot_gpt2.sh
+++ b/examples/evaluate_zeroshot_gpt2.sh
+#!/bin/bash
+
+WORLD_SIZE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+TASK="LAMBADA"
+
+VALID_DATA=<lambada path>
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+CHECKPOINT=checkpoints/gpt2_345m
+
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task $TASK \
+               --valid-data $VALID_DATA \
+               --tokenizer-type GPT2BPETokenizer \
+               --strict-lambada
+               --vocab-file $VOCAB_FILE \
+               --merge-file $MERGE_FILE \
+               --load $CHECKPOINT \
+               --model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --batch-size 8 \
+               --checkpoint-activations \
+               --seq-length 1024 \
+               --max-position-embeddings 1024 \
+               --log-interval 10 \
+               --fp16 \
+               --no-load-optim \
+               --no-load-rng
--- a/examples/finetune_mnli_distributed.sh
+++ b/examples/finetune_mnli_distributed.sh
+#!/bin/bash
+
+WORLD_SIZE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+TRAIN_DATA="data/glue_data/MNLI/train.tsv"
+VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
+            data/glue_data/MNLI/dev_mismatched.tsv"
+PRETRAINED_CHECKPOINT=checkpoints/bert_345m
+VOCAB_FILE=bert-vocab.txt
+CHECKPOINT_PATH=checkpoints/bert_345m_mnli
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task MNLI \
+               --seed 1234 \
+               --train-data $TRAIN_DATA \
+               --valid-data $VALID_DATA \
+               --tokenizer-type BertWordPieceLowerCase \
+               --vocab-file $VOCAB_FILE \
+               --epochs 5 \
+               --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
+               --model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --batch-size 8 \
+               --checkpoint-activations \
+               --lr 5.0e-5 \
+               --lr-decay-style linear \
+               --warmup 0.065 \
+               --seq-length 512 \
+               --max-position-embeddings 512 \
+               --save-interval 500000 \
+               --save $CHECKPOINT_PATH \
+               --log-interval 10 \
+               --eval-interval 100 \
+               --eval-iters 50 \
+               --weight-decay 1.0e-1 \
+               --fp16
--- a/examples/finetune_race_distributed.sh
+++ b/examples/finetune_race_distributed.sh
+#!/bin/bash
+
+WORLD_SIZE=8
+
+DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+TRAIN_DATA="data/RACE/train/middle"
+VALID_DATA="data/RACE/dev/middle \
+            data/RACE/dev/high"
+VOCAB_FILE=bert-vocab.txt
+PRETRAINED_CHECKPOINT=checkpoints/bert_345m
+CHECKPOINT_PATH=checkpoints/bert_345m_race
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
+               --task RACE \
+               --seed 1234 \
+               --train-data $TRAIN_DATA \
+               --valid-data $VALID_DATA \
+               --tokenizer-type BertWordPieceLowerCase \
+               --vocab-file $VOCAB_FILE \
+               --epochs 3 \
+               --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
+               --model-parallel-size 1 \
+               --num-layers 24 \
+               --hidden-size 1024 \
+               --num-attention-heads 16 \
+               --batch-size 4 \
+               --checkpoint-activations \
+               --lr 1.0e-5 \
+               --lr-decay-style linear \
+               --warmup 0.06 \
+               --seq-length 512 \
+               --max-position-embeddings 512 \
+               --save-interval 100000 \
+               --save $CHECKPOINT_PATH \
+               --log-interval 10 \
+               --eval-interval 100 \
+               --eval-iters 50 \
+               --weight-decay 1.0e-1 \
+               --clip-grad 1.0 \
+               --hidden-dropout 0.1 \
+               --attention-dropout 0.1 \
+               --fp16
--- a/examples/generate_text.sh
+++ b/examples/generate_text.sh
+#!/bin/bash
+
+CHECKPOINT_PATH=checkpoints/gpt2_345m
+VOCAB_FILE=gpt2-vocab.json
+MERGE_FILE=gpt2-merges.txt
+
+python tools/generate_samples_gpt2.py \
+       --model-parallel-size 1 \
+       --num-layers 24 \
+       --hidden-size 1024 \
+       --load $CHECKPOINT_PATH \
+       --num-attention-heads 16 \
+       --max-position-embeddings 1024 \
+       --tokenizer-type GPT2BPETokenizer \
+       --fp16 \
+       --batch-size 2 \
+       --seq-length 1024 \
+       --out-seq-length 1024 \
+       --temperature 1.0 \
+       --vocab-file $VOCAB_FILE \
+       --merge-file $MERGE_FILE \
+       --genfile unconditional_samples.json \
+       --num-samples 2 \
+       --top_p 0.9 \
+       --recompute
--- a/examples/merge_mp_bert.sh
+++ b/examples/merge_mp_bert.sh
+#!/bin/bash
+
+MODEL_PARALLEL_SIZE=2
+
+VOCAB_FILE=bert-vocab.txt
+CHECKPOINT_PATH=checkpoints/bert_345m
+
+WORLD_SIZE=$MODEL_PARALLEL_SIZE python tools/merge_mp_partitions.py \
+                                --model-type BERT \
+                                --model-parallel-size $MODEL_PARALLEL_SIZE \
+                                --tokenizer-type BertWordPieceLowerCase \
+                                --vocab-file $VOCAB_FILE \
+                                --num-layers 24 \
+                                --hidden-size 1024 \
+                                --num-attention-heads 16 \
+                                --seq-length 512 \
+                                --max-position-embeddings 512 \
+                                --load $CHECKPOINT_PATH
--- a/scripts/pretrain_bert.sh
+++ b/scripts/pretrain_bert.sh
@@ -2,6 +2,8 @@

 RANK=0
 WORLD_SIZE=1
+DATA_PATH=<Specify path and file prefix>_text_sentence
+CHECKPOINT_PATH=<Specify path>

 python pretrain_bert.py \
       --num-layers 24 \
@@ -9,26 +11,25 @@ python pretrain_bert.py \
       --num-attention-heads 16 \
       --batch-size 4 \
       --seq-length 512 \
-       --max-preds-per-seq 80 \
       --max-position-embeddings 512 \
-       --train-iters 1000000 \
-       --save checkpoints/bert_345m \
-       --load checkpoints/bert_345m \
-       --resume-dataloader \
-       --train-data wikipedia \
-       --lazy-loader \
-       --tokenizer-type BertWordPieceTokenizer \
-       --tokenizer-model-type bert-large-uncased \
-       --presplit-sentences \
-       --cache-dir cache \
+       --train-iters 2000000 \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+       --vocab-file bert-vocab.txt \
+       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
+       --min-lr 0.00001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
-       --fp16 \
-       --fp32-layernorm \
-       --fp32-embedding
+       --log-interval 100 \
+       --save-interval 10000 \
+       --eval-interval 1000 \
+       --eval-iters 10 \
+       --fp16
+
--- a/scripts/pretrain_bert_distributed.sh
+++ b/scripts/pretrain_bert_distributed.sh
@@ -8,36 +8,37 @@ NNODES=1
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

+DATA_PATH=<Specify path and file prefix>_text_sentence
+CHECKPOINT_PATH=<Specify path>
+
 DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

 python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_bert.py \
+       --model-parallel-size 1 \
       --num-layers 24 \
       --hidden-size 1024 \
       --num-attention-heads 16 \
       --batch-size 4 \
       --seq-length 512 \
-       --max-preds-per-seq 80 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
-       --save checkpoints/bert_345m \
-       --load checkpoints/bert_345m \
-       --resume-dataloader \
-       --train-data wikipedia \
-       --lazy-loader \
-       --tokenizer-type BertWordPieceTokenizer \
-       --tokenizer-model-type bert-large-uncased \
-       --presplit-sentences \
-       --cache-dir cache \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+       --vocab-file bert-vocab.txt \
+       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
+       --min-lr 1.0e-5 \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
-       --fp16 \
-       --fp32-layernorm \
-       --fp32-embedding
-
+       --log-interval 100 \
+       --save-interval 10000 \
+       --eval-interval 1000 \
+       --eval-iters 10 \
+       --fp16
--- a/scripts/pretrain_gpt2.sh
+++ b/scripts/pretrain_gpt2.sh
@@ -5,6 +5,10 @@
 RANK=0
 WORLD_SIZE=1

+DATA_PATH=<Specify path and file prefix>_text_document
+CHECKPOINT_PATH=<Specify path>
+
+
 python pretrain_gpt2.py \
       --num-layers 24 \
       --hidden-size 1024 \
@@ -12,22 +16,27 @@ python pretrain_gpt2.py \
       --batch-size 8 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
-       --train-iters 320000 \
-       --save checkpoints/gpt2_345m \
-       --load checkpoints/gpt2_345m \
-       --resume-dataloader \
-       --train-data wikipedia \
-       --lazy-loader \
-       --tokenizer-type GPT2BPETokenizer \
-       --cache-dir cache \
+       --train-iters 500000 \
+       --lr-decay-iters 320000 \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+       --vocab-file gpt2-vocab.json \
+       --merge-file gpt2-merges.txt \
+       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
+       --min-lr 1.0e-5 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --checkpoint-activations \
+       --log-interval 100 \
+       --save-interval 10000 \
+       --eval-interval 1000 \
+       --eval-iters 10 \
       --fp16



--- a/scripts/pretrain_gpt2_distributed.sh
+++ b/scripts/pretrain_gpt2_distributed.sh
@@ -10,33 +10,43 @@ NNODES=1
 NODE_RANK=0
 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

+DATA_PATH=<Specify path and file prefix>_text_document
+CHECKPOINT_PATH=<Specify path>
+
 DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

 python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt2.py \
+       --model-parallel-size 1 \
       --num-layers 24 \
       --hidden-size 1024 \
       --num-attention-heads 16 \
       --batch-size 8 \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
-       --train-iters 320000 \
-       --save checkpoints/gpt2_345m \
-       --load checkpoints/gpt2_345m \
-       --resume-dataloader \
-       --train-data wikipedia \
-       --lazy-loader \
-       --tokenizer-type GPT2BPETokenizer \
-       --cache-dir cache \
+       --train-iters 500000 \
+       --lr-decay-iters 320000 \
+       --save $CHECKPOINT_PATH \
+       --load $CHECKPOINT_PATH \
+       --data-path $DATA_PATH \
+       --vocab-file gpt2-vocab.json \
+       --merge-file gpt2-merges.txt \
+       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
+       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --checkpoint-activations \
+       --log-interval 100 \
+       --save-interval 10000 \
+       --eval-interval 1000 \
+       --eval-iters 10 \
       --fp16


+
 set +x
--- a/gpt2_data_loader.py
+++ b/gpt2_data_loader.py
-# coding=utf-8
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import json
-import os
-
-import numpy as np
-import torch
-from torch.multiprocessing import Lock
-from torch.utils.data import Dataset
-
-from megatron import mpu
-from megatron.data_utils.samplers import DistributedBatchSampler
-from megatron.data_utils.tokenization_gpt2 import GPT2Tokenizer
-
-
-def make_gpt2_dataloaders(args):
-
-    # Input parameters.
-    input_data_sizes_file = args.input_data_sizes_file
-    seq_length = args.seq_length
-    initial_seed = args.seed
-
-    # Data parallel arguments.
-    world_size = mpu.get_data_parallel_world_size()
-    rank = mpu.get_data_parallel_rank()
-    global_batch_size = args.batch_size * world_size
-    num_workers = args.num_workers
-
-    def make_data_loader_(data_path):
-        # Build the dataset.
-        dataset = GPT2Dataset(data_path, input_data_sizes_file,
-                              seq_length, initial_seed)
-        # Use a simple sampler with distributed batch sampler.
-        sampler = torch.utils.data.SequentialSampler(dataset)
-        batch_sampler = DistributedBatchSampler(sampler=sampler,
-                                                batch_size=global_batch_size,
-                                                drop_last=True,
-                                                rank=rank,
-                                                world_size=world_size)
-        # Torch dataloader.
-        return torch.utils.data.DataLoader(dataset,
-                                           batch_sampler=batch_sampler,
-                                           num_workers=num_workers,
-                                           pin_memory=True)
-
-    train = make_data_loader_(args.train_data)
-    valid = make_data_loader_(args.valid_data)
-    test = make_data_loader_(args.test_data)
-
-    args.do_train = False
-    args.do_valid = False
-    args.do_test = False
-
-    if train is not None:
-        args.do_train = True
-    if valid is not None:
-        args.do_valid = True
-    if test is not None:
-        args.do_test = True
-
-    # Tokenizer.
-    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', cache_dir=args.cache_dir)
-    eod_token = tokenizer.encoder['<|endoftext|>']
-    num_tokens = eod_token + 1
-
-    return (train, valid, test), num_tokens, eod_token
-
-
-class GPT2Dataset(Dataset):
-
-    def __init__(self, data_path, sizes_filename, seq_length,
-                 initial_seed, max_epochs=100):
-        # Input parameters.
-        self.data_path = data_path
-        self.sizes_filename = sizes_filename
-        self.seq_length = seq_length
-        self.initial_seed = initial_seed
-        self.max_epochs = max_epochs
-        # Lock for building the dataset.
-        self.lock = Lock()
-
-        # Shard stuff.
-        # Dictionary from shard nameto its size (number of element).
-        self.master_shard_size_dict = None
-        # Dictionary from shard name to modified size so it is
-        # divisible by self.seq_length.
-        self.shard_size_dict = None
-        # Long array (self.max_epochs * num-shards) populated
-        # randomly with shard names.
-        self.shards_name = None
-        # Start index of the data for a shard.
-        self.shards_start_index = None
-        self.build_shard_mappings_()
-        self.data_length = self.shards_start_index[-1]
-
-        # Data.
-        self.shards_data = [None]*self.shards_name.size
-        self.shards_sample_index = [None]*self.shards_name.size
-
-    def __len__(self):
-        return self.data_length
-
-    def __getitem__(self, idx):
-        # Find which shard we need.
-        shard_index = np.searchsorted(self.shards_start_index,
-                                      idx, side='right') - 1
-        # data index in the shard.
-        data_idx = idx - self.shards_start_index[shard_index]
-        # Load the shard if it is not in memory.
-        #self.lock.acquire()
-        if self.shards_data[shard_index] is None:
-            print('global rank {} is building data for shard index {} ...'.
-                  format(torch.distributed.get_rank(), shard_index))
-            self.build_dataset_(shard_index)
-        #assert self.shards_data[shard_index] is not None
-        #self.lock.release()
-        # Start index.
-        start_index = self.shards_sample_index[shard_index][data_idx]
-        # Add one for label shift.
-        end_index = start_index + self.seq_length + 1
-        data = self.shards_data[shard_index][start_index:end_index]
-        return {'text': np.array(data, dtype=np.int64)}
-
-    def build_dataset_(self, shard_index):
-        # Garbage collect so we don't use a lot of memory.
-        # Leave the last one in case other threads have not catche up yet.
-        #for i in range(shard_index - 1):
-        for i in range(shard_index):
-            self.shards_data[i] = None
-            self.shards_sample_index[i] = None
-        # Read the shard.
-        filename = os.path.join(self.data_path, self.shards_name[shard_index])
-        print('loading {}'.format(filename))
-        data = np.load(filename, allow_pickle=True)
-        # Shuffle the data
-        rng = np.random.RandomState(self.initial_seed + shard_index)
-        rng.shuffle(data)
-        # Flatten.
-        data = np.hstack(data)
-        size = (data.shape[0] - 1) // self.seq_length
-        last_index = size * self.seq_length + 1
-        data = data[0:last_index]
-        self.shards_data[shard_index] = data
-        indices = np.arange(size) * self.seq_length
-        rng.shuffle(indices)
-        self.shards_sample_index[shard_index] = indices
-
-    def build_shard_mappings_(self):
-        # Load the sizes file.
-        sizes_filename = os.path.join(self.data_path, self.sizes_filename)
-        if torch.distributed.get_rank() == 0:
-            print(' > loading sizes from {}'.format(sizes_filename))
-        with open(sizes_filename, 'r') as f:
-            self.master_shard_size_dict = json.load(f)
-        if torch.distributed.get_rank() == 0:
-            print('   found {} shards'.format(len(self.master_shard_size_dict)))
-        # Adjust sizes to be a multiple of seq_length.
-        self.shard_size_dict = self.master_shard_size_dict.copy()
-        total_samples = 0
-        for shard in self.shard_size_dict:
-            size = self.shard_size_dict[shard]
-            size = ((size - 1) // self.seq_length) * self.seq_length
-            total_samples += size // self.seq_length
-            self.shard_size_dict[shard] = size
-        if torch.distributed.get_rank() == 0:
-            print('   found {} samples in the dataset'.format(total_samples))
-        # Build a list of shards.
-        shards_ = np.sort(np.array(list(self.shard_size_dict.keys())))
-        rng = np.random.RandomState(self.initial_seed)
-        self.shards_name = np.copy(shards_)
-        rng.shuffle(self.shards_name)
-        for i in range(1, self.max_epochs):
-            shards_c = np.copy(shards_)
-            rng.shuffle(shards_c)
-            self.shards_name = np.append(self.shards_name, shards_c)
-        # Build the global indexing.
-        self.shards_start_index = np.zeros(self.shards_name.size, dtype=np.int)
-        self.shards_start_index[0] = 0
-        for i in range(1, self.shards_name.size):
-            shard = str(self.shards_name[i-1])
-            size = self.shard_size_dict[shard]
-            self.shards_start_index[i] = self.shards_start_index[i-1] + \
-                                         size // self.seq_length
-
-'''
-if __name__ == '__main__':
-
-    print('gpt2 data loader ...')
-    path = '/raid/mshoeybi/data/gpt2/adlr/reddit_all_ftfy_lg200/npys'
-
-    dataset = GPT2Dataset(path, 'sizes.txt', 1024, 1234, 100)
-    print('dataset contains {} samples'.format(dataset.data_length))
-
-    for i in range(len(dataset)):
-        if i % 512000 == 0:
-            print(i)
-        data = dataset[i]
-'''
--- a/openwebtext/tokenizer.py
+++ b/openwebtext/tokenizer.py
 # coding=utf-8
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,24 +13,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import sys
-sys.path.append('..')
+import torch

-from megatron.data_utils.tokenization_gpt2 import GPT2Tokenizer
+from .global_vars import get_args
+from .global_vars import get_tokenizer
+from .global_vars import get_tensorboard_writer
+from .global_vars import get_adlr_autoresume
+from .global_vars import get_timers


-class Tokenizer:
-
-    def __init__(self, cache_dir=None):
-        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2',
-                                                       cache_dir=cache_dir)
-        self.tokenizer.max_len = int(1e12)
-        self.eod_token = self.tokenizer.encoder['<|endoftext|>']
-        assert self.eod_token < 65535, 'vocab size will not fit in uint16'
-        print('> GPT2 tokenizer with {} vocab size and eod token {} ...'.format(
-            len(self.tokenizer.encoder), self.eod_token))
-
-    def tokenize_document(self, document):
-        tokens = self.tokenizer.encode(document)
-        tokens.append(self.eod_token)
-        return tokens
+def print_rank_0(message):
+    """If distributed is initialized print only on rank 0."""
+    if torch.distributed.is_initialized():
+        if torch.distributed.get_rank() == 0:
+            print(message, flush=True)
+    else:
+        print(message, flush=True)
--- a/arguments.py
+++ b/arguments.py
 # coding=utf-8
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,141 +13,196 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-"""argparser configuration"""
+"""Megatron arguments."""

 import argparse
 import os
-import torch


-def add_model_config_args(parser):
-    """Model arguments"""
+def parse_args(extra_args_provider=None, defaults={}):
+    """Parse all arguments."""
+    parser = argparse.ArgumentParser(description='Megatron-LM Arguments')
+
+    # Standard arguments.
+    parser = _add_network_size_args(parser)
+    parser = _add_regularization_args(parser)
+    parser = _add_training_args(parser)
+    parser = _add_initialization_args(parser)
+    parser = _add_learning_rate_args(parser)
+    parser = _add_checkpointing_args(parser)
+    parser = _add_mixed_precision_args(parser)
+    parser = _add_distributed_args(parser)
+    parser = _add_validation_args(parser)
+    parser = _add_data_args(parser)
+    parser = _add_autoresume_args(parser)
+
+    # Custom arguments.
+    if extra_args_provider is not None:
+        parser = extra_args_provider(parser)
+
+    # Parse.
+    args = parser.parse_args()

-    group = parser.add_argument_group('model', 'model configuration')
+    # Set input defaults.
+    for key in defaults:
+        # For default to be valid, it should not be provided in the
+        # arguments that are passed to the program. We check this by
+        # ensuring the arg is set to None.
+        assert getattr(args, key) is None, \
+            'defaults can only be overwritten for args with None values.'
+        setattr(args, key, defaults[key])
+
+    # Check required arguments.
+    required_args = ['num_layers', 'hidden_size', 'num_attention_heads',
+                     'max_position_embeddings']
+    for req_arg in required_args: 
+        _check_arg_is_not_none(args, req_arg)
+
+    # Distributed args.
+    args.rank = int(os.getenv('RANK', '0'))
+    args.world_size = int(os.getenv("WORLD_SIZE", '1'))
+    args.model_parallel_size = min(args.model_parallel_size, args.world_size)
+    if args.rank == 0:
+        print('using world size: {} and model-parallel size: {} '.format(
+            args.world_size, args.model_parallel_size))

-    group.add_argument('--pretrained-bert', action='store_true',
-                       help='use a pretrained bert-large-uncased model instead'
-                       'of initializing from scratch. See '
-                       '--tokenizer-model-type to specify which pretrained '
-                       'BERT model to use')
-    group.add_argument('--attention-dropout', type=float, default=0.1,
-                       help='dropout probability for attention weights')
-    group.add_argument('--num-attention-heads', type=int, default=16,
-                       help='num of transformer attention heads')
-    group.add_argument('--hidden-size', type=int, default=1024,
-                       help='tansformer hidden size')
-    group.add_argument('--intermediate-size', type=int, default=None,
-                       help='transformer embedding dimension for FFN'
-                       'set to 4*`--hidden-size` if it is None')
-    group.add_argument('--num-layers', type=int, default=24,
-                       help='num decoder layers')
-    group.add_argument('--layernorm-epsilon', type=float, default=1e-5,
-                       help='layer norm epsilon')
-    group.add_argument('--hidden-dropout', type=float, default=0.1,
-                       help='dropout probability for hidden state transformer')
-    group.add_argument('--max-position-embeddings', type=int, default=512,
-                       help='maximum number of position embeddings to use')
-    group.add_argument('--vocab-size', type=int, default=None,
-                       help='vocab size to use for non-character-level '
-                       'tokenization. This value will only be used when '
-                       'creating a tokenizer')
-    group.add_argument('--deep-init', action='store_true',
-                       help='initialize bert model similar to gpt2 model.'
-                       'scales initialization of projection layers by a '
-                       'factor of 1/sqrt(2N). Necessary to train bert '
-                       'models larger than BERT-Large.')
+    # Fp16 loss scaling.
+    args.dynamic_loss_scale = False
+    if args.loss_scale is None:
+        args.dynamic_loss_scale = True
+
+    # Checks.
+    assert args.hidden_size % args.num_attention_heads == 0
+    if args.seq_length is not None:
+        assert args.max_position_embeddings >= args.seq_length
+    if args.lr is not None:
+        assert args.min_lr <= args.lr
+    if args.save is not None:
+        assert args.save_interval is not None
+
+    _print_args(args)
+    return args
+
+
+def _print_args(args):
+    """Print arguments."""
+    if args.rank == 0:
+        print('-------------------- arguments --------------------', flush=True)
+        str_list = []
+        for arg in vars(args):
+            dots = '.' * (32 - len(arg))
+            str_list.append('  {} {} {}'.format(arg, dots, getattr(args, arg)))
+        for arg in sorted(str_list, key=lambda x: x.lower()):
+            print(arg, flush=True)
+        print('---------------- end of arguments ----------------', flush=True)
+
+
+def _check_arg_is_not_none(args, arg):
+    assert getattr(args, arg) is not None, '{} argument is None'.format(arg)
+
+
+def _add_network_size_args(parser):
+    group = parser.add_argument_group(title='network size')
+
+    group.add_argument('--num-layers', type=int, default=None,
+                       help='Number of transformer layers.')
+    group.add_argument('--hidden-size', type=int, default=None,
+                       help='Tansformer hidden size.')
+    group.add_argument('--num-attention-heads', type=int, default=None,
+                       help='Number of transformer attention heads.')
+    group.add_argument('--max-position-embeddings', type=int, default=None,
+                       help='Maximum number of position embeddings to use. '
+                       'This is the size of position embedding.')
    group.add_argument('--make-vocab-size-divisible-by', type=int, default=128,
                       help='Pad the vocab size to be divisible by this value.'
                       'This is added for computational efficieny reasons.')
+    group.add_argument('--layernorm-epsilon', type=float, default=1e-5,
+                       help='Layer norm epsilon.')
+    group.add_argument('--apply-residual-connection-post-layernorm',
+                       action='store_true',
+                       help='If set, use original BERT residula connection '
+                       'ordering.')
+    group.add_argument('--openai-gelu', action='store_true',
+                       help='Use OpenAIs GeLU implementation. This option'
+                       'should not be used unless for backward compatibility'
+                       'reasons.')

    return parser


-def add_fp16_config_args(parser):
-    """Mixed precision arguments."""
+def _add_regularization_args(parser):
+    group = parser.add_argument_group(title='regularization')

-    group = parser.add_argument_group('fp16', 'fp16 configurations')
-
-    group.add_argument('--fp16', action='store_true',
-                       help='Run model in fp16 mode')
-    group.add_argument('--apply-query-key-layer-scaling', action='store_true',
-                       help='Scale Q * K^T by 1 / layer-number. If this flag '
-                       'is set, then it will automatically set '
-                       'attention-softmax-in-fp32 to true')
-    group.add_argument('--attention-softmax-in-fp32', action='store_true',
-                       help='Run attention masking and softmax in fp32.')
-    group.add_argument('--fp32-embedding', action='store_true',
-                       help='embedding in fp32')
-    group.add_argument('--fp32-layernorm', action='store_true',
-                       help='layer norm in fp32')
-    group.add_argument('--fp32-tokentypes', action='store_true',
-                       help='embedding token types in fp32')
-    group.add_argument('--fp32-allreduce', action='store_true',
-                       help='all-reduce in fp32')
-    group.add_argument('--hysteresis', type=int, default=2,
-                       help='hysteresis for dynamic loss scaling')
-    group.add_argument('--loss-scale', type=float, default=None,
-                       help='Static loss scaling, positive power of 2 '
-                       'values can improve fp16 convergence. If None, dynamic'
-                       'loss scaling is used.')
-    group.add_argument('--loss-scale-window', type=float, default=1000,
-                       help='Window over which to raise/lower dynamic scale')
-    group.add_argument('--min-scale', type=float, default=1,
-                       help='Minimum loss scale for dynamic loss scale')
+    group.add_argument('--attention-dropout', type=float, default=0.1,
+                       help='Post attention dropout ptobability.')
+    group.add_argument('--hidden-dropout', type=float, default=0.1,
+                       help='Dropout probability for hidden state transformer.')
+    group.add_argument('--weight-decay', type=float, default=0.01,
+                       help='Weight decay coefficient for L2 regularization.')
+    group.add_argument('--clip-grad', type=float, default=1.0,
+                       help='Gradient clipping based on global L2 norm.')

    return parser


-def add_training_args(parser):
-    """Training arguments."""
+def _add_training_args(parser):
+    group = parser.add_argument_group(title='training')

-    group = parser.add_argument_group('train', 'training configurations')
-
-    group.add_argument('--batch-size', type=int, default=4,
-                       help='Data Loader batch size')
-    group.add_argument('--weight-decay', type=float, default=0.01,
-                       help='weight decay coefficient for L2 regularization')
+    group.add_argument('--batch-size', type=int, default=None,
+                       help='Batch size per model instance (local batch size). '
+                       'Global batch size is local batch size times data '
+                       'parallel size.')
    group.add_argument('--checkpoint-activations', action='store_true',
-                       help='checkpoint activation to allow for training '
-                       'with larger models and sequences')
+                       help='Checkpoint activation to allow for training '
+                       'with larger models, sequences, and batch sizes.')
    group.add_argument('--checkpoint-num-layers', type=int, default=1,
-                       help='chunk size (number of layers) for checkpointing')
-    group.add_argument('--clip-grad', type=float, default=1.0,
-                       help='gradient clipping')
-    group.add_argument('--train-iters', type=int, default=1000000,
-                       help='total number of iterations to train over all training runs')
+                       help='chunk size (number of layers) for checkpointing.')
+    group.add_argument('--train-iters', type=int, default=None,
+                       help='Total number of iterations to train over all '
+                       'training runs.')
    group.add_argument('--log-interval', type=int, default=100,
-                       help='report interval')
+                       help='Report loss and timing interval.')
    group.add_argument('--exit-interval', type=int, default=None,
-                       help='Exit the program after this many new iterations.')
+                       help='Exit the program after the iteration is divisible '
+                       'by this value.')
    group.add_argument('--tensorboard-dir', type=str, default=None,
-                       help='Write TensorBoard logs to this directory')
+                       help='Write TensorBoard logs to this directory.')
+
+    return parser
+
+
+def _add_initialization_args(parser):
+    group = parser.add_argument_group(title='initialization')
+
    group.add_argument('--seed', type=int, default=1234,
-                       help='random seed')
-    # Batch prodecuer arguments
-    group.add_argument('--reset-position-ids', action='store_true',
-                       help='Reset posistion ids after end-of-document token.')
-    group.add_argument('--reset-attention-mask', action='store_true',
-                       help='Reset self attention maske after '
-                       'end-of-document token.')
-    group.add_argument('--eod-mask-loss', action='store_true',
-                       help='Mask loss for the end of document tokens')
+                       help='Random seed used for python, numpy, '
+                       'pytorch, and cuda.')
+    group.add_argument('--init-method-std', type=float, default=0.02,
+                       help='Standard deviation of the zero mean normal '
+                       'distribution used for weight initialization.')

-    # Learning rate.
-    group.add_argument('--lr-decay-iters', type=int, default=None,
-                       help='number of iterations to decay LR over,'
-                       ' If None defaults to `--train-iters`*`--epochs`')
+    return parser
+
+
+def _add_learning_rate_args(parser):
+    group = parser.add_argument_group(title='learning rate')
+
+    group.add_argument('--lr', type=float, default=None,
+                       help='Initial learning rate. Depending on decay style '
+                       'and initial warmup, the learing rate at each '
+                       'iteration would be different.')
    group.add_argument('--lr-decay-style', type=str, default='linear',
                       choices=['constant', 'linear', 'cosine', 'exponential'],
-                       help='learning rate decay function')
-    group.add_argument('--lr', type=float, default=1.0e-4,
-                       help='initial learning rate')
+                       help='Learning rate decay function.')
+    group.add_argument('--lr-decay-iters', type=int, default=None,
+                       help='number of iterations to decay learning rate over,'
+                       ' If None defaults to `--train-iters`')
    group.add_argument('--min-lr', type=float, default=0.0,
                       help='Minumum value for learning rate. The scheduler'
                       'clip values below this threshold.')
    group.add_argument('--warmup', type=float, default=0.01,
-                       help='percentage of data to warmup on (.01 = 1% of all '
-                       'training iters). Default 0.01')
+                       help='Percentage of total iterations to warmup on '
+                       '(.01 = 1 percent of all training iters).')
    group.add_argument('--override-lr-scheduler', action='store_true',
                       help='Reset the values of the scheduler (learning rate,'
                       'warmup iterations, minimum learning rate, maximum '
@@ -158,20 +213,24 @@ def add_training_args(parser):
                       help='Use checkpoint to set the values of the scheduler '
                       '(learning rate, warmup iterations, minimum learning '
                       'rate, maximum number of iterations, and decay style '
-                       'from input arguments and ignore values from '
-                       'checkpoints. Notethat all the above values will be '
-                       'reset.')
-    # model checkpointing
+                       'from checkpoint and ignore input arguments.')
+
+    return parser
+
+
+def _add_checkpointing_args(parser):
+    group = parser.add_argument_group(title='checkpointing')
+
    group.add_argument('--save', type=str, default=None,
                       help='Output directory to save checkpoints to.')
-    group.add_argument('--save-interval', type=int, default=5000,
-                       help='number of iterations between saves')
+    group.add_argument('--save-interval', type=int, default=None,
+                       help='Number of iterations between checkpoint saves.')
    group.add_argument('--no-save-optim', action='store_true',
                       help='Do not save current optimizer.')
    group.add_argument('--no-save-rng', action='store_true',
                       help='Do not save current rng state.')
    group.add_argument('--load', type=str, default=None,
-                       help='Path to a directory containing a model checkpoint.')
+                       help='Directory containing a model checkpoint.')
    group.add_argument('--no-load-optim', action='store_true',
                       help='Do not load optimizer when loading checkpoint.')
    group.add_argument('--no-load-rng', action='store_true',
@@ -180,235 +239,118 @@ def add_training_args(parser):
                       help='Load model for finetuning. Do not load optimizer '
                       'or rng state from checkpoint and set iteration to 0. '
                       'Assumed when loading a release checkpoint.')
-    group.add_argument('--resume-dataloader', action='store_true',
-                       help='Resume the dataloader when resuming training. '
-                       'Does not apply to tfrecords dataloader, try resuming'
-                       'with a different seed in this case.')
-    # distributed training args
-    group.add_argument('--distributed-backend', default='nccl',
-                       help='which backend to use for distributed '
-                       'training. One of [gloo, nccl]')
-    group.add_argument('--DDP-impl', default='local',
-                       help='which DistributedDataParallel implementation '
-                       'to use. One of [local, torch]')
-    group.add_argument('--local_rank', type=int, default=None,
-                       help='local rank passed from distributed launcher')
-    # autoresume
-    group.add_argument('--adlr-autoresume', action='store_true',
-                       help='enable autoresume on adlr cluster.')
-    group.add_argument('--adlr-autoresume-interval', type=int, default=1000,
-                       help='intervals over which check for autoresume'
-                       'termination signal')

    return parser


-def add_evaluation_args(parser):
-    """Evaluation arguments."""
-
-    group = parser.add_argument_group('validation', 'validation configurations')
+def _add_mixed_precision_args(parser):
+    group = parser.add_argument_group(title='mixed precision')

-    group.add_argument('--eval-batch-size', type=int, default=None,
-                       help='Data Loader batch size for evaluation datasets.'
-                       'Defaults to `--batch-size`')
-    group.add_argument('--eval-iters', type=int, default=100,
-                       help='number of iterations to run for evaluation'
-                       'validation/test for')
-    group.add_argument('--eval-interval', type=int, default=1000,
-                       help='interval between running evaluation on validation set')
-    group.add_argument('--eval-seq-length', type=int, default=None,
-                       help='Maximum sequence length to process for '
-                       'evaluation. Defaults to `--seq-length`')
-    group.add_argument('--eval-max-preds-per-seq', type=int, default=None,
-                       help='Maximum number of predictions to use for '
-                       'evaluation. Defaults to '
-                       'math.ceil(`--eval-seq-length`*.15/10)*10')
-    group.add_argument('--overlapping-eval', type=int, default=32,
-                       help='sliding window for overlapping eval ')
-    group.add_argument('--cloze-eval', action='store_true',
-                       help='Evaluation dataset from `--valid-data` is a cloze task')
-    group.add_argument('--strict-lambada', action='store_true',
-                       help='use more difficult formulation of lambada')
-    group.add_argument('--eval-hf', action='store_true',
-                       help='perform evaluation with huggingface openai model.'
-                       'use `--load` to specify weights path to be loaded')
-    group.add_argument('--load-openai', action='store_true',
-                       help='load openai weights into our model. Use `--load` '
-                       'to specify weights path to be loaded')
-
-    return parser
+    group.add_argument('--fp16', action='store_true',
+                       help='Run model in fp16 mode.')
+    group.add_argument('--apply-query-key-layer-scaling', action='store_true',
+                       help='Scale Q * K^T by 1 / layer-number. If this flag '
+                       'is set, then it will automatically set '
+                       'attention-softmax-in-fp32 to true')
+    group.add_argument('--attention-softmax-in-fp32', action='store_true',
+                       help='Run attention masking and softmax in fp32.')
+    group.add_argument('--fp32-allreduce', action='store_true',
+                       help='All-reduce in fp32')
+    group.add_argument('--hysteresis', type=int, default=2,
+                       help='hysteresis for dynamic loss scaling')
+    group.add_argument('--loss-scale', type=float, default=None,
+                       help='Static loss scaling, positive power of 2 '
+                       'values can improve fp16 convergence. If None, dynamic'
+                       'loss scaling is used.')
+    group.add_argument('--loss-scale-window', type=float, default=1000,
+                       help='Window over which to raise/lower dynamic scale.')
+    group.add_argument('--min-scale', type=float, default=1,
+                       help='Minimum loss scale for dynamic loss scale.')

-def add_text_generate_args(parser):
-    """Text generate arguments."""
-
-    group = parser.add_argument_group('Text generation', 'configurations')
-    group.add_argument("--temperature", type=float, default=1.0)
-    group.add_argument("--greedy", action='store_true', default=False)
-    group.add_argument("--top_p", type=float, default=0.0)
-    group.add_argument("--top_k", type=int, default=0)
-    group.add_argument("--out-seq-length", type=int, default=1024)
-    group.add_argument("--sample-input-file", type=str, default="",
-                      help='get input from file instead of interactive mode, '
-                           'each line is an input' )
-    group.add_argument("--sample-output-file", type=str, default="",
-                      help='output file got from --sample-input-file')
-    group.add_argument("--num-samples", type=int, default=0,
-                       help='number of samples to generate unconditionally, '
-                       'defaults to 0 and interactive conditional sampling')
-    group.add_argument("--genfile", type=str,
-                       help='output file when generating unconditionally')
-    group.add_argument("--recompute", action='store_true',
-                       help='during generation recompute all attention '
-                       'instead of using previously computed keys/values.')
    return parser


-def add_data_args(parser):
-    """Train/valid/test data arguments."""
-
-    group = parser.add_argument_group('data', 'data configurations')
+def _add_distributed_args(parser):
+    group = parser.add_argument_group(title='mixed precision')

    group.add_argument('--model-parallel-size', type=int, default=1,
-                       help='size of the model parallel.')
-    group.add_argument('--shuffle', action='store_true',
-                       help='Shuffle data. Shuffling is deterministic '
-                       'based on seed and current epoch.')
-    group.add_argument('--data-loader', type=str, default=None,
-                       choices=['raw', 'lazy', 'tfrecords', 'numpy', 'binary'],
-                       help='Which data loader to use. Default varies by model.')
-
-    group.add_argument('--train-data', nargs='+', default=None,
-                       help='Whitespace separated paths or corpora names '
-                       'for training.')
-    group.add_argument('--valid-data', nargs='*', default=None,
-                       help='path(s) to the validation data.')
-    group.add_argument('--test-data', nargs='*', default=None,
-                       help='path(s) to the testing data.')
-    group.add_argument('--data-path', nargs='+', default=None,
-                       help='path to combined dataset to split')
-    group.add_argument('--split', default='1000,1,1',
-                       help='comma-separated list of proportions for training,'
-                       ' validation, and test split')
-
-    group.add_argument('--seq-length', type=int, default=512,
-                       help="Maximum sequence length to process")
-    group.add_argument('--max-preds-per-seq', type=int, default=None,
-                       help='Maximum number of predictions to use per sequence.'
-                       'Defaults to math.ceil(`--seq-length`*.15/10)*10.'
-                       'MUST BE SPECIFIED IF `--data-loader tfrecords`.')
-
-    # arguments for binary data loader
-    parser.add_argument('--vocab', type=str, default='vocab.txt',
-                        help='path to vocab file')
-    parser.add_argument('--data-impl', type=str, default='infer',
-                        help='implementation of indexed datasets',
-                        choices=['lazy', 'cached', 'mmap', 'infer'])
-    parser.add_argument('--max-num-samples', type=int, default=None,
-                        help='Maximum number of samples to plan for, defaults to total iters * batch-size.')
-    parser.add_argument('--data-epochs', type=int, default=None,
-                        help='Number of epochs to plan for, defaults to using --max-num-samples')
-    parser.add_argument('--mask-prob', default=0.15, type=float,
-                        help='probability of replacing a token with mask')
-    parser.add_argument('--short-seq-prob', default=0.1, type=float,
-                        help='probability of producing a short sequence')
-    parser.add_argument('--skip-mmap-warmup', action='store_true',
-                        help='skip warming up mmap files')
-
-    # arguments for numpy data loader
-    group.add_argument('--input-data-sizes-file', type=str, default='sizes.txt',
-                       help='the filename containing all the shards sizes for numpy data loader')
-
-    # arguments for raw/tfrecords data loader
-    group.add_argument('--delim', default=',',
-                       help='delimiter used to parse csv data files')
-    group.add_argument('--text-key', default='sentence',
-                       help='key to use to extract text from json/csv')
-    group.add_argument('--eval-text-key', default=None,
-                       help='key to use to extract text from '
-                       'json/csv evaluation datasets')
-    group.add_argument('--loose-json', action='store_true',
-                       help='Use loose json (one json-formatted string per '
-                       'newline), instead of tight json (data file is one '
-                       'json string)')
-    group.add_argument('--presplit-sentences', action='store_true',
-                       help='Dataset content consists of documents where '
-                       'each document consists of newline separated sentences')
-
-    group.add_argument('--num-workers', type=int, default=2,
-                       help="""Number of workers to use for dataloading""")
-    group.add_argument('--tokenizer-model-type', type=str,
-                       default='bert-large-uncased',
-                       help="Model type to use for sentencepiece tokenization \
-                       (one of ['bpe', 'char', 'unigram', 'word']) or \
-                       bert vocab to use for BertWordPieceTokenizer (one of \
-                       ['bert-large-uncased', 'bert-large-cased', etc.])")
-    group.add_argument('--tokenizer-path', type=str, default='tokenizer.model',
-                       help='path used to save/load sentencepiece tokenization '
-                       'models')
-    group.add_argument('--tokenizer-type', type=str,
-                       default='BertWordPieceTokenizer',
-                       choices=['CharacterLevelTokenizer',
-                                'SentencePieceTokenizer',
-                                'BertWordPieceTokenizer',
-                                'GPT2BPETokenizer'],
-                       help='what type of tokenizer to use')
-    group.add_argument("--cache-dir", default=None, type=str,
-                       help="Where to store pre-trained BERT downloads")
+                       help='Size of the model parallel.')
+    group.add_argument('--distributed-backend', default='nccl',
+                       choices=['nccl', 'gloo'],
+                       help='Which backend to use for distributed training.')
+    group.add_argument('--DDP-impl', default='local',
+                       choices=['local', 'torch'],
+                       help='which DistributedDataParallel implementation '
+                       'to use.')
+    group.add_argument('--local_rank', type=int, default=None,
+                       help='local rank passed from distributed launcher.')

    return parser


-def get_args():
-    """Parse all the args."""
-
-    parser = argparse.ArgumentParser(description='PyTorch BERT Model')
-    parser = add_model_config_args(parser)
-    parser = add_fp16_config_args(parser)
-    parser = add_training_args(parser)
-    parser = add_evaluation_args(parser)
-    parser = add_text_generate_args(parser)
-    parser = add_data_args(parser)
-
-    args = parser.parse_args()
+def _add_validation_args(parser):
+    group = parser.add_argument_group(title='validation')

-    if not args.train_data and not args.data_path:
-        print('WARNING: No training data specified')
-
-    args.cuda = torch.cuda.is_available()
+    group.add_argument('--eval-iters', type=int, default=100,
+                       help='Number of iterations to run for evaluation'
+                       'validation/test for.')
+    group.add_argument('--eval-interval', type=int, default=1000,
+                       help='Interval between running evaluation on '
+                       'validation set.')

-    args.rank = int(os.getenv('RANK', '0'))
-    args.world_size = int(os.getenv("WORLD_SIZE", '1'))
+    return parser

-    if os.getenv('OMPI_COMM_WORLD_LOCAL_RANK'):
-        # We are using (OpenMPI) mpirun for launching distributed data parallel processes
-        local_rank = int(os.getenv('OMPI_COMM_WORLD_LOCAL_RANK'))
-        local_size = int(os.getenv('OMPI_COMM_WORLD_LOCAL_SIZE'))

-        # Possibly running with Slurm
-        num_nodes = int(os.getenv('SLURM_JOB_NUM_NODES', '1'))
-        nodeid = int(os.getenv('SLURM_NODEID', '0'))
+def _add_data_args(parser):
+    group = parser.add_argument_group(title='data and dataloader')
+
+    group.add_argument('--data-path', type=str, default=None,
+                       help='Path to combined dataset to split.')
+    group.add_argument('--split', type=str, default='969, 30, 1',
+                       help='Comma-separated list of proportions for training,'
+                       ' validation, and test split. For example the split '
+                       '`90,5,5` will use 90% of data for training, 5% for '
+                       'validation and 5% for test.')
+    group.add_argument('--vocab-file', type=str, default=None,
+                       help='Path to the vocab file.')
+    group.add_argument('--merge-file', type=str, default=None,
+                       help='Path to the BPE merge file.')
+    group.add_argument('--seq-length', type=int, default=None,
+                       help="Maximum sequence length to process.")
+    group.add_argument('--mask-prob', type=float, default=0.15,
+                       help='Probability of replacing a token with mask.')
+    group.add_argument('--short-seq-prob', type=float, default=0.1,
+                       help='Probability of producing a short sequence.')
+    group.add_argument('--mmap-warmup', action='store_true',
+                       help='Warm up mmap files.')
+    group.add_argument('--num-workers', type=int, default=2,
+                       help="Dataloader number of workers.")
+    group.add_argument('--tokenizer-type', type=str,
+                       default=None,
+                       choices=['BertWordPieceLowerCase',
+                                'GPT2BPETokenizer'],
+                       help='What type of tokenizer to use.')
+    group.add_argument('--data-impl', type=str, default='infer',
+                       choices=['lazy', 'cached', 'mmap', 'infer'],
+                       help='Implementation of indexed datasets.')
+    group.add_argument('--reset-position-ids', action='store_true',
+                       help='Reset posistion ids after end-of-document token.')
+    group.add_argument('--reset-attention-mask', action='store_true',
+                       help='Reset self attention maske after '
+                       'end-of-document token.')
+    group.add_argument('--eod-mask-loss', action='store_true',
+                       help='Mask loss for the end of document tokens.')

-        args.local_rank = local_rank
-        args.rank = nodeid*local_size + local_rank
-        args.world_size = num_nodes*local_size
+    return parser

-    args.model_parallel_size = min(args.model_parallel_size, args.world_size)
-    if args.rank == 0:
-        print('using world size: {} and model-parallel size: {} '.format(
-            args.world_size, args.model_parallel_size))

-    args.dynamic_loss_scale = False
-    if args.loss_scale is None:
-        args.dynamic_loss_scale = True
-        if args.rank == 0:
-            print(' > using dynamic loss scaling')
+def _add_autoresume_args(parser):
+    group = parser.add_argument_group(title='autoresume')

-    # The args fp32_* or fp16_* meant to be active when the
-    # args fp16 is set. So the default behaviour should all
-    # be false.
-    if not args.fp16:
-        args.fp32_embedding = False
-        args.fp32_tokentypes = False
-        args.fp32_layernorm = False
+    group.add_argument('--adlr-autoresume', action='store_true',
+                       help='Enable autoresume on adlr cluster.')
+    group.add_argument('--adlr-autoresume-interval', type=int, default=1000,
+                       help='Intervals over which check for autoresume'
+                       'termination signal')

-    return args
+    return parser
--- a/megatron/checkpointing.py
+++ b/megatron/checkpointing.py
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Input/output checkpointing."""
+
+import os
+import random
+import sys
+import numpy as np
+
+import torch
+from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
+
+from megatron import mpu
+from megatron import get_args
+from megatron import print_rank_0
+
+
+def check_checkpoint_args(checkpoint_args):
+    """Ensure fixed arguments for a model are the same for the input
+    arguments and the one retreived frm checkpoint."""
+    args = get_args()
+
+    def _compare(arg_name):
+        checkpoint_value = getattr(checkpoint_args, arg_name)
+        args_value = getattr(args, arg_name)
+        error_message = '{} value from checkpoint ({}) is not equal to the ' \
+                        'input argument value ({}).'.format(
+                            arg_name, checkpoint_value, args_value)
+        assert checkpoint_value == args_value, error_message
+
+    _compare('num_layers')
+    _compare('hidden_size')
+    _compare('num_attention_heads')
+    _compare('max_position_embeddings')
+    _compare('make_vocab_size_divisible_by')
+    _compare('padded_vocab_size')
+    _compare('tokenizer_type')
+    _compare('model_parallel_size')
+
+
+def ensure_directory_exists(filename):
+    """Build filename's path if it does not already exists."""
+    dirname = os.path.dirname(filename)
+    if not os.path.exists(dirname):
+        os.makedirs(dirname)
+
+
+def get_checkpoint_name(checkpoints_path, iteration,
+                        release=False, mp_rank=None):
+    """A unified checkpoint name."""
+    if release:
+        directory = 'release'
+    else:
+        directory = 'iter_{:07d}'.format(iteration)
+    return os.path.join(checkpoints_path, directory,
+                        'mp_rank_{:02d}'.format(
+                            mpu.get_model_parallel_rank() if mp_rank is None
+                            else mp_rank),
+                        'model_optim_rng.pt')
+
+
+def get_checkpoint_tracker_filename(checkpoints_path):
+    """Tracker file rescords the latest chckpoint during
+    training to restart from."""
+    return os.path.join(checkpoints_path, 'latest_checkpointed_iteration.txt')
+
+
+def save_checkpoint(iteration, model, optimizer, lr_scheduler):
+    """Save a model checkpoint."""
+    args = get_args()
+
+    # Only rank zero of the data parallel writes to the disk.
+    if isinstance(model, torchDDP):
+        model = model.module
+    if mpu.get_data_parallel_rank() == 0:
+
+        # Arguments, iteration, and model.
+        state_dict = {}
+        state_dict['args'] = args
+        state_dict['iteration'] = iteration
+        state_dict['model'] = model.state_dict_for_save_checkpoint()
+
+        # Optimizer stuff.
+        if not args.no_save_optim:
+            if optimizer is not None:
+                state_dict['optimizer'] = optimizer.state_dict()
+            if lr_scheduler is not None:
+                state_dict['lr_scheduler'] = lr_scheduler.state_dict()
+
+        # RNG states.
+        if not args.no_save_rng:
+            state_dict['random_rng_state'] = random.getstate()
+            state_dict['np_rng_state'] = np.random.get_state()
+            state_dict['torch_rng_state'] = torch.get_rng_state()
+            state_dict['cuda_rng_state'] = torch.cuda.get_rng_state()
+            state_dict['rng_tracker_states'] \
+                = mpu.get_cuda_rng_tracker().get_states()
+
+        # Save.
+        checkpoint_name = get_checkpoint_name(args.save, iteration)
+        print('global rank {} is saving checkpoint at iteration {:7d} to {}'.
+              format(torch.distributed.get_rank(), iteration, checkpoint_name))
+        ensure_directory_exists(checkpoint_name)
+        torch.save(state_dict, checkpoint_name)
+        print('  successfully saved {}'.format(checkpoint_name))
+
+    # Wait so everyone is done (necessary)
+    torch.distributed.barrier()
+    # And update the latest iteration
+    if torch.distributed.get_rank() == 0:
+        tracker_filename = get_checkpoint_tracker_filename(args.save)
+        with open(tracker_filename, 'w') as f:
+            f.write(str(iteration))
+    # Wait so everyone is done (not necessary)
+    torch.distributed.barrier()
+
+
+def load_checkpoint(model, optimizer, lr_scheduler):
+    """Load a model checkpoint and return the iteration."""
+    args = get_args()
+
+    if isinstance(model, torchDDP):
+        model = model.module
+    # Read the tracker file and set the iteration.
+    tracker_filename = get_checkpoint_tracker_filename(args.load)
+
+    # If no tracker file, return iretation zero.
+    if not os.path.isfile(tracker_filename):
+        print_rank_0('WARNING: could not find the metadata file {} '.format(
+            tracker_filename))
+        print_rank_0('    will not load any checkpoints and will start from '
+                     'random')
+        return 0
+
+    # Otherwise, read the tracker file and either set the iteration or
+    # mark it as a release checkpoint.
+    iteration = 0
+    release = False
+    with open(tracker_filename, 'r') as f:
+        metastring = f.read().strip()
+        try:
+            iteration = int(metastring)
+        except ValueError:
+            release = metastring == 'release'
+            if not release:
+                print_rank_0('ERROR: Invalid metadata file {}. Exiting'.format(
+                    tracker_filename))
+                sys.exit()
+
+    assert iteration > 0 or release, 'error parsing metadata file {}'.format(
+        tracker_filename)
+
+    # Checkpoint.
+    checkpoint_name = get_checkpoint_name(args.load, iteration, release)
+    if mpu.get_data_parallel_rank() == 0:
+        print('global rank {} is loading checkpoint {}'.format(
+            torch.distributed.get_rank(), checkpoint_name))
+
+    # Load the checkpoint.
+    try:
+        state_dict = torch.load(checkpoint_name, map_location='cpu')
+    except ModuleNotFoundError:
+        # For backward compatibility.
+        print_rank_0(' > deserializing using the old code structure ...')
+        sys.modules['fp16.loss_scaler'] = sys.modules[
+            'megatron.fp16.loss_scaler']
+        state_dict = torch.load(checkpoint_name, map_location='cpu')
+        sys.modules.pop('fp16.loss_scaler', None)
+    except BaseException:
+        print_rank_0('could not load the checkpoint')
+        sys.exit()
+
+    # Set iteration.
+    if args.finetune or release:
+        iteration = 0
+    else:
+        try:
+            iteration = state_dict['iteration']
+        except KeyError:
+            try:  # Backward compatible with older checkpoints
+                iteration = state_dict['total_iters']
+            except KeyError:
+                print_rank_0('A metadata file exists but unable to load '
+                             'iteration from checkpoint {}, exiting'.format(
+                                 checkpoint_name))
+                sys.exit()
+
+    # Check arguments.
+    if 'args' in state_dict:
+        checkpoint_args = state_dict['args']
+        check_checkpoint_args(checkpoint_args)
+    else:
+        print_rank_0('could not find arguments in the checkpoint ...')
+
+    # Model.
+    model.load_state_dict(state_dict['model'])
+
+    # Optimizer.
+    if not release and not args.finetune and not args.no_load_optim:
+        try:
+            if optimizer is not None:
+                optimizer.load_state_dict(state_dict['optimizer'])
+            if lr_scheduler is not None:
+                lr_scheduler.load_state_dict(state_dict['lr_scheduler'])
+        except KeyError:
+            print_rank_0('Unable to load optimizer from checkpoint {}. '
+                         'Specify --no-load-optim or --finetune to prevent '
+                         'attempting to load the optimizer state, '
+                         'exiting ...'.format(checkpoint_name))
+            sys.exit()
+
+    # rng states.
+    if not release and not args.finetune and not args.no_load_rng:
+        try:
+            random.setstate(state_dict['random_rng_state'])
+            np.random.set_state(state_dict['np_rng_state'])
+            torch.set_rng_state(state_dict['torch_rng_state'])
+            torch.cuda.set_rng_state(state_dict['cuda_rng_state'])
+            mpu.get_cuda_rng_tracker().set_states(
+                state_dict['rng_tracker_states'])
+        except KeyError:
+            print_rank_0('Unable to load optimizer from checkpoint {}. '
+                         'Specify --no-load-rng or --finetune to prevent '
+                         'attempting to load the optimizer state, '
+                         'exiting ...'.format(checkpoint_name))
+            sys.exit()
+
+    torch.distributed.barrier()
+    if mpu.get_data_parallel_rank() == 0:
+        print('  successfully loaded {}'.format(checkpoint_name))
+
+    return iteration
--- a/megatron/data/__init__.py
+++ b/megatron/data/__init__.py
 from . import indexed_dataset
-from .bert_tokenization import FullTokenizer as FullBertTokenizer
-from .albert_dataset import AlbertDataset