large update including model parallelism and gpt2

Co-authored-by: shoeybi <shoeybim@gmail.com> Co-authored-by: raulpuric <raulpuric@berkeley.edu> Co-authored-by: jaredcasper <jaredcasper@gmail.com> Co-authored-by: mpatwary <mostofa.patwary@gmail.com> Co-authored-by: plegresl <plegresl@gmail.com>

large update including model parallelism and gpt2
Co-authored-by: shoeybi <shoeybim@gmail.com> Co-authored-by: raulpuric <raulpuric@berkeley.edu> Co-authored-by: jaredcasper <jaredcasper@gmail.com> Co-authored-by: mpatwary <mostofa.patwary@gmail.com> Co-authored-by: plegresl <plegresl@gmail.com>
abe36e2e · Raul Puri · 0399d32c · abe36e2e · abe36e2e · abe36e2e
Commit abe36e2e authored Jul 29, 2019 by Raul Puri
20 changed files
--- a/LICENSE
+++ b/LICENSE
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
 ------------- LICENSE FOR huggingface(transformer) repository --------------



--- a/README.md
+++ b/README.md
-Megatron is a large, powerful transformer. This repo is for ongoing research on training large, powerful transformer language models at scale. Currently, we support multinode training of [BERT](https://arxiv.org/pdf/1810.04805.pdf) in mixed precision. Our codebase is capable of training BERT Large on 64 V100 GPUs in 3 days. We achieved a final language modeling perplexity of 3.15 and SQuAD F1-score of 90.7.
+Megatron is a large, powerful transformer. This repo is for ongoing research on training large, powerful transformer language models at scale. Currently, we support model-parallel, multinode training of [GPT2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [BERT](https://arxiv.org/pdf/1810.04805.pdf) in mixed precision. 
+
+Our codebase is capable of efficiently training a 72-layer, 8.3 Billion Parameter GPT2 Language model with 8-way model and 64-way data parallelism across 512 GPUs. We find that bigger language models are able to surpass current GPT2-1.5B wikitext perplexities in as little as 5 epochs of training.
+
+For BERT training our repository trains BERT Large on 64 V100 GPUs in 3 days. We achieved a final language modeling perplexity of 3.15 and SQuAD F1-score of 90.7.
+<!--
+do we want to make any claims about GPT2 speed, convergence, or model release
+-->

 # Setup
 We officially support only python3.6.

 To use this repo please install the latest supported versions of PyTorch with GPU support. 

-Additionally, part of this codebase leverages tensorflow-cpu to perform dataloading of TFRecords. We recommend creating a virtual environment (to avoid breaking existing tf installations) and install our `reuirements.txt`.
+Additionally, part of this codebase leverages tensorflow-cpu to (optionally) perform dataloading of TFRecords for BERT training. We recommend either utilizing the provided Dockerfile in [`./docker/`](./docker) or creating a virtual environment (to avoid breaking existing tf installations) and install our `requirements.txt`. 

 ```
 python -m pip install virtualenv
@@ -16,55 +23,155 @@ pip install -r requirements.txt


 # Usage
-We've provided 4 scripts that pretrain BERT. All saved checkpoints can be used for finetuning according to [existing implementations](https://github.com/huggingface). Save model checkpoints with `--save`.
+We've provided 5 scripts that pretrain BERT and 3 scripts that pretrain GPT2. Save and load model checkpoints with `--save` and `--load`. Additionally we provide GPT2 scripts for interactive text generation and zero shot evaluation of GPT2 on wikitext and LAMBADA.

 ## BERT Pretraining
 `bash scripts/pretrain_bert.sh`

-This script runs single gpu BERT pretraining and is mainly for debugging purposes.
+This script runs single gpu BERT pretraining and is mainly for debugging purposes. The optimization arguments are set with 64-way distributed training in mind.

 To use this script place your `--train-data` in loose json format with one json per line. The text field of your json dictionaries should correspond to `--text-key`. 

 ```
 python pretrain_bert.py \
-    --batch-size 4 \
-    --tokenizer-type BertWordPieceTokenizer \
-    --cache-dir temp_cache_dir \
-    --tokenizer-model-type bert-large-uncased \
-    --vocab-size 30522 \
-    --train-data wikipedia \
-    --presplit-sentences \
-    --loose-json \
-    --text-key text \
-    --split 1000,1,1 \
-    --lazy-loader \
-    --max-preds-per-seq 80 \
-    --seq-length 512 \
-    --max-position-embeddings 512 \
-    --num-layers 24 \
-    --hidden-size 1024 \
-    --intermediate-size 4096 \
-    --num-attention-heads 16 \
-    --hidden-dropout 0.1 \
-    --attention-dropout 0.1 \
-    --train-iters 1000000 \
-    --lr 0.0001 \
-    --lr-decay-style linear \
-    --lr-decay-iters 990000 \
-    --warmup .01 \
-    --weight-decay 1e-2 \
-    --clip-grad 1.0 \
-    --fp16 \
-    --fp32-layernorm \
-    --fp32-embedding \
-    --hysteresis 2 \
-    --num-workers 2 
+       --num-layers 24 \
+       --hidden-size 1024 \
+       --num-attention-heads 16 \
+       --batch-size 4 \
+       --seq-length 512 \
+       --max-preds-per-seq 80 \
+       --max-position-embeddings 512 \
+       --train-iters 1000000 \
+       --save checkpoints/bert_345m \
+       --load checkpoints/bert_345m \
+       --resume-dataloader \
+       --train-data wikipedia \
+       --lazy-loader \
+       --tokenizer-type BertWordPieceTokenizer \
+       --tokenizer-model-type bert-large-uncased \
+       --presplit-sentences \
+       --cache-dir cache \
+       --split 949,50,1 \
+       --distributed-backend nccl \
+       --lr 0.0001 \
+       --lr-decay-style linear \
+       --lr-decay-iters 990000 \
+       --weight-decay 1e-2 \
+       --clip-grad 1.0 \
+       --warmup .01 \
+       --fp16 \
+       --fp32-embedding
+```
+
+## GPT2 Pretraining
+`bash scripts/pretrain_gpt2.sh`
+
+This script runs single gpu gpt2 pretraining and is mainly for debugging purposes. The optimization arguments are set with 64-way distributed training in mind. 
+
+It follows largely the same format as the previous script with a few notable differences: the `--tokenizer-type` has been switched to a `GPT2BPETokenizer`, the `--lr-decay-style` has been switched to cosine decay, and activation checkpointing has been turned on with `--checkpoint-activations` and `--checkpoint-num-layers` set to checkpoint every `1` layers.
+
+Additionally GPT2 uses a different parameter initialization from BERT designed for training deep residual networks. To train BERT with this initialization use `--deep-init`.
+
+```
+python pretrain_gpt2.py \
+       --num-layers 24 \
+       --hidden-size 1024 \
+       --num-attention-heads 16 \
+       --batch-size 8 \
+       --seq-length 1024 \
+       --max-position-embeddings 1024 \
+       --train-iters 320000 \
+       --save checkpoints/gpt2_345m \
+       --load checkpoints/gpt2_345m \
+       --resume-dataloader \
+       --train-data wikipedia \
+       --lazy-loader \
+       --tokenizer-type GPT2BPETokenizer \
+       --cache-dir cache \
+       --split 949,50,1 \
+       --distributed-backend nccl \
+       --lr 0.00015 \
+       --lr-decay-style cosine \
+       --weight-decay 1e-2 \
+       --clip-grad 1.0 \
+       --warmup .01 \
+       --checkpoint-activations \
+       --fp16
+```
+
+## GPT2 Text Generation
+`bash scripts/generate_text.sh`
+
+Starts an interactive terminal session that generates text either conditionally or unconditionally depending on what the user enters into the prompt. Specify the model in the script by setting the `CHECKPOINT_PATH` variable and the appropriate model configuration. 
+
+The script is capable of greedy sampling, top-k, or top-p sampling as specified by the appropriate variables within the script.
+
+## GPT2 Evaluation
+We support 3 modes of GPT2 evaluation with [`./scripts/run_gpt2_eval.py`](./scripts/run_gpt2_eval.py): wikitext ppl evaluation, lambada cloze accuracy, large corpora ppl evaluation.
+
+### Wikitext PPL evaluation
+For even comparison with prior works we evaluate wikitext perplexity on the word-level wikitext test dataset, which can be downloaded [here](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), and appropriately compute perplexity given the change in tokens when using our subword tokenizer.
+
+We use the following command to run wikitext evaluation:
+
+```
+python scripts/run_gpt2_eval.py \
+  --model-parallel-size 1 \
+  --num-layers 24 \
+  --hidden-size 1024 \
+  --num-attention-heads 16 \
+  --model-path <gpt2_345_path> \
+  --data-path <wikitext_tokens_test_path> \
+  --batch-size 16 \
+  --cache-dir cache
+```
+
+### Lambada Cloze Accuracy
+To compute Lambada cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the Lambada dataset we sourced from [here](https://github.com/cybertronai/bflm/blob/master/lambada_test.jsonl).
+
+We use the following command to run lambada evaluation:
+
+```
+python scripts/run_gpt2_eval.py \
+  --model-parallel-size 1 \
+  --num-layers 24 \
+  --hidden-size 1024 \
+  --num-attention-heads 16 \
+  --model-path <gpt2_345_path> \
+  --data-path <lambada_test_path> \
+  --batch-size 16 \
+  --cloze-eval \
+  --cache-dir cache
+```
+
+### Large Corpora PPL evaluation
+This functionality allows one to evaluate the gpt2 model on a loose json file. With the following command we evaluate the gpt2 model for 5000 iterations at a batch size of 16 on a webtext test data split. We recommend that the user presplit their dataset before training a model according to the procedure outlined [below](#partitioning-datasets-into-train-val-test).
+
+```
+python scripts/run_gpt2_eval.py \
+  --model-parallel-size 1 \
+  --num-layers 24 \
+  --hidden-size 1024 \
+  --num-attention-heads 16 \
+  --model-path <gpt2_345_path> \
+  --data-path <webtext_test_path> \
+  --batch-size 16 \
+  --eval-iters 5000 \
+  --webtext-eval \
+  --cache-dir cache
 ```

-## Distributed BERT Pretraining
-`bash scripts/pretrain_bert_distributed.sh`
+## Distributed BERT or GPT2 Pretraining
+`bash scripts/pretrain_bert_distributed.sh` or `bash scripts/pretrain_gpt2_distributed.sh`
+
+To use these scripts, follow the same data preparation procedure as in earlier sections. This script uses the pytorch distributed launcher to launch distributed training. As such, multinode training can be achieved by properly setting environment variables for the `env://` init method. See the official pytorch [documentation](https://pytorch.org/docs/stable/distributed.html#launch-utility) for further description of these [environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization). By default multinode training uses the nccl distributed backend.
+
+## Model Parallel BERT or GPT2 Pretraining
+`bash scripts/pretrain_bert_model_parallel.sh` or `bash scripts/pretrain_gpt2_model_parallel.sh`

-To use this script, follow the same data preparation procedure as in [earlier sections](#bert-pretraining). This script uses the pytorch distributed launcher to launch distributed training. As such, multinode training can be achieved by properly setting environment variables for the `env://` init method. See the official pytorch [documentation](https://pytorch.org/docs/stable/distributed.html#launch-utility) for further description of these [environment variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization). By default multinode training uses the nccl distributed backend.
+These scripts build upon the distributed training scripts and are identical in setup. They differ in use of the `--model-parallel-size` flag. For model parallelism of 2 and a world size of 8, the scripts will launch training with 4-way distributed data parallelism and 2-way model parallelism.
+
+We note that we have experimented with multiple distributed data parallel implementations: a simple one of our own which performs gradient all-reduce at the end of back propagation step, and torch's distributed data parallel wrapper which overlaps gradient reduction with back propagation computation. To switch between these two options toggle the `USE_TORCH_DDP` flag (the default is set to `False` and uses our DDP implementation) at the top of `pretrain_bert.py` and `pretrain_gpt2.py`. We find that torch distributed data parallelism is more efficient at larger model parallel sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 74% when torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.

 ## Distributed BERT Pretraining with TFRecords
 `bash scripts/pretrain_bert_tfrecords_distributed.sh`
@@ -77,11 +184,31 @@ This script takes advantage of TensorFlow BERT's [`create_pretraining.py`](https
 This script runs BERT pretraining with a `sentencepiece` tokenizer. If no sentencepiece tokenizer exists at `--tokenizer-path` one will be trained automatically. The sentencepiece tokenizer can be used with the previous scripts (NOTE: sentencepiece training can only happen during single gpu pretraining). `<--tokenizer-path>.vocab` can be used with [`create_pretraining_data.py`](https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/create_pretraining_data.py) to make a TFRecord dataset with the given tokenization.


-# Collecting Wikipedia Training Data
+# Data sets
+We do not host any datasets for GPT2 or BERT training, however, we detail their collection so that our results may be reproduced.
+
+## Collecting Wikipedia Training Data
 We recommend following the wikipedia data extraction process specified by google research: "the recommended pre-processing is to download [the latest dump](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), extract the text with [WikiExtractor.py](https://github.com/attardi/wikiextractor), and then apply any necessary cleanup to convert it into plain text." 

-We recommend using the `--json` argument when using WikiExtractor, which will dump the wikipedia data into loose json format (one json per line), making it more manageable and readily consumable by our codebase. We recommend further preprocessing this json dataset by preprocessing the dataset with nltk punctuation standardization, and presplitting each document into newline separated sentences. This can be done with the provided script `./scripts/presplit_sentences_json.py` and will allow for faster data processing during training time. Pretraining with presplit data should be run with the `--presplit-sentences` flag as shown above.
+We recommend using the `--json` argument when using WikiExtractor, which will dump the wikipedia data into loose json format (one json per line), making it more manageable and readily consumable by our codebase. We recommend further preprocessing this json dataset by preprocessing the dataset with nltk punctuation standardization, and presplitting each document into newline separated sentences. This can be done with the provided script `./scripts/presplit_sentences_json.py` and will allow for faster data processing during training time. Pretraining with presplit data should be run with the `--presplit-sentences` flag as shown above. (Note that if you'd like to use wikipedia data for GPT2 training you should still clean it with nltk/spacy/ftfy, but do not split it into newline seperated sentences)

 Once the json dataset is ready make sure to set the path in line 27 of `data_utils/corpora.py`.

-If your system is memory limited we also recommend running pretraining with the `--lazy-loader` argument as we've done. After preprocessing the dataset once, this will allow the dataset to be lazily loaded from disk, as opposed to storing it in memory.
+If your system is memory limited we also recommend running pretraining with the `--lazy-loader` argument as we've done. After preprocessing the dataset once, this will allow the dataset to be lazily loaded from disk, as opposed to storing it in memory. Make sure to run the code once on a 
+
+## Collecting GPT2 Webtext Data
+We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library from [jcpeterson](https://github.com/jcpeterson/openwebtext) and [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our [openwebtext](./openwebtext) directory. For reddit URLS corresponding to content upto october 2018 we arrived at approximately 37GB of content.
+
+We recommend creating an alias for this dataset as described below.
+
+## Aliasing datasets with corpora.py
+As mentioned in the previous Wikipedia data section we recommend aliasing datasets with human readable names (eg. `--train-data wikipedia`). This helps avoid forgetting arguments when submitting jobs, and allows one to combine datasets that would otherwise require different commandline options/data structures.
+
+Examples of how to create these dataset objects can be found in [`./data_utils/corpora.py`](./data_utils/corpora.py). We recommend that the objects inherit from or adhere to the interface laid out by `torch.utils.data.Dataset` objects.
+
+Any created datasets should be then added to the `NAMED_CORPORA` dictionary object in [`./data_utils/corpora.py`](./data_utils/corpora.py). At runtime one can specify one or more corpora from the commandline with `--train-data corpus1 corpus2 corpus3`, `--valid-data corpus1 corpus2 corpus3`, or `--test-data ...`.
+
+## Partitioning datasets into Train/Val/Test
+We support multiple ways to partition corpora into train/val/test splits. By specifying a `--split 95,5` commandline argument, the corpora specified by `--train-data` will have it's documents split proportionally into a 95%, 5% train/val split. The split is performed lazily on the fly and is efficient and deterministic from run to run given the same `--seed`. Note that if `--valid-data` or `--test-data` is specified then the train data will still be split accordingly, but `--valid-data`/`--test-data` will still be used as the validation/test source.
+
+We do realize that this method, while effective, introduces noise into the development process, since different seeds will change the dataset and outcome. To have fixed training/validation/test sets across all your runs please utilize our script [`./scripts/split_json.py`](./scripts/split_json.py)
--- a/arguments.py
+++ b/arguments.py
@@ -41,9 +41,9 @@ def add_model_config_args(parser):
                       'set to 4*`--hidden-size` if it is None')
    group.add_argument('--num-layers', type=int, default=24,
                       help='num decoder layers')
-    group.add_argument('--layernorm-epsilon', type=float, default=1e-12,
+    group.add_argument('--layernorm-epsilon', type=float, default=1e-5,
                       help='layer norm epsilon')
-    group.add_argument('--hidden-dropout', type=float, default=0.0,
+    group.add_argument('--hidden-dropout', type=float, default=0.1,
                       help='dropout probability for hidden state transformer')
    group.add_argument('--max-position-embeddings', type=int, default=512,
                       help='maximum number of position embeddings to use')
@@ -51,6 +51,14 @@ def add_model_config_args(parser):
                       help='vocab size to use for non-character-level '
                       'tokenization. This value will only be used when '
                       'creating a tokenizer')
+    group.add_argument('--deep-init', action='store_true',
+                       help='initialize bert model similar to gpt2 model.'
+                       'scales initialization of projection layers by a '
+                       'factor of 1/sqrt(2N). Necessary to train bert '
+                       'models larger than BERT-Large.')
+    group.add_argument('--make-vocab-size-divisible-by', type=int, default=128,
+                       help='Pad the vocab size to be divisible by this value.'
+                       'This is added for computational efficieny reasons.')

    return parser

@@ -96,16 +104,26 @@ def add_training_args(parser):
    group.add_argument('--checkpoint-activations', action='store_true',
                       help='checkpoint activation to allow for training '
                       'with larger models and sequences')
+    group.add_argument('--checkpoint-num-layers', type=int, default=1,
+                       help='chunk size (number of layers) for checkpointing')
    group.add_argument('--clip-grad', type=float, default=1.0,
                       help='gradient clipping')
-    group.add_argument('--epochs', type=int, default=1,
-                       help='upper epoch limit')
+    group.add_argument('--train-iters', type=int, default=1000000,
+                       help='total number of iterations to train over all training runs')
    group.add_argument('--log-interval', type=int, default=100,
                       help='report interval')
-    group.add_argument('--train-iters', type=int, default=1000000,
-                       help='number of iterations per epoch')
+    group.add_argument('--exit-interval', type=int, default=None,
+                       help='Exit the program after this many new iterations.')
+
    group.add_argument('--seed', type=int, default=1234,
                       help='random seed')
+    # Batch prodecuer arguments
+    group.add_argument('--reset-position-ids', action='store_true',
+                       help='Reset posistion ids after end-of-document token.')
+    group.add_argument('--reset-attention-mask', action='store_true',
+                       help='Reset self attention maske after '
+                       'end-of-document token.')
+
    # Learning rate.
    group.add_argument('--lr-decay-iters', type=int, default=None,
                       help='number of iterations to decay LR over,'
@@ -121,28 +139,22 @@ def add_training_args(parser):
    # model checkpointing
    group.add_argument('--save', type=str, default=None,
                       help='Output directory to save checkpoints to.')
-    group.add_argument('--save-iters', type=int, default=None,
-                       help='Save every so often iterations.')
-    group.add_argument('--save-optim', action='store_true',
-                       help='Save current optimizer.')
-    group.add_argument('--save-rng', action='store_true',
-                       help='Save current rng state.')
-    group.add_argument('--save-all-rng', action='store_true',
-                       help='Save current rng state of each rank in '
-                       'distributed training.')
+    group.add_argument('--save-interval', type=int, default=5000,
+                       help='number of iterations between saves')
+    group.add_argument('--no-save-optim', action='store_true',
+                       help='Do not save current optimizer.')
+    group.add_argument('--no-save-rng', action='store_true',
+                       help='Do not save current rng state.')
    group.add_argument('--load', type=str, default=None,
-                       help='Path to a particular model checkpoint. \
-                             (ex. `savedir/model.1000.pt`)')
-    group.add_argument('--load-optim', action='store_true',
-                       help='Load most recent optimizer corresponding '
-                       'to `--load`.')
-    group.add_argument('--load-rng', action='store_true',
-                       help='Load most recent rng state corresponding '
-                       'to `--load`.')
-    group.add_argument('--load-all-rng', action='store_true',
-                       help='Load most recent rng state of each rank in '
-                       'distributed training corresponding to `--load`('
-                       'complementary to `--save-all-rng`).')
+                       help='Path to a directory containing a model checkpoint.')
+    group.add_argument('--no-load-optim', action='store_true',
+                       help='Do not load optimizer when loading checkpoint.')
+    group.add_argument('--no-load-rng', action='store_true',
+                       help='Do not load rng state when loading checkpoint.')
+    group.add_argument('--finetune', action='store_true',
+                       help='Load model for finetuning. Do not load optimizer '
+                       'or rng state from checkpoint and set iteration to 0. '
+                       'Assumed when loading a release checkpoint.')
    group.add_argument('--resume-dataloader', action='store_true',
                       help='Resume the dataloader when resuming training. '
                       'Does not apply to tfrecords dataloader, try resuming'
@@ -165,9 +177,11 @@ def add_evaluation_args(parser):
    group.add_argument('--eval-batch-size', type=int, default=None,
                       help='Data Loader batch size for evaluation datasets.'
                       'Defaults to `--batch-size`')
-    group.add_argument('--eval-iters', type=int, default=2000,
-                       help='number of iterations per epoch to run '
+    group.add_argument('--eval-iters', type=int, default=100,
+                       help='number of iterations to run for evaluation'
                       'validation/test for')
+    group.add_argument('--eval-interval', type=int, default=1000,
+                       help='interval between running evaluation on validation set')
    group.add_argument('--eval-seq-length', type=int, default=None,
                       help='Maximum sequence length to process for '
                       'evaluation. Defaults to `--seq-length`')
@@ -175,21 +189,57 @@ def add_evaluation_args(parser):
                       help='Maximum number of predictions to use for '
                       'evaluation. Defaults to '
                       'math.ceil(`--eval-seq-length`*.15/10)*10')
+    group.add_argument('--overlapping-eval', type=int, default=32,
+                       help='sliding window for overlapping eval ')
+    group.add_argument('--cloze-eval', action='store_true',
+                       help='Evaluation dataset from `--valid-data` is a cloze task')
+    group.add_argument('--eval-hf', action='store_true',
+                       help='perform evaluation with huggingface openai model.'
+                       'use `--load` to specify weights path to be loaded')
+    group.add_argument('--load-openai', action='store_true',
+                       help='load openai weights into our model. Use `--load` '
+                       'to specify weights path to be loaded')

    return parser

+def add_text_generate_args(parser):
+    """Text generate arguments."""
+
+    group = parser.add_argument_group('Text generation', 'configurations')
+    group.add_argument("--temperature", type=float, default=1.0)
+    group.add_argument("--top_p", type=float, default=0.0)
+    group.add_argument("--top_k", type=int, default=0)
+    group.add_argument("--out-seq-length", type=int, default=256)
+    return parser
+

 def add_data_args(parser):
    """Train/valid/test data arguments."""

    group = parser.add_argument_group('data', 'data configurations')

+    group.add_argument('--model-parallel-size', type=int, default=1,
+                       help='size of the model parallel.')
    group.add_argument('--shuffle', action='store_true',
                       help='Shuffle data. Shuffling is deterministic '
                       'based on seed and current epoch.')
-    group.add_argument('--train-data', nargs='+', required=True,
-                       help='Filename (or whitespace separated filenames) '
+    group.add_argument('--train-data', nargs='+', default=None,
+                       help='Whitespace separated filenames or corpora names '
                       'for training.')
+
+    group.add_argument('--use-npy-data-loader', action='store_true',
+                       help='Use the numpy data loader. If set, then'
+                       'train-data-path, val-data-path, and test-data-path'
+                       'should also be provided.')
+    group.add_argument('--train-data-path', type=str, default='',
+                       help='path to the training data')
+    group.add_argument('--val-data-path', type=str, default='',
+                       help='path to the validation data')
+    group.add_argument('--test-data-path', type=str, default='',
+                       help='path to the test data')
+    group.add_argument('--input-data-sizes-file', type=str, default='sizes.txt',
+                       help='the filename containing all the shards sizes')
+
    group.add_argument('--delim', default=',',
                       help='delimiter used to parse csv data files')
    group.add_argument('--text-key', default='sentence',
@@ -229,7 +279,8 @@ def add_data_args(parser):
                       default='BertWordPieceTokenizer',
                       choices=['CharacterLevelTokenizer',
                                'SentencePieceTokenizer',
-                                'BertWordPieceTokenizer'],
+                                'BertWordPieceTokenizer',
+                                'GPT2BPETokenizer'],
                       help='what type of tokenizer to use')
    group.add_argument("--cache-dir", default=None, type=str,
                       help="Where to store pre-trained BERT downloads")
@@ -247,15 +298,6 @@ def add_data_args(parser):
    return parser


-def print_args(args):
-    """Print arguments."""
-
-    print('arguments:', flush=True)
-    for arg in vars(args):
-        dots = '.' * (29 - len(arg))
-        print('  {} {} {}'.format(arg, dots, getattr(args, arg)), flush=True)
-
-
 def get_args():
    """Parse all the args."""

@@ -264,18 +306,42 @@ def get_args():
    parser = add_fp16_config_args(parser)
    parser = add_training_args(parser)
    parser = add_evaluation_args(parser)
+    parser = add_text_generate_args(parser)
    parser = add_data_args(parser)

    args = parser.parse_args()

+    if not args.train_data and not args.train_data_path:
+        print('WARNING: No training data specified')
+
    args.cuda = torch.cuda.is_available()
+
    args.rank = int(os.getenv('RANK', '0'))
    args.world_size = int(os.getenv("WORLD_SIZE", '1'))

+    if os.getenv('OMPI_COMM_WORLD_LOCAL_RANK'):
+        # We are using (OpenMPI) mpirun for launching distributed data parallel processes
+        local_rank = int(os.getenv('OMPI_COMM_WORLD_LOCAL_RANK'))
+        local_size = int(os.getenv('OMPI_COMM_WORLD_LOCAL_SIZE'))
+
+        # Possibly running with Slurm
+        num_nodes = int(os.getenv('SLURM_JOB_NUM_NODES', '1'))
+        nodeid = int(os.getenv('SLURM_NODEID', '0'))
+
+        args.local_rank = local_rank
+        args.rank = nodeid*local_size + local_rank
+        args.world_size = num_nodes*local_size
+
+    args.model_parallel_size = min(args.model_parallel_size, args.world_size)
+    if args.rank == 0:
+        print('using world size: {} and model-parallel size: {} '.format(
+            args.world_size, args.model_parallel_size))
+
    args.dynamic_loss_scale = False
    if args.loss_scale is None:
        args.dynamic_loss_scale = True
-        print(' > using dynamic loss scaling')
+        if args.rank == 0:
+            print(' > using dynamic loss scaling')

    # The args fp32_* or fp16_* meant to be active when the
    # args fp16 is set. So the default behaviour should all
@@ -285,5 +351,4 @@ def get_args():
        args.fp32_tokentypes = False
        args.fp32_layernorm = False

-    print_args(args)
    return args
--- a/configure_data.py
+++ b/configure_data.py
@@ -19,6 +19,7 @@ import copy
 import torch
 import data_utils

+import mpu

 class DataConfig:

@@ -27,7 +28,8 @@ class DataConfig:
        self.defaults = defaults

    def apply(self, args):
-        print('configuring data')
+        if torch.distributed.get_rank() == 0:
+            print('configuring data')
        self.apply_defaults(args)
        return make_loaders(args)

@@ -49,8 +51,9 @@ def make_data_loader(dataset, batch_size, args):
        sampler = data_utils.samplers.RandomSampler(dataset, replacement=True, num_samples=batch_size*args.train_iters)
    else:
        sampler = torch.utils.data.SequentialSampler(dataset)
-    world_size = args.world_size
-    rank = args.rank
+    world_size = torch.distributed.get_world_size(
+        group=mpu.get_data_parallel_group())
+    rank = torch.distributed.get_rank(group=mpu.get_data_parallel_group())
    distributed = world_size > 1
    drop_last = distributed

@@ -76,7 +79,7 @@ def make_data_loader(dataset, batch_size, args):
 def make_tfrecord_loaders(args):
    """Load train/val/test dataset from shuffled TFRecords"""

-    import data_utils.tf_dl 
+    import data_utils.tf_dl
    data_set_args = {'batch_size': args.batch_size,
                     'max_seq_len': args.seq_length,
                     'max_preds_per_seq': args.max_preds_per_seq,
@@ -115,16 +118,18 @@ def make_loaders(args):

    if args.use_tfrecords:
        return make_tfrecord_loaders(args)
-    batch_size = args.batch_size * args.world_size
+    world_size = torch.distributed.get_world_size(
+        group=mpu.get_data_parallel_group())
+    batch_size = args.batch_size * world_size
    eval_batch_size = batch_size
    if args.eval_batch_size is not None:
-        eval_batch_size = args.eval_batch_size * args.world_size
+        eval_batch_size = args.eval_batch_size * world_size
    seq_length = args.seq_length
    if seq_length < 0:
-        seq_length = seq_length * args.world_size
+        seq_length = seq_length * world_size
    eval_seq_length = args.eval_seq_length
    if eval_seq_length is not None and eval_seq_length < 0:
-        eval_seq_length = eval_seq_length * args.world_size
+        eval_seq_length = eval_seq_length * world_size
    split = get_split(args)
    data_set_args = {
        'path': args.train_data,
@@ -165,24 +170,34 @@ def make_loaders(args):
        train, tokenizer = data_utils.make_dataset(**data_set_args)
        if data_utils.should_split(split):
            train, valid, test = train
-    eval_set_args['tokenizer'] = tokenizer
+        eval_set_args['tokenizer'] = tokenizer

    # make training and val dataset if necessary
    if valid is None and args.valid_data is not None:
        eval_set_args['path'] = args.valid_data
-        valid, _ = data_utils.make_dataset(**eval_set_args)
+        valid, tokenizer = data_utils.make_dataset(**eval_set_args)
+        eval_set_args['tokenizer'] = tokenizer
    if test is None and args.test_data is not None:
        eval_set_args['path'] = args.test_data
-        test, _ = data_utils.make_dataset(**eval_set_args)
+        test, tokenizer = data_utils.make_dataset(**eval_set_args)

    # wrap datasets with data loader
    if train is not None and args.batch_size > 0:
        train = make_data_loader(train, batch_size, args)
+        args.do_train = True
+    else:
+        args.do_train = False
    eval_batch_size = eval_batch_size if eval_batch_size != 0 else batch_size
    if valid is not None:
        valid = make_data_loader(valid, eval_batch_size, args)
+        args.do_valid = True
+    else:
+        args.do_valid = False
    if test is not None:
        test = make_data_loader(test, eval_batch_size, args)
+        args.do_test = True
+    else:
+        args.do_test = False

    return (train, valid, test), tokenizer


--- a/data_utils/__init__.py
+++ b/data_utils/__init__.py
@@ -17,9 +17,9 @@ import os
 import math

 from .samplers import DistributedBatchSampler
-from .datasets import json_dataset, csv_dataset, split_ds, ConcatDataset, SplitDataset, bert_sentencepair_dataset
+from .datasets import json_dataset, csv_dataset, split_ds, ConcatDataset, SplitDataset, bert_sentencepair_dataset, GPT2Dataset
 from .lazy_loader import exists_lazy, make_lazy, lazy_array_loader
-from .tokenization import Tokenization, CommandToken, Tokenizer, CharacterLevelTokenizer, BertWordPieceTokenizer, make_tokenizer
+from .tokenization import Tokenization, CommandToken, Tokenizer, CharacterLevelTokenizer, BertWordPieceTokenizer, GPT2BPETokenizer, make_tokenizer
 from . import corpora

 TRAIN_DATA = 0
@@ -109,9 +109,13 @@ def make_dataset(path, seq_length, text_key, label_key, lazy=False, process_fn=N
        ds = split_ds(ds, split)
        if ds_type.lower() == 'bert':
            presplit_sentences = kwargs['presplit_sentences'] if 'presplit_sentences' in kwargs else False
-            ds = [bert_sentencepair_dataset(d, max_seq_len=seq_length, presplit_sentences=presplit_sentences) for d in ds]
+            ds = [bert_sentencepair_dataset(d, max_seq_len=seq_length, presplit_sentences=presplit_sentences)  if d is not None else None  for d in ds]
+        elif ds_type.lower() == 'gpt2':
+            ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
    else:
        if ds_type.lower() == 'bert':
            presplit_sentences = kwargs['presplit_sentences'] if 'presplit_sentences' in kwargs else False
            ds = bert_sentencepair_dataset(ds, max_seq_len=seq_length, presplit_sentences=presplit_sentences)
+        elif ds_type.lower() == 'gpt2':
+            ds = GPT2Dataset(ds, max_seq_len=seq_length)
    return ds, tokenizer
--- a/data_utils/corpora.py
+++ b/data_utils/corpora.py
-# coding=utf-8
-# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""several datasets with preset arguments"""
-from .datasets import json_dataset, csv_dataset
-
-class wikipedia(json_dataset):
-	"""
-	dataset for wikipedia with arguments configured for convenience
-
-	command line usage: `--train-data wikipedia`
-	"""
-	PATH = '<wikipedia_path>'
-	assert_str = "make sure to set PATH at line 27 of data_utils/corpora.py"
-	def __init__(self, **kwargs):
-		assert wikipedia.PATH != '<wikipedia_path>', \
-                                         wikipedia.assert_str
-		if not kwargs:
-			kwargs = {}
-		kwargs['text_key'] = 'text'
-		kwargs['loose_json'] = True
-		super(wikipedia, self).__init__(wikipedia.PATH, **kwargs)
-
-NAMED_CORPORA = {
-	'wikipedia': wikipedia,
-}
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""several datasets with preset arguments"""
+from .datasets import json_dataset, csv_dataset
+import os
+
+class wikipedia(json_dataset):
+	"""
+	dataset for wikipedia with arguments configured for convenience
+
+	command line usage: `--train-data wikipedia`
+	"""
+	PATH = 'data/wikipedia/wikidump_lines.json'
+	assert_str = "make sure to set PATH for wikipedia data_utils/corpora.py"
+	def __init__(self, **kwargs):
+		assert os.path.exists(wikipedia.PATH), \
+                        wikipedia.assert_str
+		if not kwargs:
+			kwargs = {}
+		kwargs['text_key'] = 'text'
+		kwargs['loose_json'] = True
+		super(wikipedia, self).__init__(wikipedia.PATH, **kwargs)
+
+
+class webtext(json_dataset):
+	"""
+	dataset for webtext with arguments configured for convenience
+
+	command line usage: `--train-data webtext`
+	"""
+	PATH = 'data/webtext/data.json'
+	assert_str = "make sure to set PATH for webtext data_utils/corpora.py"
+	def __init__(self, **kwargs):
+		assert os.path.exists(webtext.PATH), \
+                        webtext.assert_str
+		if not kwargs:
+			kwargs = {}
+		kwargs['text_key'] = 'text'
+		kwargs['loose_json'] = True
+		super(webtext, self).__init__(webtext.PATH, **kwargs)
+
+
+NAMED_CORPORA = {
+	'wikipedia': wikipedia,
+        'webtext': webtext,
+}
--- a/data_utils/datasets.py
+++ b/data_utils/datasets.py
@@ -22,13 +22,13 @@ import json
 import csv
 import math
 import random
+from itertools import accumulate

 from torch.utils import data
 import pandas as pd
 import numpy as np

 import nltk
-nltk.download('punkt')
 from nltk import tokenize

 from .lazy_loader import lazy_array_loader, exists_lazy, make_lazy
@@ -57,9 +57,11 @@ class ConcatDataset(data.Dataset):
        super(ConcatDataset, self).__init__()
        assert len(datasets) > 0, 'datasets should not be an empty iterable'
        self.datasets = list(datasets)
+        self.is_lazy = sum([isinstance(ds, lazy_array_loader) for ds in self.datasets]) == len(self.datasets)
        self.cumulative_sizes = self.cumsum(self.datasets)
        self._X = None
        self._Y = None
+        self._lens = None

    def SetTokenizer(self, tokenizer):
        for ds in self.datasets:
@@ -79,6 +81,18 @@ class ConcatDataset(data.Dataset):
            sample_idx = idx - self.cumulative_sizes[dataset_idx - 1]
        return self.datasets[dataset_idx][sample_idx]

+    @property
+    def lens(self):
+        if self._lens is None:
+            self._lens = []
+            if self.is_lazy:
+                for data in self.datasets:
+                    self._lens.extend(data.lens)
+            else:
+                for data in self.datasets:
+                    self._lens.extend([len(d['text']) if isinstance(d, dict) else len(d) for d in data])
+        return self._lens
+
    @property
    def X(self):
        if self._X is None:
@@ -115,7 +129,7 @@ class SplitDataset(data.Dataset):
    def __init__(self, ds, split_inds, **kwargs):
        self.split_inds = list(split_inds)
        self.wrapped_data = ds
-        self.is_lazy = isinstance(ds, lazy_array_loader)
+        self.is_lazy = isinstance(ds, lazy_array_loader) or (hasattr(ds, 'is_lazy') and ds.is_lazy)
        if self.is_lazy:
            self.lens = itemgetter(*self.split_inds)(list(self.wrapped_data.lens))
        self._X = None
@@ -203,6 +217,7 @@ class csv_dataset(data.Dataset):
    def __init__(self, path, tokenizer=None, preprocess_fn=None, delim=',',
                binarize_sent=False, drop_unlabeled=False, text_key='sentence', label_key='label',
                **kwargs):
+        self.is_lazy = False
        self.preprocess_fn = preprocess_fn
        self.SetTokenizer(tokenizer)
        self.path = path
@@ -314,6 +329,7 @@ class json_dataset(data.Dataset):
    """
    def __init__(self, path, tokenizer=None, preprocess_fn=None, binarize_sent=False,
                text_key='sentence', label_key='label', loose_json=False, **kwargs):
+        self.is_lazy = False
        self.preprocess_fn = preprocess_fn
        self.path = path
        self.SetTokenizer(tokenizer)
@@ -437,6 +453,117 @@ class json_dataset(data.Dataset):
                j[self.label_key] = -1
            yield j

+class GPT2Dataset(data.Dataset):
+
+    def __init__(self, ds,
+                 max_seq_len=1024,
+                 num_samples=None,
+                 weighted=True,
+                 sample_across_doc=True,
+                 random_across_doc_sampling=True,
+                 sentence_start=False, **kwargs):
+        self.ds = ds
+        self.ds_len = len(self.ds)
+        self.num_samples = num_samples
+        if num_samples is None:
+            self.num_samples = 1000 * self.ds_len
+        self.max_seq_len = max_seq_len
+        self.tokenizer = self.ds.GetTokenizer()
+        self.ds.SetTokenizer(None)
+        self.weighted = weighted
+        self.sample_across_doc = sample_across_doc
+        self.random_across_doc_sampling = random_across_doc_sampling
+        self.sentence_start = sentence_start
+        self.init_weighting()
+
+    def init_weighting(self):
+        if self.weighted:
+            if hasattr(self.ds, 'is_lazy') and self.ds.is_lazy:
+                lens = np.array(self.ds.lens)
+            else:
+                lens = np.array([len(d['text']) if isinstance(d, dict)
+                                 else len(d) for d in self.ds])
+            self.total_len = np.sum(lens)
+            self.weighting = list(accumulate(lens))
+        else:
+            self.weighting = None
+
+    def get_weighted_samples(self, np_rng):
+        if self.weighting is not None:
+            idx = np_rng.randint(self.total_len)
+            return bisect_right(self.weighting, idx)
+        else:
+            return np_rng.randint(self.ds_len)
+
+    def __len__(self):
+        return self.num_samples
+
+    def __getitem__(self, idx):
+        # init rng
+        rng = random.Random(idx)
+        rng = np.random.RandomState(seed=[rng.randint(0, 2**32-1) for _ in range(16)])
+
+        # get possibly weighted random index from dataset
+        data_idx = self.get_weighted_samples(rng)
+#        data_idx = rng.choice(self.ds_len, p=self.weighting)
+        tokens = self.getidx(data_idx)
+
+        # truncate or pad tokens
+        num_tokens = len(tokens)
+        tokens_to_strip = num_tokens - self.max_seq_len - 1
+        if tokens_to_strip > 0:
+            strip_left_tokens = rng.randint(tokens_to_strip + 1)
+            tokens = tokens[strip_left_tokens:]
+            if self.sentence_start:
+                token_copy = list(tokens)
+                not_done = True
+                while (len(token_copy) > 0) and not_done:
+                    tok = token_copy.pop(0)
+                    if self.contains_sentence_end(tok):
+                        tokens = token_copy
+                        not_done = False
+            strip_right_rokens = len(tokens) - self.max_seq_len - 1
+            if strip_right_rokens > 0:
+                tokens = tokens[:-strip_right_rokens]
+
+        if self.sample_across_doc:
+            while (len(tokens) < (self.max_seq_len + 1)):
+                if self.random_across_doc_sampling:
+                    data_idx = self.get_weighted_samples(rng)
+                else:
+                    data_idx = (data_idx + 1) % self.ds_len
+                tokens += self.getidx(data_idx)
+            tokens = tokens[:(self.max_seq_len+1)]
+
+        tokens = self.pad_seq(tokens)
+        return {'text': np.array(tokens),}
+
+    def getidx(self, data_idx):
+        data = self.ds[data_idx]
+        if isinstance(data, dict):
+            data = data['text']
+        # tokenize
+        tokenization = self.tokenizer.EncodeAsIds(data)
+        tokenization.append(self.tokenizer.get_command('eos'))
+        tokens = tokenization.tokenization
+        return tokens
+
+    def pad_seq(self, seq):
+        total_tokens = self.max_seq_len + 1
+        num_pad_tokens = max(0, total_tokens - len(seq))
+        seq += [self.tokenizer.get_command('pad').Id]*(num_pad_tokens)
+        return seq
+
+    def contains_sentence_end(self, tok):
+        tok = self.tokenizer.IdToToken(tok)
+        if '.' in tok:
+            return True
+        if '?' in tok:
+            return True
+        if '!' in tok:
+            return True
+        return False
+
 class bert_sentencepair_dataset(data.Dataset):
    """
    Dataset containing sentencepairs for BERT training. Each index corresponds to a randomly generated sentence pair.
@@ -449,7 +576,7 @@ class bert_sentencepair_dataset(data.Dataset):
        dataset_size (int): number of random sentencepairs in the dataset. Default: len(ds)*(len(ds)-1)

    """
-    def __init__(self, ds, max_seq_len=512, mask_lm_prob=.15, max_preds_per_seq=None, short_seq_prob=.01, dataset_size=None, presplit_sentences=False, **kwargs):
+    def __init__(self, ds, max_seq_len=512, mask_lm_prob=.15, max_preds_per_seq=None, short_seq_prob=.01, dataset_size=None, presplit_sentences=False, weighted=True,**kwargs):
        self.ds = ds
        self.ds_len = len(self.ds)
        self.tokenizer = self.ds.GetTokenizer()
@@ -465,6 +592,28 @@ class bert_sentencepair_dataset(data.Dataset):
        if self.dataset_size is None:
            self.dataset_size = self.ds_len * (self.ds_len-1)
        self.presplit_sentences = presplit_sentences
+        if not self.presplit_sentences:
+            nltk.download('punkt', download_dir="./nltk")
+        self.weighted = weighted
+        self.get_weighting()
+
+    def get_weighting(self):
+        if self.weighted:
+            if hasattr(self.ds, 'is_lazy') and self.ds.is_lazy:
+                lens = np.array(self.ds.lens)
+            else:
+                lens = np.array([len(d['text']) if isinstance(d, dict) else len(d) for d in self.ds])
+            self.total_len = np.sum(lens)
+            self.weighting = list(accumulate(lens))
+        else:
+            self.weighting = None
+
+    def get_weighted_samples(self, np_rng):
+        if self.weighting is not None:
+            idx = np_rng.randint(self.total_len)
+            return bisect_right(self.weighting, idx)
+        else:
+            return np_rng.randint(self.ds_len)

    def __len__(self):
        return self.dataset_size
@@ -472,20 +621,23 @@ class bert_sentencepair_dataset(data.Dataset):
    def __getitem__(self, idx):
        # get rng state corresponding to index (allows deterministic random pair)
        rng = random.Random(idx)
+        np_rng = np.random.RandomState(seed=[rng.randint(0, 2**32-1) for _ in range(16)])
        # get seq length
        target_seq_length = self.max_seq_len
        short_seq = False
        if rng.random() < self.short_seq_prob:
            target_seq_length = rng.randint(2, target_seq_length)
            short_seq = True
+
        # get sentence pair and label
        is_random_next = None
        lena = 0
        lenb = 0
        while (is_random_next is None) or (lena < 1) or (lenb < 1):
-            tokensa, tokensb, is_random_next = self.create_random_sentencepair(target_seq_length, rng)
+            tokensa, tokensb, is_random_next = self.create_random_sentencepair(target_seq_length, rng, np_rng)
            lena = len(tokensa[0])
            lenb = len(tokensb[0])
+
        # truncate sentence pair to max_seq_len
        tokensa, tokensb = self.truncate_seq_pair(tokensa, tokensb, self.max_seq_len, rng)
        # join sentence pair, mask, and pad
@@ -518,7 +670,7 @@ class bert_sentencepair_dataset(data.Dataset):
            rtn = rtn['text']
        return rtn

-    def create_random_sentencepair(self, target_seq_length, rng):
+    def create_random_sentencepair(self, target_seq_length, rng, np_rng):
        """
        fetches a random sentencepair corresponding to rng state similar to
        https://github.com/google-research/bert/blob/master/create_pretraining_data.py#L248-L294
@@ -533,7 +685,11 @@ class bert_sentencepair_dataset(data.Dataset):
            curr_len = 0
            doc_a = None
            while doc_a is None:
-                doc_a_idx = rng.randint(0, self.ds_len-1)
+                if self.weighted:
+                    # doc_a_idx = np_rng.choice(self.ds_len, p=self.weighting)
+                    doc_a_idx = self.get_weighted_samples(np_rng)
+                else:
+                    doc_a_idx = rng.randint(0, self.ds_len-1)
                doc_a = self.sentence_split(self.get_doc(doc_a_idx))
                if not doc_a:
                    doc_a = None

--- a/data_utils/lazy_loader.py
+++ b/data_utils/lazy_loader.py
@@ -18,9 +18,9 @@ import mmap
 import pickle as pkl
 import time
 from itertools import accumulate
-from threading import Lock

 import torch
+from torch.multiprocessing import Lock

 def get_lazy_path(path):
    """
@@ -187,7 +187,7 @@ class lazy_array_loader(object):
        self.read_lock.release()
        #TODO: @raulp figure out mem map byte string bug
        #if mem map'd need to decode byte string to string
-        rtn = rtn.decode('utf-8')
+        rtn = rtn.decode('utf-8', 'ignore')
        # rtn = str(rtn)
        if self.mem_map:
            rtn = rtn.decode('unicode_escape')

--- a/data_utils/samplers.py
+++ b/data_utils/samplers.py
@@ -81,6 +81,7 @@ class DistributedBatchSampler(data.sampler.BatchSampler):
    def __init__(self, sampler, batch_size, drop_last, rank=-1, world_size=2, wrap_last=False):
        super(DistributedBatchSampler, self).__init__(sampler, batch_size, drop_last)
        if rank == -1:
+            assert False, 'should not be here'
            rank = torch.distributed.get_rank()
        self.rank = rank
        self.world_size = world_size
@@ -135,4 +136,4 @@ class DistributedBatchSampler(data.sampler.BatchSampler):
        """extracts samples only pertaining to this worker's batch"""
        start = self.rank*self.batch_size//self.world_size
        end = (self.rank+1)*self.batch_size//self.world_size
-        return batch[start:end]
\ No newline at end of file
+        return batch[start:end]
--- a/data_utils/tokenization.py
+++ b/data_utils/tokenization.py
@@ -17,14 +17,17 @@ from collections import namedtuple
 import random
 import os
 import csv
+import torch

 import nltk
-nltk.download('punkt')
 from nltk import tokenize as nltk_tokenize
 import sentencepiece as spm

 from .wordpiece import BertTokenizer, PRETRAINED_VOCAB_ARCHIVE_MAP

+from .tokenization_gpt2 import GPT2Tokenizer
+import regex as re
+
 def make_tokenizer(tokenizer_type, corpus, model_path=None, vocab_size=None, model_type='bpe', pad_token=0, character_coverage=1.0, command_tokens=None, type_tokens=None, **kwargs):
    """
    Helper function to instantiate a tokenizer given common combinations of options.
@@ -34,6 +37,8 @@ def make_tokenizer(tokenizer_type, corpus, model_path=None, vocab_size=None, mod
        tokenizer_class = eval(tokenizer_class)
    if tokenizer_class is BertWordPieceTokenizer:
        return BertWordPieceTokenizer(model_type, **kwargs)
+    elif tokenizer_class is GPT2BPETokenizer:
+        return GPT2BPETokenizer(**kwargs)
    text_tokenizer =  tokenizer_class(corpus=corpus, vocab_size=vocab_size, model_path=model_path, model_type=model_type,
                                      pad_token=pad_token, character_coverage=character_coverage)
    return Tokenizer(text_tokenizer, command_tokens, type_tokens)
@@ -84,11 +89,11 @@ class Tokenization(object):
        if isinstance(other, (CommandToken, TypeToken)):
            self.tokenization.insert(idx, other.Id)
            if idx == 0:
-                self.text.insert(0, other.token)
-                self.original_text.insert(0, other.token)
+                self.text = other.token + self.text
+                self.original_text = other.token + self.original_text
            elif idx == len(self.tokenization)-1:
-                self.text.insert(-1, other.token)
-                self.original_text.insert(-1, other.token)
+                self.text += other.token
+                self.original_text += other.token
        elif isinstance(other, Tokenization):
            self.tokenization = self.tokenization[:idx] + other.tokenization + self.tokenization[idx:]
        else:
@@ -97,8 +102,8 @@ class Tokenization(object):
    def append(self, other):
        if isinstance(other, (CommandToken, TypeToken)):
            self.tokenization.append(other.Id)
-            self.text.append(other.token)
-            self.original_text.append(other.token)
+            self.text += other.token
+            self.original_text += other.token
        elif isinstance(other, Tokenization):
            self.tokenization.extend(other.tokenization)
            self.text += other.text
@@ -110,8 +115,8 @@ class Tokenization(object):
    def extend(self, other):
        if isinstance(other, (CommandToken, TypeToken)):
            self.tokenization.append(other.Id)
-            self.text.append(other.token)
-            self.original_text.append(other.token)
+            self.text += other.token
+            self.original_text += other.token
        elif isinstance(other, list) and isinstance(other[0], (CommandToken, TypeToken)):
            self.tokenization.extend([o.Id for o in other])
            self.text += [o.token for o in other]
@@ -522,6 +527,7 @@ def get_corpus_freq(dataset, filepath, filetype='tsv'):
    Write frequencies to `filepath` as a tsv. Only write the first
    MAX_SENTENCEPIECE_SENTENCES most common words to the file.
    """
+    nltk.download('punkt', download_dir="./nltk")
    if filetype == 'tsv':
        delimiter = '\t'
    else:
@@ -687,10 +693,12 @@ class BertWordPieceTokenizer(Tokenizer):
        # default to bert-large-uncased tokenizer
        if tokenizer_model_type not in PRETRAINED_VOCAB_ARCHIVE_MAP:
            tokenizer_model_type = 'bert-large-uncased'
-        print('loading BertWordPieceTokenizer (', tokenizer_model_type, ') from cache_dir ', cache_dir)
+        if torch.distributed.get_rank() == 0:
+            print('loading BertWordPieceTokenizer (', tokenizer_model_type, ') from cache_dir ', cache_dir)
        do_lower_case = not ('-cased' in tokenizer_model_type or 'chinese' in tokenizer_model_type)
        self.text_tokenizer = BertTokenizer.from_pretrained(tokenizer_model_type, do_lower_case=do_lower_case, cache_dir=cache_dir)
-        print('loaded', tokenizer_model_type)
+        if torch.distributed.get_rank() == 0:
+            print('loaded', tokenizer_model_type)
        # disable max len warnings by increasing max len
        self.text_tokenizer.max_len = int(1e12)

@@ -786,3 +794,97 @@ class BertWordPieceTokenizer(Tokenizer):
        if isinstance(Tokens, Tokenization):
            Tokens = Tokens.tokenization
        return ' '.join(Tokens)
+
+
+class GPT2BPETokenizer(Tokenizer):
+    def __init__(self, cache_dir=None, **kwargs):
+        self.text_tokenizer = GPT2Tokenizer.from_pretrained('gpt2',
+                                                            cache_dir=cache_dir)
+
+        #disable max len warnings by increasing max len
+        self.text_tokenizer.max_len = int(1e12)
+        self.num_command_tokens = 2
+        self.num_tokens = len(self.text_tokenizer.encoder)
+        self.num_text_tokens = self.num_tokens-1
+        self.num_type_tokens = 2
+
+        self._command_tokens = [
+            CommandToken('pad', '<|endoftext|>', self.text_tokenizer.encoder['<|endoftext|>']),
+            CommandToken('eos', '<|endoftext|>', self.text_tokenizer.encoder['<|endoftext|>']),
+        ]
+        self.command_name_map = {tok.name: tok for tok in self._command_tokens}
+        self.command_token_map = {tok.token: tok for tok in self._command_tokens}
+        self.command_id_map = {tok.Id: tok for tok in self._command_tokens}
+
+        self.type_tokens = [
+            TypeToken('str0', '<str0>', 0),
+            TypeToken('str1', '<str1>', 1),
+        ]
+        self.type_name_map = {tok.name: tok for tok in self.type_tokens}
+        self.type_token_map = {tok.token: tok for tok in self.type_tokens}
+        self.type_id_map = {tok.Id: tok for tok in self.type_tokens}
+
+        self._tokens = list(self.text_tokenizer.encoder.keys())
+        self._vocab = {k:v for k,v in self.text_tokenizer.encoder.items()}
+
+        self._text_tokens = list(self._tokens)
+        self._text_token_vocab = {k:v for k,v in self.text_tokenizer.encoder.items()}
+
+        self._command_token_tokens = list(self.command_token_map.keys())
+        self._command_token_vocab = {t:Id for Id,t in self.command_id_map.items()}
+
+        self._token_types = list(self.type_token_map.keys())
+        self._token_type_vocab = {t:Id for Id, t in self.type_id_map.items()}
+
+    def EncodeAsIds(self, text, process_fn=None):
+        processed_text = text
+        if process_fn is not None:
+            processed_text = process_fn(processed_text)
+        Ids = self.text_tokenizer.encode(processed_text)
+        #return Tokenization(Ids, processed_text, text)
+        tokenization = Tokenization(Ids, processed_text, text)
+        tokenization.set_command_tokens(self._command_tokens)
+        return tokenization
+
+
+    def EncodeAsTokens(self, text, process_fn=None):
+        processed_text = text
+        if process_fn is not None:
+            processed_text = process_fn(processed_text)
+        tokens = []
+        for token in re.findall(self.text_tokenizer.pat, processed_text):
+            token = ''.join(self.text_tokenizer.bye_encoder[b] for b in token.encode('utf-8'))
+            tokens.extend(bpe_token for bpe_token in self.text_tokenizer.bpe(token).split(' '))
+        tokenization=Tokenization(tokens, processed_text, text, asIds=False)
+        tokenization.set_command_tokens(self._command_tokens)
+        return tokenization
+        #return Tokenization(tokens, processed_text, text, asIds=False)
+
+    def IdToToken(self, Id, type_token=False):
+        if isinstance(Id, (TypeToken, CommandToken)):
+            return Id.token
+        if type_token:
+            return self.type_id_map[Id].token
+        return self.text_tokenizer.decoder[Id]
+
+    def TokenToId(self, token, type_token=False):
+        if isinstance(token, (TypeToken, CommandToken)):
+            return token.Id
+        if type_token:
+            return self.type_token_map[token].Id
+        return self.text_tokenizer.encoder[token]
+
+    def DecodeIds(self, Ids, type_token=False):
+        if type_token:
+            return ' '.join(Id.token if isinstance(Id, TypeToken) else self.type_id_map[Id].token for Id in Ids)
+        if isinstance(Ids, Tokenization):
+            Ids = Ids.tokenization
+        return self.text_tokenizer.decode(Ids)
+
+    def DecodeTokens(self, Tokens, type_token=False):
+        if type_token:
+            return ' '.join(t.token if isinstance(t, TypeToken) else t for t in Tokens)
+        if isinstance(Tokens, Tokenization):
+            Tokens = Tokens.tokenization
+        return self.text_tokenizer.decode([self.TokenToId(tok) for tok in Tokens])
+
--- a/data_utils/tokenization_gpt2.py
+++ b/data_utils/tokenization_gpt2.py
+# coding=utf-8
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+from __future__ import (absolute_import, division, print_function,
+                        unicode_literals)
+
+import sys
+import json
+import logging
+import os
+import regex as re
+from io import open
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
+
+from .file_utils import cached_path
+
+logger = logging.getLogger(__name__)
+
+PRETRAINED_VOCAB_ARCHIVE_MAP = {
+    'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json",
+}
+PRETRAINED_MERGES_ARCHIVE_MAP = {
+    'gpt2': "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt",
+}
+PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP = {
+    'gpt2': 1024,
+}
+VOCAB_NAME = 'vocab.json'
+MERGES_NAME = 'merges.txt'
+SPECIAL_TOKENS_NAME = 'special_tokens.txt'
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    _chr = unichr if sys.version_info[0] == 2 else chr
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+class GPT2Tokenizer(object):
+    """
+    GPT-2 BPE tokenizer. Peculiarities:
+        - Byte-level BPE
+    """
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs):
+        """
+        Instantiate a PreTrainedBertModel from a pre-trained model file.
+        Download and cache the pre-trained model file if needed.
+        """
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_ARCHIVE_MAP:
+            vocab_file = PRETRAINED_VOCAB_ARCHIVE_MAP[pretrained_model_name_or_path]
+            merges_file = PRETRAINED_MERGES_ARCHIVE_MAP[pretrained_model_name_or_path]
+            special_tokens_file = None
+        else:
+            vocab_file = os.path.join(pretrained_model_name_or_path, VOCAB_NAME)
+            merges_file = os.path.join(pretrained_model_name_or_path, MERGES_NAME)
+            special_tokens_file = os.path.join(pretrained_model_name_or_path, SPECIAL_TOKENS_NAME)
+            if not os.path.exists(special_tokens_file):
+                special_tokens_file = None
+            else:
+                logger.info("loading special tokens file {}".format(special_tokens_file))
+        # redirect to the cache, if necessary
+        try:
+            resolved_vocab_file = cached_path(vocab_file, cache_dir=cache_dir)
+            resolved_merges_file = cached_path(merges_file, cache_dir=cache_dir)
+        except EnvironmentError:
+            logger.error(
+                "Model name '{}' was not found in model name list ({}). "
+                "We assumed '{}' was a path or url but couldn't find files {} and {} "
+                "at this path or url.".format(
+                    pretrained_model_name_or_path,
+                    ', '.join(PRETRAINED_VOCAB_ARCHIVE_MAP.keys()),
+                    pretrained_model_name_or_path,
+                    vocab_file, merges_file))
+            return None
+        if resolved_vocab_file == vocab_file and resolved_merges_file == merges_file:
+            logger.info("loading vocabulary file {}".format(vocab_file))
+            logger.info("loading merges file {}".format(merges_file))
+        else:
+            logger.info("loading vocabulary file {} from cache at {}".format(
+                vocab_file, resolved_vocab_file))
+            logger.info("loading merges file {} from cache at {}".format(
+                merges_file, resolved_merges_file))
+        if pretrained_model_name_or_path in PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP:
+            # if we're using a pretrained model, ensure the tokenizer wont index sequences longer
+            # than the number of positional embeddings
+            max_len = PRETRAINED_VOCAB_POSITIONAL_EMBEDDINGS_SIZE_MAP[pretrained_model_name_or_path]
+            kwargs['max_len'] = min(kwargs.get('max_len', int(1e12)), max_len)
+        # Instantiate tokenizer.
+        if special_tokens_file and 'special_tokens' not in kwargs:
+            special_tokens = open(special_tokens_file, encoding='utf-8').read().split('\n')[:-1]
+        else:
+            special_tokens = kwargs.pop('special_tokens', [])
+        tokenizer = cls(resolved_vocab_file, resolved_merges_file, special_tokens=special_tokens, *inputs, **kwargs)
+        return tokenizer
+
+    def __init__(self, vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None):
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.encoder = json.load(open(vocab_file))
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        self.errors = errors # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
+        bpe_data = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+        self.special_tokens = {}
+        self.special_tokens_decoder = {}
+        self.set_special_tokens(special_tokens)
+
+    def __len__(self):
+        return len(self.encoder) + len(self.special_tokens)
+
+    def set_special_tokens(self, special_tokens):
+        """ Add a list of additional tokens to the encoder.
+            The additional tokens are indexed starting from the last index of the
+            current vocabulary in the order of the `special_tokens` list.
+        """
+        if not special_tokens:
+            self.special_tokens = {}
+            self.special_tokens_decoder = {}
+            return
+        self.special_tokens = dict((tok, len(self.encoder) + i) for i, tok in enumerate(special_tokens))
+        self.special_tokens_decoder = {v:k for k, v in self.special_tokens.items()}
+        logger.info("Special tokens {}".format(self.special_tokens))
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def tokenize(self, text):
+        """ Tokenize a string. """
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            if sys.version_info[0] == 2:
+                token = ''.join(self.byte_encoder[ord(b)] for b in token)
+            else:
+                token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """ Converts a sequence of tokens into ids using the vocab. """
+        ids = []
+        if isinstance(tokens, str) or (sys.version_info[0] == 2 and isinstance(tokens, unicode)):
+            if tokens in self.special_tokens:
+                return self.special_tokens[tokens]
+            else:
+                return self.encoder.get(tokens, 0)
+        for token in tokens:
+            if token in self.special_tokens:
+                ids.append(self.special_tokens[token])
+            else:
+                ids.append(self.encoder.get(token, 0))
+        if len(ids) > self.max_len:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this OpenAI GPT model ({} > {}). Running this"
+                " sequence through the model will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """Converts a sequence of ids in BPE tokens using the vocab."""
+        tokens = []
+        for i in ids:
+            if i in self.special_tokens_decoder:
+                if not skip_special_tokens:
+                    tokens.append(self.special_tokens_decoder[i])
+            else:
+                tokens.append(self.decoder[i])
+        return tokens
+
+    def encode(self, text):
+        return self.convert_tokens_to_ids(self.tokenize(text))
+
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
+
+    def save_vocabulary(self, vocab_path):
+        """Save the tokenizer vocabulary and merge files to a directory."""
+        if not os.path.isdir(vocab_path):
+            logger.error("Vocabulary path ({}) should be a directory".format(vocab_path))
+            return
+        vocab_file = os.path.join(vocab_path, VOCAB_NAME)
+        merge_file = os.path.join(vocab_path, MERGES_NAME)
+        special_tokens_file = os.path.join(vocab_path, SPECIAL_TOKENS_NAME)
+
+        with open(vocab_file, 'w', encoding='utf-8') as f:
+            f.write(json.dumps(self.encoder, ensure_ascii=False))
+
+        index = 0
+        with open(merge_file, "w", encoding="utf-8") as writer:
+            writer.write(u'#version: 0.2\n')
+            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving vocabulary to {}: BPE merge indices are not consecutive."
+                                   " Please check that the tokenizer is not corrupted!".format(merge_file))
+                    index = token_index
+                writer.write(' '.join(bpe_tokens) + u'\n')
+                index += 1
+
+        index = len(self.encoder)
+        with open(special_tokens_file, 'w', encoding='utf-8') as writer:
+            for token, token_index in sorted(self.special_tokens.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning("Saving special tokens vocabulary to {}: BPE indices are not consecutive."
+                                   " Please check that the tokenizer is not corrupted!".format(special_tokens_file))
+                    index = token_index
+                writer.write(token + u'\n')
+                index += 1
+
+        return vocab_file, merge_file, special_tokens_file
--- a/detokenizer.py
+++ b/detokenizer.py
+import re
+
+def ptb_detokenizer(string):
+	string = string.replace(" '", "'")
+	string = string.replace(" \n", "\n")
+	string = string.replace("\n ", "\n")
+	string = string.replace(" n't", "n't")
+	string = string.replace(" N ","1 ")
+	string = string.replace("$ 1", "$1")
+	string = string.replace("# 1", "#1")
+	return string
+
+
+def wikitext_detokenizer(string):
+	#contractions
+	string = string.replace("s '", "s'")
+	string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
+	# number separators
+	string = string.replace(" @-@ ", "-")
+	string = string.replace(" @,@ ", ",")
+	string = string.replace(" @.@ ", ".")
+	#punctuation
+	string = string.replace(" : ", ": ")
+	string = string.replace(" ; ", "; ")
+	string = string.replace(" . ", ". ")
+	string = string.replace(" ! ", "! ")
+	string = string.replace(" ? ", "? ")
+	string = string.replace(" , ", ", ")
+	# double brackets
+	string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
+	string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
+	string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
+	string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
+	string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
+	# miscellaneous
+	string = string.replace("= = = =", "====")
+	string = string.replace("= = =", "===")
+	string = string.replace("= =", "==")
+	string = string.replace(" "+chr(176)+" ", chr(176))
+	string = string.replace(" \n", "\n")
+	string = string.replace("\n ", "\n")
+	string = string.replace(" N ", " 1 ")
+	string = string.replace(" 's", "'s")
+
+	return string
+
+def lambada_detokenizer(string):
+	return string
+
+def get_detokenizer(path):
+	for key in DETOKENIZERS.keys():
+		if key in path:
+			print(key)
+			return DETOKENIZERS[key]
+
+DETOKENIZERS = {
+	'ptb': ptb_detokenizer,
+	'wikitext': wikitext_detokenizer,
+	'lambada': lambada_detokenizer,
+}
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+# ===========
+# base images
+# ===========
+FROM nvcr.io/nvidia/pytorch:19.05-py3
+
+
+# ===============
+# system packages
+# ===============
+RUN apt-get update && apt-get install -y \
+    bash-completion \
+    emacs \
+    git \
+    graphviz \
+    htop \
+    libopenexr-dev \
+    rsync \
+    wget \
+&& rm -rf /var/lib/apt/lists/*
+
+
+# ============
+# pip packages
+# ============
+RUN pip install --upgrade pip && \
+    pip install --upgrade setuptools
+COPY requirements.txt /tmp/
+RUN pip install --upgrade --ignore-installed -r /tmp/requirements.txt
+
+
+# ===========
+# latest apex
+# ===========
+RUN pip uninstall -y apex && \
+git clone https://github.com/NVIDIA/apex.git ~/apex && \
+cd ~/apex && \
+pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
+
--- a/docker/README.md
+++ b/docker/README.md
+Note that as of now you need to have PySOL cloned to the directory here before building the container.
--- a/docker/requirements.txt
+++ b/docker/requirements.txt
+boto3
+google-cloud-language
+inflect
+nltk
+numpy
+pandas
+requests
+sentencepiece
+tensorflow
+tqdm
--- a/evaluate_gpt2.py
+++ b/evaluate_gpt2.py
+# coding=utf-8
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Pretrain BERT"""
+
+import os
+import json
+import math
+import random
+import numpy as np
+import torch
+
+
+from arguments import get_args
+from configure_data import configure_data
+from fp16 import FP16_Module
+from fp16 import FP16_Optimizer
+from learning_rates import AnnealingLR
+from model import GPT2Model
+from model import gpt2_get_params_for_weight_decay_optimization
+from model import DistributedDataParallel as DDP
+import mpu
+from apex.optimizers import FusedAdam as Adam
+from utils import Timers
+from utils import save_checkpoint
+from utils import save_checkpoint_model_parallel
+from utils import load_checkpoint
+from utils import load_checkpoint_model_parallel
+from utils import report_memory
+from utils import print_params_min_max_norm
+from utils import print_rank_0
+
+from data_utils import make_tokenizer
+
+from detokenizer import *
+
+def get_model(args):
+    """Build the model."""
+
+    print_rank_0('building GPT2 model ...')
+    model = GPT2Model(num_layers=args.num_layers,
+                      vocab_size=args.vocab_size,
+                      hidden_size=args.hidden_size,
+                      num_attention_heads=args.num_attention_heads,
+                      embedding_dropout_prob=args.hidden_dropout,
+                      attention_dropout_prob=args.attention_dropout,
+                      output_dropout_prob=args.hidden_dropout,
+                      max_sequence_length=args.max_position_embeddings,
+                      checkpoint_activations=args.checkpoint_activations,
+                      checkpoint_num_layers=args.checkpoint_num_layers,
+                      parallel_output=not args.cloze_eval)
+
+    print_rank_0(' > number of parameters: {}'.format(
+        sum([p.nelement() for p in model.parameters()])))
+
+    # GPU allocation.
+    model.cuda(torch.cuda.current_device())
+
+    # Fp16 conversion.
+    if args.fp16:
+        model = FP16_Module(model)
+
+    # Wrap model for distributed training.
+    model = DDP(model)
+
+    return model
+
+
+def setup_model(args):
+    """Setup model and optimizer."""
+
+    model = get_model(args)
+
+    if args.load is not None:
+        _ = load_checkpoint_model_parallel(
+            model, None, None, args)
+
+    return model
+
+def get_masks_and_position_ids(data,
+                               eod_token,
+                               reset_position_ids,
+                               reset_attention_mask):
+
+    # Extract batch size and sequence length.
+    batch_size, seq_length = data.size()
+
+    # Attention mask (lower triangular).
+    if reset_attention_mask:
+        att_mask_batch = batch_size
+    else:
+        att_mask_batch = 1
+    attention_mask = torch.tril(torch.ones(
+        (att_mask_batch, seq_length, seq_length), device=data.device)).view(
+            att_mask_batch, 1, seq_length, seq_length)
+
+    # Loss mask.
+    loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
+    loss_mask[data == eod_token] = 0.0
+
+    # Position ids.
+    position_ids = torch.arange(seq_length, dtype=torch.long,
+                                device=data.device)
+    position_ids = position_ids.unsqueeze(0).expand_as(data)
+    # We need to clone as the ids will be modifed based on batch index.
+    if reset_position_ids:
+        position_ids = position_ids.clone()
+
+    if reset_position_ids or reset_attention_mask:
+        # Loop through the batches:
+        for b in range(batch_size):
+
+            # Find indecies where EOD token is.
+            eod_index = position_ids[b, data[b] == eod_token]
+            # Detach indecies from positions if going to modify positions.
+            if reset_position_ids:
+                eod_index = eod_index.clone()
+
+            # Loop through EOD indecies:
+            prev_index = 0
+            for j in range(eod_index.size()[0]):
+                i = eod_index[j]
+                # Mask attention loss.
+                if reset_attention_mask:
+                    attention_mask[b, 0, (i+1):, :(i+1)] = 0
+                # Reset positions.
+                if reset_position_ids:
+                    position_ids[b, (i+1):] -= (i + 1 - prev_index)
+                    prev_index = i + 1
+
+    return attention_mask, loss_mask, position_ids
+
+def get_batch(data_iterator, args, timers):
+    ''' get_batch subdivides the source data into chunks of
+    length args.seq_length. If source is equal to the example
+    output of the data loading example, with a seq_length limit
+    of 2, we'd get the following two Variables for i = 0:
+    ┌ a g m s ┐ ┌ b h n t ┐
+    └ b h n t ┘ └ c i o u ┘
+    Note that despite the name of the function, the subdivison of data is not
+    done along the batch dimension (i.e. dimension 1), since that was handled
+    by the data loader. The chunks are along dimension 0, corresponding
+    to the seq_len dimension in the LSTM. A Variable representing an appropriate
+    shard reset mask of the same dimensions is also returned.
+    '''
+    # Items and their type.
+    keys = ['text', 'pad_mask']
+    datatype = torch.int64
+
+    # Broadcast data.
+    timers('data loader').start()
+    if data_iterator is not None:
+        data = next(data_iterator)
+    else:
+        data = None
+    timers('data loader').stop()
+    data_b = mpu.broadcast_data(keys, data, datatype)
+
+    # Unpack.
+    tokens_ = data_b['text'].long()
+    lm_labels = tokens_[:, 1:].contiguous()
+    tokens = tokens_[:, :-1].contiguous()
+    padding_mask = data_b['pad_mask'].byte()
+
+    # Get the masks and postition ids.
+    attention_mask, loss_mask, position_ids = get_masks_and_position_ids(
+        tokens,
+        args.eod_token,
+        args.reset_position_ids,
+        args.reset_attention_mask)
+
+    # Convert
+    if args.fp16:
+        attention_mask = attention_mask.half()
+
+    return tokens, lm_labels, attention_mask, position_ids, padding_mask
+
+
+def forward_step(data_iterator, model, args, timers):
+    """Forward step."""
+
+    # Get the batch.
+    timers('batch generator').start()
+    batch = get_batch(data_iterator, args, timers)
+    if batch is None:
+        return None
+    tokens, lm_labels, attention_mask, position_ids, loss_mask = batch
+    timers('batch generator').stop()
+    # Forward model.
+    if args.eval_hf:
+        output, _ = model(tokens)
+    else:
+        output = model(tokens, position_ids, attention_mask)
+
+    if not args.cloze_eval:
+        #losses = torch.nn.CrossEntropyLoss(reduce=False)(
+        losses = mpu.vocab_parallel_cross_entropy(
+            output.contiguous().float(), lm_labels.contiguous())
+        loss_mask = loss_mask.contiguous()
+        loss_mask = loss_mask.view(-1)
+        lm_loss = torch.sum(
+            losses.view(-1) * loss_mask.float())
+    else:
+        outputs = torch.argmax(output, -1).contiguous().view(-1)
+        acc = (outputs == lm_labels.contiguous().view(-1)).float()
+        loss_mask = loss_mask.contiguous().view(-1).float()
+        lm_loss = torch.sum(acc * loss_mask)
+
+    return lm_loss
+
+
+def evaluate(data_loader, model, args, timers,
+             num_iterations=None):
+    """Evaluation."""
+
+    # Turn on evaluation mode which disables dropout.
+    model.eval()
+
+    total_lm_loss = 0
+    if num_iterations is not None:
+        max_iters = num_iterations
+    else:
+        if mpu.get_model_parallel_rank() == 0:
+            max_iters_gpu = torch.cuda.LongTensor([len(data_loader)])
+        else:
+            max_iters_gpu = torch.cuda.LongTensor([0])
+        torch.distributed.broadcast(max_iters_gpu,
+                                    mpu.get_model_parallel_src_rank(),
+                                    group=mpu.get_model_parallel_group())
+        max_iters = max_iters_gpu[0].item()
+        print_rank_0('global rank: {} | max iters: {}'.format(
+            torch.distributed.get_rank(), max_iters))
+
+    if data_loader is not None:
+        data_iterator = iter(data_loader)
+    else:
+        data_iterator = None
+
+    with torch.no_grad():
+        iteration = 0
+        while iteration < max_iters:
+            if iteration % args.log_interval == 0:
+                print_rank_0('global rank: {} | iteration: {}'.format(
+                    torch.distributed.get_rank(), iteration))
+            # Forward evaluation.
+            lm_loss = forward_step(data_iterator, model, args, timers)
+            if lm_loss is None:
+                break
+            # Reduce across processes.
+            if isinstance(model, DDP):
+                torch.distributed.all_reduce(lm_loss.data)
+                if args.cloze_eval:
+                    lm_loss.data = lm_loss.data / args.world_size
+                else:
+                    lm_loss.data = lm_loss.data / args.model_parallel_size
+
+            if not args.cloze_eval:
+                total_lm_loss += lm_loss.data.detach().float().item()/(args.num_tokenized_tokens-1)
+            else:
+                total_lm_loss += lm_loss.data.detach().float().item()
+
+            iteration += 1
+
+    # Move model back to the train mode.
+    model.train()
+
+    return total_lm_loss
+
+
+def evaluate_and_print_results(prefix, data_iterator, model,
+                               args, timers, num_iterations=None):
+    """Helper function to evaluate and dump results on screen."""
+    if not args.cloze_eval:
+        lm_loss = evaluate(data_iterator, model, args, timers, num_iterations)
+        val_loss = lm_loss
+        ppl = math.exp(min(20, val_loss))
+        token_ratio = (args.num_tokenized_tokens-1)/(args.num_original_tokens-1)
+        adjusted_ppl = math.exp(min(20, val_loss*token_ratio))
+        print_rank_0('-' * 100)
+        string = ' validation results on {} | '.format(prefix)
+        string += 'avg loss: {:.4E} | '.format(val_loss)
+        string += 'ppl: {:.4E} | '.format(ppl)
+        string += 'adjusted ppl: {:.4E} | '.format(adjusted_ppl)
+        string += 'token ratio: {} |'.format(token_ratio)
+        length = len(string) + 1
+        print_rank_0('-' * length)
+        print_rank_0(string)
+        print_rank_0('-' * length)
+
+        return val_loss
+    else:
+        num_correct = evaluate(data_iterator, model, args, timers, num_iterations)
+        acc = num_correct / args.num_examples
+        print_rank_0('-' * 100)
+        string = ' validation results on {} | '.format(prefix)
+        string += 'number correct: {:.4E} | '.format(num_correct)
+        string += 'total examples: {:.4E} | '.format(args.num_examples)
+        string += 'avg accuracy: {:.4E}'.format(acc)
+        length = len(string) + 1
+        print_rank_0('-' * length)
+        print_rank_0(string)
+        print_rank_0('-' * length)
+        return acc
+
+
+def initialize_distributed(args):
+    """Initialize torch.distributed."""
+
+    # Manually set the device ids.
+    device = args.rank % torch.cuda.device_count()
+    if args.local_rank is not None:
+        device = args.local_rank
+    torch.cuda.set_device(device)
+    # Call the init process
+    init_method = 'tcp://'
+    master_ip = os.getenv('MASTER_ADDR', 'localhost')
+    master_port = os.getenv('MASTER_PORT', '6000')
+    init_method += master_ip + ':' + master_port
+    torch.distributed.init_process_group(
+        backend=args.distributed_backend,
+        world_size=args.world_size, rank=args.rank,
+        init_method=init_method)
+
+    # Set the model-parallel / data-parallel communicators.
+    mpu.initialize_model_parallel(args.model_parallel_size)
+
+
+def set_random_seed(seed):
+    """Set random seed for reproducability."""
+
+    if seed is not None and seed > 0:
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        mpu.model_parallel_cuda_manual_seed(seed)
+
+
+class LM_Eval_Dataset(torch.utils.data.Dataset):
+    def __init__(self, tokens, seq_len, pad_idx, overalapping_eval=None):
+        self.tokens = tokens
+        self.seq_len = seq_len
+        self.pad_idx = pad_idx
+        self.overalapping_eval = overalapping_eval
+        if self.overalapping_eval is None:
+            self.overalapping_eval = self.seq_len
+        self.overalapping_eval = max(1, self.overalapping_eval)
+
+        self.total_targets = len(self.tokens) - 1
+        # remove first sequence tokens
+        targets = max(self.total_targets - self.overalapping_eval, 0)
+        self.total_sequences = max(math.ceil(targets / self.overalapping_eval)+1, 1)
+
+    def __len__(self):
+        return self.total_sequences
+
+    def __getitem__(self, idx):
+        start_idx = idx * self.overalapping_eval
+        end_idx = start_idx + self.seq_len
+        tokens = self.tokens[start_idx:end_idx+1]
+        num_tokens = len(tokens)
+        pad_mask = [1]*num_tokens
+        if num_tokens < self.seq_len+1:
+            num_pad = (self.seq_len+1-num_tokens) 
+            pad_mask += [0]*(num_pad)
+            tokens += [self.pad_idx] * num_pad
+        pad_mask = np.array(pad_mask[1:])
+        if self.overalapping_eval != self.seq_len and idx!=0:
+            pad_mask[:-self.overalapping_eval] *= 0
+
+        return {'text': np.array(tokens), 'pad_mask': pad_mask}
+
+class Lambada_Eval_Dataset(torch.utils.data.Dataset):
+    def __init__(self, path, tokenizer, seq_len):
+        self.seq_len = seq_len
+        self.pad_idx = tokenizer.get_command('pad').Id
+
+        self.tokens = []
+        with open(path, 'r') as f:
+            for line in f.readlines():
+                text = json.loads(line)['text']
+                self.tokens.append(tokenizer.EncodeAsIds(text).tokenization)
+
+    def __len__(self):
+        return len(self.tokens)
+
+    def __getitem__(self, idx):
+
+        tokens = self.tokens[idx]
+        num_tokens = len(tokens)
+        pad_mask = [0]*num_tokens
+        pad_mask[-1] = 1
+        if num_tokens < self.seq_len+1:
+            num_pad = (self.seq_len+1-num_tokens) 
+            pad_mask += [0]*(num_pad)
+            tokens += [self.pad_idx] * num_pad
+        pad_mask = np.array(pad_mask[1:])
+
+        return {'text': np.array(tokens), 'pad_mask': pad_mask}
+
+def get_tokenizer(args):
+    tokenizer_args = {
+        'tokenizer_type': args.tokenizer_type,
+        'corpus': None,
+        'model_path': args.tokenizer_path,
+        'vocab_size': args.vocab_size,
+        'model_type': args.tokenizer_model_type,
+        'cache_dir': args.cache_dir}
+    return make_tokenizer(**tokenizer_args) 
+
+def get_eval_data(args):
+    val_dataloader = None
+    if mpu.get_model_parallel_rank() == 0:
+        eval_batch_size = args.eval_batch_size
+        eval_batch_size = args.batch_size if eval_batch_size is None else eval_batch_size
+        seq_len = args.seq_length
+        valid_data = args.valid_data
+        valid_data = valid_data[0] if isinstance(valid_data, list) else valid_data
+
+        tokenizer = get_tokenizer(args)
+
+        if not args.cloze_eval:
+
+            with open(valid_data, "rb") as reader:
+                entire_data = reader.read().decode('utf-8')
+            num_original_tokens = len(entire_data.strip().split(" "))
+            entire_data = get_detokenizer(valid_data)(entire_data)
+            tokenized_data = tokenizer.EncodeAsIds(entire_data).tokenization
+            num_tokenized_tokens = len(tokenized_data)
+            string = 'Original Tokens: %d, Detokenized tokens: %d' % (num_tokenized_tokens, num_original_tokens)
+            print_rank_0(string)
+
+            eod_token = tokenizer.get_command('pad').Id
+            val_dataset = LM_Eval_Dataset(tokenized_data, seq_len, eod_token,
+                                          args.overlapping_eval)
+        else:
+            val_dataset = Lambada_Eval_Dataset(valid_data, tokenizer, seq_len)
+            num_tokenized_tokens = 0
+            num_original_tokens = 0
+        val_dataloader = torch.utils.data.DataLoader(
+            val_dataset, batch_size=eval_batch_size, drop_last=False)
+
+        before = tokenizer.num_tokens
+        after = before
+        while after % mpu.get_model_parallel_world_size() != 0:
+            after += 1
+        print_rank_0('> padded vocab (size: {}) with {} dummy tokens (new size: {})'.
+              format(before, after - before, after))
+        eod_token = tokenizer.get_command('pad').Id
+        num_examples = len(val_dataset)
+        token_counts = torch.cuda.LongTensor([after, eod_token, num_examples,
+                                              num_original_tokens,
+                                              num_tokenized_tokens])
+    else:
+        token_counts = torch.cuda.LongTensor([0, 0, 0, 0, 0])
+    torch.distributed.broadcast(token_counts,
+                                mpu.get_model_parallel_src_rank(),
+                                group=mpu.get_model_parallel_group())
+    args.vocab_size = token_counts[0].item()
+    args.eod_token = token_counts[1].item()
+    args.num_examples = token_counts[2].item()
+    args.num_original_tokens = token_counts[3].item()
+    args.num_tokenized_tokens = token_counts[4].item()
+
+    print('global rank: {} | vocab size: {} | eod token: {} | '
+          'num_examples: {} | num_original_tokens: {} | '
+          'num_tokenized_tokens: {}'.format(
+              torch.distributed.get_rank(), args.vocab_size,
+              args.eod_token, args.num_examples, args.num_original_tokens,
+              args.num_tokenized_tokens ))
+    return val_dataloader
+
+def main():
+    """Main training program."""
+
+    print('Evaluate GPT2 model')
+
+    # Disable CuDNN.
+    torch.backends.cudnn.enabled = False
+
+    # Timer.
+    timers = Timers()
+
+    # Arguments.
+    args = get_args()
+
+    # Pytorch distributed.
+    initialize_distributed(args)
+
+    # Random seeds for reproducability.
+    set_random_seed(args.seed)
+
+    # Data stuff.
+    eval_data = get_eval_data(args)
+
+    # Model, optimizer, and learning rate.
+    if args.eval_hf:
+        from pytorch_pretrained_bert import GPT2LMHeadModel
+        from pytorch_pretrained_bert import GPT2Model as HFGPT2Model
+        if args.num_layers == 24:
+            model_path = args.load
+            #model_path = '/home/universal-lm-data.cosmos549/repos/gpt2_mp/models/345M'
+            hfmodel = HFGPT2Model.from_pretrained(model_path, cache_dir='gpt2_weights', from_tf=True).cuda()
+            model = GPT2LMHeadModel(hfmodel.config)
+            model.transformer.load_state_dict(hfmodel.state_dict())
+            model.cuda()
+        else:
+            model = GPT2LMHeadModel.from_pretrained('gpt2', cache_dir='gpt2_weights').cuda()
+    else:
+        if args.load_openai:
+            from utils import move_weights
+            model_path = args.load
+            args.load = None
+            model = setup_model(args)
+            from pytorch_pretrained_bert import GPT2LMHeadModel
+            from pytorch_pretrained_bert import GPT2Model as HFGPT2Model
+
+            model_path = 'gpt2'
+            from_tf = False
+            print('loading openai weights')
+            model.cpu()
+            if args.num_layers == 24:
+                #model_path = '/home/universal-lm-data.cosmos549/repos/gpt2_mp/models/345M'
+                hfmodel = HFGPT2Model.from_pretrained(model_path, cache_dir='gpt2_weights', from_tf=True)
+                gpt2model = GPT2LMHeadModel(hfmodel.config)
+                gpt2model.transformer.load_state_dict(hfmodel.state_dict())
+                gpt2model
+            else:
+                gpt2model = GPT2LMHeadModel.from_pretrained('gpt2', cache_dir='gpt2_weights')
+            model2fill = model
+            while isinstance(model2fill, (DDP, FP16_Module)):
+                model2fill = model2fill.module
+            move_weights(model2fill, gpt2model)
+            model.cuda()
+        else:
+            model = setup_model(args)
+
+    # Run on test data.
+    prefix = "wiki" #os.path.basename(args.valid_data)
+    evaluate_and_print_results(prefix, eval_data,
+                               model, args, timers)
+
+
+if __name__ == "__main__":
+    main()
--- a/fp16/fp16.py
+++ b/fp16/fp16.py
@@ -197,6 +197,8 @@ class FP16_Optimizer(object):
                        fp16_params_this_group.append(param)
                        master_param = param.detach().clone().float()
                        master_param.requires_grad = True
+                        # Copythe model parallel flag.
+                        master_param.model_parallel = param.model_parallel
                        param_group['params'][i] = master_param
                        fp32_from_fp16_params_this_group.append(master_param)
                        # Reset existing state dict key to the new master param.

--- a/fp16/fp16util.py
+++ b/fp16/fp16util.py
@@ -18,6 +18,8 @@ import torch.nn as nn
 from torch.autograd import Variable
 from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors

+import mpu
+

 class tofp16(nn.Module):
    """
@@ -194,7 +196,9 @@ def to_python_float(t):

 TORCH_MAJOR = int(torch.__version__.split('.')[0])
 TORCH_MINOR = int(torch.__version__.split('.')[1])
-if TORCH_MAJOR == 0 and TORCH_MINOR <= 4:
-    clip_grad_norm = torch.nn.utils.clip_grad_norm
-else:
-    clip_grad_norm = torch.nn.utils.clip_grad_norm_
\ No newline at end of file
+
+clip_grad_norm = mpu.clip_grad_norm
+#elif TORCH_MAJOR == 0 and TORCH_MINOR <= 4:
+#    clip_grad_norm = torch.nn.utils.clip_grad_norm
+#else:
+#    clip_grad_norm = torch.nn.utils.clip_grad_norm_
--- a/fp16/loss_scaler.py
+++ b/fp16/loss_scaler.py
@@ -14,6 +14,7 @@
 # limitations under the License.

 import torch
+import mpu

 # item() is a recent addition, so this helps with backward compatibility.
 def to_python_float(t):
@@ -103,13 +104,25 @@ class DynamicLossScaler:
        self.consecutive_hysteresis = consecutive_hysteresis

    # `params` is a list / generator of torch.Variable
-    def has_overflow(self, params):
+    def has_overflow_serial(self, params):
        for p in params:
            if p.grad is not None and DynamicLossScaler._has_inf_or_nan(p.grad.data):
                return True

        return False

+    def has_overflow(self, params):
+        overflow = self.has_overflow_serial(params)
+        # Since each model parallel GPU carries only part of the model,
+        # make sure overflow flag is synced across all the model parallel GPUs
+        overflow_gpu = torch.cuda.ByteTensor([overflow])
+        torch.distributed.all_reduce(overflow_gpu,
+                                     op=torch.distributed.ReduceOp.MAX,
+                                     group=mpu.get_model_parallel_group())
+        overflow = overflow_gpu[0].item()
+        return bool(overflow)
+
+
    # `x` is a torch.Tensor
    def _has_inf_or_nan(x):
        try:
@@ -133,6 +146,7 @@ class DynamicLossScaler:

    # `overflow` is boolean indicating whether the gradient overflowed
    def update_scale(self, overflow):
+
        if not hasattr(self, 'min_scale'):
            self.min_scale = 1
        if not hasattr(self, 'delayed_shift'):

--- a/generate_samples.py
+++ b/generate_samples.py
+# coding=utf-8
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Sample Generate GPT2"""
+
+import os
+import random
+import numpy as np
+import torch
+import torch.nn.functional as F
+import argparse
+import time
+from arguments import get_args
+from utils import Timers
+from pretrain_gpt2 import initialize_distributed
+from pretrain_gpt2 import set_random_seed
+from pretrain_gpt2 import get_train_val_test_data
+from pretrain_gpt2 import get_masks_and_position_ids
+from utils import load_checkpoint
+from data_utils import make_tokenizer
+from configure_data import configure_data
+import mpu
+
+from fp16 import FP16_Module
+from model import GPT2Model
+from model import DistributedDataParallel as DDP
+from utils import print_rank_0
+
+def get_model(args):
+    """Build the model."""
+
+    print_rank_0('building GPT2 model ...')
+    model = GPT2Model(num_layers=args.num_layers,
+                      vocab_size=args.vocab_size,
+                      hidden_size=args.hidden_size,
+                      num_attention_heads=args.num_attention_heads,
+                      embedding_dropout_prob=args.hidden_dropout,
+                      attention_dropout_prob=args.attention_dropout,
+                      output_dropout_prob=args.hidden_dropout,
+                      max_sequence_length=args.max_position_embeddings,
+                      checkpoint_activations=args.checkpoint_activations,
+                      checkpoint_num_layers=args.checkpoint_num_layers,
+                      parallel_output=False)
+
+    if mpu.get_data_parallel_rank() == 0:
+        print(' > number of parameters on model parallel rank {}: {}'.format(
+            mpu.get_model_parallel_rank(),
+            sum([p.nelement() for p in model.parameters()])), flush=True)
+
+    # GPU allocation.
+    model.cuda(torch.cuda.current_device())
+
+    # Fp16 conversion.
+    if args.fp16:
+        model = FP16_Module(model)
+
+    # Wrap model for distributed training.
+    model = DDP(model)
+
+    return model
+
+def setup_model(args):
+    """Setup model and optimizer."""
+
+    model = get_model(args)
+
+    if args.load is not None:
+        _ = load_checkpoint(
+            model, None, None, args)
+
+    return model
+
+
+def get_batch(context_tokens, device, args):
+    tokens = context_tokens
+    tokens = tokens.view(args.batch_size, -1).contiguous()
+    tokens = tokens.to(device)
+
+    # Get the masks and postition ids.
+    attention_mask, loss_mask, position_ids = get_masks_and_position_ids(
+        tokens,
+        args.eod_token,
+        args.reset_position_ids,
+        args.reset_attention_mask)
+
+    return tokens, attention_mask, position_ids
+
+def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
+    # This function has been mostly taken from huggingface conversational ai code at
+    # https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
+
+    if top_k > 0:
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+        
+    if top_p > 0.0:
+        #convert to 1D
+        logits=logits.view(logits.size()[1]).contiguous()
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[indices_to_remove] = filter_value
+        #going back to 2D
+        logits=logits.view(1, -1).contiguous()
+	
+    return logits
+
+
+def generate_samples(model, tokenizer, args, device):
+    
+    context_count=0
+    model.eval()
+    with torch.no_grad():
+        while True:
+            torch.distributed.barrier(group=mpu.get_model_parallel_group())
+            terminate_runs=0
+
+            if mpu.get_model_parallel_rank() == 0:
+                raw_text = input("\nContext prompt (stop to exit) >>> ")
+                while not raw_text:
+                    print('Prompt should not be empty!')
+                    raw_text = input("\nContext prompt (stop to exit) >>> ")
+           
+                if "stop" in raw_text:
+                    terminate_runs = 1
+                else:
+                    context_tokens = tokenizer.EncodeAsIds(raw_text).tokenization
+                    context_length = len(context_tokens)
+
+                    if context_length >=args.seq_length//2:
+                        print("\nContext length", context_length, \
+                            "\nPlease give smaller context (half of the sequence length)!")
+                        continue
+            else:
+                context_tokens = tokenizer.EncodeAsIds("EMPTY TEXT").tokenization
+                context_length = len(context_tokens)
+            
+            terminate_runs_tensor = torch.cuda.LongTensor([terminate_runs])
+            torch.distributed.broadcast(terminate_runs_tensor, mpu.get_model_parallel_src_rank(), group=mpu.get_model_parallel_group())
+            terminate_runs = terminate_runs_tensor[0].item()
+
+            if terminate_runs == 1:
+                return
+
+            pad_id = tokenizer.get_command('pad').Id
+            if context_length < args.seq_length:
+                context_tokens.extend([pad_id] * (args.seq_length - context_length))
+
+            context_tokens_tensor = torch.cuda.LongTensor(context_tokens)
+            context_length_tensor = torch.cuda.LongTensor([context_length])
+
+            torch.distributed.broadcast(context_length_tensor, mpu.get_model_parallel_src_rank(), group=mpu.get_model_parallel_group())
+            torch.distributed.broadcast(context_tokens_tensor, mpu.get_model_parallel_src_rank(), group=mpu.get_model_parallel_group())
+
+            context_length = context_length_tensor[0].item()
+            tokens, attention_mask, position_ids=get_batch(context_tokens_tensor, device, args)
+
+            start_time = time.time()
+
+            counter = 0
+            org_context_length = context_length
+
+            while counter < (org_context_length + args.out_seq_length):
+                logits = model(tokens, position_ids, attention_mask)
+                logits = logits[:, context_length - 1, :] / args.temperature
+                logits = top_k_logits(logits, top_k=args.top_k, top_p=args.top_p)            
+                log_probs = F.softmax(logits, dim=-1)
+                prev = torch.multinomial(log_probs, num_samples=1)
+                tokens[0, context_length] = prev[0] 
+                context_length += 1
+                counter += 1
+
+                output_tokens_list = tokens.view(-1).contiguous()
+                decode_tokens = tokenizer.DecodeIds(output_tokens_list.tolist())
+                token_end = decode_tokens.find("<|endoftext|>")
+
+
+                if mpu.get_model_parallel_rank() == 0 and (counter % 16 == 0 or token_end != -1):
+                   os.system('clear')
+                   print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True)
+                   print("\nContext:", raw_text, flush=True)
+                   trim_decode_tokens = decode_tokens[len(raw_text):decode_tokens.find("<|endoftext|>")]
+                   print("\nGPT2:", trim_decode_tokens, flush=True)
+                if token_end != -1:
+                   break
+                
+            if mpu.get_model_parallel_rank() == 0:
+                os.system('clear')
+                print("\nTaken time {:.2f}\n".format(time.time() - start_time), flush=True)
+                print("\nContext:", raw_text, flush=True)
+                output_tokens_list = tokens.view(-1).contiguous()
+                decode_tokens = tokenizer.DecodeIds(output_tokens_list.tolist())
+                trim_decode_tokens = decode_tokens[len(raw_text):decode_tokens.find("<|endoftext|>")]
+                print("\nGPT2:", trim_decode_tokens, flush=True)
+            raw_text = None
+
+            torch.distributed.barrier(group=mpu.get_model_parallel_group())
+            context_count += 1
+
+def prepare_tokenizer(args):
+
+    tokenizer_args = {
+        'tokenizer_type': args.tokenizer_type,
+        'corpus': None,
+        'model_path': args.tokenizer_path,
+        'vocab_size': args.vocab_size,
+        'model_type': args.tokenizer_model_type,
+        'cache_dir': args.cache_dir}
+    tokenizer = make_tokenizer(**tokenizer_args)
+
+    args.tokenizer_num_tokens = tokenizer.num_tokens
+    args.tokenizer_num_type_tokens = tokenizer.num_type_tokens
+    args.eod_token = tokenizer.get_command('eos').Id
+
+    after = tokenizer.num_tokens
+    while after % mpu.get_model_parallel_world_size() != 0:
+        after += 1
+
+    args.vocab_size = after
+    print("prepare tokenizer done", flush=True)
+
+    return tokenizer
+
+def main():
+    """Main training program."""
+
+    print('Generate Samples')
+
+    # Disable CuDNN.
+    torch.backends.cudnn.enabled = False
+
+    # Timer.
+    timers = Timers()
+
+    # Arguments.
+    args = get_args()
+
+    # Pytorch distributed.
+    initialize_distributed(args)
+
+    # Random seeds for reproducability.
+    set_random_seed(args.seed)
+
+    #get the tokenizer
+    tokenizer = prepare_tokenizer(args)
+
+    # Model, optimizer, and learning rate.
+    model = setup_model(args)
+
+    #setting default batch size to 1
+    args.batch_size = 1
+
+    #generate samples
+    generate_samples(model, tokenizer, args, torch.cuda.current_device())
+    
+
+if __name__ == "__main__":
+    main()
+
+
+