large update including model parallelism and gpt2

Co-authored-by: shoeybi <shoeybim@gmail.com> Co-authored-by: raulpuric <raulpuric@berkeley.edu> Co-authored-by: jaredcasper <jaredcasper@gmail.com> Co-authored-by: mpatwary <mostofa.patwary@gmail.com> Co-authored-by: plegresl <plegresl@gmail.com>

large update including model parallelism and gpt2
Co-authored-by: shoeybi <shoeybim@gmail.com> Co-authored-by: raulpuric <raulpuric@berkeley.edu> Co-authored-by: jaredcasper <jaredcasper@gmail.com> Co-authored-by: mpatwary <mostofa.patwary@gmail.com> Co-authored-by: plegresl <plegresl@gmail.com>
abe36e2e · Raul Puri · 0399d32c · abe36e2e · abe36e2e · abe36e2e
Commit abe36e2e authored Jul 29, 2019 by Raul Puri
20 changed files
--- a/mpu/tests/test_layers.py
+++ b/mpu/tests/test_layers.py
--- a/mpu/tests/test_random.py
+++ b/mpu/tests/test_random.py
--- a/mpu/transformer.py
+++ b/mpu/transformer.py
--- a/mpu/utils.py
+++ b/mpu/utils.py
+# coding=utf-8
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import torch
+
+
+def ensure_divisibility(numerator, denominator):
+    """Ensure that numerator is divisible by the denominator."""
+    assert numerator % denominator == 0, '{} is not divisible by {}'.format(
+        numerator, denominator)
+
+
+def divide(numerator, denominator):
+    """Ensure that numerator is divisible by the denominator and return
+    the division value."""
+    ensure_divisibility(numerator, denominator)
+    return numerator // denominator
+
+
+def split_tensor_along_last_dim(tensor, num_partitions,
+                                contiguous_split_chunks=False):
+    """Split a tensor along its last dimension.
+    Arguments:
+        tensor: input tensor.
+        num_partitions: number of partitions to split the tensor
+        contiguous_split_chunks: If True, make each chunk contiguous
+                                 in memory.
+    """
+    # Get the size and dimension.
+    last_dim = tensor.dim() - 1
+    last_dim_size = divide(tensor.size()[last_dim], num_partitions)
+    # Split.
+    tensor_list = torch.split(tensor, last_dim_size, dim=last_dim)
+    # Note: torch.split does not create contiguous tensors by default.
+    if contiguous_split_chunks:
+        return tuple(chunk.contiguous() for chunk in tensor_list)
+
+    return tensor_list
+
+
+class VocabUtility:
+    """Split the vocabulary into `world_size` chunks amd return the
+        first and last index of the vocabulary belonging to the `rank`
+        partition: Note that indecies in [fist, last)"""
+
+    @staticmethod
+    def vocab_range_from_per_partition_vocab_size(per_partition_vocab_size,
+                                                  rank, world_size):
+        index_f = rank * per_partition_vocab_size
+        index_l = index_f + per_partition_vocab_size
+        return index_f, index_l
+
+    @staticmethod
+    def vocab_range_from_global_vocab_size(global_vocab_size, rank, world_size):
+        per_partition_vocab_size = divide(global_vocab_size, world_size)
+        return VocabUtility.vocab_range_from_per_partition_vocab_size(
+            per_partition_vocab_size, rank, world_size)
--- a/openwebtext/README.md
+++ b/openwebtext/README.md
+The following steps show how to prepare training dataset to train the mode.
+
+# Libraries to install
+
+```
+    pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract 
+    git clone https://github.com/mattilyra/LSH
+    cd LSH
+    python setup.py install
+``` 
+
+# Download the dataset
+
+1. Download the deduplicated URLs from [jcpeterson](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ!cc4RgQQZ)
+2. Remove blacklisted URLs.
+```
+python blacklist_urls.py <path to the dowloaded deduplicated URLs> <filename for clean urls. e.g. clean_urls.txt>
+```
+3. Download the content from the clean urls with [openwebtext's utilities](https://github.com/eukaryote31/openwebtext/blob/master/download.py). 
+
+4. Merge the contents into one loose json file with 1 json per newline of the format `{'text': text, 'url': unique_url}`. It is important for the url to be unique.
+
+# Prepare the data for GPT-2 training:
+
+1. Perform ftfy, english detection and remove documents with less than 128 tokens. This step can be sharded and run on shards.
+```
+python cleanup_dataset.py <input data file> <output cleaned data filename>
+```
+2. Using LSH, find possible duplicates and store then in a file for later processing. This step can NOT be sharded and usually takes 12 to 24 hours for OpenWebText dataset.
+```
+python find_duplicates.py <input cleaned data file> <output possible duplicate urls filename>
+```
+3. Based on similarity measure defind inside function `is_similar` (default: 0.9), group urls that are similar. Basically, for each group, only one url we should keep and remove the rest.
+```
+python group_duplicate_urls.py <possible duplicate urls file> <output file containing similar urls>
+```
+4. Remove similar documents that were detected in the last step.
+```
+python remove_group_duplicates.py <file containing simialr documents> <cleaned data file> <outputfile containing deduplicate data>
+```
+
+5. Shuffle the dataset.
+```
+shuf <cleaned deduped data file> -o train_data.json
+```
+
--- a/openwebtext/blacklist_urls.py
+++ b/openwebtext/blacklist_urls.py
--- a/openwebtext/cleanup_dataset.py
+++ b/openwebtext/cleanup_dataset.py
--- a/openwebtext/find_duplicates.py
+++ b/openwebtext/find_duplicates.py
--- a/openwebtext/group_duplicates_url.py
+++ b/openwebtext/group_duplicates_url.py
--- a/openwebtext/make_gpt2_dataset.py
+++ b/openwebtext/make_gpt2_dataset.py
+# coding=utf-8
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import json
+import numpy as np
+import time
+import os
+import sys
+
+from tokenizer import Tokenizer
+
+
+def tokenize_corpus(filename, np_filename, print_interval=10000):
+
+    print(' > tokenizing {}'.format(filename))
+
+    tokenizer = Tokenizer(cache_dir='./cache')
+
+    tokenized_docs = []
+    num_docs = 0
+    num_tokens = 0
+    start_time = time.time()
+    with open(filename, 'r') as f:
+        for line in f:
+            try:
+                myjson = json.loads(line)
+                url = myjson['url']
+                sample = myjson['text']
+                tokens = tokenizer.tokenize_document(sample)
+                tokenized_docs.append(np.array(tokens, dtype=np.uint16))
+                num_docs += 1
+                num_tokens += len(tokens)
+                if num_docs % print_interval == 0:
+                    print('    processed {:9d} documents in {:.2f} (s) so far'.
+                          format(num_docs, time.time() - start_time),
+                          flush=True)
+            except Exception as e:
+                print('    skipping ', line, e)
+
+    print('  >> processed {} document with total of {} tokens ...'.format(
+        num_docs, num_tokens))
+
+    tokenized_docs = np.array(tokenized_docs, dtype=object)
+    np.save(np_filename, tokenized_docs, allow_pickle=True)
+    print('  >> saved the tokenzed document to {} ...'.format(np_filename))
+
+
+if __name__ == '__main__':
+
+    print('building gpt2 dataset ...')
+
+    path = sys.argv[1]
+    shard = sys.argv[2]
+
+    input_filename = os.path.join(path,
+                                  'shards/shard_{:04d}'.format(int(shard)))
+    output_filename = os.path.join(path,
+                                  'npys/shard_{:04d}.npy'.format(int(shard)))
+    print('will be reading {}'.format(input_filename))
+    print('and will write the results to {}'.format(output_filename))
+
+    tokenize_corpus(input_filename, output_filename)
+
+
--- a/openwebtext/make_gpt2_sizes.py
+++ b/openwebtext/make_gpt2_sizes.py
--- a/openwebtext/merge_jsons.py
+++ b/openwebtext/merge_jsons.py
--- a/openwebtext/remove_group_duplicates.py
+++ b/openwebtext/remove_group_duplicates.py
--- a/openwebtext/run_make_gpt2_dataset.sh
+++ b/openwebtext/run_make_gpt2_dataset.sh
+#!/bin/bash
+
+echo "processing gpt2 data ..."
+DIR="/raid/mpatwary/redownload_v0/0-21"
+
+for thread in {0..3}; do
+    echo " launching thread "$thread && python make_gpt2_dataset.py $DIR $thread > $DIR/logs/shard_$thread.log 2>&1 &
+done
--- a/optim/__init__.py
+++ b/optim/__init__.py
--- a/optim/adam.py
+++ b/optim/adam.py
--- a/pretrain_bert.py
+++ b/pretrain_bert.py
--- a/pretrain_gpt2.py
+++ b/pretrain_gpt2.py
--- a/scripts/generate_text.sh
+++ b/scripts/generate_text.sh
--- a/scripts/pretrain_bert.sh
+++ b/scripts/pretrain_bert.sh