[example] add GPT

203ca57a · Jiarui Fang · GitHub · fd2c8d81 · 203ca57a · 203ca57a
Unverified Commit 203ca57a authored Nov 08, 2022 by Jiarui Fang Committed by GitHub Nov 08, 2022
20 changed files
--- a/examples/language/gpt/README.md
+++ b/examples/language/gpt/README.md
+# Run GPT With Colossal-AI
+
+## Overview
+
+In Colossal-AI, there are many ways to run GPT in a distributed manner. The `train_gpt.py` script runs training with the specific configuration scripts in `gpt2_configs/` for different parallelisms of GPT-2 . We have provided some example configuration files of GPT-2 and you can modify them to adapt to your own use.
+
+## How to Prepare Webtext Dataset
+
+We do not host any datasets for GPT or BERT training, however, we provide a detailed guide on how to prepare the dataset so that our results may be reproduced.
+
+### Overview
+
+We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library by [jcpeterson](https://github.com/jcpeterson/openwebtext) and  [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls to different web pages. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in following section.
+
+### Install necessary packages
+
+**Note: LSH requires GCC's early version. We have tested that version 9.3.0 works, but version 10.3.0 is not.**
+
+```bash
+pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract cached-path
+git clone https://github.com/mattilyra/LSH.git
+cd LSH
+python setup.py install
+```
+
+If you couldn't install it successfully, you may try to replace the `cMinhash.cpp` in `LSH/lsh` with ours, which is provided in `tools/lsh/cMinhash.cpp`.
+
+### Download Data
+
+1. Download the deduplicated URLs from [jcpeterson](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ!cc4RgQQZ).
+
+2. Unzip the zip file and you will get a folder `URLs` which consists of many txt files including urls.
+
+3. Remove blacklisted URLs.
+
+   *We appreciate Megatron-LM for making the data preprocessing code public. We have forked Megatron-LM and fixed some bugs. For your convenience, we have collated the needed files in `tools/Megatron`. Click [here](https://github.com/NVIDIA/Megatron-LM.git) to check the source code of Megatron-LM.*
+
+   ```bash
+   cd path/to/tools
+   python Megatron/blacklist_urls.py <path/to/URLs> <path/to/clean_urls.txt>
+   ```
+
+4. Download the content from the clean urls and merge the contents into one loose json file with 1 json per newline of the format `{'text': text, 'url': unique_url}`.
+
+   *We have forked and modified [openwebtext](https://github.com/yet-another-account/openwebtext) as there are some bugs in it. For your convenience, we provide our modified version in `tools/download`.*
+
+   ```bash
+   python download/download.py <path/to/clean_urls.txt> --n_procs 50 --output <path/to/raw.json>
+   ```
+
+### Prepare Data for GPT Training
+
+1. Perform ftfy, English detection and remove documents with less than 128 tokens. This step can be sharded and run on shards.
+
+   ```bash
+   python Megatron/cleanup_dataset.py <path/to/raw.json> <path/to/clean.json>
+   ```
+
+   Additional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. More details can be found by running `python cleanup_fix_dataset.py --help`.
+
+2. Using LSH, find possible duplicates and store them in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found by `python find_duplicate.py --help`.
+
+   ```bash
+   python Megatron/find_duplicates.py --inputs <path/to/clean.json> url --output <path/to/process_stage_one.json>
+   ```
+
+3. Based on similarity measure defind inside function `is_similar` (default: 0.9), group urls that are similar. Basically, for each group, only one url we should keep and remove the rest.
+
+   ```bash
+   python Megatron/group_duplicate_url.py <path/to/process_stage_one.json> <path/to/process_stage_two.json>
+   ```
+
+4. Remove similar documents that were detected in the last step. The `dedup.json` is the data after deduplication.
+
+   ```bash
+   python Megatron/remove_group_duplicates.py <path/to/process_stage_two.json> <path/to/clean.json> <path/to/dedup.json>
+   ```
+
+5. shuffle the dataset.
+
+   ```bash
+   shuf <path/to/dedup.json> -o <path/to/train_data.json>
+   ```
+
+## How to Prepare Yuan Dataset
+
+### Overview
+
+Yuan dataset is a large scale Chinese dataset with 1TB high quality texts proposed by Inspur. You can apply on https://air.inspur.com/home to get access to the dataset. We downloaded and loaded all downloaded content according to the procedure described in following section.
+
+### Download
+
+The dataset can be according to the website once your application is approved.
+
+You also need to download the vocab file from https://github.com/Shawn-Inspur/Yuan-1.0/blob/main/src/vocab.txt
+
+The final data dir should be organized as:
+
+```
+|--dataset
+|     |--001.txt
+|     |--002.txt
+|     |--...
+|--vocab.txt
+```
+
+### Process & Load
+
+Before you run the code, you should replace line 44 in train_gpt.py with
+
+```
+import dataset.yuan import YuanDataset
+train_ds = YuanDataset(os.environ['DATA'], vocab_path='/path/to/data/vocab.txt'seq_len=gpc.config.SEQ_LEN)
+```
+
+Then you can run model following the Usage section. The dataset will be processed when you run it for the first time, and save the cache. Then the data can be loaded automatically.
+
+## **Usage**
+
+```Bash
+#!/usr/bin/env sh
+export DATA=/path/to/train_data.json
+
+colossalai run --nproc_per_node=<num_gpus> train_gpt.py --config=gpt2_configs/<config_file>
+```
+
+You can copy it and save it as `run.sh`. Then use `bash ./run.sh` to run the script in your terminal.
+
+Please modify `DATA`, `num_gpus` and `config_file` with the path to your dataset, the number of GPUs and the config file path, respectively.
+If you are going to train gpt3, just replace `gpt2_configs` with `gpt3_configs`.
+
+## GPT-2
+
+Here are the GPT-2 configs' default parameter:
+
+| config       | scale | GPU* | batch  size | MiB of each GPU | TP  | PP  | DP  |
+| ------------ | ----- | ---- | ----------- | --------------- | --- | --- | --- |
+| gpt2-vanilla | small | 1    | 1           | 6071            | 1   | 1   | 1   |
+| gpt2-vanilla | small | 2    | 1           | 6449*2          | 1   | 1   | 2   |
+| gpt2-1d      | small | 2    | 1           | 5287*2          | 2   | 1   | 1   |
+| gpt2-2d      | small | 4    | 1           | 4590*4          | 4   | 1   | 1   |
+| gpt-2.5d     | small | 8    | 1           | 4815*8          | 8   | 1   | 1   |
+| gpt2-3d      | small | 8    | 1           | 4901*8          | 8   | 1   | 1   |
+| gpt2-pp      | small | 2    | 1           | 5877*2          | 1   | 2   | 1   |
+| gpt2-zero2   | small | 1    | 1           | 5459            | 1   | 1   | 1   |
+| gpt2-zero3   | small | 1    | 1           | 6577            | 1   | 1   | 1   |
+| gpt2-nvme    | small | 1    | 1           | 5067            | 1   | 1   | 1   |
+| gpt2-pp1d    | small | 8    | 8           | 5411*8          | 2   | 2   | 2   |
+
+*\*Note: For GPUs, we use Nvidia A100 80G.*
+*\*Note: Results of ZeRO are outdated, we will update them soon.*
+
+**We set** `TENSOR_PARALLEL` `PIPELINE_PARALLEL` **and** `DATA_PARALLEL` **as small as it can be to run every demo with the least number of GPUs.**
+
+### **Modify the config file**
+
+#### **General**
+
+There are some **general rules** when modifying the config files.
+
+```Plain%20Text
+TP denotes Tensor Parallel
+PP denotes Pipeline Parallel
+DP denotes Data Parallel
+
+GPUS = TP * PP * DP
+Where DP is autoseted
+```
+
+You can set the **batch size** and the **epoch** number by changing the number of
+`BATCH_SIZE` and  `NUM_EPOCHS`, respectively. Then, we will introduce the config file of each mode.
+
+Please note that `gpt2_zero3.py` has nothing but `BATCH_SIZE` and `NUM_EPOCHS` to change.
+
+#### **Vanilla & Data Parallel**
+
+`Vanilla` is the basic mode of GPT-2 with no parallelism at all. However, if you use more than 1 GPU and TP * PP < no. of GPUs, Colossal-AI will **set DP for you** **automatically**.
+
+#### **1D, 2D, 2.5D, 3D**
+
+In files `gpt2_1d.py, gpt2_2d.py, gpt2_2p5d.py, gpt2_3d.py`, there is a line:
+
+```Python
+TENSOR_PARALLEL = 2
+```
+
+You can modify it to use more tensor parallel, just with the general rules satisfied.
+In particular, `TENSOR_PARALLEL` should be a square number and cubic number for 2D and 3D,
+respectively, and `TENSOR_PARALLEL / DEPTH` should be a square number for 2.5D.
+
+#### **Pipeline Parallel**
+
+To use pipeline parallel training, you should install colossalai from the **latest** main branch.
+
+In `gpt2_pp.py`, there are lines:
+
+```Python
+# BATCH_SIZE / NUM_MICRO_BATCHES should be an integer
+NUM_MICRO_BATCHES = 1
+PIPELINE = 2
+```
+
+#### **Pipeline + 1D + Data Parallel**
+
+In `gpt2_pp1d.py`, we have
+
+```Python
+BATCH_SIZE = 8
+NUM_EPOCHS = 60
+NUM_MICRO_BATCHES = 1
+HIDDEN_SIZE = 768
+PIPELINE = 2
+TENSOR_PARALLEL = 2
+MODE  = '1d'
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)
+```
+
+We have introduced `BATCH_SIZE`, `NUM_EPOCHS`, `NUM_MICRO_BATCHES`, `PIPELINE`, `TENSOR_PARALLEL` as discussed above.
+`HIDDEN_SIZE` refers to the hidden dimension of the model, i.e. `gpt2_small` is 768.
+You can choose `None, '1d', '2d', '2.5d', '3d'` for `MODE`.
+
+## GPT-3
+
+GPT-3 is a really huge model, for which it seems not possible to train it with a little number of GPUs. Therefore, we choose some common sets of parameters instead of the smallest ones.
+
+Here are our default parameters of GPT-3 configs:
+
+| config         | GPU* | batch size | TP  | PP  | DP  |
+| -------------- | ---- | ---------- | --- | --- | --- |
+| gpt3_pp1d_min  | 96   | 192        | 4   | 24  | 1   |
+| gpt3_pp1d      | 128  | 192        | 4   | 32  | 1   |
+| gpt3_pp2d      | 96   | 2*48       | 4   | 24  | 1   |
+| gpt3_pp2p5d    | 96   | 2*48       | 4   | 24  | 1   |
+| gpt3_zero3_min | 64   | 3          | 1   | 1   | 64  |
+| gpt3_zero3     | 96   | 2          | 1   | 1   | 96  |
+
+*\*Note: we use Nvidia A100 40G GPUs*
+*\*Note: Results of ZeRO are outdated, we will update them soon.*
+
+In the figure above, the suffix `_min` means the set of hyper-parameters requires the least number of GPUs with the same mode.
+
+GPT-3 and GPT-2 have the same set of hyper-parameters.
--- a/examples/language/gpt/dataset/webtext.py
+++ b/examples/language/gpt/dataset/webtext.py
+import json
+import os
+
+import torch
+from torch.utils.data import Dataset
+
+from colossalai.registry import DATASETS
+from transformers import GPT2Tokenizer
+
+
+@DATASETS.register_module
+class WebtextDataset(Dataset):
+
+    def __init__(self, path, seq_len=1024) -> None:
+        super().__init__()
+        root = os.path.dirname(path)
+        encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
+        if os.path.isfile(encoded_data_cache_path):
+            seq_len_, data, attention_mask = torch.load(encoded_data_cache_path)
+            if seq_len_ == seq_len:
+                self.data = data
+                self.attention_mask = attention_mask
+                return
+        raw_data = []
+        with open(path) as f:
+            for line in f.readlines():
+                raw_data.append(json.loads(line)['text'])
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        tokenizer.pad_token = tokenizer.unk_token
+        encoded_data = tokenizer(raw_data, padding=True, truncation=True, max_length=seq_len, return_tensors='pt')
+        self.data = encoded_data['input_ids']
+        self.attention_mask = encoded_data['attention_mask']
+        torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return {'input_ids': self.data[index], 'attention_mask': self.attention_mask[index]}, self.data[index]
--- a/examples/language/gpt/dataset/yuan.py
+++ b/examples/language/gpt/dataset/yuan.py
+import collections
+import glob
+import logging
+import multiprocessing
+import os
+import sys
+
+import jieba
+import six
+import torch
+from tools.tokenization_enc_dec import EncDecTokenizer
+from torch.utils.data import Dataset
+from tqdm import tqdm
+
+from colossalai.registry import DATASETS
+
+try:
+    import nltk
+
+    nltk_available = True
+except ImportError:
+    nltk_available = False
+
+jieba.setLogLevel(logging.INFO)
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+
+
+def is_contain_chinese(check_str):
+    for ch in check_str:
+        if u'\u4e00' <= ch <= u'\u9fff':
+            return True
+    return False
+
+
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Should be running on Python 3")
+
+
+class WordpieceTokenizer(object):
+
+    def __init__(self, vocab, unk_token="<unk>", max_input_chars_per_word=200):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, token):
+
+        token = convert_to_unicode(token)
+
+        chars = list(token)
+        if len(chars) > self.max_input_chars_per_word:
+            return [self.unk_token]
+
+        start = 0
+        sub_tokens = []
+        while start < len(chars):
+            end = len(chars)
+            cur_substr = None
+            while start < end:
+                substr = "".join(chars[start:end])
+                if is_contain_chinese(substr):
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                else:
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                end -= 1
+            if cur_substr is None:
+                sub_tokens.append(self.unk_token)
+                start += 1
+                continue
+            sub_tokens.append(cur_substr)
+            start = end
+
+        return sub_tokens
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    index = 0
+    with open(vocab_file, "r", encoding='utf-8') as reader:
+        while True:
+            token = convert_to_unicode(reader.readline())
+            if not token:
+                break
+            token = token.strip()
+            vocab[token] = index
+            index += 1
+    return vocab
+
+
+class EncDecTokenizer(object):
+
+    def __init__(self, vocab_file, max_len=None, max_sentinels=190):
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.encoder = load_vocab(vocab_file)
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder)
+
+        self.translator = str.maketrans(" \n", "\u2582\u2583")
+
+        self.sentinel_list = [self.encoder['<s_{}>'.format(i)] for i in range(max_sentinels)]
+
+        self.en_vocab = {}
+        for k, v in self.encoder.items():
+            if is_contain_chinese(k):
+                self.en_vocab[v] = False
+            else:
+                self.en_vocab[v] = True
+        self.en_vocab[10] = False
+
+    @property
+    def vocab_size(self):
+        return len(self.encoder)
+
+    def __len__(self):
+        return len(self.encoder)
+
+    @property
+    def eod_id(self):
+        return self.encoder[self.eod_token]
+
+    @property
+    def pad_id(self):
+        return self.encoder[self.pad_token]
+
+    @property
+    def eod_token(self):
+        return '<eod>'
+
+    @property
+    def pad_token(self):
+        return '<pad>'
+
+    def get_sentinel_num(self):
+        return len(self.sentinel_list)
+
+    def get_sentinel_id(self, idx):
+        return self.sentinel_list[idx]
+
+    def tokenize(self, text):
+        """ Tokenize a string. """
+        output_tokens = []
+        for x in jieba.cut(text, cut_all=False):
+            x = x.translate(self.translator)
+            output_tokens.extend(self.wordpiece_tokenizer.tokenize(x))
+
+        # print(output_tokens)
+
+        return output_tokens
+
+    def encode(self, text):
+        output_tokens = [self.encoder[x] for x in self.tokenize(text)]
+
+        # filter space
+        new_output_tokens = [output_tokens[0]]
+        for i, x in enumerate(output_tokens[1:-1]):
+            if x == 10:
+                if self.en_vocab[output_tokens[i]] and self.en_vocab[output_tokens[i + 2]]:
+                    continue
+            new_output_tokens.append(x)
+        new_output_tokens.append(output_tokens[-1])
+
+        return new_output_tokens
+
+    def decode(self, tokens):
+        new_tokens = []
+        for i, x in enumerate(tokens[:-1]):
+            if self.en_vocab[x] and self.en_vocab[tokens[i + 1]]:
+                new_tokens.append(x)
+                new_tokens.append(10)
+            else:
+                new_tokens.append(x)
+        new_tokens.append(tokens[-1])
+
+        # text = ''.join([self.decoder[x] for x in new_tokens])
+        # text = text.replace('\u2582', ' ').replace('\u2583', '\n')
+        # return text
+        return [self.decoder[x] for x in tokens]
+
+
+class IdentitySplitter(object):
+
+    @staticmethod
+    def tokenize(*text):
+        return text
+
+
+class Encoder(object):
+
+    def __init__(self, vocab_path, length, sentence_splitter):
+        self.vocab_path = vocab_path
+        self.length = length
+        self.sentence_splitter = sentence_splitter
+        self.tokenizer = EncDecTokenizer(os.path.join(self.vocab_path))
+        self.splitter = IdentitySplitter()
+
+    def initializer(self):
+        # Use Encoder class as a container for global data
+        pass
+
+    def encode(self, line):
+        # end with <eod>
+        if len(line) > 20000:
+            return None, 0
+        if len(line) < 10:
+            return None, 0
+        data = line.strip().strip('<n>')
+        data = data.replace("<n>", "\n")
+        doc_ids = self.tokenizer.encode(data)
+        doc_ids.append(self.tokenizer.eod_id)
+        return doc_ids, len(line)
+
+
+@DATASETS.register_module
+class YuanDataset(Dataset):
+    """
+    Yuan is an open source Chinese dataset, which can be accessed on https://github.com/Shawn-Inspur/Yuan-1.0.
+
+    Args:
+        path(str): Path to dataset's folder, raw data should be organized under the folder as 001.txt, 002.txt...
+                   eg:/path/yuan/dataset
+        vocab_path(str): Path to the vocab file. eg:/path/yuan/vocab.txt
+        seq_len(int): Sequence length of the transformer, defaults to 2048.
+    """
+
+    def __init__(self, path, vocab_path, seq_len=2048) -> None:
+        super().__init__()
+
+        self.input_path = path
+        workers = 16
+        sentence_splitter = None
+        self.vocab_path = vocab_path
+        self.pad_id = EncDecTokenizer(os.path.join(self.vocab_path)).pad_id
+        self.length = seq_len
+
+        if self.input_path[-1] == '/':
+            self.input_path = self.input_path[:-1]
+        if os.path.exists(os.path.join(self.input_path, 'data_list.pt')):
+            self.data_path = torch.load(os.path.join(self.input_path, 'data_list.pt'))
+            return
+
+        fin_list = glob.glob(self.input_path + '/0[0-9][0-9].txt')
+        self.data_path = []
+        for fin_path in fin_list:
+            if not os.path.exists(fin_path):
+                continue
+            if '.txt' not in fin_path:
+                continue
+
+            all_data = []
+            print("Processing ", fin_path)
+            with open(fin_path, 'r', encoding='utf-8', errors='ignore') as fin:
+
+                encoder = Encoder(self.vocab_path, seq_len, sentence_splitter)
+                pool = multiprocessing.Pool(workers, initializer=encoder.initializer)
+                encoded_docs = pool.imap_unordered(encoder.encode, fin, 30)
+
+                for i, (no_noise_tokens, bytes_processed) in tqdm(enumerate(encoded_docs, start=1)):
+                    if no_noise_tokens is None:
+                        continue
+                    all_data.append(no_noise_tokens)
+
+                pool.close()
+
+            print('Saving ', fin_path)
+            base_path = fin_path.replace('.txt', '')
+            if not os.path.exists(base_path):
+                os.mkdir(base_path)
+            idx = 0
+            for d in tqdm(all_data):
+                idx += 1
+                cur_path = os.path.join(base_path, str(idx) + '.txt')
+                with open(cur_path, 'w+', encoding='utf-8') as f:
+                    for i in d:
+                        f.write(str(i) + ' ')
+                    f.write('\n')
+                self.data_path.append(cur_path.replace(self.input_path + '/', ''))
+
+        torch.save(self.data_path, os.path.join(self.input_path, 'data_list.pt'))
+
+    def __len__(self):
+        return len(self.data_path)
+
+    def __getitem__(self, index):
+        path = self.data_path[index]
+        root = os.path.join(self.input_path, path)
+        with open(root, "r") as f:
+            data = f.readlines()
+        assert len(data) == 1
+        data = data[0][:-2].split(' ')
+        try:
+            data = list(map(int, data))
+        except:
+            while '' in data:
+                data.remove('')
+            data = list(map(int, data))
+        if len(data) > self.length:
+            data = data[:self.length - 1] + [data[-1]]
+            mask = [1] * self.length
+        else:
+            data += [self.pad_id] * (self.length - len(data))
+            mask = [1] * len(data) + [0] * (self.length - len(data))
+
+        data = torch.tensor(data)
+        mask = torch.tensor(mask)
+        return {'input_ids': data, 'attention_mask': mask}, data
+
+
+if __name__ == '__main__':
+    dataset = YuanDataset('/data/gpt-yuan/ASC22/dataset', vocab_path='/data/gpt-yuan/ASC22/vocab.txt', seq_len=2048)
+    test = dataset.__getitem__(0)
+    print(test)
--- a/examples/language/gpt/gpt2_configs/gpt2_1d.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_1d.py
+from titans.loss.lm_loss import GPTLMLoss
+from titans.model.gpt import gpt2_small
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 1
+SEQ_LEN = 1024
+NUM_EPOCHS = 60
+
+TENSOR_PARALLEL = 2
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+loss = dict(type=GPTLMLoss,)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(size=TENSOR_PARALLEL, mode='1d'),
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_2d.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_2d.py
+from titans.loss.lm_loss import GPTLMLoss
+from titans.model.gpt import gpt2_small
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 4
+SEQ_LEN = 1024
+NUM_EPOCHS = 60
+TENSOR_PARALLEL = 4
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+loss = dict(type=GPTLMLoss,)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(size=TENSOR_PARALLEL, mode='2d'),
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_2p5d.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_2p5d.py
+from titans.loss.lm_loss import GPTLMLoss
+from titans.model.gpt import gpt2_small
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 4
+SEQ_LEN = 1024
+NUM_EPOCHS = 60
+TENSOR_PARALLEL = 8
+DEPTH = 2
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+loss = dict(type=GPTLMLoss,)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(size=TENSOR_PARALLEL, depth=DEPTH, mode='2.5d'),
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_3d.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_3d.py
+from titans.loss.lm_loss import GPTLMLoss
+from titans.model.gpt import gpt2_small
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 4
+SEQ_LEN = 1024
+NUM_EPOCHS = 60
+TENSOR_PARALLEL = 8
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+loss = dict(type=GPTLMLoss,)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(size=TENSOR_PARALLEL, mode='3d'),
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_pp.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_pp.py
+from titans.loss.lm_loss import GPTLMLoss
+from titans.model.gpt import gpt2_small
+#from model_zoo.gpt.gpt import gpt2_small_pipeline
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 8
+SEQ_LEN = 1024
+NUM_EPOCHS = 60
+HIDDEN_SIZE = 768
+NUM_MICRO_BATCHES = 4
+PIPELINE = 2
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+loss = dict(type=GPTLMLoss,)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
+
+parallel = dict(
+    pipeline=PIPELINE,
+    tensor=dict(size=1, mode=None),
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_pp1d.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_pp1d.py
+import torch
+from titans.loss.lm_loss import GPTLMLoss
+from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
+from titans.model.gpt import gpt2_small
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 8
+NUM_EPOCHS = 60
+SEQ_LEN = 1024
+
+NUM_MICRO_BATCHES = 4
+HIDDEN_SIZE = 768
+PIPELINE = 2
+TENSOR_PARALLEL = 2
+MODE = '1d'
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+parallel = dict(pipeline=PIPELINE, tensor=dict(mode=MODE, size=TENSOR_PARALLEL))
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+    dtype=torch.half,
+)
+
+loss_fn = dict(type=vocab_parallel_cross_entropy)
--- a/examples/language/gpt/gpt2_configs/gpt2_vanilla.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_vanilla.py
+from titans.model.gpt import gpt2_small
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 1
+NUM_EPOCHS = 60
+SEQ_LEN = 1024
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
+
+parallel = dict(
+    pipeline=1,
+    tensor=dict(size=1, mode=None),
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_zero3.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_zero3.py
+from titans.model.gpt import gpt2_small
+
+from colossalai.nn.optimizer import HybridAdam
+from colossalai.zero.shard_utils import TensorShardStrategy
+
+BATCH_SIZE = 2
+NUM_EPOCHS = 60
+SEQ_LEN = 1024
+
+zero = dict(model_config=dict(tensor_placement_policy='auto',
+                              shard_strategy=TensorShardStrategy(),
+                              reuse_fp16_shard=True),
+            optimizer_config=dict())
+
+optimizer = dict(
+    type=HybridAdam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(
+    type=gpt2_small,
+    checkpoint=True,
+)
--- a/examples/language/gpt/gpt2_configs/gpt2_zero3_pp1d.py
+++ b/examples/language/gpt/gpt2_configs/gpt2_zero3_pp1d.py
+from model import GPT2_small_pipeline_hybrid
+
+from colossalai.nn.optimizer import HybridAdam
+from colossalai.zero.shard_utils import BucketTensorShardStrategy, TensorShardStrategy
+
+BATCH_SIZE = 8
+NUM_EPOCHS = 60
+SEQ_LEN = 1024
+NUM_MICRO_BATCHES = 4
+HIDDEN_SIZE = 768
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)
+zero = dict(model_config=dict(tensor_placement_policy='cpu', shard_strategy=BucketTensorShardStrategy()),
+            optimizer_config=dict())
+
+optimizer = dict(
+    type=HybridAdam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(type=GPT2_small_pipeline_hybrid, checkpoint=True, num_chunks=1)
+
+parallel = dict(
+    pipeline=2,
+    tensor=dict(size=2, mode='1d'),
+)
--- a/examples/language/gpt/gpt3_configs/gpt3_pp1d.py
+++ b/examples/language/gpt/gpt3_configs/gpt3_pp1d.py
+import torch
+from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
+from titans.model.gpt import gpt3
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 192
+NUM_EPOCHS = 60
+SEQ_LEN = 2048
+NUM_MICRO_BATCHES = 192
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, 12288)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+parallel = dict(pipeline=32, tensor=dict(mode='1d', size=4))
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(
+    type=gpt3,
+    checkpoint=True,
+    dtype=torch.half,
+)
+
+loss_fn = dict(type=vocab_parallel_cross_entropy)
--- a/examples/language/gpt/gpt3_configs/gpt3_pp1d_min.py
+++ b/examples/language/gpt/gpt3_configs/gpt3_pp1d_min.py
+import torch
+from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
+from titans.model.gpt import gpt3
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 192
+NUM_EPOCHS = 60
+SEQ_LEN = 2048
+NUM_MICRO_BATCHES = 192
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, 12288)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+parallel = dict(pipeline=24, tensor=dict(mode='1d', size=4))
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(
+    type=gpt3,
+    checkpoint=True,
+    dtype=torch.half,
+)
+
+loss_fn = dict(type=vocab_parallel_cross_entropy)
--- a/examples/language/gpt/gpt3_configs/gpt3_pp2d.py
+++ b/examples/language/gpt/gpt3_configs/gpt3_pp2d.py
+import torch
+from titans.model.gpt import gpt3
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 2 * 48
+NUM_EPOCHS = 60
+SEQ_LEN = 2048
+NUM_MICRO_BATCHES = 48
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES // 2, SEQ_LEN, 12288 // 2)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+parallel = dict(pipeline=24, tensor=dict(mode='2d', size=4))
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(
+    type=gpt3,
+    checkpoint=True,
+    dtype=torch.half,
+)
--- a/examples/language/gpt/gpt3_configs/gpt3_pp2p5d.py
+++ b/examples/language/gpt/gpt3_configs/gpt3_pp2p5d.py
+import torch
+from titans.model.gpt import gpt3
+from torch.optim import Adam
+
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 2 * 48
+NUM_EPOCHS = 60
+SEQ_LEN = 2048
+NUM_MICRO_BATCHES = 48
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES // 2, SEQ_LEN, 12288 // 2)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+
+parallel = dict(pipeline=24, tensor=dict(mode='2.5d', depth=1, size=4))
+
+optimizer = dict(
+    type=Adam,
+    lr=0.00015,
+    weight_decay=1e-2,
+)
+
+model = dict(
+    type=gpt3,
+    checkpoint=True,
+    dtype=torch.half,
+)
--- a/examples/language/gpt/run.sh
+++ b/examples/language/gpt/run.sh
+export DATA=/data/scratch/gpt_data/small-gpt-dataset.json
+
+export NODE_RANK=${NODE_RANK:-0}
+export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
+export MASTER_PORT=${MASTER_PORT:-"12345"}
+
+env OMP_NUM_THREADS=16 torchrun --standalone --nproc_per_node=2 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch 2>&1 | tee logs/log
--- a/examples/language/gpt/tools/LSH/cMinhash.cpp
+++ b/examples/language/gpt/tools/LSH/cMinhash.cpp
--- a/examples/language/gpt/tools/Megatron/__init__.py
+++ b/examples/language/gpt/tools/Megatron/__init__.py
--- a/examples/language/gpt/tools/Megatron/blacklist_urls.py
+++ b/examples/language/gpt/tools/Megatron/blacklist_urls.py
+# coding=utf-8
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import re
+import sys
+import time
+
+import tldextract
+
+# List of the domains to blacklist.
+domain_blacklist = set([
+    '500px',
+    'aapks',
+    'akamaihd',
+    'amazon',
+    'apple',
+    'artifactfire',
+    'artstation',
+    'awwni',
+    'bandcamp',
+    'battleforthenet',
+    'coinscalendar',
+    'dailymotion',
+    'deviantart',
+    'discord',
+    'discordapp',
+    'dlapkandroid',
+    'dropbox',
+    'e621',
+    'ebay',
+    'edealinfo',
+    'erome',
+    'eroshare',
+    'explosm',
+    'facebook',
+    'fbcdn',
+    'flickr',
+    'furaffinity',
+    'futhead',
+    'gatopardo',
+    'gfycat',
+    'gifsound',
+    'gifsoup',
+    'giphy',
+    'github',
+    'google',
+    'gunprime',
+    'gyazo',
+    'hotdealstar',
+    'imagefap',
+    'imageshack',
+    'imgflip',
+    'imgur',
+    'instagram',
+    'karmadecay',
+    'kryptocal',
+    'kym-cdn',
+    'liveleak',
+    'livememe',
+    'lmgtfy',
+    'magaimg',
+    'memegenerator',
+    'minorplanetcenter',
+    'minus',
+    'mobafire',
+    'morejpeg',
+    'nocookie',
+    'pcpartpicker',
+    'photobucket',
+    'pinimg',
+    'pinterest',
+    'pixiv',
+    'pornhub',
+    'prntscr',
+    'puu',
+    'qkme',
+    'quickmeme',
+    'radd',
+    'redd',
+    'reddit',
+    'reddit-stream',
+    'redditlog',
+    'redditmedia',
+    'reddituploads',
+    'redtube',
+    'reupp',
+    'reverb',
+    'roanoke',
+    'rollingstone',
+    'sli',
+    'soundcloud',
+    'soundgasm',
+    'spankbang',
+    'spotify',
+    'strawpoll',
+    'streamable',
+    'timeanddate',
+    'tinypic',
+    'touhouradio',
+    'tumblr',
+    'twimg',
+    'twitch',
+    'twitter',
+    'vid',
+    'vimeo',
+    'vine',
+    'vkaao',
+    'vocaroo',
+    'voyagefusion',
+    'walmart',
+    'wciu',
+    'wikimedia',
+    'wikipedia',
+    'xhamster',
+    'xkcd',
+    'xvideos',
+    'youtu',
+    'youtube',
+    'youtubedoubler',
+    'ytimg',
+    'zillexplorer',
+])
+
+
+def domain_is_in_blacklist(url):
+    domain = tldextract.extract(url).domain
+    return domain in domain_blacklist
+
+
+# List of extentions to blacklist.
+extentions_blacklist = (
+    '.3gp',
+    '.7z'
+    '.ai',
+    '.aif',
+    '.apk',
+    '.app',
+    '.avi',
+    '.bin',
+    '.bmp',
+    '.bz2',
+    '.css',
+    '.csv',
+    '.dat',
+    '.deb',
+    '.dmg',
+    '.doc',
+    '.docx',
+    '.exe',
+    '.gif',
+    '.gifv',
+    '.gz',
+    '.iso',
+    '.jar',
+    '.jpeg',
+    '.jpg',
+    '.js',
+    '.log',
+    '.mid',
+    '.midi',
+    '.mkv',
+    '.mov',
+    '.mp3',
+    '.mp4',
+    '.mpeg',
+    '.mpg',
+    '.ogg',
+    '.ogv',
+    '.otf',
+    '.pdf',
+    '.pkg',
+    '.png',
+    '.pps',
+    '.ppt',
+    '.pptx',
+    '.psd',
+    '.py',
+    '.qt',
+    '.ram',
+    '.rar',
+    '.sql',
+    '.svg',
+    '.swf',
+    '.tar.gz',
+    '.tar',
+    '.tgz',
+    '.tiff',
+    '.ttf',
+    '.txt',
+    '.wav',
+    '.webm',
+    '.wma',
+    '.wmv',
+    '.xls',
+    '.xlsx',
+    '.xml',
+    '.xz',
+    '.zip',
+)
+
+
+def extention_is_in_blacklist(url):
+    if url.split('?')[0].lower().endswith(extentions_blacklist):
+        return True
+    return False
+
+
+# Malformed urls.
+# This function is adapted from:
+#   https://stackoverflow.com/questions/7160737/python-how-to-validate-a-url-in-python-malformed-or-not
+url_regex = re.compile(
+    r'^(?:http)s?://'    # http:// or https://
+    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'    #domain...
+    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'    # ...or ip
+    r'(?::\d+)?'    # optional port
+    r'(?:/?|[/?]\S+)$',
+    re.IGNORECASE)
+
+
+def url_is_malformed(url):
+    return re.match(url_regex, url) is None
+
+
+def print_progress(prefix, start_time, urls_counter, domain_blacklist_counter, extention_blacklist_counter,
+                   short_url_counter, malformed_url_counter, duplicate_url_counter):
+    string = prefix + ' | '
+    string += 'time elapsed (s): {:.2f} | '.format(time.time() - start_time)
+    string += 'number of urls: {} | '.format(urls_counter)
+    string += 'domain blacklisted: {} | '.format(domain_blacklist_counter)
+    string += 'extention blacklisted: {} | '.format(extention_blacklist_counter)
+    string += 'short urls (<=8): {} | '.format(short_url_counter)
+    string += 'malformed urls: {} | '.format(malformed_url_counter)
+    string += 'duplicate urls: {}'.format(duplicate_url_counter)
+    print(string, flush=True)
+
+
+if __name__ == '__main__':
+
+    print('remove blacklisted urls ..')
+
+    # Path to the url files.
+    path = sys.argv[1]
+    # Output url file.
+    output = sys.argv[2]
+
+    # Get the list of url files.
+    files = glob.glob(path + '/*.txt')
+    print('> found {} files'.format(len(files)))
+
+    urls = set()
+    urls_counter = 0
+    domain_blacklist_counter = 0
+    extention_blacklist_counter = 0
+    short_url_counter = 0
+    malformed_url_counter = 0
+    duplicate_url_counter = 0
+    start_time = time.time()
+    for filename in files:
+        with open(filename, 'r') as f:
+            for line in f:
+                url = line.strip()
+                urls_counter += 1
+                if domain_is_in_blacklist(url):
+                    print('[DOMAIN BLACKLIST]: {}'.format(url), flush=True)
+                    domain_blacklist_counter += 1
+                elif extention_is_in_blacklist(url):
+                    print('[EXTENTION BLACKLIST]: {}'.format(url), flush=True)
+                    extention_blacklist_counter += 1
+                elif len(url) <= 8:
+                    print('[SHORT URL]: {}'.format(url), flush=True)
+                    short_url_counter += 1
+                elif url_is_malformed(url):
+                    print('[MALFORMED URL]: {}'.format(url), flush=True)
+                    malformed_url_counter += 1
+                elif url in urls:
+                    print('[DUPLICATE URL]: {}'.format(url), flush=True)
+                    duplicate_url_counter += 1
+                else:
+                    urls.add(url)
+                if urls_counter % 100000 == 0:
+                    print_progress('PROGRESS', start_time, urls_counter, domain_blacklist_counter,
+                                   extention_blacklist_counter, short_url_counter, malformed_url_counter,
+                                   duplicate_url_counter)
+
+    print_progress('FINAL', start_time, urls_counter, domain_blacklist_counter, extention_blacklist_counter,
+                   short_url_counter, malformed_url_counter, duplicate_url_counter)
+
+    # Write the final set of urls.
+    print('> writing cleaned up url list to {}'.format(output))
+    with open(output, 'w') as f:
+        for url in urls:
+            f.write(url + '\n')
+
+    print('done :-)')