Unverified Commit 203ca57a authored by Jiarui Fang's avatar Jiarui Fang Committed by GitHub
Browse files

[example] add GPT

parent fd2c8d81
# Run GPT With Colossal-AI
## Overview
In Colossal-AI, there are many ways to run GPT in a distributed manner. The `train_gpt.py` script runs training with the specific configuration scripts in `gpt2_configs/` for different parallelisms of GPT-2 . We have provided some example configuration files of GPT-2 and you can modify them to adapt to your own use.
## How to Prepare Webtext Dataset
We do not host any datasets for GPT or BERT training, however, we provide a detailed guide on how to prepare the dataset so that our results may be reproduced.
### Overview
We utilize the publicly available [OpenWebText](https://github.com/eukaryote31/openwebtext) library by [jcpeterson](https://github.com/jcpeterson/openwebtext) and [eukaryote31's](https://github.com/eukaryote31/openwebtext) work to download urls to different web pages. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in following section.
### Install necessary packages
**Note: LSH requires GCC's early version. We have tested that version 9.3.0 works, but version 10.3.0 is not.**
```bash
pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract cached-path
git clone https://github.com/mattilyra/LSH.git
cd LSH
python setup.py install
```
If you couldn't install it successfully, you may try to replace the `cMinhash.cpp` in `LSH/lsh` with ours, which is provided in `tools/lsh/cMinhash.cpp`.
### Download Data
1. Download the deduplicated URLs from [jcpeterson](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ!cc4RgQQZ).
2. Unzip the zip file and you will get a folder `URLs` which consists of many txt files including urls.
3. Remove blacklisted URLs.
*We appreciate Megatron-LM for making the data preprocessing code public. We have forked Megatron-LM and fixed some bugs. For your convenience, we have collated the needed files in `tools/Megatron`. Click [here](https://github.com/NVIDIA/Megatron-LM.git) to check the source code of Megatron-LM.*
```bash
cd path/to/tools
python Megatron/blacklist_urls.py <path/to/URLs> <path/to/clean_urls.txt>
```
4. Download the content from the clean urls and merge the contents into one loose json file with 1 json per newline of the format `{'text': text, 'url': unique_url}`.
*We have forked and modified [openwebtext](https://github.com/yet-another-account/openwebtext) as there are some bugs in it. For your convenience, we provide our modified version in `tools/download`.*
```bash
python download/download.py <path/to/clean_urls.txt> --n_procs 50 --output <path/to/raw.json>
```
### Prepare Data for GPT Training
1. Perform ftfy, English detection and remove documents with less than 128 tokens. This step can be sharded and run on shards.
```bash
python Megatron/cleanup_dataset.py <path/to/raw.json> <path/to/clean.json>
```
Additional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using `cleanup_fix_dataset.py`. More details can be found by running `python cleanup_fix_dataset.py --help`.
2. Using LSH, find possible duplicates and store them in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found by `python find_duplicate.py --help`.
```bash
python Megatron/find_duplicates.py --inputs <path/to/clean.json> url --output <path/to/process_stage_one.json>
```
3. Based on similarity measure defind inside function `is_similar` (default: 0.9), group urls that are similar. Basically, for each group, only one url we should keep and remove the rest.
```bash
python Megatron/group_duplicate_url.py <path/to/process_stage_one.json> <path/to/process_stage_two.json>
```
4. Remove similar documents that were detected in the last step. The `dedup.json` is the data after deduplication.
```bash
python Megatron/remove_group_duplicates.py <path/to/process_stage_two.json> <path/to/clean.json> <path/to/dedup.json>
```
5. shuffle the dataset.
```bash
shuf <path/to/dedup.json> -o <path/to/train_data.json>
```
## How to Prepare Yuan Dataset
### Overview
Yuan dataset is a large scale Chinese dataset with 1TB high quality texts proposed by Inspur. You can apply on https://air.inspur.com/home to get access to the dataset. We downloaded and loaded all downloaded content according to the procedure described in following section.
### Download
The dataset can be according to the website once your application is approved.
You also need to download the vocab file from https://github.com/Shawn-Inspur/Yuan-1.0/blob/main/src/vocab.txt
The final data dir should be organized as:
```
|--dataset
| |--001.txt
| |--002.txt
| |--...
|--vocab.txt
```
### Process & Load
Before you run the code, you should replace line 44 in train_gpt.py with
```
import dataset.yuan import YuanDataset
train_ds = YuanDataset(os.environ['DATA'], vocab_path='/path/to/data/vocab.txt'seq_len=gpc.config.SEQ_LEN)
```
Then you can run model following the Usage section. The dataset will be processed when you run it for the first time, and save the cache. Then the data can be loaded automatically.
## **Usage**
```Bash
#!/usr/bin/env sh
export DATA=/path/to/train_data.json
colossalai run --nproc_per_node=<num_gpus> train_gpt.py --config=gpt2_configs/<config_file>
```
You can copy it and save it as `run.sh`. Then use `bash ./run.sh` to run the script in your terminal.
Please modify `DATA`, `num_gpus` and `config_file` with the path to your dataset, the number of GPUs and the config file path, respectively.
If you are going to train gpt3, just replace `gpt2_configs` with `gpt3_configs`.
## GPT-2
Here are the GPT-2 configs' default parameter:
| config | scale | GPU* | batch size | MiB of each GPU | TP | PP | DP |
| ------------ | ----- | ---- | ----------- | --------------- | --- | --- | --- |
| gpt2-vanilla | small | 1 | 1 | 6071 | 1 | 1 | 1 |
| gpt2-vanilla | small | 2 | 1 | 6449*2 | 1 | 1 | 2 |
| gpt2-1d | small | 2 | 1 | 5287*2 | 2 | 1 | 1 |
| gpt2-2d | small | 4 | 1 | 4590*4 | 4 | 1 | 1 |
| gpt-2.5d | small | 8 | 1 | 4815*8 | 8 | 1 | 1 |
| gpt2-3d | small | 8 | 1 | 4901*8 | 8 | 1 | 1 |
| gpt2-pp | small | 2 | 1 | 5877*2 | 1 | 2 | 1 |
| gpt2-zero2 | small | 1 | 1 | 5459 | 1 | 1 | 1 |
| gpt2-zero3 | small | 1 | 1 | 6577 | 1 | 1 | 1 |
| gpt2-nvme | small | 1 | 1 | 5067 | 1 | 1 | 1 |
| gpt2-pp1d | small | 8 | 8 | 5411*8 | 2 | 2 | 2 |
*\*Note: For GPUs, we use Nvidia A100 80G.*
*\*Note: Results of ZeRO are outdated, we will update them soon.*
**We set** `TENSOR_PARALLEL` `PIPELINE_PARALLEL` **and** `DATA_PARALLEL` **as small as it can be to run every demo with the least number of GPUs.**
### **Modify the config file**
#### **General**
There are some **general rules** when modifying the config files.
```Plain%20Text
TP denotes Tensor Parallel
PP denotes Pipeline Parallel
DP denotes Data Parallel
GPUS = TP * PP * DP
Where DP is autoseted
```
You can set the **batch size** and the **epoch** number by changing the number of
`BATCH_SIZE` and `NUM_EPOCHS`, respectively. Then, we will introduce the config file of each mode.
Please note that `gpt2_zero3.py` has nothing but `BATCH_SIZE` and `NUM_EPOCHS` to change.
#### **Vanilla & Data Parallel**
`Vanilla` is the basic mode of GPT-2 with no parallelism at all. However, if you use more than 1 GPU and TP * PP < no. of GPUs, Colossal-AI will **set DP for you** **automatically**.
#### **1D, 2D, 2.5D, 3D**
In files `gpt2_1d.py, gpt2_2d.py, gpt2_2p5d.py, gpt2_3d.py`, there is a line:
```Python
TENSOR_PARALLEL = 2
```
You can modify it to use more tensor parallel, just with the general rules satisfied.
In particular, `TENSOR_PARALLEL` should be a square number and cubic number for 2D and 3D,
respectively, and `TENSOR_PARALLEL / DEPTH` should be a square number for 2.5D.
#### **Pipeline Parallel**
To use pipeline parallel training, you should install colossalai from the **latest** main branch.
In `gpt2_pp.py`, there are lines:
```Python
# BATCH_SIZE / NUM_MICRO_BATCHES should be an integer
NUM_MICRO_BATCHES = 1
PIPELINE = 2
```
#### **Pipeline + 1D + Data Parallel**
In `gpt2_pp1d.py`, we have
```Python
BATCH_SIZE = 8
NUM_EPOCHS = 60
NUM_MICRO_BATCHES = 1
HIDDEN_SIZE = 768
PIPELINE = 2
TENSOR_PARALLEL = 2
MODE = '1d'
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)
```
We have introduced `BATCH_SIZE`, `NUM_EPOCHS`, `NUM_MICRO_BATCHES`, `PIPELINE`, `TENSOR_PARALLEL` as discussed above.
`HIDDEN_SIZE` refers to the hidden dimension of the model, i.e. `gpt2_small` is 768.
You can choose `None, '1d', '2d', '2.5d', '3d'` for `MODE`.
## GPT-3
GPT-3 is a really huge model, for which it seems not possible to train it with a little number of GPUs. Therefore, we choose some common sets of parameters instead of the smallest ones.
Here are our default parameters of GPT-3 configs:
| config | GPU* | batch size | TP | PP | DP |
| -------------- | ---- | ---------- | --- | --- | --- |
| gpt3_pp1d_min | 96 | 192 | 4 | 24 | 1 |
| gpt3_pp1d | 128 | 192 | 4 | 32 | 1 |
| gpt3_pp2d | 96 | 2*48 | 4 | 24 | 1 |
| gpt3_pp2p5d | 96 | 2*48 | 4 | 24 | 1 |
| gpt3_zero3_min | 64 | 3 | 1 | 1 | 64 |
| gpt3_zero3 | 96 | 2 | 1 | 1 | 96 |
*\*Note: we use Nvidia A100 40G GPUs*
*\*Note: Results of ZeRO are outdated, we will update them soon.*
In the figure above, the suffix `_min` means the set of hyper-parameters requires the least number of GPUs with the same mode.
GPT-3 and GPT-2 have the same set of hyper-parameters.
import json
import os
import torch
from torch.utils.data import Dataset
from colossalai.registry import DATASETS
from transformers import GPT2Tokenizer
@DATASETS.register_module
class WebtextDataset(Dataset):
def __init__(self, path, seq_len=1024) -> None:
super().__init__()
root = os.path.dirname(path)
encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
if os.path.isfile(encoded_data_cache_path):
seq_len_, data, attention_mask = torch.load(encoded_data_cache_path)
if seq_len_ == seq_len:
self.data = data
self.attention_mask = attention_mask
return
raw_data = []
with open(path) as f:
for line in f.readlines():
raw_data.append(json.loads(line)['text'])
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.unk_token
encoded_data = tokenizer(raw_data, padding=True, truncation=True, max_length=seq_len, return_tensors='pt')
self.data = encoded_data['input_ids']
self.attention_mask = encoded_data['attention_mask']
torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return {'input_ids': self.data[index], 'attention_mask': self.attention_mask[index]}, self.data[index]
import collections
import glob
import logging
import multiprocessing
import os
import sys
import jieba
import six
import torch
from tools.tokenization_enc_dec import EncDecTokenizer
from torch.utils.data import Dataset
from tqdm import tqdm
from colossalai.registry import DATASETS
try:
import nltk
nltk_available = True
except ImportError:
nltk_available = False
jieba.setLogLevel(logging.INFO)
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
def is_contain_chinese(check_str):
for ch in check_str:
if u'\u4e00' <= ch <= u'\u9fff':
return True
return False
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Should be running on Python 3")
class WordpieceTokenizer(object):
def __init__(self, vocab, unk_token="<unk>", max_input_chars_per_word=200):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, token):
token = convert_to_unicode(token)
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
return [self.unk_token]
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if is_contain_chinese(substr):
if substr in self.vocab:
cur_substr = substr
break
else:
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
sub_tokens.append(self.unk_token)
start += 1
continue
sub_tokens.append(cur_substr)
start = end
return sub_tokens
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
index = 0
with open(vocab_file, "r", encoding='utf-8') as reader:
while True:
token = convert_to_unicode(reader.readline())
if not token:
break
token = token.strip()
vocab[token] = index
index += 1
return vocab
class EncDecTokenizer(object):
def __init__(self, vocab_file, max_len=None, max_sentinels=190):
self.max_len = max_len if max_len is not None else int(1e12)
self.encoder = load_vocab(vocab_file)
self.decoder = {v: k for k, v in self.encoder.items()}
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder)
self.translator = str.maketrans(" \n", "\u2582\u2583")
self.sentinel_list = [self.encoder['<s_{}>'.format(i)] for i in range(max_sentinels)]
self.en_vocab = {}
for k, v in self.encoder.items():
if is_contain_chinese(k):
self.en_vocab[v] = False
else:
self.en_vocab[v] = True
self.en_vocab[10] = False
@property
def vocab_size(self):
return len(self.encoder)
def __len__(self):
return len(self.encoder)
@property
def eod_id(self):
return self.encoder[self.eod_token]
@property
def pad_id(self):
return self.encoder[self.pad_token]
@property
def eod_token(self):
return '<eod>'
@property
def pad_token(self):
return '<pad>'
def get_sentinel_num(self):
return len(self.sentinel_list)
def get_sentinel_id(self, idx):
return self.sentinel_list[idx]
def tokenize(self, text):
""" Tokenize a string. """
output_tokens = []
for x in jieba.cut(text, cut_all=False):
x = x.translate(self.translator)
output_tokens.extend(self.wordpiece_tokenizer.tokenize(x))
# print(output_tokens)
return output_tokens
def encode(self, text):
output_tokens = [self.encoder[x] for x in self.tokenize(text)]
# filter space
new_output_tokens = [output_tokens[0]]
for i, x in enumerate(output_tokens[1:-1]):
if x == 10:
if self.en_vocab[output_tokens[i]] and self.en_vocab[output_tokens[i + 2]]:
continue
new_output_tokens.append(x)
new_output_tokens.append(output_tokens[-1])
return new_output_tokens
def decode(self, tokens):
new_tokens = []
for i, x in enumerate(tokens[:-1]):
if self.en_vocab[x] and self.en_vocab[tokens[i + 1]]:
new_tokens.append(x)
new_tokens.append(10)
else:
new_tokens.append(x)
new_tokens.append(tokens[-1])
# text = ''.join([self.decoder[x] for x in new_tokens])
# text = text.replace('\u2582', ' ').replace('\u2583', '\n')
# return text
return [self.decoder[x] for x in tokens]
class IdentitySplitter(object):
@staticmethod
def tokenize(*text):
return text
class Encoder(object):
def __init__(self, vocab_path, length, sentence_splitter):
self.vocab_path = vocab_path
self.length = length
self.sentence_splitter = sentence_splitter
self.tokenizer = EncDecTokenizer(os.path.join(self.vocab_path))
self.splitter = IdentitySplitter()
def initializer(self):
# Use Encoder class as a container for global data
pass
def encode(self, line):
# end with <eod>
if len(line) > 20000:
return None, 0
if len(line) < 10:
return None, 0
data = line.strip().strip('<n>')
data = data.replace("<n>", "\n")
doc_ids = self.tokenizer.encode(data)
doc_ids.append(self.tokenizer.eod_id)
return doc_ids, len(line)
@DATASETS.register_module
class YuanDataset(Dataset):
"""
Yuan is an open source Chinese dataset, which can be accessed on https://github.com/Shawn-Inspur/Yuan-1.0.
Args:
path(str): Path to dataset's folder, raw data should be organized under the folder as 001.txt, 002.txt...
eg:/path/yuan/dataset
vocab_path(str): Path to the vocab file. eg:/path/yuan/vocab.txt
seq_len(int): Sequence length of the transformer, defaults to 2048.
"""
def __init__(self, path, vocab_path, seq_len=2048) -> None:
super().__init__()
self.input_path = path
workers = 16
sentence_splitter = None
self.vocab_path = vocab_path
self.pad_id = EncDecTokenizer(os.path.join(self.vocab_path)).pad_id
self.length = seq_len
if self.input_path[-1] == '/':
self.input_path = self.input_path[:-1]
if os.path.exists(os.path.join(self.input_path, 'data_list.pt')):
self.data_path = torch.load(os.path.join(self.input_path, 'data_list.pt'))
return
fin_list = glob.glob(self.input_path + '/0[0-9][0-9].txt')
self.data_path = []
for fin_path in fin_list:
if not os.path.exists(fin_path):
continue
if '.txt' not in fin_path:
continue
all_data = []
print("Processing ", fin_path)
with open(fin_path, 'r', encoding='utf-8', errors='ignore') as fin:
encoder = Encoder(self.vocab_path, seq_len, sentence_splitter)
pool = multiprocessing.Pool(workers, initializer=encoder.initializer)
encoded_docs = pool.imap_unordered(encoder.encode, fin, 30)
for i, (no_noise_tokens, bytes_processed) in tqdm(enumerate(encoded_docs, start=1)):
if no_noise_tokens is None:
continue
all_data.append(no_noise_tokens)
pool.close()
print('Saving ', fin_path)
base_path = fin_path.replace('.txt', '')
if not os.path.exists(base_path):
os.mkdir(base_path)
idx = 0
for d in tqdm(all_data):
idx += 1
cur_path = os.path.join(base_path, str(idx) + '.txt')
with open(cur_path, 'w+', encoding='utf-8') as f:
for i in d:
f.write(str(i) + ' ')
f.write('\n')
self.data_path.append(cur_path.replace(self.input_path + '/', ''))
torch.save(self.data_path, os.path.join(self.input_path, 'data_list.pt'))
def __len__(self):
return len(self.data_path)
def __getitem__(self, index):
path = self.data_path[index]
root = os.path.join(self.input_path, path)
with open(root, "r") as f:
data = f.readlines()
assert len(data) == 1
data = data[0][:-2].split(' ')
try:
data = list(map(int, data))
except:
while '' in data:
data.remove('')
data = list(map(int, data))
if len(data) > self.length:
data = data[:self.length - 1] + [data[-1]]
mask = [1] * self.length
else:
data += [self.pad_id] * (self.length - len(data))
mask = [1] * len(data) + [0] * (self.length - len(data))
data = torch.tensor(data)
mask = torch.tensor(mask)
return {'input_ids': data, 'attention_mask': mask}, data
if __name__ == '__main__':
dataset = YuanDataset('/data/gpt-yuan/ASC22/dataset', vocab_path='/data/gpt-yuan/ASC22/vocab.txt', seq_len=2048)
test = dataset.__getitem__(0)
print(test)
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_small
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 1
SEQ_LEN = 1024
NUM_EPOCHS = 60
TENSOR_PARALLEL = 2
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
loss = dict(type=GPTLMLoss,)
model = dict(
type=gpt2_small,
checkpoint=True,
)
parallel = dict(
pipeline=1,
tensor=dict(size=TENSOR_PARALLEL, mode='1d'),
)
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_small
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 4
SEQ_LEN = 1024
NUM_EPOCHS = 60
TENSOR_PARALLEL = 4
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
loss = dict(type=GPTLMLoss,)
model = dict(
type=gpt2_small,
checkpoint=True,
)
parallel = dict(
pipeline=1,
tensor=dict(size=TENSOR_PARALLEL, mode='2d'),
)
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_small
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 4
SEQ_LEN = 1024
NUM_EPOCHS = 60
TENSOR_PARALLEL = 8
DEPTH = 2
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
loss = dict(type=GPTLMLoss,)
model = dict(
type=gpt2_small,
checkpoint=True,
)
parallel = dict(
pipeline=1,
tensor=dict(size=TENSOR_PARALLEL, depth=DEPTH, mode='2.5d'),
)
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_small
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 4
SEQ_LEN = 1024
NUM_EPOCHS = 60
TENSOR_PARALLEL = 8
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
loss = dict(type=GPTLMLoss,)
model = dict(
type=gpt2_small,
checkpoint=True,
)
parallel = dict(
pipeline=1,
tensor=dict(size=TENSOR_PARALLEL, mode='3d'),
)
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_small
#from model_zoo.gpt.gpt import gpt2_small_pipeline
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 8
SEQ_LEN = 1024
NUM_EPOCHS = 60
HIDDEN_SIZE = 768
NUM_MICRO_BATCHES = 4
PIPELINE = 2
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
loss = dict(type=GPTLMLoss,)
model = dict(
type=gpt2_small,
checkpoint=True,
)
parallel = dict(
pipeline=PIPELINE,
tensor=dict(size=1, mode=None),
)
import torch
from titans.loss.lm_loss import GPTLMLoss
from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
from titans.model.gpt import gpt2_small
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 8
NUM_EPOCHS = 60
SEQ_LEN = 1024
NUM_MICRO_BATCHES = 4
HIDDEN_SIZE = 768
PIPELINE = 2
TENSOR_PARALLEL = 2
MODE = '1d'
fp16 = dict(mode=AMP_TYPE.NAIVE)
parallel = dict(pipeline=PIPELINE, tensor=dict(mode=MODE, size=TENSOR_PARALLEL))
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt2_small,
checkpoint=True,
dtype=torch.half,
)
loss_fn = dict(type=vocab_parallel_cross_entropy)
from titans.model.gpt import gpt2_small
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 1
NUM_EPOCHS = 60
SEQ_LEN = 1024
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
model = dict(
type=gpt2_small,
checkpoint=True,
)
parallel = dict(
pipeline=1,
tensor=dict(size=1, mode=None),
)
from titans.model.gpt import gpt2_small
from colossalai.nn.optimizer import HybridAdam
from colossalai.zero.shard_utils import TensorShardStrategy
BATCH_SIZE = 2
NUM_EPOCHS = 60
SEQ_LEN = 1024
zero = dict(model_config=dict(tensor_placement_policy='auto',
shard_strategy=TensorShardStrategy(),
reuse_fp16_shard=True),
optimizer_config=dict())
optimizer = dict(
type=HybridAdam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt2_small,
checkpoint=True,
)
from model import GPT2_small_pipeline_hybrid
from colossalai.nn.optimizer import HybridAdam
from colossalai.zero.shard_utils import BucketTensorShardStrategy, TensorShardStrategy
BATCH_SIZE = 8
NUM_EPOCHS = 60
SEQ_LEN = 1024
NUM_MICRO_BATCHES = 4
HIDDEN_SIZE = 768
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)
zero = dict(model_config=dict(tensor_placement_policy='cpu', shard_strategy=BucketTensorShardStrategy()),
optimizer_config=dict())
optimizer = dict(
type=HybridAdam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(type=GPT2_small_pipeline_hybrid, checkpoint=True, num_chunks=1)
parallel = dict(
pipeline=2,
tensor=dict(size=2, mode='1d'),
)
import torch
from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
from titans.model.gpt import gpt3
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 192
NUM_EPOCHS = 60
SEQ_LEN = 2048
NUM_MICRO_BATCHES = 192
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, 12288)
fp16 = dict(mode=AMP_TYPE.NAIVE)
parallel = dict(pipeline=32, tensor=dict(mode='1d', size=4))
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt3,
checkpoint=True,
dtype=torch.half,
)
loss_fn = dict(type=vocab_parallel_cross_entropy)
import torch
from titans.loss.vocab_cross_entropy import vocab_parallel_cross_entropy
from titans.model.gpt import gpt3
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 192
NUM_EPOCHS = 60
SEQ_LEN = 2048
NUM_MICRO_BATCHES = 192
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, 12288)
fp16 = dict(mode=AMP_TYPE.NAIVE)
parallel = dict(pipeline=24, tensor=dict(mode='1d', size=4))
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt3,
checkpoint=True,
dtype=torch.half,
)
loss_fn = dict(type=vocab_parallel_cross_entropy)
import torch
from titans.model.gpt import gpt3
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 2 * 48
NUM_EPOCHS = 60
SEQ_LEN = 2048
NUM_MICRO_BATCHES = 48
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES // 2, SEQ_LEN, 12288 // 2)
fp16 = dict(mode=AMP_TYPE.NAIVE)
parallel = dict(pipeline=24, tensor=dict(mode='2d', size=4))
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt3,
checkpoint=True,
dtype=torch.half,
)
import torch
from titans.model.gpt import gpt3
from torch.optim import Adam
from colossalai.amp import AMP_TYPE
BATCH_SIZE = 2 * 48
NUM_EPOCHS = 60
SEQ_LEN = 2048
NUM_MICRO_BATCHES = 48
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES // 2, SEQ_LEN, 12288 // 2)
fp16 = dict(mode=AMP_TYPE.NAIVE)
parallel = dict(pipeline=24, tensor=dict(mode='2.5d', depth=1, size=4))
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
model = dict(
type=gpt3,
checkpoint=True,
dtype=torch.half,
)
export DATA=/data/scratch/gpt_data/small-gpt-dataset.json
export NODE_RANK=${NODE_RANK:-0}
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
export MASTER_PORT=${MASTER_PORT:-"12345"}
env OMP_NUM_THREADS=16 torchrun --standalone --nproc_per_node=2 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch 2>&1 | tee logs/log
This source diff could not be displayed because it is too large. You can view the blob instead.
# coding=utf-8
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import re
import sys
import time
import tldextract
# List of the domains to blacklist.
domain_blacklist = set([
'500px',
'aapks',
'akamaihd',
'amazon',
'apple',
'artifactfire',
'artstation',
'awwni',
'bandcamp',
'battleforthenet',
'coinscalendar',
'dailymotion',
'deviantart',
'discord',
'discordapp',
'dlapkandroid',
'dropbox',
'e621',
'ebay',
'edealinfo',
'erome',
'eroshare',
'explosm',
'facebook',
'fbcdn',
'flickr',
'furaffinity',
'futhead',
'gatopardo',
'gfycat',
'gifsound',
'gifsoup',
'giphy',
'github',
'google',
'gunprime',
'gyazo',
'hotdealstar',
'imagefap',
'imageshack',
'imgflip',
'imgur',
'instagram',
'karmadecay',
'kryptocal',
'kym-cdn',
'liveleak',
'livememe',
'lmgtfy',
'magaimg',
'memegenerator',
'minorplanetcenter',
'minus',
'mobafire',
'morejpeg',
'nocookie',
'pcpartpicker',
'photobucket',
'pinimg',
'pinterest',
'pixiv',
'pornhub',
'prntscr',
'puu',
'qkme',
'quickmeme',
'radd',
'redd',
'reddit',
'reddit-stream',
'redditlog',
'redditmedia',
'reddituploads',
'redtube',
'reupp',
'reverb',
'roanoke',
'rollingstone',
'sli',
'soundcloud',
'soundgasm',
'spankbang',
'spotify',
'strawpoll',
'streamable',
'timeanddate',
'tinypic',
'touhouradio',
'tumblr',
'twimg',
'twitch',
'twitter',
'vid',
'vimeo',
'vine',
'vkaao',
'vocaroo',
'voyagefusion',
'walmart',
'wciu',
'wikimedia',
'wikipedia',
'xhamster',
'xkcd',
'xvideos',
'youtu',
'youtube',
'youtubedoubler',
'ytimg',
'zillexplorer',
])
def domain_is_in_blacklist(url):
domain = tldextract.extract(url).domain
return domain in domain_blacklist
# List of extentions to blacklist.
extentions_blacklist = (
'.3gp',
'.7z'
'.ai',
'.aif',
'.apk',
'.app',
'.avi',
'.bin',
'.bmp',
'.bz2',
'.css',
'.csv',
'.dat',
'.deb',
'.dmg',
'.doc',
'.docx',
'.exe',
'.gif',
'.gifv',
'.gz',
'.iso',
'.jar',
'.jpeg',
'.jpg',
'.js',
'.log',
'.mid',
'.midi',
'.mkv',
'.mov',
'.mp3',
'.mp4',
'.mpeg',
'.mpg',
'.ogg',
'.ogv',
'.otf',
'.pdf',
'.pkg',
'.png',
'.pps',
'.ppt',
'.pptx',
'.psd',
'.py',
'.qt',
'.ram',
'.rar',
'.sql',
'.svg',
'.swf',
'.tar.gz',
'.tar',
'.tgz',
'.tiff',
'.ttf',
'.txt',
'.wav',
'.webm',
'.wma',
'.wmv',
'.xls',
'.xlsx',
'.xml',
'.xz',
'.zip',
)
def extention_is_in_blacklist(url):
if url.split('?')[0].lower().endswith(extentions_blacklist):
return True
return False
# Malformed urls.
# This function is adapted from:
# https://stackoverflow.com/questions/7160737/python-how-to-validate-a-url-in-python-malformed-or-not
url_regex = re.compile(
r'^(?:http)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$',
re.IGNORECASE)
def url_is_malformed(url):
return re.match(url_regex, url) is None
def print_progress(prefix, start_time, urls_counter, domain_blacklist_counter, extention_blacklist_counter,
short_url_counter, malformed_url_counter, duplicate_url_counter):
string = prefix + ' | '
string += 'time elapsed (s): {:.2f} | '.format(time.time() - start_time)
string += 'number of urls: {} | '.format(urls_counter)
string += 'domain blacklisted: {} | '.format(domain_blacklist_counter)
string += 'extention blacklisted: {} | '.format(extention_blacklist_counter)
string += 'short urls (<=8): {} | '.format(short_url_counter)
string += 'malformed urls: {} | '.format(malformed_url_counter)
string += 'duplicate urls: {}'.format(duplicate_url_counter)
print(string, flush=True)
if __name__ == '__main__':
print('remove blacklisted urls ..')
# Path to the url files.
path = sys.argv[1]
# Output url file.
output = sys.argv[2]
# Get the list of url files.
files = glob.glob(path + '/*.txt')
print('> found {} files'.format(len(files)))
urls = set()
urls_counter = 0
domain_blacklist_counter = 0
extention_blacklist_counter = 0
short_url_counter = 0
malformed_url_counter = 0
duplicate_url_counter = 0
start_time = time.time()
for filename in files:
with open(filename, 'r') as f:
for line in f:
url = line.strip()
urls_counter += 1
if domain_is_in_blacklist(url):
print('[DOMAIN BLACKLIST]: {}'.format(url), flush=True)
domain_blacklist_counter += 1
elif extention_is_in_blacklist(url):
print('[EXTENTION BLACKLIST]: {}'.format(url), flush=True)
extention_blacklist_counter += 1
elif len(url) <= 8:
print('[SHORT URL]: {}'.format(url), flush=True)
short_url_counter += 1
elif url_is_malformed(url):
print('[MALFORMED URL]: {}'.format(url), flush=True)
malformed_url_counter += 1
elif url in urls:
print('[DUPLICATE URL]: {}'.format(url), flush=True)
duplicate_url_counter += 1
else:
urls.add(url)
if urls_counter % 100000 == 0:
print_progress('PROGRESS', start_time, urls_counter, domain_blacklist_counter,
extention_blacklist_counter, short_url_counter, malformed_url_counter,
duplicate_url_counter)
print_progress('FINAL', start_time, urls_counter, domain_blacklist_counter, extention_blacklist_counter,
short_url_counter, malformed_url_counter, duplicate_url_counter)
# Write the final set of urls.
print('> writing cleaned up url list to {}'.format(output))
with open(output, 'w') as f:
for url in urls:
f.write(url + '\n')
print('done :-)')
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment