In this example, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate
activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.
## Table of contents
Paper: [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120)
-[Sequence Parallelism](#sequence-parallelism)
-[Table of contents](#table-of-contents)
-[📚 Overview](#-overview)
-[🚀 Quick Start](#-quick-start)
-[🏎 How to Train with Sequence Parallelism](#-how-to-train-with-sequence-parallelism)
-[Step 1. Configure your parameters](#step-1-configure-your-parameters)
After running the preprocessing scripts, you will obtain two files:
1. my-bert_text_sentence.bin
2. my-bert_text_sentence.idx
If you happen to encouter `index out of range` problem when running Megatron's script,
this is probably because that a sentence starts with a punctuation and cannot be tokenized. A work-around is to update `Encoder.encode` method with the code below:
```python
classEncoder(object):
def__init__(self,args):
...
definitializer(self):
...
defencode(self,json_line):
data=json.loads(json_line)
ids={}
forkeyinself.args.json_keys:
text=data[key]
doc_ids=[]
# lsg: avoid sentences which start with a punctuation
# as it cannot be tokenized by splitter
iflen(text)>0andtext[0]instring.punctuation:
text=text[1:]
forsentenceinEncoder.splitter.tokenize(text):
sentence_ids=Encoder.tokenizer.tokenize(sentence)
iflen(sentence_ids)>0:
doc_ids.append(sentence_ids)
iflen(doc_ids)>0andself.args.append_eod:
doc_ids[-1].append(Encoder.tokenizer.eod)
ids[key]=doc_ids
returnids,len(json_line)
```
In this tutorial, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate
activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.
## How to Train with Sequence Parallelism
Paper: [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120)
We provided `train.py` for you to execute training. Before invoking the script, there are several
steps to perform.
## 🚀 Quick Start
### Step 1. Set data path and vocab path
1. Install PyTorch
At the top of `config.py`, you can see two global variables `DATA_PATH` and `VOCAB_FILE_PATH`.
2. Install the dependencies.
```python
DATA_PATH=<data-path>
VOCAB_FILE_PATH=<vocab-path>
```bash
pip install-r requirements.txt
```
`DATA_PATH` refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to `DATA_PATH` to the path to the bin file without the file extension.
3. Run with the following command
For example, if your my-bert_text_sentence.bin is /home/Megatron-LM/my-bert_text_sentence.bin, then you should set
The `VOCAB_FILE_PATH` refers to the path to the vocabulary downloaded when you prepare the dataset
(e.g. bert-large-uncased-vocab.txt).
> The default config is sequence parallel size = 2, pipeline size = 1, let’s change pipeline size to be 2 and try it again.
### Step 3. Make Dataset Helper
Build BERT dataset helper. Requirements are `CUDA`, `g++`, `pybind11` and `make`.
## 🏎 How to Train with Sequence Parallelism
```python
cd./data/datasets
make
```
We provided `train.py` for you to execute training. Before invoking the script, there are several
steps to perform.
### Step 3. Configure your parameters
### Step 1. Configure your parameters
In the `config.py` provided, a set of parameters are defined including training scheme, model, etc.
You can also modify the ColossalAI setting. For example, if you wish to parallelize over the
sequence dimension on 8 GPUs. You can change `size=4` to `size=8`. If you wish to use pipeline parallelism, you can set `pipeline=<num_of_pipeline_stages>`.
### Step 4. Invoke parallel training
### Step 2. Invoke parallel training
Lastly, you can start training with sequence parallelism. How you invoke `train.py` depends on your