README.md 5.48 KB
Newer Older
1
# Sequence Parallelism with BERT
2

3
In this example, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate
4
5
6
7
activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.

Paper: [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120)

8
9
10
11
12
13
14
15
16
## 🚀Quick Start
1. Run with the following command
```bash
export PYTHONPATH=$PWD
colossalai run --nproc_per_node 4 train.py -s
```
2. The default config is sequence parallel size = 2, pipeline size = 1, let’s change pipeline size to be 2 and try it again.


17
18
19
20
21
22
23
24
25
26
27
## How to Prepare WikiPedia Dataset

First, let's prepare the WikiPedia dataset from scratch. To generate a preprocessed dataset, we need four items:
1. raw WikiPedia dataset
2. wikipedia extractor (extract data from the raw dataset)
3. vocabulary file
4. preprocessing scripts (generate final data from extracted data)

For the preprocessing script, we thank Megatron-LM for providing a preprocessing script to generate the corpus file.

```python
28
# download raw data
29
30
31
32
33
34
35
mkdir data && cd ./data
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# install wiki extractor
git clone https://github.com/FrankLeeeee/wikiextractor.git
pip install ./wikiextractor

36
# extractmodule
37
38
39
40
41
42
43
44
45
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
cat text/*/* > ./corpus.json
cd ..

# download vocab file
mkdir vocab && cd ./vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
cd ..

46
# preprocess some data
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
git clone https://github.com/NVIDIA/Megatron-LM.git
cd ./Megatron-LM
python tools/preprocess_data.py \
    --input ../data/corpus.json \
    --output-prefix my-bert \
    --vocab ../vocab/bert-large-uncased-vocab.txt \
    --dataset-impl mmap \
    --tokenizer-type BertWordPieceLowerCase \
    --split-sentences \
    --workers 24
```

After running the preprocessing scripts, you will obtain two files:
1. my-bert_text_sentence.bin
2. my-bert_text_sentence.idx

If you happen to encouter `index out of range` problem when running Megatron's script,
this is probably because that a sentence starts with a punctuation and cannot be tokenized. A work-around is to update `Encoder.encode` method with the code below:

```python
class Encoder(object):
    def __init__(self, args):
        ...

    def initializer(self):
        ...

    def encode(self, json_line):
        data = json.loads(json_line)
        ids = {}
        for key in self.args.json_keys:
            text = data[key]
            doc_ids = []

            # lsg: avoid sentences which start with a punctuation
            # as it cannot be tokenized by splitter
            if len(text) > 0 and text[0] in string.punctuation:
                text = text[1:]

            for sentence in Encoder.splitter.tokenize(text):
                sentence_ids = Encoder.tokenizer.tokenize(sentence)
                if len(sentence_ids) > 0:
                    doc_ids.append(sentence_ids)
            if len(doc_ids) > 0 and self.args.append_eod:
                doc_ids[-1].append(Encoder.tokenizer.eod)
            ids[key] = doc_ids
        return ids, len(json_line)
```

## How to Train with Sequence Parallelism

98
We provided `train.py` for you to execute training. Before invoking the script, there are several
99
100
101
102
steps to perform.

### Step 1. Set data path and vocab path

103
At the top of `config.py`, you can see two global variables `DATA_PATH` and `VOCAB_FILE_PATH`.
104
105
106
107
108
109
110
111
112
113
114
115
116
117

```python
DATA_PATH = <data-path>
VOCAB_FILE_PATH = <vocab-path>
```

`DATA_PATH` refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to `DATA_PATH` to the path to the bin file without the file extension.

For example, if your my-bert_text_sentence.bin is /home/Megatron-LM/my-bert_text_sentence.bin, then you should set

```python
DATA_PATH = '/home/Megatron-LM/my-bert_text_sentence'
```

118
The `VOCAB_FILE_PATH` refers to the path to the vocabulary downloaded when you prepare the dataset
119
120
121
122
123
124
125
126
127
128
129
130
131
132
(e.g. bert-large-uncased-vocab.txt).

### Step 3. Make Dataset Helper

Build BERT dataset helper. Requirements are `CUDA`, `g++`, `pybind11` and `make`.

```python
cd ./data/datasets
make
```

### Step 3. Configure your parameters

In the `config.py` provided, a set of parameters are defined including training scheme, model, etc.
133
You can also modify the ColossalAI setting. For example, if you wish to parallelize over the
134
135
136
137
sequence dimension on 8 GPUs. You can change `size=4` to `size=8`. If you wish to use pipeline parallelism, you can set `pipeline=<num_of_pipeline_stages>`.

### Step 4. Invoke parallel training

138
Lastly, you can start training with sequence parallelism. How you invoke `train.py` depends on your
139
140
141
142
143
144
machine setting.

- If you are using a single machine with multiple GPUs, PyTorch launch utility can easily let you
  start your script. A sample command is like below:

  ```bash
145
    colossalai run --nproc_per_node <num_gpus_on_this_machine> --master_addr localhost --master_port 29500 train.py
146
147
148
  ```

- If you are using multiple machines with multiple GPUs, we suggest that you refer to `colossalai
149
150
  launch_from_slurm` or `colossalai.launch_from_openmpi` as it is easier to use SLURM and OpenMPI
  to start multiple processes over multiple nodes. If you have your own launcher, you can fall back
151
  to the default `colossalai.launch` function.