README.md 1.49 KB
Newer Older
Pan,Huiwen's avatar
Pan,Huiwen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Megatron-LM-GPT2

### 1. 准备DeepSpeed框架
1. 下载21.10.1适配的框架:[**DeepSpeed-0.3.13**](http://10.0.50.210:8000/jenkins/rocm/yum/21.10.1/whl/deepspeed-0.3.13%2Bcd3feaa-cp36-cp36m-linux_x86_64.whl)
2. 需要注释掉```{pythonenv}/deepspeed/ops/sparse_attention/softmax.py ```
```
from .trsrc import softmax_fwd, softmax_bwd
```
3. 注释掉```{pythonpath/deepspeed/ops/sparse_attention/matmul.py}```中:
```
from .trsrc import matmul
```

### 2. 数据集下载

数据集:[**openai数据集**](https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/webtext.train.jsonl)
[**GPT2 vocab**](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)
[**GPT2 merge table**](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt)

下载哪些数据集可以参考:

```
for ds in 
['webtext',
'small-117M', 'small-117M-k40','
medium-345M', 'medium-345M-k40',
'large-762M', 'large-762M-k40',
'xl-1542M', 'xl-1542M-k40',]:
for split in ['train', 'valid', 'test']:
filename = ds + "." + split + '.jsonl'
r = requests.get("https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/" + filename, stream=True)
```

### 3. 数据集预处理
```
python3 tools/preprocess_data.py \
       --input webtext.train.jsonl \
Pan,Huiwen's avatar
Pan,Huiwen committed
38
39
40
41
42
43
       --output-prefix my-gpt2 \
       --vocab gpt2-vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod
Pan,Huiwen's avatar
Pan,Huiwen committed
44
```
Pan,Huiwen's avatar
Pan,Huiwen committed
45

Pan,Huiwen's avatar
Pan,Huiwen committed
46
47
### 4. 执行训练
训练脚本:`examples/ds_pretrain_gpt2.sh`