README.md

# Megatron-LM-GPT2

### 1. 准备DeepSpeed框架
1. 下载21.10.1适配的框架：[**DeepSpeed-0.3.13**](http://10.0.50.210:8000/jenkins/rocm/yum/21.10.1/whl/deepspeed-0.3.13%2Bcd3feaa-cp36-cp36m-linux_x86_64.whl)
2. 需要注释掉```{pythonenv}/deepspeed/ops/sparse_attention/softmax.py ```中
```
from .trsrc import softmax_fwd, softmax_bwd
```
3. 注释掉```{pythonpath/deepspeed/ops/sparse_attention/matmul.py}```中：
```
from .trsrc import matmul
```

### 2. 数据集下载

数据集：[**openai数据集**](https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/webtext.train.jsonl)
[**GPT2 vocab**](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)
[**GPT2 merge table**](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt)

下载哪些数据集可以参考：

```
for ds in 
['webtext',
'small-117M', 'small-117M-k40','
medium-345M', 'medium-345M-k40',
'large-762M', 'large-762M-k40',
'xl-1542M', 'xl-1542M-k40',]:
for split in ['train', 'valid', 'test']:
filename = ds + "." + split + '.jsonl'
r = requests.get("https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/" + filename, stream=True)
```

### 3. 数据集预处理
```
python3 tools/preprocess_data.py \
       --input webtext.train.jsonl \
       --output-prefix my-gpt2 \
       --vocab gpt2-vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod
```

### 4. 执行训练
训练脚本：`examples/ds_pretrain_gpt2.sh`