# Megatron-LM-GPT2 ### 1. 准备DeepSpeed框架 1. 下载21.10.1适配的框架:[**DeepSpeed-0.3.13**](http://10.0.50.210:8000/jenkins/rocm/yum/21.10.1/whl/deepspeed-0.3.13%2Bcd3feaa-cp36-cp36m-linux_x86_64.whl) 2. 需要注释掉```{pythonenv}/deepspeed/ops/sparse_attention/softmax.py ```中 ``` from .trsrc import softmax_fwd, softmax_bwd ``` 3. 注释掉```{pythonpath/deepspeed/ops/sparse_attention/matmul.py}```中: ``` from .trsrc import matmul ``` ### 2. 数据集下载 数据集:[**openai数据集**](https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/webtext.train.jsonl) [**GPT2 vocab**](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json) [**GPT2 merge table**](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt) 下载哪些数据集可以参考: ``` for ds in ['webtext', 'small-117M', 'small-117M-k40',' medium-345M', 'medium-345M-k40', 'large-762M', 'large-762M-k40', 'xl-1542M', 'xl-1542M-k40',]: for split in ['train', 'valid', 'test']: filename = ds + "." + split + '.jsonl' r = requests.get("https://openaipublic.azureedge.net/gpt-2/output-dataset/v1/" + filename, stream=True) ``` ### 3. 数据集预处理 ``` python3 tools/preprocess_data.py \ --input webtext.train.jsonl \ --output-prefix my-gpt2 \ --vocab gpt2-vocab.json \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --merge-file gpt2-merges.txt \ --append-eod ``` ### 4. 执行训练 训练脚本:`examples/ds_pretrain_gpt2.sh`