## 简介

Pai-Megatron-Patch工具是阿里人工智能平台PAI算法团队研发，基于阿里云智算服务PAI-灵骏平台的大模型最佳实践解决方案配套工具。
Pai-Megatron-Patch是各类开源大模型和Megatron训练加速引擎之间的“桥梁”，为用户提供用Megatron训练开源大模型的易用性以及LLM算法场景定制化的灵活性。
同时它可以帮助大模型开发者快速上手PAI灵骏产品，完成大模型的高效分布式训练，有监督指令微调，模型离线推理验证等完整大模型开发链路。
该项目提供了业界主流开源大模型基于Megatron的训练&离线推理验证流程，方便用户快速上手大模型训练。

**此项目为DCU适配后的版本并且加入了一些DCU的相关优化。**

## 模型支持情况

|    模型     | Megatron-LM-Dense | Megatron-Core-Dense |        备注         |
| :---------: | :---------------: | :-----------------: | :-----------------: |
|  baichuan   |         √         |          -          |                     |
|  baichuan2  |         √         |          -          |  TE暂时不支持bf16   |
|   chatglm   |         √         |          -          | 官方没有做TP PP支持 |
|   llama3    |         √         |          √          |  TE暂时不支持bf16   |
|  llama3.1   |         -         |          √          |                     |
| llava_mcore |         -         |          √          |                     |
|   qwen1.5   |         √         |          √          |                     |
|    qwen2    |         -         |          √          |  TE暂时不支持bf16   |
|   qwen2.5   |         -         |          √          |                     |
|  qwen2_vl   |         -         |          √          |                     |

## 环境准备

### Docker（方法一）

```
#拉取镜像
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-py3.10-dtk24.04.3-ubuntu20.04


docker run -it -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro --shm-size=80G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name pai_model bash
#安装依赖
pip install transformers

cd Pai-Megatron-Patch/unsloth
pip install -e .
```

### Dockerfile（方法二）

```
cd /home/Index/docker
docker build --no-cache -t llama:latest .
docker run -it --network=host --privileged=true --name=Index-1.9B --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=32G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal/:/opt/hyhal/:ro 39b353741dac /bin/bash
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/  --trusted-host mirrors.aliyun.com

cd Pai-Megatron-Patch/unsloth
pip install -e .
```


### Anaconda（方法三）

关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。

```
DTK驱动:dtk24.04.3
python:3.10
torch:2.1.0
flash-attn:2.6.1
apex:1.3.0
bitsandbytes:0.42.0
deepspeed:0.14.2
faiss:1.7.2
mmcv:2.0.1
torchvision:0.16.0
transformer-engine:1.8.0
triton:2.1.0
xformers:0.0.25
lmslim:0.1.0

cd Pai-Megatron-Patch
pip install requirements.txt
cd Pai-Megatron-Patch/unsloth
pip install -e .
```


## 快速开始

```python
#example文件夹下有各个模型文件，可以根据相应的REDEME进行相应的模型训练

'''llama示例'''

#1 进入对应模型目录 根据README下载好模型数据集，并完成模型转换（如果有需要）
cd examples/llama3


#2 修改run-pretrain.sh中的超参数 具体对llama3各个参数如下：
ENV=$1                          # 运行环境: dlc, dsw
MEGATRON_PATCH_PATH=$2          # 设置Megatron Patch的代码路径
MODEL_SIZE=$3                   # 模型结构参数量级：7B, 13B
BATCH_SIZE=$4                   # 每卡训练一次迭代样本数: 4, 8
GLOBAL_BATCH_SIZE=$5            # 全局batch size
LR=$6                           # 学习率: 1e-5, 5e-5
MIN_LR=$7                       # 最小学习率: 1e-6, 5e-6
SEQ_LEN=$8                      # 序列长度
PAD_LEN=$9                      # Padding长度：100
EXTRA_VOCAB_SIZE=${10}          # 词表扩充大小
PR=${11}                        # 训练精度: fp16, bf16
TP=${12}                        # 模型并行度
PP=${13}                        # 流水并行度
AC=${14}                        # 激活检查点模式: sel, full
DO=${15}                        # 是否使用Megatron版Zero-1降显存优化器: true, false
FL=${16}                        # 是否使用Flash Attention: true, false
SP=${17}                        # 是否使用序列并行: true, false
TE=${18}                        # 是否使用Transformer Engine: true, false
SAVE_INTERVAL=${19}             # 保存ckpt的间隔
DATASET_PATH=${20}              # 训练数据集路径
PRETRAIN_CHECKPOINT_PATH=${21}  # 预训练模型路径
TRAIN_TOKENS=${22}              # 训练token数
WARMUP_TOKENS=${23}             # 预热token数
OUTPUT_BASEPATH=${24}           # 训练输出文件路径

#3 修改run_pretrain_megatron_llama-dcu.sh文件中使用的卡数
#这里使用前四张卡
export HIP_VISIBLE_DEVICES=0,1,2,3	#使用卡的编号
MASTER_ADDR=localhost
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
NNODES=1
NODE_RANK=0
GPUS_PER_NODE=4						#使用卡的数量


#4 启动训练脚本
sh run-pretrain.sh
```

更多示例可参考Pai-Megatron-Patch/README_zh-CN.md，具体模型可以参考Pai-Megatron-Patch/examples/XXX(model)/README.md

## 参考资料

[Pai-Megatron-Patch](https://github.com/alibaba/Pai-Megatron-Patch/tree/main)