更改格式。

7b47409b · qianyj · 829d3ffa · 7b47409b · 7b47409b · 7b47409b
Commit 7b47409b authored Nov 09, 2023 by qianyj
9 changed files
--- a/README.md
+++ b/README.md
@@ -11,11 +11,21 @@ Baichuan系列模型是由百川智能开发的开源大规模预训练模型，
 | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
 | Baichuan-7B | 4,096 | 32 | 32 | 64,000 | 7,000,559,616 | 1.2万亿 | RoPE | 4096 |
 | Baichuan-13B | 5,120 | 40 | 	40 | 64,000 | 13,264,901,120 | 1.4万亿 | ALiBi | 4096 |
+<div align="center">
+<img src="./assets/transformer.jpg" width="400" height="300">
+</div>
 ## 算法原理
 Baichuan整体模型基于标准的Transformer结构，采用了和LLaMA一样的模型设计。其中，Baichuan-7B在结构上采用Rotary Embedding位置编码方案、SwiGLU激活函数、基于RMSNorm的Pre-Normalization。Baichuan-13B使用了ALiBi线性偏置技术，相对于Rotary Embedding计算量更小，对推理性能有显著提升。
+<div align="center">
+<img src="./assets/transformer.png" width="450" height="300">
+</div>
 ## 环境配置
+说明1：若在accelerate、transformers等库中遇到对deepspeed0.9.3的依赖，请注释掉相应的version check代码，目前暂未对deepspeed0.9.3进行适配，deepspeed0.9.2即可使用。
+说明2: 如需使用lora训练，请安装transformer 4.31.0版本
 ### Docker(方式一)
 推荐使用docker方式运行，提供拉取的docker镜像：
 ```
@@ -23,9 +33,9 @@ docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk
 ```
 安装docker中没有的依赖:
 ```
-pip install transformers==4.31.0 -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+pip install transformers==4.28.0 -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
-pip install accelerate==0.22.0 --no-dependencies -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+pip install accelerate --no-dependencies -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
-pip install datasets peft trl tiktoken jieba rouge-chinese nltk gradio matplotlib uvicore fastapi sse-starlette
+pip install datasets peft trl tiktoken jieba rouge-chinese nltk gradio matplotlib uvicore fastapi sse-starlette sentencepiece
 ```
@@ -54,8 +64,6 @@ conda create -n chatglm python=3.8
 pip install -r requirements.txt --no-dependencies -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com 
 ```
-说明：若在accelerate、transformers等库中遇到对deepspeed0.9.3的依赖，请注释掉相应的version check代码，目前暂未对deepspeed0.9.3进行适配，deepspeed0.9.2即可使用。
 ## 数据集
 输入数据为放置在项目[data](.data)目录下的 json 文件，用--dataset选项指定（参考下面示例），多个输入文件用`,`分隔。json 文件示例格式和字段说明如下：
@@ -93,36 +101,42 @@ Hugging Face模型下载地址：
 [Baichuan-13B-Chat](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat)
+## 训练
+### 全参数微调训练
-## 全参数微调训练
+1. 单机训练
-### 单机训练
 ```
 bash run-full.sh
 ```
 您可以根据自己的需求，更改其中的batch size大小、模型路径、数据集及deepspeed配置文件等。
 注意：以上实例中加载的预训练模型为对齐模型，所以--template 参数被设置为`baichuan`; 若您加载的预训练模型为基座(base)模型，请设置为`--template default`
-### 集群训练
+2. 多机训练
 ```
-cd slurm_script
+cd multi_node
-sbatch run-13b-sft.sh
 ``` 
+进入节点1，根据环境修改hostfile，保证两节点文件路径一致，配置相同，按需修改run-13b-sft.sh中--mca btl_tcp_if_include enp97s0f1，enp97s0f1改为ip a命令后对应节点ip的网卡名，numa可以根据当前节点拓扑更改绑定，微调命令：
+``` 
+bash run-13b-sft.sh
+``` 
+### LoRA微调训练
-## LoRA微调训练
+1. 单机训练
-### 单机训练
 ```
 bash run-lora.sh
 ```
 您可以根据自己的需求，更改其中的batch size大小、模型路径、数据集、deepspeed配置文件、lora_ran及lora_target等。请使用 python src/train_bash.py -h 查看全部可选项。
-### 集群训练
+2. 多机训练
-```
-cd slurm_script
-sbatch run-7b-sft-lora.sh
 ```
+cd multi_node
+``` 
+进入节点1，根据环境修改hostfile，保证两节点文件路径一致，配置相同，按需修改run-13b-sft.sh中--mca btl_tcp_if_include enp97s0f1，enp97s0f1改为ip a命令后对应节点ip的网卡名，numa可以根据当前节点拓扑更改绑定，微调命令：
+``` 
+bash run-7b-sft-lora.sh
+``` 
 ## 推理
@@ -185,6 +199,9 @@ CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
 ```
 ## Result
+无
+## 精度
 - 以下为我们基于baichuan-13b-base模型进行全参数指令微调测试的loss收敛情况：
 <div align="center">
 <img src="./assets/training_loss.png" width="300" height="250">

--- a/assets/transformer.jpg
+++ b/assets/transformer.jpg
--- a/assets/transformer.png
+++ b/assets/transformer.png
--- a/multi_node/deepspeed.json
+++ b/multi_node/deepspeed.json
+{
+    "train_micro_batch_size_per_gpu": "auto",
+    "train_batch_size": "auto",
+    "zero_allow_untested_optimizer": true,
+    "fp16": {
+      "enabled": "auto",
+      "loss_scale": 0,
+      "initial_scale_power": 16, 
+      "loss_scale_window": 1000,
+      "hysteresis": 2,
+      "min_loss_scale": 1
+    }, 
+    "zero_force_ds_cpu_optimizer": false,
+    "zero_optimization": {
+      "stage": 3,
+      "offload_param": {
+        "device": "cpu",
+        "pin_memory": true
+    },
+    "offload_optimizer": {
+        "device": "cpu",
+        "pin_memory": true
+    },
+    "stage3_gather_16bit_weights_on_model_save": true,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients" : true
+    }
+  }
--- a/multi_node/hostfile
+++ b/multi_node/hostfile
+10.0.21.163 slots=8
+10.0.21.116 slots=8
--- a/multi_node/run-13b-sft-single.sh
+++ b/multi_node/run-13b-sft-single.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_FIND_MODE=3
+export MIOPEN_COMPILE_PARALLEL_LEVEL=1
+export NCCL_PLUGIN_P2P=ucx
+export RCCL_NCHANNELS=2
+export NCCL_SOCKET_IFNAME=ib0
+export NCCL_P2P_LEVEL=5
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+echo "LRANK===============================$lrank"
+RANK=$OMPI_COMM_WORLD_RANK
+WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
+export NCCL_IB_HCA=mlx5_0  #0号网卡
+APP="python3 ../src/train_bash.py --stage sft \
+    --model_name_or_path ../../baichuan-13b-base/ \
+    --do_train \
+    --template default \
+    --dataset alpaca_gpt4_en,alpaca_gpt4_zh,self_cognition,oaast_sft,lima \
+    --finetuning_type full \
+    --output_dir output/baichuan-13b \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --preprocessing_num_workers 16 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --save_steps 2000 \
+    --learning_rate 1e-4 \
+    --num_train_epochs 1.0 \
+    --plot_loss \
+    --fp16 \
+    --deepspeed deepspeed.json
+"
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
--- a/multi_node/run-13b-sft.sh
+++ b/multi_node/run-13b-sft.sh
+ulimit -u 200000
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export OMP_NUM_THREADS=1
+export NCCL_DEBUG=INFO
+export MIOPEN_FIND_MODE=3
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_COMPILE_PARALLEL_LEVEL=1
+export NCCL_PLUGIN_P2P=ucx
+export NCCL_SOCKET_IFNAME=ib0
+export NCCL_P2P_LEVEL=5
+echo "START TIME: $(date)"
+hostfile=./hostfile
+np=$(cat $hostfile|sort|uniq |wc -l)
+np=$(($np*8))
+which mpirun
+mpirun -np $np --allow-run-as-root --hostfile hostfile --bind-to none --mca btl_tcp_if_include enp97s0f1 mpi_single.sh 8
+echo "END TIME: $(date)"
--- a/multi_node/run-7b-sft-lora-single.sh
+++ b/multi_node/run-7b-sft-lora-single.sh
+#!/bin/bash
+export MIOPEN_FIND_MODE=3
+export GPU_MAX_HW_QUEUES=16
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZE
+export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
+export RANK=$comm_rank
+export WORLD_SIZE=$comm_size
+export NCCL_IB_HCA=mlx5
+export NCCL_SOCKET_IFNAME=ib0 
+export HIP_DIRECT_DISPATCH=0
+APP="python3 ../src/train_bash.py --stage sft \
+    --model_name_or_path ../../baichuan-7b-base \
+    --do_train \
+    --template default \
+    --dataset alpaca_gpt4_en \
+    --finetuning_type lora \
+    --lora_rank 16 \
+    --lora_target W_pack,o_proj,gate_proj,down_proj,up_proj \
+    --output_dir out/baichuan-7b-lora-test7 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --preprocessing_num_workers 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --save_steps 2 \
+    --eval_steps 2 \
+    --learning_rate 1e-4 \
+    --max_grad_norm 0.5 \
+    --num_train_epochs 1.0 \
+    --val_size 0.001 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --plot_loss \
+    --fp16 \
+    --deepspeed deepspeed.json
+"
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
--- a/multi_node/run-7b-sft-lora.sh
+++ b/multi_node/run-7b-sft-lora.sh
+ulimit -u 200000
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export OMP_NUM_THREADS=1
+export NCCL_DEBUG=INFO
+export MIOPEN_FIND_MODE=3
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_COMPILE_PARALLEL_LEVEL=1
+export NCCL_PLUGIN_P2P=ucx
+export NCCL_SOCKET_IFNAME=ib0
+export NCCL_P2P_LEVEL=5
+echo "START TIME: $(date)"
+hostfile=./hostfile
+np=$(cat $hostfile|sort|uniq |wc -l)
+np=$(($np*8))
+which mpirun
+mpirun -np $np --allow-run-as-root --hostfile hostfile --bind-to none --mca btl_tcp_if_include enp97s0f1 `pwd`/run-7b-single-lora.sh 8
+echo "END TIME: $(date)"