re-organize the code

4321d0c8 · zhaoying1 · 6c914197 · 4321d0c8 · 4321d0c8 · 4321d0c8
Commit 4321d0c8 authored Sep 07, 2023 by zhaoying1
13 changed files
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
+COPY requirements.txt requirements.txt
+RUN source /opt/dtk-23.04/env.sh
+RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone 
+ENV LANG C.UTF-8
+RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/README.md
+++ b/README.md
 # ChatGLM-6B
-## 模型介绍
+## 论文
+`GLM: General Language Model Pretraining with Autoregressive Blank Infilling`
+- [https://arxiv.org/abs/2103.10360](https://arxiv.org/abs/2103.10360)
+## 模型结构
 ChatGLM-6B 是清华大学开源的开源的、支持中英双语的对话语言模型，基于 [General Language Model (GLM)](https://github.com/THUDM/GLM) 架构，具有 62 亿参数。ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练，辅以监督微调、反馈自助、人类反馈强化学习等技术的加持，62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。
+<div align="center">
+<img src="ptuning/media/GLM.png" width="550" height="200">
+</div>
+以下是ChatGLM-6B的主要网络参数配置：
+| 模型名称 | 隐含层维度 | 层数 | 头数 | 词表大小 | 位置编码 | 最大长 |
+| -------- | -------- | -------- | -------- | -------- | -------- | -------- | 
+|ChatGLM-6B | 4,096 | 28 | 32 | 130528 |  RoPE | 2048 |
+## 算法原理
+ChatGLM-6B基于GLM架构开发。GLM是一种基于Transformer的语言模型，以自回归空白填充为训练目标， 同时具备自回归和自编码能力。
-## 数据集
-本仓库以 [ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法，该数据集任务为根据输入（content）生成一段广告词（summary）。数据集可从 [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) 下载处理好的 ADGEN 数据集，将解压后的AdvertiseGen目录放到 [ptuning](./ptuning)本目录下。
 ## 环境配置
-### Docker
+### Docker(方式一)
 推荐使用docker方式运行，提供拉取的docker镜像：
 ```
 docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
@@ -21,72 +35,100 @@ docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk
 pip install transformers==4.28.0 -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 pip install accelerate sentencepiece mdtex2html gradio rouge_chinese nltk jieba datasets protobuf -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```
+### Dockerfile(方式二)
+```
+docker build -t chatglm:latest .
+docker run -dit --network=host --name=baichuan --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 baichuan:latest
+docker exec -it baichuan /bin/bash
+```
-## P-tuning v2微调
+### Conda（方法三）
-本仓库实现了对于ChatGLM-6B模型基于[P-Tuning v2](https://github.com/THUDM/P-tuning-v2)的微调。P-Tuning v2是由清华大学提出的一种高效参数微调方法，采用该方法可以将需要微调的参数量减少到原来的 0.1%。
+1. 创建conda虚拟环境：
+```
+conda create -n chatglm python=3.8
+```
-### 实验设置
+2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+- [DTK 23.04](https://cancon.hpccube.com:65024/1/main/DTK-23.04.1)
+- [Pytorch 1.13.1](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)
+- [Deepspeed 0.9.2](https://cancon.hpccube.com:65024/4/main/deepspeed/dtk23.04)
-    max_source_length=64
+    Tips：以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。
-    max_target_length=64
-    max_steps=3000
-    pre_seq_len=128
-    learning_rate=5e-3
-    per_device_train_batch_size=16
-    gradient_accumulation_steps=1
-    fp16
-### 训练
+3. 其它依赖库参照requirements.txt安装：
-该微调脚本运行环境为1节点，4张DCU-Z100-32G
+```
+pip install -r requirements.txt
+```
-微调训练命令
+## 数据集
+本仓库以 [ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法，该数据集任务为根据输入（content）生成一段广告词（summary），以下为下载地址：
+- [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1)
+下载处理好的 ADGEN 数据集，将解压后的AdvertiseGen目录放到 [ptuning](./ptuning)本目录下。数据集目录结构如下：
+```
+ ── AdvertiseGen
+    │   ├── dev.json
+    │   └── train.json
+```
-    cd ptuning
+## P-tuning v2 微调训练
-    bash pt_train.sh
+本仓库实现了对于ChatGLM-6B模型基于[P-Tuning v2](https://github.com/THUDM/P-tuning-v2)的微调。P-Tuning v2是由清华大学提出的一种高效参数微调方法。
-### 训练Loss收敛情况
+### 单机多卡训练
-![](./ptuning/logs/6B_ds_pt_bs16_accum1_4cards_zero2_5e-3.jpg)
+```
+    cd ptuning
+    bash ptuning_train.sh
+```
+注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
 ### 推理测评
 在 P-tuning v2 训练时模型只保存 PrefixEncoder 部分的参数，所以在推理时需要同时加载原 ChatGLM-6B 模型以及 PrefixEncoder 的权重，可直接运行一下命令：
+```
    cd ptuning
-    bash evaluate_pt.sh
+    bash evaluate_ptuning.sh
+```
+### Results
+- 训练Loss
+<div align="center">
+<img src="./ptuning/media/6B_ds_pt_bs16_accum1_4cards_zero2_5e-3.jpg" width="400" height="300">
+</div>
+- 推理测试结果:
-测试结果:
 | Checkpoint | Training Loss |BLEU-4 | Rouge-1 |  Rouge-2 | Rouge-l |
 | :------: | :------: |:------: | :------: |:------: | :------: |
 | 2000 steps |  3.57 | 7.9777 | 31.0344 |  6.981 | 24.7393 |
 ## Finetune全参数微调
-### 实验设置
-    max_source_length=64
-    max_target_length=64
-    max_steps=5000
-    pre_seq_len=128
-    learning_rate=5e-5
-    per_device_train_batch_size=32
-    gradient_accumulation_steps=1
-    fp16
-### 训练
-该微调脚本运行环境为1节点，4张DCU-Z100-32G
-微调训练命令
+### 单机多卡训练
+```
    cd ptuning
    bash ft_train.sh
+```
+注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
-### 训练Loss收敛情况
+### 集群训练
-![](./ptuning/logs/6B_ds_ft_bs32_accum1_4cards_zero3_5e-5.jpg)
+```
+    cd ptuning/slurm_scripts
+    bash run.shi
+```
+注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
-### 推理测评
+### 推理测评
+```
    cd ptuning
    bash evaluate_ft.sh
+```
+### Results
+- 训练Loss
+<div align="center">
+<img src="./ptuning/media/6B_ds_ft_bs32_accum1_4cards_zero3_5e-5.jpg" width="400" height="300">
+</div>
+- 推理测试结果:
-测试结果:
 | Checkpoint | Training Loss |BLEU-4 | Rouge-1 |  Rouge-2 | Rouge-l |
 | :------: | :------: |:------: | :------: |:------: | :------: |
 | 3000 steps |  2.3398 | 7.6501 | 29.2229 | 6.466 | 23.8506 |
@@ -100,7 +142,9 @@ pip install accelerate sentencepiece mdtex2html gradio rouge_chinese nltk jieba
 | Rouge-2       | 7.36    | 7.11 | 6.96 |
 | Rouge-l       | 25.08  | 24.97 | 24.80 |
 | Training Loss | 3.00 | 3.57 | 3.32 | -->
-## 模型使用
+## 推理
 运行如下命令：
    python cli_demo.py
@@ -156,8 +200,10 @@ HIP_VISIBLE_DEVICES=0,1,2,3 deepspeed --num_gpus=4 --master_port $MASTER_PORT ma
 ### 训练loss收敛情况
 由于该示例预训练数据集较小，loss会降的至较低水平到0.1左右。
+<div align="center">
+<img src="./ptuning/media/pretrain.jpeg" width="400" height="300">
+</div>
-![img](https://developer.hpccube.com/codes/modelzoo/chatglm/-/raw/main/ptuning/logs/pretrain.jpeg)
 ## 强化学习(RLHF)微调方案
@@ -166,11 +212,22 @@ HIP_VISIBLE_DEVICES=0,1,2,3 deepspeed --num_gpus=4 --master_port $MASTER_PORT ma
 - 使用 Lora，只更新低秩适应层，可以直接参考项目：https://github.com/hiyouga/ChatGLM-Efficient-Tuning/blob/main/examples/covid_doctor.md
 - 使用 DeepSpeed-Chat 方案全参微调，目前已经适配完成，欢迎尝试：https://github.com/yuguo-Jack/ChatGLM-6B-in-DeepSpeed-Chat
+## 应用场景
+### 算法类别
+`自然语言处理`
+### 热点应用行业
+`nlp,智能聊天助手,科研`
 ## 源码仓库及问题反馈
-https://developer.hpccube.com/codes/modelzoo/chatglm
+- https://developer.hpccube.com/codes/modelzoo/chatglm
 ## 参考
-[THUDM/ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B/tree/main)
+- [THUDM/ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B/tree/main)
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=193
 # 模型名称
-modelName=ChatGLM-6B
+modelName=chatglm_torch
 # 模型描述
-modelDescription=基于Pytorch框架的ChatGLM-6B
+modelDescription=基于Pytorch框架的chatglm-6b
-# 应用场景(多个标签以英文逗号分割)
+# 应用场景
 appScenario=训练,推理,train,inference,nlp,智能聊天助手
-# 框架类型(多个标签以英文逗号分割)
+# 框架类型
-frameType=Pytorch
+frameType=Pytorch,Transformers,Deepspeed
--- a/ptuning/evaluate_ptuning.sh
+++ b/ptuning/evaluate_ptuning.sh
+PRE_SEQ_LEN=128
+CHECKPOINT=adgen-chatglm-6b-pt-4c-5e-3
+STEP=3000
+CUDA_VISIBLE_DEVICES=0 python3 main.py \
+    --do_predict \
+    --validation_file AdvertiseGen/dev.json \
+    --test_file AdvertiseGen/dev.json \
+    --overwrite_cache \
+    --model_name_or_path THUDM/chatglm-6b \
+    --ptuning_checkpoint ./output_pt/$CHECKPOINT/checkpoint-$STEP \
+    --prompt_column content \
+    --response_column summary \
+    --output_dir ./output_pt/$CHECKPOINT \
+    --overwrite_output_dir \
+    --max_source_length 64 \
+    --max_target_length 64 \
+    --per_device_eval_batch_size 1 \
+    --predict_with_generate \
+    --pre_seq_len $PRE_SEQ_LEN
\ No newline at end of file
--- a/ptuning/media/6B_ds_ft_bs32_accum1_4cards_zero3_5e-5.jpg
+++ b/ptuning/media/6B_ds_ft_bs32_accum1_4cards_zero3_5e-5.jpg
--- a/ptuning/media/6B_ds_pt_bs16_accum1_4cards_zero2_5e-3.jpg
+++ b/ptuning/media/6B_ds_pt_bs16_accum1_4cards_zero2_5e-3.jpg
--- a/ptuning/media/GLM.png
+++ b/ptuning/media/GLM.png
--- a/ptuning/media/pretrain.jpeg
+++ b/ptuning/media/pretrain.jpeg
--- a/ptuning/ptuning_train.sh
+++ b/ptuning/ptuning_train.sh
+PRE_SEQ_LEN=128
+LR=5e-3
+MASTER_PORT=$(shuf -n 1 -i 10000-65535)
+CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port $MASTER_PORT main.py \
+    --deepspeed deepspeed.json \
+    --do_train \
+    --train_file AdvertiseGen/train.json \
+    --test_file AdvertiseGen/dev.json \
+    --prompt_column content \
+    --response_column summary \
+    --overwrite_cache \
+    --model_name_or_path THUDM/chatglm-6b \
+    --output_dir ./output_pt/adgen-chatglm-6b-pt-4c-$LR \
+    --overwrite_output_dir \
+    --max_source_length 64 \
+    --max_target_length 64 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --predict_with_generate \
+    --max_steps 3000 \
+    --logging_steps 10 \
+    --save_steps 1000 \
+    --learning_rate $LR \
+    --pre_seq_len $PRE_SEQ_LEN \
+    --fp16
--- a/ptuning/slurm_scripts/run.sh
+++ b/ptuning/slurm_scripts/run.sh
+#/bin/bash
+mkdir -p logs
+#rm -rf log/*
+mkdir -p pt_output
+mkdir -p hostfile
+sbatch run_train.sh
--- a/ptuning/slurm_scripts/run_train.sh
+++ b/ptuning/slurm_scripts/run_train.sh
+#!/bin/bash
+#SBATCH -p kshdnormal01
+#SBATCH -N 4
+#SBATCH --cpus-per-task=1
+#SBATCH --ntasks-per-node=32
+#SBATCH --mem 100G
+#SBATCH --gres=dcu:4
+#SBATCH -J chatglm
+#SBATCH -o logs/pt-%j.out
+#SBATCH -e logs/pt-%j.err
+ulimit -u 200000
+export OMP_NUM_THREADS=1
+export NCCL_DEBUG=INFO
+export MIOPEN_FIND_MODE=3
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_COMPILE_PARALLEL_LEVEL=1
+export NCCL_PLUGIN_P2P=ucx
+export NCCL_SOCKET_IFNAME=ib0
+export NCCL_P2P_LEVEL=5
+export NCCL_NET_PLUGIN=none
+unset RCCL_NCHANNELS
+unset NCCL_NET_GDR_LEVEL
+rm -rf ./hostfile/*
+echo "START TIME: $(date)"
+hostfile=./hostfile/$SLURM_JOB_ID
+scontrol show hostnames $SLURM_JOB_NODELIST > ${hostfile}
+for i in `cat $hostfile`
+do
+    echo ${i} slots=4 >> `pwd`/hostfile/hostfile-dl-$SLURM_JOB_ID
+done
+np=$(cat $hostfile|sort|uniq |wc -l)
+np=$(($np*4))
+nodename=$(cat $hostfile |sed -n "1p")
+dist_url=`echo $nodename | awk '{print $1}'`
+echo ${dist_url}
+mpirun -np $np --hostfile hostfile/hostfile-dl-$SLURM_JOB_ID --bind-to none `pwd`/run_train_single.sh $dist_url
--- a/ptuning/slurm_scripts/run_train_single.sh
+++ b/ptuning/slurm_scripts/run_train_single.sh
+#!/bin/bash
+export HSA_FORCE_FINE_GRAIN_PCIE=1
+export MIOPEN_FIND_MODE=3
+export MIOPEN_COMPILE_PARALLEL_LEVEL=1
+export NCCL_PLUGIN_P2P=ucx
+export RCCL_NCHANNELS=2
+export NCCL_SOCKET_IFNAME=ib0
+export NCCL_P2P_LEVEL=5
+export NCCL_IB_HCA=mlx5_0
+export NCCL_DEBUG=INFO
+export NCCL_NET_GDR_LEVEL=SYS
+export NCCL_NET_PLUGIN=none
+unset RCCL_NCHANNELS
+unset NCCL_NET_GDR_LEVEL
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+echo "LRANK===============================$lrank"
+RANK=$OMPI_COMM_WORLD_RANK
+WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
+export HIP_VISIBLE_DEVICES=0,1,2,3
+LR=1e-5
+APP="python3 ../main.py \
+    --deepspeed ../deepspeed.json \
+    --do_train \
+    --train_file AdvertiseGen/train.json \
+    --prompt_column prompt \
+    --response_column response \
+    --model_name_or_path THUDM/chatglm-6b \
+    --output_dir ./output_ft/pretrain \
+    --overwrite_output_dir \
+    --max_source_length 64 \
+    --max_target_length 64 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --predict_with_generate \
+    --max_steps 2000 \
+    --logging_steps 5 \
+    --save_steps 1000 \
+    --learning_rate $LR \
+    --fp16 \
+    --local_rank $lrank "
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=0,1,2,3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
--- a/requirements.txt
+++ b/requirements.txt
 protobuf
-transformers==4.27.1
+transformers==4.28.0
-gradio
+accelerate
-mdtex2html
 sentencepiece
-accelerate
+mdtex2html
\ No newline at end of file
+gradio
+rouge_chinese
+nltk
+jieba
+datasets
+protobuf
\ No newline at end of file