Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
Qwen_pytorch
Commits
5ad54d77
Commit
5ad54d77
authored
Oct 10, 2023
by
hepj987
Browse files
标准修改
parent
60a64b85
Changes
6
Hide whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
99 additions
and
37 deletions
+99
-37
README.md
README.md
+31
-21
hostfile
hostfile
+2
-0
model.properties
model.properties
+10
-0
mpirun-nodes.sh
mpirun-nodes.sh
+11
-0
qwen.png
qwen.png
+0
-0
single_ddp.sh
single_ddp.sh
+45
-16
No files found.
README.md
View file @
5ad54d77
...
...
@@ -22,28 +22,20 @@ https://arxiv.org/pdf/2308.12966.pdf
## 算法原理
```
模型架构:Qwen-7B的构建采用了类似LLaMA的架构。与标准transformer的主要差异有:1)使用非连接嵌入、2)使用旋转位置嵌入、3)在注意力中除了QKV外不使用偏置、4)使用RMSNorm代替LayerNorm、5)使用SwiGLU代替ReLU、以及6)采用快速注意力来加速训练。该模型共有32层,嵌入维度为4096,注意力头数为32。
```
## 数据集

```
使用alpaca_gpt4_zh数据集,已经包含在data目录中,具体文件为alpaca_gpt4_data_zh.json
模型架构:Qwen-7B的构建采用了类似LLaMA的架构。与标准transformer的主要差异有:1)使用非连接嵌入、2)使用旋转位置嵌入、3)在注意力中除了QKV外不使用偏置、4)使用RMSNorm代替LayerNorm、5)使用SwiGLU代替ReLU、以及6)采用快速注意力来加速训练。该模型共有32层,嵌入维度为4096,注意力头数为32。
```
## 模型下载
[
Qwen模型下载
](
https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main
)
## Qwen训练
### 环境配置
## 环境配置
推荐使用docker方式运行,提供
[
光源
](
https://www.sourcefind.cn/#/main-page
)
拉取的docker镜像:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py39-latest
docker run -dit --network=host --name=qwen_pytorch --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py39-latest
```
进入docker
...
...
@@ -54,25 +46,43 @@ pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --t
其中apex、torch、deepspeed需要到
[
开发者社区
](
https://cancon.hpccube.com:65024/4/main/
)
下载对应版本
##
# 训练(单节点)
##
数据集
```
bash run-node.sh
使用alpaca_gpt4_zh数据集,已经包含在data目录中,具体文件为alpaca_gpt4_data_zh.json
```
```
#数据集树目录
data
├── alpaca_gpt4_data_en.json
└── alpaca_gpt4_data_zh.json
```
### 训练(集群多节点)
## 模型下载
[
Qwen模型下载
](
https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main
)
## Qwen训练
### 训练(单节点)
```
#需要修改对应的节点名、加载对应虚拟环境以及模型路径等
#集群运行
sbatch run-dtk23.04.sh
bash run-node.sh
```
### 训练(多节点)
```
#需要修改对应的节点名、加载对应虚拟环境以及模型路径等,修改hostfile为自己所用的节点
sh mpirun-nodes.sh
```
##
模型训练Loss
##
result
乌镇集群两节点zero3训练
乌镇集群两节点
八卡
zero3训练
| train | loss |
| :-------------------: | :----: |
...
...
hostfile
0 → 100644
View file @
5ad54d77
10.0.21.163 slots=8
10.0.21.116 slots=8
model.properties
0 → 100644
View file @
5ad54d77
# 模型唯一标识
modelCode
=
397
# 模型名称
modelName
=
Qwen_pytorch
# 模型描述
modelDescription
=
Qwen是阿里开源的预训练语言表征模型。
# 应用场景
appScenario
=
训练,NLP,文本问答
# 框架类型
frameType
=
Pytorch
mpirun-nodes.sh
0 → 100644
View file @
5ad54d77
hostfile
=
./hostfile
np
=
$(
cat
$hostfile
|sort|uniq |wc
-l
)
np
=
$((
$np
*
8
))
nodename
=
$(
cat
$hostfile
|sed
-n
"1p"
)
dist_url
=
`
echo
$nodename
|
awk
'{print $1}'
`
which mpirun
mpirun
-np
$np
--allow-run-as-root
--hostfile
$hostfile
--bind-to
none
--mca
btl_tcp_if_include
$dist_url
single-16B.sh
$dist_url
echo
"END TIME:
$(
date
)
"
qwen.png
0 → 100644
View file @
5ad54d77
112 KB
single_ddp.sh
View file @
5ad54d77
...
...
@@ -34,23 +34,52 @@ APP="python ./src/train_bash.py \
case
${
lrank
}
in
[
0]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
UCX_NET_DEVICES
=
mlx5_0:1
numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_0:1
export
UCX_IB_PCI_BW
=
mlx5_0:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
1]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
UCX_NET_DEVICES
=
mlx5_1:1
numactl
--cpunodebind
=
1
--membind
=
1
${
APP
}
;;
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_1:1
export
UCX_IB_PCI_BW
=
mlx5_1:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
2]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
UCX_NET_DEVICES
=
mlx5_2:1
numactl
--cpunodebind
=
2
--membind
=
2
${
APP
}
;;
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_2:1
export
UCX_IB_PCI_BW
=
mlx5_2:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
3]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3
export
UCX_NET_DEVICES
=
mlx5_3:1
numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_3:1
export
UCX_IB_PCI_BW
=
mlx5_3:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
0
--membind
=
0
${
APP
}
;;
[
4]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_4:1
export
UCX_IB_PCI_BW
=
mlx5_4:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
[
5]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_5:1
export
UCX_IB_PCI_BW
=
mlx5_5:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
[
6]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_6:1
export
UCX_IB_PCI_BW
=
mlx5_6:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
[
7]
)
export
HIP_VISIBLE_DEVICES
=
0,1,2,3,4,5,6,7
export
UCX_NET_DEVICES
=
mlx5_7:1
export
UCX_IB_PCI_BW
=
mlx5_7:50Gbs
NCCL_SOCKET_IFNAME
=
ib0 numactl
--cpunodebind
=
3
--membind
=
3
${
APP
}
;;
esac
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment