Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangkx1
multi_machine_train_summary
Commits
8148977d
Commit
8148977d
authored
Nov 07, 2024
by
wangkaixiong
🚴🏼
Browse files
init
parents
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
264 additions
and
0 deletions
+264
-0
readme.md
readme.md
+264
-0
No files found.
readme.md
0 → 100644
View file @
8148977d
一、基础环境准备:
Miniconda
-
miniconda 下载地址: https://docs.anaconda.com/miniconda/
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
```
二、多机训练:
假设机器A的IP为 10.10.10.1,机器B的IP为 10.10.10.2:
1.
搭建 nfs 共享存储, 给子节点设置可写的权限:
机器A:
```
bash
# 1. 安装nfs server
sudo
apt update
sudo
apt
install
nfs-kernel-server
# 2. 机器A设置目录 /path/to/AI 为共享目录
vim /etc/exports
# 文件末尾增加下述内容后, 保存退出
/path/to/AI 10.10.10.2
(
rw,sync,no_subtree_check
)
# 3. 重启服务使得配置生效
sudo
service nfs-kernel-server start
# 4. 使得配置生效并且导出目录, 如果执行无输出, 则配置未生效
sudo
exportfs
-a
# 5. 使得所有用户都可以读写该文件夹内容
sudo chmod
-R
777 /path/to/AI
```
机器B:
```
bash
# 1. 安装nfs-client
sudo
apt update
sudo
apt
install
nfs-common
# 2. 创建目录
sudo mkdir
-p
/mnt/AI
# 3. 挂载:
sudo
mount
-t
nfs 10.10.10.1:/path/to/AI /mnt/AI
-o
rw
# 4. 到此即可在机器B实现读写机器A指定内容的读写
```
2.
机器A、B免密配置;
-
机器A、机器B, 更改ssh服务配置,允许root用户登录,设置root用户密码;
```bash
vim /etc/ssh/sshd_config
# PermitRootLogin 所在行更改为如下内容
PermitRootLogin yes
# 保存退出
# 重启ssh server服务
systemctl restart sshd
# 设置root用户密码:
passwd
```
-
配置多机的ssh免密登录
```bash
# 机器A、机器B分别执行生成密钥
ssh-keygen -t ed25519 -C "node"
# 服务器A,为本机以及机器B配置免密:
ssh-copy-id root@10.10.10.1
ssh-copy-id root@10.10.10.2
# 服务器B,为本机以及机器A配置免密:
ssh-copy-id root@10.10.10.2
ssh-copy-id root@10.10.10.1
```
-
更改机器A、机器B的 /etc/hosts 文件, 文件默认增加内容如下:
```bash
10.10.10.1 node1
10.10.10.2 node2
```
3.
mpirun 的多机工程:
-
创建hostfile文件,文件内容如下:
```
node1 slots=8
node1 slots=8
```
-
mpirun 最终运行的多机的文件,需要设置为 777 权限;
`chmod 777 scripts-run/single_process.sh`
-
编写多机训练脚本:
```bash
cd /datav/ai-oem/peixun/train/resnet50_tensorflow
hostfile=./hostfile
np=$(cat $hostfile|sort|uniq |wc -l)
np=$(($np*8))
which orted
which mpirun
echo "np = ${np}"
mpirun -np ${np} --host master:8,node1:8 \
--bind-to none \
-mca btl_tcp_if_include enp209s0f0 \
--allow-run-as-root \
-x HSA_FORCE_FINE_GRAIN_PCIE=1 \
-x PATH="/opt/mpi-4.1.6/bin:/root/miniconda3/envs/tf2.13.1/bin:$PATH" \
-x LD_LIBRARY_PATH="/root/miniconda3/envs/tf2.13.1/lib:/opt/mpi-4.1.6/lib:$LD_LIBRARY_PATH" \
scripts-run/single_process.sh
echo "END TIME: $(date)"
```
-
`scripts-run/single_process.sh`
内容:
```bash
#!/bin/bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then
. "/root/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/root/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate tf2.13.1
cd /datav/ai-oem/peixun/train/resnet50_tensorflow
echo "start ...."
/root/miniconda3/bin/activate /root/miniconda3/envs/tf2.13.1
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
drank=$OMPI_COMM_WORLD_RANK
export NCCL_SOCKET_IFNAME=en,eth,em,bond
export NCCL_IB_DISABLE=1
export NCCL_DEBUG=WARN
# export NCCL_SOCKET_NBUFS=512
# export NCCL_P2P_DISABLE=1
# export NCCL_SOCKET_NBUFS=128
# export NCCL_MIN_NRINGS=4
# export NCCL_SOCKET_TIMEOUT=180
# export NCCL_BLOCKING_WAIT=1
# export NCCL_SOCKET_TIMEOUT=60
# export NCCL_ALGO=ring
# export NCCL_NET_GDR_LEVEL=2
# export NCCL_MIN_CU=16
# export NCCL_MAX_CU=256
# export PYTHONPATH=/workspace/train/resnet50_tensorflow:$PYTHONPATH
model_save_dir=./output_model
datasets=./imagenet_tf
echo "lrank = ${lrank}"
echo "drank = ${drank}"
APP="python3 ./official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --num_gpus=1 --skip_eval=true --batch_size=128 --train_epochs=90 --use_synthetic_data=false --distribution_strategy=multi_worker_mirrored --all_reduce_alg=nccl --dtype=fp16 --data_dir=${datasets} --task_index=${drank} "
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
${APP}
# numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=1
${APP}
# numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=2
${APP}
# numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=3
${APP}
# numactl --cpunodebind=3 --membind=3 ${APP}
;;
[4])
export HIP_VISIBLE_DEVICES=4
${APP}
# numactl --cpunodebind=0 --membind=0 ${APP}
;;
[5])
export HIP_VISIBLE_DEVICES=5
${APP}
# numactl --cpunodebind=1 --membind=1 ${APP}
;;
[6])
export HIP_VISIBLE_DEVICES=6
${APP}
# numactl --cpunodebind=2 --membind=2 ${APP}
;;
[7])
export HIP_VISIBLE_DEVICES=7
${APP}
# numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
```
三、常见问题:
1.
共享存储: nfs 在容器内创建 server 端服务, 会出现问题, 建议物理机使用;
2.
docker swarm 创建的集群, 依赖docker network 创建的网络可能存在 socket 接受字段长度和预期不符的问题; docker swarm 建议不要和 mpirun 的 nccl 一起用;
3.
docker swarm 无法加入子节点, 可能需要关闭防火墙;
4.
机器A使用mpirun调用机器B的环境,进入conda环境:
```
bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup
=
"
$(
'/root/miniconda3/bin/conda'
'shell.bash'
'hook'
2> /dev/null
)
"
if
[
$?
-eq
0
]
;
then
eval
"
$__conda_setup
"
else
if
[
-f
"/root/miniconda3/etc/profile.d/conda.sh"
]
;
then
.
"/root/miniconda3/etc/profile.d/conda.sh"
else
export
PATH
=
"/root/miniconda3/bin:
$PATH
"
fi
fi
unset
__conda_setup
# <<< conda initialize <<<
conda activate tf2.13.1
```
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment