Commit 8148977d authored by wangkaixiong's avatar wangkaixiong 🚴🏼
Browse files

init

parents
一、基础环境准备:
Miniconda
- miniconda 下载地址: https://docs.anaconda.com/miniconda/
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
```
二、多机训练:
假设机器A的IP为 10.10.10.1,机器B的IP为 10.10.10.2:
1. 搭建 nfs 共享存储, 给子节点设置可写的权限:
机器A:
```bash
# 1. 安装nfs server
sudo apt update
sudo apt install nfs-kernel-server
# 2. 机器A设置目录 /path/to/AI 为共享目录
vim /etc/exports
# 文件末尾增加下述内容后, 保存退出
/path/to/AI 10.10.10.2(rw,sync,no_subtree_check)
# 3. 重启服务使得配置生效
sudo service nfs-kernel-server start
# 4. 使得配置生效并且导出目录, 如果执行无输出, 则配置未生效
sudo exportfs -a
# 5. 使得所有用户都可以读写该文件夹内容
sudo chmod -R 777 /path/to/AI
```
机器B:
```bash
# 1. 安装nfs-client
sudo apt update
sudo apt install nfs-common
# 2. 创建目录
sudo mkdir -p /mnt/AI
# 3. 挂载:
sudo mount -t nfs 10.10.10.1:/path/to/AI /mnt/AI -o rw
# 4. 到此即可在机器B实现读写机器A指定内容的读写
```
2. 机器A、B免密配置;
- 机器A、机器B, 更改ssh服务配置,允许root用户登录,设置root用户密码;
```bash
vim /etc/ssh/sshd_config
# PermitRootLogin 所在行更改为如下内容
PermitRootLogin yes
# 保存退出
# 重启ssh server服务
systemctl restart sshd
# 设置root用户密码:
passwd
```
- 配置多机的ssh免密登录
```bash
# 机器A、机器B分别执行生成密钥
ssh-keygen -t ed25519 -C "node"
# 服务器A,为本机以及机器B配置免密:
ssh-copy-id root@10.10.10.1
ssh-copy-id root@10.10.10.2
# 服务器B,为本机以及机器A配置免密:
ssh-copy-id root@10.10.10.2
ssh-copy-id root@10.10.10.1
```
- 更改机器A、机器B的 /etc/hosts 文件, 文件默认增加内容如下:
```bash
10.10.10.1 node1
10.10.10.2 node2
```
3. mpirun 的多机工程:
- 创建hostfile文件,文件内容如下:
```
node1 slots=8
node1 slots=8
```
- mpirun 最终运行的多机的文件,需要设置为 777 权限; `chmod 777 scripts-run/single_process.sh`
- 编写多机训练脚本:
```bash
cd /datav/ai-oem/peixun/train/resnet50_tensorflow
hostfile=./hostfile
np=$(cat $hostfile|sort|uniq |wc -l)
np=$(($np*8))
which orted
which mpirun
echo "np = ${np}"
mpirun -np ${np} --host master:8,node1:8 \
--bind-to none \
-mca btl_tcp_if_include enp209s0f0 \
--allow-run-as-root \
-x HSA_FORCE_FINE_GRAIN_PCIE=1 \
-x PATH="/opt/mpi-4.1.6/bin:/root/miniconda3/envs/tf2.13.1/bin:$PATH" \
-x LD_LIBRARY_PATH="/root/miniconda3/envs/tf2.13.1/lib:/opt/mpi-4.1.6/lib:$LD_LIBRARY_PATH" \
scripts-run/single_process.sh
echo "END TIME: $(date)"
```
- `scripts-run/single_process.sh` 内容:
```bash
#!/bin/bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then
. "/root/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/root/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate tf2.13.1
cd /datav/ai-oem/peixun/train/resnet50_tensorflow
echo "start ...."
/root/miniconda3/bin/activate /root/miniconda3/envs/tf2.13.1
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
drank=$OMPI_COMM_WORLD_RANK
export NCCL_SOCKET_IFNAME=en,eth,em,bond
export NCCL_IB_DISABLE=1
export NCCL_DEBUG=WARN
# export NCCL_SOCKET_NBUFS=512
# export NCCL_P2P_DISABLE=1
# export NCCL_SOCKET_NBUFS=128
# export NCCL_MIN_NRINGS=4
# export NCCL_SOCKET_TIMEOUT=180
# export NCCL_BLOCKING_WAIT=1
# export NCCL_SOCKET_TIMEOUT=60
# export NCCL_ALGO=ring
# export NCCL_NET_GDR_LEVEL=2
# export NCCL_MIN_CU=16
# export NCCL_MAX_CU=256
# export PYTHONPATH=/workspace/train/resnet50_tensorflow:$PYTHONPATH
model_save_dir=./output_model
datasets=./imagenet_tf
echo "lrank = ${lrank}"
echo "drank = ${drank}"
APP="python3 ./official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --num_gpus=1 --skip_eval=true --batch_size=128 --train_epochs=90 --use_synthetic_data=false --distribution_strategy=multi_worker_mirrored --all_reduce_alg=nccl --dtype=fp16 --data_dir=${datasets} --task_index=${drank} "
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
${APP}
# numactl --cpunodebind=0 --membind=0 ${APP}
;;
[1])
export HIP_VISIBLE_DEVICES=1
${APP}
# numactl --cpunodebind=1 --membind=1 ${APP}
;;
[2])
export HIP_VISIBLE_DEVICES=2
${APP}
# numactl --cpunodebind=2 --membind=2 ${APP}
;;
[3])
export HIP_VISIBLE_DEVICES=3
${APP}
# numactl --cpunodebind=3 --membind=3 ${APP}
;;
[4])
export HIP_VISIBLE_DEVICES=4
${APP}
# numactl --cpunodebind=0 --membind=0 ${APP}
;;
[5])
export HIP_VISIBLE_DEVICES=5
${APP}
# numactl --cpunodebind=1 --membind=1 ${APP}
;;
[6])
export HIP_VISIBLE_DEVICES=6
${APP}
# numactl --cpunodebind=2 --membind=2 ${APP}
;;
[7])
export HIP_VISIBLE_DEVICES=7
${APP}
# numactl --cpunodebind=3 --membind=3 ${APP}
;;
esac
```
三、常见问题:
1. 共享存储: nfs 在容器内创建 server 端服务, 会出现问题, 建议物理机使用;
2. docker swarm 创建的集群, 依赖docker network 创建的网络可能存在 socket 接受字段长度和预期不符的问题; docker swarm 建议不要和 mpirun 的 nccl 一起用;
3. docker swarm 无法加入子节点, 可能需要关闭防火墙;
4. 机器A使用mpirun调用机器B的环境,进入conda环境:
```bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then
. "/root/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/root/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate tf2.13.1
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment