一、基础环境准备: Miniconda - miniconda 下载地址: https://docs.anaconda.com/miniconda/ ```bash wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh ``` 二、多机训练: 假设机器A的IP为 10.10.10.1,机器B的IP为 10.10.10.2: 1. 搭建 nfs 共享存储, 给子节点设置可写的权限: 机器A: ```bash # 1. 安装nfs server sudo apt update sudo apt install nfs-kernel-server # 2. 机器A设置目录 /path/to/AI 为共享目录 vim /etc/exports # 文件末尾增加下述内容后, 保存退出 /path/to/AI 10.10.10.2(rw,sync,no_subtree_check) # 3. 重启服务使得配置生效 sudo service nfs-kernel-server start # 4. 使得配置生效并且导出目录, 如果执行无输出, 则配置未生效 sudo exportfs -a # 5. 使得所有用户都可以读写该文件夹内容 sudo chmod -R 777 /path/to/AI ``` 机器B: ```bash # 1. 安装nfs-client sudo apt update sudo apt install nfs-common # 2. 创建目录 sudo mkdir -p /mnt/AI # 3. 挂载: sudo mount -t nfs 10.10.10.1:/path/to/AI /mnt/AI -o rw # 4. 到此即可在机器B实现读写机器A指定内容的读写 ``` 2. 机器A、B免密配置; - 机器A、机器B, 更改ssh服务配置,允许root用户登录,设置root用户密码; ```bash vim /etc/ssh/sshd_config # PermitRootLogin 所在行更改为如下内容 PermitRootLogin yes # 保存退出 # 重启ssh server服务 systemctl restart sshd # 设置root用户密码: passwd ``` - 配置多机的ssh免密登录 ```bash # 机器A、机器B分别执行生成密钥 ssh-keygen -t ed25519 -C "node" # 服务器A,为本机以及机器B配置免密: ssh-copy-id root@10.10.10.1 ssh-copy-id root@10.10.10.2 # 服务器B,为本机以及机器A配置免密: ssh-copy-id root@10.10.10.2 ssh-copy-id root@10.10.10.1 ``` - 更改机器A、机器B的 /etc/hosts 文件, 文件默认增加内容如下: ```bash 10.10.10.1 node1 10.10.10.2 node2 ``` 3. mpirun 的多机工程: - 创建hostfile文件,文件内容如下: ``` node1 slots=8 node1 slots=8 ``` - mpirun 最终运行的多机的文件,需要设置为 777 权限; `chmod 777 scripts-run/single_process.sh` - 编写多机训练脚本: ```bash cd /datav/ai-oem/peixun/train/resnet50_tensorflow hostfile=./hostfile np=$(cat $hostfile|sort|uniq |wc -l) np=$(($np*8)) which orted which mpirun echo "np = ${np}" mpirun -np ${np} --host master:8,node1:8 \ --bind-to none \ -mca btl_tcp_if_include enp209s0f0 \ --allow-run-as-root \ -x HSA_FORCE_FINE_GRAIN_PCIE=1 \ -x PATH="/opt/mpi-4.1.6/bin:/root/miniconda3/envs/tf2.13.1/bin:$PATH" \ -x LD_LIBRARY_PATH="/root/miniconda3/envs/tf2.13.1/lib:/opt/mpi-4.1.6/lib:$LD_LIBRARY_PATH" \ scripts-run/single_process.sh echo "END TIME: $(date)" ``` - `scripts-run/single_process.sh` 内容: ```bash #!/bin/bash # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then . "/root/miniconda3/etc/profile.d/conda.sh" else export PATH="/root/miniconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< conda activate tf2.13.1 cd /datav/ai-oem/peixun/train/resnet50_tensorflow echo "start ...." /root/miniconda3/bin/activate /root/miniconda3/envs/tf2.13.1 lrank=$OMPI_COMM_WORLD_LOCAL_RANK drank=$OMPI_COMM_WORLD_RANK export NCCL_SOCKET_IFNAME=en,eth,em,bond export NCCL_IB_DISABLE=1 export NCCL_DEBUG=WARN # export NCCL_SOCKET_NBUFS=512 # export NCCL_P2P_DISABLE=1 # export NCCL_SOCKET_NBUFS=128 # export NCCL_MIN_NRINGS=4 # export NCCL_SOCKET_TIMEOUT=180 # export NCCL_BLOCKING_WAIT=1 # export NCCL_SOCKET_TIMEOUT=60 # export NCCL_ALGO=ring # export NCCL_NET_GDR_LEVEL=2 # export NCCL_MIN_CU=16 # export NCCL_MAX_CU=256 # export PYTHONPATH=/workspace/train/resnet50_tensorflow:$PYTHONPATH model_save_dir=./output_model datasets=./imagenet_tf echo "lrank = ${lrank}" echo "drank = ${drank}" APP="python3 ./official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --num_gpus=1 --skip_eval=true --batch_size=128 --train_epochs=90 --use_synthetic_data=false --distribution_strategy=multi_worker_mirrored --all_reduce_alg=nccl --dtype=fp16 --data_dir=${datasets} --task_index=${drank} " case ${lrank} in [0]) export HIP_VISIBLE_DEVICES=0 ${APP} # numactl --cpunodebind=0 --membind=0 ${APP} ;; [1]) export HIP_VISIBLE_DEVICES=1 ${APP} # numactl --cpunodebind=1 --membind=1 ${APP} ;; [2]) export HIP_VISIBLE_DEVICES=2 ${APP} # numactl --cpunodebind=2 --membind=2 ${APP} ;; [3]) export HIP_VISIBLE_DEVICES=3 ${APP} # numactl --cpunodebind=3 --membind=3 ${APP} ;; [4]) export HIP_VISIBLE_DEVICES=4 ${APP} # numactl --cpunodebind=0 --membind=0 ${APP} ;; [5]) export HIP_VISIBLE_DEVICES=5 ${APP} # numactl --cpunodebind=1 --membind=1 ${APP} ;; [6]) export HIP_VISIBLE_DEVICES=6 ${APP} # numactl --cpunodebind=2 --membind=2 ${APP} ;; [7]) export HIP_VISIBLE_DEVICES=7 ${APP} # numactl --cpunodebind=3 --membind=3 ${APP} ;; esac ``` 三、常见问题: 1. 共享存储: nfs 在容器内创建 server 端服务, 会出现问题, 建议物理机使用; 2. docker swarm 创建的集群, 依赖docker network 创建的网络可能存在 socket 接受字段长度和预期不符的问题; docker swarm 建议不要和 mpirun 的 nccl 一起用; 3. docker swarm 无法加入子节点, 可能需要关闭防火墙; 4. 机器A使用mpirun调用机器B的环境,进入conda环境: ```bash # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then . "/root/miniconda3/etc/profile.d/conda.sh" else export PATH="/root/miniconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< conda activate tf2.13.1 ```