readme.md 7.07 KB
Newer Older
wangkaixiong's avatar
init  
wangkaixiong committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
一、基础环境准备:

Miniconda
   - miniconda 下载地址: https://docs.anaconda.com/miniconda/

       ```bash
       wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
       ```

二、多机训练:

假设机器A的IP为 10.10.10.1,机器B的IP为 10.10.10.2:

1. 搭建 nfs 共享存储, 给子节点设置可写的权限:

机器A:

   ```bash
    # 1. 安装nfs server
    sudo apt update
    sudo apt install nfs-kernel-server

    # 2. 机器A设置目录 /path/to/AI 为共享目录
    vim /etc/exports

    # 文件末尾增加下述内容后, 保存退出
    /path/to/AI 10.10.10.2(rw,sync,no_subtree_check)

    # 3. 重启服务使得配置生效
    sudo service nfs-kernel-server start

    # 4.  使得配置生效并且导出目录, 如果执行无输出, 则配置未生效
wangkx1's avatar
wangkx1 committed
33
    sudo exportfs -v
wangkaixiong's avatar
init  
wangkaixiong committed
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

    # 5. 使得所有用户都可以读写该文件夹内容
    sudo chmod -R 777 /path/to/AI
   ```
       
机器B:   

   ```bash
    # 1. 安装nfs-client
    sudo apt update
    sudo apt install nfs-common

    # 2. 创建目录
    sudo mkdir -p /mnt/AI

    # 3. 挂载:
    sudo mount -t nfs 10.10.10.1:/path/to/AI /mnt/AI -o rw

    # 4. 到此即可在机器B实现读写机器A指定内容的读写
   ```
   

2. 机器A、B免密配置;

   - 机器A、机器B, 更改ssh服务配置,允许root用户登录,设置root用户密码;

       ```bash
       vim /etc/ssh/sshd_config
    
       # PermitRootLogin 所在行更改为如下内容
       PermitRootLogin yes
       # 保存退出

       # 重启ssh server服务 
       systemctl restart sshd

       # 设置root用户密码:
       passwd

       ```

   - 配置多机的ssh免密登录

       ```bash
       # 机器A、机器B分别执行生成密钥
       ssh-keygen -t ed25519 -C  "node"

       # 服务器A,为本机以及机器B配置免密:
       ssh-copy-id root@10.10.10.1
       ssh-copy-id root@10.10.10.2

       # 服务器B,为本机以及机器A配置免密:
       ssh-copy-id root@10.10.10.2
       ssh-copy-id root@10.10.10.1
       ```

   - 更改机器A、机器B的 /etc/hosts 文件, 文件默认增加内容如下:

     ```bash
     10.10.10.1 node1
     10.10.10.2 node2
     ```

3. mpirun 的多机工程:

   - 创建hostfile文件,文件内容如下:
       ```
       node1 slots=8
wangkx1's avatar
wangkx1 committed
102
       node2 slots=8
wangkaixiong's avatar
init  
wangkaixiong committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
       ```

   - mpirun 最终运行的多机的文件,需要设置为 777 权限;  `chmod 777 scripts-run/single_process.sh`

   - 编写多机训练脚本:

       ```bash
        cd /datav/ai-oem/peixun/train/resnet50_tensorflow

        hostfile=./hostfile

        np=$(cat $hostfile|sort|uniq |wc -l)
        np=$(($np*8))
        which orted
        which mpirun

        echo "np = ${np}"

        mpirun -np ${np} --host master:8,node1:8 \
        --bind-to none \
        -mca btl_tcp_if_include enp209s0f0 \
        --allow-run-as-root \
        -x HSA_FORCE_FINE_GRAIN_PCIE=1 \
        -x PATH="/opt/mpi-4.1.6/bin:/root/miniconda3/envs/tf2.13.1/bin:$PATH" \
        -x LD_LIBRARY_PATH="/root/miniconda3/envs/tf2.13.1/lib:/opt/mpi-4.1.6/lib:$LD_LIBRARY_PATH" \
        scripts-run/single_process.sh

        echo "END TIME: $(date)"
       ```

   - `scripts-run/single_process.sh` 内容:

       ```bash
        #!/bin/bash

        # >>> conda initialize >>>
        # !! Contents within this block are managed by 'conda init' !!
        __conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
        if [ $? -eq 0 ]; then
            eval "$__conda_setup"
        else
            if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then
                . "/root/miniconda3/etc/profile.d/conda.sh"
            else
                export PATH="/root/miniconda3/bin:$PATH"
            fi
        fi
        unset __conda_setup
        # <<< conda initialize <<<

        conda activate tf2.13.1


        cd /datav/ai-oem/peixun/train/resnet50_tensorflow

        echo "start ...."
        /root/miniconda3/bin/activate /root/miniconda3/envs/tf2.13.1

        lrank=$OMPI_COMM_WORLD_LOCAL_RANK
        drank=$OMPI_COMM_WORLD_RANK


        export NCCL_SOCKET_IFNAME=en,eth,em,bond
        export NCCL_IB_DISABLE=1
        export NCCL_DEBUG=WARN
        # export NCCL_SOCKET_NBUFS=512

        # export NCCL_P2P_DISABLE=1
        # export NCCL_SOCKET_NBUFS=128
        # export NCCL_MIN_NRINGS=4
        # export NCCL_SOCKET_TIMEOUT=180
        # export NCCL_BLOCKING_WAIT=1
        # export NCCL_SOCKET_TIMEOUT=60
        # export NCCL_ALGO=ring
        # export NCCL_NET_GDR_LEVEL=2
        # export NCCL_MIN_CU=16
        # export NCCL_MAX_CU=256

        # export PYTHONPATH=/workspace/train/resnet50_tensorflow:$PYTHONPATH

        model_save_dir=./output_model
        datasets=./imagenet_tf

        echo "lrank = ${lrank}"
        echo "drank = ${drank}"

        APP="python3 ./official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py --num_gpus=1  --skip_eval=true   --batch_size=128 --train_epochs=90  --use_synthetic_data=false  --distribution_strategy=multi_worker_mirrored  --all_reduce_alg=nccl --dtype=fp16  --data_dir=${datasets}  --task_index=${drank} "
        case ${lrank} in
        [0])
        export HIP_VISIBLE_DEVICES=0
        ${APP}
        # numactl --cpunodebind=0 --membind=0 ${APP}
        ;;
        [1])
        export HIP_VISIBLE_DEVICES=1
        ${APP}
        # numactl --cpunodebind=1 --membind=1 ${APP}
        ;;
        [2])
        export HIP_VISIBLE_DEVICES=2
        ${APP}
        # numactl --cpunodebind=2 --membind=2 ${APP}
        ;;
        [3])
        export HIP_VISIBLE_DEVICES=3
        ${APP}
        # numactl --cpunodebind=3 --membind=3 ${APP}
        ;;
        [4])
        export HIP_VISIBLE_DEVICES=4
        ${APP}
        # numactl --cpunodebind=0 --membind=0 ${APP}
        ;;
        [5])
        export HIP_VISIBLE_DEVICES=5
        ${APP}
        # numactl --cpunodebind=1 --membind=1 ${APP}
        ;;
        [6])
        export HIP_VISIBLE_DEVICES=6
        ${APP}
        # numactl --cpunodebind=2 --membind=2 ${APP}
        ;;
        [7])
        export HIP_VISIBLE_DEVICES=7
        ${APP}
        # numactl --cpunodebind=3 --membind=3 ${APP}
        ;;
        esac

       ```


三、常见问题:

1. 共享存储: nfs 在容器内创建 server 端服务, 会出现问题, 建议物理机使用;

2. docker swarm 创建的集群, 依赖docker network 创建的网络可能存在 socket 接受字段长度和预期不符的问题; docker swarm 建议不要和 mpirun 的 nccl 一起用;

3. docker swarm 无法加入子节点, 可能需要关闭防火墙;

4. 机器A使用mpirun调用机器B的环境,进入conda环境:

```bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/root/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/root/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/root/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/root/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

conda activate tf2.13.1
wangkx1's avatar
wangkx1 committed
263
```