# Fast-Rcnn [TOC] ## 简介 该测试用例用于PyTorch目标检测模型Fasterrcnn测试。 ## 数据集准备 使用Objects365数据集,昆山平台Objects365路径:/public/software/apps/DeepLearning/Data/objects365 需要在train.py中的parser中设置“--data-path”为数据集所在的路径,也可设置epoch、学习率和output输出位置等参数。 参数说明: - 上面的中trian.py需要关注--batch_size与-j ; - --save-path 是chekpoint保存路径,要求是已经存在的文件夹; - --data-path是数据集所在位置。 ## 环境准备 下载dtk22.04.1的安装包 [下载链接]([centos7.6 | 光合开发者社区 (hpccube.com)](https://cancon.hpccube.com:65024/1/main/DTK-22.04.1/centos7.6)) 解压后执行如下命令即可载入dtk22.04.1系统 ``` source env.sh ``` 准备python3.7的python环境 ``` conda create --name Fast-Rcnn python=‘3.7’ conda env list conda activate Fast-Rcnn ``` 安装其他依赖(由于网络问题,建议安装时使用镜像源) ``` pip3 install torch-1.10.0a0+git450cdd1.dtk22.4-cp37-cp37m-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple pip3 install torchvision-0.10.0a0_dtk22.04_300a8a4-cp37-cp37m-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple pip3 install pycocotools -i https://pypi.tuna.tsinghua.edu.cn/simple ``` ## 训练 ### 单卡 export HIP_VISIVLE_DEVICES=0 python3 train.py --batch-size=2 -j 8 --epochs=26 --data-path=/path/to/datasets/folder --output-dir=/path/to/result/save/folder ### 单机多卡 single_process.sh ``` #!/bin/bash lrank=$OMPI_COMM_WORLD_LOCAL_RANK comm_rank=$OMPI_COMM_WORLD_RANK comm_size=$OMPI_COMM_WORLD_SIZE APP="python3 `pwd`/train.py --batch-size=8 -j 16 --epochs=12 --dist-url tcp://${1}:34567 --world-size=${comm_size} --rank=${comm_rank} --output-dir `pwd`/../output_dir_4card --data-path=/public/software/apps/DeepLearning/Data/objects365 --lr 0.01" case ${lrank} in [0]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_0:1 export UCX_IB_PCI_BW=mlx5_0:50Gbs echo numactl --cpunodebind=0 --membind=0 ${APP} numactl --cpunodebind=0 --membind=0 ${APP} ;; [1]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_1:1 export UCX_IB_PCI_BW=mlx5_1:50Gbs echo numactl --cpunodebind=1 --membind=1 ${APP} numactl --cpunodebind=1 --membind=1 ${APP} ;; [2]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_2:1 export UCX_IB_PCI_BW=mlx5_2:50Gbs echo numactl --cpunodebind=2 --membind=2 ${APP} numactl --cpunodebind=2 --membind=2 ${APP} ;; [3]) export HIP_VISIBLE_DEVICES=0,1,2,3 export UCX_NET_DEVICES=mlx5_3:1 export UCX_IB_PCI_BW=mlx5_3:50Gbs echo numactl --cpunodebind=3 --membind=3 ${APP} numactl --cpunodebind=3 --membind=3 ${APP} ;; esac ``` run_muti.sh mpirun -np 4 --bind-to none single_process.sh localhost #mpirun -np 4 --hostfile hostfile --bind-to none `pwd`/single_process.sh localhost 运行 ``` nohup $WORK_PATH/run_multi.sh > train_4card_32.log 2>&1 & ``` ### 多机多卡 mpirun -np $np --hostfile hostfile --bind-to none `pwd`/single_process.sh ${master_ip} ## 参考 [https://github.com/pytorch/vision/tree/master/references/detection](https://github.com/pytorch/vision/tree/master/references/detection)