Merge branch 'main' into 'main'

补充了部分模型的多机多卡训练，修正了maskrcnn训练输出无用日志的问题，对部分readme文件进行了更新和纠错 See merge request dcutoolkit/deeplearing/dlexamples_new!43

Merge branch 'main' into 'main'
补充了部分模型的多机多卡训练，修正了maskrcnn训练输出无用日志的问题，对部分readme文件进行了更新和纠错 See merge request dcutoolkit/deeplearing/dlexamples_new!43
6d8087d4 · sunxx1 · 730aae10 · c1476851 · 6d8087d4 · 6d8087d4
Commit 6d8087d4 authored Dec 29, 2022 by sunxx1
14 changed files
--- a/PyTorch/Compute-Vision/Objection/Faster-rcnn/README.md
+++ b/PyTorch/Compute-Vision/Objection/Faster-rcnn/README.md
 # Fast-Rcnn 

-## 简介 
+[TOC]
+
+
+
+## 简介

 该测试用例用于PyTorch目标检测模型Fasterrcnn测试。


--- a/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/README.md
+++ b/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/README.md
@@ -17,7 +17,10 @@ COCO2017数据集
 	     --lr-steps 16 22 --aspect-ratio-group-factor 3 \
 	     --data-path /path/to/{COCO2017_data_dir}  
 若报错Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to .cache/torch/checkpoints/resnet50-19c8e357.pth失败，则需提前下载resnet50-19c8e357.pth，拷贝至.cache/torch/checkpoints/。  
-### 多卡
+
+### 单机多卡
+
+1）pytorch启动方式

 	export HIP_VISIBLE_DEVICES=0,1,2,3
 	export NGPUS=4
@@ -25,6 +28,24 @@ COCO2017数据集
 	python3 -m torch.distributed.launch --nproc_per_node= ${NGPUS} --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.005 --data-path /path/to/{COCO2017_data_dir} > train_2gpu_lr0.005.log 2>&1 &
 注意：多卡运行时，学习率与卡数的对应关系为0.02/8*$NGPU，例如，lr_4gpu=0.01，lr_2gpu=0.005，lr_1gpu=0.0025。  

+2）mpi启动
+
+```
+cd references/detection 
+mpirun -np 4  --bind-to none single_process.sh localhost
+```
+
+### 多机多卡
+
+mpi启动
+
+```
+mpirun -np $np --hostfile hostfile --bind-to none single_process.sh $dist_url
+```
+
+其中，$dist_url为master_node的ip，在多节点的时候需要参考hostfile文件中的格式进行修改。
+
 # 参考
+
 [https://github.com/pytorch/vision/tree/master/references/detection](https://github.com/pytorch/vision/tree/master/references/detection)

--- a/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/env.sh
+++ b/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/env.sh
+#!/bin/bash
+which python3
--- a/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/hostfile
+++ b/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/hostfile
+a01r1n13 slots=4
+a01r1n14 slots=4
--- a/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/single_process.sh
+++ b/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/single_process.sh
+#!/bin/bash
+export MIOPEN_DEBUG_DISABLE_FIND_DB=1
+export NCCL_SOCKET_IFNAME=ib0
+export HSA_USERPTR_FOR_PAGED_MEM=0
+#source /public/software/apps/DeepLearning/PyTorch/pytorch-env.sh
+
+module rm compiler/dtk/21.10
+module load compiler/dtk/22.10
+cd /work/home/sugon_ldc/Gitlab/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection
+#conda init
+#source activate
+#source deactivate
+#conda activate maskrcnn
+source env.sh
+
+#module load apps/PyTorch/1.5.0a0/hpcx-2.4.1-gcc-7.3.1-rocm3.3
+#source /public/home/aiss/Pytorch/env_rocm3.3_torch1.5.sh
+lrank=$OMPI_COMM_WORLD_LOCAL_RANK
+comm_rank=$OMPI_COMM_WORLD_RANK
+comm_size=$OMPI_COMM_WORLD_SIZE
+
+echo $lrank
+echo $comm_rank
+echo $comm_size
+echo '##################'
+
+#APP="python3 `pwd`/main_bench.py --batch-size=${3} --a=${2} -j 24 --epochs=1 --dist-url tcp://${1}:34567 --dist-backend nccl --world-size=${comm_size} --rank=${comm_rank} --synthetic /public/software/apps/DeepLearning/Data/ImageNet-pytorch/"
+APP="python3 train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --dist-url tcp://${1}:34567 --dist-backend nccl --world-size=${comm_size} --rank=${comm_rank} --lr-steps 16 22 --aspect-ratio-group-factor 3 --data-path /work/home/sugon_ldc/datasets/COCO2017/"
+echo $dist_url
+
+case ${lrank} in
+[0])
+  export HIP_VISIBLE_DEVICES=0
+  export UCX_NET_DEVICES=mlx5_0:1
+  export UCX_IB_PCI_BW=mlx5_0:50Gbs
+  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
+  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
+  
+  #echo GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP} 
+  #GLOO_SOCKET_IFNAME=ib0 numactl --cpunodebind=0 --membind=0 ${APP}
+  ;;
+[1])
+  export HIP_VISIBLE_DEVICES=1
+  export UCX_NET_DEVICES=mlx5_1:1
+  export UCX_IB_PCI_BW=mlx5_1:50Gbs
+  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
+  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=1 --membind=1 ${APP}
+  ;;
+[2])
+  export HIP_VISIBLE_DEVICES=2
+  export UCX_NET_DEVICES=mlx5_2:1
+  export UCX_IB_PCI_BW=mlx5_2:50Gbs
+  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP} 
+  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=2 --membind=2 ${APP}
+  ;;
+[3])
+  export HIP_VISIBLE_DEVICES=3
+  export UCX_NET_DEVICES=mlx5_3:1
+  export UCX_IB_PCI_BW=mlx5_3:50Gbs
+  echo NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
+  NCCL_SOCKET_IFNAME=ib0 numactl --cpunodebind=3 --membind=3 ${APP}
+  ;;
+esac
+
--- a/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/train.py
+++ b/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/train.py
@@ -29,6 +29,7 @@ from engine import train_one_epoch, evaluate

 import presets
 import utils
+import torch.distributed as dist


 def get_dataset(name, image_set, transform, data_path):
@@ -111,6 +112,12 @@ def get_args_parser(add_help=True):
    #PAN
    #Mixed precision training parameters
    parser.add_argument("--amp", action="store_true", help="Use torch.cuda.amp for mixed precision training")
+    parser.add_argument('--dist-backend', default='nccl', type=str,help='distributed backend')
+    parser.add_argument('--multiprocessing-distributed', action='store_true',
+                    help='Use multi-processing distributed training to launch '
+                         'N processes per node, which has N GPUs. This is the '
+                         'fastest way to use PyTorch for either single node or '
+                         'multi node data parallel training')

    return parser

@@ -118,7 +125,9 @@ def get_args_parser(add_help=True):
 def main(args):
    if args.output_dir:
        utils.mkdir(args.output_dir)
-
+    if args.dist_url == "env://" and args.world_size == -1:
+        args.world_size = int(os.environ["WORLD_SIZE"])
+    #ngpus_per_node = torch.cuda.device_count()
    utils.init_distributed_mode(args)
    print(args)

@@ -133,8 +142,19 @@ def main(args):

    print("Creating data loaders")
    if args.distributed:
+        #train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
+        #test_sampler = torch.utils.data.distributed.DistributedSampler(dataset_test)
+        #if args.dist_url == "env://" and args.rank == -1:
+        #    args.rank = int(os.environ["RANK"])
+        #if args.multiprocessing_distributed:
+            # For multiprocessing distributed training, rank needs to be the
+            # global rank among all the processes
+        #args.rank = args.rank * ngpus_per_node + gpu
+        #dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
+        #                        world_size=args.world_size, rank=args.rank)
        train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
        test_sampler = torch.utils.data.distributed.DistributedSampler(dataset_test)
+
    else:
        train_sampler = torch.utils.data.RandomSampler(dataset)
        test_sampler = torch.utils.data.SequentialSampler(dataset_test)

--- a/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/utils.py
+++ b/PyTorch/Compute-Vision/Objection/MaskRCNN/vision/references/detection/utils.py
@@ -292,6 +292,11 @@ def init_distributed_mode(args):
    args.dist_backend = 'nccl'
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)
+    print('**********************')
+    print('backend:',args.dist_backend)
+    print('init_method:',args.dist_url)
+    print('world_size:',args.world_size)
+    print('rank:',args.rank)
    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                         world_size=args.world_size, rank=args.rank)
    torch.distributed.barrier()

--- a/PyTorch/Compute-Vision/Objection/YOLOv3/README.md
+++ b/PyTorch/Compute-Vision/Objection/YOLOv3/README.md
 # 介绍  
+[TOC]
+
 本测试用例用于测试目标检测YOLOv3模型在ROCm平台PyTorch框架下的训练性能、推理性能和检测准确性，使用的数据集为COCO2017，其具体测试流程如下 ：
+
 # 测试流程    
 ## 测试数据准备  
 ### 数据预处理  
@@ -22,24 +25,25 @@ git clone https://github.com/RTalha/COCOTOYOLO-Annotations
 该转换步骤应分别针对训练数据集和val数据集运行

 ```
-java -jar cocotoyolo.jar "coco/annotations/instances_train2017.json" "/usr/home/madmax/coco/images/train2017/" "person" "coco/yolo"
+java -jar cocotoyolo.jar "coco/annotations/instances_train2017.json" "/usr/home/madmax/coco/images/train2017/" "person,bicycle," "coco/yolo"

-java -jar cocotoyolo.jar "coco/annotations/instances_val2017.json" "/usr/home/madmax/coco/images/val2017/" "person" "coco/yolo"
+java -jar cocotoyolo.jar "coco/annotations/instances_val2017.json" "/usr/home/madmax/coco/images/val2017/" "person,bicycle" "coco/yolo"
 ```

 然后运行最终的转换文件，将所有txt文件转换为一个注释.txt文件

- 1. 通过添加上述java文件的输出路径来更新最终转换文件
- 1. 然后为图像提供final_conversion文件的路径
- 1. 最后根据您的需求更新人员注释.txt。
+- 1、通过添加上述java文件的输出路径来更新最终转换文件
+- 2 、然后为图像提供final_conversion文件的路径
+- 3、最后根据您的需求更新注释.txt。

-如果要从所有可可映像复制自定义映像
+如果要从所有coco图像复制自定义图像

 ```
 ls person-dataset/ |sed 's/.txt/.jpg/' | xargs -i bash -c 'cp train2017/{} person-dataset-images/ '
 ```

-### 下载预训练模型  
+### 下载预训练模型
+
 下载链接 
 [https://drive.google.com/drive/folders/1LezFG5g3BCW6iYaV89B2i64cqEUZD7e0](https://drive.google.com/drive/folders/1LezFG5g3BCW6iYaV89B2i64cqEUZD7e0) 
 下载完成后放入weight目录
@@ -99,19 +103,19 @@ export MIOPEN_FIND_MODE=3
 #### 单卡  

 ```
-python3 train.py --cfg cfg/yolov3.cfg --weights weights/yolov3.pt --data data/coco2017.data --batch 32 --accum 2 --device 0 --epochs 300
+python3 train.py --epochs 500 --batch-size 32 --cfg cfg/yolov3.cfg --weights weights/yolov3.pt --data data/coco.data --device 0
 ```

-运行前需确认coco2017.data中train2017.txt和val2017.txt中的数据路径  
+运行前需确认coco2017.data中train2017.txt和val2017.txt中的数据路径，若出现No such file or directory需要修改文件中的数据路径，或者直接将COCO2017数据集放置在程序根目录当中。
 #### 多卡
-	python3 train.py --cfg cfg/yolov3.cfg --weights weights/yolov3.weights --data data/coco2017.data --batch 64 --accum 1 --device 0,1
-### 推理  
-	python3 test.py --cfg cfg/yolov3.cfg --weights weights/yolov3.pt --task benchmark --augment --device 1  
-运行完成后会生成benchmark.txt和benchmark_yolov3.log文件，benchmark.txt文件记录了5种图片输入尺寸、2种iou阈值下的mAP@0.5...0.9和mAP@0.5值，benchmark_yolov3.log文件记录了每张图片的inference/NMS/total时间。  
-### 检测  
+
+	python -m torch.distributed.run --nproc_per_node 2 train.py --epochs 500 --batch-size 64 --cfg cfg/yolov3.cfg --weights weights/yolov3.pt --data data/coco.data --device 0,1  
+### 检测
+
 使用detect.py文件进行测试，是yolov3模型的的实际应用，测试内容是指定一张图片，检测图片中物体，观察准确率。运行指令如下：
 	python3 detect.py --cfg cfg/yolov3.cfg --weights weights/yolov3.pt  
 运行完成后会生成带有检测框的图片。  
+
 # 参考
 [https://github.com/ultralytics/yolov3](https://github.com/ultralytics/yolov3)


--- a/PyTorch/Compute-Vision/Objection/YOLOv3/train.py
+++ b/PyTorch/Compute-Vision/Objection/YOLOv3/train.py
@@ -407,6 +407,7 @@ if __name__ == '__main__':
    parser.add_argument('--device', default='', help='device id (i.e. 0 or 0,1 or cpu)')
    parser.add_argument('--adam', action='store_true', help='use adam optimizer')
    parser.add_argument('--single-cls', action='store_true', help='train as single-class dataset')
+    parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter, do not modify')
    opt = parser.parse_args()
    opt.weights = last if opt.resume else opt.weights
    print(opt)

--- a/PyTorch/Compute-Vision/Objection/ssd/README.md
+++ b/PyTorch/Compute-Vision/Objection/ssd/README.md
@@ -2,6 +2,10 @@

 该脚本是基于目标检测模型SSD_ResNet34的功能测试用例，参考mlperf工程，当mAP值达到0.23时，视为模型收敛并成功结束作业运行。

+[TOC]
+
+
+
 # 2. 运行

 ## 安装依赖库

--- a/PyTorch/Compute-Vision/Objection/yolov5/README.md
+++ b/PyTorch/Compute-Vision/Objection/yolov5/README.md
@@ -2,18 +2,20 @@

 # YOLOV5算力测试

-## 测试前准备
+[TOC]

-### 数据集
+## 1.测试前准备
+
+### 1.1 数据集

 使用COCO2017数据集

-### 环境搭建
+### 1.2 环境搭建

 建立python3.7的环境

 ```
-conda create -n yolov5 python='3.7'
+conda create -n yolov5 python=3.7

 conda activate yolov5
 ```
@@ -39,7 +41,7 @@ pip3 install torchvision-0.10.0a0_dtk22.04_300a8a4-cp37-cp37m-linux_x86_64.whl -
 pip3 install pycocotools -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```

-## 训练
+## 2. 单卡训练

 ```
 export HSA_FORCE_FINE_GRAIN_PCIE=1
@@ -48,9 +50,51 @@ export MIOPEN_FIND_MODE=3
 python3 train.py --data data/coco.yaml --cfg models/yolov5x.yaml --weights weights/yolov5x.pt --device 0 --batch-size 32 --epochs 10
 ```

-## 精度测试
+## 3. 多卡训练
+
+### 3.1 单节点多卡
+
+```
+python3 -m torch.distributed.run --nproc_per_node 4 train.py --batch 256 --data coco.yaml --cfg 'yolov5s.yaml' --weights 'yolov5s.pt' --project 'run_origin_yolov5s/train' --hyp 'data/hyps/hyp.scratch-low.yaml' --device 0,1,2,3 --epochs 1000
+```
+
+其中--nproc_per_node参数代表卡的个数，--batch参数代表global batchsize的大小
+
+### 3.2 多节点多卡
+
+```
+python3 -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr "a03r4n01" --master_port 34567 train.py --batch 256 --data coco.yaml --weight 'yolov5s.pt' --project 'multi/train' --hyp 'data/hyps/hyp.scratch-low.yaml' --cfg 'yolov5s.yaml' --epochs 1000  2>&1 | tee  multi.log
+
+python3 -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr "a03r4n01" --master_port 34567 train.py --batch 256 --data coco.yaml --weight 'yolov5s.pt' --project 'multi/train' --hyp 'data/hyps/hyp.scratch-low.yaml' --cfg 'yolov5s.yaml' --epochs 1000  2>&1 | tee  multi.log
+```
+
+这里需要注意的是--master_addr是你的主节点，也就是log会输出的节点，两个指令的主节点需要保持一致，同时--node_rank需要保证不同，--nnodes为使用的节点数量。
+
+**tips：需要注意的是，在超参数的选取上，小模型使用hyp.scratch-low，例如yolov5s，而大模型需要使用hyp.scratch-high，例如yolov5m，它们的区别为，low有更快的收敛速度，而high参数收敛速度慢，但是不容易陷入局部最优。**
+
+## 4. 推理测试

 ```
 python3 val.py --data data/coco-v5.yaml --weights runs/train/exp12/weights/best.pt --device 0
 ```

+## 5.画出loss和精度曲线
+
+如果在训练一段时间后想要得到类似上述的loss及map曲线，我们提供了view_code.py文件，只需要将您训练过程中--project 指定的路径写入，之后执行python3 view_code.py即可在该路径下得到曲线的图像。
+
+## 6. 存在的问题及解决方案
+
+### 6.1 pycocotools输出结果特别低问题
+
+在训练结束或者推理结束后有时候会发现pycocotools输出的结果不正确，数值会非常低，如下图所示
+
+![pycoco错误结果](pycoco错误结果.png)
+
+这是由于python的版本过低导致的问题，除了升级Python版本外，还可以对代码进行修改也可以解决问题，将val.py文件中的如图所示位置，注释掉红框部分的代码也可得到正确的结果。
+
+![pycocotools](pycocotools.png)
+
+
+
+
+
--- a/PyTorch/Compute-Vision/Objection/yolov5/view_code.py
+++ b/PyTorch/Compute-Vision/Objection/yolov5/view_code.py
+from utils.plots import plot_results
+
+plot_results(file='/work/home/sugon_ldc/project/yolov5_test/yolov5-6.0/9.30_v6.0_direct_high/train/exp/results.csv',dir='')
+#plot_results(file='/work/home/sugon_ldc/project/yolov5-6.0/run_direct_bs256/train/exp/results.csv',dir='')
--- a/TensorFlow2x/ComputeVision/Detection/SSD/README.md
+++ b/TensorFlow2x/ComputeVision/Detection/SSD/README.md
@@ -18,7 +18,7 @@ VOC数据集下载地址如下，里面已经包括了训练集、测试集、

 ### python依赖包

-使用Conda配置TF2.7环境，环境中包括根据whl文件安装的Tensorflow2.7和Python3.6
+使用Conda配置TF2.7环境，环境中包括根据whl文件安装的Tensorflow2.7和Python3.7

 构建TF2的支持文件requiresments.txt

@@ -28,122 +28,6 @@ VOC数据集下载地址如下，里面已经包括了训练集、测试集、
 pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/
 ```

-requiresments.txt中包括：
-
-```
-# Python dependencies required for development
-
-astor 
-
-astunparse 
-
-cached-property 
-
-cachetools 
-
-certifi 
-
-charset-normalizer 
-
-dataclasses 
-
-devscripts
-
-distro 
-
-flatbuffers 
-
-gast
-
-google-auth 
-
-google-auth-oauthlib
-
-google-pasta 
-
-grpcio 
-
-h5py 
-
-idna 
-
-importlib-metadata 
-
-keras 
-
-Keras-Applications 
-
-Keras-Preprocessing
-
-libclang 
-
-Markdown 
-
-mock 
-
-numpy
-
-oauthlib 
-
-opt-einsum
-
-packaging
-
-portpicker 
-
-protobuf 
-
-pyasn1
-
-pyasn1-modules 
-
-pyparsing 
-
-requests 
-
-requests-oauthlib 
-
-rsa 
-
-scikit-build 
-
-scipy 
-
-setuptools
-
-six
-
-tensorboard 
-
-tensorboard-data-server 
-
-tensorboard-plugin-wit 
-
-tensorflow-estimator 
-
-tensorflow-io-gcs-filesystem 
-
-tensorflow-model-optimization 
-
-termcolor 
-
-tf-models-official 
-
-typing-extensions
-
-urllib3 
-
-Werkzeug
-
-wheel 
-
-wrapt 
-
-zipp 
-
-horovod
-```
-
 ### 环境变量设置

 ```
@@ -166,13 +50,7 @@ export HSA_FORCE_FINE_GRAIN_PCIE=1  
 export MIOPEN_FIND_MODE=3  
 export HIP_VISIBLE_DEVICES=0  
 cd /public/home/libodi/work1/ssd-tf2-master  
-python3 train32.py --dtype=fp32
-```
-
-执行如下代码即可进行训练
-
-```
-python3 train32.py --dtype=fp32
+python3 train.py --dtype=fp32
 ```

 使用ssd-tf2-master/voc_annotation.py自动生成训练集和验证集，其中训练集5717 张、验证集5823张。**具体为，修改voc_annotation.py里面的annotation_mode=2，运行voc_annotation.py生成根目录下的2007_train.txt和2007_val.txt。**

--- a/TensorFlow2x/ComputeVision/Detection/SSD/utils/dataloader.py
+++ b/TensorFlow2x/ComputeVision/Detection/SSD/utils/dataloader.py
@@ -46,7 +46,7 @@ class SSDDatasets(keras.utils.Sequence):

            image_data.append(image)               
            box_data.append(box)
-            print(preprocess_input(np.array(image_data)), np.array(box_data))
+            #print(preprocess_input(np.array(image_data)), np.array(box_data))
            break

        return preprocess_input(np.array(image_data)), np.array(box_data)