# Benchmarks Here we benchmark the training and testing speed of models in MMDetection3D, with some other open source 3D detection codebases. ## Settings * Hardwares: 8 NVIDIA Tesla V100 (32G) GPUs, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz * Software: Python 3.7, CUDA 10.1, cuDNN 7.6.5, PyTorch 1.3, numba 0.48.0. * Model: Since all the other codebases implements different models, we compare the corresponding models including SECOND, PointPillars, Part-A2, and VoteNet with them separately. * Metrics: We use the average throughput in iterations of the entire training run and skip the first 50 iterations of each epoch to skip GPU warmup time. Note that the throughput of a detector typically changes during training, because it depends on the predictions of the model. ## Main Results ### VoteNet We compare our implementation of VoteNet with [votenet](https://github.com/facebookresearch/votenet/) and report the performance on SUNRGB-D v2 dataset under the AP@0.5 metric. We find that our implementation achieves higher accuracy, so we also report the AP here. ```eval_rst +----------------+---------------------+--------------------+--------+ | Implementation | Training (sample/s) | Testing (sample/s) | AP@0.5 | +================+=====================+====================+========+ | MMDetection3D | 358 | 17 | 35.8 | +----------------+---------------------+--------------------+--------+ | votenet | 77 | 3 | 31.5 | +----------------+---------------------+--------------------+--------+ ``` ### Single-Class PointPillars Since [Det3D](https://github.com/poodarchu/Det3D/) only provides PointPillars on car class, we compare the training speed of single-class PointPillars here. ```eval_rst +----------------+---------------------+--------------------+ | Implementation | Training (sample/s) | Testing (sample/s) | +================+=====================+====================+ | MMDetection3D | 141 | 44 | +----------------+---------------------+--------------------+ | Det3D | 140 | 24 | +----------------+---------------------+--------------------+ ``` ### Multi-Class PointPillars [OpenPCDet](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2) only provides PointPillars on 3 classes, we compare the training speed of multi-class PointPillars here. ```eval_rst +----------------+---------------------+--------------------+ | Implementation | Training (sample/s) | Testing (sample/s) | +================+=====================+====================+ | MMDetection3D | 107 | | +----------------+---------------------+--------------------+ | OpenPCDet | 44 | 67 | +----------------+---------------------+--------------------+ ``` ### SECOND [Det3D](https://github.com/poodarchu/Det3D/) provides a different SECOND on car class and we cannot train the original SECOND by modifying the config. So we only compare SECOND with [OpenPCDet](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2), which is a SECOND model on 3 classes, we report the AP on moderate condition following the KITTI benchmark and compare average AP over all classes on moderate condition for performance on 3 classes. ```eval_rst +----------------+---------------------+--------------------+ | Implementation | Training (sample/s) | Testing (sample/s) | +================+=====================+====================+ | MMDetection3D | 40 | | +----------------+---------------------+--------------------+ | OpenPCDet | 30 | 32 | +----------------+---------------------+--------------------+ ``` ### Part-A2 We benchmark Part-A2 with that in [OpenPCDet](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2). We report the AP on moderate condition following the KITTI benchmark and compare average AP over all classes on moderate condition for performance on 3 classes. ```eval_rst +----------------+---------------------+--------------------+ | Implementation | Training (sample/s) | Testing (sample/s) | +================+=====================+====================+ | MMDetection3D | 17 | | +----------------+---------------------+--------------------+ | OpenPCDet | 14 | 13 | +----------------+---------------------+--------------------+ ``` ## Details of Comparison ### Modification for Calculating Speed * __MMDetection3D__: We try to use as similar settings as those of other codebases as possible using [benchmark configs](https://github.com/open-mmlab/MMDetection3D/blob/master/configs/benchmark). * __Det3D__: For comparison with Det3D, we use the commit [519251e](https://github.com/poodarchu/Det3D/tree/519251e72a5c1fdd58972eabeac67808676b9bb7). * __OpenPCDet__: For comparison with OpenPCDet, we use the commit [b32fbddb](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2). For training speed, we add code to record the running time in the file `./tools/train_utils/train_utils.py`. We calculate the speed of each epoch, and report the average speed of all the epochs.
(diff to make it use the same method for benchmarking speed - click to expand) ```diff diff --git a/tools/train_utils/train_utils.py b/tools/train_utils/train_utils.py index 91f21dd..021359d 100644 --- a/tools/train_utils/train_utils.py +++ b/tools/train_utils/train_utils.py @@ -2,6 +2,7 @@ import torch import os import glob import tqdm +import datetime from torch.nn.utils import clip_grad_norm_ @@ -13,7 +14,10 @@ def train_one_epoch(model, optimizer, train_loader, model_func, lr_scheduler, ac if rank == 0: pbar = tqdm.tqdm(total=total_it_each_epoch, leave=leave_pbar, desc='train', dynamic_ncols=True) + start_time = None for cur_it in range(total_it_each_epoch): + if cur_it > 49 and start_time is None: + start_time = datetime.datetime.now() try: batch = next(dataloader_iter) except StopIteration: @@ -55,9 +59,11 @@ def train_one_epoch(model, optimizer, train_loader, model_func, lr_scheduler, ac tb_log.add_scalar('learning_rate', cur_lr, accumulated_iter) for key, val in tb_dict.items(): tb_log.add_scalar('train_' + key, val, accumulated_iter) + endtime = datetime.datetime.now() + speed = (endtime - start_time).seconds / (total_it_each_epoch - 50) if rank == 0: pbar.close() - return accumulated_iter + return accumulated_iter, speed def train_model(model, optimizer, train_loader, model_func, lr_scheduler, optim_cfg, @@ -65,6 +71,7 @@ def train_model(model, optimizer, train_loader, model_func, lr_scheduler, optim_ lr_warmup_scheduler=None, ckpt_save_interval=1, max_ckpt_save_num=50, merge_all_iters_to_one_epoch=False): accumulated_iter = start_iter + speeds = [] with tqdm.trange(start_epoch, total_epochs, desc='epochs', dynamic_ncols=True, leave=(rank == 0)) as tbar: total_it_each_epoch = len(train_loader) if merge_all_iters_to_one_epoch: @@ -82,7 +89,7 @@ def train_model(model, optimizer, train_loader, model_func, lr_scheduler, optim_ cur_scheduler = lr_warmup_scheduler else: cur_scheduler = lr_scheduler - accumulated_iter = train_one_epoch( + accumulated_iter, speed = train_one_epoch( model, optimizer, train_loader, model_func, lr_scheduler=cur_scheduler, accumulated_iter=accumulated_iter, optim_cfg=optim_cfg, @@ -91,7 +98,7 @@ def train_model(model, optimizer, train_loader, model_func, lr_scheduler, optim_ total_it_each_epoch=total_it_each_epoch, dataloader_iter=dataloader_iter ) - + speeds.append(speed) # save trained model trained_epoch = cur_epoch + 1 if trained_epoch % ckpt_save_interval == 0 and rank == 0: @@ -107,6 +114,8 @@ def train_model(model, optimizer, train_loader, model_func, lr_scheduler, optim_ save_checkpoint( checkpoint_state(model, optimizer, trained_epoch, accumulated_iter), filename=ckpt_name, ) + print(speed) + print(f'*******{sum(speeds) / len(speeds)}******') def model_state_to_cpu(model_state): ```
For testing speed, we add code to record the running time in the file `./tools/eval_utils/eval_utils.py`.
(diff to make it use the same method for benchmarking speed - click to expand) ```diff diff --git a/tools/eval_utils/eval_utils.py b/tools/eval_utils/eval_utils.py index 0cbf17b..f51e687 100644 --- a/tools/eval_utils/eval_utils.py +++ b/tools/eval_utils/eval_utils.py @@ -49,8 +49,11 @@ def eval_one_epoch(cfg, model, dataloader, epoch_id, logger, dist_test=False, sa if cfg.LOCAL_RANK == 0: progress_bar = tqdm.tqdm(total=len(dataloader), leave=True, desc='eval', dynamic_ncols=True) - start_time = time.time() + num_warmup = 5 + pure_inf_time = 0 for i, batch_dict in enumerate(dataloader): + torch.cuda.synchronize() + start_time = time.perf_counter() for key, val in batch_dict.items(): if not isinstance(val, np.ndarray): continue @@ -61,7 +64,14 @@ def eval_one_epoch(cfg, model, dataloader, epoch_id, logger, dist_test=False, sa with torch.no_grad(): pred_dicts, ret_dict = model(batch_dict) disp_dict = {} - + torch.cuda.synchronize() + elapsed = time.perf_counter() - start_time + if i >= num_warmup: + pure_inf_time += elapsed + if (i + 1) == 2000: + pure_inf_time += elapsed + fps = (i + 1 - num_warmup) / pure_inf_time + out_str = f'Overall fps: {fps:.1f} img / s' statistics_info(cfg, ret_dict, metric, disp_dict) annos = dataset.generate_prediction_dicts( batch_dict, pred_dicts, class_names, @@ -71,7 +81,7 @@ def eval_one_epoch(cfg, model, dataloader, epoch_id, logger, dist_test=False, sa if cfg.LOCAL_RANK == 0: progress_bar.set_postfix(disp_dict) progress_bar.update() - + print(out_str) if cfg.LOCAL_RANK == 0: progress_bar.close() ```
### VoteNet * __MMDetection3D__: With release v0.1.0, run ```bash ./tools/dist_train.sh configs/votenet/votenet_16x8_sunrgbd-3d-10class.py 8 --no-validate ``` Then benchmark the test speed by running ```bash python tools/benchmark.py configs/votenet/votenet_16x8_sunrgbd-3d-10class.py ${CHECKPOINTS} ``` * __votenet__: At commit 2f6d6d3, run ```bash python train.py --dataset sunrgbd --batch_size 16 ``` Then benchmark the test speed by running ```bash python eval.py --dataset sunrgbd --checkpoint_path log_sunrgbd/checkpoint.tar --batch_size 1 --dump_dir eval_sunrgbd --cluster_sampling seed_fps --use_3d_nms --use_cls_nms --per_class_proposal ``` ### Single-class PointPillars * __MMDetection3D__: With release v0.1.0, run ```bash ./tools/dist_train.sh configs/benchmark/hv_pointpillars_secfpn_3x8_100e_det3d_kitti-3d-car.py 8 --no-validate ``` Then benchmark the test speed by running ```bash python tools/benchmark.py configs/benchmark/hv_pointpillars_secfpn_3x8 ${CHECKPOINT} ``` * __Det3D__: At commit 519251e, use kitti_point_pillars_mghead_syncbn.py and run ```bash ./tools/scripts/train.sh --launcher=slurm --gpus=8 ``` Note that the config in train.sh is modified to train point pillars.
(diff to benchmark the similar models - click to expand) ```diff diff --git a/tools/scripts/train.sh b/tools/scripts/train.sh index 3a93f95..461e0ea 100755 --- a/tools/scripts/train.sh +++ b/tools/scripts/train.sh @@ -16,9 +16,9 @@ then fi # Voxelnet -python -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py examples/second/configs/ kitti_car_vfev3_spmiddlefhd_rpn1_mghead_syncbn.py --work_dir=$SECOND_WORK_DIR +# python -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py examples/second/configs/ kitti_car_vfev3_spmiddlefhd_rpn1_mghead_syncbn.py --work_dir=$SECOND_WORK_DIR # python -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py examples/cbgs/configs/ nusc_all_vfev3_spmiddleresnetfhd_rpn2_mghead_syncbn.py --work_dir=$NUSC_CBGS_WORK_DIR # python -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py examples/second/configs/ lyft_all_vfev3_spmiddleresnetfhd_rpn2_mghead_syncbn.py --work_dir=$LYFT_CBGS_WORK_DIR # PointPillars -# python -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py ./examples/point_pillars/configs/ original_pp_mghead_syncbn_kitti.py --work_dir=$PP_WORK_DIR +python -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py ./examples/point_pillars/configs/ kitti_point_pillars_mghead_syncbn.py ```
Then benchmark the test speed by running ```bash ./tools/scripts/test.sh examples/point_pillars/configs/kitti_point_pillars_mghead_syncbn.py ./work_dir/Point_Pillars/latest.pth ``` Note that the `tools/dist_test.py` is modified to benchmark point pillars.
(diff to benchmark the similar models - click to expand) ```diff diff --git a/tools/dist_test.py b/tools/dist_test.py index 3e37f8a..0908fee 100644 --- a/tools/dist_test.py +++ b/tools/dist_test.py @@ -3,6 +3,7 @@ import json import os import sys +import time import apex import numpy as np import torch @@ -128,12 +129,26 @@ def main(): detections = {} cpu_device = torch.device("cpu") + sample_time = list() for i, data_batch in enumerate(data_loader): with torch.no_grad(): + + torch.cuda.synchronize() + start_time = time.perf_counter() + outputs = batch_processor( model, data_batch, train_mode=False, local_rank=args.local_rank, ) + + torch.cuda.synchronize() + elapsed = time.perf_counter() - start_time + sample_time.append(elapsed) + if i == 2006: + st_arr = np.array(sample_time)[5:] + + print('avg time elapsed: %f s'%st_arr.mean()) + for output in outputs: token = output["metadata"]["token"] for k, v in output.items(): @@ -185,3 +200,4 @@ def main(): if __name__ == "__main__": main() ``` ### Multi-class PointPillars * __MMDetection3D__: With release v0.1.0, run ```bash ./tools/dist_train.sh configs/benchmark/hv_pointpillars_secfpn_4x8_80e_pcdet_kitti-3d-3class.py 8 --no-validate ``` Then benchmark the test speed by running ```bash python tools/benchmark.py configs/benchmark/hv_pointpillars_secfpn_4x8_80e_pcdet_kitti-3d-3class.py ${CKPT} ``` * __OpenPCDet__: At commit [b32fbddb](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2), run ```bash cd tools sh scripts/slurm_train.sh ${PARTITION} ${JOB_NAME} 8 --cfg_file ./cfgs/pointpillar.yaml --batch_size 32 --workers 32 ``` Then benchmark the test speed by running ```bash cd tools python test.py --cfg_file cfgs/kitti_models/pointpillar.yaml --ckpt ${CKPT} --batch_size 1 --workers 4 ``` ### SECOND * __MMDetection3D__: With release v0.1.0, run ```bash ./tools/dist_train.sh configs/benchmark/hv_second_secfpn_4x8_80e_pcdet_kitti-3d-3class.py 8 --no-validate ``` Then benchmark the test speed by running ```bash python tools/benchmark.py configs/benchmark/hv_second_secfpn_4x8_80e_pcdet_kitti-3d-3class.py ${CKPT} ``` * __OpenPCDet__: At commit [b32fbddb](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2), run ```bash cd tools ./scripts/slurm_train.sh ${PARTITION} ${JOB_NAME} 8 --cfg_file ./cfgs/second.yaml --batch_size 32 --workers 32 ``` Then benchmark the test speed by running ```bash cd tools python test.py --cfg_file cfgs/kitti_models/second.yaml --ckpt ${CKPT} --batch_size 1 --workers 4 ``` ### Part-A2 * __MMDetection3D__: With release v0.1.0, run ```bash ./tools/dist_train.sh configs/benchmark/hv_PartA2_secfpn_4x8_cyclic_80e_pcdet_kitti-3d-3class.py 8 --no-validate ``` Then benchmark the test speed by running ```bash python tools/benchmark.py configs/benchmark/hv_PartA2_secfpn_4x8_cyclic_80e_pcdet_kitti-3d-3class.py ${CKPT} ``` * __OpenPCDet__: At commit [b32fbddb](https://github.com/open-mmlab/OpenPCDet/tree/b32fbddbe06183507bad433ed99b407cbc2175c2), train the model by running ```bash cd tools ./scripts/slurm_train.sh ${PARTITION} ${JOB_NAME} 8 --cfg_file ./cfgs/PartA2.yaml --batch_size 32 --workers 32 ``` Then benchmark the test speed by running ```bash cd tools python test.py --cfg_file cfgs/kitti_models/PartA2.yaml --ckpt ${CKPT} --batch_size 1 --workers 4 ```