Merge pull request #683 from sshaoshuai/master

Release OpenPCDet v0.5.0

Merge pull request #683 from sshaoshuai/master
Release OpenPCDet v0.5.0
3fa8b512 · Shaoshuai Shi · GitHub · 70857b83 · d1368b01 · 3fa8b512
Unverified Commit 3fa8b512 authored Dec 01, 2021 by Shaoshuai Shi Committed by GitHub Dec 01, 2021
20 changed files
--- a/README.md
+++ b/README.md
@@ -4,8 +4,9 @@

 `OpenPCDet` is a clear, simple, self-contained open source project for LiDAR-based 3D object detection. 

-It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/1812.04244), [`[Part-A^2 net]`](https://arxiv.org/abs/1907.03670), [`[PV-RCNN]`](https://arxiv.org/abs/1912.13192) and [`[Voxel R-CNN]`](https://arxiv.org/abs/2012.15712). 
+It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/1812.04244), [`[Part-A2-Net]`](https://arxiv.org/abs/1907.03670), [`[PV-RCNN]`](https://arxiv.org/abs/1912.13192) and [`[Voxel R-CNN]`](https://arxiv.org/abs/2012.15712). 

+**NEW**: `OpenPCDet` has been updated to `v0.5.0` (Dec. 2021).

 ## Overview
 - [Changelog](#changelog)
@@ -18,15 +19,22 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18


 ## Changelog
+[2021-12-01] **NEW:** `OpenPCDet` v0.5.0 is released with the following features:
+* Improve the performance of all models on [Waymo Open Dataset](#waymo-open-dataset-baselines). Note that you need to re-prepare the training/validation data and ground-truth database of Waymo Open Dataset (see [GETTING_STARTED.md](docs/GETTING_STARTED.md)). 
+* Support anchor-free [CenterHead](pcdet/models/dense_heads/center_head.py), add configs of `CenterPoint` and `PV-RCNN with CenterHead`.
+* Support lastest **PyTorch 1.1~1.10** and **spconv 1.0~2.x**, where **spconv 2.x** should be easy to install with pip and faster than previous version (see the official update of spconv [here](https://github.com/traveller59/spconv)).  
+* Support config [`USE_SHARED_MEMORY`](tools/cfgs/dataset_configs/waymo_dataset.yaml) to use shared memory to potentially speed up the training process in case you suffer from an IO problem.  
+* Support better and faster [visualization script](tools/visual_utils/open3d_vis_utils.py), and you need to install [Open3D](https://github.com/isl-org/Open3D) firstly. 
+
 [2021-06-08] Added support for the voxel-based 3D object detection model [`Voxel R-CNN`](#KITTI-3D-Object-Detection-Baselines)

 [2021-05-14] Added support for the monocular 3D object detection model [`CaDDN`](#KITTI-3D-Object-Detection-Baselines)

-[2020-11-27] **Bugfixed:** Please re-prepare the validation infos of Waymo dataset (version 1.2) if you would like to 
+[2020-11-27] Bugfixed: Please re-prepare the validation infos of Waymo dataset (version 1.2) if you would like to 
 use our provided Waymo evaluation tool (see [PR](https://github.com/open-mmlab/OpenPCDet/pull/383)). 
 Note that you do not need to re-prepare the training data and ground-truth database. 

-[2020-11-10] **NEW:** The [Waymo Open Dataset](#waymo-open-dataset-baselines) has been supported with state-of-the-art results. Currently we provide the 
+[2020-11-10] The [Waymo Open Dataset](#waymo-open-dataset-baselines) has been supported with state-of-the-art results. Currently we provide the 
 configs and results of `SECOND`, `PartA2` and `PV-RCNN` on the Waymo Open Dataset, and more models could be easily supported by modifying their dataset configs. 

 [2020-08-10] Bugfixed: The provided NuScenes models have been updated to fix the loading bugs. Please redownload it if you need to use the pretrained NuScenes models.
@@ -102,22 +110,15 @@ Selected supported methods are shown in the below table. The results are the 3D
 |---------------------------------------------|----------:|:-------:|:-------:|:-------:|:---------:|
 | [PointPillar](tools/cfgs/kitti_models/pointpillar.yaml) |~1.2 hours| 77.28 | 52.29 | 62.68 | [model-18M](https://drive.google.com/file/d/1wMxWTpU1qUoY3DsCH31WJmvJxcjFXKlm/view?usp=sharing) | 
 | [SECOND](tools/cfgs/kitti_models/second.yaml)       |  ~1.7 hours  | 78.62 | 52.98 | 67.15 | [model-20M](https://drive.google.com/file/d/1-01zsPOsqanZQqIIyy7FpNXStL3y4jdR/view?usp=sharing) |
-| [SECOND-IoU](tools/cfgs/kitti_models/second_iou.yaml)       | -  | 79.09 | 55.74 | 71.31 | [model](https://drive.google.com/file/d/1AQkeNs4bxhvhDQ-5sEo_yvQUlfo73lsW/view?usp=sharing) |
+| [SECOND-IoU](tools/cfgs/kitti_models/second_iou.yaml)       | -  | 79.09 | 55.74 | 71.31 | [model-46M](https://drive.google.com/file/d/1AQkeNs4bxhvhDQ-5sEo_yvQUlfo73lsW/view?usp=sharing) |
 | [PointRCNN](tools/cfgs/kitti_models/pointrcnn.yaml) | ~3 hours | 78.70 | 54.41 | 72.11 | [model-16M](https://drive.google.com/file/d/1BCX9wMn-GYAfSOPpyxf6Iv6fc0qKLSiU/view?usp=sharing)| 
 | [PointRCNN-IoU](tools/cfgs/kitti_models/pointrcnn_iou.yaml) | ~3 hours | 78.75 | 58.32 | 71.34 | [model-16M](https://drive.google.com/file/d/1V0vNZ3lAHpEEt0MlT80eL2f41K2tHm_D/view?usp=sharing)|
-| [Part-A^2-Free](tools/cfgs/kitti_models/PartA2_free.yaml)   | ~3.8 hours| 78.72 | 65.99 | 74.29 | [model-226M](https://drive.google.com/file/d/1lcUUxF8mJgZ_e-tZhP1XNQtTBuC-R0zr/view?usp=sharing) |
-| [Part-A^2-Anchor](tools/cfgs/kitti_models/PartA2.yaml)    | ~4.3 hours| 79.40 | 60.05 | 69.90 | [model-244M](https://drive.google.com/file/d/10GK1aCkLqxGNeX3lVu8cLZyE0G8002hY/view?usp=sharing) |
+| [Part-A2-Free](tools/cfgs/kitti_models/PartA2_free.yaml)   | ~3.8 hours| 78.72 | 65.99 | 74.29 | [model-226M](https://drive.google.com/file/d/1lcUUxF8mJgZ_e-tZhP1XNQtTBuC-R0zr/view?usp=sharing) |
+| [Part-A2-Anchor](tools/cfgs/kitti_models/PartA2.yaml)    | ~4.3 hours| 79.40 | 60.05 | 69.90 | [model-244M](https://drive.google.com/file/d/10GK1aCkLqxGNeX3lVu8cLZyE0G8002hY/view?usp=sharing) |
 | [PV-RCNN](tools/cfgs/kitti_models/pv_rcnn.yaml) | ~5 hours| 83.61 | 57.90 | 70.47 | [model-50M](https://drive.google.com/file/d/1lIOq4Hxr0W3qsX83ilQv0nk1Cls6KAr-/view?usp=sharing) |
 | [Voxel R-CNN (Car)](tools/cfgs/kitti_models/voxel_rcnn_car.yaml) | ~2.2 hours| 84.54 | - | - | [model-28M](https://drive.google.com/file/d/19_jiAeGLz7V0wNjSJw4cKmMjdm5EW5By/view?usp=sharing) |
-| [CaDDN](tools/cfgs/kitti_models/CaDDN.yaml) |~15 hours| 21.38 | 13.02 | 9.76 | [model-774M](https://drive.google.com/file/d/1OQTO2PtXT8GGr35W9m2GZGuqgb6fyU1V/view?usp=sharing) |
-
-### NuScenes 3D Object Detection Baselines
-All models are trained with 8 GTX 1080Ti GPUs and are available for download.
-
-|                                             | mATE | mASE | mAOE | mAVE | mAAE | mAP | NDS | download | 
-|---------------------------------------------|----------:|:-------:|:-------:|:-------:|:---------:|:-------:|:-------:|:---------:|
-| [PointPillar-MultiHead](tools/cfgs/nuscenes_models/cbgs_pp_multihead.yaml) | 33.87	| 26.00 | 32.07	| 28.74 | 20.15 | 44.63 | 58.23	 | [model-23M](https://drive.google.com/file/d/1p-501mTWsq0G9RzroTWSXreIMyTUUpBM/view?usp=sharing) | 
-| [SECOND-MultiHead (CBGS)](tools/cfgs/nuscenes_models/cbgs_second_multihead.yaml) | 31.15 |	25.51 |	26.64 | 26.26 | 20.46 | 50.59 | 62.29 | [model-35M](https://drive.google.com/file/d/1bNzcOnE3u9iooBFMk2xK7HqhdeQ_nwTq/view?usp=sharing) |
+||
+| [CaDDN (Mono)](tools/cfgs/kitti_models/CaDDN.yaml) |~15 hours| 21.38 | 13.02 | 9.76 | [model-774M](https://drive.google.com/file/d/1OQTO2PtXT8GGr35W9m2GZGuqgb6fyU1V/view?usp=sharing) |

 ### Waymo Open Dataset Baselines
 We provide the setting of [`DATA_CONFIG.SAMPLED_INTERVAL`](tools/cfgs/dataset_configs/waymo_dataset.yaml) on the Waymo Open Dataset (WOD) to subsample partial samples for training and evaluation, 
@@ -125,19 +126,30 @@ so you could also play with WOD by setting a smaller `DATA_CONFIG.SAMPLED_INTERV

 By default, all models are trained with **20% data (~32k frames)** of all the training samples on 8 GTX 1080Ti GPUs, and the results of each cell here are mAP/mAPH calculated by the official Waymo evaluation metrics on the **whole** validation set (version 1.2).    

-|                                             | Vec_L1 | Vec_L2 | Ped_L1 | Ped_L2 | Cyc_L1 | Cyc_L2 |  
+|    Performance@(train with 20\% Data)            | Vec_L1 | Vec_L2 | Ped_L1 | Ped_L2 | Cyc_L1 | Cyc_L2 |  
 |---------------------------------------------|----------:|:-------:|:-------:|:-------:|:-------:|:-------:|
-| [SECOND](tools/cfgs/waymo_models/second.yaml) | 68.03/67.44	| 59.57/59.04 | 61.14/50.33	| 53.00/43.56 | 54.66/53.31 | 52.67/51.37 | 
-| [Part-A^2-Anchor](tools/cfgs/waymo_models/PartA2.yaml) | 71.82/71.29 | 64.33/63.82 | 63.15/54.96 | 54.24/47.11 | 65.23/63.92 | 62.61/61.35 |
-| [PV-RCNN](tools/cfgs/waymo_models/pv_rcnn.yaml) | 74.06/73.38 | 64.99/64.38 |	62.66/52.68 | 53.80/45.14 |	63.32/61.71	| 60.72/59.18 | 
+| [SECOND](tools/cfgs/waymo_models/second.yaml) | 70.96/70.34|62.58/62.02|65.23/54.24	|57.22/47.49|	57.13/55.62 |	54.97/53.53 | 
+| [CenterPoint](tools/cfgs/waymo_models/centerpoint_without_resnet.yaml)| 71.33/70.76|63.16/62.65|	72.09/65.49	|64.27/58.23|	68.68/67.39	|66.11/64.87|
+| [CenterPoint (ResNet)](tools/cfgs/waymo_models/centerpoint.yaml)|72.76/72.23|64.91/64.42	|74.19/67.96	|66.03/60.34|	71.04/69.79	|68.49/67.28 |
+| [Part-A2-Anchor](tools/cfgs/waymo_models/PartA2.yaml) | 74.66/74.12	|65.82/65.32	|71.71/62.24	|62.46/54.06	|66.53/65.18	|64.05/62.75 |
+| [PV-RCNN (AnchorHead)](tools/cfgs/waymo_models/pv_rcnn.yaml) | 75.41/74.74	|67.44/66.80	|71.98/61.24	|63.70/53.95	|65.88/64.25	|63.39/61.82 | 
+| [PV-RCNN (CenterHead)](tools/cfgs/waymo_models/pv_rcnn_with_centerhead_rpn.yaml) | 75.95/75.43	|68.02/67.54	|75.94/69.40	|67.66/61.62	|70.18/68.98	|67.73/66.57|

 We could not provide the above pretrained models due to [Waymo Dataset License Agreement](https://waymo.com/open/terms/), 
 but you could easily achieve similar performance by training with the default configs.

+### NuScenes 3D Object Detection Baselines
+All models are trained with 8 GTX 1080Ti GPUs and are available for download.
+
+|                                             | mATE | mASE | mAOE | mAVE | mAAE | mAP | NDS | download | 
+|---------------------------------------------|----------:|:-------:|:-------:|:-------:|:---------:|:-------:|:-------:|:---------:|
+| [PointPillar-MultiHead](tools/cfgs/nuscenes_models/cbgs_pp_multihead.yaml) | 33.87	| 26.00 | 32.07	| 28.74 | 20.15 | 44.63 | 58.23	 | [model-23M](https://drive.google.com/file/d/1p-501mTWsq0G9RzroTWSXreIMyTUUpBM/view?usp=sharing) | 
+| [SECOND-MultiHead (CBGS)](tools/cfgs/nuscenes_models/cbgs_second_multihead.yaml) | 31.15 |	25.51 |	26.64 | 26.26 | 20.46 | 50.59 | 62.29 | [model-35M](https://drive.google.com/file/d/1bNzcOnE3u9iooBFMk2xK7HqhdeQ_nwTq/view?usp=sharing) |
+


 ### Other datasets
-More datasets are on the way. 
+Welcome to support other datasets by submitting pull request. 

 ## Installation


--- a/docs/DEMO.md
+++ b/docs/DEMO.md
@@ -6,8 +6,11 @@ We suppose you already followed the [INSTALL.md](INSTALL.md) to install the `Ope

 1. Download the provided pretrained models as shown in the [README.md](../README.md). 

-2. Make sure you have already installed the `mayavi` visualization tools. If not, you could install it as follows:
+2. Make sure you have already installed the [`Open3D`](https://github.com/isl-org/Open3D) (faster) or `mayavi` visualization tools. 
+If not, you could install it as follows:
   ```
+   pip install open3d
+   # or 
   pip install mayavi
   ```


--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -71,11 +71,14 @@ OpenPCDet
 │   │   │── raw_data
 │   │   │   │── segment-xxxxxxxx.tfrecord
 |   |   |   |── ...
-|   |   |── waymo_processed_data
+|   |   |── waymo_processed_data_v0_5_0
 │   │   │   │── segment-xxxxxxxx/
 |   |   |   |── ...
-│   │   │── pcdet_gt_database_train_sampled_xx/
-│   │   │── pcdet_waymo_dbinfos_train_sampled_xx.pkl   
+│   │   │── waymo_processed_data_v0_5_0_gt_database_train_sampled_1/
+│   │   │── waymo_processed_data_v0_5_0_waymo_dbinfos_train_sampled_1.pkl
+│   │   │── waymo_processed_data_v0_5_0_gt_database_train_sampled_1_global.npy (optional)
+│   │   │── waymo_processed_data_v0_5_0_infos_train.pkl (optional)
+│   │   │── waymo_processed_data_v0_5_0_infos_val.pkl (optional)
 ├── pcdet
 ├── tools
 ```
@@ -87,7 +90,7 @@ pip3 install waymo-open-dataset-tf-2-0-0==1.2.0 --user
 ```

 * Extract point cloud data from tfrecord and generate data infos by running the following command (it takes several hours, 
-and you could refer to `data/waymo/waymo_processed_data` to see how many records that have been processed): 
+and you could refer to `data/waymo/waymo_processed_data_v0_5_0` to see how many records that have been processed): 
 ```python 
 python -m pcdet.datasets.waymo.waymo_dataset --func create_waymo_infos \
    --cfg_file tools/cfgs/dataset_configs/waymo_dataset.yaml
@@ -96,7 +99,7 @@ python -m pcdet.datasets.waymo.waymo_dataset --func create_waymo_infos \
 Note that you do not need to install `waymo-open-dataset` if you have already processed the data before and do not need to evaluate with official Waymo Metrics. 

 ## Pretrained Models
-If you would like to train [CaDDN](../tools/cfgs/kitti_models/CaDDN.yaml), download the pretrained [DeepLabV3 model](https://download.pytorch.org/models/deeplabv3_resnet101_coco-586e9e4e.pth) and place within the `checkpoints` directory
+If you would like to train [CaDDN](../tools/cfgs/kitti_models/CaDDN.yaml), download the pretrained [DeepLabV3 model](https://download.pytorch.org/models/deeplabv3_resnet101_coco-586e9e4e.pth) and place within the `checkpoints` directory. Please make sure the [kornia](https://github.com/kornia/kornia) is installed since it is needed for `CaDDN`.
 ```
 OpenPCDet
 ├── checkpoints

--- a/docs/INSTALL.md
+++ b/docs/INSTALL.md
@@ -2,15 +2,15 @@

 ### Requirements
 All the codes are tested in the following environment:
-* Linux (tested on Ubuntu 14.04/16.04)
+* Linux (tested on Ubuntu 14.04/16.04/18.04/20.04/21.04)
 * Python 3.6+
-* PyTorch 1.1 or higher (tested on PyTorch 1.1, 1,3, 1,5)
+* PyTorch 1.1 or higher (tested on PyTorch 1.1, 1,3, 1,5~1.10)
 * CUDA 9.0 or higher (PyTorch 1.3+ needs CUDA 9.2+)
-* [`spconv v1.0 (commit 8da6f96)`](https://github.com/traveller59/spconv/tree/8da6f967fb9a054d8870c3515b1b44eca2103634) or [`spconv v1.2`](https://github.com/traveller59/spconv)
+* [`spconv v1.0 (commit 8da6f96)`](https://github.com/traveller59/spconv/tree/8da6f967fb9a054d8870c3515b1b44eca2103634) or [`spconv v1.2`](https://github.com/traveller59/spconv) or [`spconv v2.x`](https://github.com/traveller59/spconv)


-### Install `pcdet v0.3`
-NOTE: Please re-install `pcdet v0.3` by running `python setup.py develop` even if you have already installed previous version.
+### Install `pcdet v0.5`
+NOTE: Please re-install `pcdet v0.5` by running `python setup.py develop` even if you have already installed previous version.

 a. Clone this repository.
 ```shell
@@ -19,16 +19,20 @@ git clone https://github.com/open-mmlab/OpenPCDet.git

 b. Install the dependent libraries as follows:

-* Install the dependent python libraries: 
-```
-pip install -r requirements.txt 
-```
+[comment]: <> (* Install the dependent python libraries: )
+
+[comment]: <> (```)
+
+[comment]: <> (pip install -r requirements.txt )
+
+[comment]: <> (```)

 * Install the SparseConv library, we use the implementation from [`[spconv]`](https://github.com/traveller59/spconv). 
    * If you use PyTorch 1.1, then make sure you install the `spconv v1.0` with ([commit 8da6f96](https://github.com/traveller59/spconv/tree/8da6f967fb9a054d8870c3515b1b44eca2103634)) instead of the latest one.
    * If you use PyTorch 1.3+, then you need to install the `spconv v1.2`. As mentioned by the author of [`spconv`](https://github.com/traveller59/spconv), you need to use their docker if you use PyTorch 1.4+. 
+    * You could also install latest `spconv v2.x` with pip, see the official documents of [spconv](https://github.com/traveller59/spconv).
  
-c. Install this `pcdet` library by running the following command:
+c. Install this `pcdet` library and its dependent libraries by running the following command:
 ```shell
 python setup.py develop
 ```
--- a/pcdet/datasets/augmentor/database_sampler.py
+++ b/pcdet/datasets/augmentor/database_sampler.py
 import pickle

+import os
+import copy
 import numpy as np
+import SharedArray
+import torch.distributed as dist

 from ...ops.iou3d_nms import iou3d_nms_utils
-from ...utils import box_utils
+from ...utils import box_utils, common_utils


 class DataBaseSampler(object):
@@ -25,9 +29,13 @@ class DataBaseSampler(object):
        for func_name, val in sampler_cfg.PREPARE.items():
            self.db_infos = getattr(self, func_name)(self.db_infos, val)

+        self.use_shared_memory = sampler_cfg.get('USE_SHARED_MEMORY', False)
+        self.gt_database_data_key = self.load_db_to_shared_memory() if self.use_shared_memory else None
+
        self.sample_groups = {}
        self.sample_class_num = {}
        self.limit_whole_scene = sampler_cfg.get('LIMIT_WHOLE_SCENE', False)
+
        for x in sampler_cfg.SAMPLE_GROUPS:
            class_name, sample_num = x.split(':')
            if class_name not in class_names:
@@ -47,6 +55,35 @@ class DataBaseSampler(object):
    def __setstate__(self, d):
        self.__dict__.update(d)

+    def __del__(self):
+        if self.use_shared_memory:
+            self.logger.info('Deleting GT database from shared memory')
+            cur_rank, num_gpus = common_utils.get_dist_info()
+            sa_key = self.sampler_cfg.DB_DATA_PATH[0]
+            if cur_rank % num_gpus == 0 and os.path.exists(f"/dev/shm/{sa_key}"):
+                SharedArray.delete(f"shm://{sa_key}")
+
+            if num_gpus > 1:
+                dist.barrier()
+            self.logger.info('GT database has been removed from shared memory')
+
+    def load_db_to_shared_memory(self):
+        self.logger.info('Loading GT database to shared memory')
+        cur_rank, world_size, num_gpus = common_utils.get_dist_info(return_gpu_per_machine=True)
+
+        assert self.sampler_cfg.DB_DATA_PATH.__len__() == 1, 'Current only support single DB_DATA'
+        db_data_path = self.root_path.resolve() / self.sampler_cfg.DB_DATA_PATH[0]
+        sa_key = self.sampler_cfg.DB_DATA_PATH[0]
+
+        if cur_rank % num_gpus == 0 and not os.path.exists(f"/dev/shm/{sa_key}"):
+            gt_database_data = np.load(db_data_path)
+            common_utils.sa_create(f"shm://{sa_key}", gt_database_data)
+            
+        if num_gpus > 1:
+            dist.barrier()
+        self.logger.info('GT database has been saved to shared memory')
+        return sa_key
+
    def filter_by_difficulty(self, db_infos, removed_difficulty):
        new_db_infos = {}
        for key, dinfos in db_infos.items():
@@ -128,7 +165,17 @@ class DataBaseSampler(object):
            data_dict.pop('road_plane')

        obj_points_list = []
+        if self.use_shared_memory:
+            gt_database_data = SharedArray.attach(f"shm://{self.gt_database_data_key}")
+            gt_database_data.setflags(write=0)
+        else:
+            gt_database_data = None 
+
        for idx, info in enumerate(total_valid_sampled_dict):
+            if self.use_shared_memory:
+                start_offset, end_offset = info['global_data_offset']
+                obj_points = copy.deepcopy(gt_database_data[start_offset:end_offset])
+            else:
                file_path = self.root_path / info['path']
                obj_points = np.fromfile(str(file_path), dtype=np.float32).reshape(
                    [-1, self.sampler_cfg.NUM_POINT_FEATURES])

--- a/pcdet/datasets/waymo/waymo_dataset.py
+++ b/pcdet/datasets/waymo/waymo_dataset.py
@@ -9,6 +9,8 @@ import copy
 import numpy as np
 import torch
 import multiprocessing
+import SharedArray
+import torch.distributed as dist
 from tqdm import tqdm
 from pathlib import Path
 from ...ops.roiaware_pool3d import roiaware_pool3d_utils
@@ -29,6 +31,11 @@ class WaymoDataset(DatasetTemplate):
        self.infos = []
        self.include_waymo_data(self.mode)

+        self.use_shared_memory = self.dataset_cfg.get('USE_SHARED_MEMORY', False) and self.training
+        if self.use_shared_memory:
+            self.shared_memory_file_limit = self.dataset_cfg.get('SHARED_MEMORY_FILE_LIMIT', 0x7FFFFFFF)
+            self.load_data_to_shared_memory()
+
    def set_split(self, split):
        super().__init__(
            dataset_cfg=self.dataset_cfg, class_names=self.class_names, training=self.training,
@@ -67,17 +74,67 @@ class WaymoDataset(DatasetTemplate):
            self.infos = sampled_waymo_infos
            self.logger.info('Total sampled samples for Waymo dataset: %d' % len(self.infos))

+    def load_data_to_shared_memory(self):
+        self.logger.info(f'Loading training data to shared memory (file limit={self.shared_memory_file_limit})')
+
+        cur_rank, num_gpus = common_utils.get_dist_info()
+        all_infos = self.infos[:self.shared_memory_file_limit] \
+            if self.shared_memory_file_limit < len(self.infos) else self.infos
+        cur_infos = all_infos[cur_rank::num_gpus]
+        for info in cur_infos:
+            pc_info = info['point_cloud']
+            sequence_name = pc_info['lidar_sequence']
+            sample_idx = pc_info['sample_idx']
+
+            sa_key = f'{sequence_name}___{sample_idx}'
+            if os.path.exists(f"/dev/shm/{sa_key}"):
+                continue
+
+            points = self.get_lidar(sequence_name, sample_idx)
+            common_utils.sa_create(f"shm://{sa_key}", points)
+
+        dist.barrier()
+        self.logger.info('Training data has been saved to shared memory')
+
+    def clean_shared_memory(self):
+        self.logger.info(f'Clean training data from shared memory (file limit={self.shared_memory_file_limit})')
+
+        cur_rank, num_gpus = common_utils.get_dist_info()
+        all_infos = self.infos[:self.shared_memory_file_limit] \
+            if self.shared_memory_file_limit < len(self.infos) else self.infos
+        cur_infos = all_infos[cur_rank::num_gpus]
+        for info in cur_infos:
+            pc_info = info['point_cloud']
+            sequence_name = pc_info['lidar_sequence']
+            sample_idx = pc_info['sample_idx']
+
+            sa_key = f'{sequence_name}___{sample_idx}'
+            if not os.path.exists(f"/dev/shm/{sa_key}"):
+                continue
+
+            SharedArray.delete(f"shm://{sa_key}")
+
+        if num_gpus > 1:
+            dist.barrier()
+        self.logger.info('Training data has been deleted from shared memory')
+
    @staticmethod
    def check_sequence_name_with_all_version(sequence_file):
-        if '_with_camera_labels' not in str(sequence_file) and not sequence_file.exists():
-            sequence_file = Path(str(sequence_file)[:-9] + '_with_camera_labels.tfrecord')
-        if '_with_camera_labels' in str(sequence_file) and not sequence_file.exists():
-            sequence_file = Path(str(sequence_file).replace('_with_camera_labels', ''))
-
+        if not sequence_file.exists():
+            found_sequence_file = sequence_file
+            for pre_text in ['training', 'validation', 'testing']:
+                if not sequence_file.exists():
+                    temp_sequence_file = Path(str(sequence_file).replace('segment', pre_text + '_segment'))
+                    if temp_sequence_file.exists():
+                        found_sequence_file = temp_sequence_file
+                        break
+            if not found_sequence_file.exists():
+                found_sequence_file = Path(str(sequence_file).replace('_with_camera_labels', ''))
+            if found_sequence_file.exists():
+                sequence_file = found_sequence_file
        return sequence_file

    def get_infos(self, raw_data_path, save_path, num_workers=multiprocessing.cpu_count(), has_label=True, sampled_interval=1):
-        import concurrent.futures as futures
        from functools import partial
        from . import waymo_utils
        print('---------------The waymo sample interval is %d, total sequecnes is %d-----------------'
@@ -92,10 +149,10 @@ class WaymoDataset(DatasetTemplate):
            for sequence_file in self.sample_sequence_list
        ]

-        # process_single_sequence(sample_sequence_file_list[0])
-        with futures.ThreadPoolExecutor(num_workers) as executor:
-            sequence_infos = list(tqdm(executor.map(process_single_sequence, sample_sequence_file_list),
+        with multiprocessing.Pool(num_workers) as p:
+            sequence_infos = list(tqdm(p.imap(process_single_sequence, sample_sequence_file_list),
                                       total=len(sample_sequence_file_list)))
+
        all_sequences_infos = [item for infos in sequence_infos for item in infos]
        return all_sequences_infos

@@ -104,6 +161,7 @@ class WaymoDataset(DatasetTemplate):
        point_features = np.load(lidar_file)  # (N, 7): [x, y, z, intensity, elongation, NLZ_flag]

        points_all, NLZ_flag = point_features[:, 0:5], point_features[:, 5]
+        if not self.dataset_cfg.get('DISABLE_NLZ_FLAG_ON_POINTS', False):
            points_all = points_all[NLZ_flag == -1]
        points_all[:, 3] = np.tanh(points_all[:, 3])
        return points_all
@@ -122,6 +180,11 @@ class WaymoDataset(DatasetTemplate):
        pc_info = info['point_cloud']
        sequence_name = pc_info['lidar_sequence']
        sample_idx = pc_info['sample_idx']
+
+        if self.use_shared_memory and index < self.shared_memory_file_limit:
+            sa_key = f'{sequence_name}___{sample_idx}'
+            points = SharedArray.attach(f"shm://{sa_key}").copy()
+        else:
            points = self.get_lidar(sequence_name, sample_idx)

        input_dict = {
@@ -138,6 +201,12 @@ class WaymoDataset(DatasetTemplate):
            else:
                gt_boxes_lidar = annos['gt_boxes_lidar']

+            if self.training and self.dataset_cfg.get('FILTER_EMPTY_BOXES_FOR_TRAIN', False):
+                mask = (annos['num_points_in_gt'] > 0)  # filter empty boxes
+                annos['name'] = annos['name'][mask]
+                gt_boxes_lidar = gt_boxes_lidar[mask]
+                annos['num_points_in_gt'] = annos['num_points_in_gt'][mask]
+
            input_dict.update({
                'gt_names': annos['name'],
                'gt_boxes': gt_boxes_lidar,
@@ -251,15 +320,16 @@ class WaymoDataset(DatasetTemplate):

    def create_groundtruth_database(self, info_path, save_path, used_classes=None, split='train', sampled_interval=10,
                                    processed_data_tag=None):
-        database_save_path = save_path / ('pcdet_gt_database_%s_sampled_%d' % (split, sampled_interval))
-        db_info_save_path = save_path / ('pcdet_waymo_dbinfos_%s_sampled_%d.pkl' % (split, sampled_interval))
-
+        database_save_path = save_path / ('%s_gt_database_%s_sampled_%d' % (processed_data_tag, split, sampled_interval))
+        db_info_save_path = save_path / ('%s_waymo_dbinfos_%s_sampled_%d.pkl' % (processed_data_tag, split, sampled_interval))
+        db_data_save_path = save_path / ('%s_gt_database_%s_sampled_%d_global.npy' % (processed_data_tag, split, sampled_interval))
        database_save_path.mkdir(parents=True, exist_ok=True)
        all_db_infos = {}
-
        with open(info_path, 'rb') as f:
            infos = pickle.load(f)

+        point_offset_cnt = 0
+        stacked_gt_points = []
        for k in range(0, len(infos), sampled_interval):
            print('gt_database sample: %d/%d' % (k + 1, len(infos)))
            info = infos[k]
@@ -274,7 +344,21 @@ class WaymoDataset(DatasetTemplate):
            difficulty = annos['difficulty']
            gt_boxes = annos['gt_boxes_lidar']

+            if k % 4 != 0 and len(names) > 0:
+                mask = (names == 'Vehicle')
+                names = names[~mask]
+                difficulty = difficulty[~mask]
+                gt_boxes = gt_boxes[~mask]
+
+            if k % 2 != 0 and len(names) > 0:
+                mask = (names == 'Pedestrian')
+                names = names[~mask]
+                difficulty = difficulty[~mask]
+                gt_boxes = gt_boxes[~mask]
+
            num_obj = gt_boxes.shape[0]
+            if num_obj == 0:
+                continue

            box_idxs_of_pts = roiaware_pool3d_utils.points_in_boxes_gpu(
                torch.from_numpy(points[:, 0:3]).unsqueeze(dim=0).float().cuda(),
@@ -295,6 +379,12 @@ class WaymoDataset(DatasetTemplate):
                    db_info = {'name': names[i], 'path': db_path, 'sequence_name': sequence_name,
                               'sample_idx': sample_idx, 'gt_idx': i, 'box3d_lidar': gt_boxes[i],
                               'num_points_in_gt': gt_points.shape[0], 'difficulty': difficulty[i]}
+
+                    # it will be used if you choose to use shared memory for gt sampling
+                    stacked_gt_points.append(gt_points)
+                    db_info['global_data_offset'] = [point_offset_cnt, point_offset_cnt + gt_points.shape[0]]
+                    point_offset_cnt += gt_points.shape[0]
+
                    if names[i] in all_db_infos:
                        all_db_infos[names[i]].append(db_info)
                    else:
@@ -305,19 +395,24 @@ class WaymoDataset(DatasetTemplate):
        with open(db_info_save_path, 'wb') as f:
            pickle.dump(all_db_infos, f)

+        # it will be used if you choose to use shared memory for gt sampling
+        stacked_gt_points = np.concatenate(stacked_gt_points, axis=0)
+        np.save(db_data_save_path, stacked_gt_points)
+

 def create_waymo_infos(dataset_cfg, class_names, data_path, save_path,
                       raw_data_tag='raw_data', processed_data_tag='waymo_processed_data',
-                       workers=multiprocessing.cpu_count()):
+                       workers=min(16, multiprocessing.cpu_count())):
    dataset = WaymoDataset(
        dataset_cfg=dataset_cfg, class_names=class_names, root_path=data_path,
        training=False, logger=common_utils.create_logger()
    )
    train_split, val_split = 'train', 'val'

-    train_filename = save_path / ('waymo_infos_%s.pkl' % train_split)
-    val_filename = save_path / ('waymo_infos_%s.pkl' % val_split)
+    train_filename = save_path / ('%s_infos_%s.pkl' % (processed_data_tag, train_split))
+    val_filename = save_path / ('%s_infos_%s.pkl' % (processed_data_tag, val_split))

+    os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
    print('---------------Start to generate data infos---------------')

    dataset.set_split(train_split)
@@ -341,10 +436,11 @@ def create_waymo_infos(dataset_cfg, class_names, data_path, save_path,
    print('----------------Waymo info val file is saved to %s----------------' % val_filename)

    print('---------------Start create groundtruth database for data augmentation---------------')
+    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    dataset.set_split(train_split)
    dataset.create_groundtruth_database(
-        info_path=train_filename, save_path=save_path, split='train', sampled_interval=10,
-        used_classes=['Vehicle', 'Pedestrian', 'Cyclist']
+        info_path=train_filename, save_path=save_path, split='train', sampled_interval=1,
+        used_classes=['Vehicle', 'Pedestrian', 'Cyclist'], processed_data_tag=processed_data_tag
    )
    print('---------------Data preparation Done---------------')

@@ -355,18 +451,24 @@ if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='arg parser')
    parser.add_argument('--cfg_file', type=str, default=None, help='specify the config of dataset')
    parser.add_argument('--func', type=str, default='create_waymo_infos', help='')
+    parser.add_argument('--processed_data_tag', type=str, default='waymo_processed_data_v0_5_0', help='')
    args = parser.parse_args()

    if args.func == 'create_waymo_infos':
        import yaml
        from easydict import EasyDict
-        dataset_cfg = EasyDict(yaml.load(open(args.cfg_file)))
+        try:
+            yaml_config = yaml.load(open(args.cfg_file), Loader=yaml.FullLoader)
+        except:
+            yaml_config = yaml.load(open(args.cfg_file))
+        dataset_cfg = EasyDict(yaml_config)
        ROOT_DIR = (Path(__file__).resolve().parent / '../../../').resolve()
+        dataset_cfg.PROCESSED_DATA_TAG = args.processed_data_tag
        create_waymo_infos(
            dataset_cfg=dataset_cfg,
            class_names=['Vehicle', 'Pedestrian', 'Cyclist'],
            data_path=ROOT_DIR / 'data' / 'waymo',
            save_path=ROOT_DIR / 'data' / 'waymo',
            raw_data_tag='raw_data',
-            processed_data_tag=dataset_cfg.PROCESSED_DATA_TAG
+            processed_data_tag=args.processed_data_tag
        )
--- a/pcdet/datasets/waymo/waymo_eval.py
+++ b/pcdet/datasets/waymo/waymo_eval.py
@@ -236,7 +236,7 @@ def main():
        gt_infos_dst.append(cur_info)

    waymo_AP = eval.waymo_evaluation(
-        pred_infos, gt_infos_dst, class_name=args.class_names, distance_thresh=1000, fake_gt_infos=True
+        pred_infos, gt_infos_dst, class_name=args.class_names, distance_thresh=1000, fake_gt_infos=False
    )

    print(waymo_AP)

--- a/pcdet/datasets/waymo/waymo_utils.py
+++ b/pcdet/datasets/waymo/waymo_utils.py
@@ -61,7 +61,7 @@ def generate_labels(frame):
    return annotations


-def convert_range_image_to_point_cloud(frame, range_images, camera_projections, range_image_top_pose, ri_index=0):
+def convert_range_image_to_point_cloud(frame, range_images, camera_projections, range_image_top_pose, ri_index=(0, 1)):
    """
    Modified from the codes of Waymo Open Dataset.
    Convert range images to point cloud.
@@ -97,8 +97,12 @@ def convert_range_image_to_point_cloud(frame, range_images, camera_projections,
    range_image_top_pose_tensor = transform_utils.get_transform(
        range_image_top_pose_tensor_rotation,
        range_image_top_pose_tensor_translation)
+
    for c in calibrations:
-        range_image = range_images[c.name][ri_index]
+        points_single, cp_points_single, points_NLZ_single, points_intensity_single, points_elongation_single \
+            = [], [], [], [], []
+        for cur_ri_index in ri_index:
+            range_image = range_images[c.name][cur_ri_index]
            if len(c.beam_inclinations) == 0:  # pylint: disable=g-explicit-length-test
                beam_inclinations = range_image_utils.compute_inclination(
                    tf.constant([c.beam_inclination_min, c.beam_inclination_max]),
@@ -137,21 +141,29 @@ def convert_range_image_to_point_cloud(frame, range_images, camera_projections,
            cp = camera_projections[c.name][0]
            cp_tensor = tf.reshape(tf.convert_to_tensor(cp.data), cp.shape.dims)
            cp_points_tensor = tf.gather_nd(cp_tensor, tf.where(range_image_mask))
-        points.append(points_tensor.numpy())
-        cp_points.append(cp_points_tensor.numpy())
-        points_NLZ.append(points_NLZ_tensor.numpy())
-        points_intensity.append(points_intensity_tensor.numpy())
-        points_elongation.append(points_elongation_tensor.numpy())
+
+            points_single.append(points_tensor.numpy())
+            cp_points_single.append(cp_points_tensor.numpy())
+            points_NLZ_single.append(points_NLZ_tensor.numpy())
+            points_intensity_single.append(points_intensity_tensor.numpy())
+            points_elongation_single.append(points_elongation_tensor.numpy())
+
+        points.append(np.concatenate(points_single, axis=0))
+        cp_points.append(np.concatenate(cp_points_single, axis=0))
+        points_NLZ.append(np.concatenate(points_NLZ_single, axis=0))
+        points_intensity.append(np.concatenate(points_intensity_single, axis=0))
+        points_elongation.append(np.concatenate(points_elongation_single, axis=0))

    return points, cp_points, points_NLZ, points_intensity, points_elongation


-def save_lidar_points(frame, cur_save_path):
+def save_lidar_points(frame, cur_save_path, use_two_returns=True):
    range_images, camera_projections, range_image_top_pose = \
        frame_utils.parse_range_image_and_camera_projection(frame)

-    points, cp_points, points_in_NLZ_flag, points_intensity, points_elongation = \
-        convert_range_image_to_point_cloud(frame, range_images, camera_projections, range_image_top_pose)
+    points, cp_points, points_in_NLZ_flag, points_intensity, points_elongation = convert_range_image_to_point_cloud(
+        frame, range_images, camera_projections, range_image_top_pose, ri_index=(0, 1) if use_two_returns else (0,)
+    )

    # 3d points in vehicle frame.
    points_all = np.concatenate(points, axis=0)
@@ -169,7 +181,7 @@ def save_lidar_points(frame, cur_save_path):
    return num_points_of_each_lidar


-def process_single_sequence(sequence_file, save_path, sampled_interval, has_label=True):
+def process_single_sequence(sequence_file, save_path, sampled_interval, has_label=True, use_two_returns=True):
    sequence_name = os.path.splitext(os.path.basename(sequence_file))[0]

    # print('Load record (sampled_interval=%d): %s' % (sampled_interval, sequence_name))
@@ -200,6 +212,10 @@ def process_single_sequence(sequence_file, save_path, sampled_interval, has_labe
        info['point_cloud'] = pc_info

        info['frame_id'] = sequence_name + ('_%03d' % cnt)
+        info['metadata'] = {
+            'context_name': frame.context.name,
+            'timestamp_micros': frame.timestamp_micros
+        }
        image_info = {}
        for j in range(5):
            width = frame.context.camera_calibrations[j].width
@@ -214,7 +230,9 @@ def process_single_sequence(sequence_file, save_path, sampled_interval, has_labe
            annotations = generate_labels(frame)
            info['annos'] = annotations

-        num_points_of_each_lidar = save_lidar_points(frame, cur_save_dir / ('%04d.npy' % cnt))
+        num_points_of_each_lidar = save_lidar_points(
+            frame, cur_save_dir / ('%04d.npy' % cnt), use_two_returns=use_two_returns
+        )
        info['num_points_of_each_lidar'] = num_points_of_each_lidar

        sequence_infos.append(info)

--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/frustum_grid_generator.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/frustum_grid_generator.py
@@ -6,7 +6,8 @@ try:
    from kornia.geometry.linalg import transform_points
 except Exception as e:
    # Note: Kornia team will fix this import issue to try to allow the usage of lower torch versions.
-    print('Warning: kornia is not installed correctly, please ignore this warning if you do not use CaDDN. Otherwise, it is recommended to use torch version greater than 1.2 to use kornia properly.')
+    # print('Warning: kornia is not installed correctly, please ignore this warning if you do not use CaDDN. Otherwise, it is recommended to use torch version greater than 1.2 to use kornia properly.')
+    pass

 from pcdet.utils import transform_utils

@@ -22,6 +23,14 @@ class FrustumGridGenerator(nn.Module):
            disc_cfg: EasyDict, Depth discretiziation configuration
        """
        super().__init__()
+        try:
+            import kornia
+        except Exception as e:
+            # Note: Kornia team will fix this import issue to try to allow the usage of lower torch versions.
+            print('Error: kornia is not installed correctly, please ignore this warning if you do not use CaDDN. '
+                  'Otherwise, it is recommended to use torch version greater than 1.2 to use kornia properly.')
+            exit(-1)
+
        self.dtype = torch.float32
        self.grid_size = torch.as_tensor(grid_size, dtype=self.dtype)
        self.pc_range = pc_range

--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ddn/ddn_deeplabv3.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ddn/ddn_deeplabv3.py
-import torchvision
-
 from .ddn_template import DDNTemplate

+try:
+    import torchvision
+except:
+    pass
+

 class DDNDeepLabV3(DDNTemplate):


--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ddn/ddn_template.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ddn/ddn_template.py
@@ -2,11 +2,9 @@ from collections import OrderedDict
 from pathlib import Path
 from torch import hub

-import numpy as np
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-import torchvision

 try:
    from kornia.enhance.normalize import normalize

--- a/pcdet/models/dense_heads/__init__.py
+++ b/pcdet/models/dense_heads/__init__.py
@@ -4,6 +4,7 @@ from .anchor_head_template import AnchorHeadTemplate
 from .point_head_box import PointHeadBox
 from .point_head_simple import PointHeadSimple
 from .point_intra_part_head import PointIntraPartOffsetHead
+from .center_head import CenterHead

 __all__ = {
    'AnchorHeadTemplate': AnchorHeadTemplate,
@@ -12,4 +13,5 @@ __all__ = {
    'PointHeadSimple': PointHeadSimple,
    'PointHeadBox': PointHeadBox,
    'AnchorHeadMulti': AnchorHeadMulti,
+    'CenterHead': CenterHead
 }
--- a/pcdet/models/dense_heads/center_head.py
+++ b/pcdet/models/dense_heads/center_head.py
+import copy
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.nn.init import kaiming_normal_
+from ..model_utils import model_nms_utils
+from ..model_utils import centernet_utils
+from ...utils import loss_utils
+
+
+class SeparateHead(nn.Module):
+    def __init__(self, input_channels, sep_head_dict, init_bias=-2.19, use_bias=False):
+        super().__init__()
+        self.sep_head_dict = sep_head_dict
+
+        for cur_name in self.sep_head_dict:
+            output_channels = self.sep_head_dict[cur_name]['out_channels']
+            num_conv = self.sep_head_dict[cur_name]['num_conv']
+
+            fc_list = []
+            for k in range(num_conv - 1):
+                fc_list.append(nn.Sequential(
+                    nn.Conv2d(input_channels, input_channels, kernel_size=3, stride=1, padding=1, bias=use_bias),
+                    nn.BatchNorm2d(input_channels),
+                    nn.ReLU()
+                ))
+            fc_list.append(nn.Conv2d(input_channels, output_channels, kernel_size=3, stride=1, padding=1, bias=True))
+            fc = nn.Sequential(*fc_list)
+            if 'hm' in cur_name:
+                fc[-1].bias.data.fill_(init_bias)
+            else:
+                for m in fc.modules():
+                    if isinstance(m, nn.Conv2d):
+                        kaiming_normal_(m.weight.data)
+                        if hasattr(m, "bias") and m.bias is not None:
+                            nn.init.constant_(m.bias, 0)
+
+            self.__setattr__(cur_name, fc)
+
+    def forward(self, x):
+        ret_dict = {}
+        for cur_name in self.sep_head_dict:
+            ret_dict[cur_name] = self.__getattr__(cur_name)(x)
+
+        return ret_dict
+
+
+class CenterHead(nn.Module):
+    def __init__(self, model_cfg, input_channels, num_class, class_names, grid_size, point_cloud_range, voxel_size,
+                 predict_boxes_when_training=True):
+        super().__init__()
+        self.model_cfg = model_cfg
+        self.num_class = num_class
+        self.grid_size = grid_size
+        self.point_cloud_range = point_cloud_range
+        self.voxel_size = voxel_size
+        self.feature_map_stride = self.model_cfg.TARGET_ASSIGNER_CONFIG.get('FEATURE_MAP_STRIDE', None)
+
+        self.class_names = class_names
+        self.class_names_each_head = []
+        self.class_id_mapping_each_head = []
+
+        for cur_class_names in self.model_cfg.CLASS_NAMES_EACH_HEAD:
+            self.class_names_each_head.append([x for x in cur_class_names if x in class_names])
+            cur_class_id_mapping = torch.from_numpy(np.array(
+                [self.class_names.index(x) for x in cur_class_names if x in class_names]
+            )).cuda()
+            self.class_id_mapping_each_head.append(cur_class_id_mapping)
+
+        total_classes = sum([len(x) for x in self.class_names_each_head])
+        assert total_classes == len(self.class_names), f'class_names_each_head={self.class_names_each_head}'
+
+        self.shared_conv = nn.Sequential(
+            nn.Conv2d(
+                input_channels, self.model_cfg.SHARED_CONV_CHANNEL, 3, stride=1, padding=1,
+                bias=self.model_cfg.get('USE_BIAS_BEFORE_NORM', False)
+            ),
+            nn.BatchNorm2d(self.model_cfg.SHARED_CONV_CHANNEL),
+            nn.ReLU(),
+        )
+
+        self.heads_list = nn.ModuleList()
+        self.separate_head_cfg = self.model_cfg.SEPARATE_HEAD_CFG
+        for idx, cur_class_names in enumerate(self.class_names_each_head):
+            cur_head_dict = copy.deepcopy(self.separate_head_cfg.HEAD_DICT)
+            cur_head_dict['hm'] = dict(out_channels=len(cur_class_names), num_conv=self.model_cfg.NUM_HM_CONV)
+            self.heads_list.append(
+                SeparateHead(
+                    input_channels=self.model_cfg.SHARED_CONV_CHANNEL,
+                    sep_head_dict=cur_head_dict,
+                    init_bias=-2.19,
+                    use_bias=self.model_cfg.get('USE_BIAS_BEFORE_NORM', False)
+                )
+            )
+        self.predict_boxes_when_training = predict_boxes_when_training
+        self.forward_ret_dict = {}
+        self.build_losses()
+
+    def build_losses(self):
+        self.add_module('hm_loss_func', loss_utils.FocalLossCenterNet())
+        self.add_module('reg_loss_func', loss_utils.RegLossCenterNet())
+
+    def assign_target_of_single_head(
+            self, num_classes, gt_boxes, feature_map_size, feature_map_stride, num_max_objs=500,
+            gaussian_overlap=0.1, min_radius=2
+    ):
+        """
+        Args:
+            gt_boxes: (N, 8)
+            feature_map_size: (2), [x, y]
+
+        Returns:
+
+        """
+        heatmap = gt_boxes.new_zeros(num_classes, feature_map_size[1], feature_map_size[0])
+        ret_boxes = gt_boxes.new_zeros((num_max_objs, gt_boxes.shape[-1] - 1 + 1))
+        inds = gt_boxes.new_zeros(num_max_objs).long()
+        mask = gt_boxes.new_zeros(num_max_objs).long()
+
+        x, y, z = gt_boxes[:, 0], gt_boxes[:, 1], gt_boxes[:, 2]
+        coord_x = (x - self.point_cloud_range[0]) / self.voxel_size[0] / feature_map_stride
+        coord_y = (y - self.point_cloud_range[1]) / self.voxel_size[1] / feature_map_stride
+        coord_x = torch.clamp(coord_x, min=0, max=feature_map_size[0] - 0.5)  # bugfixed: 1e-6 does not work for center.int()
+        coord_y = torch.clamp(coord_y, min=0, max=feature_map_size[1] - 0.5)  #
+        center = torch.cat((coord_x[:, None], coord_y[:, None]), dim=-1)
+        center_int = center.int()
+        center_int_float = center_int.float()
+
+        dx, dy, dz = gt_boxes[:, 3], gt_boxes[:, 4], gt_boxes[:, 5]
+        dx = dx / self.voxel_size[0] / feature_map_stride
+        dy = dy / self.voxel_size[1] / feature_map_stride
+
+        radius = centernet_utils.gaussian_radius(dx, dy, min_overlap=gaussian_overlap)
+        radius = torch.clamp_min(radius.int(), min=min_radius)
+
+        for k in range(min(num_max_objs, gt_boxes.shape[0])):
+            if dx[k] <= 0 or dy[k] <= 0:
+                continue
+
+            if not (0 <= center_int[k][0] <= feature_map_size[0] and 0 <= center_int[k][1] <= feature_map_size[1]):
+                continue
+
+            cur_class_id = (gt_boxes[k, -1] - 1).long()
+            centernet_utils.draw_gaussian_to_heatmap(heatmap[cur_class_id], center[k], radius[k].item())
+
+            inds[k] = center_int[k, 1] * feature_map_size[0] + center_int[k, 0]
+            mask[k] = 1
+
+            ret_boxes[k, 0:2] = center[k] - center_int_float[k].float()
+            ret_boxes[k, 2] = z[k]
+            ret_boxes[k, 3:6] = gt_boxes[k, 3:6].log()
+            ret_boxes[k, 6] = torch.cos(gt_boxes[k, 6])
+            ret_boxes[k, 7] = torch.sin(gt_boxes[k, 6])
+            if gt_boxes.shape[1] > 8:
+                ret_boxes[k, 8:] = gt_boxes[k, 7:-1]
+
+        return heatmap, ret_boxes, inds, mask
+
+    def assign_targets(self, gt_boxes, feature_map_size=None, **kwargs):
+        """
+        Args:
+            gt_boxes: (B, M, 8)
+            range_image_polar: (B, 3, H, W)
+            feature_map_size: (2) [H, W]
+            spatial_cartesian: (B, 4, H, W)
+        Returns:
+
+        """
+        feature_map_size = feature_map_size[::-1]  # [H, W] ==> [x, y]
+        target_assigner_cfg = self.model_cfg.TARGET_ASSIGNER_CONFIG
+        # feature_map_size = self.grid_size[:2] // target_assigner_cfg.FEATURE_MAP_STRIDE
+
+        batch_size = gt_boxes.shape[0]
+        ret_dict = {
+            'heatmaps': [],
+            'target_boxes': [],
+            'inds': [],
+            'masks': [],
+            'heatmap_masks': []
+        }
+
+        all_names = np.array(['bg', *self.class_names])
+        for idx, cur_class_names in enumerate(self.class_names_each_head):
+            heatmap_list, target_boxes_list, inds_list, masks_list = [], [], [], []
+            for bs_idx in range(batch_size):
+                cur_gt_boxes = gt_boxes[bs_idx]
+                gt_class_names = all_names[cur_gt_boxes[:, -1].cpu().long().numpy()]
+
+                gt_boxes_single_head = []
+
+                for idx, name in enumerate(gt_class_names):
+                    if name not in cur_class_names:
+                        continue
+                    temp_box = cur_gt_boxes[idx]
+                    temp_box[-1] = cur_class_names.index(name) + 1
+                    gt_boxes_single_head.append(temp_box[None, :])
+
+                if len(gt_boxes_single_head) == 0:
+                    gt_boxes_single_head = cur_gt_boxes[:0, :]
+                else:
+                    gt_boxes_single_head = torch.cat(gt_boxes_single_head, dim=0)
+
+                heatmap, ret_boxes, inds, mask = self.assign_target_of_single_head(
+                    num_classes=len(cur_class_names), gt_boxes=gt_boxes_single_head.cpu(),
+                    feature_map_size=feature_map_size, feature_map_stride=target_assigner_cfg.FEATURE_MAP_STRIDE,
+                    num_max_objs=target_assigner_cfg.NUM_MAX_OBJS,
+                    gaussian_overlap=target_assigner_cfg.GAUSSIAN_OVERLAP,
+                    min_radius=target_assigner_cfg.MIN_RADIUS,
+                )
+                heatmap_list.append(heatmap.to(gt_boxes_single_head.device))
+                target_boxes_list.append(ret_boxes.to(gt_boxes_single_head.device))
+                inds_list.append(inds.to(gt_boxes_single_head.device))
+                masks_list.append(mask.to(gt_boxes_single_head.device))
+
+            ret_dict['heatmaps'].append(torch.stack(heatmap_list, dim=0))
+            ret_dict['target_boxes'].append(torch.stack(target_boxes_list, dim=0))
+            ret_dict['inds'].append(torch.stack(inds_list, dim=0))
+            ret_dict['masks'].append(torch.stack(masks_list, dim=0))
+        return ret_dict
+
+    def sigmoid(self, x):
+        y = torch.clamp(x.sigmoid(), min=1e-4, max=1 - 1e-4)
+        return y
+
+    def get_loss(self):
+        pred_dicts = self.forward_ret_dict['pred_dicts']
+        target_dicts = self.forward_ret_dict['target_dicts']
+
+        tb_dict = {}
+        loss = 0
+
+        for idx, pred_dict in enumerate(pred_dicts):
+            pred_dict['hm'] = self.sigmoid(pred_dict['hm'])
+            hm_loss = self.hm_loss_func(pred_dict['hm'], target_dicts['heatmaps'][idx])
+
+            target_boxes = target_dicts['target_boxes'][idx]
+            pred_boxes = torch.cat([pred_dict[head_name] for head_name in self.separate_head_cfg.HEAD_ORDER], dim=1)
+
+            reg_loss = self.reg_loss_func(
+                pred_boxes, target_dicts['masks'][idx], target_dicts['inds'][idx], target_boxes
+            )
+            loc_loss = (reg_loss * reg_loss.new_tensor(self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['code_weights'])).sum()
+            loc_loss = loc_loss * self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['loc_weight']
+
+            loss += hm_loss + loc_loss
+            tb_dict['hm_loss_head_%d' % idx] = hm_loss.item()
+            tb_dict['loc_loss_head_%d' % idx] = loc_loss.item()
+
+        tb_dict['rpn_loss'] = loss.item()
+        return loss, tb_dict
+
+    def generate_predicted_boxes(self, batch_size, pred_dicts):
+        post_process_cfg = self.model_cfg.POST_PROCESSING
+        post_center_limit_range = torch.tensor(post_process_cfg.POST_CENTER_LIMIT_RANGE).cuda().float()
+
+        ret_dict = [{
+            'pred_boxes': [],
+            'pred_scores': [],
+            'pred_labels': [],
+        } for k in range(batch_size)]
+        for idx, pred_dict in enumerate(pred_dicts):
+            batch_hm = pred_dict['hm'].sigmoid()
+            batch_center = pred_dict['center']
+            batch_center_z = pred_dict['center_z']
+            batch_dim = pred_dict['dim'].exp()
+            batch_rot_cos = pred_dict['rot'][:, 0].unsqueeze(dim=1)
+            batch_rot_sin = pred_dict['rot'][:, 1].unsqueeze(dim=1)
+            batch_vel = pred_dict['vel'] if 'vel' in self.separate_head_cfg.HEAD_ORDER else None
+
+            final_pred_dicts = centernet_utils.decode_bbox_from_heatmap(
+                heatmap=batch_hm, rot_cos=batch_rot_cos, rot_sin=batch_rot_sin,
+                center=batch_center, center_z=batch_center_z, dim=batch_dim, vel=batch_vel,
+                point_cloud_range=self.point_cloud_range, voxel_size=self.voxel_size,
+                feature_map_stride=self.feature_map_stride,
+                K=post_process_cfg.MAX_OBJ_PER_SAMPLE,
+                circle_nms=(post_process_cfg.NMS_CONFIG.NMS_TYPE == 'circle_nms'),
+                score_thresh=post_process_cfg.SCORE_THRESH,
+                post_center_limit_range=post_center_limit_range
+            )
+
+            for k, final_dict in enumerate(final_pred_dicts):
+                final_dict['pred_labels'] = self.class_id_mapping_each_head[idx][final_dict['pred_labels'].long()]
+                if post_process_cfg.NMS_CONFIG.NMS_TYPE != 'circle_nms':
+                    selected, selected_scores = model_nms_utils.class_agnostic_nms(
+                        box_scores=final_dict['pred_scores'], box_preds=final_dict['pred_boxes'],
+                        nms_config=post_process_cfg.NMS_CONFIG,
+                        score_thresh=None
+                    )
+
+                    final_dict['pred_boxes'] = final_dict['pred_boxes'][selected]
+                    final_dict['pred_scores'] = selected_scores
+                    final_dict['pred_labels'] = final_dict['pred_labels'][selected]
+
+                ret_dict[k]['pred_boxes'].append(final_dict['pred_boxes'])
+                ret_dict[k]['pred_scores'].append(final_dict['pred_scores'])
+                ret_dict[k]['pred_labels'].append(final_dict['pred_labels'])
+
+        for k in range(batch_size):
+            ret_dict[k]['pred_boxes'] = torch.cat(ret_dict[k]['pred_boxes'], dim=0)
+            ret_dict[k]['pred_scores'] = torch.cat(ret_dict[k]['pred_scores'], dim=0)
+            ret_dict[k]['pred_labels'] = torch.cat(ret_dict[k]['pred_labels'], dim=0) + 1
+
+        return ret_dict
+
+    @staticmethod
+    def reorder_rois_for_refining(batch_size, pred_dicts):
+        num_max_rois = max([len(cur_dict['pred_boxes']) for cur_dict in pred_dicts])
+        num_max_rois = max(1, num_max_rois)  # at least one faked rois to avoid error
+        pred_boxes = pred_dicts[0]['pred_boxes']
+
+        rois = pred_boxes.new_zeros((batch_size, num_max_rois, pred_boxes.shape[-1]))
+        roi_scores = pred_boxes.new_zeros((batch_size, num_max_rois))
+        roi_labels = pred_boxes.new_zeros((batch_size, num_max_rois)).long()
+
+        for bs_idx in range(batch_size):
+            num_boxes = len(pred_dicts[bs_idx]['pred_boxes'])
+
+            rois[bs_idx, :num_boxes, :] = pred_dicts[bs_idx]['pred_boxes']
+            roi_scores[bs_idx, :num_boxes] = pred_dicts[bs_idx]['pred_scores']
+            roi_labels[bs_idx, :num_boxes] = pred_dicts[bs_idx]['pred_labels']
+        return rois, roi_scores, roi_labels
+
+    def forward(self, data_dict):
+        spatial_features_2d = data_dict['spatial_features_2d']
+        x = self.shared_conv(spatial_features_2d)
+
+        pred_dicts = []
+        for head in self.heads_list:
+            pred_dicts.append(head(x))
+
+        if self.training:
+            target_dict = self.assign_targets(
+                data_dict['gt_boxes'], feature_map_size=spatial_features_2d.size()[2:],
+                feature_map_stride=data_dict.get('spatial_features_2d_strides', None)
+            )
+            self.forward_ret_dict['target_dicts'] = target_dict
+
+        self.forward_ret_dict['pred_dicts'] = pred_dicts
+
+        if not self.training or self.predict_boxes_when_training:
+            pred_dicts = self.generate_predicted_boxes(
+                data_dict['batch_size'], pred_dicts
+            )
+
+            if self.predict_boxes_when_training:
+                rois, roi_scores, roi_labels = self.reorder_rois_for_refining(data_dict['batch_size'], pred_dicts)
+                data_dict['rois'] = rois
+                data_dict['roi_scores'] = roi_scores
+                data_dict['roi_labels'] = roi_labels
+                data_dict['has_class_labels'] = True
+            else:
+                data_dict['final_box_dicts'] = pred_dicts
+
+        return data_dict
--- a/pcdet/models/detectors/__init__.py
+++ b/pcdet/models/detectors/__init__.py
@@ -7,6 +7,7 @@ from .second_net import SECONDNet
 from .second_net_iou import SECONDNetIoU
 from .caddn import CaDDN
 from .voxel_rcnn import VoxelRCNN
+from .centerpoint import CenterPoint

 __all__ = {
    'Detector3DTemplate': Detector3DTemplate,
@@ -17,7 +18,8 @@ __all__ = {
    'PointRCNN': PointRCNN,
    'SECONDNetIoU': SECONDNetIoU,
    'CaDDN': CaDDN,
-    'VoxelRCNN': VoxelRCNN
+    'VoxelRCNN': VoxelRCNN,
+    'CenterPoint': CenterPoint
 }



--- a/pcdet/models/detectors/centerpoint.py
+++ b/pcdet/models/detectors/centerpoint.py
+from .detector3d_template import Detector3DTemplate
+
+
+class CenterPoint(Detector3DTemplate):
+    def __init__(self, model_cfg, num_class, dataset):
+        super().__init__(model_cfg=model_cfg, num_class=num_class, dataset=dataset)
+        self.module_list = self.build_networks()
+
+    def forward(self, batch_dict):
+        for cur_module in self.module_list:
+            batch_dict = cur_module(batch_dict)
+
+        if self.training:
+            loss, tb_dict, disp_dict = self.get_training_loss()
+
+            ret_dict = {
+                'loss': loss
+            }
+            return ret_dict, tb_dict, disp_dict
+        else:
+            pred_dicts, recall_dicts = self.post_processing(batch_dict)
+            return pred_dicts, recall_dicts
+
+    def get_training_loss(self):
+        disp_dict = {}
+
+        loss_rpn, tb_dict = self.dense_head.get_loss()
+        tb_dict = {
+            'loss_rpn': loss_rpn.item(),
+            **tb_dict
+        }
+
+        loss = loss_rpn
+        return loss, tb_dict, disp_dict
+
+    def post_processing(self, batch_dict):
+        post_process_cfg = self.model_cfg.POST_PROCESSING
+        batch_size = batch_dict['batch_size']
+        final_pred_dict = batch_dict['final_box_dicts']
+        recall_dict = {}
+        for index in range(batch_size):
+            pred_boxes = final_pred_dict[index]['pred_boxes']
+
+            recall_dict = self.generate_recall_record(
+                box_preds=pred_boxes,
+                recall_dict=recall_dict, batch_index=index, data_dict=batch_dict,
+                thresh_list=post_process_cfg.RECALL_THRESH_LIST
+            )
+
+        return final_pred_dict, recall_dict
--- a/pcdet/models/detectors/detector3d_template.py
+++ b/pcdet/models/detectors/detector3d_template.py
@@ -132,7 +132,8 @@ class Detector3DTemplate(nn.Module):
            class_names=self.class_names,
            grid_size=model_info_dict['grid_size'],
            point_cloud_range=model_info_dict['point_cloud_range'],
-            predict_boxes_when_training=self.model_cfg.get('ROI_HEAD', False)
+            predict_boxes_when_training=self.model_cfg.get('ROI_HEAD', False),
+            voxel_size=model_info_dict.get('voxel_size', False)
        )
        model_info_dict['module_list'].append(dense_head_module)
        return dense_head_module, model_info_dict

--- a/pcdet/models/model_utils/centernet_utils.py
+++ b/pcdet/models/model_utils/centernet_utils.py
+# This file is modified from https://github.com/tianweiy/CenterPoint
+
+import torch
+import torch.nn.functional as F
+import numpy as np
+import numba
+
+
+def gaussian_radius(height, width, min_overlap=0.5):
+    """
+    Args:
+        height: (N)
+        width: (N)
+        min_overlap:
+    Returns:
+    """
+    a1 = 1
+    b1 = (height + width)
+    c1 = width * height * (1 - min_overlap) / (1 + min_overlap)
+    sq1 = (b1 ** 2 - 4 * a1 * c1).sqrt()
+    r1 = (b1 + sq1) / 2
+
+    a2 = 4
+    b2 = 2 * (height + width)
+    c2 = (1 - min_overlap) * width * height
+    sq2 = (b2 ** 2 - 4 * a2 * c2).sqrt()
+    r2 = (b2 + sq2) / 2
+
+    a3 = 4 * min_overlap
+    b3 = -2 * min_overlap * (height + width)
+    c3 = (min_overlap - 1) * width * height
+    sq3 = (b3 ** 2 - 4 * a3 * c3).sqrt()
+    r3 = (b3 + sq3) / 2
+    ret = torch.min(torch.min(r1, r2), r3)
+    return ret
+
+
+def gaussian2D(shape, sigma=1):
+    m, n = [(ss - 1.) / 2. for ss in shape]
+    y, x = np.ogrid[-m:m + 1, -n:n + 1]
+
+    h = np.exp(-(x * x + y * y) / (2 * sigma * sigma))
+    h[h < np.finfo(h.dtype).eps * h.max()] = 0
+    return h
+
+
+def draw_gaussian_to_heatmap(heatmap, center, radius, k=1, valid_mask=None):
+    diameter = 2 * radius + 1
+    gaussian = gaussian2D((diameter, diameter), sigma=diameter / 6)
+
+    x, y = int(center[0]), int(center[1])
+
+    height, width = heatmap.shape[0:2]
+
+    left, right = min(x, radius), min(width - x, radius + 1)
+    top, bottom = min(y, radius), min(height - y, radius + 1)
+
+    masked_heatmap = heatmap[y - top:y + bottom, x - left:x + right]
+    masked_gaussian = torch.from_numpy(
+        gaussian[radius - top:radius + bottom, radius - left:radius + right]
+    ).to(heatmap.device).float()
+
+    if min(masked_gaussian.shape) > 0 and min(masked_heatmap.shape) > 0:  # TODO debug
+        if valid_mask is not None:
+            cur_valid_mask = valid_mask[y - top:y + bottom, x - left:x + right]
+            masked_gaussian = masked_gaussian * cur_valid_mask.float()
+
+        torch.max(masked_heatmap, masked_gaussian * k, out=masked_heatmap)
+    return heatmap
+
+
+def _nms(heat, kernel=3):
+    pad = (kernel - 1) // 2
+
+    hmax = F.max_pool2d(heat, (kernel, kernel), stride=1, padding=pad)
+    keep = (hmax == heat).float()
+    return heat * keep
+
+
+@numba.jit(nopython=True)
+def circle_nms(dets, thresh):
+    x1 = dets[:, 0]
+    y1 = dets[:, 1]
+    scores = dets[:, 2]
+    order = scores.argsort()[::-1].astype(np.int32)  # highest->lowest
+    ndets = dets.shape[0]
+    suppressed = np.zeros((ndets), dtype=np.int32)
+    keep = []
+    for _i in range(ndets):
+        i = order[_i]  # start with highest score box
+        if suppressed[i] == 1:  # if any box have enough iou with this, remove it
+            continue
+        keep.append(i)
+        for _j in range(_i + 1, ndets):
+            j = order[_j]
+            if suppressed[j] == 1:
+                continue
+            # calculate center distance between i and j box
+            dist = (x1[i] - x1[j]) ** 2 + (y1[i] - y1[j]) ** 2
+
+            # ovr = inter / areas[j]
+            if dist <= thresh:
+                suppressed[j] = 1
+    return keep
+
+
+def _circle_nms(boxes, min_radius, post_max_size=83):
+    """
+    NMS according to center distance
+    """
+    keep = np.array(circle_nms(boxes.cpu().numpy(), thresh=min_radius))[:post_max_size]
+
+    keep = torch.from_numpy(keep).long().to(boxes.device)
+
+    return keep
+
+
+def _gather_feat(feat, ind, mask=None):
+    dim = feat.size(2)
+    ind = ind.unsqueeze(2).expand(ind.size(0), ind.size(1), dim)
+    feat = feat.gather(1, ind)
+    if mask is not None:
+        mask = mask.unsqueeze(2).expand_as(feat)
+        feat = feat[mask]
+        feat = feat.view(-1, dim)
+    return feat
+
+
+def _transpose_and_gather_feat(feat, ind):
+    feat = feat.permute(0, 2, 3, 1).contiguous()
+    feat = feat.view(feat.size(0), -1, feat.size(3))
+    feat = _gather_feat(feat, ind)
+    return feat
+
+
+def _topk(scores, K=40):
+    batch, num_class, height, width = scores.size()
+
+    topk_scores, topk_inds = torch.topk(scores.flatten(2, 3), K)
+
+    topk_inds = topk_inds % (height * width)
+    topk_ys = (topk_inds // width).float()
+    topk_xs = (topk_inds % width).int().float()
+
+    topk_score, topk_ind = torch.topk(topk_scores.view(batch, -1), K)
+    topk_classes = (topk_ind // K).int()
+    topk_inds = _gather_feat(topk_inds.view(batch, -1, 1), topk_ind).view(batch, K)
+    topk_ys = _gather_feat(topk_ys.view(batch, -1, 1), topk_ind).view(batch, K)
+    topk_xs = _gather_feat(topk_xs.view(batch, -1, 1), topk_ind).view(batch, K)
+
+    return topk_score, topk_inds, topk_classes, topk_ys, topk_xs
+
+
+def decode_bbox_from_heatmap(heatmap, rot_cos, rot_sin, center, center_z, dim,
+                             point_cloud_range=None, voxel_size=None, feature_map_stride=None, vel=None, K=100,
+                             circle_nms=False, score_thresh=None, post_center_limit_range=None):
+    batch_size, num_class, _, _ = heatmap.size()
+
+    if circle_nms:
+        # TODO: not checked yet
+        assert False, 'not checked yet'
+        heatmap = _nms(heatmap)
+
+    scores, inds, class_ids, ys, xs = _topk(heatmap, K=K)
+    center = _transpose_and_gather_feat(center, inds).view(batch_size, K, 2)
+    rot_sin = _transpose_and_gather_feat(rot_sin, inds).view(batch_size, K, 1)
+    rot_cos = _transpose_and_gather_feat(rot_cos, inds).view(batch_size, K, 1)
+    center_z = _transpose_and_gather_feat(center_z, inds).view(batch_size, K, 1)
+    dim = _transpose_and_gather_feat(dim, inds).view(batch_size, K, 3)
+
+    angle = torch.atan2(rot_sin, rot_cos)
+    xs = xs.view(batch_size, K, 1) + center[:, :, 0:1]
+    ys = ys.view(batch_size, K, 1) + center[:, :, 1:2]
+
+    xs = xs * feature_map_stride * voxel_size[0] + point_cloud_range[0]
+    ys = ys * feature_map_stride * voxel_size[1] + point_cloud_range[1]
+
+    box_part_list = [xs, ys, center_z, dim, angle]
+    if vel is not None:
+        vel = _transpose_and_gather_feat(vel, inds).view(batch_size, K, 2)
+        box_part_list.append(vel)
+
+    final_box_preds = torch.cat((box_part_list), dim=-1)
+    final_scores = scores.view(batch_size, K)
+    final_class_ids = class_ids.view(batch_size, K)
+
+    assert post_center_limit_range is not None
+    mask = (final_box_preds[..., :3] >= post_center_limit_range[:3]).all(2)
+    mask &= (final_box_preds[..., :3] <= post_center_limit_range[3:]).all(2)
+
+    if score_thresh is not None:
+        mask &= (final_scores > score_thresh)
+
+    ret_pred_dicts = []
+    for k in range(batch_size):
+        cur_mask = mask[k]
+        cur_boxes = final_box_preds[k, cur_mask]
+        cur_scores = final_scores[k, cur_mask]
+        cur_labels = final_class_ids[k, cur_mask]
+
+        if circle_nms:
+            assert False, 'not checked yet'
+            centers = cur_boxes[:, [0, 1]]
+            boxes = torch.cat((centers, scores.view(-1, 1)), dim=1)
+            keep = _circle_nms(boxes, min_radius=min_radius, post_max_size=nms_post_max_size)
+
+            cur_boxes = cur_boxes[keep]
+            cur_scores = cur_scores[keep]
+            cur_labels = cur_labels[keep]
+
+        ret_pred_dicts.append({
+            'pred_boxes': cur_boxes,
+            'pred_scores': cur_scores,
+            'pred_labels': cur_labels
+        })
+    return ret_pred_dicts
--- a/pcdet/models/roi_heads/roi_head_template.py
+++ b/pcdet/models/roi_heads/roi_head_template.py
@@ -61,6 +61,9 @@ class RoIHeadTemplate(nn.Module):
                roi_labels: (B, num_rois)

        """
+        if batch_dict.get('rois', None) is not None:
+            return batch_dict
+            
        batch_size = batch_dict['batch_size']
        batch_box_preds = batch_dict['batch_box_preds']
        batch_cls_preds = batch_dict['batch_cls_preds']

--- a/pcdet/utils/common_utils.py
+++ b/pcdet/utils/common_utils.py
@@ -4,6 +4,7 @@ import pickle
 import random
 import shutil
 import subprocess
+import SharedArray

 import numpy as np
 import torch
@@ -172,7 +173,7 @@ def init_dist_pytorch(tcp_port, local_rank, backend='nccl'):
    return num_gpus, rank


-def get_dist_info():
+def get_dist_info(return_gpu_per_machine=False):
    if torch.__version__ < '1.0':
        initialized = dist._initialized
    else:
@@ -186,6 +187,11 @@ def get_dist_info():
    else:
        rank = 0
        world_size = 1
+
+    if return_gpu_per_machine:
+        gpu_per_machine = torch.cuda.device_count()
+        return rank, world_size, gpu_per_machine
+
    return rank, world_size


@@ -233,3 +239,9 @@ def generate_voxel2pinds(sparse_tensor):
    return v2pinds_tensor


+def sa_create(name, var):
+    x = SharedArray.create(name, var.shape, dtype=var.dtype)
+    x[...] = var[...]
+    x.flags.writeable = False
+    return x
+
--- a/pcdet/utils/loss_utils.py
+++ b/pcdet/utils/loss_utils.py
@@ -259,3 +259,128 @@ def compute_fg_mask(gt_boxes2d, shape, downsample_factor=1, device=torch.device(
            fg_mask[b, v1:v2, u1:u2] = True

    return fg_mask
+
+
+def neg_loss_cornernet(pred, gt, mask=None):
+    """
+    Refer to https://github.com/tianweiy/CenterPoint.
+    Modified focal loss. Exactly the same as CornerNet. Runs faster and costs a little bit more memory
+    Args:
+        pred: (batch x c x h x w)
+        gt: (batch x c x h x w)
+        mask: (batch x h x w)
+    Returns:
+    """
+    pos_inds = gt.eq(1).float()
+    neg_inds = gt.lt(1).float()
+
+    neg_weights = torch.pow(1 - gt, 4)
+
+    loss = 0
+
+    pos_loss = torch.log(pred) * torch.pow(1 - pred, 2) * pos_inds
+    neg_loss = torch.log(1 - pred) * torch.pow(pred, 2) * neg_weights * neg_inds
+
+    if mask is not None:
+        mask = mask[:, None, :, :].float()
+        pos_loss = pos_loss * mask
+        neg_loss = neg_loss * mask
+        num_pos = (pos_inds.float() * mask).sum()
+    else:
+        num_pos = pos_inds.float().sum()
+
+    pos_loss = pos_loss.sum()
+    neg_loss = neg_loss.sum()
+
+    if num_pos == 0:
+        loss = loss - neg_loss
+    else:
+        loss = loss - (pos_loss + neg_loss) / num_pos
+    return loss
+
+
+class FocalLossCenterNet(nn.Module):
+    """
+    Refer to https://github.com/tianweiy/CenterPoint
+    """
+    def __init__(self):
+        super(FocalLossCenterNet, self).__init__()
+        self.neg_loss = neg_loss_cornernet
+
+    def forward(self, out, target, mask=None):
+        return self.neg_loss(out, target, mask=mask)
+
+
+def _reg_loss(regr, gt_regr, mask):
+    """
+    Refer to https://github.com/tianweiy/CenterPoint
+    L1 regression loss
+    Args:
+        regr (batch x max_objects x dim)
+        gt_regr (batch x max_objects x dim)
+        mask (batch x max_objects)
+    Returns:
+    """
+    num = mask.float().sum()
+    mask = mask.unsqueeze(2).expand_as(gt_regr).float()
+    isnotnan = (~ torch.isnan(gt_regr)).float()
+    mask *= isnotnan
+    regr = regr * mask
+    gt_regr = gt_regr * mask
+
+    loss = torch.abs(regr - gt_regr)
+    loss = loss.transpose(2, 0)
+
+    loss = torch.sum(loss, dim=2)
+    loss = torch.sum(loss, dim=1)
+    # else:
+    #  # D x M x B
+    #  loss = loss.reshape(loss.shape[0], -1)
+
+    # loss = loss / (num + 1e-4)
+    loss = loss / torch.clamp_min(num, min=1.0)
+    # import pdb; pdb.set_trace()
+    return loss
+
+
+def _gather_feat(feat, ind, mask=None):
+    dim  = feat.size(2)
+    ind  = ind.unsqueeze(2).expand(ind.size(0), ind.size(1), dim)
+    feat = feat.gather(1, ind)
+    if mask is not None:
+        mask = mask.unsqueeze(2).expand_as(feat)
+        feat = feat[mask]
+        feat = feat.view(-1, dim)
+    return feat
+
+
+def _transpose_and_gather_feat(feat, ind):
+    feat = feat.permute(0, 2, 3, 1).contiguous()
+    feat = feat.view(feat.size(0), -1, feat.size(3))
+    feat = _gather_feat(feat, ind)
+    return feat
+
+
+class RegLossCenterNet(nn.Module):
+    """
+    Refer to https://github.com/tianweiy/CenterPoint
+    """
+
+    def __init__(self):
+        super(RegLossCenterNet, self).__init__()
+
+    def forward(self, output, mask, ind=None, target=None):
+        """
+        Args:
+            output: (batch x dim x h x w) or (batch x max_objects)
+            mask: (batch x max_objects)
+            ind: (batch x max_objects)
+            target: (batch x max_objects x dim)
+        Returns:
+        """
+        if ind is None:
+            pred = output
+        else:
+            pred = _transpose_and_gather_feat(output, ind)
+        loss = _reg_loss(pred, target, mask)
+        return loss
\ No newline at end of file