Support multi-modal 3D detection on NuScenes #1339

Add support for multi-modal NuScenes Detection

Support multi-modal 3D detection on NuScenes #1339
Add support for multi-modal NuScenes Detection
02ac3e17 · Shaoshuai Shi · GitHub · ad9c25c0 · fcfa0773 · 02ac3e17
Unverified Commit 02ac3e17 authored May 13, 2023 by Shaoshuai Shi Committed by GitHub May 13, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -10,6 +10,7 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
 * `OpenPCDet` has been updated to `v0.6.0` (Sep. 2022).
 * The codes of PV-RCNN++ has been supported.
 * The codes of MPPNet has been supported. 
+* The multi-modal 3D detection approaches on Nuscenes have been supported. 

 ## Overview
 - [Changelog](#changelog)
@@ -22,10 +23,15 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18


 ## Changelog
-[2023-04-02] Added support for [`VoxelNeXt`](https://github.com/dvlab-research/VoxelNeXt) on Nuscenes, Waymo, and Argoverse2 datasets. It is a fully sparse 3D object detection network, which is a clean sparse CNNs network and predicts 3D objects directly upon voxels.
+[2023-05-13] **NEW:** Added support for the multi-modal 3D object detection models on Nuscenes dataset.  
+* Support multi-modal Nuscenes detection (See the [GETTING_STARTED.md](docs/GETTING_STARTED.md) to process data).
+* Support [TransFusion-Lidar](https://arxiv.org/abs/2203.11496) head, which ahcieves 69.43% NDS on Nuscenes validation dataset.
+* Support [`BEVFusion`](https://arxiv.org/abs/2205.13542), which fuses multi-modal information on BEV space and reaches 70.98% NDS on Nuscenes validation dataset. (see the [guideline](docs/guidelines_of_approaches/bevfusion.md) on how to train/test with BEVFusion).
+
+[2023-04-02] Added support for [`VoxelNeXt`](https://arxiv.org/abs/2303.11301) on Nuscenes, Waymo, and Argoverse2 datasets. It is a fully sparse 3D object detection network, which is a clean sparse CNNs network and predicts 3D objects directly upon voxels.

 [2022-09-02] **NEW:** Update `OpenPCDet` to v0.6.0:
-* Official code release of [MPPNet](https://arxiv.org/abs/2205.05979) for temporal 3D object detection, which supports long-term multi-frame 3D object detection and ranks 1st place on [3D detection learderboard](https://waymo.com/open/challenges/2020/3d-detection) of Waymo Open Dataset on Sept. 2th, 2022. For validation dataset, MPPNet achieves 74.96%, 75.06% and 74.52% for vehicle, pedestrian and cyclist classes in terms of mAPH@Level_2. (see the [guideline](docs/guidelines_of_approaches/mppnet.md) on how to train/test with MPPNet).
+* Official code release of [`MPPNet`](https://arxiv.org/abs/2205.05979) for temporal 3D object detection, which supports long-term multi-frame 3D object detection and ranks 1st place on [3D detection learderboard](https://waymo.com/open/challenges/2020/3d-detection) of Waymo Open Dataset on Sept. 2th, 2022. For validation dataset, MPPNet achieves 74.96%, 75.06% and 74.52% for vehicle, pedestrian and cyclist classes in terms of mAPH@Level_2. (see the [guideline](docs/guidelines_of_approaches/mppnet.md) on how to train/test with MPPNet).
 * Support multi-frame training/testing on Waymo Open Dataset (see the [change log](docs/changelog.md) for more details on how to process data).
 * Support to save changing training details (e.g., loss, iter, epoch) to file (previous tqdm progress bar is still supported by using `--use_tqdm_to_record`). Please use `pip install gpustat` if you also want to log the GPU related information.
 * Support to save latest model every 5 mintues, so you can restore the model training from latest status instead of previous epoch.   
@@ -38,10 +44,10 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18

 [2022-02-07] Added support for Centerpoint models on Nuscenes Dataset.

-[2022-01-14] Added support for dynamic pillar voxelization, following the implementation proposed in [H^23D R-CNN](https://arxiv.org/abs/2107.14391) with unique operation and [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) package.
+[2022-01-14] Added support for dynamic pillar voxelization, following the implementation proposed in [`H^23D R-CNN`](https://arxiv.org/abs/2107.14391) with unique operation and [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) package.

 [2022-01-05] **NEW:** Update `OpenPCDet` to v0.5.2:
-* The code of [PV-RCNN++](https://arxiv.org/abs/2102.00463) has been released to this repo, with higher performance, faster training/inference speed and less memory consumption than PV-RCNN.
+* The code of [`PV-RCNN++`](https://arxiv.org/abs/2102.00463) has been released to this repo, with higher performance, faster training/inference speed and less memory consumption than PV-RCNN.
 * Add performance of several models trained with full training set of [Waymo Open Dataset](#waymo-open-dataset-baselines).
 * Support Lyft dataset, see the pull request [here](https://github.com/open-mmlab/OpenPCDet/pull/720).

@@ -199,7 +205,7 @@ We could not provide the above pretrained models due to [Waymo Dataset License A
 but you could easily achieve similar performance by training with the default configs.

 ### NuScenes 3D Object Detection Baselines
-All models are trained with 8 GTX 1080Ti GPUs and are available for download.
+All models are trained with 8 GPUs and are available for download. For training BEVFusion, please refer to the [guideline](docs/guidelines_of_approaches/bevfusion.md).

 |                                                                                                    |   mATE |  mASE  |  mAOE  | mAVE  | mAAE  |  mAP  |  NDS   |                                              download                                              | 
 |----------------------------------------------------------------------------------------------------|-------:|:------:|:------:|:-----:|:-----:|:-----:|:------:|:--------------------------------------------------------------------------------------------------:|
@@ -209,7 +215,10 @@ All models are trained with 8 GTX 1080Ti GPUs and are available for download.
 | [CenterPoint (voxel_size=0.1)](tools/cfgs/nuscenes_models/cbgs_voxel01_res3d_centerpoint.yaml)     |  30.11 | 	25.55 | 	38.28 | 21.94 | 18.87 | 56.03 | 64.54  |  [model-34M](https://drive.google.com/file/d/1Cz-J1c3dw7JAWc25KRG1XQj8yCaOlexQ/view?usp=sharing)   |
 | [CenterPoint (voxel_size=0.075)](tools/cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml) |  28.80 | 	25.43 | 	37.27 | 21.55 | 18.24 | 59.22 | 66.48  |  [model-34M](https://drive.google.com/file/d/1XOHAWm1MPkCKr1gqmc3TWi5AYZgPsgxU/view?usp=sharing)   |
 | [VoxelNeXt (voxel_size=0.075)](tools/cfgs/nuscenes_models/cbgs_voxel0075_voxelnext.yaml)   |  30.11 | 	25.23 | 	40.57 | 21.69 | 18.56 | 60.53 | 66.65  | [model-31M](https://drive.google.com/file/d/1IV7e7G9X-61KXSjMGtQo579pzDNbhwvf/view?usp=share_link) |
+| [TransFusion-L*](tools/cfgs/nuscenes_models/transfusion_lidar.yaml)   |  27.96 | 	25.37 | 	29.35 | 27.31 | 18.55 | 64.58 | 69.43  | [model-32M](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link) |
+| [BEVFusion](tools/cfgs/nuscenes_models/bevfusion.yaml)   |  28.03 | 	25.43 | 	30.19 | 26.76 | 18.48 | 67.75 | 70.98  | [model-157M](https://drive.google.com/file/d/1X50b-8immqlqD8VPAUkSKI0Ls-4k37g9/view?usp=share_link) |

+*: Use the fade strategy, which disables data augmentations in the last several epochs during training.

 ### ONCE 3D Object Detection Baselines
 All models are trained with 8 GPUs.

--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -53,9 +53,16 @@ pip install nuscenes-devkit==1.0.5

 * Generate the data infos by running the following command (it may take several hours): 
 ```python 
+# for lidar-only setting
 python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
    --version v1.0-trainval
+
+# for multi-modal setting
+python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
+    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
+    --version v1.0-trainval \
+    --with_cam
 ```

 ### Waymo Open Dataset

--- a/docs/guidelines_of_approaches/bevfusion.md
+++ b/docs/guidelines_of_approaches/bevfusion.md
+
+## Installation
+
+Please refer to [INSTALL.md](../INSTALL.md) for the installation of `OpenPCDet`.
+* We recommend the users to check the version of pillow and use pillow==8.4.0 to avoid bug in bev pooling.
+
+## Data Preparation
+Please refer to [GETTING_STARTED.md](../GETTING_STARTED.md) to process the multi-modal Nuscenes Dataset.
+
+## Training
+
+1.  Train the lidar branch for BEVFusion:
+```shell
+bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/transfusion_lidar.yaml \
+```
+The ckpt will be saved in ../output/nuscenes_models/transfusion_lidar/default/ckpt, or you can download pretrained checkpoint directly form [here](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link).
+
+2.  To train BEVFusion, you need to download pretrained parameters for image backbone [here](https://drive.google.com/file/d/1v74WCt4_5ubjO7PciA5T0xhQc9bz_jZu/view?usp=share_link), and specify the path in [config](../../tools/cfgs/nuscenes_models/bevfusion.yaml#L88). Then run the following command:
+```shell
+bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/bevfusion.yaml \
+--pretrained_model path_to_pretrained_lidar_branch_ckpt \
+```
+## Evaluation
+* Test with a pretrained model:
+```shell
+bash scripts/dist_test.sh ${NUM_GPUS} --cfg_file  cfgs/nuscenes_models/bevfusion.yaml \
+--ckpt ../output/cfgs/nuscenes_models/bevfusion/default/ckpt/checkpoint_epoch_6.pth
+```
+
+## Performance
+All models are trained with spconv 1.0, but you can directly load them for testing regardless of the spconv version.
+|                                                                                                    |   mATE |  mASE  |  mAOE  | mAVE  | mAAE  |  mAP  |  NDS   |                                              download                                              | 
+|----------------------------------------------------------------------------------------------------|-------:|:------:|:------:|:-----:|:-----:|:-----:|:------:|:--------------------------------------------------------------------------------------------------:|
+| [TransFusion-L](../../tools/cfgs/nuscenes_models/transfusion_lidar.yaml)   |  27.96 | 	25.37 | 	29.35 | 27.31 | 18.55 | 64.58 | 69.43  | [model-32M](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link) |
+| [BEVFusion](../../tools/cfgs/nuscenes_models/bevfusion.yaml)   |  28.03 | 	25.43 | 	30.19 | 26.76 | 18.48 | 67.75 | 70.98  | [model-157M](https://drive.google.com/file/d/1X50b-8immqlqD8VPAUkSKI0Ls-4k37g9/view?usp=share_link) |
--- a/pcdet/datasets/augmentor/data_augmentor.py
+++ b/pcdet/datasets/augmentor/data_augmentor.py
 from functools import partial

 import numpy as np
+from PIL import Image

 from ...utils import common_utils
 from . import augmentor_utils, database_sampler
@@ -23,6 +24,18 @@ class DataAugmentor(object):
            cur_augmentor = getattr(self, cur_cfg.NAME)(config=cur_cfg)
            self.data_augmentor_queue.append(cur_augmentor)

+    def disable_augmentation(self, augmentor_configs):
+        self.data_augmentor_queue = []
+        aug_config_list = augmentor_configs if isinstance(augmentor_configs, list) \
+            else augmentor_configs.AUG_CONFIG_LIST
+
+        for cur_cfg in aug_config_list:
+            if not isinstance(augmentor_configs, list):
+                if cur_cfg.NAME in augmentor_configs.DISABLE_AUG_LIST:
+                    continue
+            cur_augmentor = getattr(self, cur_cfg.NAME)(config=cur_cfg)
+            self.data_augmentor_queue.append(cur_augmentor)
+             
    def gt_sampling(self, config=None):
        db_sampler = database_sampler.DataBaseSampler(
            root_path=self.root_path,
@@ -139,6 +152,7 @@ class DataAugmentor(object):
        
        data_dict['gt_boxes'] = gt_boxes
        data_dict['points'] = points
+        data_dict['noise_translate'] = noise_translate
        return data_dict

    def random_local_translation(self, data_dict=None, config=None):
@@ -251,6 +265,28 @@ class DataAugmentor(object):
        data_dict['points'] = points
        return data_dict

+    def imgaug(self, data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.imgaug, config=config)
+        imgs = data_dict["camera_imgs"]
+        img_process_infos = data_dict['img_process_infos']
+        new_imgs = []
+        for img, img_process_info in zip(imgs, img_process_infos):
+            flip = False
+            if config.RAND_FLIP and np.random.choice([0, 1]):
+                flip = True
+            rotate = np.random.uniform(*config.ROT_LIM)
+            # aug images
+            if flip:
+                img = img.transpose(method=Image.FLIP_LEFT_RIGHT)
+            img = img.rotate(rotate)
+            img_process_info[2] = flip
+            img_process_info[3] = rotate
+            new_imgs.append(img)
+
+        data_dict["camera_imgs"] = new_imgs
+        return data_dict
+
    def forward(self, data_dict):
        """
        Args:

--- a/pcdet/datasets/dataset.py
+++ b/pcdet/datasets/dataset.py
@@ -2,6 +2,7 @@ from collections import defaultdict
 from pathlib import Path

 import numpy as np
+import torch
 import torch.utils.data as torch_data

 from ..utils import common_utils
@@ -130,6 +131,30 @@ class DatasetTemplate(torch_data.Dataset):
        """
        raise NotImplementedError

+    def set_lidar_aug_matrix(self, data_dict):
+        """
+            Get lidar augment matrix (4 x 4), which are used to recover orig point coordinates.
+        """
+        lidar_aug_matrix = np.eye(4)
+        if 'flip_y' in data_dict.keys():
+            flip_x = data_dict['flip_x']
+            flip_y = data_dict['flip_y']
+            if flip_x:
+                lidar_aug_matrix[:3,:3] = np.array([[1, 0, 0], [0, -1, 0], [0, 0, 1]]) @ lidar_aug_matrix[:3,:3]
+            if flip_y:
+                lidar_aug_matrix[:3,:3] = np.array([[-1, 0, 0], [0, 1, 0], [0, 0, 1]]) @ lidar_aug_matrix[:3,:3]
+        if 'noise_rot' in data_dict.keys():
+            noise_rot = data_dict['noise_rot']
+            lidar_aug_matrix[:3,:3] = common_utils.angle2matrix(torch.tensor(noise_rot)) @ lidar_aug_matrix[:3,:3]
+        if 'noise_scale' in data_dict.keys():
+            noise_scale = data_dict['noise_scale']
+            lidar_aug_matrix[:3,:3] *= noise_scale
+        if 'noise_translate' in data_dict.keys():
+            noise_translate = data_dict['noise_translate']
+            lidar_aug_matrix[:3,3:4] = noise_translate.T
+        data_dict['lidar_aug_matrix'] = lidar_aug_matrix
+        return data_dict
+
    def prepare_data(self, data_dict):
        """
        Args:
@@ -165,6 +190,7 @@ class DatasetTemplate(torch_data.Dataset):
            )
            if 'calib' in data_dict:
                data_dict['calib'] = calib
+        data_dict = self.set_lidar_aug_matrix(data_dict)
        if data_dict.get('gt_boxes', None) is not None:
            selected = common_utils.keep_arrays_by_name(data_dict['gt_names'], self.class_names)
            data_dict['gt_boxes'] = data_dict['gt_boxes'][selected]
@@ -287,6 +313,8 @@ class DatasetTemplate(torch_data.Dataset):
                                constant_values=pad_value)
                        points.append(points_pad)
                    ret[key] = np.stack(points, axis=0)
+                elif key in ['camera_imgs']:
+                    ret[key] = torch.stack([torch.stack(imgs,dim=0) for imgs in val],dim=0)
                else:
                    ret[key] = np.stack(val, axis=0)
            except:

--- a/pcdet/datasets/nuscenes/nuscenes_dataset.py
+++ b/pcdet/datasets/nuscenes/nuscenes_dataset.py
@@ -8,6 +8,8 @@ from tqdm import tqdm
 from ...ops.roiaware_pool3d import roiaware_pool3d_utils
 from ...utils import common_utils
 from ..dataset import DatasetTemplate
+from pyquaternion import Quaternion
+from PIL import Image


 class NuScenesDataset(DatasetTemplate):
@@ -17,6 +19,13 @@ class NuScenesDataset(DatasetTemplate):
            dataset_cfg=dataset_cfg, class_names=class_names, training=training, root_path=root_path, logger=logger
        )
        self.infos = []
+        self.camera_config = self.dataset_cfg.get('CAMERA_CONFIG', None)
+        if self.camera_config is not None:
+            self.use_camera = self.camera_config.get('USE_CAMERA', True)
+            self.camera_image_config = self.camera_config.IMAGE
+        else:
+            self.use_camera = False
+
        self.include_nuscenes_data(self.mode)
        if self.training and self.dataset_cfg.get('BALANCED_RESAMPLING', False):
            self.infos = self.balanced_infos_resampling(self.infos)
@@ -108,6 +117,98 @@ class NuScenesDataset(DatasetTemplate):
        points = np.concatenate((points, times), axis=1)
        return points

+    def crop_image(self, input_dict):
+        W, H = input_dict["ori_shape"]
+        imgs = input_dict["camera_imgs"]
+        img_process_infos = []
+        crop_images = []
+        for img in imgs:
+            if self.training == True:
+                fH, fW = self.camera_image_config.FINAL_DIM
+                resize_lim = self.camera_image_config.RESIZE_LIM_TRAIN
+                resize = np.random.uniform(*resize_lim)
+                resize_dims = (int(W * resize), int(H * resize))
+                newW, newH = resize_dims
+                crop_h = newH - fH
+                crop_w = int(np.random.uniform(0, max(0, newW - fW)))
+                crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)
+            else:
+                fH, fW = self.camera_image_config.FINAL_DIM
+                resize_lim = self.camera_image_config.RESIZE_LIM_TEST
+                resize = np.mean(resize_lim)
+                resize_dims = (int(W * resize), int(H * resize))
+                newW, newH = resize_dims
+                crop_h = newH - fH
+                crop_w = int(max(0, newW - fW) / 2)
+                crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)
+            
+            # reisze and crop image
+            img = img.resize(resize_dims)
+            img = img.crop(crop)
+            crop_images.append(img)
+            img_process_infos.append([resize, crop, False, 0])
+        
+        input_dict['img_process_infos'] = img_process_infos
+        input_dict['camera_imgs'] = crop_images
+        return input_dict
+    
+    def load_camera_info(self, input_dict, info):
+        input_dict["image_paths"] = []
+        input_dict["lidar2camera"] = []
+        input_dict["lidar2image"] = []
+        input_dict["camera2ego"] = []
+        input_dict["camera_intrinsics"] = []
+        input_dict["camera2lidar"] = []
+
+        for _, camera_info in info["cams"].items():
+            input_dict["image_paths"].append(camera_info["data_path"])
+
+            # lidar to camera transform
+            lidar2camera_r = np.linalg.inv(camera_info["sensor2lidar_rotation"])
+            lidar2camera_t = (
+                camera_info["sensor2lidar_translation"] @ lidar2camera_r.T
+            )
+            lidar2camera_rt = np.eye(4).astype(np.float32)
+            lidar2camera_rt[:3, :3] = lidar2camera_r.T
+            lidar2camera_rt[3, :3] = -lidar2camera_t
+            input_dict["lidar2camera"].append(lidar2camera_rt.T)
+
+            # camera intrinsics
+            camera_intrinsics = np.eye(4).astype(np.float32)
+            camera_intrinsics[:3, :3] = camera_info["camera_intrinsics"]
+            input_dict["camera_intrinsics"].append(camera_intrinsics)
+
+            # lidar to image transform
+            lidar2image = camera_intrinsics @ lidar2camera_rt.T
+            input_dict["lidar2image"].append(lidar2image)
+
+            # camera to ego transform
+            camera2ego = np.eye(4).astype(np.float32)
+            camera2ego[:3, :3] = Quaternion(
+                camera_info["sensor2ego_rotation"]
+            ).rotation_matrix
+            camera2ego[:3, 3] = camera_info["sensor2ego_translation"]
+            input_dict["camera2ego"].append(camera2ego)
+
+            # camera to lidar transform
+            camera2lidar = np.eye(4).astype(np.float32)
+            camera2lidar[:3, :3] = camera_info["sensor2lidar_rotation"]
+            camera2lidar[:3, 3] = camera_info["sensor2lidar_translation"]
+            input_dict["camera2lidar"].append(camera2lidar)
+        # read image
+        filename = input_dict["image_paths"]
+        images = []
+        for name in filename:
+            images.append(Image.open(str(self.root_path / name)))
+        
+        input_dict["camera_imgs"] = images
+        input_dict["ori_shape"] = images[0].size
+        
+        # resize and crop image
+        input_dict = self.crop_image(input_dict)
+
+        return input_dict
+
    def __len__(self):
        if self._merge_all_iters_to_one_epoch:
            return len(self.infos) * self.total_epochs
@@ -137,6 +238,8 @@ class NuScenesDataset(DatasetTemplate):
                'gt_names': info['gt_names'] if mask is None else info['gt_names'][mask],
                'gt_boxes': info['gt_boxes'] if mask is None else info['gt_boxes'][mask]
            })
+        if self.use_camera:
+            input_dict = self.load_camera_info(input_dict, info)

        data_dict = self.prepare_data(data_dict=input_dict)

@@ -251,7 +354,7 @@ class NuScenesDataset(DatasetTemplate):
            pickle.dump(all_db_infos, f)


-def create_nuscenes_info(version, data_path, save_path, max_sweeps=10):
+def create_nuscenes_info(version, data_path, save_path, max_sweeps=10, with_cam=False):
    from nuscenes.nuscenes import NuScenes
    from nuscenes.utils import splits
    from . import nuscenes_utils
@@ -283,7 +386,7 @@ def create_nuscenes_info(version, data_path, save_path, max_sweeps=10):

    train_nusc_infos, val_nusc_infos = nuscenes_utils.fill_trainval_infos(
        data_path=data_path, nusc=nusc, train_scenes=train_scenes, val_scenes=val_scenes,
-        test='test' in version, max_sweeps=max_sweeps
+        test='test' in version, max_sweeps=max_sweeps, with_cam=with_cam
    )

    if version == 'v1.0-test':
@@ -308,6 +411,7 @@ if __name__ == '__main__':
    parser.add_argument('--cfg_file', type=str, default=None, help='specify the config of dataset')
    parser.add_argument('--func', type=str, default='create_nuscenes_infos', help='')
    parser.add_argument('--version', type=str, default='v1.0-trainval', help='')
+    parser.add_argument('--with_cam', action='store_true', default=False, help='use camera or not')
    args = parser.parse_args()

    if args.func == 'create_nuscenes_infos':
@@ -319,6 +423,7 @@ if __name__ == '__main__':
            data_path=ROOT_DIR / 'data' / 'nuscenes',
            save_path=ROOT_DIR / 'data' / 'nuscenes',
            max_sweeps=dataset_cfg.MAX_SWEEPS,
+            with_cam=args.with_cam
        )

        nuscenes_dataset = NuScenesDataset(

--- a/pcdet/datasets/nuscenes/nuscenes_utils.py
+++ b/pcdet/datasets/nuscenes/nuscenes_utils.py
@@ -249,7 +249,67 @@ def quaternion_yaw(q: Quaternion) -> float:
    return yaw
    

-def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, max_sweeps=10):
+def obtain_sensor2top(
+    nusc, sensor_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, sensor_type="lidar"
+):
+    """Obtain the info with RT matric from general sensor to Top LiDAR.
+
+    Args:
+        nusc (class): Dataset class in the nuScenes dataset.
+        sensor_token (str): Sample data token corresponding to the
+            specific sensor type.
+        l2e_t (np.ndarray): Translation from lidar to ego in shape (1, 3).
+        l2e_r_mat (np.ndarray): Rotation matrix from lidar to ego
+            in shape (3, 3).
+        e2g_t (np.ndarray): Translation from ego to global in shape (1, 3).
+        e2g_r_mat (np.ndarray): Rotation matrix from ego to global
+            in shape (3, 3).
+        sensor_type (str): Sensor to calibrate. Default: 'lidar'.
+
+    Returns:
+        sweep (dict): Sweep information after transformation.
+    """
+    sd_rec = nusc.get("sample_data", sensor_token)
+    cs_record = nusc.get("calibrated_sensor", sd_rec["calibrated_sensor_token"])
+    pose_record = nusc.get("ego_pose", sd_rec["ego_pose_token"])
+    data_path = str(nusc.get_sample_data_path(sd_rec["token"]))
+    # if os.getcwd() in data_path:  # path from lyftdataset is absolute path
+    #     data_path = data_path.split(f"{os.getcwd()}/")[-1]  # relative path
+    sweep = {
+        "data_path": data_path,
+        "type": sensor_type,
+        "sample_data_token": sd_rec["token"],
+        "sensor2ego_translation": cs_record["translation"],
+        "sensor2ego_rotation": cs_record["rotation"],
+        "ego2global_translation": pose_record["translation"],
+        "ego2global_rotation": pose_record["rotation"],
+        "timestamp": sd_rec["timestamp"],
+    }
+    l2e_r_s = sweep["sensor2ego_rotation"]
+    l2e_t_s = sweep["sensor2ego_translation"]
+    e2g_r_s = sweep["ego2global_rotation"]
+    e2g_t_s = sweep["ego2global_translation"]
+
+    # obtain the RT from sensor to Top LiDAR
+    # sweep->ego->global->ego'->lidar
+    l2e_r_s_mat = Quaternion(l2e_r_s).rotation_matrix
+    e2g_r_s_mat = Quaternion(e2g_r_s).rotation_matrix
+    R = (l2e_r_s_mat.T @ e2g_r_s_mat.T) @ (
+        np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+    )
+    T = (l2e_t_s @ e2g_r_s_mat.T + e2g_t_s) @ (
+        np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+    )
+    T -= (
+        e2g_t @ (np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T)
+        + l2e_t @ np.linalg.inv(l2e_r_mat).T
+    ).squeeze(0)
+    sweep["sensor2lidar_rotation"] = R.T  # points @ R.T + T
+    sweep["sensor2lidar_translation"] = T
+    return sweep
+
+
+def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, max_sweeps=10, with_cam=False):
    train_nusc_infos = []
    val_nusc_infos = []
    progress_bar = tqdm.tqdm(total=len(nusc.sample), desc='create_info', dynamic_ncols=True)
@@ -291,6 +351,34 @@ def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, m
            'car_from_global': car_from_global,
            'timestamp': ref_time,
        }
+        if with_cam:
+            info['cams'] = dict()
+            l2e_r = ref_cs_rec["rotation"]
+            l2e_t = ref_cs_rec["translation"],
+            e2g_r = ref_pose_rec["rotation"]
+            e2g_t = ref_pose_rec["translation"]
+            l2e_r_mat = Quaternion(l2e_r).rotation_matrix
+            e2g_r_mat = Quaternion(e2g_r).rotation_matrix
+
+            # obtain 6 image's information per frame
+            camera_types = [
+                "CAM_FRONT",
+                "CAM_FRONT_RIGHT",
+                "CAM_FRONT_LEFT",
+                "CAM_BACK",
+                "CAM_BACK_LEFT",
+                "CAM_BACK_RIGHT",
+            ]
+            for cam in camera_types:
+                cam_token = sample["data"][cam]
+                cam_path, _, camera_intrinsics = nusc.get_sample_data(cam_token)
+                cam_info = obtain_sensor2top(
+                    nusc, cam_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, cam
+                )
+                cam_info['data_path'] = Path(cam_info['data_path']).relative_to(data_path).__str__()
+                cam_info.update(camera_intrinsics=camera_intrinsics)
+                info["cams"].update({cam: cam_info})
+        

        sample_data_token = sample['data'][chan]
        curr_sd_rec = nusc.get('sample_data', sample_data_token)

--- a/pcdet/datasets/processor/data_processor.py
+++ b/pcdet/datasets/processor/data_processor.py
@@ -2,7 +2,8 @@ from functools import partial

 import numpy as np
 from skimage import transform
-
+import torch
+import torchvision
 from ...utils import box_utils, common_utils

 tv = None
@@ -229,6 +230,56 @@ class DataProcessor(object):
        )
        return data_dict
    
+    def image_normalize(self, data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.image_normalize, config=config)
+        mean = config.mean
+        std = config.std
+        compose = torchvision.transforms.Compose(
+            [
+                torchvision.transforms.ToTensor(),
+                torchvision.transforms.Normalize(mean=mean, std=std),
+            ]
+        )
+        data_dict["camera_imgs"] = [compose(img) for img in data_dict["camera_imgs"]]
+        return data_dict
+    
+    def image_calibrate(self,data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.image_calibrate, config=config)
+        img_process_infos = data_dict['img_process_infos']
+        transforms = []
+        for img_process_info in img_process_infos:
+            resize, crop, flip, rotate = img_process_info
+
+            rotation = torch.eye(2)
+            translation = torch.zeros(2)
+            # post-homography transformation
+            rotation *= resize
+            translation -= torch.Tensor(crop[:2])
+            if flip:
+                A = torch.Tensor([[-1, 0], [0, 1]])
+                b = torch.Tensor([crop[2] - crop[0], 0])
+                rotation = A.matmul(rotation)
+                translation = A.matmul(translation) + b
+            theta = rotate / 180 * np.pi
+            A = torch.Tensor(
+                [
+                    [np.cos(theta), np.sin(theta)],
+                    [-np.sin(theta), np.cos(theta)],
+                ]
+            )
+            b = torch.Tensor([crop[2] - crop[0], crop[3] - crop[1]]) / 2
+            b = A.matmul(-b) + b
+            rotation = A.matmul(rotation)
+            translation = A.matmul(translation) + b
+            transform = torch.eye(4)
+            transform[:2, :2] = rotation
+            transform[:2, 3] = translation
+            transforms.append(transform.numpy())
+        data_dict["img_aug_matrix"] = transforms
+        return data_dict
+
    def forward(self, data_dict):
        """
        Args:

--- a/pcdet/models/__init__.py
+++ b/pcdet/models/__init__.py
@@ -22,9 +22,11 @@ def build_network(model_cfg, num_class, dataset):

 def load_data_to_gpu(batch_dict):
    for key, val in batch_dict.items():
-        if not isinstance(val, np.ndarray):
+        if key == 'camera_imgs':
+            batch_dict[key] = val.cuda()
+        elif not isinstance(val, np.ndarray):
            continue
-        elif key in ['frame_id', 'metadata', 'calib']:
+        elif key in ['frame_id', 'metadata', 'calib', 'image_paths','ori_shape','img_process_infos']:
            continue
        elif key in ['images']:
            batch_dict[key] = kornia.image_to_tensor(val).float().cuda().contiguous()

--- a/pcdet/models/backbones_2d/base_bev_backbone.py
+++ b/pcdet/models/backbones_2d/base_bev_backbone.py
@@ -46,7 +46,7 @@ class BaseBEVBackbone(nn.Module):
            self.blocks.append(nn.Sequential(*cur_layers))
            if len(upsample_strides) > 0:
                stride = upsample_strides[idx]
-                if stride >= 1:
+                if stride > 1 or (stride == 1 and not self.model_cfg.get('USE_CONV_FOR_NO_STRIDE', False)):
                    self.deblocks.append(nn.Sequential(
                        nn.ConvTranspose2d(
                            num_filters[idx], num_upsample_filters[idx],

--- a/pcdet/models/backbones_2d/fuser/__init__.py
+++ b/pcdet/models/backbones_2d/fuser/__init__.py
+from .convfuser import ConvFuser
+__all__ = {
+    'ConvFuser':ConvFuser
+}
\ No newline at end of file
--- a/pcdet/models/backbones_2d/fuser/convfuser.py
+++ b/pcdet/models/backbones_2d/fuser/convfuser.py
+import torch
+from torch import nn
+
+
+class ConvFuser(nn.Module):
+    def __init__(self,model_cfg) -> None:
+        super().__init__()
+        self.model_cfg = model_cfg
+        in_channel = self.model_cfg.IN_CHANNEL
+        out_channel = self.model_cfg.OUT_CHANNEL
+        self.conv = nn.Sequential(
+            nn.Conv2d(in_channel, out_channel, 3, padding=1, bias=False),
+            nn.BatchNorm2d(out_channel),
+            nn.ReLU(True)
+            )
+        
+    def forward(self,batch_dict):
+        """
+        Args:
+            batch_dict:
+                spatial_features_img (tensor): Bev features from image modality
+                spatial_features (tensor): Bev features from lidar modality
+
+        Returns:
+            batch_dict:
+                spatial_features (tensor): Bev features after muli-modal fusion
+        """
+        img_bev = batch_dict['spatial_features_img']
+        lidar_bev = batch_dict['spatial_features']
+        cat_bev = torch.cat([img_bev,lidar_bev],dim=1)
+        mm_bev = self.conv(cat_bev)
+        batch_dict['spatial_features'] = mm_bev
+        return batch_dict
\ No newline at end of file
--- a/pcdet/models/backbones_3d/spconv_backbone.py
+++ b/pcdet/models/backbones_3d/spconv_backbone.py
@@ -30,10 +30,11 @@ def post_act_block(in_channels, out_channels, kernel_size, indice_key=None, stri
 class SparseBasicBlock(spconv.SparseModule):
    expansion = 1

-    def __init__(self, inplanes, planes, stride=1, norm_fn=None, downsample=None, indice_key=None):
+    def __init__(self, inplanes, planes, stride=1, bias=None, norm_fn=None, downsample=None, indice_key=None):
        super(SparseBasicBlock, self).__init__()

        assert norm_fn is not None
+        if bias is None:
            bias = norm_fn is not None
        self.conv1 = spconv.SubMConv3d(
            inplanes, planes, kernel_size=3, stride=stride, padding=1, bias=bias, indice_key=indice_key
@@ -184,6 +185,7 @@ class VoxelResBackBone8x(nn.Module):
    def __init__(self, model_cfg, input_channels, grid_size, **kwargs):
        super().__init__()
        self.model_cfg = model_cfg
+        use_bias = self.model_cfg.get('USE_BIAS', None)
        norm_fn = partial(nn.BatchNorm1d, eps=1e-3, momentum=0.01)

        self.sparse_shape = grid_size[::-1] + [1, 0, 0]
@@ -196,29 +198,29 @@ class VoxelResBackBone8x(nn.Module):
        block = post_act_block

        self.conv1 = spconv.SparseSequential(
-            SparseBasicBlock(16, 16, norm_fn=norm_fn, indice_key='res1'),
-            SparseBasicBlock(16, 16, norm_fn=norm_fn, indice_key='res1'),
+            SparseBasicBlock(16, 16, bias=use_bias, norm_fn=norm_fn, indice_key='res1'),
+            SparseBasicBlock(16, 16, bias=use_bias, norm_fn=norm_fn, indice_key='res1'),
        )

        self.conv2 = spconv.SparseSequential(
            # [1600, 1408, 41] <- [800, 704, 21]
            block(16, 32, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv2', conv_type='spconv'),
-            SparseBasicBlock(32, 32, norm_fn=norm_fn, indice_key='res2'),
-            SparseBasicBlock(32, 32, norm_fn=norm_fn, indice_key='res2'),
+            SparseBasicBlock(32, 32, bias=use_bias, norm_fn=norm_fn, indice_key='res2'),
+            SparseBasicBlock(32, 32, bias=use_bias, norm_fn=norm_fn, indice_key='res2'),
        )

        self.conv3 = spconv.SparseSequential(
            # [800, 704, 21] <- [400, 352, 11]
            block(32, 64, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv3', conv_type='spconv'),
-            SparseBasicBlock(64, 64, norm_fn=norm_fn, indice_key='res3'),
-            SparseBasicBlock(64, 64, norm_fn=norm_fn, indice_key='res3'),
+            SparseBasicBlock(64, 64, bias=use_bias, norm_fn=norm_fn, indice_key='res3'),
+            SparseBasicBlock(64, 64, bias=use_bias, norm_fn=norm_fn, indice_key='res3'),
        )

        self.conv4 = spconv.SparseSequential(
            # [400, 352, 11] <- [200, 176, 5]
            block(64, 128, 3, norm_fn=norm_fn, stride=2, padding=(0, 1, 1), indice_key='spconv4', conv_type='spconv'),
-            SparseBasicBlock(128, 128, norm_fn=norm_fn, indice_key='res4'),
-            SparseBasicBlock(128, 128, norm_fn=norm_fn, indice_key='res4'),
+            SparseBasicBlock(128, 128, bias=use_bias, norm_fn=norm_fn, indice_key='res4'),
+            SparseBasicBlock(128, 128, bias=use_bias, norm_fn=norm_fn, indice_key='res4'),
        )

        last_pad = 0

--- a/pcdet/models/backbones_image/__init__.py
+++ b/pcdet/models/backbones_image/__init__.py
+from .swin import SwinTransformer
+__all__ = {
+    'SwinTransformer':SwinTransformer,
+}
\ No newline at end of file
--- a/pcdet/models/backbones_image/img_neck/__init__.py
+++ b/pcdet/models/backbones_image/img_neck/__init__.py
+from .generalized_lss import GeneralizedLSSFPN
+__all__ = {
+    'GeneralizedLSSFPN':GeneralizedLSSFPN,
+}
\ No newline at end of file
--- a/pcdet/models/backbones_image/img_neck/generalized_lss.py
+++ b/pcdet/models/backbones_image/img_neck/generalized_lss.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ...model_utils.basic_block_2d import BasicBlock2D
+
+
+class GeneralizedLSSFPN(nn.Module):
+    """
+        This module implements FPN, which creates pyramid features built on top of some input feature maps.
+        This code is adapted from https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/necks/fpn.py with minimal modifications.
+    """
+    def __init__(self, model_cfg):
+        super().__init__()
+        self.model_cfg = model_cfg
+        in_channels =  self.model_cfg.IN_CHANNELS
+        out_channels = self.model_cfg.OUT_CHANNELS
+        num_ins = len(in_channels)
+        num_outs = self.model_cfg.NUM_OUTS
+        start_level = self.model_cfg.START_LEVEL
+        end_level = self.model_cfg.END_LEVEL
+
+        self.in_channels = in_channels
+
+        if end_level == -1:
+            self.backbone_end_level = num_ins - 1
+        else:
+            self.backbone_end_level = end_level
+            assert end_level <= len(in_channels)
+            assert num_outs == end_level - start_level
+        self.start_level = start_level
+        self.end_level = end_level
+
+        self.lateral_convs = nn.ModuleList()
+        self.fpn_convs = nn.ModuleList()
+
+        for i in range(self.start_level, self.backbone_end_level):
+            l_conv = BasicBlock2D(
+                in_channels[i] + (in_channels[i + 1] if i == self.backbone_end_level - 1 else out_channels),
+                out_channels, kernel_size=1, bias = False
+            )
+            fpn_conv = BasicBlock2D(out_channels,out_channels, kernel_size=3, padding=1, bias = False)
+            self.lateral_convs.append(l_conv)
+            self.fpn_convs.append(fpn_conv)
+
+    def forward(self, batch_dict):
+        """
+        Args:
+            batch_dict:
+                image_features (list[tensor]): Multi-stage features from image backbone.
+        Returns:
+            batch_dict:
+                image_fpn (list(tensor)): FPN features.
+        """
+        # upsample -> cat -> conv1x1 -> conv3x3
+        inputs = batch_dict['image_features']
+        assert len(inputs) == len(self.in_channels)
+
+        # build laterals
+        laterals = [inputs[i + self.start_level] for i in range(len(inputs))]
+
+        # build top-down path
+        used_backbone_levels = len(laterals) - 1
+        for i in range(used_backbone_levels - 1, -1, -1):
+            x = F.interpolate(
+                laterals[i + 1],
+                size=laterals[i].shape[2:],
+                mode='bilinear', align_corners=False,
+            )
+            laterals[i] = torch.cat([laterals[i], x], dim=1)
+            laterals[i] = self.lateral_convs[i](laterals[i])
+            laterals[i] = self.fpn_convs[i](laterals[i])
+
+        # build outputs
+        outs = [laterals[i] for i in range(used_backbone_levels)]
+        batch_dict['image_fpn'] = tuple(outs)
+        return batch_dict
--- a/pcdet/models/backbones_image/swin.py
+++ b/pcdet/models/backbones_image/swin.py
+# Copyright (c) OpenMMLab. All rights reserved.
+"""
+Mostly copy-paste from 
+    https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/backbones/swin.py 
+    
+"""
+
+import warnings
+from collections import OrderedDict
+from copy import deepcopy
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+
+from ..model_utils.swin_utils import swin_converter
+from ..model_utils.swin_utils import PatchEmbed, PatchMerging
+from ..model_utils.swin_utils import FFN, DropPath, to_2tuple, trunc_normal_, trunc_normal_init, constant_init
+
+
+class WindowMSA(nn.Module):
+    """Window based multi-head self-attention (W-MSA) module with relative
+    position bias.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (tuple[int]): The height and width of the window.
+        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.
+            Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        attn_drop_rate (float, optional): Dropout ratio of attention weight.
+            Default: 0.0
+        proj_drop_rate (float, optional): Dropout ratio of output. Default: 0.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop_rate=0.,
+                 proj_drop_rate=0.):
+
+        super().__init__()
+        self._is_init = False
+
+        self.embed_dims = embed_dims
+        self.window_size = window_size  # Wh, Ww
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.scale = qk_scale or head_embed_dims**-0.5
+
+        # define a parameter table of relative position bias
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1),
+                        num_heads))  # 2*Wh-1 * 2*Ww-1, nH
+
+        # About 2x faster than original impl
+        Wh, Ww = self.window_size
+        rel_index_coords = self.double_step_seq(2 * Ww - 1, Wh, 1, Ww)
+        rel_position_index = rel_index_coords + rel_index_coords.T
+        rel_position_index = rel_position_index.flip(1).contiguous()
+        self.register_buffer('relative_position_index', rel_position_index)
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop_rate)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop_rate)
+
+        self.softmax = nn.Softmax(dim=-1)
+
+    def init_weights(self):
+        trunc_normal_(self.relative_position_bias_table, std=0.02)
+
+    def forward(self, x, mask=None):
+        """
+        Args:
+
+            x (tensor): input features with shape of (num_windows*B, N, C)
+            mask (tensor | None, Optional): mask with shape of (num_windows,
+                Wh*Ww, Wh*Ww), value should be between (-inf, 0].
+        """
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        # make torchscript happy (cannot use tensor as tuple)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+        attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    @staticmethod
+    def double_step_seq(step1, len1, step2, len2):
+        seq1 = torch.arange(0, step1 * len1, step1)
+        seq2 = torch.arange(0, step2 * len2, step2)
+        return (seq1[:, None] + seq2[None, :]).reshape(1, -1)
+
+
+class ShiftWindowMSA(nn.Module):
+    """Shifted Window Multihead Self-Attention Module.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window.
+        shift_size (int, optional): The shift step of each window towards
+            right-bottom. If zero, act as regular window-msa. Defaults to 0.
+        qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
+            Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Defaults: None.
+        attn_drop_rate (float, optional): Dropout ratio of attention weight.
+            Defaults: 0.
+        proj_drop_rate (float, optional): Dropout ratio of output.
+            Defaults: 0.
+        dropout_layer (dict, optional): The dropout_layer used before output.
+            Defaults: dict(type='DropPath', drop_prob=0.).
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size,
+                 shift_size=0,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop_rate=0,
+                 proj_drop_rate=0,
+                 dropout_layer=dict(type='DropPath', drop_prob=0.)):
+        super().__init__()
+        self._is_init = False
+
+        self.window_size = window_size
+        self.shift_size = shift_size
+        assert 0 <= self.shift_size < self.window_size
+
+        self.w_msa = WindowMSA(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            window_size=to_2tuple(window_size),
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop_rate=attn_drop_rate,
+            proj_drop_rate=proj_drop_rate,)
+        self.drop = DropPath(dropout_layer['drop_prob'])
+
+    def forward(self, query, hw_shape):
+        B, L, C = query.shape
+        H, W = hw_shape
+        assert L == H * W, 'input feature has wrong size'
+        query = query.view(B, H, W, C)
+
+        # pad feature maps to multiples of window size
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+        query = F.pad(query, (0, 0, 0, pad_r, 0, pad_b))
+        H_pad, W_pad = query.shape[1], query.shape[2]
+
+        # cyclic shift
+        if self.shift_size > 0:
+            shifted_query = torch.roll(
+                query,
+                shifts=(-self.shift_size, -self.shift_size),
+                dims=(1, 2))
+
+            # calculate attention mask for SW-MSA
+            img_mask = torch.zeros((1, H_pad, W_pad, 1), device=query.device)
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size,
+                              -self.shift_size), slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            # nW, window_size, window_size, 1
+            mask_windows = self.window_partition(img_mask)
+            mask_windows = mask_windows.view(
+                -1, self.window_size * self.window_size)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(attn_mask != 0,
+                                              float(-100.0)).masked_fill(
+                                                  attn_mask == 0, float(0.0))
+        else:
+            shifted_query = query
+            attn_mask = None
+
+        # nW*B, window_size, window_size, C
+        query_windows = self.window_partition(shifted_query)
+        # nW*B, window_size*window_size, C
+        query_windows = query_windows.view(-1, self.window_size**2, C)
+
+        # W-MSA/SW-MSA (nW*B, window_size*window_size, C)
+        attn_windows = self.w_msa(query_windows, mask=attn_mask)
+
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size,
+                                         self.window_size, C)
+
+        # B H' W' C
+        shifted_x = self.window_reverse(attn_windows, H_pad, W_pad)
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = torch.roll(
+                shifted_x,
+                shifts=(self.shift_size, self.shift_size),
+                dims=(1, 2))
+        else:
+            x = shifted_x
+
+        if pad_r > 0 or pad_b:
+            x = x[:, :H, :W, :].contiguous()
+
+        x = x.view(B, H * W, C)
+
+        x = self.drop(x)
+        return x
+
+    def window_reverse(self, windows, H, W):
+        """
+        Args:
+            windows: (num_windows*B, window_size, window_size, C)
+            H (int): Height of image
+            W (int): Width of image
+        Returns:
+            x: (B, H, W, C)
+        """
+        window_size = self.window_size
+        B = int(windows.shape[0] / (H * W / window_size / window_size))
+        x = windows.view(B, H // window_size, W // window_size, window_size,
+                         window_size, -1)
+        x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+        return x
+
+    def window_partition(self, x):
+        """
+        Args:
+            x: (B, H, W, C)
+        Returns:
+            windows: (num_windows*B, window_size, window_size, C)
+        """
+        B, H, W, C = x.shape
+        window_size = self.window_size
+        x = x.view(B, H // window_size, window_size, W // window_size,
+                   window_size, C)
+        windows = x.permute(0, 1, 3, 2, 4, 5).contiguous()
+        windows = windows.view(-1, window_size, window_size, C)
+        return windows
+
+
+class SwinBlock(nn.Module):
+    """"
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        window_size (int, optional): The local window scale. Default: 7.
+        shift (bool, optional): whether to shift window or not. Default False.
+        qkv_bias (bool, optional): enable bias for qkv if True. Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        drop_rate (float, optional): Dropout rate. Default: 0.
+        attn_drop_rate (float, optional): Attention dropout rate. Default: 0.
+        drop_path_rate (float, optional): Stochastic depth rate. Default: 0.
+        act_cfg (dict, optional): The config dict of activation function.
+            Default: dict(type='GELU').
+        norm_cfg (dict, optional): The config dict of normalization.
+            Default: dict(type='LN').
+        with_cp (bool, optional): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 window_size=7,
+                 shift=False,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False,):
+        super(SwinBlock, self).__init__()
+        self._is_init = False
+
+        self.with_cp = with_cp
+
+        self.norm1 = nn.LayerNorm(embed_dims)
+        self.attn = ShiftWindowMSA(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            window_size=window_size,
+            shift_size=window_size // 2 if shift else 0,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop_rate=attn_drop_rate,
+            proj_drop_rate=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),)
+
+        self.norm2 = nn.LayerNorm(embed_dims)
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=2,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg,
+            add_identity=True,)
+
+    def forward(self, x, hw_shape):
+
+        def _inner_forward(x):
+            identity = x
+            x = self.norm1(x)
+            x = self.attn(x, hw_shape)
+
+            x = x + identity
+
+            identity = x
+            x = self.norm2(x)
+            x = self.ffn(x, identity=identity)
+
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+
+        return x
+
+
+class SwinBlockSequence(nn.Module):
+    """Implements one stage in Swin Transformer.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        depth (int): The number of blocks in this stage.
+        window_size (int, optional): The local window scale. Default: 7.
+        qkv_bias (bool, optional): enable bias for qkv if True. Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        drop_rate (float, optional): Dropout rate. Default: 0.
+        attn_drop_rate (float, optional): Attention dropout rate. Default: 0.
+        drop_path_rate (float | list[float], optional): Stochastic depth
+            rate. Default: 0.
+        downsample (BaseModule | None, optional): The downsample operation
+            module. Default: None.
+        act_cfg (dict, optional): The config dict of activation function.
+            Default: dict(type='GELU').
+        norm_cfg (dict, optional): The config dict of normalization.
+            Default: dict(type='LN').
+        with_cp (bool, optional): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 depth,
+                 window_size=7,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 downsample=None,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False):
+        super().__init__()
+        self._is_init = False
+
+        if isinstance(drop_path_rate, list):
+            drop_path_rates = drop_path_rate
+            assert len(drop_path_rates) == depth
+        else:
+            drop_path_rates = [deepcopy(drop_path_rate) for _ in range(depth)]
+
+        self.blocks = nn.ModuleList()
+        for i in range(depth):
+            block = SwinBlock(
+                embed_dims=embed_dims,
+                num_heads=num_heads,
+                feedforward_channels=feedforward_channels,
+                window_size=window_size,
+                shift=False if i % 2 == 0 else True,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=drop_path_rates[i],
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+                with_cp=with_cp,)
+            self.blocks.append(block)
+
+        self.downsample = downsample
+
+    def forward(self, x, hw_shape):
+        for block in self.blocks:
+            x = block(x, hw_shape)
+
+        if self.downsample:
+            x_down, down_hw_shape = self.downsample(x, hw_shape)
+            return x_down, down_hw_shape, x, hw_shape
+        else:
+            return x, hw_shape, x, hw_shape
+
+
+class SwinTransformer(nn.Module):
+    """ Swin Transformer
+    A PyTorch implement of : `Swin Transformer:
+    Hierarchical Vision Transformer using Shifted Windows`  -
+        https://arxiv.org/abs/2103.14030
+
+    This code is adapted from https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/backbones/swin.py 
+    with minimal modifications.
+
+    Args:
+        pretrain_img_size (int | tuple[int]): The size of input image when
+            pretrain. Defaults: 224.
+        in_channels (int): The num of input channels.
+            Defaults: 3.
+        embed_dims (int): The feature dimension. Default: 96.
+        patch_size (int | tuple[int]): Patch size. Default: 4.
+        window_size (int): Window size. Default: 7.
+        mlp_ratio (int): Ratio of mlp hidden dim to embedding dim.
+            Default: 4.
+        depths (tuple[int]): Depths of each Swin Transformer stage.
+            Default: (2, 2, 6, 2).
+        num_heads (tuple[int]): Parallel attention heads of each Swin
+            Transformer stage. Default: (3, 6, 12, 24).
+        strides (tuple[int]): The patch merging or patch embedding stride of
+            each Swin Transformer stage. (In swin, we set kernel size equal to
+            stride.) Default: (4, 2, 2, 2).
+        out_indices (tuple[int]): Output from which stages.
+            Default: (0, 1, 2, 3).
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key,
+            value. Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        patch_norm (bool): If add a norm layer for patch embed and patch
+            merging. Default: True.
+        drop_rate (float): Dropout rate. Defaults: 0.
+        attn_drop_rate (float): Attention dropout rate. Default: 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults: 0.1.
+        use_abs_pos_embed (bool): If True, add absolute position embedding to
+            the patch embedding. Defaults: False.
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='GELU').
+        norm_cfg (dict): Config dict for normalization layer at
+            output of backone. Defaults: dict(type='LN').
+        with_cp (bool, optional): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+        pretrained (str, optional): model pretrained path. Default: None.
+        convert_weights (bool): The flag indicates whether the
+            pre-trained model is from the original repo. We may need
+            to convert some keys to make it compatible.
+            Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            Default: -1 (-1 means not freezing any parameters).
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self, model_cfg):
+        
+        self.model_cfg = model_cfg
+        pretrain_img_size = self.model_cfg.get('PRETRAIN_IMG_SIZE', 224)
+        init_cfg = self.model_cfg.get('INIT_CFG', None)
+        depths = self.model_cfg.DEPTHS
+        in_channels = self.model_cfg.get('IN_CHANNELS', 3)
+        strides = self.model_cfg.get('STRIDES', (4, 2, 2, 2))
+        patch_size = self.model_cfg.get('PATCH_SIZE', 4)
+        embed_dims = self.model_cfg.EMBED_DIMS
+        num_heads = self.model_cfg.NUM_HEADS
+        window_size = self.model_cfg.WINDOW_SIZE
+        mlp_ratio = self.model_cfg.MLP_RATIO
+        qkv_bias = self.model_cfg.get('QKV_BIAS', True)
+        qk_scale = self.model_cfg.get('QK_SCALE', None)
+        drop_rate = self.model_cfg.DROP_RATE
+        attn_drop_rate = self.model_cfg.ATTN_DROP_RATE
+        drop_path_rate = self.model_cfg.DROP_PATH_RATE
+        patch_norm = self.model_cfg.get('PATCH_NORM', True)
+        out_indices = self.model_cfg.get('OUT_INDICES', [0, 1, 2, 3])
+        with_cp = self.model_cfg.get('WITH_CP', False)
+        use_abs_pos_embed = self.model_cfg.get('USE_ABS_POS_EMBED', False)
+        act_cfg=dict(type='GELU')
+        norm_cfg=dict(type='LN')
+
+        self.convert_weights = self.model_cfg.get('CONVERT_WEIGHTS', False)
+        self.frozen_stages = self.model_cfg.get('FROZEN_STAGES', -1)
+
+        if isinstance(pretrain_img_size, int):
+            pretrain_img_size = to_2tuple(pretrain_img_size)
+        elif isinstance(pretrain_img_size, tuple):
+            if len(pretrain_img_size) == 1:
+                pretrain_img_size = to_2tuple(pretrain_img_size[0])
+            assert len(pretrain_img_size) == 2, \
+                f'The size of image should have length 1 or 2, ' \
+                f'but got {len(pretrain_img_size)}'
+
+        super(SwinTransformer, self).__init__()
+        self.init_cfg = init_cfg
+
+        num_layers = len(depths)
+        self.out_indices = out_indices
+        self.use_abs_pos_embed = use_abs_pos_embed
+
+        assert strides[0] == patch_size, 'Use non-overlapping patch embed.'
+
+        self.patch_embed = PatchEmbed(
+            in_channels=in_channels,
+            embed_dims=embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=strides[0],
+            norm_cfg=norm_cfg if patch_norm else None)
+
+        if self.use_abs_pos_embed:
+            patch_row = pretrain_img_size[0] // patch_size
+            patch_col = pretrain_img_size[1] // patch_size
+            num_patches = patch_row * patch_col
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros((1, num_patches, embed_dims)))
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        # set stochastic depth decay rule
+        total_depth = sum(depths)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]
+
+        self.stages = nn.ModuleList()
+        in_channels = embed_dims
+        for i in range(num_layers):
+            if i < num_layers - 1:
+                downsample = PatchMerging(
+                    in_channels=in_channels,
+                    out_channels=2 * in_channels,
+                    stride=strides[i + 1],
+                    norm_cfg=norm_cfg if patch_norm else None)
+            else:
+                downsample = None
+
+            stage = SwinBlockSequence(
+                embed_dims=in_channels,
+                num_heads=num_heads[i],
+                feedforward_channels=mlp_ratio * in_channels,
+                depth=depths[i],
+                window_size=window_size,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=dpr[sum(depths[:i]):sum(depths[:i + 1])],
+                downsample=downsample,
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+                with_cp=with_cp)
+            self.stages.append(stage)
+            if downsample:
+                in_channels = downsample.out_channels
+
+        self.num_features = [int(embed_dims * 2**i) for i in range(num_layers)]
+        # Add a norm layer for each output
+        for i in out_indices:
+            layer =  nn.LayerNorm(self.num_features[i])
+            layer_name = f'norm{i}'
+            self.add_module(layer_name, layer)
+
+    def train(self, mode=True):
+        """Convert the model into training mode while keep layers freezed."""
+        super(SwinTransformer, self).train(mode)
+        self._freeze_stages()
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+            if self.use_abs_pos_embed:
+                self.absolute_pos_embed.requires_grad = False
+            self.drop_after_pos.eval()
+
+        for i in range(1, self.frozen_stages + 1):
+
+            if (i - 1) in self.out_indices:
+                norm_layer = getattr(self, f'norm{i-1}')
+                norm_layer.eval()
+                for param in norm_layer.parameters():
+                    param.requires_grad = False
+
+            m = self.stages[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def init_weights(self):
+        if self.init_cfg is None:
+            print(f'No pre-trained weights for '
+                        f'{self.__class__.__name__}, '
+                        f'training start from scratch')
+            if self.use_abs_pos_embed:
+                trunc_normal_(self.absolute_pos_embed, std=0.02)
+            for m in self.modules():
+                if isinstance(m, nn.Linear):
+                    trunc_normal_init(m, std=.02, bias=0.)
+                elif isinstance(m, nn.LayerNorm):
+                    constant_init(m, 1.0)
+        else:
+            assert 'checkpoint' in self.init_cfg, f'Only support ' \
+                                                  f'specify `Pretrained` in ' \
+                                                  f'`init_cfg` in ' \
+                                                  f'{self.__class__.__name__} '
+            ckpt = torch.load(self.init_cfg.checkpoint, map_location='cpu')
+            if 'state_dict' in ckpt:
+                _state_dict = ckpt['state_dict']
+            elif 'model' in ckpt:
+                _state_dict = ckpt['model']
+            else:
+                _state_dict = ckpt
+            if self.convert_weights:
+                # supported loading weight from original repo,
+                _state_dict = swin_converter(_state_dict)
+
+            state_dict = OrderedDict()
+            for k, v in _state_dict.items():
+                if k.startswith('backbone.'):
+                    state_dict[k[9:]] = v
+
+            # strip prefix of state_dict
+            if list(state_dict.keys())[0].startswith('module.'):
+                state_dict = {k[7:]: v for k, v in state_dict.items()}
+
+            # reshape absolute position embedding
+            if state_dict.get('absolute_pos_embed') is not None:
+                absolute_pos_embed = state_dict['absolute_pos_embed']
+                N1, L, C1 = absolute_pos_embed.size()
+                N2, C2, H, W = self.absolute_pos_embed.size()
+                if N1 != N2 or C1 != C2 or L != H * W:
+                    print('Error in loading absolute_pos_embed, pass')
+                else:
+                    state_dict['absolute_pos_embed'] = absolute_pos_embed.view(
+                        N2, H, W, C2).permute(0, 3, 1, 2).contiguous()
+
+            # interpolate position bias table if needed
+            relative_position_bias_table_keys = [
+                k for k in state_dict.keys()
+                if 'relative_position_bias_table' in k
+            ]
+            for table_key in relative_position_bias_table_keys:
+                table_pretrained = state_dict[table_key]
+                table_current = self.state_dict()[table_key]
+                L1, nH1 = table_pretrained.size()
+                L2, nH2 = table_current.size()
+                if nH1 != nH2:
+                    print(f'Error in loading {table_key}, pass')
+                elif L1 != L2:
+                    S1 = int(L1**0.5)
+                    S2 = int(L2**0.5)
+                    table_pretrained_resized = F.interpolate(
+                        table_pretrained.permute(1, 0).reshape(1, nH1, S1, S1),
+                        size=(S2, S2),
+                        mode='bicubic')
+                    state_dict[table_key] = table_pretrained_resized.view(
+                        nH2, L2).permute(1, 0).contiguous()
+
+            # load state_dict
+            self.load_state_dict(state_dict, False)
+
+    def forward(self, batch_dict):
+        x = batch_dict['camera_imgs']
+        B, N, C, H, W = x.size()
+        x = x.view(B * N, C, H, W)
+        x, hw_shape = self.patch_embed(x)
+
+        if self.use_abs_pos_embed:
+            x = x + self.absolute_pos_embed
+        x = self.drop_after_pos(x)
+
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x, hw_shape, out, out_hw_shape = stage(x, hw_shape)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                out = norm_layer(out)
+                out = out.view(-1, *out_hw_shape,
+                               self.num_features[i]).permute(0, 3, 1,
+                                                             2).contiguous()
+                outs.append(out)
+        batch_dict['image_features'] = outs
+        return batch_dict
--- a/pcdet/models/dense_heads/__init__.py
+++ b/pcdet/models/dense_heads/__init__.py
@@ -6,6 +6,7 @@ from .point_head_simple import PointHeadSimple
 from .point_intra_part_head import PointIntraPartOffsetHead
 from .center_head import CenterHead
 from .voxelnext_head import VoxelNeXtHead
+from .transfusion_head import TransFusionHead

 __all__ = {
    'AnchorHeadTemplate': AnchorHeadTemplate,
@@ -16,4 +17,5 @@ __all__ = {
    'AnchorHeadMulti': AnchorHeadMulti,
    'CenterHead': CenterHead,
    'VoxelNeXtHead': VoxelNeXtHead,
+    'TransFusionHead': TransFusionHead,
 }
--- a/pcdet/models/dense_heads/target_assigner/hungarian_assigner.py
+++ b/pcdet/models/dense_heads/target_assigner/hungarian_assigner.py
+import torch
+from scipy.optimize import linear_sum_assignment
+from pcdet.ops.iou3d_nms import iou3d_nms_cuda
+
+
+def height_overlaps(boxes1, boxes2):
+    """
+    Calculate height overlaps of two boxes.
+    """
+    boxes1_top_height = (boxes1[:,2]+ boxes1[:,5]).view(-1, 1)
+    boxes1_bottom_height = boxes1[:,2].view(-1, 1)
+    boxes2_top_height = (boxes2[:,2]+boxes2[:,5]).view(1, -1)
+    boxes2_bottom_height = boxes2[:,2].view(1, -1)
+
+    heighest_of_bottom = torch.max(boxes1_bottom_height, boxes2_bottom_height)
+    lowest_of_top = torch.min(boxes1_top_height, boxes2_top_height)
+    overlaps_h = torch.clamp(lowest_of_top - heighest_of_bottom, min=0)
+    return overlaps_h
+
+
+def overlaps(boxes1, boxes2):
+    """
+    Calculate 3D overlaps of two boxes.
+    """
+    rows = len(boxes1)
+    cols = len(boxes2)
+    if rows * cols == 0:
+        return boxes1.new(rows, cols)
+
+    # height overlap
+    overlaps_h = height_overlaps(boxes1, boxes2)
+    boxes1_bev = boxes1[:,:7]
+    boxes2_bev = boxes2[:,:7]
+
+    # bev overlap
+    overlaps_bev = boxes1_bev.new_zeros(
+        (boxes1_bev.shape[0], boxes2_bev.shape[0])
+    ).cuda()  # (N, M)
+    iou3d_nms_cuda.boxes_overlap_bev_gpu(
+        boxes1_bev.contiguous().cuda(), boxes2_bev.contiguous().cuda(), overlaps_bev
+    )
+
+    # 3d overlaps
+    overlaps_3d = overlaps_bev.to(boxes1.device) * overlaps_h
+
+    volume1 = (boxes1[:, 3] * boxes1[:, 4] * boxes1[:, 5]).view(-1, 1)
+    volume2 = (boxes2[:, 3] * boxes2[:, 4] * boxes2[:, 5]).view(1, -1)
+        
+    iou3d = overlaps_3d / torch.clamp(volume1 + volume2 - overlaps_3d, min=1e-8)
+
+    return iou3d
+
+
+
+class HungarianAssigner3D:
+    def __init__(self, cls_cost, reg_cost, iou_cost):
+        self.cls_cost = cls_cost
+        self.reg_cost = reg_cost
+        self.iou_cost = iou_cost
+
+    def focal_loss_cost(self, cls_pred, gt_labels):
+        weight = self.cls_cost.get('weight', 0.15)
+        alpha = self.cls_cost.get('alpha', 0.25)
+        gamma = self.cls_cost.get('gamma', 2.0)
+        eps = self.cls_cost.get('eps', 1e-12)
+
+        cls_pred = cls_pred.sigmoid()
+        neg_cost = -(1 - cls_pred + eps).log() * (
+            1 - alpha) * cls_pred.pow(gamma)
+        pos_cost = -(cls_pred + eps).log() * alpha * (
+            1 - cls_pred).pow(gamma)
+
+        cls_cost = pos_cost[:, gt_labels] - neg_cost[:, gt_labels]
+        return cls_cost * weight
+    
+    def bevbox_cost(self, bboxes, gt_bboxes, point_cloud_range):
+        weight = self.reg_cost.get('weight', 0.25)
+
+        pc_start = bboxes.new(point_cloud_range[0:2])
+        pc_range = bboxes.new(point_cloud_range[3:5]) - bboxes.new(point_cloud_range[0:2])
+        # normalize the box center to [0, 1]
+        normalized_bboxes_xy = (bboxes[:, :2] - pc_start) / pc_range
+        normalized_gt_bboxes_xy = (gt_bboxes[:, :2] - pc_start) / pc_range
+        reg_cost = torch.cdist(normalized_bboxes_xy, normalized_gt_bboxes_xy, p=1)
+        return reg_cost * weight
+
+    def iou3d_cost(self, bboxes, gt_bboxes):
+        iou = overlaps(bboxes, gt_bboxes)
+        weight = self.iou_cost.get('weight', 0.25)
+        iou_cost = - iou
+        return iou_cost * weight, iou
+
+    def assign(self, bboxes, gt_bboxes, gt_labels, cls_pred, point_cloud_range):
+        num_gts, num_bboxes = gt_bboxes.size(0), bboxes.size(0)
+
+        # 1. assign -1 by default
+        assigned_gt_inds = bboxes.new_full((num_bboxes,), -1, dtype=torch.long)
+        assigned_labels = bboxes.new_full((num_bboxes,), -1, dtype=torch.long)
+        if num_gts == 0 or num_bboxes == 0:
+            # No ground truth or boxes, return empty assignment
+            if num_gts == 0:
+                # No ground truth, assign all to background
+                assigned_gt_inds[:] = 0
+            return num_gts, assigned_gt_inds, max_overlaps, assigned_labels
+
+        # 2. compute the weighted costs
+        cls_cost = self.focal_loss_cost(cls_pred[0].T, gt_labels)
+        reg_cost = self.bevbox_cost(bboxes, gt_bboxes, point_cloud_range)
+        iou_cost, iou = self.iou3d_cost(bboxes, gt_bboxes)
+        
+
+        # weighted sum of above three costs
+        cost = cls_cost + reg_cost + iou_cost
+
+        # 3. do Hungarian matching on CPU using linear_sum_assignment
+        cost = cost.detach().cpu()
+        matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
+        matched_row_inds = torch.from_numpy(matched_row_inds).to(bboxes.device)
+        matched_col_inds = torch.from_numpy(matched_col_inds).to(bboxes.device)
+
+        # 4. assign backgrounds and foregrounds
+        # assign all indices to backgrounds first
+        assigned_gt_inds[:] = 0
+        # assign foregrounds based on matching results
+        assigned_gt_inds[matched_row_inds] = matched_col_inds + 1
+        assigned_labels[matched_row_inds] = gt_labels[matched_col_inds]
+
+        max_overlaps = torch.zeros_like(iou.max(1).values)
+        max_overlaps[matched_row_inds] = iou[matched_row_inds, matched_col_inds]
+
+        return assigned_gt_inds, max_overlaps
\ No newline at end of file
--- a/pcdet/models/dense_heads/transfusion_head.py
+++ b/pcdet/models/dense_heads/transfusion_head.py
+import copy
+import numpy as np
+import torch
+from torch import nn
+import torch.nn.functional as F
+from torch.nn.init import kaiming_normal_
+from ..model_utils.transfusion_utils import clip_sigmoid
+from ..model_utils.basic_block_2d import BasicBlock2D
+from ..model_utils.transfusion_utils import PositionEmbeddingLearned, TransformerDecoderLayer
+from .target_assigner.hungarian_assigner import HungarianAssigner3D
+from ...utils import loss_utils
+from ..model_utils import centernet_utils
+
+
+class SeparateHead_Transfusion(nn.Module):
+    def __init__(self, input_channels, head_channels, kernel_size, sep_head_dict, init_bias=-2.19, use_bias=False):
+        super().__init__()
+        self.sep_head_dict = sep_head_dict
+
+        for cur_name in self.sep_head_dict:
+            output_channels = self.sep_head_dict[cur_name]['out_channels']
+            num_conv = self.sep_head_dict[cur_name]['num_conv']
+
+            fc_list = []
+            for k in range(num_conv - 1):
+                fc_list.append(nn.Sequential(
+                    nn.Conv1d(input_channels, head_channels, kernel_size, stride=1, padding=kernel_size//2, bias=use_bias),
+                    nn.BatchNorm1d(head_channels),
+                    nn.ReLU()
+                ))
+            fc_list.append(nn.Conv1d(head_channels, output_channels, kernel_size, stride=1, padding=kernel_size//2, bias=True))
+            fc = nn.Sequential(*fc_list)
+            if 'hm' in cur_name:
+                fc[-1].bias.data.fill_(init_bias)
+            else:
+                for m in fc.modules():
+                    if isinstance(m, nn.Conv2d):
+                        kaiming_normal_(m.weight.data)
+                        if hasattr(m, "bias") and m.bias is not None:
+                            nn.init.constant_(m.bias, 0)
+
+            self.__setattr__(cur_name, fc)
+
+    def forward(self, x):
+        ret_dict = {}
+        for cur_name in self.sep_head_dict:
+            ret_dict[cur_name] = self.__getattr__(cur_name)(x)
+
+        return ret_dict
+
+
+
+class TransFusionHead(nn.Module):
+    """
+        This module implements TransFusionHead.
+        The code is adapted from https://github.com/mit-han-lab/bevfusion/ with minimal modifications.
+    """
+    def __init__(
+        self,
+        model_cfg, input_channels, num_class, class_names, grid_size, point_cloud_range, voxel_size, predict_boxes_when_training=True,
+    ):
+        super(TransFusionHead, self).__init__()
+
+        self.grid_size = grid_size
+        self.point_cloud_range = point_cloud_range
+        self.voxel_size = voxel_size
+        self.num_classes = num_class
+
+        self.model_cfg = model_cfg
+        self.feature_map_stride = self.model_cfg.TARGET_ASSIGNER_CONFIG.get('FEATURE_MAP_STRIDE', None)
+        self.dataset_name = self.model_cfg.TARGET_ASSIGNER_CONFIG.get('DATASET', 'nuScenes')
+
+        hidden_channel=self.model_cfg.HIDDEN_CHANNEL
+        self.num_proposals = self.model_cfg.NUM_PROPOSALS
+        self.bn_momentum = self.model_cfg.BN_MOMENTUM
+        self.nms_kernel_size = self.model_cfg.NMS_KERNEL_SIZE
+
+        num_heads = self.model_cfg.NUM_HEADS
+        dropout = self.model_cfg.DROPOUT
+        activation = self.model_cfg.ACTIVATION
+        ffn_channel = self.model_cfg.FFN_CHANNEL
+        bias = self.model_cfg.get('USE_BIAS_BEFORE_NORM', False)
+
+        loss_cls = self.model_cfg.LOSS_CONFIG.LOSS_CLS
+        self.use_sigmoid_cls = loss_cls.get("use_sigmoid", False)
+        if not self.use_sigmoid_cls:
+            self.num_classes += 1
+        self.loss_cls = loss_utils.SigmoidFocalClassificationLoss(gamma=loss_cls.gamma,alpha=loss_cls.alpha)
+        self.loss_cls_weight = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['cls_weight']
+        self.loss_bbox = loss_utils.L1Loss()
+        self.loss_bbox_weight = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['bbox_weight']
+        self.loss_heatmap = loss_utils.GaussianFocalLoss()
+        self.loss_heatmap_weight = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['hm_weight']
+
+        self.code_size = 10
+
+        # a shared convolution
+        self.shared_conv = nn.Conv2d(in_channels=input_channels,out_channels=hidden_channel,kernel_size=3,padding=1)
+        layers = []
+        layers.append(BasicBlock2D(hidden_channel,hidden_channel, kernel_size=3,padding=1,bias=bias))
+        layers.append(nn.Conv2d(in_channels=hidden_channel,out_channels=num_class,kernel_size=3,padding=1))
+        self.heatmap_head = nn.Sequential(*layers)
+        self.class_encoding = nn.Conv1d(num_class, hidden_channel, 1)
+
+        # transformer decoder layers for object query with LiDAR feature
+        self.decoder = TransformerDecoderLayer(hidden_channel, num_heads, ffn_channel, dropout, activation,
+                self_posembed=PositionEmbeddingLearned(2, hidden_channel),
+                cross_posembed=PositionEmbeddingLearned(2, hidden_channel),
+            )
+        # Prediction Head
+        heads = copy.deepcopy(self.model_cfg.SEPARATE_HEAD_CFG.HEAD_DICT)
+        heads['heatmap'] = dict(out_channels=self.num_classes, num_conv=self.model_cfg.NUM_HM_CONV)
+        self.prediction_head = SeparateHead_Transfusion(hidden_channel, 64, 1, heads, use_bias=bias)
+
+        self.init_weights()
+        self.bbox_assigner = HungarianAssigner3D(**self.model_cfg.TARGET_ASSIGNER_CONFIG.HUNGARIAN_ASSIGNER)
+
+        # Position Embedding for Cross-Attention, which is re-used during training
+        x_size = self.grid_size[0] // self.feature_map_stride
+        y_size = self.grid_size[1] // self.feature_map_stride
+        self.bev_pos = self.create_2D_grid(x_size, y_size)
+
+        self.forward_ret_dict = {}
+
+    def create_2D_grid(self, x_size, y_size):
+        meshgrid = [[0, x_size - 1, x_size], [0, y_size - 1, y_size]]
+        # NOTE: modified
+        batch_x, batch_y = torch.meshgrid(
+            *[torch.linspace(it[0], it[1], it[2]) for it in meshgrid]
+        )
+        batch_x = batch_x + 0.5
+        batch_y = batch_y + 0.5
+        coord_base = torch.cat([batch_x[None], batch_y[None]], dim=0)[None]
+        coord_base = coord_base.view(1, 2, -1).permute(0, 2, 1)
+        return coord_base
+
+    def init_weights(self):
+        # initialize transformer
+        for m in self.decoder.parameters():
+            if m.dim() > 1:
+                nn.init.xavier_uniform_(m)
+        if hasattr(self, "query"):
+            nn.init.xavier_normal_(self.query)
+        self.init_bn_momentum()
+
+    def init_bn_momentum(self):
+        for m in self.modules():
+            if isinstance(m, (nn.BatchNorm2d, nn.BatchNorm1d)):
+                m.momentum = self.bn_momentum
+
+    def predict(self, inputs):
+        batch_size = inputs.shape[0]
+        lidar_feat = self.shared_conv(inputs)
+
+        lidar_feat_flatten = lidar_feat.view(
+            batch_size, lidar_feat.shape[1], -1
+        )
+        bev_pos = self.bev_pos.repeat(batch_size, 1, 1).to(lidar_feat.device)
+
+        # query initialization
+        dense_heatmap = self.heatmap_head(lidar_feat)
+        heatmap = dense_heatmap.detach().sigmoid()
+        padding = self.nms_kernel_size // 2
+        local_max = torch.zeros_like(heatmap)
+        local_max_inner = F.max_pool2d(
+            heatmap, kernel_size=self.nms_kernel_size, stride=1, padding=0
+        )
+        local_max[:, :, padding:(-padding), padding:(-padding)] = local_max_inner
+        # for Pedestrian & Traffic_cone in nuScenes
+        if self.dataset_name == "nuScenes":
+            local_max[ :, 8, ] = F.max_pool2d(heatmap[:, 8], kernel_size=1, stride=1, padding=0)
+            local_max[ :, 9, ] = F.max_pool2d(heatmap[:, 9], kernel_size=1, stride=1, padding=0)
+        # for Pedestrian & Cyclist in Waymo
+        elif self.dataset_name == "Waymo":
+            local_max[ :, 1, ] = F.max_pool2d(heatmap[:, 1], kernel_size=1, stride=1, padding=0)
+            local_max[ :, 2, ] = F.max_pool2d(heatmap[:, 2], kernel_size=1, stride=1, padding=0)
+        heatmap = heatmap * (heatmap == local_max)
+        heatmap = heatmap.view(batch_size, heatmap.shape[1], -1)
+ 
+        # top num_proposals among all classes
+        top_proposals = heatmap.view(batch_size, -1).argsort(dim=-1, descending=True)[
+            ..., : self.num_proposals
+        ]
+        top_proposals_class = top_proposals // heatmap.shape[-1]
+        top_proposals_index = top_proposals % heatmap.shape[-1]
+        query_feat = lidar_feat_flatten.gather(
+            index=top_proposals_index[:, None, :].expand(-1, lidar_feat_flatten.shape[1], -1),
+            dim=-1,
+        )
+        self.query_labels = top_proposals_class
+
+        # add category embedding
+        one_hot = F.one_hot(top_proposals_class, num_classes=self.num_classes).permute(0, 2, 1)
+        
+        query_cat_encoding = self.class_encoding(one_hot.float())
+        query_feat += query_cat_encoding
+
+        query_pos = bev_pos.gather(
+            index=top_proposals_index[:, None, :].permute(0, 2, 1).expand(-1, -1, bev_pos.shape[-1]),
+            dim=1,
+        )
+        # convert to xy
+        query_pos = query_pos.flip(dims=[-1])
+        bev_pos = bev_pos.flip(dims=[-1])
+
+        query_feat = self.decoder(
+            query_feat, lidar_feat_flatten, query_pos, bev_pos
+        )
+        res_layer = self.prediction_head(query_feat)
+        res_layer["center"] = res_layer["center"] + query_pos.permute(0, 2, 1)
+
+        res_layer["query_heatmap_score"] = heatmap.gather(
+            index=top_proposals_index[:, None, :].expand(-1, self.num_classes, -1),
+            dim=-1,
+        )
+        res_layer["dense_heatmap"] = dense_heatmap
+
+        return res_layer
+
+    def forward(self, batch_dict):
+        feats = batch_dict['spatial_features_2d']
+        res = self.predict(feats)
+        if not self.training:
+            bboxes = self.get_bboxes(res)
+            batch_dict['final_box_dicts'] = bboxes
+        else:
+            gt_boxes = batch_dict['gt_boxes']
+            gt_bboxes_3d = gt_boxes[...,:-1]
+            gt_labels_3d =  gt_boxes[...,-1].long() - 1
+            loss, tb_dict = self.loss(gt_bboxes_3d, gt_labels_3d, res)
+            batch_dict['loss'] = loss
+            batch_dict['tb_dict'] = tb_dict
+        return batch_dict
+
+    def get_targets(self, gt_bboxes_3d, gt_labels_3d, pred_dicts):
+        assign_results = []
+        for batch_idx in range(len(gt_bboxes_3d)):
+            pred_dict = {}
+            for key in pred_dicts.keys():
+                pred_dict[key] = pred_dicts[key][batch_idx : batch_idx + 1]
+            gt_bboxes = gt_bboxes_3d[batch_idx]
+            valid_idx = []
+            # filter empty boxes
+            for i in range(len(gt_bboxes)):
+                if gt_bboxes[i][3] > 0 and gt_bboxes[i][4] > 0:
+                    valid_idx.append(i)
+            assign_result = self.get_targets_single(gt_bboxes[valid_idx], gt_labels_3d[batch_idx][valid_idx], pred_dict)
+            assign_results.append(assign_result)
+
+        res_tuple = tuple(map(list, zip(*assign_results)))
+        labels = torch.cat(res_tuple[0], dim=0)
+        label_weights = torch.cat(res_tuple[1], dim=0)
+        bbox_targets = torch.cat(res_tuple[2], dim=0)
+        bbox_weights = torch.cat(res_tuple[3], dim=0)
+        num_pos = np.sum(res_tuple[4])
+        matched_ious = np.mean(res_tuple[5])
+        heatmap = torch.cat(res_tuple[6], dim=0)
+        return labels, label_weights, bbox_targets, bbox_weights, num_pos, matched_ious, heatmap
+        
+
+    def get_targets_single(self, gt_bboxes_3d, gt_labels_3d, preds_dict):
+        
+        num_proposals = preds_dict["center"].shape[-1]
+        score = copy.deepcopy(preds_dict["heatmap"].detach())
+        center = copy.deepcopy(preds_dict["center"].detach())
+        height = copy.deepcopy(preds_dict["height"].detach())
+        dim = copy.deepcopy(preds_dict["dim"].detach())
+        rot = copy.deepcopy(preds_dict["rot"].detach())
+        if "vel" in preds_dict.keys():
+            vel = copy.deepcopy(preds_dict["vel"].detach())
+        else:
+            vel = None
+
+        boxes_dict = self.decode_bbox(score, rot, dim, center, height, vel)
+        bboxes_tensor = boxes_dict[0]["pred_boxes"]
+        gt_bboxes_tensor = gt_bboxes_3d.to(score.device)
+
+        assigned_gt_inds, ious = self.bbox_assigner.assign(
+            bboxes_tensor, gt_bboxes_tensor, gt_labels_3d,
+            score, self.point_cloud_range,
+        )
+        pos_inds = torch.nonzero(assigned_gt_inds > 0, as_tuple=False).squeeze(-1).unique()
+        neg_inds = torch.nonzero(assigned_gt_inds == 0, as_tuple=False).squeeze(-1).unique()
+        pos_assigned_gt_inds = assigned_gt_inds[pos_inds] - 1
+        if gt_bboxes_3d.numel() == 0:
+            assert pos_inds.numel() == 0
+            pos_gt_bboxes = torch.empty_like(gt_bboxes_3d).view(-1, 9)
+        else:
+            pos_gt_bboxes = gt_bboxes_3d[pos_assigned_gt_inds.long(), :]
+
+        # create target for loss computation
+        bbox_targets = torch.zeros([num_proposals, self.code_size]).to(center.device)
+        bbox_weights = torch.zeros([num_proposals, self.code_size]).to(center.device)
+        ious = torch.clamp(ious, min=0.0, max=1.0)
+        labels = bboxes_tensor.new_zeros(num_proposals, dtype=torch.long)
+        label_weights = bboxes_tensor.new_zeros(num_proposals, dtype=torch.long)
+
+        if gt_labels_3d is not None:  # default label is -1
+            labels += self.num_classes
+
+        # both pos and neg have classification loss, only pos has regression and iou loss
+        if len(pos_inds) > 0:
+            pos_bbox_targets = self.encode_bbox(pos_gt_bboxes)
+            bbox_targets[pos_inds, :] = pos_bbox_targets
+            bbox_weights[pos_inds, :] = 1.0
+
+            if gt_labels_3d is None:
+                labels[pos_inds] = 1
+            else:
+                labels[pos_inds] = gt_labels_3d[pos_assigned_gt_inds]
+            label_weights[pos_inds] = 1.0
+
+        if len(neg_inds) > 0:
+            label_weights[neg_inds] = 1.0
+
+        # compute dense heatmap targets
+        device = labels.device
+        target_assigner_cfg = self.model_cfg.TARGET_ASSIGNER_CONFIG
+        feature_map_size = (self.grid_size[:2] // self.feature_map_stride) 
+        heatmap = gt_bboxes_3d.new_zeros(self.num_classes, feature_map_size[1], feature_map_size[0])
+        for idx in range(len(gt_bboxes_3d)):
+            width = gt_bboxes_3d[idx][3]
+            length = gt_bboxes_3d[idx][4]
+            width = width / self.voxel_size[0] / self.feature_map_stride
+            length = length / self.voxel_size[1] / self.feature_map_stride
+            if width > 0 and length > 0:
+                radius = centernet_utils.gaussian_radius(length.view(-1), width.view(-1), target_assigner_cfg.GAUSSIAN_OVERLAP)[0]
+                radius = max(target_assigner_cfg.MIN_RADIUS, int(radius))
+                x, y = gt_bboxes_3d[idx][0], gt_bboxes_3d[idx][1]
+
+                coor_x = (x - self.point_cloud_range[0]) / self.voxel_size[0] / self.feature_map_stride
+                coor_y = (y - self.point_cloud_range[1]) / self.voxel_size[1] / self.feature_map_stride
+
+                center = torch.tensor([coor_x, coor_y], dtype=torch.float32, device=device)
+                center_int = center.to(torch.int32)
+                centernet_utils.draw_gaussian_to_heatmap(heatmap[gt_labels_3d[idx]], center_int, radius)
+
+
+        mean_iou = ious[pos_inds].sum() / max(len(pos_inds), 1)
+        return (labels[None], label_weights[None], bbox_targets[None], bbox_weights[None], int(pos_inds.shape[0]), float(mean_iou), heatmap[None])
+
+    def loss(self, gt_bboxes_3d, gt_labels_3d, pred_dicts, **kwargs):
+
+        labels, label_weights, bbox_targets, bbox_weights, num_pos, matched_ious, heatmap = \
+            self.get_targets(gt_bboxes_3d, gt_labels_3d, pred_dicts)
+        loss_dict = dict()
+        loss_all = 0
+
+        # compute heatmap loss
+        loss_heatmap = self.loss_heatmap(
+            clip_sigmoid(pred_dicts["dense_heatmap"]),
+            heatmap,
+        ).sum() / max(heatmap.eq(1).float().sum().item(), 1)
+        loss_dict["loss_heatmap"] = loss_heatmap.item() * self.loss_heatmap_weight
+        loss_all += loss_heatmap * self.loss_heatmap_weight
+
+        labels = labels.reshape(-1)
+        label_weights = label_weights.reshape(-1)
+        cls_score = pred_dicts["heatmap"].permute(0, 2, 1).reshape(-1, self.num_classes)
+
+        one_hot_targets = torch.zeros(*list(labels.shape), self.num_classes+1, dtype=cls_score.dtype, device=labels.device)
+        one_hot_targets.scatter_(-1, labels.unsqueeze(dim=-1).long(), 1.0)
+        one_hot_targets = one_hot_targets[..., :-1]
+        loss_cls = self.loss_cls(
+            cls_score, one_hot_targets, label_weights
+        ).sum() / max(num_pos, 1)
+
+        preds = torch.cat([pred_dicts[head_name] for head_name in self.model_cfg.SEPARATE_HEAD_CFG.HEAD_ORDER], dim=1).permute(0, 2, 1)
+        code_weights = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['code_weights']
+        reg_weights = bbox_weights * bbox_weights.new_tensor(code_weights)
+
+        loss_bbox = self.loss_bbox(preds, bbox_targets) 
+        loss_bbox = (loss_bbox * reg_weights).sum() / max(num_pos, 1)
+
+        loss_dict["loss_cls"] = loss_cls.item() * self.loss_cls_weight
+        loss_dict["loss_bbox"] = loss_bbox.item() * self.loss_bbox_weight
+        loss_all = loss_all + loss_cls * self.loss_cls_weight + loss_bbox * self.loss_bbox_weight
+
+        loss_dict[f"matched_ious"] = loss_cls.new_tensor(matched_ious)
+        loss_dict['loss_trans'] = loss_all
+
+        return loss_all,loss_dict
+
+    def encode_bbox(self, bboxes):
+        code_size = 10
+        targets = torch.zeros([bboxes.shape[0], code_size]).to(bboxes.device)
+        targets[:, 0] = (bboxes[:, 0] - self.point_cloud_range[0]) / (self.feature_map_stride * self.voxel_size[0])
+        targets[:, 1] = (bboxes[:, 1] - self.point_cloud_range[1]) / (self.feature_map_stride * self.voxel_size[1])
+        targets[:, 3:6] = bboxes[:, 3:6].log()
+        targets[:, 2] = bboxes[:, 2]
+        targets[:, 6] = torch.sin(bboxes[:, 6])
+        targets[:, 7] = torch.cos(bboxes[:, 6])
+        if code_size == 10:
+            targets[:, 8:10] = bboxes[:, 7:]
+        return targets
+
+    def decode_bbox(self, heatmap, rot, dim, center, height, vel, filter=False):
+        
+        post_process_cfg = self.model_cfg.POST_PROCESSING
+        score_thresh = post_process_cfg.SCORE_THRESH
+        post_center_range = post_process_cfg.POST_CENTER_RANGE
+        post_center_range = torch.tensor(post_center_range).cuda().float()
+        # class label
+        final_preds = heatmap.max(1, keepdims=False).indices
+        final_scores = heatmap.max(1, keepdims=False).values
+
+        center[:, 0, :] = center[:, 0, :] * self.feature_map_stride * self.voxel_size[0] + self.point_cloud_range[0]
+        center[:, 1, :] = center[:, 1, :] * self.feature_map_stride * self.voxel_size[1] + self.point_cloud_range[1]
+        dim = dim.exp()
+        rots, rotc = rot[:, 0:1, :], rot[:, 1:2, :]
+        rot = torch.atan2(rots, rotc)
+
+        if vel is None:
+            final_box_preds = torch.cat([center, height, dim, rot], dim=1).permute(0, 2, 1)
+        else:
+            final_box_preds = torch.cat([center, height, dim, rot, vel], dim=1).permute(0, 2, 1)
+
+        predictions_dicts = []
+        for i in range(heatmap.shape[0]):
+            boxes3d = final_box_preds[i]
+            scores = final_scores[i]
+            labels = final_preds[i]
+            predictions_dict = {
+                'pred_boxes': boxes3d,
+                'pred_scores': scores,
+                'pred_labels': labels
+            }
+            predictions_dicts.append(predictions_dict)
+
+        if filter is False:
+            return predictions_dicts
+
+        thresh_mask = final_scores > score_thresh        
+        mask = (final_box_preds[..., :3] >= post_center_range[:3]).all(2)
+        mask &= (final_box_preds[..., :3] <= post_center_range[3:]).all(2)
+
+        predictions_dicts = []
+        for i in range(heatmap.shape[0]):
+            cmask = mask[i, :]
+            cmask &= thresh_mask[i]
+
+            boxes3d = final_box_preds[i, cmask]
+            scores = final_scores[i, cmask]
+            labels = final_preds[i, cmask]
+            predictions_dict = {
+                'pred_boxes': boxes3d,
+                'pred_scores': scores,
+                'pred_labels': labels,
+            }
+
+            predictions_dicts.append(predictions_dict)
+
+        return predictions_dicts
+
+    def get_bboxes(self, preds_dicts):
+
+        batch_size = preds_dicts["heatmap"].shape[0]
+        batch_score = preds_dicts["heatmap"].sigmoid()
+        one_hot = F.one_hot(
+            self.query_labels, num_classes=self.num_classes
+        ).permute(0, 2, 1)
+        batch_score = batch_score * preds_dicts["query_heatmap_score"] * one_hot
+        batch_center = preds_dicts["center"]
+        batch_height = preds_dicts["height"]
+        batch_dim = preds_dicts["dim"]
+        batch_rot = preds_dicts["rot"]
+        batch_vel = None
+        if "vel" in preds_dicts:
+            batch_vel = preds_dicts["vel"]
+
+        ret_dict = self.decode_bbox(
+            batch_score, batch_rot, batch_dim,
+            batch_center, batch_height, batch_vel,
+            filter=True,
+        )
+        for k in range(batch_size):
+            ret_dict[k]['pred_labels'] = ret_dict[k]['pred_labels'].int() + 1
+
+        return ret_dict