Support multi-modal 3D detection on NuScenes #1339

Add support for multi-modal NuScenes Detection

Support multi-modal 3D detection on NuScenes #1339
Add support for multi-modal NuScenes Detection
02ac3e17 · Shaoshuai Shi · GitHub · ad9c25c0 · fcfa0773 · 02ac3e17
Unverified Commit 02ac3e17 authored May 13, 2023 by Shaoshuai Shi Committed by GitHub May 13, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -10,6 +10,7 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
 * `OpenPCDet` has been updated to `v0.6.0` (Sep. 2022).
 * The codes of PV-RCNN++ has been supported.
 * The codes of MPPNet has been supported. 
+* The multi-modal 3D detection approaches on Nuscenes have been supported. 
 ## Overview
 - [Changelog](#changelog)
@@ -22,10 +23,15 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
 ## Changelog
-[2023-04-02] Added support for [`VoxelNeXt`](https://github.com/dvlab-research/VoxelNeXt) on Nuscenes, Waymo, and Argoverse2 datasets. It is a fully sparse 3D object detection network, which is a clean sparse CNNs network and predicts 3D objects directly upon voxels.
+[2023-05-13] **NEW:** Added support for the multi-modal 3D object detection models on Nuscenes dataset.  
+* Support multi-modal Nuscenes detection (See the [GETTING_STARTED.md](docs/GETTING_STARTED.md) to process data).
+* Support [TransFusion-Lidar](https://arxiv.org/abs/2203.11496) head, which ahcieves 69.43% NDS on Nuscenes validation dataset.
+* Support [`BEVFusion`](https://arxiv.org/abs/2205.13542), which fuses multi-modal information on BEV space and reaches 70.98% NDS on Nuscenes validation dataset. (see the [guideline](docs/guidelines_of_approaches/bevfusion.md) on how to train/test with BEVFusion).
+[2023-04-02] Added support for [`VoxelNeXt`](https://arxiv.org/abs/2303.11301) on Nuscenes, Waymo, and Argoverse2 datasets. It is a fully sparse 3D object detection network, which is a clean sparse CNNs network and predicts 3D objects directly upon voxels.
 [2022-09-02] **NEW:** Update `OpenPCDet` to v0.6.0:
-* Official code release of [MPPNet](https://arxiv.org/abs/2205.05979) for temporal 3D object detection, which supports long-term multi-frame 3D object detection and ranks 1st place on [3D detection learderboard](https://waymo.com/open/challenges/2020/3d-detection) of Waymo Open Dataset on Sept. 2th, 2022. For validation dataset, MPPNet achieves 74.96%, 75.06% and 74.52% for vehicle, pedestrian and cyclist classes in terms of mAPH@Level_2. (see the [guideline](docs/guidelines_of_approaches/mppnet.md) on how to train/test with MPPNet).
+* Official code release of [`MPPNet`](https://arxiv.org/abs/2205.05979) for temporal 3D object detection, which supports long-term multi-frame 3D object detection and ranks 1st place on [3D detection learderboard](https://waymo.com/open/challenges/2020/3d-detection) of Waymo Open Dataset on Sept. 2th, 2022. For validation dataset, MPPNet achieves 74.96%, 75.06% and 74.52% for vehicle, pedestrian and cyclist classes in terms of mAPH@Level_2. (see the [guideline](docs/guidelines_of_approaches/mppnet.md) on how to train/test with MPPNet).
 * Support multi-frame training/testing on Waymo Open Dataset (see the [change log](docs/changelog.md) for more details on how to process data).
 * Support to save changing training details (e.g., loss, iter, epoch) to file (previous tqdm progress bar is still supported by using `--use_tqdm_to_record`). Please use `pip install gpustat` if you also want to log the GPU related information.
 * Support to save latest model every 5 mintues, so you can restore the model training from latest status instead of previous epoch.   
@@ -38,10 +44,10 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
 [2022-02-07] Added support for Centerpoint models on Nuscenes Dataset.
-[2022-01-14] Added support for dynamic pillar voxelization, following the implementation proposed in [H^23D R-CNN](https://arxiv.org/abs/2107.14391) with unique operation and [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) package.
+[2022-01-14] Added support for dynamic pillar voxelization, following the implementation proposed in [`H^23D R-CNN`](https://arxiv.org/abs/2107.14391) with unique operation and [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) package.
 [2022-01-05] **NEW:** Update `OpenPCDet` to v0.5.2:
-* The code of [PV-RCNN++](https://arxiv.org/abs/2102.00463) has been released to this repo, with higher performance, faster training/inference speed and less memory consumption than PV-RCNN.
+* The code of [`PV-RCNN++`](https://arxiv.org/abs/2102.00463) has been released to this repo, with higher performance, faster training/inference speed and less memory consumption than PV-RCNN.
 * Add performance of several models trained with full training set of [Waymo Open Dataset](#waymo-open-dataset-baselines).
 * Support Lyft dataset, see the pull request [here](https://github.com/open-mmlab/OpenPCDet/pull/720).
@@ -199,7 +205,7 @@ We could not provide the above pretrained models due to [Waymo Dataset License A
 but you could easily achieve similar performance by training with the default configs.
 ### NuScenes 3D Object Detection Baselines
-All models are trained with 8 GTX 1080Ti GPUs and are available for download.
+All models are trained with 8 GPUs and are available for download. For training BEVFusion, please refer to the [guideline](docs/guidelines_of_approaches/bevfusion.md).
 |                                                                                                    |   mATE |  mASE  |  mAOE  | mAVE  | mAAE  |  mAP  |  NDS   |                                              download                                              | 
 |----------------------------------------------------------------------------------------------------|-------:|:------:|:------:|:-----:|:-----:|:-----:|:------:|:--------------------------------------------------------------------------------------------------:|
@@ -209,7 +215,10 @@ All models are trained with 8 GTX 1080Ti GPUs and are available for download.
 | [CenterPoint (voxel_size=0.1)](tools/cfgs/nuscenes_models/cbgs_voxel01_res3d_centerpoint.yaml)     |  30.11 | 	25.55 | 	38.28 | 21.94 | 18.87 | 56.03 | 64.54  |  [model-34M](https://drive.google.com/file/d/1Cz-J1c3dw7JAWc25KRG1XQj8yCaOlexQ/view?usp=sharing)   |
 | [CenterPoint (voxel_size=0.075)](tools/cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml) |  28.80 | 	25.43 | 	37.27 | 21.55 | 18.24 | 59.22 | 66.48  |  [model-34M](https://drive.google.com/file/d/1XOHAWm1MPkCKr1gqmc3TWi5AYZgPsgxU/view?usp=sharing)   |
 | [VoxelNeXt (voxel_size=0.075)](tools/cfgs/nuscenes_models/cbgs_voxel0075_voxelnext.yaml)   |  30.11 | 	25.23 | 	40.57 | 21.69 | 18.56 | 60.53 | 66.65  | [model-31M](https://drive.google.com/file/d/1IV7e7G9X-61KXSjMGtQo579pzDNbhwvf/view?usp=share_link) |
+| [TransFusion-L*](tools/cfgs/nuscenes_models/transfusion_lidar.yaml)   |  27.96 | 	25.37 | 	29.35 | 27.31 | 18.55 | 64.58 | 69.43  | [model-32M](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link) |
+| [BEVFusion](tools/cfgs/nuscenes_models/bevfusion.yaml)   |  28.03 | 	25.43 | 	30.19 | 26.76 | 18.48 | 67.75 | 70.98  | [model-157M](https://drive.google.com/file/d/1X50b-8immqlqD8VPAUkSKI0Ls-4k37g9/view?usp=share_link) |
+*: Use the fade strategy, which disables data augmentations in the last several epochs during training.
 ### ONCE 3D Object Detection Baselines
 All models are trained with 8 GPUs.

--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -53,9 +53,16 @@ pip install nuscenes-devkit==1.0.5
 * Generate the data infos by running the following command (it may take several hours): 
 ```python 
+# for lidar-only setting
 python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
    --version v1.0-trainval
+# for multi-modal setting
+python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
+    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
+    --version v1.0-trainval \
+    --with_cam
 ```
 ### Waymo Open Dataset

--- a/docs/guidelines_of_approaches/bevfusion.md
+++ b/docs/guidelines_of_approaches/bevfusion.md
+## Installation
+Please refer to [INSTALL.md](../INSTALL.md) for the installation of `OpenPCDet`.
+* We recommend the users to check the version of pillow and use pillow==8.4.0 to avoid bug in bev pooling.
+## Data Preparation
+Please refer to [GETTING_STARTED.md](../GETTING_STARTED.md) to process the multi-modal Nuscenes Dataset.
+## Training
+1.  Train the lidar branch for BEVFusion:
+```shell
+bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/transfusion_lidar.yaml \
+```
+The ckpt will be saved in ../output/nuscenes_models/transfusion_lidar/default/ckpt, or you can download pretrained checkpoint directly form [here](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link).
+2.  To train BEVFusion, you need to download pretrained parameters for image backbone [here](https://drive.google.com/file/d/1v74WCt4_5ubjO7PciA5T0xhQc9bz_jZu/view?usp=share_link), and specify the path in [config](../../tools/cfgs/nuscenes_models/bevfusion.yaml#L88). Then run the following command:
+```shell
+bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/bevfusion.yaml \
+--pretrained_model path_to_pretrained_lidar_branch_ckpt \
+```
+## Evaluation
+* Test with a pretrained model:
+```shell
+bash scripts/dist_test.sh ${NUM_GPUS} --cfg_file  cfgs/nuscenes_models/bevfusion.yaml \
+--ckpt ../output/cfgs/nuscenes_models/bevfusion/default/ckpt/checkpoint_epoch_6.pth
+```
+## Performance
+All models are trained with spconv 1.0, but you can directly load them for testing regardless of the spconv version.
+|                                                                                                    |   mATE |  mASE  |  mAOE  | mAVE  | mAAE  |  mAP  |  NDS   |                                              download                                              | 
+|----------------------------------------------------------------------------------------------------|-------:|:------:|:------:|:-----:|:-----:|:-----:|:------:|:--------------------------------------------------------------------------------------------------:|
+| [TransFusion-L](../../tools/cfgs/nuscenes_models/transfusion_lidar.yaml)   |  27.96 | 	25.37 | 	29.35 | 27.31 | 18.55 | 64.58 | 69.43  | [model-32M](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link) |
+| [BEVFusion](../../tools/cfgs/nuscenes_models/bevfusion.yaml)   |  28.03 | 	25.43 | 	30.19 | 26.76 | 18.48 | 67.75 | 70.98  | [model-157M](https://drive.google.com/file/d/1X50b-8immqlqD8VPAUkSKI0Ls-4k37g9/view?usp=share_link) |
--- a/pcdet/datasets/augmentor/data_augmentor.py
+++ b/pcdet/datasets/augmentor/data_augmentor.py
 from functools import partial
 import numpy as np
+from PIL import Image
 from ...utils import common_utils
 from . import augmentor_utils, database_sampler
@@ -23,6 +24,18 @@ class DataAugmentor(object):
            cur_augmentor = getattr(self, cur_cfg.NAME)(config=cur_cfg)
            self.data_augmentor_queue.append(cur_augmentor)
+    def disable_augmentation(self, augmentor_configs):
+        self.data_augmentor_queue = []
+        aug_config_list = augmentor_configs if isinstance(augmentor_configs, list) \
+            else augmentor_configs.AUG_CONFIG_LIST
+        for cur_cfg in aug_config_list:
+            if not isinstance(augmentor_configs, list):
+                if cur_cfg.NAME in augmentor_configs.DISABLE_AUG_LIST:
+                    continue
+            cur_augmentor = getattr(self, cur_cfg.NAME)(config=cur_cfg)
+            self.data_augmentor_queue.append(cur_augmentor)
    def gt_sampling(self, config=None):
        db_sampler = database_sampler.DataBaseSampler(
            root_path=self.root_path,
@@ -139,6 +152,7 @@ class DataAugmentor(object):
        data_dict['gt_boxes'] = gt_boxes
        data_dict['points'] = points
+        data_dict['noise_translate'] = noise_translate
        return data_dict
    def random_local_translation(self, data_dict=None, config=None):
@@ -251,6 +265,28 @@ class DataAugmentor(object):
        data_dict['points'] = points
        return data_dict
+    def imgaug(self, data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.imgaug, config=config)
+        imgs = data_dict["camera_imgs"]
+        img_process_infos = data_dict['img_process_infos']
+        new_imgs = []
+        for img, img_process_info in zip(imgs, img_process_infos):
+            flip = False
+            if config.RAND_FLIP and np.random.choice([0, 1]):
+                flip = True
+            rotate = np.random.uniform(*config.ROT_LIM)
+            # aug images
+            if flip:
+                img = img.transpose(method=Image.FLIP_LEFT_RIGHT)
+            img = img.rotate(rotate)
+            img_process_info[2] = flip
+            img_process_info[3] = rotate
+            new_imgs.append(img)
+        data_dict["camera_imgs"] = new_imgs
+        return data_dict
    def forward(self, data_dict):
        """
        Args:

--- a/pcdet/datasets/dataset.py
+++ b/pcdet/datasets/dataset.py
@@ -2,6 +2,7 @@ from collections import defaultdict
 from pathlib import Path
 import numpy as np
+import torch
 import torch.utils.data as torch_data
 from ..utils import common_utils
@@ -130,6 +131,30 @@ class DatasetTemplate(torch_data.Dataset):
        """
        raise NotImplementedError
+    def set_lidar_aug_matrix(self, data_dict):
+        """
+            Get lidar augment matrix (4 x 4), which are used to recover orig point coordinates.
+        """
+        lidar_aug_matrix = np.eye(4)
+        if 'flip_y' in data_dict.keys():
+            flip_x = data_dict['flip_x']
+            flip_y = data_dict['flip_y']
+            if flip_x:
+                lidar_aug_matrix[:3,:3] = np.array([[1, 0, 0], [0, -1, 0], [0, 0, 1]]) @ lidar_aug_matrix[:3,:3]
+            if flip_y:
+                lidar_aug_matrix[:3,:3] = np.array([[-1, 0, 0], [0, 1, 0], [0, 0, 1]]) @ lidar_aug_matrix[:3,:3]
+        if 'noise_rot' in data_dict.keys():
+            noise_rot = data_dict['noise_rot']
+            lidar_aug_matrix[:3,:3] = common_utils.angle2matrix(torch.tensor(noise_rot)) @ lidar_aug_matrix[:3,:3]
+        if 'noise_scale' in data_dict.keys():
+            noise_scale = data_dict['noise_scale']
+            lidar_aug_matrix[:3,:3] *= noise_scale
+        if 'noise_translate' in data_dict.keys():
+            noise_translate = data_dict['noise_translate']
+            lidar_aug_matrix[:3,3:4] = noise_translate.T
+        data_dict['lidar_aug_matrix'] = lidar_aug_matrix
+        return data_dict
    def prepare_data(self, data_dict):
        """
        Args:
@@ -165,6 +190,7 @@ class DatasetTemplate(torch_data.Dataset):
            )
            if 'calib' in data_dict:
                data_dict['calib'] = calib
+        data_dict = self.set_lidar_aug_matrix(data_dict)
        if data_dict.get('gt_boxes', None) is not None:
            selected = common_utils.keep_arrays_by_name(data_dict['gt_names'], self.class_names)
            data_dict['gt_boxes'] = data_dict['gt_boxes'][selected]
@@ -287,6 +313,8 @@ class DatasetTemplate(torch_data.Dataset):
                                constant_values=pad_value)
                        points.append(points_pad)
                    ret[key] = np.stack(points, axis=0)
+                elif key in ['camera_imgs']:
+                    ret[key] = torch.stack([torch.stack(imgs,dim=0) for imgs in val],dim=0)
                else:
                    ret[key] = np.stack(val, axis=0)
            except:

--- a/pcdet/datasets/nuscenes/nuscenes_dataset.py
+++ b/pcdet/datasets/nuscenes/nuscenes_dataset.py
@@ -8,6 +8,8 @@ from tqdm import tqdm
 from ...ops.roiaware_pool3d import roiaware_pool3d_utils
 from ...utils import common_utils
 from ..dataset import DatasetTemplate
+from pyquaternion import Quaternion
+from PIL import Image
 class NuScenesDataset(DatasetTemplate):
@@ -17,6 +19,13 @@ class NuScenesDataset(DatasetTemplate):
            dataset_cfg=dataset_cfg, class_names=class_names, training=training, root_path=root_path, logger=logger
        )
        self.infos = []
+        self.camera_config = self.dataset_cfg.get('CAMERA_CONFIG', None)
+        if self.camera_config is not None:
+            self.use_camera = self.camera_config.get('USE_CAMERA', True)
+            self.camera_image_config = self.camera_config.IMAGE
+        else:
+            self.use_camera = False
        self.include_nuscenes_data(self.mode)
        if self.training and self.dataset_cfg.get('BALANCED_RESAMPLING', False):
            self.infos = self.balanced_infos_resampling(self.infos)
@@ -108,6 +117,98 @@ class NuScenesDataset(DatasetTemplate):
        points = np.concatenate((points, times), axis=1)
        return points
+    def crop_image(self, input_dict):
+        W, H = input_dict["ori_shape"]
+        imgs = input_dict["camera_imgs"]
+        img_process_infos = []
+        crop_images = []
+        for img in imgs:
+            if self.training == True:
+                fH, fW = self.camera_image_config.FINAL_DIM
+                resize_lim = self.camera_image_config.RESIZE_LIM_TRAIN
+                resize = np.random.uniform(*resize_lim)
+                resize_dims = (int(W * resize), int(H * resize))
+                newW, newH = resize_dims
+                crop_h = newH - fH
+                crop_w = int(np.random.uniform(0, max(0, newW - fW)))
+                crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)
+            else:
+                fH, fW = self.camera_image_config.FINAL_DIM
+                resize_lim = self.camera_image_config.RESIZE_LIM_TEST
+                resize = np.mean(resize_lim)
+                resize_dims = (int(W * resize), int(H * resize))
+                newW, newH = resize_dims
+                crop_h = newH - fH
+                crop_w = int(max(0, newW - fW) / 2)
+                crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)
+            # reisze and crop image
+            img = img.resize(resize_dims)
+            img = img.crop(crop)
+            crop_images.append(img)
+            img_process_infos.append([resize, crop, False, 0])
+        input_dict['img_process_infos'] = img_process_infos
+        input_dict['camera_imgs'] = crop_images
+        return input_dict
+    def load_camera_info(self, input_dict, info):
+        input_dict["image_paths"] = []
+        input_dict["lidar2camera"] = []
+        input_dict["lidar2image"] = []
+        input_dict["camera2ego"] = []
+        input_dict["camera_intrinsics"] = []
+        input_dict["camera2lidar"] = []
+        for _, camera_info in info["cams"].items():
+            input_dict["image_paths"].append(camera_info["data_path"])
+            # lidar to camera transform
+            lidar2camera_r = np.linalg.inv(camera_info["sensor2lidar_rotation"])
+            lidar2camera_t = (
+                camera_info["sensor2lidar_translation"] @ lidar2camera_r.T
+            )
+            lidar2camera_rt = np.eye(4).astype(np.float32)
+            lidar2camera_rt[:3, :3] = lidar2camera_r.T
+            lidar2camera_rt[3, :3] = -lidar2camera_t
+            input_dict["lidar2camera"].append(lidar2camera_rt.T)
+            # camera intrinsics
+            camera_intrinsics = np.eye(4).astype(np.float32)
+            camera_intrinsics[:3, :3] = camera_info["camera_intrinsics"]
+            input_dict["camera_intrinsics"].append(camera_intrinsics)
+            # lidar to image transform
+            lidar2image = camera_intrinsics @ lidar2camera_rt.T
+            input_dict["lidar2image"].append(lidar2image)
+            # camera to ego transform
+            camera2ego = np.eye(4).astype(np.float32)
+            camera2ego[:3, :3] = Quaternion(
+                camera_info["sensor2ego_rotation"]
+            ).rotation_matrix
+            camera2ego[:3, 3] = camera_info["sensor2ego_translation"]
+            input_dict["camera2ego"].append(camera2ego)
+            # camera to lidar transform
+            camera2lidar = np.eye(4).astype(np.float32)
+            camera2lidar[:3, :3] = camera_info["sensor2lidar_rotation"]
+            camera2lidar[:3, 3] = camera_info["sensor2lidar_translation"]
+            input_dict["camera2lidar"].append(camera2lidar)
+        # read image
+        filename = input_dict["image_paths"]
+        images = []
+        for name in filename:
+            images.append(Image.open(str(self.root_path / name)))
+        input_dict["camera_imgs"] = images
+        input_dict["ori_shape"] = images[0].size
+        # resize and crop image
+        input_dict = self.crop_image(input_dict)
+        return input_dict
    def __len__(self):
        if self._merge_all_iters_to_one_epoch:
            return len(self.infos) * self.total_epochs
@@ -137,6 +238,8 @@ class NuScenesDataset(DatasetTemplate):
                'gt_names': info['gt_names'] if mask is None else info['gt_names'][mask],
                'gt_boxes': info['gt_boxes'] if mask is None else info['gt_boxes'][mask]
            })
+        if self.use_camera:
+            input_dict = self.load_camera_info(input_dict, info)
        data_dict = self.prepare_data(data_dict=input_dict)
@@ -251,7 +354,7 @@ class NuScenesDataset(DatasetTemplate):
            pickle.dump(all_db_infos, f)
-def create_nuscenes_info(version, data_path, save_path, max_sweeps=10):
+def create_nuscenes_info(version, data_path, save_path, max_sweeps=10, with_cam=False):
    from nuscenes.nuscenes import NuScenes
    from nuscenes.utils import splits
    from . import nuscenes_utils
@@ -283,7 +386,7 @@ def create_nuscenes_info(version, data_path, save_path, max_sweeps=10):
    train_nusc_infos, val_nusc_infos = nuscenes_utils.fill_trainval_infos(
        data_path=data_path, nusc=nusc, train_scenes=train_scenes, val_scenes=val_scenes,
-        test='test' in version, max_sweeps=max_sweeps
+        test='test' in version, max_sweeps=max_sweeps, with_cam=with_cam
    )
    if version == 'v1.0-test':
@@ -308,6 +411,7 @@ if __name__ == '__main__':
    parser.add_argument('--cfg_file', type=str, default=None, help='specify the config of dataset')
    parser.add_argument('--func', type=str, default='create_nuscenes_infos', help='')
    parser.add_argument('--version', type=str, default='v1.0-trainval', help='')
+    parser.add_argument('--with_cam', action='store_true', default=False, help='use camera or not')
    args = parser.parse_args()
    if args.func == 'create_nuscenes_infos':
@@ -319,6 +423,7 @@ if __name__ == '__main__':
            data_path=ROOT_DIR / 'data' / 'nuscenes',
            save_path=ROOT_DIR / 'data' / 'nuscenes',
            max_sweeps=dataset_cfg.MAX_SWEEPS,
+            with_cam=args.with_cam
        )
        nuscenes_dataset = NuScenesDataset(

--- a/pcdet/datasets/nuscenes/nuscenes_utils.py
+++ b/pcdet/datasets/nuscenes/nuscenes_utils.py
@@ -247,9 +247,69 @@ def quaternion_yaw(q: Quaternion) -> float:
    yaw = np.arctan2(v[1], v[0])
    return yaw
+def obtain_sensor2top(
+    nusc, sensor_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, sensor_type="lidar"
+):
+    """Obtain the info with RT matric from general sensor to Top LiDAR.
-def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, max_sweeps=10):
+    Args:
+        nusc (class): Dataset class in the nuScenes dataset.
+        sensor_token (str): Sample data token corresponding to the
+            specific sensor type.
+        l2e_t (np.ndarray): Translation from lidar to ego in shape (1, 3).
+        l2e_r_mat (np.ndarray): Rotation matrix from lidar to ego
+            in shape (3, 3).
+        e2g_t (np.ndarray): Translation from ego to global in shape (1, 3).
+        e2g_r_mat (np.ndarray): Rotation matrix from ego to global
+            in shape (3, 3).
+        sensor_type (str): Sensor to calibrate. Default: 'lidar'.
+    Returns:
+        sweep (dict): Sweep information after transformation.
+    """
+    sd_rec = nusc.get("sample_data", sensor_token)
+    cs_record = nusc.get("calibrated_sensor", sd_rec["calibrated_sensor_token"])
+    pose_record = nusc.get("ego_pose", sd_rec["ego_pose_token"])
+    data_path = str(nusc.get_sample_data_path(sd_rec["token"]))
+    # if os.getcwd() in data_path:  # path from lyftdataset is absolute path
+    #     data_path = data_path.split(f"{os.getcwd()}/")[-1]  # relative path
+    sweep = {
+        "data_path": data_path,
+        "type": sensor_type,
+        "sample_data_token": sd_rec["token"],
+        "sensor2ego_translation": cs_record["translation"],
+        "sensor2ego_rotation": cs_record["rotation"],
+        "ego2global_translation": pose_record["translation"],
+        "ego2global_rotation": pose_record["rotation"],
+        "timestamp": sd_rec["timestamp"],
+    }
+    l2e_r_s = sweep["sensor2ego_rotation"]
+    l2e_t_s = sweep["sensor2ego_translation"]
+    e2g_r_s = sweep["ego2global_rotation"]
+    e2g_t_s = sweep["ego2global_translation"]
+    # obtain the RT from sensor to Top LiDAR
+    # sweep->ego->global->ego'->lidar
+    l2e_r_s_mat = Quaternion(l2e_r_s).rotation_matrix
+    e2g_r_s_mat = Quaternion(e2g_r_s).rotation_matrix
+    R = (l2e_r_s_mat.T @ e2g_r_s_mat.T) @ (
+        np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+    )
+    T = (l2e_t_s @ e2g_r_s_mat.T + e2g_t_s) @ (
+        np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
+    )
+    T -= (
+        e2g_t @ (np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T)
+        + l2e_t @ np.linalg.inv(l2e_r_mat).T
+    ).squeeze(0)
+    sweep["sensor2lidar_rotation"] = R.T  # points @ R.T + T
+    sweep["sensor2lidar_translation"] = T
+    return sweep
+def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, max_sweeps=10, with_cam=False):
    train_nusc_infos = []
    val_nusc_infos = []
    progress_bar = tqdm.tqdm(total=len(nusc.sample), desc='create_info', dynamic_ncols=True)
@@ -291,6 +351,34 @@ def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, m
            'car_from_global': car_from_global,
            'timestamp': ref_time,
        }
+        if with_cam:
+            info['cams'] = dict()
+            l2e_r = ref_cs_rec["rotation"]
+            l2e_t = ref_cs_rec["translation"],
+            e2g_r = ref_pose_rec["rotation"]
+            e2g_t = ref_pose_rec["translation"]
+            l2e_r_mat = Quaternion(l2e_r).rotation_matrix
+            e2g_r_mat = Quaternion(e2g_r).rotation_matrix
+            # obtain 6 image's information per frame
+            camera_types = [
+                "CAM_FRONT",
+                "CAM_FRONT_RIGHT",
+                "CAM_FRONT_LEFT",
+                "CAM_BACK",
+                "CAM_BACK_LEFT",
+                "CAM_BACK_RIGHT",
+            ]
+            for cam in camera_types:
+                cam_token = sample["data"][cam]
+                cam_path, _, camera_intrinsics = nusc.get_sample_data(cam_token)
+                cam_info = obtain_sensor2top(
+                    nusc, cam_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, cam
+                )
+                cam_info['data_path'] = Path(cam_info['data_path']).relative_to(data_path).__str__()
+                cam_info.update(camera_intrinsics=camera_intrinsics)
+                info["cams"].update({cam: cam_info})
        sample_data_token = sample['data'][chan]
        curr_sd_rec = nusc.get('sample_data', sample_data_token)

--- a/pcdet/datasets/processor/data_processor.py
+++ b/pcdet/datasets/processor/data_processor.py
@@ -2,7 +2,8 @@ from functools import partial
 import numpy as np
 from skimage import transform
+import torch
+import torchvision
 from ...utils import box_utils, common_utils
 tv = None
@@ -228,6 +229,56 @@ class DataProcessor(object):
            factors=(self.depth_downsample_factor, self.depth_downsample_factor)
        )
        return data_dict
+    def image_normalize(self, data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.image_normalize, config=config)
+        mean = config.mean
+        std = config.std
+        compose = torchvision.transforms.Compose(
+            [
+                torchvision.transforms.ToTensor(),
+                torchvision.transforms.Normalize(mean=mean, std=std),
+            ]
+        )
+        data_dict["camera_imgs"] = [compose(img) for img in data_dict["camera_imgs"]]
+        return data_dict
+    def image_calibrate(self,data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.image_calibrate, config=config)
+        img_process_infos = data_dict['img_process_infos']
+        transforms = []
+        for img_process_info in img_process_infos:
+            resize, crop, flip, rotate = img_process_info
+            rotation = torch.eye(2)
+            translation = torch.zeros(2)
+            # post-homography transformation
+            rotation *= resize
+            translation -= torch.Tensor(crop[:2])
+            if flip:
+                A = torch.Tensor([[-1, 0], [0, 1]])
+                b = torch.Tensor([crop[2] - crop[0], 0])
+                rotation = A.matmul(rotation)
+                translation = A.matmul(translation) + b
+            theta = rotate / 180 * np.pi
+            A = torch.Tensor(
+                [
+                    [np.cos(theta), np.sin(theta)],
+                    [-np.sin(theta), np.cos(theta)],
+                ]
+            )
+            b = torch.Tensor([crop[2] - crop[0], crop[3] - crop[1]]) / 2
+            b = A.matmul(-b) + b
+            rotation = A.matmul(rotation)
+            translation = A.matmul(translation) + b
+            transform = torch.eye(4)
+            transform[:2, :2] = rotation
+            transform[:2, 3] = translation
+            transforms.append(transform.numpy())
+        data_dict["img_aug_matrix"] = transforms
+        return data_dict
    def forward(self, data_dict):
        """

--- a/pcdet/models/__init__.py
+++ b/pcdet/models/__init__.py
@@ -22,9 +22,11 @@ def build_network(model_cfg, num_class, dataset):
 def load_data_to_gpu(batch_dict):
    for key, val in batch_dict.items():
-        if not isinstance(val, np.ndarray):
+        if key == 'camera_imgs':
+            batch_dict[key] = val.cuda()
+        elif not isinstance(val, np.ndarray):
            continue
-        elif key in ['frame_id', 'metadata', 'calib']:
+        elif key in ['frame_id', 'metadata', 'calib', 'image_paths','ori_shape','img_process_infos']:
            continue
        elif key in ['images']:
            batch_dict[key] = kornia.image_to_tensor(val).float().cuda().contiguous()

--- a/pcdet/models/backbones_2d/base_bev_backbone.py
+++ b/pcdet/models/backbones_2d/base_bev_backbone.py
@@ -46,7 +46,7 @@ class BaseBEVBackbone(nn.Module):
            self.blocks.append(nn.Sequential(*cur_layers))
            if len(upsample_strides) > 0:
                stride = upsample_strides[idx]
-                if stride >= 1:
+                if stride > 1 or (stride == 1 and not self.model_cfg.get('USE_CONV_FOR_NO_STRIDE', False)):
                    self.deblocks.append(nn.Sequential(
                        nn.ConvTranspose2d(
                            num_filters[idx], num_upsample_filters[idx],

--- a/pcdet/models/backbones_2d/fuser/__init__.py
+++ b/pcdet/models/backbones_2d/fuser/__init__.py
+from .convfuser import ConvFuser
+__all__ = {
+    'ConvFuser':ConvFuser
+}
\ No newline at end of file
--- a/pcdet/models/backbones_2d/fuser/convfuser.py
+++ b/pcdet/models/backbones_2d/fuser/convfuser.py
+import torch
+from torch import nn
+class ConvFuser(nn.Module):
+    def __init__(self,model_cfg) -> None:
+        super().__init__()
+        self.model_cfg = model_cfg
+        in_channel = self.model_cfg.IN_CHANNEL
+        out_channel = self.model_cfg.OUT_CHANNEL
+        self.conv = nn.Sequential(
+            nn.Conv2d(in_channel, out_channel, 3, padding=1, bias=False),
+            nn.BatchNorm2d(out_channel),
+            nn.ReLU(True)
+            )
+    def forward(self,batch_dict):
+        """
+        Args:
+            batch_dict:
+                spatial_features_img (tensor): Bev features from image modality
+                spatial_features (tensor): Bev features from lidar modality
+        Returns:
+            batch_dict:
+                spatial_features (tensor): Bev features after muli-modal fusion
+        """
+        img_bev = batch_dict['spatial_features_img']
+        lidar_bev = batch_dict['spatial_features']
+        cat_bev = torch.cat([img_bev,lidar_bev],dim=1)
+        mm_bev = self.conv(cat_bev)
+        batch_dict['spatial_features'] = mm_bev
+        return batch_dict
\ No newline at end of file
--- a/pcdet/models/backbones_3d/spconv_backbone.py
+++ b/pcdet/models/backbones_3d/spconv_backbone.py
@@ -30,11 +30,12 @@ def post_act_block(in_channels, out_channels, kernel_size, indice_key=None, stri
 class SparseBasicBlock(spconv.SparseModule):
    expansion = 1
-    def __init__(self, inplanes, planes, stride=1, norm_fn=None, downsample=None, indice_key=None):
+    def __init__(self, inplanes, planes, stride=1, bias=None, norm_fn=None, downsample=None, indice_key=None):
        super(SparseBasicBlock, self).__init__()
        assert norm_fn is not None
-        bias = norm_fn is not None
+        if bias is None:
+            bias = norm_fn is not None
        self.conv1 = spconv.SubMConv3d(
            inplanes, planes, kernel_size=3, stride=stride, padding=1, bias=bias, indice_key=indice_key
        )
@@ -184,6 +185,7 @@ class VoxelResBackBone8x(nn.Module):
    def __init__(self, model_cfg, input_channels, grid_size, **kwargs):
        super().__init__()
        self.model_cfg = model_cfg
+        use_bias = self.model_cfg.get('USE_BIAS', None)
        norm_fn = partial(nn.BatchNorm1d, eps=1e-3, momentum=0.01)
        self.sparse_shape = grid_size[::-1] + [1, 0, 0]
@@ -196,29 +198,29 @@ class VoxelResBackBone8x(nn.Module):
        block = post_act_block
        self.conv1 = spconv.SparseSequential(
-            SparseBasicBlock(16, 16, norm_fn=norm_fn, indice_key='res1'),
+            SparseBasicBlock(16, 16, bias=use_bias, norm_fn=norm_fn, indice_key='res1'),
-            SparseBasicBlock(16, 16, norm_fn=norm_fn, indice_key='res1'),
+            SparseBasicBlock(16, 16, bias=use_bias, norm_fn=norm_fn, indice_key='res1'),
        )
        self.conv2 = spconv.SparseSequential(
            # [1600, 1408, 41] <- [800, 704, 21]
            block(16, 32, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv2', conv_type='spconv'),
-            SparseBasicBlock(32, 32, norm_fn=norm_fn, indice_key='res2'),
+            SparseBasicBlock(32, 32, bias=use_bias, norm_fn=norm_fn, indice_key='res2'),
-            SparseBasicBlock(32, 32, norm_fn=norm_fn, indice_key='res2'),
+            SparseBasicBlock(32, 32, bias=use_bias, norm_fn=norm_fn, indice_key='res2'),
        )
        self.conv3 = spconv.SparseSequential(
            # [800, 704, 21] <- [400, 352, 11]
            block(32, 64, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv3', conv_type='spconv'),
-            SparseBasicBlock(64, 64, norm_fn=norm_fn, indice_key='res3'),
+            SparseBasicBlock(64, 64, bias=use_bias, norm_fn=norm_fn, indice_key='res3'),
-            SparseBasicBlock(64, 64, norm_fn=norm_fn, indice_key='res3'),
+            SparseBasicBlock(64, 64, bias=use_bias, norm_fn=norm_fn, indice_key='res3'),
        )
        self.conv4 = spconv.SparseSequential(
            # [400, 352, 11] <- [200, 176, 5]
            block(64, 128, 3, norm_fn=norm_fn, stride=2, padding=(0, 1, 1), indice_key='spconv4', conv_type='spconv'),
-            SparseBasicBlock(128, 128, norm_fn=norm_fn, indice_key='res4'),
+            SparseBasicBlock(128, 128, bias=use_bias, norm_fn=norm_fn, indice_key='res4'),
-            SparseBasicBlock(128, 128, norm_fn=norm_fn, indice_key='res4'),
+            SparseBasicBlock(128, 128, bias=use_bias, norm_fn=norm_fn, indice_key='res4'),
        )
        last_pad = 0

--- a/pcdet/models/backbones_image/__init__.py
+++ b/pcdet/models/backbones_image/__init__.py
+from .swin import SwinTransformer
+__all__ = {
+    'SwinTransformer':SwinTransformer,
+}
\ No newline at end of file
--- a/pcdet/models/backbones_image/img_neck/__init__.py
+++ b/pcdet/models/backbones_image/img_neck/__init__.py
+from .generalized_lss import GeneralizedLSSFPN
+__all__ = {
+    'GeneralizedLSSFPN':GeneralizedLSSFPN,
+}
\ No newline at end of file
--- a/pcdet/models/backbones_image/img_neck/generalized_lss.py
+++ b/pcdet/models/backbones_image/img_neck/generalized_lss.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ...model_utils.basic_block_2d import BasicBlock2D
+class GeneralizedLSSFPN(nn.Module):
+    """
+        This module implements FPN, which creates pyramid features built on top of some input feature maps.
+        This code is adapted from https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/necks/fpn.py with minimal modifications.
+    """
+    def __init__(self, model_cfg):
+        super().__init__()
+        self.model_cfg = model_cfg
+        in_channels =  self.model_cfg.IN_CHANNELS
+        out_channels = self.model_cfg.OUT_CHANNELS
+        num_ins = len(in_channels)
+        num_outs = self.model_cfg.NUM_OUTS
+        start_level = self.model_cfg.START_LEVEL
+        end_level = self.model_cfg.END_LEVEL
+        self.in_channels = in_channels
+        if end_level == -1:
+            self.backbone_end_level = num_ins - 1
+        else:
+            self.backbone_end_level = end_level
+            assert end_level <= len(in_channels)
+            assert num_outs == end_level - start_level
+        self.start_level = start_level
+        self.end_level = end_level
+        self.lateral_convs = nn.ModuleList()
+        self.fpn_convs = nn.ModuleList()
+        for i in range(self.start_level, self.backbone_end_level):
+            l_conv = BasicBlock2D(
+                in_channels[i] + (in_channels[i + 1] if i == self.backbone_end_level - 1 else out_channels),
+                out_channels, kernel_size=1, bias = False
+            )
+            fpn_conv = BasicBlock2D(out_channels,out_channels, kernel_size=3, padding=1, bias = False)
+            self.lateral_convs.append(l_conv)
+            self.fpn_convs.append(fpn_conv)
+    def forward(self, batch_dict):
+        """
+        Args:
+            batch_dict:
+                image_features (list[tensor]): Multi-stage features from image backbone.
+        Returns:
+            batch_dict:
+                image_fpn (list(tensor)): FPN features.
+        """
+        # upsample -> cat -> conv1x1 -> conv3x3
+        inputs = batch_dict['image_features']
+        assert len(inputs) == len(self.in_channels)
+        # build laterals
+        laterals = [inputs[i + self.start_level] for i in range(len(inputs))]
+        # build top-down path
+        used_backbone_levels = len(laterals) - 1
+        for i in range(used_backbone_levels - 1, -1, -1):
+            x = F.interpolate(
+                laterals[i + 1],
+                size=laterals[i].shape[2:],
+                mode='bilinear', align_corners=False,
+            )
+            laterals[i] = torch.cat([laterals[i], x], dim=1)
+            laterals[i] = self.lateral_convs[i](laterals[i])
+            laterals[i] = self.fpn_convs[i](laterals[i])
+        # build outputs
+        outs = [laterals[i] for i in range(used_backbone_levels)]
+        batch_dict['image_fpn'] = tuple(outs)
+        return batch_dict
--- a/pcdet/models/backbones_image/swin.py
+++ b/pcdet/models/backbones_image/swin.py
--- a/pcdet/models/dense_heads/__init__.py
+++ b/pcdet/models/dense_heads/__init__.py
@@ -6,6 +6,7 @@ from .point_head_simple import PointHeadSimple
 from .point_intra_part_head import PointIntraPartOffsetHead
 from .center_head import CenterHead
 from .voxelnext_head import VoxelNeXtHead
+from .transfusion_head import TransFusionHead
 __all__ = {
    'AnchorHeadTemplate': AnchorHeadTemplate,
@@ -16,4 +17,5 @@ __all__ = {
    'AnchorHeadMulti': AnchorHeadMulti,
    'CenterHead': CenterHead,
    'VoxelNeXtHead': VoxelNeXtHead,
+    'TransFusionHead': TransFusionHead,
 }
--- a/pcdet/models/dense_heads/target_assigner/hungarian_assigner.py
+++ b/pcdet/models/dense_heads/target_assigner/hungarian_assigner.py
+import torch
+from scipy.optimize import linear_sum_assignment
+from pcdet.ops.iou3d_nms import iou3d_nms_cuda
+def height_overlaps(boxes1, boxes2):
+    """
+    Calculate height overlaps of two boxes.
+    """
+    boxes1_top_height = (boxes1[:,2]+ boxes1[:,5]).view(-1, 1)
+    boxes1_bottom_height = boxes1[:,2].view(-1, 1)
+    boxes2_top_height = (boxes2[:,2]+boxes2[:,5]).view(1, -1)
+    boxes2_bottom_height = boxes2[:,2].view(1, -1)
+    heighest_of_bottom = torch.max(boxes1_bottom_height, boxes2_bottom_height)
+    lowest_of_top = torch.min(boxes1_top_height, boxes2_top_height)
+    overlaps_h = torch.clamp(lowest_of_top - heighest_of_bottom, min=0)
+    return overlaps_h
+def overlaps(boxes1, boxes2):
+    """
+    Calculate 3D overlaps of two boxes.
+    """
+    rows = len(boxes1)
+    cols = len(boxes2)
+    if rows * cols == 0:
+        return boxes1.new(rows, cols)
+    # height overlap
+    overlaps_h = height_overlaps(boxes1, boxes2)
+    boxes1_bev = boxes1[:,:7]
+    boxes2_bev = boxes2[:,:7]
+    # bev overlap
+    overlaps_bev = boxes1_bev.new_zeros(
+        (boxes1_bev.shape[0], boxes2_bev.shape[0])
+    ).cuda()  # (N, M)
+    iou3d_nms_cuda.boxes_overlap_bev_gpu(
+        boxes1_bev.contiguous().cuda(), boxes2_bev.contiguous().cuda(), overlaps_bev
+    )
+    # 3d overlaps
+    overlaps_3d = overlaps_bev.to(boxes1.device) * overlaps_h
+    volume1 = (boxes1[:, 3] * boxes1[:, 4] * boxes1[:, 5]).view(-1, 1)
+    volume2 = (boxes2[:, 3] * boxes2[:, 4] * boxes2[:, 5]).view(1, -1)
+    iou3d = overlaps_3d / torch.clamp(volume1 + volume2 - overlaps_3d, min=1e-8)
+    return iou3d
+class HungarianAssigner3D:
+    def __init__(self, cls_cost, reg_cost, iou_cost):
+        self.cls_cost = cls_cost
+        self.reg_cost = reg_cost
+        self.iou_cost = iou_cost
+    def focal_loss_cost(self, cls_pred, gt_labels):
+        weight = self.cls_cost.get('weight', 0.15)
+        alpha = self.cls_cost.get('alpha', 0.25)
+        gamma = self.cls_cost.get('gamma', 2.0)
+        eps = self.cls_cost.get('eps', 1e-12)
+        cls_pred = cls_pred.sigmoid()
+        neg_cost = -(1 - cls_pred + eps).log() * (
+            1 - alpha) * cls_pred.pow(gamma)
+        pos_cost = -(cls_pred + eps).log() * alpha * (
+            1 - cls_pred).pow(gamma)
+        cls_cost = pos_cost[:, gt_labels] - neg_cost[:, gt_labels]
+        return cls_cost * weight
+    def bevbox_cost(self, bboxes, gt_bboxes, point_cloud_range):
+        weight = self.reg_cost.get('weight', 0.25)
+        pc_start = bboxes.new(point_cloud_range[0:2])
+        pc_range = bboxes.new(point_cloud_range[3:5]) - bboxes.new(point_cloud_range[0:2])
+        # normalize the box center to [0, 1]
+        normalized_bboxes_xy = (bboxes[:, :2] - pc_start) / pc_range
+        normalized_gt_bboxes_xy = (gt_bboxes[:, :2] - pc_start) / pc_range
+        reg_cost = torch.cdist(normalized_bboxes_xy, normalized_gt_bboxes_xy, p=1)
+        return reg_cost * weight
+    def iou3d_cost(self, bboxes, gt_bboxes):
+        iou = overlaps(bboxes, gt_bboxes)
+        weight = self.iou_cost.get('weight', 0.25)
+        iou_cost = - iou
+        return iou_cost * weight, iou
+    def assign(self, bboxes, gt_bboxes, gt_labels, cls_pred, point_cloud_range):
+        num_gts, num_bboxes = gt_bboxes.size(0), bboxes.size(0)
+        # 1. assign -1 by default
+        assigned_gt_inds = bboxes.new_full((num_bboxes,), -1, dtype=torch.long)
+        assigned_labels = bboxes.new_full((num_bboxes,), -1, dtype=torch.long)
+        if num_gts == 0 or num_bboxes == 0:
+            # No ground truth or boxes, return empty assignment
+            if num_gts == 0:
+                # No ground truth, assign all to background
+                assigned_gt_inds[:] = 0
+            return num_gts, assigned_gt_inds, max_overlaps, assigned_labels
+        # 2. compute the weighted costs
+        cls_cost = self.focal_loss_cost(cls_pred[0].T, gt_labels)
+        reg_cost = self.bevbox_cost(bboxes, gt_bboxes, point_cloud_range)
+        iou_cost, iou = self.iou3d_cost(bboxes, gt_bboxes)
+        # weighted sum of above three costs
+        cost = cls_cost + reg_cost + iou_cost
+        # 3. do Hungarian matching on CPU using linear_sum_assignment
+        cost = cost.detach().cpu()
+        matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
+        matched_row_inds = torch.from_numpy(matched_row_inds).to(bboxes.device)
+        matched_col_inds = torch.from_numpy(matched_col_inds).to(bboxes.device)
+        # 4. assign backgrounds and foregrounds
+        # assign all indices to backgrounds first
+        assigned_gt_inds[:] = 0
+        # assign foregrounds based on matching results
+        assigned_gt_inds[matched_row_inds] = matched_col_inds + 1
+        assigned_labels[matched_row_inds] = gt_labels[matched_col_inds]
+        max_overlaps = torch.zeros_like(iou.max(1).values)
+        max_overlaps[matched_row_inds] = iou[matched_row_inds, matched_col_inds]
+        return assigned_gt_inds, max_overlaps
\ No newline at end of file
--- a/pcdet/models/dense_heads/transfusion_head.py
+++ b/pcdet/models/dense_heads/transfusion_head.py