[Feature] Support BEVFusion in `projects/` (#2236)

* add bevfusion models * refactor * build successfully * update ImageAug3D * support inference * update the format of final bboxes * add new loading func * align test precision * polish docstring * refactor transformer decoder * polish code * fix table in readme * fix table in readme * fix table in readme * update pre-commit-config * minor changes * revert the changes of file_client_args in LoadAnnotation3D * remove unnucessary functions in BEVFusion * fix loading bug * fix docstring

[Feature] Support BEVFusion in `projects/` (#2236)
* add bevfusion models * refactor * build successfully * update ImageAug3D * support inference * update the format of final bboxes * add new loading func * align test precision * polish docstring * refactor transformer decoder * polish code * fix table in readme * fix table in readme * fix table in readme * update pre-commit-config * minor changes * revert the changes of file_client_args in LoadAnnotation3D * remove unnucessary functions in BEVFusion * fix loading bug * fix docstring
4d77b4c8 · Jingwei Zhang · GitHub · c6a8eb1f · 4d77b4c8 · 4d77b4c8
Unverified Commit 4d77b4c8 authored Jan 30, 2023 by Jingwei Zhang Committed by GitHub Jan 30, 2023
20 changed files
--- a/mmdet3d/datasets/transforms/loading.py
+++ b/mmdet3d/datasets/transforms/loading.py
@@ -193,10 +193,10 @@ class LoadMultiViewImageFromFiles(BaseTransform):
        # unravel to list, see `DefaultFormatBundle` in formating.py
        # which will transpose each image separately and then stack into array
        results['img'] = [img[..., i] for i in range(img.shape[-1])]
-        results['img_shape'] = img.shape
-        results['ori_shape'] = img.shape
+        results['img_shape'] = img.shape[:2]
+        results['ori_shape'] = img.shape[:2]
        # Set initial values for default meta_keys
-        results['pad_shape'] = img.shape
+        results['pad_shape'] = img.shape[:2]
        if self.set_default_scale:
            results['scale_factor'] = 1.0
        num_channels = 1 if len(img.shape) < 3 else img.shape[2]
@@ -297,9 +297,13 @@ class LoadPointsFromMultiSweeps(BaseTransform):
                 test_mode: bool = False) -> None:
        self.load_dim = load_dim
        self.sweeps_num = sweeps_num
+        if isinstance(use_dim, int):
+            use_dim = list(range(use_dim))
+        assert max(use_dim) < load_dim, \
+            f'Expect all used dimensions < {load_dim}, got {use_dim}'
        self.use_dim = use_dim
        self.file_client_args = file_client_args.copy()
-        self.file_client = None
+        self.file_client = mmengine.FileClient(**self.file_client_args)
        self.pad_empty_sweeps = pad_empty_sweeps
        self.remove_close = remove_close
        self.test_mode = test_mode
@@ -761,6 +765,7 @@ class LoadAnnotations3D(LoadAnnotations):
        self.with_mask_3d = with_mask_3d
        self.with_seg_3d = with_seg_3d
        self.seg_3d_dtype = seg_3d_dtype
+        self.file_client = None

    def _load_bboxes_3d(self, results: dict) -> dict:
        """Private function to move the 3D bounding box annotation from

--- a/mmdet3d/utils/__init__.py
+++ b/mmdet3d/utils/__init__.py
@@ -6,11 +6,12 @@ from .misc import replace_ceph_backend
 from .setup_env import register_all_modules, setup_multi_processes
 from .typing_utils import (ConfigType, InstanceList, MultiConfig,
                           OptConfigType, OptInstanceList, OptMultiConfig,
-                           OptSamplingResultList)
+                           OptSampleList, OptSamplingResultList)

 __all__ = [
    'collect_env', 'setup_multi_processes', 'compat_cfg',
    'register_all_modules', 'array_converter', 'ArrayConverter', 'ConfigType',
    'OptConfigType', 'MultiConfig', 'OptMultiConfig', 'InstanceList',
-    'OptInstanceList', 'OptSamplingResultList', 'replace_ceph_backend'
+    'OptInstanceList', 'OptSamplingResultList', 'replace_ceph_backend',
+    'OptSampleList'
 ]
--- a/mmdet3d/utils/typing_utils.py
+++ b/mmdet3d/utils/typing_utils.py
@@ -23,3 +23,4 @@ SamplingResultList = List[SamplingResult]

 OptSamplingResultList = Optional[SamplingResultList]
 SampleList = List[Det3DDataSample]
+OptSampleList = Optional[SampleList]
--- a/projects/BEVFusion/README.md
+++ b/projects/BEVFusion/README.md
+# BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
+
+> [BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation](https://arxiv.org/abs/2205.13542)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower computation cost. Code to reproduce our
+results is available at https://github.com/mit-han-lab/bevfusion.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/34888372/215313913-4b43f8a1-e2e2-49ba-b631-992155351922.png" width="800"/>
+</div>
+
+## Introduction
+
+We implement BEVFusion and provide the results and pretrained checkpoints on NuScenes dataset.
+
+## Usage
+
+<!-- For a typical model, this section should contain the commands for training and testing. You are also suggested to dump your environment specification to env.yml by `conda env export > env.yml`. -->
+
+### Compiling operations on CUDA
+
+**Note** that the voxelization OP in the original implementation of `BEVFusion` is different from the implementation in MMCV. If you want to use the original pretrained model [here](https://github.com/mit-han-lab/bevfusion/blob/main/README.md), you need to use the original implementation of voxelization OP.
+
+```python
+python projects/BEVFusion/setup.py develop
+```
+
+### Training commands
+
+In MMDetection3D's root directory, run the following command to train the model:
+
+```bash
+python tools/train.py projects/BEVFusion/configs/bevfusion_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py
+```
+
+For multi-gpu training, run:
+
+```bash
+python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=${NUM_GPUS} --master_port=29506 --master_addr="127.0.0.1" tools/train.py projects/BEVFusion/configs/bevfusion_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py
+```
+
+### Testing commands
+
+In MMDetection3D's root directory, run the following command to test the model:
+
+```bash
+python tools/train.py projects/BEVFusion/configs/bevfusion_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py ${CHECKPOINT_PATH}
+```
+
+## Results and models
+
+### NuScenes
+
+|                                    Backbone                                     | Voxel type (voxel size) | NMS | Mem (GB) | Inf time (fps) |  NDS  |  mAP  |                                                 Download                                                 |
+| :-----------------------------------------------------------------------------: | :---------------------: | :-: | :------: | :------------: | :---: | :---: | :------------------------------------------------------------------------------------------------------: |
+| [SECFPN](./configs/bevfusion_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py) |      voxel (0.075)      |  ×  |    -     |       -        | 71.62 | 68.77 | [converted_model](https://drive.google.com/file/d/1QkvbYDk4G2d6SZoeJqish13qSyXA4lp3/view?usp=share_link) |
+
+## Citation
+
+```latex
+@inproceedings{liu2022bevfusion,
+  title={BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation},
+  author={Liu, Zhijian and Tang, Haotian and Amini, Alexander and Yang, Xingyu and Mao, Huizi and Rus, Daniela and Han, Song},
+  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
+  year={2023}
+}
+```
+
+## Checklist
+
+<!-- Here is a checklist illustrating a usual development workflow of a successful project, and also serves as an overview of this project's progress. The PIC (person in charge) or contributors of this project should check all the items that they believe have been finished, which will further be verified by codebase maintainers via a PR.
+OpenMMLab's maintainer will review the code to ensure the project's quality. Reaching the first milestone means that this project suffices the minimum requirement of being merged into 'projects/'. But this project is only eligible to become a part of the core package upon attaining the last milestone.
+Note that keeping this section up-to-date is crucial not only for this project's developers but the entire community, since there might be some other contributors joining this project and deciding their starting point from this list. It also helps maintainers accurately estimate time and effort on further code polishing, if needed.
+A project does not necessarily have to be finished in a single PR, but it's essential for the project to at least reach the first milestone in its very first PR. -->
+
+- [x] Milestone 1: PR-ready, and acceptable to be one of the `projects/`.
+
+  - [x] Finish the code
+
+    <!-- The code's design shall follow existing interfaces and convention. For example, each model component should be registered into `mmdet3d.registry.MODELS` and configurable via a config file. -->
+
+  - [x] Basic docstrings & proper citation
+
+    <!-- Each major object should contain a docstring, describing its functionality and arguments. If you have adapted the code from other open-source projects, don't forget to cite the source project in docstring and make sure your behavior is not against its license. Typically, we do not accept any code snippet under GPL license. [A Short Guide to Open Source Licenses](https://medium.com/nationwide-technology/a-short-guide-to-open-source-licenses-cf5b1c329edd) -->
+
+  - [x] Test-time correctness
+
+    <!-- If you are reproducing the result from a paper, make sure your model's inference-time performance matches that in the original paper. The weights usually could be obtained by simply renaming the keys in the official pre-trained weights. This test could be skipped though, if you are able to prove the training-time correctness and check the second milestone. -->
+
+  - [x] A full README
+
+    <!-- As this template does. -->
+
+- [ ] Milestone 2: Indicates a successful model implementation.
+
+  - [ ] Training-time correctness
+
+    <!-- If you are reproducing the result from a paper, checking this item means that you should have trained your model from scratch based on the original paper's specification and verified that the final result matches the report within a minor error range. -->
+
+- [ ] Milestone 3: Good to be a part of our core package!
+
+  - [ ] Type hints and docstrings
+
+    <!-- Ideally *all* the methods should have [type hints](https://www.pythontutorial.net/python-basics/python-type-hints/) and [docstrings](https://google.github.io/styleguide/pyguide.html#381-docstrings). [Example](https://github.com/open-mmlab/mmdetection3d/blob/dev-1.x/mmdet3d/models/detectors/fcos_mono3d.py) -->
+
+  - [ ] Unit tests
+
+    <!-- Unit tests for each module are required. [Example](https://github.com/open-mmlab/mmdetection3d/blob/dev-1.x/tests/test_models/test_dense_heads/test_fcos_mono3d_head.py) -->
+
+  - [ ] Code polishing
+
+    <!-- Refactor your code according to reviewer's comment. -->
+
+  - [ ] Metafile.yml
+
+    <!-- It will be parsed by MIM and Inferencer. [Example](https://github.com/open-mmlab/mmdetection3d/blob/dev-1.x/configs/fcos3d/metafile.yml) -->
+
+- [ ] Move your modules into the core package following the codebase's file hierarchy structure.
+
+  <!-- In particular, you may have to refactor this README into a standard one. [Example](/configs/textdet/dbnet/README.md) -->
+
+- [ ] Refactor your modules into the core package following the codebase's file hierarchy structure.
--- a/projects/BEVFusion/bevfusion/__init__.py
+++ b/projects/BEVFusion/bevfusion/__init__.py
+from .bevfusion import BEVFusion
+from .bevfusion_necks import GeneralizedLSSFPN
+from .depth_lss import DepthLSSTransform
+from .loading import BEVLoadMultiViewImageFromFiles
+from .sparse_encoder import BEVFusionSparseEncoder
+from .transformer import TransformerDecoderLayer
+from .transforms_3d import GridMask, ImageAug3D
+from .transfusion_head import ConvFuser, TransFusionHead
+from .utils import (BBoxBEVL1Cost, HeuristicAssigner3D, HungarianAssigner3D,
+                    IoU3DCost)
+
+__all__ = [
+    'BEVFusion', 'TransFusionHead', 'ConvFuser', 'ImageAug3D', 'GridMask',
+    'GeneralizedLSSFPN', 'HungarianAssigner3D', 'BBoxBEVL1Cost', 'IoU3DCost',
+    'HeuristicAssigner3D', 'DepthLSSTransform',
+    'BEVLoadMultiViewImageFromFiles', 'BEVFusionSparseEncoder',
+    'TransformerDecoderLayer'
+]
--- a/projects/BEVFusion/bevfusion/bevfusion.py
+++ b/projects/BEVFusion/bevfusion/bevfusion.py
+from typing import Dict, List, Optional
+
+import numpy as np
+import torch
+from torch import Tensor
+from torch.nn import functional as F
+
+from mmdet3d.models import Base3DDetector
+from mmdet3d.registry import MODELS
+from mmdet3d.structures import Det3DDataSample
+from mmdet3d.utils import OptConfigType, OptMultiConfig, OptSampleList
+from .ops import Voxelization
+
+
+@MODELS.register_module()
+class BEVFusion(Base3DDetector):
+
+    def __init__(
+        self,
+        data_preprocessor: OptConfigType = None,
+        pts_voxel_encoder: Optional[dict] = None,
+        pts_middle_encoder: Optional[dict] = None,
+        fusion_layer: Optional[dict] = None,
+        img_backbone: Optional[dict] = None,
+        pts_backbone: Optional[dict] = None,
+        vtransform: Optional[dict] = None,
+        img_neck: Optional[dict] = None,
+        pts_neck: Optional[dict] = None,
+        bbox_head: Optional[dict] = None,
+        init_cfg: OptMultiConfig = None,
+        seg_head: Optional[dict] = None,
+        **kwargs,
+    ) -> None:
+        voxelize_cfg = data_preprocessor.pop('voxelize_cfg')
+        super().__init__(
+            data_preprocessor=data_preprocessor, init_cfg=init_cfg)
+
+        self.voxelize_reduce = voxelize_cfg.pop('voxelize_reduce')
+        self.pts_voxel_layer = Voxelization(**voxelize_cfg)
+
+        self.pts_voxel_encoder = MODELS.build(pts_voxel_encoder)
+
+        self.img_backbone = MODELS.build(img_backbone)
+        self.img_neck = MODELS.build(img_neck)
+        self.vtransform = MODELS.build(vtransform)
+        self.pts_middle_encoder = MODELS.build(pts_middle_encoder)
+
+        self.fusion_layer = MODELS.build(fusion_layer)
+
+        self.pts_backbone = MODELS.build(pts_backbone)
+        self.pts_neck = MODELS.build(pts_neck)
+
+        self.bbox_head = MODELS.build(bbox_head)
+        # hard code here where using converted checkpoint of original
+        # implementation of `BEVFusion`
+        self.use_converted_checkpoint = True
+
+        self.init_weights()
+
+    def _forward(self,
+                 batch_inputs: Tensor,
+                 batch_data_samples: OptSampleList = None):
+        """Network forward process.
+
+        Usually includes backbone, neck and head forward without any post-
+        processing.
+        """
+        pass
+
+    def init_weights(self) -> None:
+        if self.img_backbone is not None:
+            self.img_backbone.init_weights()
+
+    @property
+    def with_bbox_head(self):
+        """bool: Whether the detector has a box head."""
+        return hasattr(self, 'bbox_head') and self.bbox_head is not None
+
+    @property
+    def with_seg_head(self):
+        """bool: Whether the detector has a segmentation head.
+        """
+        return hasattr(self, 'seg_head') and self.seg_head is not None
+
+    def extract_img_feat(
+        self,
+        x,
+        points,
+        lidar2image,
+        camera_intrinsics,
+        camera2lidar,
+        img_aug_matrix,
+        lidar_aug_matrix,
+        img_metas,
+    ) -> torch.Tensor:
+        B, N, C, H, W = x.size()
+        x = x.view(B * N, C, H, W)
+
+        x = self.img_backbone(x)
+        x = self.img_neck(x)
+
+        if not isinstance(x, torch.Tensor):
+            x = x[0]
+
+        BN, C, H, W = x.size()
+        x = x.view(B, int(BN / B), C, H, W)
+
+        x = self.vtransform(
+            x,
+            points,
+            lidar2image,
+            camera_intrinsics,
+            camera2lidar,
+            img_aug_matrix,
+            lidar_aug_matrix,
+            img_metas,
+        )
+        return x
+
+    def extract_pts_feat(self, batch_inputs_dict) -> torch.Tensor:
+        points = batch_inputs_dict['points']
+        feats, coords, sizes = self.voxelize(points)
+        batch_size = coords[-1, 0] + 1
+        x = self.pts_middle_encoder(feats, coords, batch_size)
+        return x
+
+    @torch.no_grad()
+    def voxelize(self, points):
+        feats, coords, sizes = [], [], []
+        for k, res in enumerate(points):
+            ret = self.pts_voxel_layer(res)
+            if len(ret) == 3:
+                # hard voxelize
+                f, c, n = ret
+            else:
+                assert len(ret) == 2
+                f, c = ret
+                n = None
+            feats.append(f)
+            coords.append(F.pad(c, (1, 0), mode='constant', value=k))
+            if n is not None:
+                sizes.append(n)
+
+        feats = torch.cat(feats, dim=0)
+        coords = torch.cat(coords, dim=0)
+        if len(sizes) > 0:
+            sizes = torch.cat(sizes, dim=0)
+            if self.voxelize_reduce:
+                feats = feats.sum(
+                    dim=1, keepdim=False) / sizes.type_as(feats).view(-1, 1)
+                feats = feats.contiguous()
+
+        return feats, coords, sizes
+
+    def predict(self, batch_inputs_dict: Dict[str, Optional[Tensor]],
+                batch_data_samples: List[Det3DDataSample],
+                **kwargs) -> List[Det3DDataSample]:
+        """Forward of testing.
+
+        Args:
+            batch_inputs_dict (dict): The model input dict which include
+                'points' keys.
+
+                - points (list[torch.Tensor]): Point cloud of each sample.
+            batch_data_samples (List[:obj:`Det3DDataSample`]): The Data
+                Samples. It usually includes information such as
+                `gt_instance_3d`.
+
+        Returns:
+            list[:obj:`Det3DDataSample`]: Detection results of the
+            input sample. Each Det3DDataSample usually contain
+            'pred_instances_3d'. And the ``pred_instances_3d`` usually
+            contains following keys.
+
+            - scores_3d (Tensor): Classification scores, has a shape
+                (num_instances, )
+            - labels_3d (Tensor): Labels of bboxes, has a shape
+                (num_instances, ).
+            - bbox_3d (:obj:`BaseInstance3DBoxes`): Prediction of bboxes,
+                contains a tensor with shape (num_instances, 7).
+        """
+        batch_input_metas = [item.metainfo for item in batch_data_samples]
+        feats = self.extract_feat(batch_inputs_dict, batch_input_metas)
+
+        if self.with_bbox_head:
+            outputs = self.bbox_head.predict(feats, batch_input_metas)
+            if self.use_converted_checkpoint:
+                outputs[0]['bboxes_3d'].tensor[:, 6] = -outputs[0][
+                    'bboxes_3d'].tensor[:, 6] - np.pi / 2
+                outputs[0]['bboxes_3d'].tensor[:, 3:5] = outputs[0][
+                    'bboxes_3d'].tensor[:, [4, 3]]
+
+        res = self.add_pred_to_datasample(batch_data_samples, outputs)
+
+        return res
+
+    def extract_feat(
+        self,
+        batch_inputs_dict,
+        batch_input_metas,
+        **kwargs,
+    ):
+        imgs = batch_inputs_dict.get('imgs', None)
+        points = batch_inputs_dict.get('points', None)
+
+        lidar2image, camera_intrinsics, camera2lidar = [], [], []
+        img_aug_matrix, lidar_aug_matrix = [], []
+        for i, meta in enumerate(batch_input_metas):
+            lidar2image.append(meta['lidar2img'])
+            camera_intrinsics.append(meta['cam2img'])
+            camera2lidar.append(meta['cam2lidar'])
+            img_aug_matrix.append(meta.get('img_aug_matrix', np.eye(4)))
+            lidar_aug_matrix.append(meta.get('lidar_aug_matrix', np.eye(4)))
+
+        lidar2image = imgs.new_tensor(np.asarray(lidar2image))
+        camera_intrinsics = imgs.new_tensor(np.array(camera_intrinsics))
+        camera2lidar = imgs.new_tensor(np.asarray(camera2lidar))
+        img_aug_matrix = imgs.new_tensor(np.asarray(img_aug_matrix))
+        lidar_aug_matrix = imgs.new_tensor(np.asarray(lidar_aug_matrix))
+        img_feature = self.extract_img_feat(imgs, points, lidar2image,
+                                            camera_intrinsics, camera2lidar,
+                                            img_aug_matrix, lidar_aug_matrix,
+                                            batch_input_metas)
+        pts_feature = self.extract_pts_feat(batch_inputs_dict)
+
+        features = [img_feature, pts_feature]
+
+        if self.fusion_layer is not None:
+            x = self.fusion_layer(features)
+        else:
+            assert len(features) == 1, features
+            x = features[0]
+
+        x = self.pts_backbone(x)
+        x = self.pts_neck(x)
+
+        return x
+
+    def loss(self, batch_inputs_dict: Dict[str, Optional[Tensor]],
+             batch_data_samples: List[Det3DDataSample],
+             **kwargs) -> List[Det3DDataSample]:
+        pass
--- a/projects/BEVFusion/bevfusion/bevfusion_necks.py
+++ b/projects/BEVFusion/bevfusion/bevfusion_necks.py
+# modify from https://github.com/mit-han-lab/bevfusion
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import ConvModule
+from mmengine.model import BaseModule
+
+from mmdet3d.registry import MODELS
+
+
+@MODELS.register_module()
+class GeneralizedLSSFPN(BaseModule):
+
+    def __init__(
+            self,
+            in_channels,
+            out_channels,
+            num_outs,
+            start_level=0,
+            end_level=-1,
+            no_norm_on_lateral=False,
+            conv_cfg=None,
+            norm_cfg=dict(type='BN2d'),
+            act_cfg=dict(type='ReLU'),
+            upsample_cfg=dict(mode='bilinear', align_corners=True),
+    ) -> None:
+        super().__init__()
+        assert isinstance(in_channels, list)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.num_ins = len(in_channels)
+        self.num_outs = num_outs
+        self.no_norm_on_lateral = no_norm_on_lateral
+        self.fp16_enabled = False
+        self.upsample_cfg = upsample_cfg.copy()
+
+        if end_level == -1:
+            self.backbone_end_level = self.num_ins - 1
+            # assert num_outs >= self.num_ins - start_level
+        else:
+            # if end_level < inputs, no extra level is allowed
+            self.backbone_end_level = end_level
+            assert end_level <= len(in_channels)
+            assert num_outs == end_level - start_level
+        self.start_level = start_level
+        self.end_level = end_level
+
+        self.lateral_convs = nn.ModuleList()
+        self.fpn_convs = nn.ModuleList()
+
+        for i in range(self.start_level, self.backbone_end_level):
+            l_conv = ConvModule(
+                in_channels[i] +
+                (in_channels[i + 1] if i == self.backbone_end_level -
+                 1 else out_channels),
+                out_channels,
+                1,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg if not self.no_norm_on_lateral else None,
+                act_cfg=act_cfg,
+                inplace=False,
+            )
+            fpn_conv = ConvModule(
+                out_channels,
+                out_channels,
+                3,
+                padding=1,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                inplace=False,
+            )
+
+            self.lateral_convs.append(l_conv)
+            self.fpn_convs.append(fpn_conv)
+
+    def forward(self, inputs):
+        """Forward function."""
+        # upsample -> cat -> conv1x1 -> conv3x3
+        assert len(inputs) == len(self.in_channels)
+
+        # build laterals
+        laterals = [inputs[i + self.start_level] for i in range(len(inputs))]
+
+        # build top-down path
+        used_backbone_levels = len(laterals) - 1
+        for i in range(used_backbone_levels - 1, -1, -1):
+            x = F.interpolate(
+                laterals[i + 1],
+                size=laterals[i].shape[2:],
+                **self.upsample_cfg,
+            )
+            laterals[i] = torch.cat([laterals[i], x], dim=1)
+            laterals[i] = self.lateral_convs[i](laterals[i])
+            laterals[i] = self.fpn_convs[i](laterals[i])
+
+        # build outputs
+        outs = [laterals[i] for i in range(used_backbone_levels)]
+        return tuple(outs)
--- a/projects/BEVFusion/bevfusion/depth_lss.py
+++ b/projects/BEVFusion/bevfusion/depth_lss.py
+# modify from https://github.com/mit-han-lab/bevfusion
+from typing import Tuple
+
+import torch
+from torch import nn
+
+from mmdet3d.registry import MODELS
+from .ops import bev_pool
+
+
+def gen_dx_bx(xbound, ybound, zbound):
+    dx = torch.Tensor([row[2] for row in [xbound, ybound, zbound]])
+    bx = torch.Tensor(
+        [row[0] + row[2] / 2.0 for row in [xbound, ybound, zbound]])
+    nx = torch.LongTensor([(row[1] - row[0]) / row[2]
+                           for row in [xbound, ybound, zbound]])
+    return dx, bx, nx
+
+
+class BaseTransform(nn.Module):
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        image_size: Tuple[int, int],
+        feature_size: Tuple[int, int],
+        xbound: Tuple[float, float, float],
+        ybound: Tuple[float, float, float],
+        zbound: Tuple[float, float, float],
+        dbound: Tuple[float, float, float],
+    ) -> None:
+        super().__init__()
+        self.in_channels = in_channels
+        self.image_size = image_size
+        self.feature_size = feature_size
+        self.xbound = xbound
+        self.ybound = ybound
+        self.zbound = zbound
+        self.dbound = dbound
+
+        dx, bx, nx = gen_dx_bx(self.xbound, self.ybound, self.zbound)
+        self.dx = nn.Parameter(dx, requires_grad=False)
+        self.bx = nn.Parameter(bx, requires_grad=False)
+        self.nx = nn.Parameter(nx, requires_grad=False)
+
+        self.C = out_channels
+        self.frustum = self.create_frustum()
+        self.D = self.frustum.shape[0]
+        self.fp16_enabled = False
+
+    def create_frustum(self):
+        iH, iW = self.image_size
+        fH, fW = self.feature_size
+
+        ds = (
+            torch.arange(*self.dbound,
+                         dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW))
+        D, _, _ = ds.shape
+
+        xs = (
+            torch.linspace(0, iW - 1, fW,
+                           dtype=torch.float).view(1, 1, fW).expand(D, fH, fW))
+        ys = (
+            torch.linspace(0, iH - 1, fH,
+                           dtype=torch.float).view(1, fH, 1).expand(D, fH, fW))
+
+        frustum = torch.stack((xs, ys, ds), -1)
+        return nn.Parameter(frustum, requires_grad=False)
+
+    def get_geometry(
+        self,
+        camera2lidar_rots,
+        camera2lidar_trans,
+        intrins,
+        post_rots,
+        post_trans,
+        **kwargs,
+    ):
+        B, N, _ = camera2lidar_trans.shape
+
+        # undo post-transformation
+        # B x N x D x H x W x 3
+        points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
+        points = (
+            torch.inverse(post_rots).view(B, N, 1, 1, 1, 3,
+                                          3).matmul(points.unsqueeze(-1)))
+        # cam_to_lidar
+        points = torch.cat(
+            (
+                points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
+                points[:, :, :, :, :, 2:3],
+            ),
+            5,
+        )
+        combine = camera2lidar_rots.matmul(torch.inverse(intrins))
+        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
+        points += camera2lidar_trans.view(B, N, 1, 1, 1, 3)
+
+        if 'extra_rots' in kwargs:
+            extra_rots = kwargs['extra_rots']
+            points = (
+                extra_rots.view(B, 1, 1, 1, 1, 3,
+                                3).repeat(1, N, 1, 1, 1, 1, 1).matmul(
+                                    points.unsqueeze(-1)).squeeze(-1))
+        if 'extra_trans' in kwargs:
+            extra_trans = kwargs['extra_trans']
+            points += extra_trans.view(B, 1, 1, 1, 1,
+                                       3).repeat(1, N, 1, 1, 1, 1)
+
+        return points
+
+    def get_cam_feats(self, x):
+        raise NotImplementedError
+
+    def bev_pool(self, geom_feats, x):
+        B, N, D, H, W, C = x.shape
+        Nprime = B * N * D * H * W
+
+        # flatten x
+        x = x.reshape(Nprime, C)
+
+        # flatten indices
+        geom_feats = ((geom_feats - (self.bx - self.dx / 2.0)) /
+                      self.dx).long()
+        geom_feats = geom_feats.view(Nprime, 3)
+        batch_ix = torch.cat([
+            torch.full([Nprime // B, 1], ix, device=x.device, dtype=torch.long)
+            for ix in range(B)
+        ])
+        geom_feats = torch.cat((geom_feats, batch_ix), 1)
+
+        # filter out points that are outside box
+        kept = ((geom_feats[:, 0] >= 0)
+                & (geom_feats[:, 0] < self.nx[0])
+                & (geom_feats[:, 1] >= 0)
+                & (geom_feats[:, 1] < self.nx[1])
+                & (geom_feats[:, 2] >= 0)
+                & (geom_feats[:, 2] < self.nx[2]))
+        x = x[kept]
+        geom_feats = geom_feats[kept]
+
+        x = bev_pool(x, geom_feats, B, self.nx[2], self.nx[0], self.nx[1])
+
+        # collapse Z
+        final = torch.cat(x.unbind(dim=2), 1)
+
+        return final
+
+    def forward(
+        self,
+        img,
+        points,
+        lidar2image,
+        camera_intrinsics,
+        camera2lidar,
+        img_aug_matrix,
+        lidar_aug_matrix,
+        **kwargs,
+    ):
+        intrins = camera_intrinsics[..., :3, :3]
+        post_rots = img_aug_matrix[..., :3, :3]
+        post_trans = img_aug_matrix[..., :3, 3]
+        camera2lidar_rots = camera2lidar[..., :3, :3]
+        camera2lidar_trans = camera2lidar[..., :3, 3]
+
+        extra_rots = lidar_aug_matrix[..., :3, :3]
+        extra_trans = lidar_aug_matrix[..., :3, 3]
+
+        geom = self.get_geometry(
+            camera2lidar_rots,
+            camera2lidar_trans,
+            intrins,
+            post_rots,
+            post_trans,
+            extra_rots=extra_rots,
+            extra_trans=extra_trans,
+        )
+
+        x = self.get_cam_feats(img)
+        x = self.bev_pool(geom, x)
+        return x
+
+
+class BaseDepthTransform(BaseTransform):
+
+    def forward(
+        self,
+        img,
+        points,
+        lidar2image,
+        cam_intrinsic,
+        camera2lidar,
+        img_aug_matrix,
+        lidar_aug_matrix,
+        metas,
+        **kwargs,
+    ):
+        intrins = cam_intrinsic[..., :3, :3]
+        post_rots = img_aug_matrix[..., :3, :3]
+        post_trans = img_aug_matrix[..., :3, 3]
+        camera2lidar_rots = camera2lidar[..., :3, :3]
+        camera2lidar_trans = camera2lidar[..., :3, 3]
+
+        # print(img.shape, self.image_size, self.feature_size)
+
+        batch_size = len(points)
+        depth = torch.zeros(batch_size, img.shape[1], 1,
+                            *self.image_size).to(points[0].device)
+
+        for b in range(batch_size):
+            cur_coords = points[b][:, :3]
+            cur_img_aug_matrix = img_aug_matrix[b]
+            cur_lidar_aug_matrix = lidar_aug_matrix[b]
+            cur_lidar2image = lidar2image[b]
+
+            # inverse aug
+            cur_coords -= cur_lidar_aug_matrix[:3, 3]
+            cur_coords = torch.inverse(cur_lidar_aug_matrix[:3, :3]).matmul(
+                cur_coords.transpose(1, 0))
+            # lidar2image
+            cur_coords = cur_lidar2image[:, :3, :3].matmul(cur_coords)
+            cur_coords += cur_lidar2image[:, :3, 3].reshape(-1, 3, 1)
+            # get 2d coords
+            dist = cur_coords[:, 2, :]
+            cur_coords[:, 2, :] = torch.clamp(cur_coords[:, 2, :], 1e-5, 1e5)
+            cur_coords[:, :2, :] /= cur_coords[:, 2:3, :]
+
+            # imgaug
+            cur_coords = cur_img_aug_matrix[:, :3, :3].matmul(cur_coords)
+            cur_coords += cur_img_aug_matrix[:, :3, 3].reshape(-1, 3, 1)
+            cur_coords = cur_coords[:, :2, :].transpose(1, 2)
+
+            # normalize coords for grid sample
+            cur_coords = cur_coords[..., [1, 0]]
+
+            on_img = ((cur_coords[..., 0] < self.image_size[0])
+                      & (cur_coords[..., 0] >= 0)
+                      & (cur_coords[..., 1] < self.image_size[1])
+                      & (cur_coords[..., 1] >= 0))
+            for c in range(on_img.shape[0]):
+                masked_coords = cur_coords[c, on_img[c]].long()
+                masked_dist = dist[c, on_img[c]]
+                depth[b, c, 0, masked_coords[:, 0],
+                      masked_coords[:, 1]] = masked_dist
+
+        extra_rots = lidar_aug_matrix[..., :3, :3]
+        extra_trans = lidar_aug_matrix[..., :3, 3]
+        geom = self.get_geometry(
+            camera2lidar_rots,
+            camera2lidar_trans,
+            intrins,
+            post_rots,
+            post_trans,
+            extra_rots=extra_rots,
+            extra_trans=extra_trans,
+        )
+
+        x = self.get_cam_feats(img, depth)
+        x = self.bev_pool(geom, x)
+        return x
+
+
+@MODELS.register_module()
+class DepthLSSTransform(BaseDepthTransform):
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        image_size: Tuple[int, int],
+        feature_size: Tuple[int, int],
+        xbound: Tuple[float, float, float],
+        ybound: Tuple[float, float, float],
+        zbound: Tuple[float, float, float],
+        dbound: Tuple[float, float, float],
+        downsample: int = 1,
+    ) -> None:
+        super().__init__(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            image_size=image_size,
+            feature_size=feature_size,
+            xbound=xbound,
+            ybound=ybound,
+            zbound=zbound,
+            dbound=dbound,
+        )
+        self.dtransform = nn.Sequential(
+            nn.Conv2d(1, 8, 1),
+            nn.BatchNorm2d(8),
+            nn.ReLU(True),
+            nn.Conv2d(8, 32, 5, stride=4, padding=2),
+            nn.BatchNorm2d(32),
+            nn.ReLU(True),
+            nn.Conv2d(32, 64, 5, stride=2, padding=2),
+            nn.BatchNorm2d(64),
+            nn.ReLU(True),
+        )
+        self.depthnet = nn.Sequential(
+            nn.Conv2d(in_channels + 64, in_channels, 3, padding=1),
+            nn.BatchNorm2d(in_channels),
+            nn.ReLU(True),
+            nn.Conv2d(in_channels, in_channels, 3, padding=1),
+            nn.BatchNorm2d(in_channels),
+            nn.ReLU(True),
+            nn.Conv2d(in_channels, self.D + self.C, 1),
+        )
+        if downsample > 1:
+            assert downsample == 2, downsample
+            self.downsample = nn.Sequential(
+                nn.Conv2d(
+                    out_channels, out_channels, 3, padding=1, bias=False),
+                nn.BatchNorm2d(out_channels),
+                nn.ReLU(True),
+                nn.Conv2d(
+                    out_channels,
+                    out_channels,
+                    3,
+                    stride=downsample,
+                    padding=1,
+                    bias=False,
+                ),
+                nn.BatchNorm2d(out_channels),
+                nn.ReLU(True),
+                nn.Conv2d(
+                    out_channels, out_channels, 3, padding=1, bias=False),
+                nn.BatchNorm2d(out_channels),
+                nn.ReLU(True),
+            )
+        else:
+            self.downsample = nn.Identity()
+
+    def get_cam_feats(self, x, d):
+        B, N, C, fH, fW = x.shape
+
+        d = d.view(B * N, *d.shape[2:])
+        x = x.view(B * N, C, fH, fW)
+
+        d = self.dtransform(d)
+        x = torch.cat([d, x], dim=1)
+        x = self.depthnet(x)
+
+        depth = x[:, :self.D].softmax(dim=1)
+        x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)
+
+        x = x.view(B, N, self.C, self.D, fH, fW)
+        x = x.permute(0, 1, 3, 4, 5, 2)
+        return x
+
+    def forward(self, *args, **kwargs):
+        x = super().forward(*args, **kwargs)
+        x = self.downsample(x)
+        return x
--- a/projects/BEVFusion/bevfusion/loading.py
+++ b/projects/BEVFusion/bevfusion/loading.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+from typing import Optional
+
+import mmcv
+import mmengine
+import numpy as np
+
+from mmdet3d.datasets.transforms import LoadMultiViewImageFromFiles
+from mmdet3d.registry import TRANSFORMS
+
+
+@TRANSFORMS.register_module()
+class BEVLoadMultiViewImageFromFiles(LoadMultiViewImageFromFiles):
+    """Load multi channel images from a list of separate channel files.
+
+    ``BEVLoadMultiViewImageFromFiles`` adds the following keys for the
+    convenience of view transforms in the forward:
+        - 'cam2lidar'
+        - 'lidar2img'
+
+    Args:
+        to_float32 (bool): Whether to convert the img to float32.
+            Defaults to False.
+        color_type (str): Color type of the file. Defaults to 'unchanged'.
+        file_client_args (dict): Arguments to instantiate a FileClient.
+            See :class:`mmengine.fileio.FileClient` for details.
+            Defaults to dict(backend='disk').
+        num_views (int): Number of view in a frame. Defaults to 5.
+        num_ref_frames (int): Number of frame in loading. Defaults to -1.
+        test_mode (bool): Whether is test mode in loading. Defaults to False.
+        set_default_scale (bool): Whether to set default scale.
+            Defaults to True.
+    """
+
+    def transform(self, results: dict) -> Optional[dict]:
+        """Call function to load multi-view image from files.
+
+        Args:
+            results (dict): Result dict containing multi-view image filenames.
+
+        Returns:
+            dict: The result dict containing the multi-view image data.
+            Added keys and values are described below.
+
+                - filename (str): Multi-view image filenames.
+                - img (np.ndarray): Multi-view image arrays.
+                - img_shape (tuple[int]): Shape of multi-view image arrays.
+                - ori_shape (tuple[int]): Shape of original image arrays.
+                - pad_shape (tuple[int]): Shape of padded image arrays.
+                - scale_factor (float): Scale factor.
+                - img_norm_cfg (dict): Normalization configuration of images.
+        """
+        # TODO: consider split the multi-sweep part out of this pipeline
+        # Derive the mask and transform for loading of multi-sweep data
+        if self.num_ref_frames > 0:
+            # init choice with the current frame
+            init_choice = np.array([0], dtype=np.int64)
+            num_frames = len(results['img_filename']) // self.num_views - 1
+            if num_frames == 0:  # no previous frame, then copy cur frames
+                choices = np.random.choice(
+                    1, self.num_ref_frames, replace=True)
+            elif num_frames >= self.num_ref_frames:
+                # NOTE: suppose the info is saved following the order
+                # from latest to earlier frames
+                if self.test_mode:
+                    choices = np.arange(num_frames - self.num_ref_frames,
+                                        num_frames) + 1
+                # NOTE: +1 is for selecting previous frames
+                else:
+                    choices = np.random.choice(
+                        num_frames, self.num_ref_frames, replace=False) + 1
+            elif num_frames > 0 and num_frames < self.num_ref_frames:
+                if self.test_mode:
+                    base_choices = np.arange(num_frames) + 1
+                    random_choices = np.random.choice(
+                        num_frames,
+                        self.num_ref_frames - num_frames,
+                        replace=True) + 1
+                    choices = np.concatenate([base_choices, random_choices])
+                else:
+                    choices = np.random.choice(
+                        num_frames, self.num_ref_frames, replace=True) + 1
+            else:
+                raise NotImplementedError
+            choices = np.concatenate([init_choice, choices])
+            select_filename = []
+            for choice in choices:
+                select_filename += results['img_filename'][choice *
+                                                           self.num_views:
+                                                           (choice + 1) *
+                                                           self.num_views]
+            results['img_filename'] = select_filename
+            for key in ['cam2img', 'lidar2cam']:
+                if key in results:
+                    select_results = []
+                    for choice in choices:
+                        select_results += results[key][choice *
+                                                       self.num_views:(choice +
+                                                                       1) *
+                                                       self.num_views]
+                    results[key] = select_results
+            for key in ['ego2global']:
+                if key in results:
+                    select_results = []
+                    for choice in choices:
+                        select_results += [results[key][choice]]
+                    results[key] = select_results
+            # Transform lidar2cam to
+            # [cur_lidar]2[prev_img] and [cur_lidar]2[prev_cam]
+            for key in ['lidar2cam']:
+                if key in results:
+                    # only change matrices of previous frames
+                    for choice_idx in range(1, len(choices)):
+                        pad_prev_ego2global = np.eye(4)
+                        prev_ego2global = results['ego2global'][choice_idx]
+                        pad_prev_ego2global[:prev_ego2global.
+                                            shape[0], :prev_ego2global.
+                                            shape[1]] = prev_ego2global
+                        pad_cur_ego2global = np.eye(4)
+                        cur_ego2global = results['ego2global'][0]
+                        pad_cur_ego2global[:cur_ego2global.
+                                           shape[0], :cur_ego2global.
+                                           shape[1]] = cur_ego2global
+                        cur2prev = np.linalg.inv(pad_prev_ego2global).dot(
+                            pad_cur_ego2global)
+                        for result_idx in range(choice_idx * self.num_views,
+                                                (choice_idx + 1) *
+                                                self.num_views):
+                            results[key][result_idx] = \
+                                results[key][result_idx].dot(cur2prev)
+        # Support multi-view images with different shapes
+        # TODO: record the origin shape and padded shape
+        filename, cam2img, lidar2cam, cam2lidar, lidar2img = [], [], [], [], []
+        for _, cam_item in results['images'].items():
+            filename.append(cam_item['img_path'])
+            lidar2cam.append(cam_item['lidar2cam'])
+
+            lidar2cam_array = np.array(cam_item['lidar2cam']).astype(
+                np.float32)
+            lidar2cam_rot = lidar2cam_array[:3, :3]
+            lidar2cam_trans = lidar2cam_array[:3, 3:4]
+            camera2lidar = np.eye(4)
+            camera2lidar[:3, :3] = lidar2cam_rot.T
+            camera2lidar[:3, 3:4] = -1 * np.matmul(
+                lidar2cam_rot.T, lidar2cam_trans.reshape(3, 1))
+            cam2lidar.append(camera2lidar)
+
+            cam2img_array = np.eye(4).astype(np.float32)
+            cam2img_array[:3, :3] = np.array(cam_item['cam2img']).astype(
+                np.float32)
+            cam2img.append(cam2img_array)
+            lidar2img.append(cam2img_array @ lidar2cam_array)
+
+        results['img_path'] = filename
+        results['cam2img'] = np.stack(cam2img, axis=0)
+        results['lidar2cam'] = np.stack(lidar2cam, axis=0)
+        results['cam2lidar'] = np.stack(cam2lidar, axis=0)
+        results['lidar2img'] = np.stack(lidar2img, axis=0)
+
+        results['ori_cam2img'] = copy.deepcopy(results['cam2img'])
+
+        if self.file_client is None:
+            self.file_client = mmengine.FileClient(**self.file_client_args)
+
+        # img is of shape (h, w, c, num_views)
+        # h and w can be different for different views
+        img_bytes = [self.file_client.get(name) for name in filename]
+        imgs = [
+            mmcv.imfrombytes(
+                img_byte,
+                flag=self.color_type,
+                backend='pillow',
+                channel_order='rgb') for img_byte in img_bytes
+        ]
+        # handle the image with different shape
+        img_shapes = np.stack([img.shape for img in imgs], axis=0)
+        img_shape_max = np.max(img_shapes, axis=0)
+        img_shape_min = np.min(img_shapes, axis=0)
+        assert img_shape_min[-1] == img_shape_max[-1]
+        if not np.all(img_shape_max == img_shape_min):
+            pad_shape = img_shape_max[:2]
+        else:
+            pad_shape = None
+        if pad_shape is not None:
+            imgs = [
+                mmcv.impad(img, shape=pad_shape, pad_val=0) for img in imgs
+            ]
+        img = np.stack(imgs, axis=-1)
+        if self.to_float32:
+            img = img.astype(np.float32)
+
+        results['filename'] = filename
+        # unravel to list, see `DefaultFormatBundle` in formating.py
+        # which will transpose each image separately and then stack into array
+        results['img'] = [img[..., i] for i in range(img.shape[-1])]
+        results['img_shape'] = img.shape[:2]
+        results['ori_shape'] = img.shape[:2]
+        # Set initial values for default meta_keys
+        results['pad_shape'] = img.shape[:2]
+        if self.set_default_scale:
+            results['scale_factor'] = 1.0
+        num_channels = 1 if len(img.shape) < 3 else img.shape[2]
+        results['img_norm_cfg'] = dict(
+            mean=np.zeros(num_channels, dtype=np.float32),
+            std=np.ones(num_channels, dtype=np.float32),
+            to_rgb=False)
+        results['num_views'] = self.num_views
+        results['num_ref_frames'] = self.num_ref_frames
+        return results
--- a/projects/BEVFusion/bevfusion/ops/__init__.py
+++ b/projects/BEVFusion/bevfusion/ops/__init__.py
+from .bev_pool import bev_pool
+from .voxel import DynamicScatter, Voxelization, dynamic_scatter, voxelization
+
+__all__ = [
+    'bev_pool', 'Voxelization', 'voxelization', 'dynamic_scatter',
+    'DynamicScatter'
+]
--- a/projects/BEVFusion/bevfusion/ops/bev_pool/__init__.py
+++ b/projects/BEVFusion/bevfusion/ops/bev_pool/__init__.py
+from .bev_pool import bev_pool
+
+__all__ = ['bev_pool']
--- a/projects/BEVFusion/bevfusion/ops/bev_pool/bev_pool.py
+++ b/projects/BEVFusion/bevfusion/ops/bev_pool/bev_pool.py
+import torch
+
+from . import bev_pool_ext
+
+
+class QuickCumsum(torch.autograd.Function):
+
+    @staticmethod
+    def forward(ctx, x, geom_feats, ranks):
+        x = x.cumsum(0)
+        kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)
+        kept[:-1] = ranks[1:] != ranks[:-1]
+
+        x, geom_feats = x[kept], geom_feats[kept]
+        x = torch.cat((x[:1], x[1:] - x[:-1]))
+
+        # save kept for backward
+        ctx.save_for_backward(kept)
+
+        # no gradient for geom_feats
+        ctx.mark_non_differentiable(geom_feats)
+
+        return x, geom_feats
+
+    @staticmethod
+    def backward(ctx, gradx, gradgeom):
+        (kept, ) = ctx.saved_tensors
+        back = torch.cumsum(kept, 0)
+        back[kept] -= 1
+
+        val = gradx[back]
+
+        return val, None, None
+
+
+class QuickCumsumCuda(torch.autograd.Function):
+
+    @staticmethod
+    def forward(ctx, x, geom_feats, ranks, B, D, H, W):
+        kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)
+        kept[1:] = ranks[1:] != ranks[:-1]
+        interval_starts = torch.where(kept)[0].int()
+        interval_lengths = torch.zeros_like(interval_starts)
+        interval_lengths[:-1] = interval_starts[1:] - interval_starts[:-1]
+        interval_lengths[-1] = x.shape[0] - interval_starts[-1]
+        geom_feats = geom_feats.int()
+
+        out = bev_pool_ext.bev_pool_forward(
+            x,
+            geom_feats,
+            interval_lengths,
+            interval_starts,
+            B,
+            D,
+            H,
+            W,
+        )
+
+        ctx.save_for_backward(interval_starts, interval_lengths, geom_feats)
+        ctx.saved_shapes = B, D, H, W
+        return out
+
+    @staticmethod
+    def backward(ctx, out_grad):
+        interval_starts, interval_lengths, geom_feats = ctx.saved_tensors
+        B, D, H, W = ctx.saved_shapes
+
+        out_grad = out_grad.contiguous()
+        x_grad = bev_pool_ext.bev_pool_backward(
+            out_grad,
+            geom_feats,
+            interval_lengths,
+            interval_starts,
+            B,
+            D,
+            H,
+            W,
+        )
+
+        return x_grad, None, None, None, None, None, None
+
+
+def bev_pool(feats, coords, B, D, H, W):
+    assert feats.shape[0] == coords.shape[0]
+
+    ranks = (
+        coords[:, 0] * (W * D * B) + coords[:, 1] * (D * B) +
+        coords[:, 2] * B + coords[:, 3])
+    indices = ranks.argsort()
+    feats, coords, ranks = feats[indices], coords[indices], ranks[indices]
+
+    x = QuickCumsumCuda.apply(feats, coords, ranks, B, D, H, W)
+    x = x.permute(0, 4, 1, 2, 3).contiguous()
+    return x
--- a/projects/BEVFusion/bevfusion/ops/bev_pool/src/bev_pool.cpp
+++ b/projects/BEVFusion/bevfusion/ops/bev_pool/src/bev_pool.cpp
+#include <torch/torch.h>
+#include <c10/cuda/CUDAGuard.h>
+
+// CUDA function declarations
+void bev_pool(int b, int d, int h, int w, int n, int c, int n_intervals, const float* x,
+    const int* geom_feats, const int* interval_starts, const int* interval_lengths, float* out);
+
+void bev_pool_grad(int b, int d, int h, int w, int n, int c, int n_intervals, const float* out_grad,
+  const int* geom_feats, const int* interval_starts, const int* interval_lengths, float* x_grad);
+
+
+/*
+  Function: pillar pooling (forward, cuda)
+  Args:
+    x                : input features, FloatTensor[n, c]
+    geom_feats       : input coordinates, IntTensor[n, 4]
+    interval_lengths : starting position for pooled point, IntTensor[n_intervals]
+    interval_starts  : how many points in each pooled point, IntTensor[n_intervals]
+  Return:
+    out              : output features, FloatTensor[b, d, h, w, c]
+*/
+at::Tensor bev_pool_forward(
+  const at::Tensor _x,
+  const at::Tensor _geom_feats,
+  const at::Tensor _interval_lengths,
+  const at::Tensor _interval_starts,
+  int b, int d, int h, int w
+) {
+  int n = _x.size(0);
+  int c = _x.size(1);
+  int n_intervals = _interval_lengths.size(0);
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(_x));
+  const float* x = _x.data_ptr<float>();
+  const int* geom_feats = _geom_feats.data_ptr<int>();
+  const int* interval_lengths = _interval_lengths.data_ptr<int>();
+  const int* interval_starts = _interval_starts.data_ptr<int>();
+
+  auto options =
+      torch::TensorOptions().dtype(_x.dtype()).device(_x.device());
+  at::Tensor _out = torch::zeros({b, d, h, w, c}, options);
+  float* out = _out.data_ptr<float>();
+  bev_pool(
+    b, d, h, w, n, c, n_intervals, x,
+    geom_feats, interval_starts, interval_lengths, out
+  );
+  return _out;
+}
+
+
+/*
+  Function: pillar pooling (backward, cuda)
+  Args:
+    out_grad         : input features, FloatTensor[b, d, h, w, c]
+    geom_feats       : input coordinates, IntTensor[n, 4]
+    interval_lengths : starting position for pooled point, IntTensor[n_intervals]
+    interval_starts  : how many points in each pooled point, IntTensor[n_intervals]
+  Return:
+    x_grad           : output features, FloatTensor[n, 4]
+*/
+at::Tensor bev_pool_backward(
+  const at::Tensor _out_grad,
+  const at::Tensor _geom_feats,
+  const at::Tensor _interval_lengths,
+  const at::Tensor _interval_starts,
+  int b, int d, int h, int w
+) {
+  int n = _geom_feats.size(0);
+  int c = _out_grad.size(4);
+  int n_intervals = _interval_lengths.size(0);
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(_out_grad));
+  const float* out_grad = _out_grad.data_ptr<float>();
+  const int* geom_feats = _geom_feats.data_ptr<int>();
+  const int* interval_lengths = _interval_lengths.data_ptr<int>();
+  const int* interval_starts = _interval_starts.data_ptr<int>();
+
+  auto options =
+      torch::TensorOptions().dtype(_out_grad.dtype()).device(_out_grad.device());
+  at::Tensor _x_grad = torch::zeros({n, c}, options);
+  float* x_grad = _x_grad.data_ptr<float>();
+
+  bev_pool_grad(
+    b, d, h, w, n, c, n_intervals, out_grad,
+    geom_feats, interval_starts, interval_lengths, x_grad
+  );
+
+  return _x_grad;
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+  m.def("bev_pool_forward", &bev_pool_forward,
+        "bev_pool_forward");
+  m.def("bev_pool_backward", &bev_pool_backward,
+        "bev_pool_backward");
+}
--- a/projects/BEVFusion/bevfusion/ops/bev_pool/src/bev_pool_cuda.cu
+++ b/projects/BEVFusion/bevfusion/ops/bev_pool/src/bev_pool_cuda.cu
+#include <stdio.h>
+#include <stdlib.h>
+
+/*
+  Function: pillar pooling
+  Args:
+    b                : batch size
+    d                : depth of the feature map
+    h                : height of pooled feature map
+    w                : width of pooled feature map
+    n                : number of input points
+    c                : number of channels
+    n_intervals      : number of unique points
+    x                : input features, FloatTensor[n, c]
+    geom_feats       : input coordinates, IntTensor[n, 4]
+    interval_lengths : starting position for pooled point, IntTensor[n_intervals]
+    interval_starts  : how many points in each pooled point, IntTensor[n_intervals]
+    out              : output features, FloatTensor[b, d, h, w, c]
+*/
+__global__ void bev_pool_kernel(int b, int d, int h, int w, int n, int c, int n_intervals,
+                                  const float *__restrict__ x,
+                                  const int *__restrict__ geom_feats,
+                                  const int *__restrict__ interval_starts,
+                                  const int *__restrict__ interval_lengths,
+                                  float* __restrict__ out) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  int index = idx / c;
+  int cur_c = idx % c;
+  if (index >= n_intervals) return;
+  int interval_start = interval_starts[index];
+  int interval_length = interval_lengths[index];
+  const int* cur_geom_feats = geom_feats + interval_start * 4;
+  const float* cur_x = x + interval_start * c + cur_c;
+  float* cur_out = out + cur_geom_feats[3] * d * h * w * c +
+    cur_geom_feats[2] * h * w * c + cur_geom_feats[0] * w * c +
+    cur_geom_feats[1] * c + cur_c;
+  float psum = 0;
+  for(int i = 0; i < interval_length; i++){
+    psum += cur_x[i * c];
+  }
+  *cur_out = psum;
+}
+
+
+/*
+  Function: pillar pooling backward
+  Args:
+    b                : batch size
+    d                : depth of the feature map
+    h                : height of pooled feature map
+    w                : width of pooled feature map
+    n                : number of input points
+    c                : number of channels
+    n_intervals      : number of unique points
+    out_grad         : gradient of the BEV fmap from top, FloatTensor[b, d, h, w, c]
+    geom_feats       : input coordinates, IntTensor[n, 4]
+    interval_lengths : starting position for pooled point, IntTensor[n_intervals]
+    interval_starts  : how many points in each pooled point, IntTensor[n_intervals]
+    x_grad           : gradient of the image fmap, FloatTensor
+*/
+__global__ void bev_pool_grad_kernel(int b, int d, int h, int w, int n, int c, int n_intervals,
+                                  const float *__restrict__ out_grad,
+                                  const int *__restrict__ geom_feats,
+                                  const int *__restrict__ interval_starts,
+                                  const int *__restrict__ interval_lengths,
+                                  float* __restrict__ x_grad) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  int index = idx / c;
+  int cur_c = idx % c;
+  if (index >= n_intervals) return;
+  int interval_start = interval_starts[index];
+  int interval_length = interval_lengths[index];
+
+  const int* cur_geom_feats = geom_feats + interval_start * 4;
+  float* cur_x_grad = x_grad + interval_start * c + cur_c;
+
+  const float* cur_out_grad = out_grad + cur_geom_feats[3] * d * h * w * c +
+    cur_geom_feats[2] * h * w * c + cur_geom_feats[0] * w * c +
+    cur_geom_feats[1] * c + cur_c;
+  for(int i = 0; i < interval_length; i++){
+    cur_x_grad[i * c] = *cur_out_grad;
+  }
+
+}
+
+void bev_pool(int b, int d, int h, int w, int n, int c, int n_intervals, const float* x,
+  const int* geom_feats, const int* interval_starts, const int* interval_lengths, float* out) {
+  bev_pool_kernel<<<(int)ceil(((double)n_intervals * c / 256)), 256>>>(
+    b, d, h, w, n, c, n_intervals, x, geom_feats, interval_starts, interval_lengths, out
+  );
+}
+
+void bev_pool_grad(int b, int d, int h, int w, int n, int c, int n_intervals, const float* out_grad,
+  const int* geom_feats, const int* interval_starts, const int* interval_lengths, float* x_grad) {
+  bev_pool_grad_kernel<<<(int)ceil(((double)n_intervals * c / 256)), 256>>>(
+    b, d, h, w, n, c, n_intervals, out_grad, geom_feats, interval_starts, interval_lengths, x_grad
+  );
+}
--- a/projects/BEVFusion/bevfusion/ops/voxel/__init__.py
+++ b/projects/BEVFusion/bevfusion/ops/voxel/__init__.py
+from .scatter_points import DynamicScatter, dynamic_scatter
+from .voxelize import Voxelization, voxelization
+
+__all__ = ['Voxelization', 'voxelization', 'dynamic_scatter', 'DynamicScatter']
--- a/projects/BEVFusion/bevfusion/ops/voxel/scatter_points.py
+++ b/projects/BEVFusion/bevfusion/ops/voxel/scatter_points.py
+import torch
+from torch import nn
+from torch.autograd import Function
+
+from .voxel_layer import (dynamic_point_to_voxel_backward,
+                          dynamic_point_to_voxel_forward)
+
+
+class _dynamic_scatter(Function):
+
+    @staticmethod
+    def forward(ctx, feats, coors, reduce_type='max'):
+        """convert kitti points(N, >=3) to voxels.
+
+        Args:
+            feats: [N, C] float tensor. points features to be reduced
+                into voxels.
+            coors: [N, ndim] int tensor. corresponding voxel coordinates
+                (specifically multi-dim voxel index) of each points.
+            reduce_type: str. reduce op. support 'max', 'sum' and 'mean'
+        Returns:
+            tuple
+            voxel_feats: [M, C] float tensor. reduced features. input features
+                that shares the same voxel coordinates are reduced to one row
+            coordinates: [M, ndim] int tensor, voxel coordinates.
+        """
+        results = dynamic_point_to_voxel_forward(feats, coors, reduce_type)
+        (voxel_feats, voxel_coors, point2voxel_map,
+         voxel_points_count) = results
+        ctx.reduce_type = reduce_type
+        ctx.save_for_backward(feats, voxel_feats, point2voxel_map,
+                              voxel_points_count)
+        ctx.mark_non_differentiable(voxel_coors)
+        return voxel_feats, voxel_coors
+
+    @staticmethod
+    def backward(ctx, grad_voxel_feats, grad_voxel_coors=None):
+        (feats, voxel_feats, point2voxel_map,
+         voxel_points_count) = ctx.saved_tensors
+        grad_feats = torch.zeros_like(feats)
+        # TODO: whether to use index put or use cuda_backward
+        # To use index put, need point to voxel index
+        dynamic_point_to_voxel_backward(
+            grad_feats,
+            grad_voxel_feats.contiguous(),
+            feats,
+            voxel_feats,
+            point2voxel_map,
+            voxel_points_count,
+            ctx.reduce_type,
+        )
+        return grad_feats, None, None
+
+
+dynamic_scatter = _dynamic_scatter.apply
+
+
+class DynamicScatter(nn.Module):
+
+    def __init__(self, voxel_size, point_cloud_range, average_points: bool):
+        super(DynamicScatter, self).__init__()
+        """Scatters points into voxels, used in the voxel encoder with
+           dynamic voxelization
+
+        **Note**: The CPU and GPU implementation get the same output, but
+        have numerical difference after summation and division (e.g., 5e-7).
+
+        Args:
+            average_points (bool): whether to use avg pooling to scatter
+                points into voxel voxel_size (list): list [x, y, z] size
+                of three dimension
+            point_cloud_range (list):
+                [x_min, y_min, z_min, x_max, y_max, z_max]
+        """
+        self.voxel_size = voxel_size
+        self.point_cloud_range = point_cloud_range
+        self.average_points = average_points
+
+    def forward_single(self, points, coors):
+        reduce = 'mean' if self.average_points else 'max'
+        return dynamic_scatter(points.contiguous(), coors.contiguous(), reduce)
+
+    def forward(self, points, coors):
+        """
+        Args:
+            input: NC points
+        """
+        if coors.size(-1) == 3:
+            return self.forward_single(points, coors)
+        else:
+            batch_size = coors[-1, 0] + 1
+            voxels, voxel_coors = [], []
+            for i in range(batch_size):
+                inds = torch.where(coors[:, 0] == i)
+                voxel, voxel_coor = self.forward_single(
+                    points[inds], coors[inds][:, 1:])
+                coor_pad = nn.functional.pad(
+                    voxel_coor, (1, 0), mode='constant', value=i)
+                voxel_coors.append(coor_pad)
+                voxels.append(voxel)
+            features = torch.cat(voxels, dim=0)
+            feature_coors = torch.cat(voxel_coors, dim=0)
+
+            return features, feature_coors
+
+    def __repr__(self):
+        tmpstr = self.__class__.__name__ + '('
+        tmpstr += 'voxel_size=' + str(self.voxel_size)
+        tmpstr += ', point_cloud_range=' + str(self.point_cloud_range)
+        tmpstr += ', average_points=' + str(self.average_points)
+        tmpstr += ')'
+        return tmpstr
--- a/projects/BEVFusion/bevfusion/ops/voxel/src/scatter_points_cpu.cpp
+++ b/projects/BEVFusion/bevfusion/ops/voxel/src/scatter_points_cpu.cpp
+#include <ATen/TensorUtils.h>
+#include <torch/extension.h>
+// #include "voxelization.h"
+
+namespace {
+
+template <typename T_int>
+void determin_max_points_kernel(
+    torch::TensorAccessor<T_int, 2> coor,
+    torch::TensorAccessor<T_int, 1> point_to_voxelidx,
+    torch::TensorAccessor<T_int, 1> num_points_per_voxel,
+    torch::TensorAccessor<T_int, 3> coor_to_voxelidx, int& voxel_num,
+    int& max_points, const int num_points) {
+  int voxelidx, num;
+  for (int i = 0; i < num_points; ++i) {
+    if (coor[i][0] == -1) continue;
+    voxelidx = coor_to_voxelidx[coor[i][0]][coor[i][1]][coor[i][2]];
+
+    // record voxel
+    if (voxelidx == -1) {
+      voxelidx = voxel_num;
+      voxel_num += 1;
+      coor_to_voxelidx[coor[i][0]][coor[i][1]][coor[i][2]] = voxelidx;
+    }
+
+    // put points into voxel
+    num = num_points_per_voxel[voxelidx];
+    point_to_voxelidx[i] = num;
+    num_points_per_voxel[voxelidx] += 1;
+
+    // update max points per voxel
+    max_points = std::max(max_points, num + 1);
+  }
+
+  return;
+}
+
+template <typename T, typename T_int>
+void scatter_point_to_voxel_kernel(
+    const torch::TensorAccessor<T, 2> points,
+    torch::TensorAccessor<T_int, 2> coor,
+    torch::TensorAccessor<T_int, 1> point_to_voxelidx,
+    torch::TensorAccessor<T_int, 3> coor_to_voxelidx,
+    torch::TensorAccessor<T, 3> voxels,
+    torch::TensorAccessor<T_int, 2> voxel_coors, const int num_features,
+    const int num_points, const int NDim) {
+  for (int i = 0; i < num_points; ++i) {
+    int num = point_to_voxelidx[i];
+    int voxelidx = coor_to_voxelidx[coor[i][0]][coor[i][1]][coor[i][2]];
+    for (int k = 0; k < num_features; ++k) {
+      voxels[voxelidx][num][k] = points[i][k];
+    }
+    for (int k = 0; k < NDim; ++k) {
+      voxel_coors[voxelidx][k] = coor[i][k];
+    }
+  }
+}
+
+}  // namespace
+
+namespace voxelization {
+
+std::vector<at::Tensor> dynamic_point_to_voxel_cpu(
+    const at::Tensor& points, const at::Tensor& voxel_mapping,
+    const std::vector<float> voxel_size, const std::vector<float> coors_range) {
+  // current version tooks about 0.02s_0.03s for one frame on cpu
+  // check device
+  AT_ASSERTM(points.device().is_cpu(), "points must be a CPU tensor");
+
+  const int NDim = voxel_mapping.size(1);
+  const int num_points = points.size(0);
+  const int num_features = points.size(1);
+
+  std::vector<int> grid_size(NDim);
+  for (int i = 0; i < NDim; ++i) {
+    grid_size[i] =
+        round((coors_range[NDim + i] - coors_range[i]) / voxel_size[i]);
+  }
+
+  at::Tensor num_points_per_voxel = at::zeros(
+      {
+          num_points,
+      },
+      voxel_mapping.options());
+  at::Tensor coor_to_voxelidx = -at::ones(
+      {grid_size[2], grid_size[1], grid_size[0]}, voxel_mapping.options());
+  at::Tensor point_to_voxelidx = -at::ones(
+      {
+          num_points,
+      },
+      voxel_mapping.options());
+
+  int voxel_num = 0;
+  int max_points = 0;
+  AT_DISPATCH_ALL_TYPES(voxel_mapping.scalar_type(), "determin_max_point", [&] {
+    determin_max_points_kernel<scalar_t>(
+        voxel_mapping.accessor<scalar_t, 2>(),
+        point_to_voxelidx.accessor<scalar_t, 1>(),
+        num_points_per_voxel.accessor<scalar_t, 1>(),
+        coor_to_voxelidx.accessor<scalar_t, 3>(), voxel_num, max_points,
+        num_points);
+  });
+
+  at::Tensor voxels =
+      at::zeros({voxel_num, max_points, num_features}, points.options());
+  at::Tensor voxel_coors =
+      at::zeros({voxel_num, NDim}, points.options().dtype(at::kInt));
+
+  AT_DISPATCH_ALL_TYPES(points.scalar_type(), "scatter_point_to_voxel", [&] {
+    scatter_point_to_voxel_kernel<scalar_t, int>(
+        points.accessor<scalar_t, 2>(), voxel_mapping.accessor<int, 2>(),
+        point_to_voxelidx.accessor<int, 1>(),
+        coor_to_voxelidx.accessor<int, 3>(), voxels.accessor<scalar_t, 3>(),
+        voxel_coors.accessor<int, 2>(), num_features, num_points, NDim);
+  });
+
+  at::Tensor num_points_per_voxel_out =
+      num_points_per_voxel.slice(/*dim=*/0, /*start=*/0, /*end=*/voxel_num);
+  return {voxels, voxel_coors, num_points_per_voxel_out};
+}
+
+}  // namespace voxelization
--- a/projects/BEVFusion/bevfusion/ops/voxel/src/scatter_points_cuda.cu
+++ b/projects/BEVFusion/bevfusion/ops/voxel/src/scatter_points_cuda.cu
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/types.h>
+
+#include <ATen/cuda/CUDAApplyUtils.cuh>
+
+typedef enum { SUM = 0, MEAN = 1, MAX = 2 } reduce_t;
+
+#define CHECK_CUDA(x) \
+  TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
+#define CHECK_CONTIGUOUS(x) \
+  TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) \
+  CHECK_CUDA(x);       \
+  CHECK_CONTIGUOUS(x)
+
+namespace {
+int const threadsPerBlock = 512;
+int const maxGridDim = 50000;
+}  // namespace
+
+__device__ __forceinline__ static void reduceMax(float *address, float val) {
+  int *address_as_i = reinterpret_cast<int *>(address);
+  int old = *address_as_i, assumed;
+  do {
+    assumed = old;
+    old = atomicCAS(address_as_i, assumed,
+                    __float_as_int(fmaxf(val, __int_as_float(assumed))));
+  } while (assumed != old || __int_as_float(old) < val);
+}
+
+__device__ __forceinline__ static void reduceMax(double *address, double val) {
+  unsigned long long *address_as_ull =
+      reinterpret_cast<unsigned long long *>(address);
+  unsigned long long old = *address_as_ull, assumed;
+  do {
+    assumed = old;
+    old = atomicCAS(
+        address_as_ull, assumed,
+        __double_as_longlong(fmax(val, __longlong_as_double(assumed))));
+  } while (assumed != old || __longlong_as_double(old) < val);
+}
+
+// get rid of meaningless warnings when compiling host code
+#ifdef __CUDA_ARCH__
+__device__ __forceinline__ static void reduceAdd(float *address, float val) {
+#if (__CUDA_ARCH__ < 200)
+#warning \
+    "compute capability lower than 2.x. fall back to use CAS version of atomicAdd for float32"
+  int *address_as_i = reinterpret_cast<int *>(address);
+  int old = *address_as_i, assumed;
+  do {
+    assumed = old;
+    old = atomicCAS(address_as_i, assumed,
+                    __float_as_int(val + __int_as_float(assumed)));
+  } while (assumed != old);
+#else
+  atomicAdd(address, val);
+#endif
+}
+
+__device__ __forceinline__ static void reduceAdd(double *address, double val) {
+#if (__CUDA_ARCH__ < 600)
+#warning \
+    "compute capability lower than 6.x. fall back to use CAS version of atomicAdd for float64"
+  unsigned long long *address_as_ull =
+      reinterpret_cast<unsigned long long *>(address);
+  unsigned long long old = *address_as_ull, assumed;
+  do {
+    assumed = old;
+    old = atomicCAS(address_as_ull, assumed,
+                    __double_as_longlong(val + __longlong_as_double(assumed)));
+  } while (assumed != old);
+#else
+  atomicAdd(address, val);
+#endif
+}
+#endif
+
+template <typename T>
+__global__ void
+feats_reduce_kernel(const T *feats, const int32_t *coors_map,
+                    T *reduced_feats, // shall be 0 at initialization
+                    const int num_input, const int num_feats,
+                    const reduce_t reduce_type) {
+  for (int x = blockIdx.x * blockDim.x + threadIdx.x; x < num_input;
+       x += gridDim.x * blockDim.x) {
+    int32_t reduce_to = coors_map[x];
+    if (reduce_to == -1) continue;
+
+    const T *feats_offset = feats + x * num_feats;
+    T *reduced_feats_offset = reduced_feats + reduce_to * num_feats;
+    if (reduce_type == reduce_t::MAX) {
+      for (int i = 0; i < num_feats; i++) {
+        reduceMax(&reduced_feats_offset[i], feats_offset[i]);
+      }
+    } else {
+      for (int i = 0; i < num_feats; i++) {
+        reduceAdd(&reduced_feats_offset[i], feats_offset[i]);
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void add_reduce_traceback_grad_kernel(
+    T *grad_feats, const T *grad_reduced_feats, const int32_t *coors_map,
+    const int32_t *reduce_count, const int num_input, const int num_feats,
+    const reduce_t reduce_type) {
+  for (int x = blockIdx.x * blockDim.x + threadIdx.x; x < num_input;
+       x += gridDim.x * blockDim.x) {
+    int32_t reduce_to = coors_map[x];
+    if (reduce_to == -1) {
+      continue;
+    }
+
+    const int input_offset = x * num_feats;
+    T *grad_feats_offset = grad_feats + input_offset;
+    const int reduced_offset = reduce_to * num_feats;
+    const T *grad_reduced_feats_offset = grad_reduced_feats + reduced_offset;
+
+    if (reduce_type == reduce_t::SUM) {
+      for (int i = 0; i < num_feats; i++) {
+        grad_feats_offset[i] = grad_reduced_feats_offset[i];
+      }
+    } else if (reduce_type == reduce_t::MEAN) {
+      for (int i = 0; i < num_feats; i++) {
+        grad_feats_offset[i] = grad_reduced_feats_offset[i] /
+                               static_cast<T>(reduce_count[reduce_to]);
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void max_reduce_traceback_scatter_idx_kernel(
+    const T *feats, const T *reduced_feats, int32_t *reduce_from,
+    const int32_t *coors_map, const int num_input, const int num_feats) {
+  for (int x = blockIdx.x * blockDim.x + threadIdx.x; x < num_input;
+       x += gridDim.x * blockDim.x) {
+    int32_t reduce_to = coors_map[x];
+
+    const int input_offset = x * num_feats;
+    const T *feats_offset = feats + input_offset;
+
+    if (reduce_to == -1) {
+      continue;
+    }
+
+    const int reduced_offset = reduce_to * num_feats;
+    const T *reduced_feats_offset = reduced_feats + reduced_offset;
+    int32_t *reduce_from_offset = reduce_from + reduced_offset;
+
+    for (int i = 0; i < num_feats; i++) {
+      if (feats_offset[i] == reduced_feats_offset[i]) {
+        atomicMin(&reduce_from_offset[i], static_cast<int32_t>(x));
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void max_reduce_scatter_grad_kernel(T *grad_feats,
+                                               const T *grad_reduced_feats,
+                                               const int32_t *reduce_from,
+                                               const int num_reduced,
+                                               const int num_feats) {
+  for (int x = blockIdx.x * blockDim.x + threadIdx.x; x < num_reduced;
+       x += gridDim.x * blockDim.x) {
+    const int reduced_offset = x * num_feats;
+    const int32_t *scatter_to_offset = reduce_from + reduced_offset;
+    const T *grad_reduced_feats_offset = grad_reduced_feats + reduced_offset;
+
+    for (int i = 0; i < num_feats; i++) {
+      grad_feats[scatter_to_offset[i] * num_feats + i] =
+          grad_reduced_feats_offset[i];
+    }
+  }
+}
+
+namespace voxelization {
+
+std::vector<at::Tensor> dynamic_point_to_voxel_forward_gpu(
+    const at::Tensor &feats, const at::Tensor &coors,
+    const reduce_t reduce_type) {
+  CHECK_INPUT(feats);
+  CHECK_INPUT(coors);
+
+  const int num_input = feats.size(0);
+  const int num_feats = feats.size(1);
+
+  if (num_input == 0)
+    return {feats.clone().detach(),
+            coors.clone().detach(),
+            coors.new_empty({0}, torch::kInt32),
+            coors.new_empty({0}, torch::kInt32)};
+
+  at::Tensor out_coors;
+  at::Tensor coors_map;
+  at::Tensor reduce_count;
+
+  auto coors_clean = coors.masked_fill(coors.lt(0).any(-1, true), -1);
+
+  std::tie(out_coors, coors_map, reduce_count) =
+      at::unique_dim(coors_clean, 0, true, true, true);
+
+  if (out_coors.index({0, 0}).lt(0).item<bool>()) {
+    // the first element of out_coors (-1,-1,-1) and should be removed
+    out_coors = out_coors.slice(0, 1);
+    reduce_count = reduce_count.slice(0, 1);
+    coors_map = coors_map - 1;
+  }
+
+  coors_map = coors_map.to(torch::kInt32);
+  reduce_count = reduce_count.to(torch::kInt32);
+
+  auto reduced_feats =
+      at::empty({out_coors.size(0), num_feats}, feats.options());
+
+  AT_DISPATCH_FLOATING_TYPES(
+      feats.scalar_type(), "feats_reduce_kernel", ([&] {
+    if (reduce_type == reduce_t::MAX)
+      reduced_feats.fill_(-std::numeric_limits<scalar_t>::infinity());
+    else
+      reduced_feats.fill_(static_cast<scalar_t>(0));
+
+    dim3 blocks(std::min(at::cuda::ATenCeilDiv(num_input, threadsPerBlock),
+                         maxGridDim));
+    dim3 threads(threadsPerBlock);
+    feats_reduce_kernel<<<blocks, threads>>>(
+        feats.data_ptr<scalar_t>(), coors_map.data_ptr<int32_t>(),
+        reduced_feats.data_ptr<scalar_t>(), num_input, num_feats, reduce_type);
+    if (reduce_type == reduce_t::MEAN)
+      reduced_feats /= reduce_count.unsqueeze(-1).to(reduced_feats.dtype());
+  }));
+  AT_CUDA_CHECK(cudaGetLastError());
+
+  return {reduced_feats, out_coors, coors_map, reduce_count};
+}
+
+void dynamic_point_to_voxel_backward_gpu(at::Tensor &grad_feats,
+                                         const at::Tensor &grad_reduced_feats,
+                                         const at::Tensor &feats,
+                                         const at::Tensor &reduced_feats,
+                                         const at::Tensor &coors_map,
+                                         const at::Tensor &reduce_count,
+                                         const reduce_t reduce_type) {
+  CHECK_INPUT(grad_feats);
+  CHECK_INPUT(grad_reduced_feats);
+  CHECK_INPUT(feats);
+  CHECK_INPUT(reduced_feats);
+  CHECK_INPUT(coors_map);
+  CHECK_INPUT(reduce_count);
+
+  const int num_input = feats.size(0);
+  const int num_reduced = reduced_feats.size(0);
+  const int num_feats = feats.size(1);
+
+  grad_feats.fill_(0);
+  // copy voxel grad to points
+
+  if (num_input == 0 || num_reduced == 0) return;
+
+  if (reduce_type == reduce_t::MEAN || reduce_type == reduce_t::SUM) {
+    AT_DISPATCH_FLOATING_TYPES(
+        grad_reduced_feats.scalar_type(), "add_reduce_traceback_grad_kernel",
+        ([&] {
+          dim3 blocks(std::min(
+              at::cuda::ATenCeilDiv(num_input, threadsPerBlock), maxGridDim));
+          dim3 threads(threadsPerBlock);
+          add_reduce_traceback_grad_kernel<<<blocks, threads>>>(
+              grad_feats.data_ptr<scalar_t>(),
+              grad_reduced_feats.data_ptr<scalar_t>(),
+              coors_map.data_ptr<int32_t>(), reduce_count.data_ptr<int32_t>(),
+              num_input, num_feats, reduce_type);
+        }));
+    AT_CUDA_CHECK(cudaGetLastError());
+  } else {
+    auto reduce_from = at::full({num_reduced, num_feats}, num_input,
+                                coors_map.options().dtype(torch::kInt32));
+    AT_DISPATCH_FLOATING_TYPES(
+        grad_reduced_feats.scalar_type(),
+        "max_reduce_traceback_scatter_idx_kernel", ([&] {
+          dim3 blocks(std::min(
+              at::cuda::ATenCeilDiv(num_input, threadsPerBlock), maxGridDim));
+          dim3 threads(threadsPerBlock);
+          max_reduce_traceback_scatter_idx_kernel<<<blocks, threads>>>(
+              feats.data_ptr<scalar_t>(), reduced_feats.data_ptr<scalar_t>(),
+              reduce_from.data_ptr<int32_t>(), coors_map.data_ptr<int32_t>(),
+              num_input, num_feats);
+        }));
+    AT_CUDA_CHECK(cudaGetLastError());
+
+    AT_DISPATCH_FLOATING_TYPES(
+        grad_reduced_feats.scalar_type(),
+        "max_reduce_traceback_scatter_idx_kernel", ([&] {
+          dim3 blocks(std::min(
+              at::cuda::ATenCeilDiv(num_reduced, threadsPerBlock), maxGridDim));
+          dim3 threads(threadsPerBlock);
+          max_reduce_scatter_grad_kernel<<<blocks, threads>>>(
+              grad_feats.data_ptr<scalar_t>(),
+              grad_reduced_feats.data_ptr<scalar_t>(),
+              reduce_from.data_ptr<int32_t>(), num_reduced, num_feats);
+        }));
+    AT_CUDA_CHECK(cudaGetLastError());
+  }
+  return;
+}
+
+}  // namespace voxelization
--- a/projects/BEVFusion/bevfusion/ops/voxel/src/voxelization.cpp
+++ b/projects/BEVFusion/bevfusion/ops/voxel/src/voxelization.cpp
+#include <torch/extension.h>
+#include "voxelization.h"
+
+namespace voxelization {
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+  m.def("hard_voxelize", &hard_voxelize, "hard voxelize");
+  m.def("dynamic_voxelize", &dynamic_voxelize, "dynamic voxelization");
+  m.def("dynamic_point_to_voxel_forward", &dynamic_point_to_voxel_forward, "dynamic point to voxel forward");
+  m.def("dynamic_point_to_voxel_backward", &dynamic_point_to_voxel_backward, "dynamic point to voxel backward");
+}
+
+} // namespace voxelization
--- a/projects/BEVFusion/bevfusion/ops/voxel/src/voxelization.h
+++ b/projects/BEVFusion/bevfusion/ops/voxel/src/voxelization.h
+#pragma once
+#include <torch/extension.h>
+
+typedef enum { SUM = 0, MEAN = 1, MAX = 2 } reduce_t;
+
+namespace voxelization {
+
+int hard_voxelize_cpu(const at::Tensor &points, at::Tensor &voxels,
+                      at::Tensor &coors, at::Tensor &num_points_per_voxel,
+                      const std::vector<float> voxel_size,
+                      const std::vector<float> coors_range,
+                      const int max_points, const int max_voxels,
+                      const int NDim = 3);
+
+void dynamic_voxelize_cpu(const at::Tensor &points, at::Tensor &coors,
+                          const std::vector<float> voxel_size,
+                          const std::vector<float> coors_range,
+                          const int NDim = 3);
+
+std::vector<at::Tensor> dynamic_point_to_voxel_cpu(
+    const at::Tensor &points, const at::Tensor &voxel_mapping,
+    const std::vector<float> voxel_size, const std::vector<float> coors_range);
+
+#ifdef WITH_CUDA
+int hard_voxelize_gpu(const at::Tensor &points, at::Tensor &voxels,
+                      at::Tensor &coors, at::Tensor &num_points_per_voxel,
+                      const std::vector<float> voxel_size,
+                      const std::vector<float> coors_range,
+                      const int max_points, const int max_voxels,
+                      const int NDim = 3);
+
+int nondisterministic_hard_voxelize_gpu(const at::Tensor &points, at::Tensor &voxels,
+                                        at::Tensor &coors, at::Tensor &num_points_per_voxel,
+                                        const std::vector<float> voxel_size,
+                                        const std::vector<float> coors_range,
+                                        const int max_points, const int max_voxels,
+                                        const int NDim = 3);
+
+void dynamic_voxelize_gpu(const at::Tensor &points, at::Tensor &coors,
+                          const std::vector<float> voxel_size,
+                          const std::vector<float> coors_range,
+                          const int NDim = 3);
+
+std::vector<torch::Tensor> dynamic_point_to_voxel_forward_gpu(const torch::Tensor &feats,
+                                                              const torch::Tensor &coors,
+                                                              const reduce_t reduce_type);
+
+void dynamic_point_to_voxel_backward_gpu(torch::Tensor &grad_feats,
+                                         const torch::Tensor &grad_reduced_feats,
+                                         const torch::Tensor &feats,
+                                         const torch::Tensor &reduced_feats,
+                                         const torch::Tensor &coors_idx,
+                                         const torch::Tensor &reduce_count,
+                                         const reduce_t reduce_type);
+#endif
+
+// Interface for Python
+inline int hard_voxelize(const at::Tensor &points, at::Tensor &voxels,
+                         at::Tensor &coors, at::Tensor &num_points_per_voxel,
+                         const std::vector<float> voxel_size,
+                         const std::vector<float> coors_range,
+                         const int max_points, const int max_voxels,
+                         const int NDim = 3, const bool deterministic = true) {
+  if (points.device().is_cuda()) {
+#ifdef WITH_CUDA
+    if (deterministic) {
+      return hard_voxelize_gpu(points, voxels, coors, num_points_per_voxel,
+                               voxel_size, coors_range, max_points, max_voxels,
+                               NDim);
+    }
+    return nondisterministic_hard_voxelize_gpu(points, voxels, coors, num_points_per_voxel,
+                                               voxel_size, coors_range, max_points, max_voxels,
+                                               NDim);
+#else
+    AT_ERROR("Not compiled with GPU support");
+#endif
+  }
+  return hard_voxelize_cpu(points, voxels, coors, num_points_per_voxel,
+                           voxel_size, coors_range, max_points, max_voxels,
+                           NDim);
+}
+
+inline void dynamic_voxelize(const at::Tensor &points, at::Tensor &coors,
+                             const std::vector<float> voxel_size,
+                             const std::vector<float> coors_range,
+                             const int NDim = 3) {
+  if (points.device().is_cuda()) {
+#ifdef WITH_CUDA
+    return dynamic_voxelize_gpu(points, coors, voxel_size, coors_range, NDim);
+#else
+    AT_ERROR("Not compiled with GPU support");
+#endif
+  }
+  return dynamic_voxelize_cpu(points, coors, voxel_size, coors_range, NDim);
+}
+
+inline reduce_t convert_reduce_type(const std::string &reduce_type) {
+  if (reduce_type == "max")
+    return reduce_t::MAX;
+  else if (reduce_type == "sum")
+    return reduce_t::SUM;
+  else if (reduce_type == "mean")
+    return reduce_t::MEAN;
+  else TORCH_CHECK(false, "do not support reduce type " + reduce_type)
+  return reduce_t::SUM;
+}
+
+inline std::vector<torch::Tensor> dynamic_point_to_voxel_forward(const torch::Tensor &feats,
+                                                                 const torch::Tensor &coors,
+                                                                 const std::string &reduce_type) {
+  if (feats.device().is_cuda()) {
+#ifdef WITH_CUDA
+    return dynamic_point_to_voxel_forward_gpu(feats, coors, convert_reduce_type(reduce_type));
+#else
+    TORCH_CHECK(false, "Not compiled with GPU support");
+#endif
+  }
+  TORCH_CHECK(false, "do not support cpu yet");
+  return std::vector<torch::Tensor>();
+}
+
+inline void dynamic_point_to_voxel_backward(torch::Tensor &grad_feats,
+                                            const torch::Tensor &grad_reduced_feats,
+                                            const torch::Tensor &feats,
+                                            const torch::Tensor &reduced_feats,
+                                            const torch::Tensor &coors_idx,
+                                            const torch::Tensor &reduce_count,
+                                            const std::string &reduce_type) {
+  if (grad_feats.device().is_cuda()) {
+#ifdef WITH_CUDA
+    dynamic_point_to_voxel_backward_gpu(
+        grad_feats, grad_reduced_feats, feats, reduced_feats, coors_idx, reduce_count,
+        convert_reduce_type(reduce_type));
+    return;
+#else
+    TORCH_CHECK(false, "Not compiled with GPU support");
+#endif
+  }
+  TORCH_CHECK(false, "do not support cpu yet");
+}
+
+}  // namespace voxelization