Support Monocular 3D Detector CaDDN (#538)

* Added CaDDN detector and support for image, depth map, and 2D GT box dataloading * Moved image flip augmentation to augmentor_utils * Updated default get item list to include points * Moved utils functions into transform_utils * Combined FFE + F2V into ImageVFE, renamed FFE to FFN, moved depth downsample into data_processor * Updated README with updated CaDDN weights * Updated comments for image vfe

Support Monocular 3D Detector CaDDN (#538)
* Added CaDDN detector and support for image, depth map, and 2D GT box dataloading * Moved image flip augmentation to augmentor_utils * Updated default get item list to include points * Moved utils functions into transform_utils * Combined FFE + F2V into ImageVFE, renamed FFE to FFN, moved depth downsample into data_processor * Updated README with updated CaDDN weights * Updated comments for image vfe
aaf9cbeb · Cody Reading · GitHub · e3bec15f · aaf9cbeb · aaf9cbeb
Unverified Commit aaf9cbeb authored May 19, 2021 by Cody Reading Committed by GitHub May 20, 2021
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -13,3 +13,5 @@ venv/
 *.pkl
 *.zip
 *.bin
+output
+version.py
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -18,6 +18,8 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
 ## Changelog
+[2021-05-14] Added support for the monocular 3D object detection model [`CaDDN`](#KITTI-3D-Object-Detection-Baselines)
 [2020-11-27] **Bugfixed:** Please re-prepare the validation infos of Waymo dataset (version 1.2) if you would like to 
 use our provided Waymo evaluation tool (see [PR](https://github.com/open-mmlab/OpenPCDet/pull/383)). 
 Note that you do not need to re-prepare the training data and ground-truth database. 
@@ -104,6 +106,7 @@ Selected supported methods are shown in the below table. The results are the 3D
 | [Part-A^2-Free](tools/cfgs/kitti_models/PartA2_free.yaml)   | ~3.8 hours| 78.72 | 65.99 | 74.29 | [model-226M](https://drive.google.com/file/d/1lcUUxF8mJgZ_e-tZhP1XNQtTBuC-R0zr/view?usp=sharing) |
 | [Part-A^2-Anchor](tools/cfgs/kitti_models/PartA2.yaml)    | ~4.3 hours| 79.40 | 60.05 | 69.90 | [model-244M](https://drive.google.com/file/d/10GK1aCkLqxGNeX3lVu8cLZyE0G8002hY/view?usp=sharing) |
 | [PV-RCNN](tools/cfgs/kitti_models/pv_rcnn.yaml) | ~5 hours| 83.61 | 57.90 | 70.47 | [model-50M](https://drive.google.com/file/d/1lIOq4Hxr0W3qsX83ilQv0nk1Cls6KAr-/view?usp=sharing) |
+| [CaDDN](tools/cfgs/kitti_models/CaDDN.yaml) |~15 hours| 21.38 | 13.02 | 9.76 | [model-774M](https://drive.google.com/file/d/1OQTO2PtXT8GGr35W9m2GZGuqgb6fyU1V/view?usp=sharing) |
 ### NuScenes 3D Object Detection Baselines
 All models are trained with 8 GTX 1080Ti GPUs and are available for download.

--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -9,6 +9,7 @@ Currently we provide the dataloader of KITTI dataset and NuScenes dataset, and t
 ### KITTI Dataset
 * Please download the official [KITTI 3D object detection](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d) dataset and organize the downloaded files as follows (the road planes could be downloaded from [[road plane]](https://drive.google.com/file/d/1d5mq0RXRnvHPVeKx6Q612z0YRO1t2wAp/view?usp=sharing), which are optional for data augmentation in the training):
+* If you would like to train [CaDDN](../tools/cfgs/kitti_models/CaDDN.yaml), download the precomputed [depth maps](https://drive.google.com/file/d/1qFZux7KC_gJ0UHEg-qGJKqteE9Ivojin/view?usp=sharing) for the KITTI training set
 * NOTE: if you already have the data infos from `pcdet v0.1`, you can choose to use the old infos and set the DATABASE_WITH_FAKELIDAR option in tools/cfgs/dataset_configs/kitti_dataset.yaml as True. The second choice is that you can create the infos and gt database again and leave the config unchanged.
 ```
@@ -17,7 +18,7 @@ OpenPCDet
 │   ├── kitti
 │   │   │── ImageSets
 │   │   │── training
-│   │   │   ├──calib & velodyne & label_2 & image_2 & (optional: planes)
+│   │   │   ├──calib & velodyne & label_2 & image_2 & (optional: planes) & (optional: depth_2)
 │   │   │── testing
 │   │   │   ├──calib & velodyne & image_2
 ├── pcdet
@@ -94,6 +95,17 @@ python -m pcdet.datasets.waymo.waymo_dataset --func create_waymo_infos \
 Note that you do not need to install `waymo-open-dataset` if you have already processed the data before and do not need to evaluate with official Waymo Metrics. 
+## Pretrained Models
+If you would like to train [CaDDN](../tools/cfgs/kitti_models/CaDDN.yaml), download the pretrained [DeepLabV3 model](https://download.pytorch.org/models/deeplabv3_resnet101_coco-586e9e4e.pth) and place within the `checkpoints` directory
+```
+OpenPCDet
+├── checkpoints
+│   ├── deeplabv3_resnet101_coco-586e9e4e.pth
+├── data
+├── pcdet
+├── tools
+```
 ## Training & Testing

--- a/pcdet/datasets/augmentor/augmentor_utils.py
+++ b/pcdet/datasets/augmentor/augmentor_utils.py
+import copy
 import numpy as np
 from ...utils import common_utils
@@ -76,3 +77,42 @@ def global_scaling(gt_boxes, points, scale_range):
    points[:, :3] *= noise_scale
    gt_boxes[:, :6] *= noise_scale
    return gt_boxes, points
+def random_image_flip_horizontal(image, depth_map, gt_boxes, calib):
+    """
+    Performs random horizontal flip augmentation
+    Args:
+        image: (H_image, W_image, 3), Image
+        depth_map: (H_depth, W_depth), Depth map
+        gt_boxes: (N, 7), 3D box labels in LiDAR coordinates [x, y, z, w, l, h, ry]
+        calib: calibration.Calibration, Calibration object
+    Returns:
+        aug_image: (H_image, W_image, 3), Augmented image
+        aug_depth_map: (H_depth, W_depth), Augmented depth map
+        aug_gt_boxes: (N, 7), Augmented 3D box labels in LiDAR coordinates [x, y, z, w, l, h, ry]
+    """
+    # Randomly augment with 50% chance
+    enable = np.random.choice([False, True], replace=False, p=[0.5, 0.5])
+    if enable:
+        # Flip images
+        aug_image = np.fliplr(image)
+        aug_depth_map = np.fliplr(depth_map)
+        # Flip 3D gt_boxes by flipping the centroids in image space
+        aug_gt_boxes = copy.copy(gt_boxes)
+        locations = aug_gt_boxes[:, :3]
+        img_pts, img_depth = calib.lidar_to_img(locations)
+        W = image.shape[1]
+        img_pts[:, 0] = W - img_pts[:, 0]
+        pts_rect = calib.img_to_rect(u=img_pts[:, 0], v=img_pts[:, 1], depth_rect=img_depth)
+        pts_lidar = calib.rect_to_lidar(pts_rect)
+        aug_gt_boxes[:, :3] = pts_lidar
+        aug_gt_boxes[:, 6] = -1 * aug_gt_boxes[:, 6]
+    else:
+        aug_image = image
+        aug_depth_map = depth_map
+        aug_gt_boxes = gt_boxes
+    return aug_image, aug_depth_map, aug_gt_boxes
\ No newline at end of file
--- a/pcdet/datasets/augmentor/data_augmentor.py
+++ b/pcdet/datasets/augmentor/data_augmentor.py
@@ -78,6 +78,25 @@ class DataAugmentor(object):
        data_dict['points'] = points
        return data_dict
+    def random_image_flip(self, data_dict=None, config=None):
+        if data_dict is None:
+            return partial(self.random_image_flip, config=config)
+        images = data_dict["images"]
+        depth_maps = data_dict["depth_maps"]
+        gt_boxes = data_dict['gt_boxes']
+        gt_boxes2d = data_dict["gt_boxes2d"]
+        calib = data_dict["calib"]
+        for cur_axis in config['ALONG_AXIS_LIST']:
+            assert cur_axis in ['horizontal']
+            images, depth_maps, gt_boxes = getattr(augmentor_utils, 'random_image_flip_%s' % cur_axis)(
+                images, depth_maps, gt_boxes, calib,
+            )
+        data_dict['images'] = images
+        data_dict['depth_maps'] = depth_maps
+        data_dict['gt_boxes'] = gt_boxes
+        return data_dict
    def forward(self, data_dict):
        """
        Args:
@@ -103,5 +122,8 @@ class DataAugmentor(object):
            gt_boxes_mask = data_dict['gt_boxes_mask']
            data_dict['gt_boxes'] = data_dict['gt_boxes'][gt_boxes_mask]
            data_dict['gt_names'] = data_dict['gt_names'][gt_boxes_mask]
+            if 'gt_boxes2d' in data_dict:
+                data_dict['gt_boxes2d'] = data_dict['gt_boxes2d'][gt_boxes_mask]
            data_dict.pop('gt_boxes_mask')
        return data_dict
--- a/pcdet/datasets/dataset.py
+++ b/pcdet/datasets/dataset.py
@@ -39,6 +39,11 @@ class DatasetTemplate(torch_data.Dataset):
        self.total_epochs = 0
        self._merge_all_iters_to_one_epoch = False
+        if hasattr(self.data_processor, "depth_downsample_factor"):
+            self.depth_downsample_factor = self.data_processor.depth_downsample_factor
+        else:
+            self.depth_downsample_factor = None
    @property
    def mode(self):
        return 'train' if self.training else 'test'
@@ -97,7 +102,7 @@ class DatasetTemplate(torch_data.Dataset):
        """
        Args:
            data_dict:
-                points: (N, 3 + C_in)
+                points: optional, (N, 3 + C_in)
                gt_boxes: optional, (N, 7 + C) [x, y, z, dx, dy, dz, heading, ...]
                gt_names: optional, (N), string
                ...
@@ -133,6 +138,10 @@ class DatasetTemplate(torch_data.Dataset):
            gt_boxes = np.concatenate((data_dict['gt_boxes'], gt_classes.reshape(-1, 1).astype(np.float32)), axis=1)
            data_dict['gt_boxes'] = gt_boxes
+            if data_dict.get('gt_boxes2d', None) is not None:
+                data_dict['gt_boxes2d'] = data_dict['gt_boxes2d'][selected]
+        if data_dict.get('points', None) is not None:
            data_dict = self.point_feature_encoder.forward(data_dict)
        data_dict = self.data_processor.forward(
@@ -172,6 +181,43 @@ class DatasetTemplate(torch_data.Dataset):
                    for k in range(batch_size):
                        batch_gt_boxes3d[k, :val[k].__len__(), :] = val[k]
                    ret[key] = batch_gt_boxes3d
+                elif key in ['gt_boxes2d']:
+                    max_boxes = 0
+                    max_boxes = max([len(x) for x in val])
+                    batch_boxes2d = np.zeros((batch_size, max_boxes, val[0].shape[-1]), dtype=np.float32)
+                    for k in range(batch_size):
+                        if val[k].size > 0:
+                            batch_boxes2d[k, :val[k].__len__(), :] = val[k]
+                    ret[key] = batch_boxes2d
+                elif key in ["images", "depth_maps"]:
+                    # Get largest image size (H, W)
+                    max_h = 0
+                    max_w = 0
+                    for image in val:
+                        max_h = max(max_h, image.shape[0])
+                        max_w = max(max_w, image.shape[1])
+                    # Change size of images
+                    images = []
+                    for image in val:
+                        pad_h = common_utils.get_pad_params(desired_size=max_h, cur_size=image.shape[0])
+                        pad_w = common_utils.get_pad_params(desired_size=max_w, cur_size=image.shape[1])
+                        pad_width = (pad_h, pad_w)
+                        # Pad with nan, to be replaced later in the pipeline.
+                        pad_value = np.nan
+                        if key == "images":
+                            pad_width = (pad_h, pad_w, (0, 0))
+                        elif key == "depth_maps":
+                            pad_width = (pad_h, pad_w)
+                        image_pad = np.pad(image,
+                                           pad_width=pad_width,
+                                           mode='constant',
+                                           constant_values=pad_value)
+                        images.append(image_pad)
+                    ret[key] = np.stack(images, axis=0)
                else:
                    ret[key] = np.stack(val, axis=0)
            except:

--- a/pcdet/datasets/kitti/kitti_dataset.py
+++ b/pcdet/datasets/kitti/kitti_dataset.py
@@ -4,6 +4,7 @@ import pickle
 import numpy as np
 from skimage import io
+from . import kitti_utils
 from ...ops.roiaware_pool3d import roiaware_pool3d_utils
 from ...utils import box_utils, calibration_kitti, common_utils, object3d_kitti
 from ..dataset import DatasetTemplate
@@ -64,6 +65,21 @@ class KittiDataset(DatasetTemplate):
        assert lidar_file.exists()
        return np.fromfile(str(lidar_file), dtype=np.float32).reshape(-1, 4)
+    def get_image(self, idx):
+        """
+        Loads image for a sample
+        Args:
+            idx: int, Sample index
+        Returns:
+            image: (H, W, 3), RGB Image
+        """
+        img_file = self.root_split_path / 'image_2' / ('%s.png' % idx)
+        assert img_file.exists()
+        image = io.imread(img_file)
+        image = image.astype(np.float32)
+        image /= 255.0
+        return image
    def get_image_shape(self, idx):
        img_file = self.root_split_path / 'image_2' / ('%s.png' % idx)
        assert img_file.exists()
@@ -74,6 +90,21 @@ class KittiDataset(DatasetTemplate):
        assert label_file.exists()
        return object3d_kitti.get_objects_from_label(label_file)
+    def get_depth_map(self, idx):
+        """
+        Loads depth map for a sample
+        Args:
+            idx: str, Sample index
+        Returns:
+            depth: (H, W), Depth map
+        """
+        depth_file = self.root_split_path / 'depth_2' / ('%s.png' % idx)
+        assert depth_file.exists()
+        depth = io.imread(depth_file)
+        depth = depth.astype(np.float32)
+        depth /= 256.0
+        return depth
    def get_calib(self, idx):
        calib_file = self.root_split_path / 'calib' / ('%s.txt' % idx)
        assert calib_file.exists()
@@ -277,7 +308,7 @@ class KittiDataset(DatasetTemplate):
                return pred_dict
            calib = batch_dict['calib'][batch_index]
-            image_shape = batch_dict['image_shape'][batch_index]
+            image_shape = batch_dict['image_shape'][batch_index].cpu().numpy()
            pred_boxes_camera = box_utils.boxes3d_lidar_to_kitti_camera(pred_boxes, calib)
            pred_boxes_img = box_utils.boxes3d_kitti_camera_to_imageboxes(
                pred_boxes_camera, calib, image_shape=image_shape
@@ -345,18 +376,11 @@ class KittiDataset(DatasetTemplate):
        info = copy.deepcopy(self.kitti_infos[index])
        sample_idx = info['point_cloud']['lidar_idx']
-        points = self.get_lidar(sample_idx)
-        calib = self.get_calib(sample_idx)
        img_shape = info['image']['image_shape']
-        if self.dataset_cfg.FOV_POINTS_ONLY:
+        calib = self.get_calib(sample_idx)
-            pts_rect = calib.lidar_to_rect(points[:, 0:3])
+        get_item_list = self.dataset_cfg.get('GET_ITEM_LIST', ['points'])
-            fov_flag = self.get_fov_flag(pts_rect, img_shape, calib)
-            points = points[fov_flag]
        input_dict = {
-            'points': points,
            'frame_id': sample_idx,
            'calib': calib,
        }
@@ -373,10 +397,30 @@ class KittiDataset(DatasetTemplate):
                'gt_names': gt_names,
                'gt_boxes': gt_boxes_lidar
            })
+            if "gt_boxes2d" in get_item_list:
+                input_dict['gt_boxes2d'] = annos["bbox"]
            road_plane = self.get_road_plane(sample_idx)
            if road_plane is not None:
                input_dict['road_plane'] = road_plane
+        if "points" in get_item_list:
+            points = self.get_lidar(sample_idx)
+            if self.dataset_cfg.FOV_POINTS_ONLY:
+                pts_rect = calib.lidar_to_rect(points[:, 0:3])
+                fov_flag = self.get_fov_flag(pts_rect, img_shape, calib)
+                points = points[fov_flag]
+            input_dict['points'] = points
+        if "images" in get_item_list:
+            input_dict['images'] = self.get_image(sample_idx)
+        if "depth_maps" in get_item_list:
+            input_dict['depth_maps'] = self.get_depth_map(sample_idx)
+        if "calib_matricies" in get_item_list:
+            input_dict["trans_lidar_to_cam"], input_dict["trans_cam_to_img"] = kitti_utils.calib_to_matricies(calib)
        data_dict = self.prepare_data(data_dict=input_dict)
        data_dict['image_shape'] = img_shape

--- a/pcdet/datasets/kitti/kitti_utils.py
+++ b/pcdet/datasets/kitti/kitti_utils.py
@@ -42,3 +42,20 @@ def transform_annotations_to_kitti_format(annos, map_name_to_kitti=None, info_wi
            anno['rotation_y'] = anno['alpha'] = np.zeros(0)
    return annos
+def calib_to_matricies(calib):
+    """
+    Converts calibration object to transformation matricies
+    Args:
+        calib: calibration.Calibration, Calibration object
+    Returns
+        V2R: (4, 4), Lidar to rectified camera transformation matrix
+        P2: (3, 4), Camera projection matrix
+    """
+    V2C = np.vstack((calib.V2C, np.array([0, 0, 0, 1], dtype=np.float32)))  # (4, 4)
+    R0 = np.hstack((calib.R0, np.zeros((3, 1), dtype=np.float32)))  # (3, 4)
+    R0 = np.vstack((R0, np.array([0, 0, 0, 1], dtype=np.float32)))  # (4, 4)
+    V2R = R0 @ V2C
+    P2 = calib.P2
+    return V2R, P2
\ No newline at end of file
--- a/pcdet/datasets/processor/data_processor.py
+++ b/pcdet/datasets/processor/data_processor.py
 from functools import partial
 import numpy as np
+from skimage import transform
 from ...utils import box_utils, common_utils
@@ -19,8 +20,11 @@ class DataProcessor(object):
    def mask_points_and_boxes_outside_range(self, data_dict=None, config=None):
        if data_dict is None:
            return partial(self.mask_points_and_boxes_outside_range, config=config)
+        if data_dict.get('points', None) is not None:
            mask = common_utils.mask_points_by_range(data_dict['points'], self.point_cloud_range)
            data_dict['points'] = data_dict['points'][mask]
        if data_dict.get('gt_boxes', None) is not None and config.REMOVE_OUTSIDE_BOXES and self.training:
            mask = box_utils.mask_boxes_outside_range_numpy(
                data_dict['gt_boxes'], self.point_cloud_range, min_num_corners=config.get('min_num_corners', 1)
@@ -106,6 +110,25 @@ class DataProcessor(object):
        data_dict['points'] = points[choice]
        return data_dict
+    def calculate_grid_size(self, data_dict=None, config=None):
+        if data_dict is None:
+            grid_size = (self.point_cloud_range[3:6] - self.point_cloud_range[0:3]) / np.array(config.VOXEL_SIZE)
+            self.grid_size = np.round(grid_size).astype(np.int64)
+            self.voxel_size = config.VOXEL_SIZE
+            return partial(self.calculate_grid_size, config=config)
+        return data_dict
+    def downsample_depth_map(self, data_dict=None, config=None):
+        if data_dict is None:
+            self.depth_downsample_factor = config.DOWNSAMPLE_FACTOR
+            return partial(self.downsample_depth_map, config=config)
+        data_dict['depth_maps'] = transform.downscale_local_mean(
+            image=data_dict['depth_maps'],
+            factors=(self.depth_downsample_factor, self.depth_downsample_factor)
+        )
+        return data_dict
    def forward(self, data_dict):
        """
        Args:

--- a/pcdet/models/__init__.py
+++ b/pcdet/models/__init__.py
@@ -2,6 +2,7 @@ from collections import namedtuple
 import numpy as np
 import torch
+import kornia
 from .detectors import build_detector
@@ -17,8 +18,13 @@ def load_data_to_gpu(batch_dict):
    for key, val in batch_dict.items():
        if not isinstance(val, np.ndarray):
            continue
-        if key in ['frame_id', 'metadata', 'calib', 'image_shape']:
+        elif key in ['frame_id', 'metadata', 'calib']:
            continue
+        elif key in ['images']:
+            batch_dict[key] = kornia.image_to_tensor(val).float().cuda().contiguous()
+        elif key in ['image_shape']:
+            batch_dict[key] = torch.from_numpy(val).int().cuda()
+        else:
            batch_dict[key] = torch.from_numpy(val).float().cuda()

--- a/pcdet/models/backbones_2d/map_to_bev/__init__.py
+++ b/pcdet/models/backbones_2d/map_to_bev/__init__.py
 from .height_compression import HeightCompression
 from .pointpillar_scatter import PointPillarScatter
+from .conv2d_collapse import Conv2DCollapse
 __all__ = {
    'HeightCompression': HeightCompression,
-    'PointPillarScatter': PointPillarScatter
+    'PointPillarScatter': PointPillarScatter,
+    'Conv2DCollapse': Conv2DCollapse
 }
--- a/pcdet/models/backbones_2d/map_to_bev/conv2d_collapse.py
+++ b/pcdet/models/backbones_2d/map_to_bev/conv2d_collapse.py
+import torch
+import torch.nn as nn
+from pcdet.models.model_utils.basic_block_2d import BasicBlock2D
+class Conv2DCollapse(nn.Module):
+    def __init__(self, model_cfg, grid_size):
+        """
+        Initializes 2D convolution collapse module
+        Args:
+            model_cfg: EasyDict, Model configuration
+            grid_size: (X, Y, Z) Voxel grid size
+        """
+        super().__init__()
+        self.model_cfg = model_cfg
+        self.num_heights = grid_size[-1]
+        self.num_bev_features = self.model_cfg.NUM_BEV_FEATURES
+        self.block = BasicBlock2D(in_channels=self.num_bev_features * self.num_heights,
+                                  out_channels=self.num_bev_features,
+                                  **self.model_cfg.ARGS)
+    def forward(self, batch_dict):
+        """
+        Collapses voxel features to BEV via concatenation and channel reduction
+        Args:
+            batch_dict:
+                voxel_features: (B, C, Z, Y, X), Voxel feature representation
+        Returns:
+            batch_dict:
+                spatial_features: (B, C, Y, X), BEV feature representation
+        """
+        voxel_features = batch_dict["voxel_features"]
+        bev_features = voxel_features.flatten(start_dim=1, end_dim=2)  # (B, C, Z, Y, X) -> (B, C*Z, Y, X)
+        bev_features = self.block(bev_features)  # (B, C*Z, Y, X) -> (B, C, Y, X)
+        batch_dict["spatial_features"] = bev_features
+        return batch_dict
--- a/pcdet/models/backbones_3d/vfe/__init__.py
+++ b/pcdet/models/backbones_3d/vfe/__init__.py
 from .mean_vfe import MeanVFE
 from .pillar_vfe import PillarVFE
+from .image_vfe import ImageVFE
 from .vfe_template import VFETemplate
 __all__ = {
    'VFETemplate': VFETemplate,
    'MeanVFE': MeanVFE,
-    'PillarVFE': PillarVFE
+    'PillarVFE': PillarVFE,
+    'ImageVFE': ImageVFE
 }
--- a/pcdet/models/backbones_3d/vfe/image_vfe.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe.py
+import torch
+from .vfe_template import VFETemplate
+from .image_vfe_modules import ffn, f2v
+class ImageVFE(VFETemplate):
+    def __init__(self, model_cfg, grid_size, point_cloud_range, depth_downsample_factor, **kwargs):
+        super().__init__(model_cfg=model_cfg)
+        self.grid_size = grid_size
+        self.pc_range = point_cloud_range
+        self.downsample_factor = depth_downsample_factor
+        self.module_topology = [
+            'ffn', 'f2v'
+        ]
+        self.build_modules()
+    def build_modules(self):
+        """
+        Builds modules
+        """
+        for module_name in self.module_topology:
+            module = getattr(self, 'build_%s' % module_name)()
+            self.add_module(module_name, module)
+    def build_ffn(self):
+        """
+        Builds frustum feature network
+        Returns:
+            ffn_module: nn.Module, Frustum feature network
+        """
+        ffn_module = ffn.__all__[self.model_cfg.FFN.NAME](
+            model_cfg=self.model_cfg.FFN,
+            downsample_factor=self.downsample_factor
+        )
+        self.disc_cfg = ffn_module.disc_cfg
+        return ffn_module
+    def build_f2v(self):
+        """
+        Builds frustum to voxel transformation
+        Returns:
+            f2v_module: nn.Module, Frustum to voxel transformation
+        """
+        f2v_module = f2v.__all__[self.model_cfg.F2V.NAME](
+            model_cfg=self.model_cfg.F2V,
+            grid_size=self.grid_size,
+            pc_range=self.pc_range,
+            disc_cfg=self.disc_cfg
+        )
+        return f2v_module
+    def get_output_feature_dim(self):
+        """
+        Gets number of output channels
+        Returns:
+            out_feature_dim: int, Number of output channels
+        """
+        out_feature_dim = self.ffn.get_output_feature_dim()
+        return out_feature_dim
+    def forward(self, batch_dict, **kwargs):
+        """
+        Args:
+            batch_dict:
+                images: (N, 3, H_in, W_in), Input images
+            **kwargs:
+        Returns:
+            batch_dict:
+                voxel_features: (B, C, Z, Y, X), Image voxel features
+        """
+        batch_dict = self.ffn(batch_dict)
+        batch_dict = self.f2v(batch_dict)
+        return batch_dict
+    def get_loss(self):
+        """
+        Gets DDN loss
+        Returns:
+            loss: (1), Depth distribution network loss
+            tb_dict: dict[float], All losses to log in tensorboard
+        """
+        loss, tb_dict = self.ffn.get_loss()
+        return loss, tb_dict
--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/__init__.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/__init__.py
+from .frustum_to_voxel import FrustumToVoxel
+__all__ = {
+    'FrustumToVoxel': FrustumToVoxel
+}
--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/frustum_grid_generator.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/frustum_grid_generator.py
+import torch
+import torch.nn as nn
+import kornia
+from pcdet.utils import transform_utils
+class FrustumGridGenerator(nn.Module):
+    def __init__(self, grid_size, pc_range, disc_cfg):
+        """
+        Initializes Grid Generator for frustum features
+        Args:
+            grid_size: [X, Y, Z], Voxel grid size
+            pc_range: [x_min, y_min, z_min, x_max, y_max, z_max], Voxelization point cloud range (m)
+            disc_cfg: EasyDict, Depth discretiziation configuration
+        """
+        super().__init__()
+        self.dtype = torch.float32
+        self.grid_size = torch.as_tensor(grid_size)
+        self.pc_range = pc_range
+        self.out_of_bounds_val = -2
+        self.disc_cfg = disc_cfg
+        # Calculate voxel size
+        pc_range = torch.as_tensor(pc_range).reshape(2, 3)
+        self.pc_min = pc_range[0]
+        self.pc_max = pc_range[1]
+        self.voxel_size = (self.pc_max - self.pc_min) / self.grid_size
+        # Create voxel grid
+        self.depth, self.width, self.height = self.grid_size.int()
+        self.voxel_grid = kornia.utils.create_meshgrid3d(depth=self.depth,
+                                                         height=self.height,
+                                                         width=self.width,
+                                                         normalized_coordinates=False)
+        self.voxel_grid = self.voxel_grid.permute(0, 1, 3, 2, 4)  # XZY-> XYZ
+        # Add offsets to center of voxel
+        self.voxel_grid += 0.5
+        self.grid_to_lidar = self.grid_to_lidar_unproject(pc_min=self.pc_min,
+                                                          voxel_size=self.voxel_size)
+    def grid_to_lidar_unproject(self, pc_min, voxel_size):
+        """
+        Calculate grid to LiDAR unprojection for each plane
+        Args:
+            pc_min: [x_min, y_min, z_min], Minimum of point cloud range (m)
+            voxel_size: [x, y, z], Size of each voxel (m)
+        Returns:
+            unproject: (4, 4), Voxel grid to LiDAR unprojection matrix
+        """
+        x_size, y_size, z_size = voxel_size
+        x_min, y_min, z_min = pc_min
+        unproject = torch.tensor([[x_size, 0, 0, x_min],
+                                  [0, y_size, 0, y_min],
+                                  [0,  0, z_size, z_min],
+                                  [0,  0, 0, 1]],
+                                 dtype=self.dtype)  # (4, 4)
+        return unproject
+    def transform_grid(self, voxel_grid, grid_to_lidar, lidar_to_cam, cam_to_img):
+        """
+        Transforms voxel sampling grid into frustum sampling grid
+        Args:
+            grid: (B, X, Y, Z, 3), Voxel sampling grid
+            grid_to_lidar: (4, 4), Voxel grid to LiDAR unprojection matrix
+            lidar_to_cam: (B, 4, 4), LiDAR to camera frame transformation
+            cam_to_img: (B, 3, 4), Camera projection matrix
+        Returns:
+            frustum_grid: (B, X, Y, Z, 3), Frustum sampling grid
+        """
+        B = lidar_to_cam.shape[0]
+        # Create transformation matricies
+        V_G = grid_to_lidar  # Voxel Grid -> LiDAR (4, 4)
+        C_V = lidar_to_cam  # LiDAR -> Camera (B, 4, 4)
+        I_C = cam_to_img  # Camera -> Image (B, 3, 4)
+        trans = C_V @ V_G
+        # Reshape to match dimensions
+        trans = trans.reshape(B, 1, 1, 4, 4)
+        voxel_grid = voxel_grid.repeat_interleave(repeats=B, dim=0)
+        # Transform to camera frame
+        camera_grid = kornia.transform_points(trans_01=trans, points_1=voxel_grid)
+        # Project to image
+        I_C = I_C.reshape(B, 1, 1, 3, 4)
+        image_grid, image_depths = transform_utils.project_to_image(project=I_C, points=camera_grid)
+        # Convert depths to depth bins
+        image_depths = transform_utils.bin_depths(depth_map=image_depths, **self.disc_cfg)
+        # Stack to form frustum grid
+        image_depths = image_depths.unsqueeze(-1)
+        frustum_grid = torch.cat((image_grid, image_depths), dim=-1)
+        return frustum_grid
+    def forward(self, lidar_to_cam, cam_to_img, image_shape):
+        """
+        Generates sampling grid for frustum features
+        Args:
+            lidar_to_cam: (B, 4, 4), LiDAR to camera frame transformation
+            cam_to_img: (B, 3, 4), Camera projection matrix
+            image_shape: (B, 2), Image shape [H, W]
+        Returns:
+            frustum_grid (B, X, Y, Z, 3), Sampling grids for frustum features
+        """
+        frustum_grid = self.transform_grid(voxel_grid=self.voxel_grid.to(lidar_to_cam.device),
+                                           grid_to_lidar=self.grid_to_lidar.to(lidar_to_cam.device),
+                                           lidar_to_cam=lidar_to_cam,
+                                           cam_to_img=cam_to_img)
+        # Normalize grid
+        image_shape, _ = torch.max(image_shape, dim=0)
+        image_depth = torch.tensor([self.disc_cfg["num_bins"]],
+                                   device=image_shape.device,
+                                   dtype=image_shape.dtype)
+        frustum_shape = torch.cat((image_depth, image_shape))
+        frustum_grid = transform_utils.normalize_coords(coords=frustum_grid, shape=frustum_shape)
+        # Replace any NaNs or infinites with out of bounds
+        mask = ~torch.isfinite(frustum_grid)
+        frustum_grid[mask] = self.out_of_bounds_val
+        return frustum_grid
--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/frustum_to_voxel.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/frustum_to_voxel.py
+import torch
+import torch.nn as nn
+from .frustum_grid_generator import FrustumGridGenerator
+from .sampler import Sampler
+class FrustumToVoxel(nn.Module):
+    def __init__(self, model_cfg, grid_size, pc_range, disc_cfg):
+        """
+        Initializes module to transform frustum features to voxel features via 3D transformation and sampling
+        Args:
+            model_cfg: EasyDict, Module configuration
+            grid_size: [X, Y, Z], Voxel grid size
+            pc_range: [x_min, y_min, z_min, x_max, y_max, z_max], Voxelization point cloud range (m)
+            disc_cfg: EasyDict, Depth discretiziation configuration
+        """
+        super().__init__()
+        self.model_cfg = model_cfg
+        self.grid_size = grid_size
+        self.pc_range = pc_range
+        self.disc_cfg = disc_cfg
+        self.grid_generator = FrustumGridGenerator(grid_size=grid_size,
+                                                   pc_range=pc_range,
+                                                   disc_cfg=disc_cfg)
+        self.sampler = Sampler(**model_cfg.SAMPLER)
+    def forward(self, batch_dict):
+        """
+        Generates voxel features via 3D transformation and sampling
+        Args:
+            batch_dict:
+                frustum_features: (B, C, D, H_image, W_image), Image frustum features
+                lidar_to_cam: (B, 4, 4), LiDAR to camera frame transformation
+                cam_to_img: (B, 3, 4), Camera projection matrix
+                image_shape: (B, 2), Image shape [H, W]
+        Returns:
+            batch_dict:
+                voxel_features: (B, C, Z, Y, X), Image voxel features
+        """
+        # Generate sampling grid for frustum volume
+        grid = self.grid_generator(lidar_to_cam=batch_dict["trans_lidar_to_cam"],
+                                   cam_to_img=batch_dict["trans_cam_to_img"],
+                                   image_shape=batch_dict["image_shape"])  # (B, X, Y, Z, 3)
+        # Sample frustum volume to generate voxel volume
+        voxel_features = self.sampler(input_features=batch_dict["frustum_features"],
+                                      grid=grid)  # (B, C, X, Y, Z)
+        # (B, C, X, Y, Z) -> (B, C, Z, Y, X)
+        voxel_features = voxel_features.permute(0, 1, 4, 3, 2)
+        batch_dict["voxel_features"] = voxel_features
+        return batch_dict
--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/sampler.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/f2v/sampler.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class Sampler(nn.Module):
+    def __init__(self, mode="bilinear", padding_mode="zeros"):
+        """
+        Initializes module
+        Args:
+            mode: string, Sampling mode [bilinear/nearest]
+            padding_mode: string, Padding mode for outside grid values [zeros/border/reflection]
+        """
+        super().__init__()
+        self.mode = mode
+        self.padding_mode = padding_mode
+    def forward(self, input_features, grid):
+        """
+        Samples input using sampling grid
+        Args:
+            input_features: (B, C, D, H, W), Input frustum features
+            grid: (B, X, Y, Z, 3), Sampling grids for input features
+        Returns
+            output_features: (B, C, X, Y, Z) Output voxel features
+        """
+        # Sample from grid
+        output = F.grid_sample(input=input_features, grid=grid, mode=self.mode, padding_mode=self.padding_mode)
+        return output
--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/__init__.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/__init__.py
+from .depth_ffn import DepthFFN
+__all__ = {
+    'DepthFFN': DepthFFN
+}
--- a/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ddn/__init__.py
+++ b/pcdet/models/backbones_3d/vfe/image_vfe_modules/ffn/ddn/__init__.py
+from .ddn_deeplabv3 import DDNDeepLabV3
+__all__ = {
+    'DDNDeepLabV3': DDNDeepLabV3
+}