[Feature]Support NeRF-Det (#2732)

2dad86c2 · YirongYan · GitHub · 8deeb6e2 · 2dad86c2 · 2dad86c2
Unverified Commit 2dad86c2 authored Jan 04, 2024 by YirongYan Committed by GitHub Jan 04, 2024
17 changed files
--- a/projects/NeRF-Det/README.md
+++ b/projects/NeRF-Det/README.md
+# NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
+> [NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection](https://arxiv.org/abs/2307.14620)
+<!-- [ALGORITHM] -->
+## Abstract
+NeRF-Det is a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, NeRF-Det makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, NeRF-Det introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, it subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. NeRF-Det outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. The author provide extensive analysis to shed light on how NeRF-Det works. As a result of joint-training design,  NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code will be available at https://github.com/facebookresearch/NeRF-Det
+<div align=center>
+<img src="https://chenfengxu714.github.io/nerfdet/static/images/method-cropped_1.png" width="800"/>
+</div>
+## Introduction
+This directory contains the implementations of NeRF-Det (https://arxiv.org/abs/2307.14620). Our implementations are built on top of MMdetection3D. We have updated NeRF-Det to be compatible with latest mmdet3d version. The codebase and config files have all changed to adapt to the new mmdet3d version. All previous pretrained models are verified with the result listed below. However, newly trained models are yet to be uploaded.
+<!-- Share any information you would like others to know. For example:
+Author: @xxx.
+This is an implementation of \[XXX\]. -->
+## Dataset
+The format of the scannet dataset in the latest version of mmdet3d only supports the lidar tasks. For NeRF-Det, we need to create the new format of ScanNet Dataset.
+Please following the files in mmdet3d to prepare the raw data of ScanNet. After that, please use this command to generate the pkls used in nerfdet.
+```bash
+python projects/NeRF-Det/prepare_infos.py --root-path ./data/scannet --out-dir ./data/scannet
+```
+The new format of the pkl is organized as below:
+- scannet_infos_train.pkl: The train data infos, the detailed info of each scan is as follows:
+  - info\['instances'\]:A list of dict contains all annotations, each dict contains all annotation information of single instance.For the i-th instance:
+    - info\['instances'\]\[i\]\['bbox_3d'\]: List of 6 numbers representing the axis_aligned in depth coordinate system, in (x,y,z,l,w,h) order.
+    - info\['instances'\]\[i\]\['bbox_label_3d'\]: The label of each 3d bounding boxes.
+  - info\['cam2img'\]: The intrinsic matrix.Every scene has one matrix.
+  - info\['lidar2cam'\]: The extrinsic matrixes.Every scene has 300 matrixes.
+  - info\['img_paths'\]: The paths of the 300 rgb pictures.
+  - info\['axis_align_matrix'\]: The align matrix.Every scene has one matrix.
+After preparing your scannet dataset pkls,please change the paths in configs to fit your project.
+## Train
+In MMDet3D's root directory, run the following command to train the model:
+```bash
+python tools/train.py projects/NeRF-Det/configs/nerfdet_res50_2x_low_res.py ${WORK_DIR}
+```
+## Results and Models
+### NeRF-Det
+|                            Backbone                             | mAP@25 | mAP@50 |    Log    |
+| :-------------------------------------------------------------: | :----: | :----: | :-------: |
+|      [NeRF-Det-R50](./configs/nerfdet_res50_2x_low_res.py)      |  53.0  |  26.8  | [log](<>) |
+|  [NeRF-Det-R50\*](./configs/nerfdet_res50_2x_low_res_depth.py)  |  52.2  |  28.5  | [log](<>) |
+| [NeRF-Det-R101\*](./configs/nerfdet_res101_2x_low_res_depth.py) |  52.3  |  28.5  | [log](<>) |
+(Here NeRF-Det-R50\* means this model uses depth information in the training step)
+### Notes
+- The values showed in the chart all represents the best mAP in the training.
+- Since there is a lot of randomness in the behavior of the model, we conducted three experiments on each config and took the average. The mAP showed on the above chart are all average values.
+- We also conducted the same experiments in the original code, the results are showed below.
+  |    Backbone     | mAP@25 | mAP@50 |
+  | :-------------: | :----: | :----: |
+  |  NeRF-Det-R50   |  52.8  |  26.8  |
+  | NeRF-Det-R50\*  |  52.4  |  27.5  |
+  | NeRF-Det-R101\* |  52.8  |  28.6  |
+- Attention: Because of the randomness in the construction of the ScanNet dataset itself and the behavior of the model, the training results will fluctuate considerably. According to experimental results and experience, the experimental results will fluctuate by plus or minus 1.5 points.
+## Evaluation using pretrained models
+1. Download the pretrained checkpoints through the linkings in the above chart.
+2. Testing
+   To test, use:
+   ```bash
+   python tools/test.py projects/NeRF-Det/configs/nerfdet_res50_2x_low_res.py ${CHECKPOINT_PATH}
+   ```
+## Citation
+<!-- You may remove this section if not applicable. -->
+```latex
+@inproceedings{
+  xu2023nerfdet,
+  title={NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection},
+  author={Xu, Chenfeng and Wu, Bichen and Hou, Ji and Tsai, Sam and Li, Ruilong and Wang, Jialiang and Zhan, Wei and He, Zijian and Vajda, Peter and Keutzer, Kurt and Tomizuka, Masayoshi},
+  booktitle={ICCV},
+  year={2023},
+}
+@inproceedings{
+park2023time,
+title={Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection},
+author={Jinhyung Park and Chenfeng Xu and Shijia Yang and Kurt Keutzer and Kris M. Kitani and Masayoshi Tomizuka and Wei Zhan},
+booktitle={The Eleventh International Conference on Learning Representations },
+year={2023},
+url={https://openreview.net/forum?id=H3HcEJA2Um}
+}
+```
--- a/projects/NeRF-Det/configs/nerfdet_res101_2x_low_res_depth.py
+++ b/projects/NeRF-Det/configs/nerfdet_res101_2x_low_res_depth.py
+_base_ = ['../../../configs/_base_/default_runtime.py']
+custom_imports = dict(imports=['projects.NeRF-Det.nerfdet'])
+prior_generator = dict(
+    type='AlignedAnchor3DRangeGenerator',
+    ranges=[[-3.2, -3.2, -1.28, 3.2, 3.2, 1.28]],
+    rotations=[.0])
+model = dict(
+    type='NerfDet',
+    data_preprocessor=dict(
+        type='NeRFDetDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True,
+        pad_size_divisor=10),
+    backbone=dict(
+        type='mmdet.ResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet101'),
+        style='pytorch'),
+    neck=dict(
+        type='mmdet.FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        num_outs=4),
+    neck_3d=dict(
+        type='IndoorImVoxelNeck',
+        in_channels=256,
+        out_channels=128,
+        n_blocks=[1, 1, 1]),
+    bbox_head=dict(
+        type='NerfDetHead',
+        bbox_loss=dict(type='AxisAlignedIoULoss', loss_weight=1.0),
+        n_classes=18,
+        n_levels=3,
+        n_channels=128,
+        n_reg_outs=6,
+        pts_assign_threshold=27,
+        pts_center_threshold=18,
+        prior_generator=prior_generator),
+    prior_generator=prior_generator,
+    voxel_size=[.16, .16, .2],
+    n_voxels=[40, 40, 16],
+    aabb=([-2.7, -2.7, -0.78], [3.7, 3.7, 1.78]),
+    near_far_range=[0.2, 8.0],
+    N_samples=64,
+    N_rand=2048,
+    nerf_mode='image',
+    depth_supervise=True,
+    use_nerf_mask=True,
+    nerf_sample_view=20,
+    squeeze_scale=4,
+    nerf_density=True,
+    train_cfg=dict(),
+    test_cfg=dict(nms_pre=1000, iou_thr=.25, score_thr=.01))
+dataset_type = 'MultiViewScanNetDataset'
+data_root = 'data/scannet/'
+class_names = [
+    'cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window', 'bookshelf',
+    'picture', 'counter', 'desk', 'curtain', 'refrigerator', 'showercurtrain',
+    'toilet', 'sink', 'bathtub', 'garbagebin'
+]
+metainfo = dict(CLASSES=class_names)
+file_client_args = dict(backend='disk')
+input_modality = dict(
+    use_camera=True,
+    use_depth=True,
+    use_lidar=False,
+    use_neuralrecon_depth=False,
+    use_ray=True)
+backend_args = None
+train_collect_keys = [
+    'img', 'gt_bboxes_3d', 'gt_labels_3d', 'depth', 'lightpos', 'nerf_sizes',
+    'raydirs', 'gt_images', 'gt_depths', 'denorm_images'
+]
+test_collect_keys = [
+    'img',
+    'depth',
+    'lightpos',
+    'nerf_sizes',
+    'raydirs',
+    'gt_images',
+    'gt_depths',
+    'denorm_images',
+]
+train_pipeline = [
+    dict(type='LoadAnnotations3D'),
+    dict(
+        type='MultiViewPipeline',
+        n_images=48,
+        transforms=[
+            dict(type='LoadImageFromFile', file_client_args=file_client_args),
+            dict(type='Resize', scale=(320, 240), keep_ratio=True),
+        ],
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        margin=10,
+        depth_range=[0.5, 5.5],
+        loading='random',
+        nerf_target_views=10),
+    dict(type='RandomShiftOrigin', std=(.7, .7, .0)),
+    dict(type='PackNeRFDetInputs', keys=train_collect_keys)
+]
+test_pipeline = [
+    dict(type='LoadAnnotations3D'),
+    dict(
+        type='MultiViewPipeline',
+        n_images=101,
+        transforms=[
+            dict(type='LoadImageFromFile', file_client_args=file_client_args),
+            dict(type='Resize', scale=(320, 240), keep_ratio=True),
+        ],
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        margin=10,
+        depth_range=[0.5, 5.5],
+        loading='random',
+        nerf_target_views=1),
+    dict(type='PackNeRFDetInputs', keys=test_collect_keys)
+]
+train_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=6,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            ann_file='scannet_infos_train_new.pkl',
+            pipeline=train_pipeline,
+            modality=input_modality,
+            test_mode=False,
+            filter_empty_gt=True,
+            box_type_3d='Depth',
+            metainfo=metainfo)))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=5,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='scannet_infos_val_new.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        filter_empty_gt=True,
+        box_type_3d='Depth',
+        metainfo=metainfo))
+test_dataloader = val_dataloader
+val_evaluator = dict(type='IndoorMetric')
+test_evaluator = val_evaluator
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
+test_cfg = dict()
+val_cfg = dict()
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+    paramwise_cfg=dict(
+        custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}),
+    clip_grad=dict(max_norm=35., norm_type=2))
+param_scheduler = [
+    dict(
+        type='MultiStepLR',
+        begin=0,
+        end=12,
+        by_epoch=True,
+        milestones=[8, 11],
+        gamma=0.1)
+]
+# hooks
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=12))
+# runtime
+find_unused_parameters = True  # only 1 of 4 FPN outputs is used
--- a/projects/NeRF-Det/configs/nerfdet_res50_2x_low_res.py
+++ b/projects/NeRF-Det/configs/nerfdet_res50_2x_low_res.py
+_base_ = ['./nerfdet_res50_2x_low_res_depth.py']
+model = dict(depth_supervise=False)
+dataset_type = 'MultiViewScanNetDataset'
+data_root = 'data/scannet/'
+class_names = [
+    'cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window', 'bookshelf',
+    'picture', 'counter', 'desk', 'curtain', 'refrigerator', 'showercurtrain',
+    'toilet', 'sink', 'bathtub', 'garbagebin'
+]
+metainfo = dict(CLASSES=class_names)
+file_client_args = dict(backend='disk')
+input_modality = dict(use_depth=False)
+backend_args = None
+train_collect_keys = [
+    'img', 'gt_bboxes_3d', 'gt_labels_3d', 'lightpos', 'nerf_sizes', 'raydirs',
+    'gt_images', 'gt_depths', 'denorm_images'
+]
+test_collect_keys = [
+    'img',
+    'lightpos',
+    'nerf_sizes',
+    'raydirs',
+    'gt_images',
+    'gt_depths',
+    'denorm_images',
+]
+train_pipeline = [
+    dict(type='LoadAnnotations3D'),
+    dict(
+        type='MultiViewPipeline',
+        n_images=50,
+        transforms=[
+            dict(type='LoadImageFromFile', file_client_args=file_client_args),
+            dict(type='Resize', scale=(320, 240), keep_ratio=True),
+        ],
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        margin=10,
+        depth_range=[0.5, 5.5],
+        loading='random',
+        nerf_target_views=10),
+    dict(type='RandomShiftOrigin', std=(.7, .7, .0)),
+    dict(type='PackNeRFDetInputs', keys=train_collect_keys)
+]
+test_pipeline = [
+    dict(type='LoadAnnotations3D'),
+    dict(
+        type='MultiViewPipeline',
+        n_images=101,
+        transforms=[
+            dict(type='LoadImageFromFile', file_client_args=file_client_args),
+            dict(type='Resize', scale=(320, 240), keep_ratio=True),
+        ],
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        margin=10,
+        depth_range=[0.5, 5.5],
+        loading='random',
+        nerf_target_views=1),
+    dict(type='PackNeRFDetInputs', keys=test_collect_keys)
+]
+train_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=6,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            ann_file='scannet_infos_train_new.pkl',
+            pipeline=train_pipeline,
+            modality=input_modality,
+            test_mode=False,
+            filter_empty_gt=True,
+            box_type_3d='Depth',
+            metainfo=metainfo)))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='scannet_infos_val_new.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        filter_empty_gt=True,
+        box_type_3d='Depth',
+        metainfo=metainfo))
+test_dataloader = val_dataloader
--- a/projects/NeRF-Det/configs/nerfdet_res50_2x_low_res_depth.py
+++ b/projects/NeRF-Det/configs/nerfdet_res50_2x_low_res_depth.py
+_base_ = ['../../../configs/_base_/default_runtime.py']
+custom_imports = dict(imports=['projects.NeRF-Det.nerfdet'])
+prior_generator = dict(
+    type='AlignedAnchor3DRangeGenerator',
+    ranges=[[-3.2, -3.2, -1.28, 3.2, 3.2, 1.28]],
+    rotations=[.0])
+model = dict(
+    type='NerfDet',
+    data_preprocessor=dict(
+        type='NeRFDetDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True,
+        pad_size_divisor=10),
+    backbone=dict(
+        type='mmdet.ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50'),
+        style='pytorch'),
+    neck=dict(
+        type='mmdet.FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        num_outs=4),
+    neck_3d=dict(
+        type='IndoorImVoxelNeck',
+        in_channels=256,
+        out_channels=128,
+        n_blocks=[1, 1, 1]),
+    bbox_head=dict(
+        type='NerfDetHead',
+        bbox_loss=dict(type='AxisAlignedIoULoss', loss_weight=1.0),
+        n_classes=18,
+        n_levels=3,
+        n_channels=128,
+        n_reg_outs=6,
+        pts_assign_threshold=27,
+        pts_center_threshold=18,
+        prior_generator=prior_generator),
+    prior_generator=prior_generator,
+    voxel_size=[.16, .16, .2],
+    n_voxels=[40, 40, 16],
+    aabb=([-2.7, -2.7, -0.78], [3.7, 3.7, 1.78]),
+    near_far_range=[0.2, 8.0],
+    N_samples=64,
+    N_rand=2048,
+    nerf_mode='image',
+    depth_supervise=True,
+    use_nerf_mask=True,
+    nerf_sample_view=20,
+    squeeze_scale=4,
+    nerf_density=True,
+    train_cfg=dict(),
+    test_cfg=dict(nms_pre=1000, iou_thr=.25, score_thr=.01))
+dataset_type = 'MultiViewScanNetDataset'
+data_root = 'data/scannet/'
+class_names = [
+    'cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window', 'bookshelf',
+    'picture', 'counter', 'desk', 'curtain', 'refrigerator', 'showercurtrain',
+    'toilet', 'sink', 'bathtub', 'garbagebin'
+]
+metainfo = dict(CLASSES=class_names)
+file_client_args = dict(backend='disk')
+input_modality = dict(
+    use_camera=True,
+    use_depth=True,
+    use_lidar=False,
+    use_neuralrecon_depth=False,
+    use_ray=True)
+backend_args = None
+train_collect_keys = [
+    'img', 'gt_bboxes_3d', 'gt_labels_3d', 'depth', 'lightpos', 'nerf_sizes',
+    'raydirs', 'gt_images', 'gt_depths', 'denorm_images'
+]
+test_collect_keys = [
+    'img',
+    'depth',
+    'lightpos',
+    'nerf_sizes',
+    'raydirs',
+    'gt_images',
+    'gt_depths',
+    'denorm_images',
+]
+train_pipeline = [
+    dict(type='LoadAnnotations3D'),
+    dict(
+        type='MultiViewPipeline',
+        n_images=50,
+        transforms=[
+            dict(type='LoadImageFromFile', file_client_args=file_client_args),
+            dict(type='Resize', scale=(320, 240), keep_ratio=True),
+        ],
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        margin=10,
+        depth_range=[0.5, 5.5],
+        loading='random',
+        nerf_target_views=10),
+    dict(type='RandomShiftOrigin', std=(.7, .7, .0)),
+    dict(type='PackNeRFDetInputs', keys=train_collect_keys)
+]
+test_pipeline = [
+    dict(type='LoadAnnotations3D'),
+    dict(
+        type='MultiViewPipeline',
+        n_images=101,
+        transforms=[
+            dict(type='LoadImageFromFile', file_client_args=file_client_args),
+            dict(type='Resize', scale=(320, 240), keep_ratio=True),
+        ],
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        margin=10,
+        depth_range=[0.5, 5.5],
+        loading='random',
+        nerf_target_views=1),
+    dict(type='PackNeRFDetInputs', keys=test_collect_keys)
+]
+train_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type='RepeatDataset',
+        times=6,
+        dataset=dict(
+            type=dataset_type,
+            data_root=data_root,
+            ann_file='scannet_infos_train_new.pkl',
+            pipeline=train_pipeline,
+            modality=input_modality,
+            test_mode=False,
+            filter_empty_gt=True,
+            box_type_3d='Depth',
+            metainfo=metainfo)))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=5,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='scannet_infos_val_new.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        filter_empty_gt=True,
+        box_type_3d='Depth',
+        metainfo=metainfo))
+test_dataloader = val_dataloader
+val_evaluator = dict(type='IndoorMetric')
+test_evaluator = val_evaluator
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
+test_cfg = dict()
+val_cfg = dict()
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
+    paramwise_cfg=dict(
+        custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}),
+    clip_grad=dict(max_norm=35., norm_type=2))
+param_scheduler = [
+    dict(
+        type='MultiStepLR',
+        begin=0,
+        end=12,
+        by_epoch=True,
+        milestones=[8, 11],
+        gamma=0.1)
+]
+# hooks
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=12))
+# runtime
+find_unused_parameters = True  # only 1 of 4 FPN outputs is used
--- a/projects/NeRF-Det/nerfdet/__init__.py
+++ b/projects/NeRF-Det/nerfdet/__init__.py
+from .data_preprocessor import NeRFDetDataPreprocessor
+from .formating import PackNeRFDetInputs
+from .multiview_pipeline import MultiViewPipeline, RandomShiftOrigin
+from .nerfdet import NerfDet
+from .nerfdet_head import NerfDetHead
+from .scannet_multiview_dataset import MultiViewScanNetDataset
+__all__ = [
+    'MultiViewScanNetDataset', 'MultiViewPipeline', 'RandomShiftOrigin',
+    'PackNeRFDetInputs', 'NeRFDetDataPreprocessor', 'NerfDetHead', 'NerfDet'
+]
--- a/projects/NeRF-Det/nerfdet/data_preprocessor.py
+++ b/projects/NeRF-Det/nerfdet/data_preprocessor.py
--- a/projects/NeRF-Det/nerfdet/formating.py
+++ b/projects/NeRF-Det/nerfdet/formating.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Sequence, Union
+import mmengine
+import numpy as np
+import torch
+from mmcv import BaseTransform
+from mmengine.structures import InstanceData
+from numpy import dtype
+from mmdet3d.registry import TRANSFORMS
+from mmdet3d.structures import BaseInstance3DBoxes, PointData
+from mmdet3d.structures.points import BasePoints
+# from .det3d_data_sample import Det3DDataSample
+from .nerf_det3d_data_sample import NeRFDet3DDataSample
+def to_tensor(
+    data: Union[torch.Tensor, np.ndarray, Sequence, int,
+                float]) -> torch.Tensor:
+    """Convert objects of various python types to :obj:`torch.Tensor`.
+    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
+    :class:`Sequence`, :class:`int` and :class:`float`.
+    Args:
+        data (torch.Tensor | numpy.ndarray | Sequence | int | float): Data to
+            be converted.
+    Returns:
+        torch.Tensor: the converted data.
+    """
+    if isinstance(data, torch.Tensor):
+        return data
+    elif isinstance(data, np.ndarray):
+        if data.dtype is dtype('float64'):
+            data = data.astype(np.float32)
+        return torch.from_numpy(data)
+    elif isinstance(data, Sequence) and not mmengine.is_str(data):
+        return torch.tensor(data)
+    elif isinstance(data, int):
+        return torch.LongTensor([data])
+    elif isinstance(data, float):
+        return torch.FloatTensor([data])
+    else:
+        raise TypeError(f'type {type(data)} cannot be converted to tensor.')
+@TRANSFORMS.register_module()
+class PackNeRFDetInputs(BaseTransform):
+    INPUTS_KEYS = ['points', 'img']
+    NERF_INPUT_KEYS = [
+        'img', 'denorm_images', 'depth', 'lightpos', 'nerf_sizes', 'raydirs'
+    ]
+    INSTANCEDATA_3D_KEYS = [
+        'gt_bboxes_3d', 'gt_labels_3d', 'attr_labels', 'depths', 'centers_2d'
+    ]
+    INSTANCEDATA_2D_KEYS = [
+        'gt_bboxes',
+        'gt_bboxes_labels',
+    ]
+    NERF_3D_KEYS = ['gt_images', 'gt_depths']
+    SEG_KEYS = [
+        'gt_seg_map', 'pts_instance_mask', 'pts_semantic_mask',
+        'gt_semantic_seg'
+    ]
+    def __init__(
+        self,
+        keys: tuple,
+        meta_keys: tuple = ('img_path', 'ori_shape', 'img_shape', 'lidar2img',
+                            'depth2img', 'cam2img', 'pad_shape',
+                            'scale_factor', 'flip', 'pcd_horizontal_flip',
+                            'pcd_vertical_flip', 'box_mode_3d', 'box_type_3d',
+                            'img_norm_cfg', 'num_pts_feats', 'pcd_trans',
+                            'sample_idx', 'pcd_scale_factor', 'pcd_rotation',
+                            'pcd_rotation_angle', 'lidar_path',
+                            'transformation_3d_flow', 'trans_mat',
+                            'affine_aug', 'sweep_img_metas', 'ori_cam2img',
+                            'cam2global', 'crop_offset', 'img_crop_offset',
+                            'resize_img_shape', 'lidar2cam', 'ori_lidar2img',
+                            'num_ref_frames', 'num_views', 'ego2global',
+                            'axis_align_matrix')
+    ) -> None:
+        self.keys = keys
+        self.meta_keys = meta_keys
+    def _remove_prefix(self, key: str) -> str:
+        if key.startswith('gt_'):
+            key = key[3:]
+        return key
+    def transform(self, results: Union[dict,
+                                       List[dict]]) -> Union[dict, List[dict]]:
+        """Method to pack the input data. when the value in this dict is a
+        list, it usually is in Augmentations Testing.
+        Args:
+            results (dict | list[dict]): Result dict from the data pipeline.
+        Returns:
+            dict | List[dict]:
+            - 'inputs' (dict): The forward data of models. It usually contains
+              following keys:
+                - points
+                - img
+            - 'data_samples' (:obj:`NeRFDet3DDataSample`): The annotation info
+              of the sample.
+        """
+        # augtest
+        if isinstance(results, list):
+            if len(results) == 1:
+                # simple test
+                return self.pack_single_results(results[0])
+            pack_results = []
+            for single_result in results:
+                pack_results.append(self.pack_single_results(single_result))
+            return pack_results
+        # norm training and simple testing
+        elif isinstance(results, dict):
+            return self.pack_single_results(results)
+        else:
+            raise NotImplementedError
+    def pack_single_results(self, results: dict) -> dict:
+        """Method to pack the single input data. when the value in this dict is
+        a list, it usually is in Augmentations Testing.
+        Args:
+            results (dict): Result dict from the data pipeline.
+        Returns:
+            dict: A dict contains
+            - 'inputs' (dict): The forward data of models. It usually contains
+              following keys:
+                - points
+                - img
+            - 'data_samples' (:obj:`NeRFDet3DDataSample`): The annotation info
+              of the sample.
+        """
+        # Format 3D data
+        if 'points' in results:
+            if isinstance(results['points'], BasePoints):
+                results['points'] = results['points'].tensor
+        if 'img' in results:
+            if isinstance(results['img'], list):
+                # process multiple imgs in single frame
+                imgs = np.stack(results['img'], axis=0)
+                if imgs.flags.c_contiguous:
+                    imgs = to_tensor(imgs).permute(0, 3, 1, 2).contiguous()
+                else:
+                    imgs = to_tensor(
+                        np.ascontiguousarray(imgs.transpose(0, 3, 1, 2)))
+                results['img'] = imgs
+            else:
+                img = results['img']
+                if len(img.shape) < 3:
+                    img = np.expand_dims(img, -1)
+                # To improve the computational speed by by 3-5 times, apply:
+                # `torch.permute()` rather than `np.transpose()`.
+                # Refer to https://github.com/open-mmlab/mmdetection/pull/9533
+                # for more details
+                if img.flags.c_contiguous:
+                    img = to_tensor(img).permute(2, 0, 1).contiguous()
+                else:
+                    img = to_tensor(
+                        np.ascontiguousarray(img.transpose(2, 0, 1)))
+                results['img'] = img
+        if 'depth' in results:
+            if isinstance(results['depth'], list):
+                # process multiple depth imgs in single frame
+                depth_imgs = np.stack(results['depth'], axis=0)
+                if depth_imgs.flags.c_contiguous:
+                    depth_imgs = to_tensor(depth_imgs).contiguous()
+                else:
+                    depth_imgs = to_tensor(np.ascontiguousarray(depth_imgs))
+                results['depth'] = depth_imgs
+            else:
+                depth_img = results['depth']
+                if len(depth_img.shape) < 3:
+                    depth_img = np.expand_dims(depth_img, -1)
+                if depth_img.flags.c_contiguous:
+                    depth_img = to_tensor(depth_img).contiguous()
+                else:
+                    depth_img = to_tensor(np.ascontiguousarray(depth_img))
+                results['depth'] = depth_img
+        if 'ray_info' in results:
+            if isinstance(results['raydirs'], list):
+                raydirs = np.stack(results['raydirs'], axis=0)
+                if raydirs.flags.c_contiguous:
+                    raydirs = to_tensor(raydirs).contiguous()
+                else:
+                    raydirs = to_tensor(np.ascontiguousarray(raydirs))
+                results['raydirs'] = raydirs
+            if isinstance(results['lightpos'], list):
+                lightposes = np.stack(results['lightpos'], axis=0)
+                if lightposes.flags.c_contiguous:
+                    lightposes = to_tensor(lightposes).contiguous()
+                else:
+                    lightposes = to_tensor(np.ascontiguousarray(lightposes))
+                lightposes = lightposes.unsqueeze(1).repeat(
+                    1, raydirs.shape[1], 1)
+                results['lightpos'] = lightposes
+            if isinstance(results['gt_images'], list):
+                gt_images = np.stack(results['gt_images'], axis=0)
+                if gt_images.flags.c_contiguous:
+                    gt_images = to_tensor(gt_images).contiguous()
+                else:
+                    gt_images = to_tensor(np.ascontiguousarray(gt_images))
+                results['gt_images'] = gt_images
+            if isinstance(results['gt_depths'],
+                          list) and len(results['gt_depths']) != 0:
+                gt_depths = np.stack(results['gt_depths'], axis=0)
+                if gt_depths.flags.c_contiguous:
+                    gt_depths = to_tensor(gt_depths).contiguous()
+                else:
+                    gt_depths = to_tensor(np.ascontiguousarray(gt_depths))
+                results['gt_depths'] = gt_depths
+            if isinstance(results['denorm_images'], list):
+                denorm_imgs = np.stack(results['denorm_images'], axis=0)
+                if denorm_imgs.flags.c_contiguous:
+                    denorm_imgs = to_tensor(denorm_imgs).permute(
+                        0, 3, 1, 2).contiguous()
+                else:
+                    denorm_imgs = to_tensor(
+                        np.ascontiguousarray(
+                            denorm_imgs.transpose(0, 3, 1, 2)))
+                results['denorm_images'] = denorm_imgs
+        for key in [
+                'proposals', 'gt_bboxes', 'gt_bboxes_ignore', 'gt_labels',
+                'gt_bboxes_labels', 'attr_labels', 'pts_instance_mask',
+                'pts_semantic_mask', 'centers_2d', 'depths', 'gt_labels_3d'
+        ]:
+            if key not in results:
+                continue
+            if isinstance(results[key], list):
+                results[key] = [to_tensor(res) for res in results[key]]
+            else:
+                results[key] = to_tensor(results[key])
+        if 'gt_bboxes_3d' in results:
+            if not isinstance(results['gt_bboxes_3d'], BaseInstance3DBoxes):
+                results['gt_bboxes_3d'] = to_tensor(results['gt_bboxes_3d'])
+        if 'gt_semantic_seg' in results:
+            results['gt_semantic_seg'] = to_tensor(
+                results['gt_semantic_seg'][None])
+        if 'gt_seg_map' in results:
+            results['gt_seg_map'] = results['gt_seg_map'][None, ...]
+        if 'gt_images' in results:
+            results['gt_images'] = to_tensor(results['gt_images'])
+        if 'gt_depths' in results:
+            results['gt_depths'] = to_tensor(results['gt_depths'])
+        data_sample = NeRFDet3DDataSample()
+        gt_instances_3d = InstanceData()
+        gt_instances = InstanceData()
+        gt_pts_seg = PointData()
+        gt_nerf_images = InstanceData()
+        gt_nerf_depths = InstanceData()
+        data_metas = {}
+        for key in self.meta_keys:
+            if key in results:
+                data_metas[key] = results[key]
+            elif 'images' in results:
+                if len(results['images'].keys()) == 1:
+                    cam_type = list(results['images'].keys())[0]
+                    # single-view image
+                    if key in results['images'][cam_type]:
+                        data_metas[key] = results['images'][cam_type][key]
+                else:
+                    # multi-view image
+                    img_metas = []
+                    cam_types = list(results['images'].keys())
+                    for cam_type in cam_types:
+                        if key in results['images'][cam_type]:
+                            img_metas.append(results['images'][cam_type][key])
+                    if len(img_metas) > 0:
+                        data_metas[key] = img_metas
+            elif 'lidar_points' in results:
+                if key in results['lidar_points']:
+                    data_metas[key] = results['lidar_points'][key]
+        data_sample.set_metainfo(data_metas)
+        inputs = {}
+        for key in self.keys:
+            if key in results:
+                # if key in self.INPUTS_KEYS:
+                if key in self.NERF_INPUT_KEYS:
+                    inputs[key] = results[key]
+                elif key in self.NERF_3D_KEYS:
+                    if key == 'gt_images':
+                        gt_nerf_images[self._remove_prefix(key)] = results[key]
+                    else:
+                        gt_nerf_depths[self._remove_prefix(key)] = results[key]
+                elif key in self.INSTANCEDATA_3D_KEYS:
+                    gt_instances_3d[self._remove_prefix(key)] = results[key]
+                elif key in self.INSTANCEDATA_2D_KEYS:
+                    if key == 'gt_bboxes_labels':
+                        gt_instances['labels'] = results[key]
+                    else:
+                        gt_instances[self._remove_prefix(key)] = results[key]
+                elif key in self.SEG_KEYS:
+                    gt_pts_seg[self._remove_prefix(key)] = results[key]
+                else:
+                    raise NotImplementedError(f'Please modified '
+                                              f'`Pack3DDetInputs` '
+                                              f'to put {key} to '
+                                              f'corresponding field')
+        data_sample.gt_instances_3d = gt_instances_3d
+        data_sample.gt_instances = gt_instances
+        data_sample.gt_pts_seg = gt_pts_seg
+        data_sample.gt_nerf_images = gt_nerf_images
+        data_sample.gt_nerf_depths = gt_nerf_depths
+        if 'eval_ann_info' in results:
+            data_sample.eval_ann_info = results['eval_ann_info']
+        else:
+            data_sample.eval_ann_info = None
+        packed_results = dict()
+        packed_results['data_samples'] = data_sample
+        packed_results['inputs'] = inputs
+        return packed_results
+    def __repr__(self) -> str:
+        """str: Return a string that describes the module."""
+        repr_str = self.__class__.__name__
+        repr_str += f'(keys={self.keys})'
+        repr_str += f'(meta_keys={self.meta_keys})'
+        return repr_str
--- a/projects/NeRF-Det/nerfdet/multiview_pipeline.py
+++ b/projects/NeRF-Det/nerfdet/multiview_pipeline.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import mmcv
+import numpy as np
+from mmcv.transforms import BaseTransform, Compose
+from PIL import Image
+from mmdet3d.registry import TRANSFORMS
+def get_dtu_raydir(pixelcoords, intrinsic, rot, dir_norm=None):
+    # rot is c2w
+    # pixelcoords: H x W x 2
+    x = (pixelcoords[..., 0] + 0.5 - intrinsic[0, 2]) / intrinsic[0, 0]
+    y = (pixelcoords[..., 1] + 0.5 - intrinsic[1, 2]) / intrinsic[1, 1]
+    z = np.ones_like(x)
+    dirs = np.stack([x, y, z], axis=-1)
+    # dirs = np.sum(dirs[...,None,:] * rot[:,:], axis=-1) # h*w*1*3   x   3*3
+    dirs = dirs @ rot[:, :].T  #
+    if dir_norm:
+        dirs = dirs / (np.linalg.norm(dirs, axis=-1, keepdims=True) + 1e-5)
+    return dirs
+@TRANSFORMS.register_module()
+class MultiViewPipeline(BaseTransform):
+    """MultiViewPipeline used in nerfdet.
+    Required Keys:
+    - depth_info
+    - img_prefix
+    - img_info
+    - lidar2img
+    - c2w
+    - cammrotc2w
+    - lightpos
+    - ray_info
+    Modified Keys:
+    - lidar2img
+    Added Keys:
+    - img
+    - denorm_images
+    - depth
+    - c2w
+    - camrotc2w
+    - lightpos
+    - pixels
+    - raydirs
+    - gt_images
+    - gt_depths
+    - nerf_sizes
+    - depth_range
+    Args:
+        transforms (list[dict]): The transform pipeline
+            used to process the imgs.
+        n_images (int): The number of sampled views.
+        mean (array): The mean values used in normalization.
+        std (array): The variance values used in normalization.
+        margin (int): The margin value. Defaults to 10.
+        depth_range (array): The range of the depth.
+            Defaults to [0.5, 5.5].
+        loading (str): The mode of loading. Defaults to 'random'.
+        nerf_target_views (int): The number of novel views.
+        sample_freq (int): The frequency of sampling.
+    """
+    def __init__(self,
+                 transforms: dict,
+                 n_images: int,
+                 mean: tuple = [123.675, 116.28, 103.53],
+                 std: tuple = [58.395, 57.12, 57.375],
+                 margin: int = 10,
+                 depth_range: tuple = [0.5, 5.5],
+                 loading: str = 'random',
+                 nerf_target_views: int = 0,
+                 sample_freq: int = 3):
+        self.transforms = Compose(transforms)
+        self.depth_transforms = Compose(transforms[1])
+        self.n_images = n_images
+        self.mean = np.array(mean, dtype=np.float32)
+        self.std = np.array(std, dtype=np.float32)
+        self.margin = margin
+        self.depth_range = depth_range
+        self.loading = loading
+        self.sample_freq = sample_freq
+        self.nerf_target_views = nerf_target_views
+    def transform(self, results: dict) -> dict:
+        """Nerfdet transform function.
+        Args:
+            results (dict): Result dict from loading pipeline
+        Returns:
+            dict: The result dict containing the processed results.
+            Updated key and value are described below.
+                - img (list): The loaded origin image.
+                - denorm_images (list): The denormalized image.
+                - depth (list): The origin depth image.
+                - c2w (list): The c2w matrixes.
+                - camrotc2w (list): The rotation matrixes.
+                - lightpos (list): The transform parameters of the camera.
+                - pixels (list): Some pixel information.
+                - raydirs (list): The ray-directions.
+                - gt_images (list): The groundtruth images.
+                - gt_depths (list): The groundtruth depth images.
+                - nerf_sizes (array): The size of the groundtruth images.
+                - depth_range (array): The range of the depth.
+        Here we give a detailed explanation of some keys mentioned above.
+        Let P_c be the coordinate of camera, P_w be the coordinate of world.
+        There is such a conversion relationship: P_c = R @ P_w + T.
+        The 'camrotc2w' mentioned above corresponds to the R matrix here.
+        The 'lightpos' corresponds to the T matrix here. And if you put
+        R and T together, you can get the camera extrinsics matrix. It
+        corresponds to the 'c2w' mentioned above.
+        """
+        imgs = []
+        depths = []
+        extrinsics = []
+        c2ws = []
+        camrotc2ws = []
+        lightposes = []
+        pixels = []
+        raydirs = []
+        gt_images = []
+        gt_depths = []
+        denorm_imgs_list = []
+        nerf_sizes = []
+        if self.loading == 'random':
+            ids = np.arange(len(results['img_info']))
+            replace = True if self.n_images > len(ids) else False
+            ids = np.random.choice(ids, self.n_images, replace=replace)
+            if self.nerf_target_views != 0:
+                target_id = np.random.choice(
+                    ids, self.nerf_target_views, replace=False)
+                ids = np.setdiff1d(ids, target_id)
+                ids = ids.tolist()
+                target_id = target_id.tolist()
+        else:
+            ids = np.arange(len(results['img_info']))
+            begin_id = 0
+            ids = np.arange(begin_id,
+                            begin_id + self.n_images * self.sample_freq,
+                            self.sample_freq)
+            if self.nerf_target_views != 0:
+                target_id = ids
+        ratio = 0
+        size = (240, 320)
+        for i in ids:
+            _results = dict()
+            _results['img_path'] = results['img_info'][i]['filename']
+            _results = self.transforms(_results)
+            imgs.append(_results['img'])
+            # normalize
+            for key in _results.get('img_fields', ['img']):
+                _results[key] = mmcv.imnormalize(_results[key], self.mean,
+                                                 self.std, True)
+            _results['img_norm_cfg'] = dict(
+                mean=self.mean, std=self.std, to_rgb=True)
+            # pad
+            for key in _results.get('img_fields', ['img']):
+                padded_img = mmcv.impad(_results[key], shape=size, pad_val=0)
+                _results[key] = padded_img
+            _results['pad_shape'] = padded_img.shape
+            _results['pad_fixed_size'] = size
+            ori_shape = _results['ori_shape']
+            aft_shape = _results['img_shape']
+            ratio = ori_shape[0] / aft_shape[0]
+            # prepare the depth information
+            if 'depth_info' in results.keys():
+                if '.npy' in results['depth_info'][i]['filename']:
+                    _results['depth'] = np.load(
+                        results['depth_info'][i]['filename'])
+                else:
+                    _results['depth'] = np.asarray((Image.open(
+                        results['depth_info'][i]['filename']))) / 1000
+                    _results['depth'] = mmcv.imresize(
+                        _results['depth'], (aft_shape[1], aft_shape[0]))
+                depths.append(_results['depth'])
+            denorm_img = mmcv.imdenormalize(
+                _results['img'], self.mean, self.std, to_bgr=True).astype(
+                    np.uint8) / 255.0
+            denorm_imgs_list.append(denorm_img)
+            height, width = padded_img.shape[:2]
+            extrinsics.append(results['lidar2img']['extrinsic'][i])
+        # prepare the nerf information
+        if 'ray_info' in results.keys():
+            intrinsics_nerf = results['lidar2img']['intrinsic'].copy()
+            intrinsics_nerf[:2] = intrinsics_nerf[:2] / ratio
+            assert self.nerf_target_views > 0
+            for i in target_id:
+                c2ws.append(results['c2w'][i])
+                camrotc2ws.append(results['camrotc2w'][i])
+                lightposes.append(results['lightpos'][i])
+                px, py = np.meshgrid(
+                    np.arange(self.margin,
+                              width - self.margin).astype(np.float32),
+                    np.arange(self.margin,
+                              height - self.margin).astype(np.float32))
+                pixelcoords = np.stack((px, py),
+                                       axis=-1).astype(np.float32)  # H x W x 2
+                pixels.append(pixelcoords)
+                raydir = get_dtu_raydir(pixelcoords, intrinsics_nerf,
+                                        results['camrotc2w'][i])
+                raydirs.append(np.reshape(raydir.astype(np.float32), (-1, 3)))
+                # read target images
+                temp_results = dict()
+                temp_results['img_path'] = results['img_info'][i]['filename']
+                temp_results_ = self.transforms(temp_results)
+                # normalize
+                for key in temp_results.get('img_fields', ['img']):
+                    temp_results[key] = mmcv.imnormalize(
+                        temp_results[key], self.mean, self.std, True)
+                temp_results['img_norm_cfg'] = dict(
+                    mean=self.mean, std=self.std, to_rgb=True)
+                # pad
+                for key in temp_results.get('img_fields', ['img']):
+                    padded_img = mmcv.impad(
+                        temp_results[key], shape=size, pad_val=0)
+                    temp_results[key] = padded_img
+                temp_results['pad_shape'] = padded_img.shape
+                temp_results['pad_fixed_size'] = size
+                # denormalize target_images.
+                denorm_imgs = mmcv.imdenormalize(
+                    temp_results_['img'], self.mean, self.std,
+                    to_bgr=True).astype(np.uint8)
+                gt_rgb_shape = denorm_imgs.shape
+                gt_image = denorm_imgs[py.astype(np.int32),
+                                       px.astype(np.int32), :]
+                nerf_sizes.append(np.array(gt_image.shape))
+                gt_image = np.reshape(gt_image, (-1, 3))
+                gt_images.append(gt_image / 255.0)
+                if 'depth_info' in results.keys():
+                    if '.npy' in results['depth_info'][i]['filename']:
+                        _results['depth'] = np.load(
+                            results['depth_info'][i]['filename'])
+                    else:
+                        depth_image = Image.open(
+                            results['depth_info'][i]['filename'])
+                        _results['depth'] = np.asarray(depth_image) / 1000
+                        _results['depth'] = mmcv.imresize(
+                            _results['depth'],
+                            (gt_rgb_shape[1], gt_rgb_shape[0]))
+                    _results['depth'] = _results['depth']
+                    gt_depth = _results['depth'][py.astype(np.int32),
+                                                 px.astype(np.int32)]
+                    gt_depths.append(gt_depth)
+        for key in _results.keys():
+            if key not in ['img', 'img_info']:
+                results[key] = _results[key]
+        results['img'] = imgs
+        if 'ray_info' in results.keys():
+            results['c2w'] = c2ws
+            results['camrotc2w'] = camrotc2ws
+            results['lightpos'] = lightposes
+            results['pixels'] = pixels
+            results['raydirs'] = raydirs
+            results['gt_images'] = gt_images
+            results['gt_depths'] = gt_depths
+            results['nerf_sizes'] = nerf_sizes
+            results['denorm_images'] = denorm_imgs_list
+            results['depth_range'] = np.array([self.depth_range])
+        if len(depths) != 0:
+            results['depth'] = depths
+        results['lidar2img']['extrinsic'] = extrinsics
+        return results
+@TRANSFORMS.register_module()
+class RandomShiftOrigin(BaseTransform):
+    def __init__(self, std):
+        self.std = std
+    def transform(self, results):
+        shift = np.random.normal(.0, self.std, 3)
+        results['lidar2img']['origin'] += shift
+        return results
--- a/projects/NeRF-Det/nerfdet/nerf_det3d_data_sample.py
+++ b/projects/NeRF-Det/nerfdet/nerf_det3d_data_sample.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Tuple, Union
+import torch
+from mmengine.structures import InstanceData
+from mmdet3d.structures import Det3DDataSample
+class NeRFDet3DDataSample(Det3DDataSample):
+    """A data structure interface inheirted from Det3DDataSample. Some new
+    attributes are added to match the NeRF-Det project.
+    The attributes added in ``NeRFDet3DDataSample`` are divided into two parts:
+        - ``gt_nerf_images`` (InstanceData): Ground truth of the images which
+          will be used in the NeRF branch.
+        - ``gt_nerf_depths`` (InstanceData): Ground truth of the depth images
+          which will be used in the NeRF branch if needed.
+    For more details and examples, please refer to the 'Det3DDataSample' file.
+    """
+    @property
+    def gt_nerf_images(self) -> InstanceData:
+        return self._gt_nerf_images
+    @gt_nerf_images.setter
+    def gt_nerf_images(self, value: InstanceData) -> None:
+        self.set_field(value, '_gt_nerf_images', dtype=InstanceData)
+    @gt_nerf_images.deleter
+    def gt_nerf_images(self) -> None:
+        del self._gt_nerf_images
+    @property
+    def gt_nerf_depths(self) -> InstanceData:
+        return self._gt_nerf_depths
+    @gt_nerf_depths.setter
+    def gt_nerf_depths(self, value: InstanceData) -> None:
+        self.set_field(value, '_gt_nerf_depths', dtype=InstanceData)
+    @gt_nerf_depths.deleter
+    def gt_nerf_depths(self) -> None:
+        del self._gt_nerf_depths
+SampleList = List[NeRFDet3DDataSample]
+OptSampleList = Optional[SampleList]
+ForwardResults = Union[Dict[str, torch.Tensor], List[NeRFDet3DDataSample],
+                       Tuple[torch.Tensor], torch.Tensor]
--- a/projects/NeRF-Det/nerfdet/nerf_utils/nerf_mlp.py
+++ b/projects/NeRF-Det/nerfdet/nerf_utils/nerf_mlp.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Callable, Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class MLP(nn.Module):
+    """The MLP module used in NerfDet.
+    Args:
+        input_dim (int): The number of input tensor channels.
+        output_dim (int): The number of output tensor channels.
+        net_depth (int): The depth of the MLP. Defaults to 8.
+        net_width (int): The width of the MLP. Defaults to 256.
+        skip_layer (int): The layer to add skip layers to. Defaults to 4.
+        hidden_init (Callable): The initialize method of the hidden layers.
+        hidden_activation (Callable): The activation function of hidden
+            layers, defaults to ReLU.
+        output_enabled (bool): If true, the output layers will be used.
+            Defaults to True.
+        output_init (Optional): The initialize method of the output layer.
+        output_activation(Optional): The activation function of output layers.
+        bias_enabled (Bool): If true, the bias will be used.
+        bias_init (Callable): The initialize method of the bias.
+            Defaults to True.
+    """
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int = None,
+        net_depth: int = 8,
+        net_width: int = 256,
+        skip_layer: int = 4,
+        hidden_init: Callable = nn.init.xavier_uniform_,
+        hidden_activation: Callable = nn.ReLU(),
+        output_enabled: bool = True,
+        output_init: Optional[Callable] = nn.init.xavier_uniform_,
+        output_activation: Optional[Callable] = nn.Identity(),
+        bias_enabled: bool = True,
+        bias_init: Callable = nn.init.zeros_,
+    ):
+        super().__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.net_depth = net_depth
+        self.net_width = net_width
+        self.skip_layer = skip_layer
+        self.hidden_init = hidden_init
+        self.hidden_activation = hidden_activation
+        self.output_enabled = output_enabled
+        self.output_init = output_init
+        self.output_activation = output_activation
+        self.bias_enabled = bias_enabled
+        self.bias_init = bias_init
+        self.hidden_layers = nn.ModuleList()
+        in_features = self.input_dim
+        for i in range(self.net_depth):
+            self.hidden_layers.append(
+                nn.Linear(in_features, self.net_width, bias=bias_enabled))
+            if (self.skip_layer is not None) and (i % self.skip_layer
+                                                  == 0) and (i > 0):
+                in_features = self.net_width + self.input_dim
+            else:
+                in_features = self.net_width
+        if self.output_enabled:
+            self.output_layer = nn.Linear(
+                in_features, self.output_dim, bias=bias_enabled)
+        else:
+            self.output_dim = in_features
+        self.initialize()
+    def initialize(self):
+        def init_func_hidden(m):
+            if isinstance(m, nn.Linear):
+                if self.hidden_init is not None:
+                    self.hidden_init(m.weight)
+                if self.bias_enabled and self.bias_init is not None:
+                    self.bias_init(m.bias)
+        self.hidden_layers.apply(init_func_hidden)
+        if self.output_enabled:
+            def init_func_output(m):
+                if isinstance(m, nn.Linear):
+                    if self.output_init is not None:
+                        self.output_init(m.weight)
+                    if self.bias_enabled and self.bias_init is not None:
+                        self.bias_init(m.bias)
+            self.output_layer.apply(init_func_output)
+    def forward(self, x):
+        inputs = x
+        for i in range(self.net_depth):
+            x = self.hidden_layers[i](x)
+            x = self.hidden_activation(x)
+            if (self.skip_layer is not None) and (i % self.skip_layer
+                                                  == 0) and (i > 0):
+                x = torch.cat([x, inputs], dim=-1)
+        if self.output_enabled:
+            x = self.output_layer(x)
+            x = self.output_activation(x)
+        return x
+class DenseLayer(MLP):
+    def __init__(self, input_dim, output_dim, **kwargs):
+        super().__init__(
+            input_dim=input_dim,
+            output_dim=output_dim,
+            net_depth=0,  # no hidden layers
+            **kwargs,
+        )
+class NerfMLP(nn.Module):
+    """The Nerf-MLP Module.
+    Args:
+        input_dim (int): The number of input tensor channels.
+        condition_dim (int): The number of condition tensor channels.
+        feature_dim (int): The number of feature channels. Defaults to 0.
+        net_depth (int): The depth of the MLP. Defaults to 8.
+        net_width (int): The width of the MLP. Defaults to 256.
+        skip_layer (int): The layer to add skip layers to. Defaults to 4.
+        net_depth_condition (int): The depth of the second part of MLP.
+            Defaults to 1.
+        net_width_condition (int): The width of the second part of MLP.
+            Defaults to 128.
+    """
+    def __init__(
+        self,
+        input_dim: int,
+        condition_dim: int,
+        feature_dim: int = 0,
+        net_depth: int = 8,
+        net_width: int = 256,
+        skip_layer: int = 4,
+        net_depth_condition: int = 1,
+        net_width_condition: int = 128,
+    ):
+        super().__init__()
+        self.base = MLP(
+            input_dim=input_dim + feature_dim,
+            net_depth=net_depth,
+            net_width=net_width,
+            skip_layer=skip_layer,
+            output_enabled=False,
+        )
+        hidden_features = self.base.output_dim
+        self.sigma_layer = DenseLayer(hidden_features, 1)
+        if condition_dim > 0:
+            self.bottleneck_layer = DenseLayer(hidden_features, net_width)
+            self.rgb_layer = MLP(
+                input_dim=net_width + condition_dim,
+                output_dim=3,
+                net_depth=net_depth_condition,
+                net_width=net_width_condition,
+                skip_layer=None,
+            )
+        else:
+            self.rgb_layer = DenseLayer(hidden_features, 3)
+    def query_density(self, x, features=None):
+        """Calculate the raw sigma."""
+        if features is not None:
+            x = self.base(torch.cat([x, features], dim=-1))
+        else:
+            x = self.base(x)
+        raw_sigma = self.sigma_layer(x)
+        return raw_sigma
+    def forward(self, x, condition=None, features=None):
+        if features is not None:
+            x = self.base(torch.cat([x, features], dim=-1))
+        else:
+            x = self.base(x)
+        raw_sigma = self.sigma_layer(x)
+        if condition is not None:
+            if condition.shape[:-1] != x.shape[:-1]:
+                num_rays, n_dim = condition.shape
+                condition = condition.view(
+                    [num_rays] + [1] * (x.dim() - condition.dim()) +
+                    [n_dim]).expand(list(x.shape[:-1]) + [n_dim])
+            bottleneck = self.bottleneck_layer(x)
+            x = torch.cat([bottleneck, condition], dim=-1)
+        raw_rgb = self.rgb_layer(x)
+        return raw_rgb, raw_sigma
+class SinusoidalEncoder(nn.Module):
+    """Sinusodial Positional Encoder used in NeRF."""
+    def __init__(self, x_dim, min_deg, max_deg, use_identity: bool = True):
+        super().__init__()
+        self.x_dim = x_dim
+        self.min_deg = min_deg
+        self.max_deg = max_deg
+        self.use_identity = use_identity
+        self.register_buffer(
+            'scales', torch.tensor([2**i for i in range(min_deg, max_deg)]))
+    @property
+    def latent_dim(self) -> int:
+        return (int(self.use_identity) +
+                (self.max_deg - self.min_deg) * 2) * self.x_dim
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.max_deg == self.min_deg:
+            return x
+        xb = torch.reshape(
+            (x[Ellipsis, None, :] * self.scales[:, None]),
+            list(x.shape[:-1]) + [(self.max_deg - self.min_deg) * self.x_dim],
+        )
+        latent = torch.sin(torch.cat([xb, xb + 0.5 * math.pi], dim=-1))
+        if self.use_identity:
+            latent = torch.cat([x] + [latent], dim=-1)
+        return latent
+class VanillaNeRF(nn.Module):
+    """The Nerf-MLP with the positional encoder.
+    Args:
+        net_depth (int): The depth of the MLP. Defaults to 8.
+        net_width (int): The width of the MLP. Defaults to 256.
+        skip_layer (int): The layer to add skip layers to. Defaults to 4.
+        feature_dim (int): The number of feature channels. Defaults to 0.
+        net_depth_condition (int): The depth of the second part of MLP.
+            Defaults to 1.
+        net_width_condition (int): The width of the second part of MLP.
+            Defaults to 128.
+    """
+    def __init__(self,
+                 net_depth: int = 8,
+                 net_width: int = 256,
+                 skip_layer: int = 4,
+                 feature_dim: int = 0,
+                 net_depth_condition: int = 1,
+                 net_width_condition: int = 128):
+        super().__init__()
+        self.posi_encoder = SinusoidalEncoder(3, 0, 10, True)
+        self.view_encoder = SinusoidalEncoder(3, 0, 4, True)
+        self.mlp = NerfMLP(
+            input_dim=self.posi_encoder.latent_dim,
+            condition_dim=self.view_encoder.latent_dim,
+            feature_dim=feature_dim,
+            net_depth=net_depth,
+            net_width=net_width,
+            skip_layer=skip_layer,
+            net_depth_condition=net_depth_condition,
+            net_width_condition=net_width_condition,
+        )
+    def query_density(self, x, features=None):
+        x = self.posi_encoder(x)
+        sigma = self.mlp.query_density(x, features)
+        return F.relu(sigma)
+    def forward(self, x, condition=None, features=None):
+        x = self.posi_encoder(x)
+        if condition is not None:
+            condition = self.view_encoder(condition)
+        rgb, sigma = self.mlp(x, condition=condition, features=features)
+        return torch.sigmoid(rgb), F.relu(sigma)
--- a/projects/NeRF-Det/nerfdet/nerf_utils/projection.py
+++ b/projects/NeRF-Det/nerfdet/nerf_utils/projection.py
+# Copyright (c) OpenMMLab. All rights reserved.
+# Attention: This file is mainly modified based on the file with the same
+# name in the original project. For more details, please refer to the
+# origin project.
+import torch
+import torch.nn.functional as F
+class Projector():
+    def __init__(self, device='cuda'):
+        self.device = device
+    def inbound(self, pixel_locations, h, w):
+        """check if the pixel locations are in valid range."""
+        return (pixel_locations[..., 0] <= w - 1.) & \
+               (pixel_locations[..., 0] >= 0) & \
+               (pixel_locations[..., 1] <= h - 1.) &\
+               (pixel_locations[..., 1] >= 0)
+    def normalize(self, pixel_locations, h, w):
+        resize_factor = torch.tensor([w - 1., h - 1.
+                                      ]).to(pixel_locations.device)[None,
+                                                                    None, :]
+        normalized_pixel_locations = 2 * pixel_locations / resize_factor - 1.
+        return normalized_pixel_locations
+    def compute_projections(self, xyz, train_cameras):
+        """project 3D points into cameras."""
+        original_shape = xyz.shape[:2]
+        xyz = xyz.reshape(-1, 3)
+        num_views = len(train_cameras)
+        train_intrinsics = train_cameras[:, 2:18].reshape(-1, 4, 4)
+        train_poses = train_cameras[:, -16:].reshape(-1, 4, 4)
+        xyz_h = torch.cat([xyz, torch.ones_like(xyz[..., :1])], dim=-1)
+        # projections = train_intrinsics.bmm(torch.inverse(train_poses))
+        # we have inverse the pose in dataloader so
+        # do not need to inverse here.
+        projections = train_intrinsics.bmm(train_poses) \
+            .bmm(xyz_h.t()[None, ...].repeat(num_views, 1, 1))
+        projections = projections.permute(0, 2, 1)
+        pixel_locations = projections[..., :2] / torch.clamp(
+            projections[..., 2:3], min=1e-8)
+        pixel_locations = torch.clamp(pixel_locations, min=-1e6, max=1e6)
+        mask = projections[..., 2] > 0
+        return pixel_locations.reshape((num_views, ) + original_shape + (2, )), \
+               mask.reshape((num_views, ) + original_shape) # noqa
+    def compute_angle(self, xyz, query_camera, train_cameras):
+        original_shape = xyz.shape[:2]
+        xyz = xyz.reshape(-1, 3)
+        train_poses = train_cameras[:, -16:].reshape(-1, 4, 4)
+        num_views = len(train_poses)
+        query_pose = query_camera[-16:].reshape(-1, 4,
+                                                4).repeat(num_views, 1, 1)
+        ray2tar_pose = (query_pose[:, :3, 3].unsqueeze(1) - xyz.unsqueeze(0))
+        ray2tar_pose /= (torch.norm(ray2tar_pose, dim=-1, keepdim=True) + 1e-6)
+        ray2train_pose = (
+            train_poses[:, :3, 3].unsqueeze(1) - xyz.unsqueeze(0))
+        ray2train_pose /= (
+            torch.norm(ray2train_pose, dim=-1, keepdim=True) + 1e-6)
+        ray_diff = ray2tar_pose - ray2train_pose
+        ray_diff_norm = torch.norm(ray_diff, dim=-1, keepdim=True)
+        ray_diff_dot = torch.sum(
+            ray2tar_pose * ray2train_pose, dim=-1, keepdim=True)
+        ray_diff_direction = ray_diff / torch.clamp(ray_diff_norm, min=1e-6)
+        ray_diff = torch.cat([ray_diff_direction, ray_diff_dot], dim=-1)
+        ray_diff = ray_diff.reshape((num_views, ) + original_shape + (4, ))
+        return ray_diff
+    def compute(self,
+                xyz,
+                train_imgs,
+                train_cameras,
+                featmaps=None,
+                grid_sample=True):
+        assert (train_imgs.shape[0] == 1) \
+               and (train_cameras.shape[0] == 1)
+        # only support batch_size=1 for now
+        train_imgs = train_imgs.squeeze(0)
+        train_cameras = train_cameras.squeeze(0)
+        train_imgs = train_imgs.permute(0, 3, 1, 2)
+        h, w = train_cameras[0][:2]
+        # compute the projection of the query points to each reference image
+        pixel_locations, mask_in_front = self.compute_projections(
+            xyz, train_cameras)
+        normalized_pixel_locations = self.normalize(pixel_locations, h, w)
+        # rgb sampling
+        rgbs_sampled = F.grid_sample(
+            train_imgs, normalized_pixel_locations, align_corners=True)
+        rgb_sampled = rgbs_sampled.permute(2, 3, 0, 1)
+        # deep feature sampling
+        if featmaps is not None:
+            if grid_sample:
+                feat_sampled = F.grid_sample(
+                    featmaps, normalized_pixel_locations, align_corners=True)
+                feat_sampled = feat_sampled.permute(
+                    2, 3, 0, 1)  # [n_rays, n_samples, n_views, d]
+                rgb_feat_sampled = torch.cat(
+                    [rgb_sampled, feat_sampled],
+                    dim=-1)  # [n_rays, n_samples, n_views, d+3]
+                # rgb_feat_sampled = feat_sampled
+            else:
+                n_images, n_channels, f_h, f_w = featmaps.shape
+                resize_factor = torch.tensor([f_w / w - 1., f_h / h - 1.]).to(
+                    pixel_locations.device)[None, None, :]
+                sample_location = (pixel_locations *
+                                   resize_factor).round().long()
+                n_images, n_ray, n_sample, _ = sample_location.shape
+                sample_x = sample_location[..., 0].view(n_images, -1)
+                sample_y = sample_location[..., 1].view(n_images, -1)
+                valid = (sample_x >= 0) & (sample_y >=
+                                           0) & (sample_x < f_w) & (
+                                               sample_y < f_h)
+                valid = valid * mask_in_front.view(n_images, -1)
+                feat_sampled = torch.zeros(
+                    (n_images, n_channels, sample_x.shape[-1]),
+                    device=featmaps.device)
+                for i in range(n_images):
+                    feat_sampled[i, :,
+                                 valid[i]] = featmaps[i, :, sample_y[i,
+                                                                     valid[i]],
+                                                      sample_y[i, valid[i]]]
+                feat_sampled = feat_sampled.view(n_images, n_channels, n_ray,
+                                                 n_sample)
+                rgb_feat_sampled = feat_sampled.permute(2, 3, 0, 1)
+        else:
+            rgb_feat_sampled = None
+        inbound = self.inbound(pixel_locations, h, w)
+        mask = (inbound * mask_in_front).float().permute(
+            1, 2, 0)[..., None]  # [n_rays, n_samples, n_views, 1]
+        return rgb_feat_sampled, mask
--- a/projects/NeRF-Det/nerfdet/nerf_utils/render_ray.py
+++ b/projects/NeRF-Det/nerfdet/nerf_utils/render_ray.py
+# Copyright (c) OpenMMLab. All rights reserved.
+# Attention: This file is mainly modified based on the file with the same
+# name in the original project. For more details, please refer to the
+# origin project.
+from collections import OrderedDict
+import numpy as np
+import torch
+import torch.nn.functional as F
+rng = np.random.RandomState(234)
+# helper functions for nerf ray rendering
+def volume_sampling(sample_pts, features, aabb):
+    B, C, D, W, H = features.shape
+    assert B == 1
+    aabb = torch.Tensor(aabb).to(sample_pts.device)
+    N_rays, N_samples, coords = sample_pts.shape
+    sample_pts = sample_pts.view(1, N_rays * N_samples, 1, 1,
+                                 3).repeat(B, 1, 1, 1, 1)
+    aabbSize = aabb[1] - aabb[0]
+    invgridSize = 1.0 / aabbSize * 2
+    norm_pts = (sample_pts - aabb[0]) * invgridSize - 1
+    sample_features = F.grid_sample(
+        features, norm_pts, align_corners=True, padding_mode='border')
+    masks = ((norm_pts < 1) & (norm_pts > -1)).float().sum(dim=-1)
+    masks = (masks.view(N_rays, N_samples) == 3)
+    return sample_features.view(C, N_rays,
+                                N_samples).permute(1, 2, 0).contiguous(), masks
+def _compute_projection(img_meta):
+    views = len(img_meta['lidar2img']['extrinsic'])
+    intrinsic = torch.tensor(img_meta['lidar2img']['intrinsic'][:4, :4])
+    ratio = img_meta['ori_shape'][0] / img_meta['img_shape'][0]
+    intrinsic[:2] /= ratio
+    intrinsic = intrinsic.unsqueeze(0).view(1, 16).repeat(views, 1)
+    img_size = torch.Tensor(img_meta['img_shape'][:2]).to(intrinsic.device)
+    img_size = img_size.unsqueeze(0).repeat(views, 1)
+    extrinsics = []
+    for v in range(views):
+        extrinsics.append(
+            torch.Tensor(img_meta['lidar2img']['extrinsic'][v]).to(
+                intrinsic.device))
+    extrinsic = torch.stack(extrinsics).view(views, 16)
+    train_cameras = torch.cat([img_size, intrinsic, extrinsic], dim=-1)
+    return train_cameras.unsqueeze(0)
+def compute_mask_points(feature, mask):
+    weight = mask / (torch.sum(mask, dim=2, keepdim=True) + 1e-8)
+    mean = torch.sum(feature * weight, dim=2, keepdim=True)
+    var = torch.sum((feature - mean)**2, dim=2, keepdim=True)
+    var = var / (torch.sum(mask, dim=2, keepdim=True) + 1e-8)
+    var = torch.exp(-var)
+    return mean, var
+def sample_pdf(bins, weights, N_samples, det=False):
+    """Helper function used for sampling.
+    Args:
+        bins (tensor):Tensor of shape [N_rays, M+1], M is the number of bins
+        weights (tensor):Tensor of shape [N_rays, M+1], M is the number of bins
+        N_samples (int):Number of samples along each ray
+        det (bool):If True, will perform deterministic sampling
+    Returns:
+        samples (tuple): [N_rays, N_samples]
+    """
+    M = weights.shape[1]
+    weights += 1e-5
+    # Get pdf
+    pdf = weights / torch.sum(weights, dim=-1, keepdim=True)
+    cdf = torch.cumsum(pdf, dim=-1)
+    cdf = torch.cat([torch.zeros_like(cdf[:, 0:1]), cdf], dim=-1)
+    # Take uniform samples
+    if det:
+        u = torch.linspace(0., 1., N_samples, device=bins.device)
+        u = u.unsqueeze(0).repeat(bins.shape[0], 1)
+    else:
+        u = torch.rand(bins.shape[0], N_samples, device=bins.device)
+    # Invert CDF
+    above_inds = torch.zeros_like(u, dtype=torch.long)
+    for i in range(M):
+        above_inds += (u >= cdf[:, i:i + 1]).long()
+    # random sample inside each bin
+    below_inds = torch.clamp(above_inds - 1, min=0)
+    inds_g = torch.stack((below_inds, above_inds), dim=2)
+    cdf = cdf.unsqueeze(1).repeat(1, N_samples, 1)
+    cdf_g = torch.gather(input=cdf, dim=-1, index=inds_g)
+    bins = bins.unsqueeze(1).repeat(1, N_samples, 1)
+    bins_g = torch.gather(input=bins, dim=-1, index=inds_g)
+    denom = cdf_g[:, :, 1] - cdf_g[:, :, 0]
+    denom = torch.where(denom < 1e-5, torch.ones_like(denom), denom)
+    t = (u - cdf_g[:, :, 0]) / denom
+    samples = bins_g[:, :, 0] + t * (bins_g[:, :, 1] - bins_g[:, :, 0])
+    return samples
+def sample_along_camera_ray(ray_o,
+                            ray_d,
+                            depth_range,
+                            N_samples,
+                            inv_uniform=False,
+                            det=False):
+    """Sampling along the camera ray.
+    Args:
+        ray_o (tensor): Origin of the ray in scene coordinate system;
+            tensor of shape [N_rays, 3]
+        ray_d (tensor): Homogeneous ray direction vectors in
+            scene coordinate system; tensor of shape [N_rays, 3]
+        depth_range (tuple): [near_depth, far_depth]
+        inv_uniform (bool): If True,uniformly sampling inverse depth.
+        det (bool): If True, will perform deterministic sampling.
+    Returns:
+        pts (tensor): Tensor of shape [N_rays, N_samples, 3]
+        z_vals (tensor): Tensor of shape [N_rays, N_samples]
+    """
+    # will sample inside [near_depth, far_depth]
+    # assume the nearest possible depth is at least (min_ratio * depth)
+    near_depth_value = depth_range[0]
+    far_depth_value = depth_range[1]
+    assert near_depth_value > 0 and far_depth_value > 0 \
+        and far_depth_value > near_depth_value
+    near_depth = near_depth_value * torch.ones_like(ray_d[..., 0])
+    far_depth = far_depth_value * torch.ones_like(ray_d[..., 0])
+    if inv_uniform:
+        start = 1. / near_depth
+        step = (1. / far_depth - start) / (N_samples - 1)
+        inv_z_vals = torch.stack([start + i * step for i in range(N_samples)],
+                                 dim=1)
+        z_vals = 1. / inv_z_vals
+    else:
+        start = near_depth
+        step = (far_depth - near_depth) / (N_samples - 1)
+        z_vals = torch.stack([start + i * step for i in range(N_samples)],
+                             dim=1)
+    if not det:
+        # get intervals between samples
+        mids = .5 * (z_vals[:, 1:] + z_vals[:, :-1])
+        upper = torch.cat([mids, z_vals[:, -1:]], dim=-1)
+        lower = torch.cat([z_vals[:, 0:1], mids], dim=-1)
+        # uniform samples in those intervals
+        t_rand = torch.rand_like(z_vals)
+        z_vals = lower + (upper - lower) * t_rand
+    ray_d = ray_d.unsqueeze(1).repeat(1, N_samples, 1)
+    ray_o = ray_o.unsqueeze(1).repeat(1, N_samples, 1)
+    pts = z_vals.unsqueeze(2) * ray_d + ray_o  # [N_rays, N_samples, 3]
+    return pts, z_vals
+# ray rendering of nerf
+def raw2outputs(raw, z_vals, mask, white_bkgd=False):
+    """Transform raw data to outputs:
+    Args:
+        raw(tensor):Raw network output.Tensor of shape [N_rays, N_samples, 4]
+        z_vals(tensor):Depth of point samples along rays.
+            Tensor of shape [N_rays, N_samples]
+        ray_d(tensor):[N_rays, 3]
+    Returns:
+        ret(dict):
+            -rgb(tensor):[N_rays, 3]
+            -depth(tensor):[N_rays,]
+            -weights(tensor):[N_rays,]
+            -depth_std(tensor):[N_rays,]
+    """
+    rgb = raw[:, :, :3]  # [N_rays, N_samples, 3]
+    sigma = raw[:, :, 3]  # [N_rays, N_samples]
+    # note: we did not use the intervals here,
+    # because in practice different scenes from COLMAP can have
+    # very different scales, and using interval can affect
+    # the model's generalization ability.
+    # Therefore we don't use the intervals for both training and evaluation.
+    sigma2alpha = lambda sigma, dists: 1. - torch.exp(-sigma)  # noqa
+    # point samples are ordered with increasing depth
+    # interval between samples
+    dists = z_vals[:, 1:] - z_vals[:, :-1]
+    dists = torch.cat((dists, dists[:, -1:]), dim=-1)
+    alpha = sigma2alpha(sigma, dists)
+    T = torch.cumprod(1. - alpha + 1e-10, dim=-1)[:, :-1]
+    T = torch.cat((torch.ones_like(T[:, 0:1]), T), dim=-1)
+    # maths show weights, and summation of weights along a ray,
+    # are always inside [0, 1]
+    weights = alpha * T
+    rgb_map = torch.sum(weights.unsqueeze(2) * rgb, dim=1)
+    if white_bkgd:
+        rgb_map = rgb_map + (1. - torch.sum(weights, dim=-1, keepdim=True))
+    if mask is not None:
+        mask = mask.float().sum(dim=1) > 8
+    depth_map = torch.sum(
+        weights * z_vals, dim=-1) / (
+            torch.sum(weights, dim=-1) + 1e-8)
+    depth_map = torch.clamp(depth_map, z_vals.min(), z_vals.max())
+    ret = OrderedDict([('rgb', rgb_map), ('depth', depth_map),
+                       ('weights', weights), ('mask', mask), ('alpha', alpha),
+                       ('z_vals', z_vals), ('transparency', T)])
+    return ret
+def render_rays_func(
+        ray_o,
+        ray_d,
+        mean_volume,
+        cov_volume,
+        features_2D,
+        img,
+        aabb,
+        near_far_range,
+        N_samples,
+        N_rand=4096,
+        nerf_mlp=None,
+        img_meta=None,
+        projector=None,
+        mode='volume',  # volume and image
+        nerf_sample_view=3,
+        inv_uniform=False,
+        N_importance=0,
+        det=False,
+        is_train=True,
+        white_bkgd=False,
+        gt_rgb=None,
+        gt_depth=None):
+    ret = {
+        'outputs_coarse': None,
+        'outputs_fine': None,
+        'gt_rgb': gt_rgb,
+        'gt_depth': gt_depth
+    }
+    # pts: [N_rays, N_samples, 3]
+    # z_vals: [N_rays, N_samples]
+    pts, z_vals = sample_along_camera_ray(
+        ray_o=ray_o,
+        ray_d=ray_d,
+        depth_range=near_far_range,
+        N_samples=N_samples,
+        inv_uniform=inv_uniform,
+        det=det)
+    N_rays, N_samples = pts.shape[:2]
+    if mode == 'image':
+        img = img.permute(0, 2, 3, 1).unsqueeze(0)
+        train_camera = _compute_projection(img_meta).to(img.device)
+        rgb_feat, mask = projector.compute(
+            pts, img, train_camera, features_2D, grid_sample=True)
+        pixel_mask = mask[..., 0].sum(dim=2) > 1
+        mean, var = compute_mask_points(rgb_feat, mask)
+        globalfeat = torch.cat([mean, var], dim=-1).squeeze(2)
+        rgb_pts, density_pts = nerf_mlp(pts, ray_d, globalfeat)
+        raw_coarse = torch.cat([rgb_pts, density_pts], dim=-1)
+        ret['sigma'] = density_pts
+    elif mode == 'volume':
+        mean_pts, inbound_masks = volume_sampling(pts, mean_volume, aabb)
+        cov_pts, inbound_masks = volume_sampling(pts, cov_volume, aabb)
+        # This masks is for indicating which points outside of aabb
+        img = img.permute(0, 2, 3, 1).unsqueeze(0)
+        train_camera = _compute_projection(img_meta).to(img.device)
+        _, view_mask = projector.compute(pts, img, train_camera, None)
+        pixel_mask = view_mask[..., 0].sum(dim=2) > 1
+        # plot_3D_vis(pts, aabb, img, train_camera)
+        # [N_rays, N_samples], should at least have 2 observations
+        # This mask is for indicating which points do not have projected point
+        globalpts = torch.cat([mean_pts, cov_pts], dim=-1)
+        rgb_pts, density_pts = nerf_mlp(pts, ray_d, globalpts)
+        density_pts = density_pts * inbound_masks.unsqueeze(dim=-1)
+        raw_coarse = torch.cat([rgb_pts, density_pts], dim=-1)
+    outputs_coarse = raw2outputs(
+        raw_coarse, z_vals, pixel_mask, white_bkgd=white_bkgd)
+    ret['outputs_coarse'] = outputs_coarse
+    return ret
+def render_rays(
+        ray_batch,
+        mean_volume,
+        cov_volume,
+        features_2D,
+        img,
+        aabb,
+        near_far_range,
+        N_samples,
+        N_rand=4096,
+        nerf_mlp=None,
+        img_meta=None,
+        projector=None,
+        mode='volume',  # volume and image
+        nerf_sample_view=3,
+        inv_uniform=False,
+        N_importance=0,
+        det=False,
+        is_train=True,
+        white_bkgd=False,
+        render_testing=False):
+    """The function of the nerf rendering."""
+    ray_o = ray_batch['ray_o']
+    ray_d = ray_batch['ray_d']
+    gt_rgb = ray_batch['gt_rgb']
+    gt_depth = ray_batch['gt_depth']
+    nerf_sizes = ray_batch['nerf_sizes']
+    if is_train:
+        ray_o = ray_o.view(-1, 3)
+        ray_d = ray_d.view(-1, 3)
+        gt_rgb = gt_rgb.view(-1, 3)
+        if gt_depth.shape[1] != 0:
+            gt_depth = gt_depth.view(-1, 1)
+            non_zero_depth = (gt_depth > 0).squeeze(-1)
+            ray_o = ray_o[non_zero_depth]
+            ray_d = ray_d[non_zero_depth]
+            gt_rgb = gt_rgb[non_zero_depth]
+            gt_depth = gt_depth[non_zero_depth]
+        else:
+            gt_depth = None
+        total_rays = ray_d.shape[0]
+        select_inds = rng.choice(total_rays, size=(N_rand, ), replace=False)
+        ray_o = ray_o[select_inds]
+        ray_d = ray_d[select_inds]
+        gt_rgb = gt_rgb[select_inds]
+        if gt_depth is not None:
+            gt_depth = gt_depth[select_inds]
+        rets = render_rays_func(
+            ray_o,
+            ray_d,
+            mean_volume,
+            cov_volume,
+            features_2D,
+            img,
+            aabb,
+            near_far_range,
+            N_samples,
+            N_rand,
+            nerf_mlp,
+            img_meta,
+            projector,
+            mode,  # volume and image
+            nerf_sample_view,
+            inv_uniform,
+            N_importance,
+            det,
+            is_train,
+            white_bkgd,
+            gt_rgb,
+            gt_depth)
+    elif render_testing:
+        nerf_size = nerf_sizes[0]
+        view_num = ray_o.shape[1]
+        H = nerf_size[0][0]
+        W = nerf_size[0][1]
+        ray_o = ray_o.view(-1, 3)
+        ray_d = ray_d.view(-1, 3)
+        gt_rgb = gt_rgb.view(-1, 3)
+        print(gt_rgb.shape)
+        if len(gt_depth) != 0:
+            gt_depth = gt_depth.view(-1, 1)
+        else:
+            gt_depth = None
+        assert view_num * H * W == ray_o.shape[0]
+        num_rays = ray_o.shape[0]
+        results = []
+        rgbs = []
+        for i in range(0, num_rays, N_rand):
+            ray_o_chunck = ray_o[i:i + N_rand, :]
+            ray_d_chunck = ray_d[i:i + N_rand, :]
+            ret = render_rays_func(ray_o_chunck, ray_d_chunck, mean_volume,
+                                   cov_volume, features_2D, img, aabb,
+                                   near_far_range, N_samples, N_rand, nerf_mlp,
+                                   img_meta, projector, mode, nerf_sample_view,
+                                   inv_uniform, N_importance, True, is_train,
+                                   white_bkgd, gt_rgb, gt_depth)
+            results.append(ret)
+        rgbs = []
+        depths = []
+        if results[0]['outputs_coarse'] is not None:
+            for i in range(len(results)):
+                rgb = results[i]['outputs_coarse']['rgb']
+                rgbs.append(rgb)
+                depth = results[i]['outputs_coarse']['depth']
+                depths.append(depth)
+        rets = {
+            'outputs_coarse': {
+                'rgb': torch.cat(rgbs, dim=0).view(view_num, H, W, 3),
+                'depth': torch.cat(depths, dim=0).view(view_num, H, W, 1),
+            },
+            'gt_rgb':
+            gt_rgb.view(view_num, H, W, 3),
+            'gt_depth':
+            gt_depth.view(view_num, H, W, 1) if gt_depth is not None else None,
+        }
+    else:
+        rets = None
+    return rets
--- a/projects/NeRF-Det/nerfdet/nerf_utils/save_rendered_img.py
+++ b/projects/NeRF-Det/nerfdet/nerf_utils/save_rendered_img.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import cv2
+import numpy as np
+import torch
+from skimage.metrics import structural_similarity
+def compute_psnr_from_mse(mse):
+    return -10.0 * torch.log(mse) / np.log(10.0)
+def compute_psnr(pred, target, mask=None):
+    """Compute psnr value (we assume the maximum pixel value is 1)."""
+    if mask is not None:
+        pred, target = pred[mask], target[mask]
+    mse = ((pred - target)**2).mean()
+    return compute_psnr_from_mse(mse).cpu().numpy()
+def compute_ssim(pred, target, mask=None):
+    """Computes Masked SSIM following the neuralbody paper."""
+    assert pred.shape == target.shape and pred.shape[-1] == 3
+    if mask is not None:
+        x, y, w, h = cv2.boundingRect(mask.cpu().numpy().astype(np.uint8))
+        pred = pred[y:y + h, x:x + w]
+        target = target[y:y + h, x:x + w]
+    try:
+        ssim = structural_similarity(
+            pred.cpu().numpy(), target.cpu().numpy(), channel_axis=-1)
+    except ValueError:
+        ssim = structural_similarity(
+            pred.cpu().numpy(), target.cpu().numpy(), multichannel=True)
+    return ssim
+def save_rendered_img(img_meta, rendered_results):
+    filename = img_meta[0]['filename']
+    scenes = filename.split('/')[-2]
+    for ret in rendered_results:
+        depth = ret['outputs_coarse']['depth']
+        rgb = ret['outputs_coarse']['rgb']
+        gt = ret['gt_rgb']
+        gt_depth = ret['gt_depth']
+    # save images
+    psnr_total = 0
+    ssim_total = 0
+    rsme = 0
+    for v in range(gt.shape[0]):
+        rsme += ((depth[v] - gt_depth[v])**2).cpu().numpy()
+        depth_ = ((depth[v] - depth[v].min()) /
+                  (depth[v].max() - depth[v].min() + 1e-8)).repeat(1, 1, 3)
+        img_to_save = torch.cat([rgb[v], gt[v], depth_], dim=1)
+        image_path = os.path.join('nerf_vs_rebuttal', scenes)
+        if not os.path.exists(image_path):
+            os.makedirs(image_path)
+        save_dir = os.path.join(image_path, 'view_' + str(v) + '.png')
+        font = cv2.FONT_HERSHEY_SIMPLEX
+        org = (50, 50)
+        fontScale = 1
+        color = (255, 0, 0)
+        thickness = 2
+        image = np.uint8(img_to_save.cpu().numpy() * 255.0)
+        psnr = compute_psnr(rgb[v], gt[v], mask=None)
+        psnr_total += psnr
+        ssim = compute_ssim(rgb[v], gt[v], mask=None)
+        ssim_total += ssim
+        image = cv2.putText(
+            image, 'PSNR: ' + '%.2f' % compute_psnr(rgb[v], gt[v], mask=None),
+            org, font, fontScale, color, thickness, cv2.LINE_AA)
+        cv2.imwrite(save_dir, image)
+    return psnr_total / gt.shape[0], ssim_total / gt.shape[0], rsme / gt.shape[
+        0]
--- a/projects/NeRF-Det/nerfdet/nerfdet.py
+++ b/projects/NeRF-Det/nerfdet/nerfdet.py
--- a/projects/NeRF-Det/nerfdet/nerfdet_head.py
+++ b/projects/NeRF-Det/nerfdet/nerfdet_head.py
--- a/projects/NeRF-Det/nerfdet/scannet_multiview_dataset.py
+++ b/projects/NeRF-Det/nerfdet/scannet_multiview_dataset.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+from os import path as osp
+from typing import Callable, List, Optional, Union
+import numpy as np
+from mmdet3d.datasets import Det3DDataset
+from mmdet3d.registry import DATASETS
+from mmdet3d.structures import DepthInstance3DBoxes
+@DATASETS.register_module()
+class MultiViewScanNetDataset(Det3DDataset):
+    r"""Multi-View ScanNet Dataset for NeRF-detection Task
+    This class serves as the API for experiments on the ScanNet Dataset.
+    Please refer to the `github repo <https://github.com/ScanNet/ScanNet>`_
+    for data downloading.
+    Args:
+        data_root (str): Path of dataset root.
+        ann_file (str): Path of annotation file.
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        pipeline (List[dict]): Pipeline used for data processing.
+            Defaults to [].
+        modality (dict): Modality to specify the sensor data used as input.
+            Defaults to dict(use_camera=True, use_lidar=False).
+        box_type_3d (str): Type of 3D box of this dataset.
+            Based on the `box_type_3d`, the dataset will encapsulate the box
+            to its original format then converted them to `box_type_3d`.
+            Defaults to 'Depth' in this dataset. Available options includes:
+            - 'LiDAR': Box in LiDAR coordinates.
+            - 'Depth': Box in depth coordinates, usually for indoor dataset.
+            - 'Camera': Box in camera coordinates.
+        filter_empty_gt (bool): Whether to filter the data with empty GT.
+            If it's set to be True, the example with empty annotations after
+            data pipeline will be dropped and a random example will be chosen
+            in `__getitem__`. Defaults to True.
+        test_mode (bool): Whether the dataset is in test mode.
+            Defaults to False.
+    """
+    METAINFO = {
+        'classes':
+        ('cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window',
+         'bookshelf', 'picture', 'counter', 'desk', 'curtain', 'refrigerator',
+         'showercurtrain', 'toilet', 'sink', 'bathtub', 'garbagebin')
+    }
+    def __init__(self,
+                 data_root: str,
+                 ann_file: str,
+                 metainfo: Optional[dict] = None,
+                 pipeline: List[Union[dict, Callable]] = [],
+                 modality: dict = dict(use_camera=True, use_lidar=False),
+                 box_type_3d: str = 'Depth',
+                 filter_empty_gt: bool = True,
+                 remove_dontcare: bool = False,
+                 test_mode: bool = False,
+                 **kwargs) -> None:
+        self.remove_dontcare = remove_dontcare
+        super().__init__(
+            data_root=data_root,
+            ann_file=ann_file,
+            metainfo=metainfo,
+            pipeline=pipeline,
+            modality=modality,
+            box_type_3d=box_type_3d,
+            filter_empty_gt=filter_empty_gt,
+            test_mode=test_mode,
+            **kwargs)
+        assert 'use_camera' in self.modality and \
+               'use_lidar' in self.modality
+        assert self.modality['use_camera'] or self.modality['use_lidar']
+    @staticmethod
+    def _get_axis_align_matrix(info: dict) -> np.ndarray:
+        """Get axis_align_matrix from info. If not exist, return identity mat.
+        Args:
+            info (dict): Info of a single sample data.
+        Returns:
+            np.ndarray: 4x4 transformation matrix.
+        """
+        if 'axis_align_matrix' in info:
+            return np.array(info['axis_align_matrix'])
+        else:
+            warnings.warn(
+                'axis_align_matrix is not found in ScanNet data info, please '
+                'use new pre-process scripts to re-generate ScanNet data')
+            return np.eye(4).astype(np.float32)
+    def parse_data_info(self, info: dict) -> dict:
+        """Process the raw data info.
+        Convert all relative path of needed modality data file to
+        the absolute path.
+        Args:
+            info (dict): Raw info dict.
+        Returns:
+            dict: Has `ann_info` in training stage. And
+            all path has been converted to absolute path.
+        """
+        if self.modality['use_depth']:
+            info['depth_info'] = []
+        if self.modality['use_neuralrecon_depth']:
+            info['depth_info'] = []
+        if self.modality['use_lidar']:
+            # implement lidar processing in the future
+            raise NotImplementedError(
+                'Please modified '
+                '`MultiViewPipeline` to support lidar processing')
+        info['axis_align_matrix'] = self._get_axis_align_matrix(info)
+        info['img_info'] = []
+        info['lidar2img'] = []
+        info['c2w'] = []
+        info['camrotc2w'] = []
+        info['lightpos'] = []
+        # load img and depth_img
+        for i in range(len(info['img_paths'])):
+            img_filename = osp.join(self.data_root, info['img_paths'][i])
+            info['img_info'].append(dict(filename=img_filename))
+            if 'depth_info' in info.keys():
+                if self.modality['use_neuralrecon_depth']:
+                    info['depth_info'].append(
+                        dict(filename=img_filename[:-4] + '.npy'))
+                else:
+                    info['depth_info'].append(
+                        dict(filename=img_filename[:-4] + '.png'))
+            # implement lidar_info in input.keys() in the future.
+            extrinsic = np.linalg.inv(
+                info['axis_align_matrix'] @ info['lidar2cam'][i])
+            info['lidar2img'].append(extrinsic.astype(np.float32))
+            if self.modality['use_ray']:
+                c2w = (
+                    info['axis_align_matrix'] @ info['lidar2cam'][i]).astype(
+                        np.float32)  # noqa
+                info['c2w'].append(c2w)
+                info['camrotc2w'].append(c2w[0:3, 0:3])
+                info['lightpos'].append(c2w[0:3, 3])
+        origin = np.array([.0, .0, .5])
+        info['lidar2img'] = dict(
+            extrinsic=info['lidar2img'],
+            intrinsic=info['cam2img'].astype(np.float32),
+            origin=origin.astype(np.float32))
+        if self.modality['use_ray']:
+            info['ray_info'] = []
+        if not self.test_mode:
+            info['ann_info'] = self.parse_ann_info(info)
+        if self.test_mode and self.load_eval_anns:
+            info['ann_info'] = self.parse_ann_info(info)
+            info['eval_ann_info'] = self._remove_dontcare(info['ann_info'])
+        return info
+    def parse_ann_info(self, info: dict) -> dict:
+        """Process the `instances` in data info to `ann_info`.
+        Args:
+            info (dict): Info dict.
+        Returns:
+            dict: Processed `ann_info`.
+        """
+        ann_info = super().parse_ann_info(info)
+        if self.remove_dontcare:
+            ann_info = self._remove_dontcare(ann_info)
+        # empty gt
+        if ann_info is None:
+            ann_info = dict()
+            ann_info['gt_bboxes_3d'] = np.zeros((0, 6), dtype=np.float32)
+            ann_info['gt_labels_3d'] = np.zeros((0, ), dtype=np.int64)
+        ann_info['gt_bboxes_3d'] = DepthInstance3DBoxes(
+            ann_info['gt_bboxes_3d'],
+            box_dim=ann_info['gt_bboxes_3d'].shape[-1],
+            with_yaw=False,
+            origin=(0.5, 0.5, 0.5)).convert_to(self.box_mode_3d)
+        # count the numbers
+        for label in ann_info['gt_labels_3d']:
+            if label != -1:
+                cat_name = self.metainfo['classes'][label]
+                self.num_ins_per_cat[cat_name] += 1
+        return ann_info
--- a/projects/NeRF-Det/prepare_infos.py
+++ b/projects/NeRF-Det/prepare_infos.py
+# Copyright (c) OpenMMLab. All rights reserved.
+"""Prepare the dataset for NeRF-Det.
+Example:
+    python projects/NeRF-Det/prepare_infos.py
+        --root-path ./data/scannet
+        --out-dir ./data/scannet
+"""
+import argparse
+import time
+from os import path as osp
+from pathlib import Path
+import mmengine
+from ...tools.dataset_converters import indoor_converter as indoor
+from ...tools.dataset_converters.update_infos_to_v2 import (
+    clear_data_info_unused_keys, clear_instance_unused_keys,
+    get_empty_instance, get_empty_standard_data_info)
+def update_scannet_infos_nerfdet(pkl_path, out_dir):
+    """Update the origin pkl to the new format which will be used in nerf-det.
+    Args:
+        pkl_path (str): Path of the origin pkl.
+        out_dir (str): Output directory of the generated info file.
+    Returns:
+        The pkl will be overwritTen.
+        The new pkl is a dict containing two keys:
+        metainfo: Some base information of the pkl
+        data_list (list): A list containing all the information of the scenes.
+    """
+    print('The new refactored process is running.')
+    print(f'{pkl_path} will be modified.')
+    if out_dir in pkl_path:
+        print(f'Warning, you may overwriting '
+              f'the original data {pkl_path}.')
+        time.sleep(5)
+    METAINFO = {
+        'classes':
+        ('cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window',
+         'bookshelf', 'picture', 'counter', 'desk', 'curtain', 'refrigerator',
+         'showercurtrain', 'toilet', 'sink', 'bathtub', 'garbagebin')
+    }
+    print(f'Reading from input file: {pkl_path}.')
+    data_list = mmengine.load(pkl_path)
+    print('Start updating:')
+    converted_list = []
+    for ori_info_dict in mmengine.track_iter_progress(data_list):
+        temp_data_info = get_empty_standard_data_info()
+        # intrinsics, extrinsics and imgs
+        temp_data_info['cam2img'] = ori_info_dict['intrinsics']
+        temp_data_info['lidar2cam'] = ori_info_dict['extrinsics']
+        temp_data_info['img_paths'] = ori_info_dict['img_paths']
+        # annotation information
+        anns = ori_info_dict.get('annos', None)
+        ignore_class_name = set()
+        if anns is not None:
+            temp_data_info['axis_align_matrix'] = anns[
+                'axis_align_matrix'].tolist()
+            if anns['gt_num'] == 0:
+                instance_list = []
+            else:
+                num_instances = len(anns['name'])
+                instance_list = []
+                for instance_id in range(num_instances):
+                    empty_instance = get_empty_instance()
+                    empty_instance['bbox_3d'] = anns['gt_boxes_upright_depth'][
+                        instance_id].tolist()
+                    if anns['name'][instance_id] in METAINFO['classes']:
+                        empty_instance['bbox_label_3d'] = METAINFO[
+                            'classes'].index(anns['name'][instance_id])
+                    else:
+                        ignore_class_name.add(anns['name'][instance_id])
+                        empty_instance['bbox_label_3d'] = -1
+                    empty_instance = clear_instance_unused_keys(empty_instance)
+                    instance_list.append(empty_instance)
+            temp_data_info['instances'] = instance_list
+        temp_data_info, _ = clear_data_info_unused_keys(temp_data_info)
+        converted_list.append(temp_data_info)
+    pkl_name = Path(pkl_path).name
+    out_path = osp.join(out_dir, pkl_name)
+    print(f'Writing to output file: {out_path}.')
+    print(f'ignore classes: {ignore_class_name}')
+    # dataset metainfo
+    metainfo = dict()
+    metainfo['categories'] = {k: i for i, k in enumerate(METAINFO['classes'])}
+    if ignore_class_name:
+        for ignore_class in ignore_class_name:
+            metainfo['categories'][ignore_class] = -1
+    metainfo['dataset'] = 'scannet'
+    metainfo['info_version'] = '1.1'
+    converted_data_info = dict(metainfo=metainfo, data_list=converted_list)
+    mmengine.dump(converted_data_info, out_path, 'pkl')
+def scannet_data_prep(root_path, info_prefix, out_dir, workers):
+    """Prepare the info file for scannet dataset.
+    Args:
+        root_path (str): Path of dataset root.
+        info_prefix (str): The prefix of info filenames.
+        out_dir (str): Output directory of the generated info file.
+        workers (int): Number of threads to be used.
+        version (str): Only used to generate the dataset of nerfdet now.
+    """
+    indoor.create_indoor_info_file(
+        root_path, info_prefix, out_dir, workers=workers)
+    info_train_path = osp.join(out_dir, f'{info_prefix}_infos_train.pkl')
+    info_val_path = osp.join(out_dir, f'{info_prefix}_infos_val.pkl')
+    info_test_path = osp.join(out_dir, f'{info_prefix}_infos_test.pkl')
+    update_scannet_infos_nerfdet(out_dir=out_dir, pkl_path=info_train_path)
+    update_scannet_infos_nerfdet(out_dir=out_dir, pkl_path=info_val_path)
+    update_scannet_infos_nerfdet(out_dir=out_dir, pkl_path=info_test_path)
+parser = argparse.ArgumentParser(description='Data converter arg parser')
+parser.add_argument(
+    '--root-path',
+    type=str,
+    default='./data/scannet',
+    help='specify the root path of dataset')
+parser.add_argument(
+    '--out-dir',
+    type=str,
+    default='./data/scannet',
+    required=False,
+    help='name of info pkl')
+parser.add_argument('--extra-tag', type=str, default='scannet')
+parser.add_argument(
+    '--workers', type=int, default=4, help='number of threads to be used')
+args = parser.parse_args()
+if __name__ == '__main__':
+    from mmdet3d.utils import register_all_modules
+    register_all_modules()
+    scannet_data_prep(
+        root_path=args.root_path,
+        info_prefix=args.extra_tag,
+        out_dir=args.out_dir,
+        workers=args.workers)