Unverified Commit 2dad86c2 authored by YirongYan's avatar YirongYan Committed by GitHub
Browse files

[Feature]Support NeRF-Det (#2732)

parent 8deeb6e2
# NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
> [NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection](https://arxiv.org/abs/2307.14620)
<!-- [ALGORITHM] -->
## Abstract
NeRF-Det is a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, NeRF-Det makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, NeRF-Det introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, it subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. NeRF-Det outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. The author provide extensive analysis to shed light on how NeRF-Det works. As a result of joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code will be available at https://github.com/facebookresearch/NeRF-Det
<div align=center>
<img src="https://chenfengxu714.github.io/nerfdet/static/images/method-cropped_1.png" width="800"/>
</div>
## Introduction
This directory contains the implementations of NeRF-Det (https://arxiv.org/abs/2307.14620). Our implementations are built on top of MMdetection3D. We have updated NeRF-Det to be compatible with latest mmdet3d version. The codebase and config files have all changed to adapt to the new mmdet3d version. All previous pretrained models are verified with the result listed below. However, newly trained models are yet to be uploaded.
<!-- Share any information you would like others to know. For example:
Author: @xxx.
This is an implementation of \[XXX\]. -->
## Dataset
The format of the scannet dataset in the latest version of mmdet3d only supports the lidar tasks. For NeRF-Det, we need to create the new format of ScanNet Dataset.
Please following the files in mmdet3d to prepare the raw data of ScanNet. After that, please use this command to generate the pkls used in nerfdet.
```bash
python projects/NeRF-Det/prepare_infos.py --root-path ./data/scannet --out-dir ./data/scannet
```
The new format of the pkl is organized as below:
- scannet_infos_train.pkl: The train data infos, the detailed info of each scan is as follows:
- info\['instances'\]:A list of dict contains all annotations, each dict contains all annotation information of single instance.For the i-th instance:
- info\['instances'\]\[i\]\['bbox_3d'\]: List of 6 numbers representing the axis_aligned in depth coordinate system, in (x,y,z,l,w,h) order.
- info\['instances'\]\[i\]\['bbox_label_3d'\]: The label of each 3d bounding boxes.
- info\['cam2img'\]: The intrinsic matrix.Every scene has one matrix.
- info\['lidar2cam'\]: The extrinsic matrixes.Every scene has 300 matrixes.
- info\['img_paths'\]: The paths of the 300 rgb pictures.
- info\['axis_align_matrix'\]: The align matrix.Every scene has one matrix.
After preparing your scannet dataset pkls,please change the paths in configs to fit your project.
## Train
In MMDet3D's root directory, run the following command to train the model:
```bash
python tools/train.py projects/NeRF-Det/configs/nerfdet_res50_2x_low_res.py ${WORK_DIR}
```
## Results and Models
### NeRF-Det
| Backbone | mAP@25 | mAP@50 | Log |
| :-------------------------------------------------------------: | :----: | :----: | :-------: |
| [NeRF-Det-R50](./configs/nerfdet_res50_2x_low_res.py) | 53.0 | 26.8 | [log](<>) |
| [NeRF-Det-R50\*](./configs/nerfdet_res50_2x_low_res_depth.py) | 52.2 | 28.5 | [log](<>) |
| [NeRF-Det-R101\*](./configs/nerfdet_res101_2x_low_res_depth.py) | 52.3 | 28.5 | [log](<>) |
(Here NeRF-Det-R50\* means this model uses depth information in the training step)
### Notes
- The values showed in the chart all represents the best mAP in the training.
- Since there is a lot of randomness in the behavior of the model, we conducted three experiments on each config and took the average. The mAP showed on the above chart are all average values.
- We also conducted the same experiments in the original code, the results are showed below.
| Backbone | mAP@25 | mAP@50 |
| :-------------: | :----: | :----: |
| NeRF-Det-R50 | 52.8 | 26.8 |
| NeRF-Det-R50\* | 52.4 | 27.5 |
| NeRF-Det-R101\* | 52.8 | 28.6 |
- Attention: Because of the randomness in the construction of the ScanNet dataset itself and the behavior of the model, the training results will fluctuate considerably. According to experimental results and experience, the experimental results will fluctuate by plus or minus 1.5 points.
## Evaluation using pretrained models
1. Download the pretrained checkpoints through the linkings in the above chart.
2. Testing
To test, use:
```bash
python tools/test.py projects/NeRF-Det/configs/nerfdet_res50_2x_low_res.py ${CHECKPOINT_PATH}
```
## Citation
<!-- You may remove this section if not applicable. -->
```latex
@inproceedings{
xu2023nerfdet,
title={NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection},
author={Xu, Chenfeng and Wu, Bichen and Hou, Ji and Tsai, Sam and Li, Ruilong and Wang, Jialiang and Zhan, Wei and He, Zijian and Vajda, Peter and Keutzer, Kurt and Tomizuka, Masayoshi},
booktitle={ICCV},
year={2023},
}
@inproceedings{
park2023time,
title={Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection},
author={Jinhyung Park and Chenfeng Xu and Shijia Yang and Kurt Keutzer and Kris M. Kitani and Masayoshi Tomizuka and Wei Zhan},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=H3HcEJA2Um}
}
```
_base_ = ['../../../configs/_base_/default_runtime.py']
custom_imports = dict(imports=['projects.NeRF-Det.nerfdet'])
prior_generator = dict(
type='AlignedAnchor3DRangeGenerator',
ranges=[[-3.2, -3.2, -1.28, 3.2, 3.2, 1.28]],
rotations=[.0])
model = dict(
type='NerfDet',
data_preprocessor=dict(
type='NeRFDetDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=True,
pad_size_divisor=10),
backbone=dict(
type='mmdet.ResNet',
depth=101,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet101'),
style='pytorch'),
neck=dict(
type='mmdet.FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=4),
neck_3d=dict(
type='IndoorImVoxelNeck',
in_channels=256,
out_channels=128,
n_blocks=[1, 1, 1]),
bbox_head=dict(
type='NerfDetHead',
bbox_loss=dict(type='AxisAlignedIoULoss', loss_weight=1.0),
n_classes=18,
n_levels=3,
n_channels=128,
n_reg_outs=6,
pts_assign_threshold=27,
pts_center_threshold=18,
prior_generator=prior_generator),
prior_generator=prior_generator,
voxel_size=[.16, .16, .2],
n_voxels=[40, 40, 16],
aabb=([-2.7, -2.7, -0.78], [3.7, 3.7, 1.78]),
near_far_range=[0.2, 8.0],
N_samples=64,
N_rand=2048,
nerf_mode='image',
depth_supervise=True,
use_nerf_mask=True,
nerf_sample_view=20,
squeeze_scale=4,
nerf_density=True,
train_cfg=dict(),
test_cfg=dict(nms_pre=1000, iou_thr=.25, score_thr=.01))
dataset_type = 'MultiViewScanNetDataset'
data_root = 'data/scannet/'
class_names = [
'cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window', 'bookshelf',
'picture', 'counter', 'desk', 'curtain', 'refrigerator', 'showercurtrain',
'toilet', 'sink', 'bathtub', 'garbagebin'
]
metainfo = dict(CLASSES=class_names)
file_client_args = dict(backend='disk')
input_modality = dict(
use_camera=True,
use_depth=True,
use_lidar=False,
use_neuralrecon_depth=False,
use_ray=True)
backend_args = None
train_collect_keys = [
'img', 'gt_bboxes_3d', 'gt_labels_3d', 'depth', 'lightpos', 'nerf_sizes',
'raydirs', 'gt_images', 'gt_depths', 'denorm_images'
]
test_collect_keys = [
'img',
'depth',
'lightpos',
'nerf_sizes',
'raydirs',
'gt_images',
'gt_depths',
'denorm_images',
]
train_pipeline = [
dict(type='LoadAnnotations3D'),
dict(
type='MultiViewPipeline',
n_images=48,
transforms=[
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(320, 240), keep_ratio=True),
],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
margin=10,
depth_range=[0.5, 5.5],
loading='random',
nerf_target_views=10),
dict(type='RandomShiftOrigin', std=(.7, .7, .0)),
dict(type='PackNeRFDetInputs', keys=train_collect_keys)
]
test_pipeline = [
dict(type='LoadAnnotations3D'),
dict(
type='MultiViewPipeline',
n_images=101,
transforms=[
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(320, 240), keep_ratio=True),
],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
margin=10,
depth_range=[0.5, 5.5],
loading='random',
nerf_target_views=1),
dict(type='PackNeRFDetInputs', keys=test_collect_keys)
]
train_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='RepeatDataset',
times=6,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='scannet_infos_train_new.pkl',
pipeline=train_pipeline,
modality=input_modality,
test_mode=False,
filter_empty_gt=True,
box_type_3d='Depth',
metainfo=metainfo)))
val_dataloader = dict(
batch_size=1,
num_workers=5,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='scannet_infos_val_new.pkl',
pipeline=test_pipeline,
modality=input_modality,
test_mode=True,
filter_empty_gt=True,
box_type_3d='Depth',
metainfo=metainfo))
test_dataloader = val_dataloader
val_evaluator = dict(type='IndoorMetric')
test_evaluator = val_evaluator
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
test_cfg = dict()
val_cfg = dict()
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
paramwise_cfg=dict(
custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}),
clip_grad=dict(max_norm=35., norm_type=2))
param_scheduler = [
dict(
type='MultiStepLR',
begin=0,
end=12,
by_epoch=True,
milestones=[8, 11],
gamma=0.1)
]
# hooks
default_hooks = dict(
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=12))
# runtime
find_unused_parameters = True # only 1 of 4 FPN outputs is used
_base_ = ['./nerfdet_res50_2x_low_res_depth.py']
model = dict(depth_supervise=False)
dataset_type = 'MultiViewScanNetDataset'
data_root = 'data/scannet/'
class_names = [
'cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window', 'bookshelf',
'picture', 'counter', 'desk', 'curtain', 'refrigerator', 'showercurtrain',
'toilet', 'sink', 'bathtub', 'garbagebin'
]
metainfo = dict(CLASSES=class_names)
file_client_args = dict(backend='disk')
input_modality = dict(use_depth=False)
backend_args = None
train_collect_keys = [
'img', 'gt_bboxes_3d', 'gt_labels_3d', 'lightpos', 'nerf_sizes', 'raydirs',
'gt_images', 'gt_depths', 'denorm_images'
]
test_collect_keys = [
'img',
'lightpos',
'nerf_sizes',
'raydirs',
'gt_images',
'gt_depths',
'denorm_images',
]
train_pipeline = [
dict(type='LoadAnnotations3D'),
dict(
type='MultiViewPipeline',
n_images=50,
transforms=[
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(320, 240), keep_ratio=True),
],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
margin=10,
depth_range=[0.5, 5.5],
loading='random',
nerf_target_views=10),
dict(type='RandomShiftOrigin', std=(.7, .7, .0)),
dict(type='PackNeRFDetInputs', keys=train_collect_keys)
]
test_pipeline = [
dict(type='LoadAnnotations3D'),
dict(
type='MultiViewPipeline',
n_images=101,
transforms=[
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(320, 240), keep_ratio=True),
],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
margin=10,
depth_range=[0.5, 5.5],
loading='random',
nerf_target_views=1),
dict(type='PackNeRFDetInputs', keys=test_collect_keys)
]
train_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='RepeatDataset',
times=6,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='scannet_infos_train_new.pkl',
pipeline=train_pipeline,
modality=input_modality,
test_mode=False,
filter_empty_gt=True,
box_type_3d='Depth',
metainfo=metainfo)))
val_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='scannet_infos_val_new.pkl',
pipeline=test_pipeline,
modality=input_modality,
test_mode=True,
filter_empty_gt=True,
box_type_3d='Depth',
metainfo=metainfo))
test_dataloader = val_dataloader
_base_ = ['../../../configs/_base_/default_runtime.py']
custom_imports = dict(imports=['projects.NeRF-Det.nerfdet'])
prior_generator = dict(
type='AlignedAnchor3DRangeGenerator',
ranges=[[-3.2, -3.2, -1.28, 3.2, 3.2, 1.28]],
rotations=[.0])
model = dict(
type='NerfDet',
data_preprocessor=dict(
type='NeRFDetDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=True,
pad_size_divisor=10),
backbone=dict(
type='mmdet.ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50'),
style='pytorch'),
neck=dict(
type='mmdet.FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=4),
neck_3d=dict(
type='IndoorImVoxelNeck',
in_channels=256,
out_channels=128,
n_blocks=[1, 1, 1]),
bbox_head=dict(
type='NerfDetHead',
bbox_loss=dict(type='AxisAlignedIoULoss', loss_weight=1.0),
n_classes=18,
n_levels=3,
n_channels=128,
n_reg_outs=6,
pts_assign_threshold=27,
pts_center_threshold=18,
prior_generator=prior_generator),
prior_generator=prior_generator,
voxel_size=[.16, .16, .2],
n_voxels=[40, 40, 16],
aabb=([-2.7, -2.7, -0.78], [3.7, 3.7, 1.78]),
near_far_range=[0.2, 8.0],
N_samples=64,
N_rand=2048,
nerf_mode='image',
depth_supervise=True,
use_nerf_mask=True,
nerf_sample_view=20,
squeeze_scale=4,
nerf_density=True,
train_cfg=dict(),
test_cfg=dict(nms_pre=1000, iou_thr=.25, score_thr=.01))
dataset_type = 'MultiViewScanNetDataset'
data_root = 'data/scannet/'
class_names = [
'cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window', 'bookshelf',
'picture', 'counter', 'desk', 'curtain', 'refrigerator', 'showercurtrain',
'toilet', 'sink', 'bathtub', 'garbagebin'
]
metainfo = dict(CLASSES=class_names)
file_client_args = dict(backend='disk')
input_modality = dict(
use_camera=True,
use_depth=True,
use_lidar=False,
use_neuralrecon_depth=False,
use_ray=True)
backend_args = None
train_collect_keys = [
'img', 'gt_bboxes_3d', 'gt_labels_3d', 'depth', 'lightpos', 'nerf_sizes',
'raydirs', 'gt_images', 'gt_depths', 'denorm_images'
]
test_collect_keys = [
'img',
'depth',
'lightpos',
'nerf_sizes',
'raydirs',
'gt_images',
'gt_depths',
'denorm_images',
]
train_pipeline = [
dict(type='LoadAnnotations3D'),
dict(
type='MultiViewPipeline',
n_images=50,
transforms=[
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(320, 240), keep_ratio=True),
],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
margin=10,
depth_range=[0.5, 5.5],
loading='random',
nerf_target_views=10),
dict(type='RandomShiftOrigin', std=(.7, .7, .0)),
dict(type='PackNeRFDetInputs', keys=train_collect_keys)
]
test_pipeline = [
dict(type='LoadAnnotations3D'),
dict(
type='MultiViewPipeline',
n_images=101,
transforms=[
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(320, 240), keep_ratio=True),
],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
margin=10,
depth_range=[0.5, 5.5],
loading='random',
nerf_target_views=1),
dict(type='PackNeRFDetInputs', keys=test_collect_keys)
]
train_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='RepeatDataset',
times=6,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='scannet_infos_train_new.pkl',
pipeline=train_pipeline,
modality=input_modality,
test_mode=False,
filter_empty_gt=True,
box_type_3d='Depth',
metainfo=metainfo)))
val_dataloader = dict(
batch_size=1,
num_workers=5,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='scannet_infos_val_new.pkl',
pipeline=test_pipeline,
modality=input_modality,
test_mode=True,
filter_empty_gt=True,
box_type_3d='Depth',
metainfo=metainfo))
test_dataloader = val_dataloader
val_evaluator = dict(type='IndoorMetric')
test_evaluator = val_evaluator
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
test_cfg = dict()
val_cfg = dict()
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
paramwise_cfg=dict(
custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}),
clip_grad=dict(max_norm=35., norm_type=2))
param_scheduler = [
dict(
type='MultiStepLR',
begin=0,
end=12,
by_epoch=True,
milestones=[8, 11],
gamma=0.1)
]
# hooks
default_hooks = dict(
checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=12))
# runtime
find_unused_parameters = True # only 1 of 4 FPN outputs is used
from .data_preprocessor import NeRFDetDataPreprocessor
from .formating import PackNeRFDetInputs
from .multiview_pipeline import MultiViewPipeline, RandomShiftOrigin
from .nerfdet import NerfDet
from .nerfdet_head import NerfDetHead
from .scannet_multiview_dataset import MultiViewScanNetDataset
__all__ = [
'MultiViewScanNetDataset', 'MultiViewPipeline', 'RandomShiftOrigin',
'PackNeRFDetInputs', 'NeRFDetDataPreprocessor', 'NerfDetHead', 'NerfDet'
]
This diff is collapsed.
# Copyright (c) OpenMMLab. All rights reserved.
from typing import List, Sequence, Union
import mmengine
import numpy as np
import torch
from mmcv import BaseTransform
from mmengine.structures import InstanceData
from numpy import dtype
from mmdet3d.registry import TRANSFORMS
from mmdet3d.structures import BaseInstance3DBoxes, PointData
from mmdet3d.structures.points import BasePoints
# from .det3d_data_sample import Det3DDataSample
from .nerf_det3d_data_sample import NeRFDet3DDataSample
def to_tensor(
data: Union[torch.Tensor, np.ndarray, Sequence, int,
float]) -> torch.Tensor:
"""Convert objects of various python types to :obj:`torch.Tensor`.
Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
:class:`Sequence`, :class:`int` and :class:`float`.
Args:
data (torch.Tensor | numpy.ndarray | Sequence | int | float): Data to
be converted.
Returns:
torch.Tensor: the converted data.
"""
if isinstance(data, torch.Tensor):
return data
elif isinstance(data, np.ndarray):
if data.dtype is dtype('float64'):
data = data.astype(np.float32)
return torch.from_numpy(data)
elif isinstance(data, Sequence) and not mmengine.is_str(data):
return torch.tensor(data)
elif isinstance(data, int):
return torch.LongTensor([data])
elif isinstance(data, float):
return torch.FloatTensor([data])
else:
raise TypeError(f'type {type(data)} cannot be converted to tensor.')
@TRANSFORMS.register_module()
class PackNeRFDetInputs(BaseTransform):
INPUTS_KEYS = ['points', 'img']
NERF_INPUT_KEYS = [
'img', 'denorm_images', 'depth', 'lightpos', 'nerf_sizes', 'raydirs'
]
INSTANCEDATA_3D_KEYS = [
'gt_bboxes_3d', 'gt_labels_3d', 'attr_labels', 'depths', 'centers_2d'
]
INSTANCEDATA_2D_KEYS = [
'gt_bboxes',
'gt_bboxes_labels',
]
NERF_3D_KEYS = ['gt_images', 'gt_depths']
SEG_KEYS = [
'gt_seg_map', 'pts_instance_mask', 'pts_semantic_mask',
'gt_semantic_seg'
]
def __init__(
self,
keys: tuple,
meta_keys: tuple = ('img_path', 'ori_shape', 'img_shape', 'lidar2img',
'depth2img', 'cam2img', 'pad_shape',
'scale_factor', 'flip', 'pcd_horizontal_flip',
'pcd_vertical_flip', 'box_mode_3d', 'box_type_3d',
'img_norm_cfg', 'num_pts_feats', 'pcd_trans',
'sample_idx', 'pcd_scale_factor', 'pcd_rotation',
'pcd_rotation_angle', 'lidar_path',
'transformation_3d_flow', 'trans_mat',
'affine_aug', 'sweep_img_metas', 'ori_cam2img',
'cam2global', 'crop_offset', 'img_crop_offset',
'resize_img_shape', 'lidar2cam', 'ori_lidar2img',
'num_ref_frames', 'num_views', 'ego2global',
'axis_align_matrix')
) -> None:
self.keys = keys
self.meta_keys = meta_keys
def _remove_prefix(self, key: str) -> str:
if key.startswith('gt_'):
key = key[3:]
return key
def transform(self, results: Union[dict,
List[dict]]) -> Union[dict, List[dict]]:
"""Method to pack the input data. when the value in this dict is a
list, it usually is in Augmentations Testing.
Args:
results (dict | list[dict]): Result dict from the data pipeline.
Returns:
dict | List[dict]:
- 'inputs' (dict): The forward data of models. It usually contains
following keys:
- points
- img
- 'data_samples' (:obj:`NeRFDet3DDataSample`): The annotation info
of the sample.
"""
# augtest
if isinstance(results, list):
if len(results) == 1:
# simple test
return self.pack_single_results(results[0])
pack_results = []
for single_result in results:
pack_results.append(self.pack_single_results(single_result))
return pack_results
# norm training and simple testing
elif isinstance(results, dict):
return self.pack_single_results(results)
else:
raise NotImplementedError
def pack_single_results(self, results: dict) -> dict:
"""Method to pack the single input data. when the value in this dict is
a list, it usually is in Augmentations Testing.
Args:
results (dict): Result dict from the data pipeline.
Returns:
dict: A dict contains
- 'inputs' (dict): The forward data of models. It usually contains
following keys:
- points
- img
- 'data_samples' (:obj:`NeRFDet3DDataSample`): The annotation info
of the sample.
"""
# Format 3D data
if 'points' in results:
if isinstance(results['points'], BasePoints):
results['points'] = results['points'].tensor
if 'img' in results:
if isinstance(results['img'], list):
# process multiple imgs in single frame
imgs = np.stack(results['img'], axis=0)
if imgs.flags.c_contiguous:
imgs = to_tensor(imgs).permute(0, 3, 1, 2).contiguous()
else:
imgs = to_tensor(
np.ascontiguousarray(imgs.transpose(0, 3, 1, 2)))
results['img'] = imgs
else:
img = results['img']
if len(img.shape) < 3:
img = np.expand_dims(img, -1)
# To improve the computational speed by by 3-5 times, apply:
# `torch.permute()` rather than `np.transpose()`.
# Refer to https://github.com/open-mmlab/mmdetection/pull/9533
# for more details
if img.flags.c_contiguous:
img = to_tensor(img).permute(2, 0, 1).contiguous()
else:
img = to_tensor(
np.ascontiguousarray(img.transpose(2, 0, 1)))
results['img'] = img
if 'depth' in results:
if isinstance(results['depth'], list):
# process multiple depth imgs in single frame
depth_imgs = np.stack(results['depth'], axis=0)
if depth_imgs.flags.c_contiguous:
depth_imgs = to_tensor(depth_imgs).contiguous()
else:
depth_imgs = to_tensor(np.ascontiguousarray(depth_imgs))
results['depth'] = depth_imgs
else:
depth_img = results['depth']
if len(depth_img.shape) < 3:
depth_img = np.expand_dims(depth_img, -1)
if depth_img.flags.c_contiguous:
depth_img = to_tensor(depth_img).contiguous()
else:
depth_img = to_tensor(np.ascontiguousarray(depth_img))
results['depth'] = depth_img
if 'ray_info' in results:
if isinstance(results['raydirs'], list):
raydirs = np.stack(results['raydirs'], axis=0)
if raydirs.flags.c_contiguous:
raydirs = to_tensor(raydirs).contiguous()
else:
raydirs = to_tensor(np.ascontiguousarray(raydirs))
results['raydirs'] = raydirs
if isinstance(results['lightpos'], list):
lightposes = np.stack(results['lightpos'], axis=0)
if lightposes.flags.c_contiguous:
lightposes = to_tensor(lightposes).contiguous()
else:
lightposes = to_tensor(np.ascontiguousarray(lightposes))
lightposes = lightposes.unsqueeze(1).repeat(
1, raydirs.shape[1], 1)
results['lightpos'] = lightposes
if isinstance(results['gt_images'], list):
gt_images = np.stack(results['gt_images'], axis=0)
if gt_images.flags.c_contiguous:
gt_images = to_tensor(gt_images).contiguous()
else:
gt_images = to_tensor(np.ascontiguousarray(gt_images))
results['gt_images'] = gt_images
if isinstance(results['gt_depths'],
list) and len(results['gt_depths']) != 0:
gt_depths = np.stack(results['gt_depths'], axis=0)
if gt_depths.flags.c_contiguous:
gt_depths = to_tensor(gt_depths).contiguous()
else:
gt_depths = to_tensor(np.ascontiguousarray(gt_depths))
results['gt_depths'] = gt_depths
if isinstance(results['denorm_images'], list):
denorm_imgs = np.stack(results['denorm_images'], axis=0)
if denorm_imgs.flags.c_contiguous:
denorm_imgs = to_tensor(denorm_imgs).permute(
0, 3, 1, 2).contiguous()
else:
denorm_imgs = to_tensor(
np.ascontiguousarray(
denorm_imgs.transpose(0, 3, 1, 2)))
results['denorm_images'] = denorm_imgs
for key in [
'proposals', 'gt_bboxes', 'gt_bboxes_ignore', 'gt_labels',
'gt_bboxes_labels', 'attr_labels', 'pts_instance_mask',
'pts_semantic_mask', 'centers_2d', 'depths', 'gt_labels_3d'
]:
if key not in results:
continue
if isinstance(results[key], list):
results[key] = [to_tensor(res) for res in results[key]]
else:
results[key] = to_tensor(results[key])
if 'gt_bboxes_3d' in results:
if not isinstance(results['gt_bboxes_3d'], BaseInstance3DBoxes):
results['gt_bboxes_3d'] = to_tensor(results['gt_bboxes_3d'])
if 'gt_semantic_seg' in results:
results['gt_semantic_seg'] = to_tensor(
results['gt_semantic_seg'][None])
if 'gt_seg_map' in results:
results['gt_seg_map'] = results['gt_seg_map'][None, ...]
if 'gt_images' in results:
results['gt_images'] = to_tensor(results['gt_images'])
if 'gt_depths' in results:
results['gt_depths'] = to_tensor(results['gt_depths'])
data_sample = NeRFDet3DDataSample()
gt_instances_3d = InstanceData()
gt_instances = InstanceData()
gt_pts_seg = PointData()
gt_nerf_images = InstanceData()
gt_nerf_depths = InstanceData()
data_metas = {}
for key in self.meta_keys:
if key in results:
data_metas[key] = results[key]
elif 'images' in results:
if len(results['images'].keys()) == 1:
cam_type = list(results['images'].keys())[0]
# single-view image
if key in results['images'][cam_type]:
data_metas[key] = results['images'][cam_type][key]
else:
# multi-view image
img_metas = []
cam_types = list(results['images'].keys())
for cam_type in cam_types:
if key in results['images'][cam_type]:
img_metas.append(results['images'][cam_type][key])
if len(img_metas) > 0:
data_metas[key] = img_metas
elif 'lidar_points' in results:
if key in results['lidar_points']:
data_metas[key] = results['lidar_points'][key]
data_sample.set_metainfo(data_metas)
inputs = {}
for key in self.keys:
if key in results:
# if key in self.INPUTS_KEYS:
if key in self.NERF_INPUT_KEYS:
inputs[key] = results[key]
elif key in self.NERF_3D_KEYS:
if key == 'gt_images':
gt_nerf_images[self._remove_prefix(key)] = results[key]
else:
gt_nerf_depths[self._remove_prefix(key)] = results[key]
elif key in self.INSTANCEDATA_3D_KEYS:
gt_instances_3d[self._remove_prefix(key)] = results[key]
elif key in self.INSTANCEDATA_2D_KEYS:
if key == 'gt_bboxes_labels':
gt_instances['labels'] = results[key]
else:
gt_instances[self._remove_prefix(key)] = results[key]
elif key in self.SEG_KEYS:
gt_pts_seg[self._remove_prefix(key)] = results[key]
else:
raise NotImplementedError(f'Please modified '
f'`Pack3DDetInputs` '
f'to put {key} to '
f'corresponding field')
data_sample.gt_instances_3d = gt_instances_3d
data_sample.gt_instances = gt_instances
data_sample.gt_pts_seg = gt_pts_seg
data_sample.gt_nerf_images = gt_nerf_images
data_sample.gt_nerf_depths = gt_nerf_depths
if 'eval_ann_info' in results:
data_sample.eval_ann_info = results['eval_ann_info']
else:
data_sample.eval_ann_info = None
packed_results = dict()
packed_results['data_samples'] = data_sample
packed_results['inputs'] = inputs
return packed_results
def __repr__(self) -> str:
"""str: Return a string that describes the module."""
repr_str = self.__class__.__name__
repr_str += f'(keys={self.keys})'
repr_str += f'(meta_keys={self.meta_keys})'
return repr_str
# Copyright (c) OpenMMLab. All rights reserved.
import mmcv
import numpy as np
from mmcv.transforms import BaseTransform, Compose
from PIL import Image
from mmdet3d.registry import TRANSFORMS
def get_dtu_raydir(pixelcoords, intrinsic, rot, dir_norm=None):
# rot is c2w
# pixelcoords: H x W x 2
x = (pixelcoords[..., 0] + 0.5 - intrinsic[0, 2]) / intrinsic[0, 0]
y = (pixelcoords[..., 1] + 0.5 - intrinsic[1, 2]) / intrinsic[1, 1]
z = np.ones_like(x)
dirs = np.stack([x, y, z], axis=-1)
# dirs = np.sum(dirs[...,None,:] * rot[:,:], axis=-1) # h*w*1*3 x 3*3
dirs = dirs @ rot[:, :].T #
if dir_norm:
dirs = dirs / (np.linalg.norm(dirs, axis=-1, keepdims=True) + 1e-5)
return dirs
@TRANSFORMS.register_module()
class MultiViewPipeline(BaseTransform):
"""MultiViewPipeline used in nerfdet.
Required Keys:
- depth_info
- img_prefix
- img_info
- lidar2img
- c2w
- cammrotc2w
- lightpos
- ray_info
Modified Keys:
- lidar2img
Added Keys:
- img
- denorm_images
- depth
- c2w
- camrotc2w
- lightpos
- pixels
- raydirs
- gt_images
- gt_depths
- nerf_sizes
- depth_range
Args:
transforms (list[dict]): The transform pipeline
used to process the imgs.
n_images (int): The number of sampled views.
mean (array): The mean values used in normalization.
std (array): The variance values used in normalization.
margin (int): The margin value. Defaults to 10.
depth_range (array): The range of the depth.
Defaults to [0.5, 5.5].
loading (str): The mode of loading. Defaults to 'random'.
nerf_target_views (int): The number of novel views.
sample_freq (int): The frequency of sampling.
"""
def __init__(self,
transforms: dict,
n_images: int,
mean: tuple = [123.675, 116.28, 103.53],
std: tuple = [58.395, 57.12, 57.375],
margin: int = 10,
depth_range: tuple = [0.5, 5.5],
loading: str = 'random',
nerf_target_views: int = 0,
sample_freq: int = 3):
self.transforms = Compose(transforms)
self.depth_transforms = Compose(transforms[1])
self.n_images = n_images
self.mean = np.array(mean, dtype=np.float32)
self.std = np.array(std, dtype=np.float32)
self.margin = margin
self.depth_range = depth_range
self.loading = loading
self.sample_freq = sample_freq
self.nerf_target_views = nerf_target_views
def transform(self, results: dict) -> dict:
"""Nerfdet transform function.
Args:
results (dict): Result dict from loading pipeline
Returns:
dict: The result dict containing the processed results.
Updated key and value are described below.
- img (list): The loaded origin image.
- denorm_images (list): The denormalized image.
- depth (list): The origin depth image.
- c2w (list): The c2w matrixes.
- camrotc2w (list): The rotation matrixes.
- lightpos (list): The transform parameters of the camera.
- pixels (list): Some pixel information.
- raydirs (list): The ray-directions.
- gt_images (list): The groundtruth images.
- gt_depths (list): The groundtruth depth images.
- nerf_sizes (array): The size of the groundtruth images.
- depth_range (array): The range of the depth.
Here we give a detailed explanation of some keys mentioned above.
Let P_c be the coordinate of camera, P_w be the coordinate of world.
There is such a conversion relationship: P_c = R @ P_w + T.
The 'camrotc2w' mentioned above corresponds to the R matrix here.
The 'lightpos' corresponds to the T matrix here. And if you put
R and T together, you can get the camera extrinsics matrix. It
corresponds to the 'c2w' mentioned above.
"""
imgs = []
depths = []
extrinsics = []
c2ws = []
camrotc2ws = []
lightposes = []
pixels = []
raydirs = []
gt_images = []
gt_depths = []
denorm_imgs_list = []
nerf_sizes = []
if self.loading == 'random':
ids = np.arange(len(results['img_info']))
replace = True if self.n_images > len(ids) else False
ids = np.random.choice(ids, self.n_images, replace=replace)
if self.nerf_target_views != 0:
target_id = np.random.choice(
ids, self.nerf_target_views, replace=False)
ids = np.setdiff1d(ids, target_id)
ids = ids.tolist()
target_id = target_id.tolist()
else:
ids = np.arange(len(results['img_info']))
begin_id = 0
ids = np.arange(begin_id,
begin_id + self.n_images * self.sample_freq,
self.sample_freq)
if self.nerf_target_views != 0:
target_id = ids
ratio = 0
size = (240, 320)
for i in ids:
_results = dict()
_results['img_path'] = results['img_info'][i]['filename']
_results = self.transforms(_results)
imgs.append(_results['img'])
# normalize
for key in _results.get('img_fields', ['img']):
_results[key] = mmcv.imnormalize(_results[key], self.mean,
self.std, True)
_results['img_norm_cfg'] = dict(
mean=self.mean, std=self.std, to_rgb=True)
# pad
for key in _results.get('img_fields', ['img']):
padded_img = mmcv.impad(_results[key], shape=size, pad_val=0)
_results[key] = padded_img
_results['pad_shape'] = padded_img.shape
_results['pad_fixed_size'] = size
ori_shape = _results['ori_shape']
aft_shape = _results['img_shape']
ratio = ori_shape[0] / aft_shape[0]
# prepare the depth information
if 'depth_info' in results.keys():
if '.npy' in results['depth_info'][i]['filename']:
_results['depth'] = np.load(
results['depth_info'][i]['filename'])
else:
_results['depth'] = np.asarray((Image.open(
results['depth_info'][i]['filename']))) / 1000
_results['depth'] = mmcv.imresize(
_results['depth'], (aft_shape[1], aft_shape[0]))
depths.append(_results['depth'])
denorm_img = mmcv.imdenormalize(
_results['img'], self.mean, self.std, to_bgr=True).astype(
np.uint8) / 255.0
denorm_imgs_list.append(denorm_img)
height, width = padded_img.shape[:2]
extrinsics.append(results['lidar2img']['extrinsic'][i])
# prepare the nerf information
if 'ray_info' in results.keys():
intrinsics_nerf = results['lidar2img']['intrinsic'].copy()
intrinsics_nerf[:2] = intrinsics_nerf[:2] / ratio
assert self.nerf_target_views > 0
for i in target_id:
c2ws.append(results['c2w'][i])
camrotc2ws.append(results['camrotc2w'][i])
lightposes.append(results['lightpos'][i])
px, py = np.meshgrid(
np.arange(self.margin,
width - self.margin).astype(np.float32),
np.arange(self.margin,
height - self.margin).astype(np.float32))
pixelcoords = np.stack((px, py),
axis=-1).astype(np.float32) # H x W x 2
pixels.append(pixelcoords)
raydir = get_dtu_raydir(pixelcoords, intrinsics_nerf,
results['camrotc2w'][i])
raydirs.append(np.reshape(raydir.astype(np.float32), (-1, 3)))
# read target images
temp_results = dict()
temp_results['img_path'] = results['img_info'][i]['filename']
temp_results_ = self.transforms(temp_results)
# normalize
for key in temp_results.get('img_fields', ['img']):
temp_results[key] = mmcv.imnormalize(
temp_results[key], self.mean, self.std, True)
temp_results['img_norm_cfg'] = dict(
mean=self.mean, std=self.std, to_rgb=True)
# pad
for key in temp_results.get('img_fields', ['img']):
padded_img = mmcv.impad(
temp_results[key], shape=size, pad_val=0)
temp_results[key] = padded_img
temp_results['pad_shape'] = padded_img.shape
temp_results['pad_fixed_size'] = size
# denormalize target_images.
denorm_imgs = mmcv.imdenormalize(
temp_results_['img'], self.mean, self.std,
to_bgr=True).astype(np.uint8)
gt_rgb_shape = denorm_imgs.shape
gt_image = denorm_imgs[py.astype(np.int32),
px.astype(np.int32), :]
nerf_sizes.append(np.array(gt_image.shape))
gt_image = np.reshape(gt_image, (-1, 3))
gt_images.append(gt_image / 255.0)
if 'depth_info' in results.keys():
if '.npy' in results['depth_info'][i]['filename']:
_results['depth'] = np.load(
results['depth_info'][i]['filename'])
else:
depth_image = Image.open(
results['depth_info'][i]['filename'])
_results['depth'] = np.asarray(depth_image) / 1000
_results['depth'] = mmcv.imresize(
_results['depth'],
(gt_rgb_shape[1], gt_rgb_shape[0]))
_results['depth'] = _results['depth']
gt_depth = _results['depth'][py.astype(np.int32),
px.astype(np.int32)]
gt_depths.append(gt_depth)
for key in _results.keys():
if key not in ['img', 'img_info']:
results[key] = _results[key]
results['img'] = imgs
if 'ray_info' in results.keys():
results['c2w'] = c2ws
results['camrotc2w'] = camrotc2ws
results['lightpos'] = lightposes
results['pixels'] = pixels
results['raydirs'] = raydirs
results['gt_images'] = gt_images
results['gt_depths'] = gt_depths
results['nerf_sizes'] = nerf_sizes
results['denorm_images'] = denorm_imgs_list
results['depth_range'] = np.array([self.depth_range])
if len(depths) != 0:
results['depth'] = depths
results['lidar2img']['extrinsic'] = extrinsics
return results
@TRANSFORMS.register_module()
class RandomShiftOrigin(BaseTransform):
def __init__(self, std):
self.std = std
def transform(self, results):
shift = np.random.normal(.0, self.std, 3)
results['lidar2img']['origin'] += shift
return results
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Dict, List, Optional, Tuple, Union
import torch
from mmengine.structures import InstanceData
from mmdet3d.structures import Det3DDataSample
class NeRFDet3DDataSample(Det3DDataSample):
"""A data structure interface inheirted from Det3DDataSample. Some new
attributes are added to match the NeRF-Det project.
The attributes added in ``NeRFDet3DDataSample`` are divided into two parts:
- ``gt_nerf_images`` (InstanceData): Ground truth of the images which
will be used in the NeRF branch.
- ``gt_nerf_depths`` (InstanceData): Ground truth of the depth images
which will be used in the NeRF branch if needed.
For more details and examples, please refer to the 'Det3DDataSample' file.
"""
@property
def gt_nerf_images(self) -> InstanceData:
return self._gt_nerf_images
@gt_nerf_images.setter
def gt_nerf_images(self, value: InstanceData) -> None:
self.set_field(value, '_gt_nerf_images', dtype=InstanceData)
@gt_nerf_images.deleter
def gt_nerf_images(self) -> None:
del self._gt_nerf_images
@property
def gt_nerf_depths(self) -> InstanceData:
return self._gt_nerf_depths
@gt_nerf_depths.setter
def gt_nerf_depths(self, value: InstanceData) -> None:
self.set_field(value, '_gt_nerf_depths', dtype=InstanceData)
@gt_nerf_depths.deleter
def gt_nerf_depths(self) -> None:
del self._gt_nerf_depths
SampleList = List[NeRFDet3DDataSample]
OptSampleList = Optional[SampleList]
ForwardResults = Union[Dict[str, torch.Tensor], List[NeRFDet3DDataSample],
Tuple[torch.Tensor], torch.Tensor]
# Copyright (c) OpenMMLab. All rights reserved.
import math
from typing import Callable, Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
class MLP(nn.Module):
"""The MLP module used in NerfDet.
Args:
input_dim (int): The number of input tensor channels.
output_dim (int): The number of output tensor channels.
net_depth (int): The depth of the MLP. Defaults to 8.
net_width (int): The width of the MLP. Defaults to 256.
skip_layer (int): The layer to add skip layers to. Defaults to 4.
hidden_init (Callable): The initialize method of the hidden layers.
hidden_activation (Callable): The activation function of hidden
layers, defaults to ReLU.
output_enabled (bool): If true, the output layers will be used.
Defaults to True.
output_init (Optional): The initialize method of the output layer.
output_activation(Optional): The activation function of output layers.
bias_enabled (Bool): If true, the bias will be used.
bias_init (Callable): The initialize method of the bias.
Defaults to True.
"""
def __init__(
self,
input_dim: int,
output_dim: int = None,
net_depth: int = 8,
net_width: int = 256,
skip_layer: int = 4,
hidden_init: Callable = nn.init.xavier_uniform_,
hidden_activation: Callable = nn.ReLU(),
output_enabled: bool = True,
output_init: Optional[Callable] = nn.init.xavier_uniform_,
output_activation: Optional[Callable] = nn.Identity(),
bias_enabled: bool = True,
bias_init: Callable = nn.init.zeros_,
):
super().__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.net_depth = net_depth
self.net_width = net_width
self.skip_layer = skip_layer
self.hidden_init = hidden_init
self.hidden_activation = hidden_activation
self.output_enabled = output_enabled
self.output_init = output_init
self.output_activation = output_activation
self.bias_enabled = bias_enabled
self.bias_init = bias_init
self.hidden_layers = nn.ModuleList()
in_features = self.input_dim
for i in range(self.net_depth):
self.hidden_layers.append(
nn.Linear(in_features, self.net_width, bias=bias_enabled))
if (self.skip_layer is not None) and (i % self.skip_layer
== 0) and (i > 0):
in_features = self.net_width + self.input_dim
else:
in_features = self.net_width
if self.output_enabled:
self.output_layer = nn.Linear(
in_features, self.output_dim, bias=bias_enabled)
else:
self.output_dim = in_features
self.initialize()
def initialize(self):
def init_func_hidden(m):
if isinstance(m, nn.Linear):
if self.hidden_init is not None:
self.hidden_init(m.weight)
if self.bias_enabled and self.bias_init is not None:
self.bias_init(m.bias)
self.hidden_layers.apply(init_func_hidden)
if self.output_enabled:
def init_func_output(m):
if isinstance(m, nn.Linear):
if self.output_init is not None:
self.output_init(m.weight)
if self.bias_enabled and self.bias_init is not None:
self.bias_init(m.bias)
self.output_layer.apply(init_func_output)
def forward(self, x):
inputs = x
for i in range(self.net_depth):
x = self.hidden_layers[i](x)
x = self.hidden_activation(x)
if (self.skip_layer is not None) and (i % self.skip_layer
== 0) and (i > 0):
x = torch.cat([x, inputs], dim=-1)
if self.output_enabled:
x = self.output_layer(x)
x = self.output_activation(x)
return x
class DenseLayer(MLP):
def __init__(self, input_dim, output_dim, **kwargs):
super().__init__(
input_dim=input_dim,
output_dim=output_dim,
net_depth=0, # no hidden layers
**kwargs,
)
class NerfMLP(nn.Module):
"""The Nerf-MLP Module.
Args:
input_dim (int): The number of input tensor channels.
condition_dim (int): The number of condition tensor channels.
feature_dim (int): The number of feature channels. Defaults to 0.
net_depth (int): The depth of the MLP. Defaults to 8.
net_width (int): The width of the MLP. Defaults to 256.
skip_layer (int): The layer to add skip layers to. Defaults to 4.
net_depth_condition (int): The depth of the second part of MLP.
Defaults to 1.
net_width_condition (int): The width of the second part of MLP.
Defaults to 128.
"""
def __init__(
self,
input_dim: int,
condition_dim: int,
feature_dim: int = 0,
net_depth: int = 8,
net_width: int = 256,
skip_layer: int = 4,
net_depth_condition: int = 1,
net_width_condition: int = 128,
):
super().__init__()
self.base = MLP(
input_dim=input_dim + feature_dim,
net_depth=net_depth,
net_width=net_width,
skip_layer=skip_layer,
output_enabled=False,
)
hidden_features = self.base.output_dim
self.sigma_layer = DenseLayer(hidden_features, 1)
if condition_dim > 0:
self.bottleneck_layer = DenseLayer(hidden_features, net_width)
self.rgb_layer = MLP(
input_dim=net_width + condition_dim,
output_dim=3,
net_depth=net_depth_condition,
net_width=net_width_condition,
skip_layer=None,
)
else:
self.rgb_layer = DenseLayer(hidden_features, 3)
def query_density(self, x, features=None):
"""Calculate the raw sigma."""
if features is not None:
x = self.base(torch.cat([x, features], dim=-1))
else:
x = self.base(x)
raw_sigma = self.sigma_layer(x)
return raw_sigma
def forward(self, x, condition=None, features=None):
if features is not None:
x = self.base(torch.cat([x, features], dim=-1))
else:
x = self.base(x)
raw_sigma = self.sigma_layer(x)
if condition is not None:
if condition.shape[:-1] != x.shape[:-1]:
num_rays, n_dim = condition.shape
condition = condition.view(
[num_rays] + [1] * (x.dim() - condition.dim()) +
[n_dim]).expand(list(x.shape[:-1]) + [n_dim])
bottleneck = self.bottleneck_layer(x)
x = torch.cat([bottleneck, condition], dim=-1)
raw_rgb = self.rgb_layer(x)
return raw_rgb, raw_sigma
class SinusoidalEncoder(nn.Module):
"""Sinusodial Positional Encoder used in NeRF."""
def __init__(self, x_dim, min_deg, max_deg, use_identity: bool = True):
super().__init__()
self.x_dim = x_dim
self.min_deg = min_deg
self.max_deg = max_deg
self.use_identity = use_identity
self.register_buffer(
'scales', torch.tensor([2**i for i in range(min_deg, max_deg)]))
@property
def latent_dim(self) -> int:
return (int(self.use_identity) +
(self.max_deg - self.min_deg) * 2) * self.x_dim
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.max_deg == self.min_deg:
return x
xb = torch.reshape(
(x[Ellipsis, None, :] * self.scales[:, None]),
list(x.shape[:-1]) + [(self.max_deg - self.min_deg) * self.x_dim],
)
latent = torch.sin(torch.cat([xb, xb + 0.5 * math.pi], dim=-1))
if self.use_identity:
latent = torch.cat([x] + [latent], dim=-1)
return latent
class VanillaNeRF(nn.Module):
"""The Nerf-MLP with the positional encoder.
Args:
net_depth (int): The depth of the MLP. Defaults to 8.
net_width (int): The width of the MLP. Defaults to 256.
skip_layer (int): The layer to add skip layers to. Defaults to 4.
feature_dim (int): The number of feature channels. Defaults to 0.
net_depth_condition (int): The depth of the second part of MLP.
Defaults to 1.
net_width_condition (int): The width of the second part of MLP.
Defaults to 128.
"""
def __init__(self,
net_depth: int = 8,
net_width: int = 256,
skip_layer: int = 4,
feature_dim: int = 0,
net_depth_condition: int = 1,
net_width_condition: int = 128):
super().__init__()
self.posi_encoder = SinusoidalEncoder(3, 0, 10, True)
self.view_encoder = SinusoidalEncoder(3, 0, 4, True)
self.mlp = NerfMLP(
input_dim=self.posi_encoder.latent_dim,
condition_dim=self.view_encoder.latent_dim,
feature_dim=feature_dim,
net_depth=net_depth,
net_width=net_width,
skip_layer=skip_layer,
net_depth_condition=net_depth_condition,
net_width_condition=net_width_condition,
)
def query_density(self, x, features=None):
x = self.posi_encoder(x)
sigma = self.mlp.query_density(x, features)
return F.relu(sigma)
def forward(self, x, condition=None, features=None):
x = self.posi_encoder(x)
if condition is not None:
condition = self.view_encoder(condition)
rgb, sigma = self.mlp(x, condition=condition, features=features)
return torch.sigmoid(rgb), F.relu(sigma)
# Copyright (c) OpenMMLab. All rights reserved.
# Attention: This file is mainly modified based on the file with the same
# name in the original project. For more details, please refer to the
# origin project.
import torch
import torch.nn.functional as F
class Projector():
def __init__(self, device='cuda'):
self.device = device
def inbound(self, pixel_locations, h, w):
"""check if the pixel locations are in valid range."""
return (pixel_locations[..., 0] <= w - 1.) & \
(pixel_locations[..., 0] >= 0) & \
(pixel_locations[..., 1] <= h - 1.) &\
(pixel_locations[..., 1] >= 0)
def normalize(self, pixel_locations, h, w):
resize_factor = torch.tensor([w - 1., h - 1.
]).to(pixel_locations.device)[None,
None, :]
normalized_pixel_locations = 2 * pixel_locations / resize_factor - 1.
return normalized_pixel_locations
def compute_projections(self, xyz, train_cameras):
"""project 3D points into cameras."""
original_shape = xyz.shape[:2]
xyz = xyz.reshape(-1, 3)
num_views = len(train_cameras)
train_intrinsics = train_cameras[:, 2:18].reshape(-1, 4, 4)
train_poses = train_cameras[:, -16:].reshape(-1, 4, 4)
xyz_h = torch.cat([xyz, torch.ones_like(xyz[..., :1])], dim=-1)
# projections = train_intrinsics.bmm(torch.inverse(train_poses))
# we have inverse the pose in dataloader so
# do not need to inverse here.
projections = train_intrinsics.bmm(train_poses) \
.bmm(xyz_h.t()[None, ...].repeat(num_views, 1, 1))
projections = projections.permute(0, 2, 1)
pixel_locations = projections[..., :2] / torch.clamp(
projections[..., 2:3], min=1e-8)
pixel_locations = torch.clamp(pixel_locations, min=-1e6, max=1e6)
mask = projections[..., 2] > 0
return pixel_locations.reshape((num_views, ) + original_shape + (2, )), \
mask.reshape((num_views, ) + original_shape) # noqa
def compute_angle(self, xyz, query_camera, train_cameras):
original_shape = xyz.shape[:2]
xyz = xyz.reshape(-1, 3)
train_poses = train_cameras[:, -16:].reshape(-1, 4, 4)
num_views = len(train_poses)
query_pose = query_camera[-16:].reshape(-1, 4,
4).repeat(num_views, 1, 1)
ray2tar_pose = (query_pose[:, :3, 3].unsqueeze(1) - xyz.unsqueeze(0))
ray2tar_pose /= (torch.norm(ray2tar_pose, dim=-1, keepdim=True) + 1e-6)
ray2train_pose = (
train_poses[:, :3, 3].unsqueeze(1) - xyz.unsqueeze(0))
ray2train_pose /= (
torch.norm(ray2train_pose, dim=-1, keepdim=True) + 1e-6)
ray_diff = ray2tar_pose - ray2train_pose
ray_diff_norm = torch.norm(ray_diff, dim=-1, keepdim=True)
ray_diff_dot = torch.sum(
ray2tar_pose * ray2train_pose, dim=-1, keepdim=True)
ray_diff_direction = ray_diff / torch.clamp(ray_diff_norm, min=1e-6)
ray_diff = torch.cat([ray_diff_direction, ray_diff_dot], dim=-1)
ray_diff = ray_diff.reshape((num_views, ) + original_shape + (4, ))
return ray_diff
def compute(self,
xyz,
train_imgs,
train_cameras,
featmaps=None,
grid_sample=True):
assert (train_imgs.shape[0] == 1) \
and (train_cameras.shape[0] == 1)
# only support batch_size=1 for now
train_imgs = train_imgs.squeeze(0)
train_cameras = train_cameras.squeeze(0)
train_imgs = train_imgs.permute(0, 3, 1, 2)
h, w = train_cameras[0][:2]
# compute the projection of the query points to each reference image
pixel_locations, mask_in_front = self.compute_projections(
xyz, train_cameras)
normalized_pixel_locations = self.normalize(pixel_locations, h, w)
# rgb sampling
rgbs_sampled = F.grid_sample(
train_imgs, normalized_pixel_locations, align_corners=True)
rgb_sampled = rgbs_sampled.permute(2, 3, 0, 1)
# deep feature sampling
if featmaps is not None:
if grid_sample:
feat_sampled = F.grid_sample(
featmaps, normalized_pixel_locations, align_corners=True)
feat_sampled = feat_sampled.permute(
2, 3, 0, 1) # [n_rays, n_samples, n_views, d]
rgb_feat_sampled = torch.cat(
[rgb_sampled, feat_sampled],
dim=-1) # [n_rays, n_samples, n_views, d+3]
# rgb_feat_sampled = feat_sampled
else:
n_images, n_channels, f_h, f_w = featmaps.shape
resize_factor = torch.tensor([f_w / w - 1., f_h / h - 1.]).to(
pixel_locations.device)[None, None, :]
sample_location = (pixel_locations *
resize_factor).round().long()
n_images, n_ray, n_sample, _ = sample_location.shape
sample_x = sample_location[..., 0].view(n_images, -1)
sample_y = sample_location[..., 1].view(n_images, -1)
valid = (sample_x >= 0) & (sample_y >=
0) & (sample_x < f_w) & (
sample_y < f_h)
valid = valid * mask_in_front.view(n_images, -1)
feat_sampled = torch.zeros(
(n_images, n_channels, sample_x.shape[-1]),
device=featmaps.device)
for i in range(n_images):
feat_sampled[i, :,
valid[i]] = featmaps[i, :, sample_y[i,
valid[i]],
sample_y[i, valid[i]]]
feat_sampled = feat_sampled.view(n_images, n_channels, n_ray,
n_sample)
rgb_feat_sampled = feat_sampled.permute(2, 3, 0, 1)
else:
rgb_feat_sampled = None
inbound = self.inbound(pixel_locations, h, w)
mask = (inbound * mask_in_front).float().permute(
1, 2, 0)[..., None] # [n_rays, n_samples, n_views, 1]
return rgb_feat_sampled, mask
# Copyright (c) OpenMMLab. All rights reserved.
# Attention: This file is mainly modified based on the file with the same
# name in the original project. For more details, please refer to the
# origin project.
from collections import OrderedDict
import numpy as np
import torch
import torch.nn.functional as F
rng = np.random.RandomState(234)
# helper functions for nerf ray rendering
def volume_sampling(sample_pts, features, aabb):
B, C, D, W, H = features.shape
assert B == 1
aabb = torch.Tensor(aabb).to(sample_pts.device)
N_rays, N_samples, coords = sample_pts.shape
sample_pts = sample_pts.view(1, N_rays * N_samples, 1, 1,
3).repeat(B, 1, 1, 1, 1)
aabbSize = aabb[1] - aabb[0]
invgridSize = 1.0 / aabbSize * 2
norm_pts = (sample_pts - aabb[0]) * invgridSize - 1
sample_features = F.grid_sample(
features, norm_pts, align_corners=True, padding_mode='border')
masks = ((norm_pts < 1) & (norm_pts > -1)).float().sum(dim=-1)
masks = (masks.view(N_rays, N_samples) == 3)
return sample_features.view(C, N_rays,
N_samples).permute(1, 2, 0).contiguous(), masks
def _compute_projection(img_meta):
views = len(img_meta['lidar2img']['extrinsic'])
intrinsic = torch.tensor(img_meta['lidar2img']['intrinsic'][:4, :4])
ratio = img_meta['ori_shape'][0] / img_meta['img_shape'][0]
intrinsic[:2] /= ratio
intrinsic = intrinsic.unsqueeze(0).view(1, 16).repeat(views, 1)
img_size = torch.Tensor(img_meta['img_shape'][:2]).to(intrinsic.device)
img_size = img_size.unsqueeze(0).repeat(views, 1)
extrinsics = []
for v in range(views):
extrinsics.append(
torch.Tensor(img_meta['lidar2img']['extrinsic'][v]).to(
intrinsic.device))
extrinsic = torch.stack(extrinsics).view(views, 16)
train_cameras = torch.cat([img_size, intrinsic, extrinsic], dim=-1)
return train_cameras.unsqueeze(0)
def compute_mask_points(feature, mask):
weight = mask / (torch.sum(mask, dim=2, keepdim=True) + 1e-8)
mean = torch.sum(feature * weight, dim=2, keepdim=True)
var = torch.sum((feature - mean)**2, dim=2, keepdim=True)
var = var / (torch.sum(mask, dim=2, keepdim=True) + 1e-8)
var = torch.exp(-var)
return mean, var
def sample_pdf(bins, weights, N_samples, det=False):
"""Helper function used for sampling.
Args:
bins (tensor):Tensor of shape [N_rays, M+1], M is the number of bins
weights (tensor):Tensor of shape [N_rays, M+1], M is the number of bins
N_samples (int):Number of samples along each ray
det (bool):If True, will perform deterministic sampling
Returns:
samples (tuple): [N_rays, N_samples]
"""
M = weights.shape[1]
weights += 1e-5
# Get pdf
pdf = weights / torch.sum(weights, dim=-1, keepdim=True)
cdf = torch.cumsum(pdf, dim=-1)
cdf = torch.cat([torch.zeros_like(cdf[:, 0:1]), cdf], dim=-1)
# Take uniform samples
if det:
u = torch.linspace(0., 1., N_samples, device=bins.device)
u = u.unsqueeze(0).repeat(bins.shape[0], 1)
else:
u = torch.rand(bins.shape[0], N_samples, device=bins.device)
# Invert CDF
above_inds = torch.zeros_like(u, dtype=torch.long)
for i in range(M):
above_inds += (u >= cdf[:, i:i + 1]).long()
# random sample inside each bin
below_inds = torch.clamp(above_inds - 1, min=0)
inds_g = torch.stack((below_inds, above_inds), dim=2)
cdf = cdf.unsqueeze(1).repeat(1, N_samples, 1)
cdf_g = torch.gather(input=cdf, dim=-1, index=inds_g)
bins = bins.unsqueeze(1).repeat(1, N_samples, 1)
bins_g = torch.gather(input=bins, dim=-1, index=inds_g)
denom = cdf_g[:, :, 1] - cdf_g[:, :, 0]
denom = torch.where(denom < 1e-5, torch.ones_like(denom), denom)
t = (u - cdf_g[:, :, 0]) / denom
samples = bins_g[:, :, 0] + t * (bins_g[:, :, 1] - bins_g[:, :, 0])
return samples
def sample_along_camera_ray(ray_o,
ray_d,
depth_range,
N_samples,
inv_uniform=False,
det=False):
"""Sampling along the camera ray.
Args:
ray_o (tensor): Origin of the ray in scene coordinate system;
tensor of shape [N_rays, 3]
ray_d (tensor): Homogeneous ray direction vectors in
scene coordinate system; tensor of shape [N_rays, 3]
depth_range (tuple): [near_depth, far_depth]
inv_uniform (bool): If True,uniformly sampling inverse depth.
det (bool): If True, will perform deterministic sampling.
Returns:
pts (tensor): Tensor of shape [N_rays, N_samples, 3]
z_vals (tensor): Tensor of shape [N_rays, N_samples]
"""
# will sample inside [near_depth, far_depth]
# assume the nearest possible depth is at least (min_ratio * depth)
near_depth_value = depth_range[0]
far_depth_value = depth_range[1]
assert near_depth_value > 0 and far_depth_value > 0 \
and far_depth_value > near_depth_value
near_depth = near_depth_value * torch.ones_like(ray_d[..., 0])
far_depth = far_depth_value * torch.ones_like(ray_d[..., 0])
if inv_uniform:
start = 1. / near_depth
step = (1. / far_depth - start) / (N_samples - 1)
inv_z_vals = torch.stack([start + i * step for i in range(N_samples)],
dim=1)
z_vals = 1. / inv_z_vals
else:
start = near_depth
step = (far_depth - near_depth) / (N_samples - 1)
z_vals = torch.stack([start + i * step for i in range(N_samples)],
dim=1)
if not det:
# get intervals between samples
mids = .5 * (z_vals[:, 1:] + z_vals[:, :-1])
upper = torch.cat([mids, z_vals[:, -1:]], dim=-1)
lower = torch.cat([z_vals[:, 0:1], mids], dim=-1)
# uniform samples in those intervals
t_rand = torch.rand_like(z_vals)
z_vals = lower + (upper - lower) * t_rand
ray_d = ray_d.unsqueeze(1).repeat(1, N_samples, 1)
ray_o = ray_o.unsqueeze(1).repeat(1, N_samples, 1)
pts = z_vals.unsqueeze(2) * ray_d + ray_o # [N_rays, N_samples, 3]
return pts, z_vals
# ray rendering of nerf
def raw2outputs(raw, z_vals, mask, white_bkgd=False):
"""Transform raw data to outputs:
Args:
raw(tensor):Raw network output.Tensor of shape [N_rays, N_samples, 4]
z_vals(tensor):Depth of point samples along rays.
Tensor of shape [N_rays, N_samples]
ray_d(tensor):[N_rays, 3]
Returns:
ret(dict):
-rgb(tensor):[N_rays, 3]
-depth(tensor):[N_rays,]
-weights(tensor):[N_rays,]
-depth_std(tensor):[N_rays,]
"""
rgb = raw[:, :, :3] # [N_rays, N_samples, 3]
sigma = raw[:, :, 3] # [N_rays, N_samples]
# note: we did not use the intervals here,
# because in practice different scenes from COLMAP can have
# very different scales, and using interval can affect
# the model's generalization ability.
# Therefore we don't use the intervals for both training and evaluation.
sigma2alpha = lambda sigma, dists: 1. - torch.exp(-sigma) # noqa
# point samples are ordered with increasing depth
# interval between samples
dists = z_vals[:, 1:] - z_vals[:, :-1]
dists = torch.cat((dists, dists[:, -1:]), dim=-1)
alpha = sigma2alpha(sigma, dists)
T = torch.cumprod(1. - alpha + 1e-10, dim=-1)[:, :-1]
T = torch.cat((torch.ones_like(T[:, 0:1]), T), dim=-1)
# maths show weights, and summation of weights along a ray,
# are always inside [0, 1]
weights = alpha * T
rgb_map = torch.sum(weights.unsqueeze(2) * rgb, dim=1)
if white_bkgd:
rgb_map = rgb_map + (1. - torch.sum(weights, dim=-1, keepdim=True))
if mask is not None:
mask = mask.float().sum(dim=1) > 8
depth_map = torch.sum(
weights * z_vals, dim=-1) / (
torch.sum(weights, dim=-1) + 1e-8)
depth_map = torch.clamp(depth_map, z_vals.min(), z_vals.max())
ret = OrderedDict([('rgb', rgb_map), ('depth', depth_map),
('weights', weights), ('mask', mask), ('alpha', alpha),
('z_vals', z_vals), ('transparency', T)])
return ret
def render_rays_func(
ray_o,
ray_d,
mean_volume,
cov_volume,
features_2D,
img,
aabb,
near_far_range,
N_samples,
N_rand=4096,
nerf_mlp=None,
img_meta=None,
projector=None,
mode='volume', # volume and image
nerf_sample_view=3,
inv_uniform=False,
N_importance=0,
det=False,
is_train=True,
white_bkgd=False,
gt_rgb=None,
gt_depth=None):
ret = {
'outputs_coarse': None,
'outputs_fine': None,
'gt_rgb': gt_rgb,
'gt_depth': gt_depth
}
# pts: [N_rays, N_samples, 3]
# z_vals: [N_rays, N_samples]
pts, z_vals = sample_along_camera_ray(
ray_o=ray_o,
ray_d=ray_d,
depth_range=near_far_range,
N_samples=N_samples,
inv_uniform=inv_uniform,
det=det)
N_rays, N_samples = pts.shape[:2]
if mode == 'image':
img = img.permute(0, 2, 3, 1).unsqueeze(0)
train_camera = _compute_projection(img_meta).to(img.device)
rgb_feat, mask = projector.compute(
pts, img, train_camera, features_2D, grid_sample=True)
pixel_mask = mask[..., 0].sum(dim=2) > 1
mean, var = compute_mask_points(rgb_feat, mask)
globalfeat = torch.cat([mean, var], dim=-1).squeeze(2)
rgb_pts, density_pts = nerf_mlp(pts, ray_d, globalfeat)
raw_coarse = torch.cat([rgb_pts, density_pts], dim=-1)
ret['sigma'] = density_pts
elif mode == 'volume':
mean_pts, inbound_masks = volume_sampling(pts, mean_volume, aabb)
cov_pts, inbound_masks = volume_sampling(pts, cov_volume, aabb)
# This masks is for indicating which points outside of aabb
img = img.permute(0, 2, 3, 1).unsqueeze(0)
train_camera = _compute_projection(img_meta).to(img.device)
_, view_mask = projector.compute(pts, img, train_camera, None)
pixel_mask = view_mask[..., 0].sum(dim=2) > 1
# plot_3D_vis(pts, aabb, img, train_camera)
# [N_rays, N_samples], should at least have 2 observations
# This mask is for indicating which points do not have projected point
globalpts = torch.cat([mean_pts, cov_pts], dim=-1)
rgb_pts, density_pts = nerf_mlp(pts, ray_d, globalpts)
density_pts = density_pts * inbound_masks.unsqueeze(dim=-1)
raw_coarse = torch.cat([rgb_pts, density_pts], dim=-1)
outputs_coarse = raw2outputs(
raw_coarse, z_vals, pixel_mask, white_bkgd=white_bkgd)
ret['outputs_coarse'] = outputs_coarse
return ret
def render_rays(
ray_batch,
mean_volume,
cov_volume,
features_2D,
img,
aabb,
near_far_range,
N_samples,
N_rand=4096,
nerf_mlp=None,
img_meta=None,
projector=None,
mode='volume', # volume and image
nerf_sample_view=3,
inv_uniform=False,
N_importance=0,
det=False,
is_train=True,
white_bkgd=False,
render_testing=False):
"""The function of the nerf rendering."""
ray_o = ray_batch['ray_o']
ray_d = ray_batch['ray_d']
gt_rgb = ray_batch['gt_rgb']
gt_depth = ray_batch['gt_depth']
nerf_sizes = ray_batch['nerf_sizes']
if is_train:
ray_o = ray_o.view(-1, 3)
ray_d = ray_d.view(-1, 3)
gt_rgb = gt_rgb.view(-1, 3)
if gt_depth.shape[1] != 0:
gt_depth = gt_depth.view(-1, 1)
non_zero_depth = (gt_depth > 0).squeeze(-1)
ray_o = ray_o[non_zero_depth]
ray_d = ray_d[non_zero_depth]
gt_rgb = gt_rgb[non_zero_depth]
gt_depth = gt_depth[non_zero_depth]
else:
gt_depth = None
total_rays = ray_d.shape[0]
select_inds = rng.choice(total_rays, size=(N_rand, ), replace=False)
ray_o = ray_o[select_inds]
ray_d = ray_d[select_inds]
gt_rgb = gt_rgb[select_inds]
if gt_depth is not None:
gt_depth = gt_depth[select_inds]
rets = render_rays_func(
ray_o,
ray_d,
mean_volume,
cov_volume,
features_2D,
img,
aabb,
near_far_range,
N_samples,
N_rand,
nerf_mlp,
img_meta,
projector,
mode, # volume and image
nerf_sample_view,
inv_uniform,
N_importance,
det,
is_train,
white_bkgd,
gt_rgb,
gt_depth)
elif render_testing:
nerf_size = nerf_sizes[0]
view_num = ray_o.shape[1]
H = nerf_size[0][0]
W = nerf_size[0][1]
ray_o = ray_o.view(-1, 3)
ray_d = ray_d.view(-1, 3)
gt_rgb = gt_rgb.view(-1, 3)
print(gt_rgb.shape)
if len(gt_depth) != 0:
gt_depth = gt_depth.view(-1, 1)
else:
gt_depth = None
assert view_num * H * W == ray_o.shape[0]
num_rays = ray_o.shape[0]
results = []
rgbs = []
for i in range(0, num_rays, N_rand):
ray_o_chunck = ray_o[i:i + N_rand, :]
ray_d_chunck = ray_d[i:i + N_rand, :]
ret = render_rays_func(ray_o_chunck, ray_d_chunck, mean_volume,
cov_volume, features_2D, img, aabb,
near_far_range, N_samples, N_rand, nerf_mlp,
img_meta, projector, mode, nerf_sample_view,
inv_uniform, N_importance, True, is_train,
white_bkgd, gt_rgb, gt_depth)
results.append(ret)
rgbs = []
depths = []
if results[0]['outputs_coarse'] is not None:
for i in range(len(results)):
rgb = results[i]['outputs_coarse']['rgb']
rgbs.append(rgb)
depth = results[i]['outputs_coarse']['depth']
depths.append(depth)
rets = {
'outputs_coarse': {
'rgb': torch.cat(rgbs, dim=0).view(view_num, H, W, 3),
'depth': torch.cat(depths, dim=0).view(view_num, H, W, 1),
},
'gt_rgb':
gt_rgb.view(view_num, H, W, 3),
'gt_depth':
gt_depth.view(view_num, H, W, 1) if gt_depth is not None else None,
}
else:
rets = None
return rets
# Copyright (c) OpenMMLab. All rights reserved.
import os
import cv2
import numpy as np
import torch
from skimage.metrics import structural_similarity
def compute_psnr_from_mse(mse):
return -10.0 * torch.log(mse) / np.log(10.0)
def compute_psnr(pred, target, mask=None):
"""Compute psnr value (we assume the maximum pixel value is 1)."""
if mask is not None:
pred, target = pred[mask], target[mask]
mse = ((pred - target)**2).mean()
return compute_psnr_from_mse(mse).cpu().numpy()
def compute_ssim(pred, target, mask=None):
"""Computes Masked SSIM following the neuralbody paper."""
assert pred.shape == target.shape and pred.shape[-1] == 3
if mask is not None:
x, y, w, h = cv2.boundingRect(mask.cpu().numpy().astype(np.uint8))
pred = pred[y:y + h, x:x + w]
target = target[y:y + h, x:x + w]
try:
ssim = structural_similarity(
pred.cpu().numpy(), target.cpu().numpy(), channel_axis=-1)
except ValueError:
ssim = structural_similarity(
pred.cpu().numpy(), target.cpu().numpy(), multichannel=True)
return ssim
def save_rendered_img(img_meta, rendered_results):
filename = img_meta[0]['filename']
scenes = filename.split('/')[-2]
for ret in rendered_results:
depth = ret['outputs_coarse']['depth']
rgb = ret['outputs_coarse']['rgb']
gt = ret['gt_rgb']
gt_depth = ret['gt_depth']
# save images
psnr_total = 0
ssim_total = 0
rsme = 0
for v in range(gt.shape[0]):
rsme += ((depth[v] - gt_depth[v])**2).cpu().numpy()
depth_ = ((depth[v] - depth[v].min()) /
(depth[v].max() - depth[v].min() + 1e-8)).repeat(1, 1, 3)
img_to_save = torch.cat([rgb[v], gt[v], depth_], dim=1)
image_path = os.path.join('nerf_vs_rebuttal', scenes)
if not os.path.exists(image_path):
os.makedirs(image_path)
save_dir = os.path.join(image_path, 'view_' + str(v) + '.png')
font = cv2.FONT_HERSHEY_SIMPLEX
org = (50, 50)
fontScale = 1
color = (255, 0, 0)
thickness = 2
image = np.uint8(img_to_save.cpu().numpy() * 255.0)
psnr = compute_psnr(rgb[v], gt[v], mask=None)
psnr_total += psnr
ssim = compute_ssim(rgb[v], gt[v], mask=None)
ssim_total += ssim
image = cv2.putText(
image, 'PSNR: ' + '%.2f' % compute_psnr(rgb[v], gt[v], mask=None),
org, font, fontScale, color, thickness, cv2.LINE_AA)
cv2.imwrite(save_dir, image)
return psnr_total / gt.shape[0], ssim_total / gt.shape[0], rsme / gt.shape[
0]
This diff is collapsed.
This diff is collapsed.
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from os import path as osp
from typing import Callable, List, Optional, Union
import numpy as np
from mmdet3d.datasets import Det3DDataset
from mmdet3d.registry import DATASETS
from mmdet3d.structures import DepthInstance3DBoxes
@DATASETS.register_module()
class MultiViewScanNetDataset(Det3DDataset):
r"""Multi-View ScanNet Dataset for NeRF-detection Task
This class serves as the API for experiments on the ScanNet Dataset.
Please refer to the `github repo <https://github.com/ScanNet/ScanNet>`_
for data downloading.
Args:
data_root (str): Path of dataset root.
ann_file (str): Path of annotation file.
metainfo (dict, optional): Meta information for dataset, such as class
information. Defaults to None.
pipeline (List[dict]): Pipeline used for data processing.
Defaults to [].
modality (dict): Modality to specify the sensor data used as input.
Defaults to dict(use_camera=True, use_lidar=False).
box_type_3d (str): Type of 3D box of this dataset.
Based on the `box_type_3d`, the dataset will encapsulate the box
to its original format then converted them to `box_type_3d`.
Defaults to 'Depth' in this dataset. Available options includes:
- 'LiDAR': Box in LiDAR coordinates.
- 'Depth': Box in depth coordinates, usually for indoor dataset.
- 'Camera': Box in camera coordinates.
filter_empty_gt (bool): Whether to filter the data with empty GT.
If it's set to be True, the example with empty annotations after
data pipeline will be dropped and a random example will be chosen
in `__getitem__`. Defaults to True.
test_mode (bool): Whether the dataset is in test mode.
Defaults to False.
"""
METAINFO = {
'classes':
('cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window',
'bookshelf', 'picture', 'counter', 'desk', 'curtain', 'refrigerator',
'showercurtrain', 'toilet', 'sink', 'bathtub', 'garbagebin')
}
def __init__(self,
data_root: str,
ann_file: str,
metainfo: Optional[dict] = None,
pipeline: List[Union[dict, Callable]] = [],
modality: dict = dict(use_camera=True, use_lidar=False),
box_type_3d: str = 'Depth',
filter_empty_gt: bool = True,
remove_dontcare: bool = False,
test_mode: bool = False,
**kwargs) -> None:
self.remove_dontcare = remove_dontcare
super().__init__(
data_root=data_root,
ann_file=ann_file,
metainfo=metainfo,
pipeline=pipeline,
modality=modality,
box_type_3d=box_type_3d,
filter_empty_gt=filter_empty_gt,
test_mode=test_mode,
**kwargs)
assert 'use_camera' in self.modality and \
'use_lidar' in self.modality
assert self.modality['use_camera'] or self.modality['use_lidar']
@staticmethod
def _get_axis_align_matrix(info: dict) -> np.ndarray:
"""Get axis_align_matrix from info. If not exist, return identity mat.
Args:
info (dict): Info of a single sample data.
Returns:
np.ndarray: 4x4 transformation matrix.
"""
if 'axis_align_matrix' in info:
return np.array(info['axis_align_matrix'])
else:
warnings.warn(
'axis_align_matrix is not found in ScanNet data info, please '
'use new pre-process scripts to re-generate ScanNet data')
return np.eye(4).astype(np.float32)
def parse_data_info(self, info: dict) -> dict:
"""Process the raw data info.
Convert all relative path of needed modality data file to
the absolute path.
Args:
info (dict): Raw info dict.
Returns:
dict: Has `ann_info` in training stage. And
all path has been converted to absolute path.
"""
if self.modality['use_depth']:
info['depth_info'] = []
if self.modality['use_neuralrecon_depth']:
info['depth_info'] = []
if self.modality['use_lidar']:
# implement lidar processing in the future
raise NotImplementedError(
'Please modified '
'`MultiViewPipeline` to support lidar processing')
info['axis_align_matrix'] = self._get_axis_align_matrix(info)
info['img_info'] = []
info['lidar2img'] = []
info['c2w'] = []
info['camrotc2w'] = []
info['lightpos'] = []
# load img and depth_img
for i in range(len(info['img_paths'])):
img_filename = osp.join(self.data_root, info['img_paths'][i])
info['img_info'].append(dict(filename=img_filename))
if 'depth_info' in info.keys():
if self.modality['use_neuralrecon_depth']:
info['depth_info'].append(
dict(filename=img_filename[:-4] + '.npy'))
else:
info['depth_info'].append(
dict(filename=img_filename[:-4] + '.png'))
# implement lidar_info in input.keys() in the future.
extrinsic = np.linalg.inv(
info['axis_align_matrix'] @ info['lidar2cam'][i])
info['lidar2img'].append(extrinsic.astype(np.float32))
if self.modality['use_ray']:
c2w = (
info['axis_align_matrix'] @ info['lidar2cam'][i]).astype(
np.float32) # noqa
info['c2w'].append(c2w)
info['camrotc2w'].append(c2w[0:3, 0:3])
info['lightpos'].append(c2w[0:3, 3])
origin = np.array([.0, .0, .5])
info['lidar2img'] = dict(
extrinsic=info['lidar2img'],
intrinsic=info['cam2img'].astype(np.float32),
origin=origin.astype(np.float32))
if self.modality['use_ray']:
info['ray_info'] = []
if not self.test_mode:
info['ann_info'] = self.parse_ann_info(info)
if self.test_mode and self.load_eval_anns:
info['ann_info'] = self.parse_ann_info(info)
info['eval_ann_info'] = self._remove_dontcare(info['ann_info'])
return info
def parse_ann_info(self, info: dict) -> dict:
"""Process the `instances` in data info to `ann_info`.
Args:
info (dict): Info dict.
Returns:
dict: Processed `ann_info`.
"""
ann_info = super().parse_ann_info(info)
if self.remove_dontcare:
ann_info = self._remove_dontcare(ann_info)
# empty gt
if ann_info is None:
ann_info = dict()
ann_info['gt_bboxes_3d'] = np.zeros((0, 6), dtype=np.float32)
ann_info['gt_labels_3d'] = np.zeros((0, ), dtype=np.int64)
ann_info['gt_bboxes_3d'] = DepthInstance3DBoxes(
ann_info['gt_bboxes_3d'],
box_dim=ann_info['gt_bboxes_3d'].shape[-1],
with_yaw=False,
origin=(0.5, 0.5, 0.5)).convert_to(self.box_mode_3d)
# count the numbers
for label in ann_info['gt_labels_3d']:
if label != -1:
cat_name = self.metainfo['classes'][label]
self.num_ins_per_cat[cat_name] += 1
return ann_info
# Copyright (c) OpenMMLab. All rights reserved.
"""Prepare the dataset for NeRF-Det.
Example:
python projects/NeRF-Det/prepare_infos.py
--root-path ./data/scannet
--out-dir ./data/scannet
"""
import argparse
import time
from os import path as osp
from pathlib import Path
import mmengine
from ...tools.dataset_converters import indoor_converter as indoor
from ...tools.dataset_converters.update_infos_to_v2 import (
clear_data_info_unused_keys, clear_instance_unused_keys,
get_empty_instance, get_empty_standard_data_info)
def update_scannet_infos_nerfdet(pkl_path, out_dir):
"""Update the origin pkl to the new format which will be used in nerf-det.
Args:
pkl_path (str): Path of the origin pkl.
out_dir (str): Output directory of the generated info file.
Returns:
The pkl will be overwritTen.
The new pkl is a dict containing two keys:
metainfo: Some base information of the pkl
data_list (list): A list containing all the information of the scenes.
"""
print('The new refactored process is running.')
print(f'{pkl_path} will be modified.')
if out_dir in pkl_path:
print(f'Warning, you may overwriting '
f'the original data {pkl_path}.')
time.sleep(5)
METAINFO = {
'classes':
('cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window',
'bookshelf', 'picture', 'counter', 'desk', 'curtain', 'refrigerator',
'showercurtrain', 'toilet', 'sink', 'bathtub', 'garbagebin')
}
print(f'Reading from input file: {pkl_path}.')
data_list = mmengine.load(pkl_path)
print('Start updating:')
converted_list = []
for ori_info_dict in mmengine.track_iter_progress(data_list):
temp_data_info = get_empty_standard_data_info()
# intrinsics, extrinsics and imgs
temp_data_info['cam2img'] = ori_info_dict['intrinsics']
temp_data_info['lidar2cam'] = ori_info_dict['extrinsics']
temp_data_info['img_paths'] = ori_info_dict['img_paths']
# annotation information
anns = ori_info_dict.get('annos', None)
ignore_class_name = set()
if anns is not None:
temp_data_info['axis_align_matrix'] = anns[
'axis_align_matrix'].tolist()
if anns['gt_num'] == 0:
instance_list = []
else:
num_instances = len(anns['name'])
instance_list = []
for instance_id in range(num_instances):
empty_instance = get_empty_instance()
empty_instance['bbox_3d'] = anns['gt_boxes_upright_depth'][
instance_id].tolist()
if anns['name'][instance_id] in METAINFO['classes']:
empty_instance['bbox_label_3d'] = METAINFO[
'classes'].index(anns['name'][instance_id])
else:
ignore_class_name.add(anns['name'][instance_id])
empty_instance['bbox_label_3d'] = -1
empty_instance = clear_instance_unused_keys(empty_instance)
instance_list.append(empty_instance)
temp_data_info['instances'] = instance_list
temp_data_info, _ = clear_data_info_unused_keys(temp_data_info)
converted_list.append(temp_data_info)
pkl_name = Path(pkl_path).name
out_path = osp.join(out_dir, pkl_name)
print(f'Writing to output file: {out_path}.')
print(f'ignore classes: {ignore_class_name}')
# dataset metainfo
metainfo = dict()
metainfo['categories'] = {k: i for i, k in enumerate(METAINFO['classes'])}
if ignore_class_name:
for ignore_class in ignore_class_name:
metainfo['categories'][ignore_class] = -1
metainfo['dataset'] = 'scannet'
metainfo['info_version'] = '1.1'
converted_data_info = dict(metainfo=metainfo, data_list=converted_list)
mmengine.dump(converted_data_info, out_path, 'pkl')
def scannet_data_prep(root_path, info_prefix, out_dir, workers):
"""Prepare the info file for scannet dataset.
Args:
root_path (str): Path of dataset root.
info_prefix (str): The prefix of info filenames.
out_dir (str): Output directory of the generated info file.
workers (int): Number of threads to be used.
version (str): Only used to generate the dataset of nerfdet now.
"""
indoor.create_indoor_info_file(
root_path, info_prefix, out_dir, workers=workers)
info_train_path = osp.join(out_dir, f'{info_prefix}_infos_train.pkl')
info_val_path = osp.join(out_dir, f'{info_prefix}_infos_val.pkl')
info_test_path = osp.join(out_dir, f'{info_prefix}_infos_test.pkl')
update_scannet_infos_nerfdet(out_dir=out_dir, pkl_path=info_train_path)
update_scannet_infos_nerfdet(out_dir=out_dir, pkl_path=info_val_path)
update_scannet_infos_nerfdet(out_dir=out_dir, pkl_path=info_test_path)
parser = argparse.ArgumentParser(description='Data converter arg parser')
parser.add_argument(
'--root-path',
type=str,
default='./data/scannet',
help='specify the root path of dataset')
parser.add_argument(
'--out-dir',
type=str,
default='./data/scannet',
required=False,
help='name of info pkl')
parser.add_argument('--extra-tag', type=str, default='scannet')
parser.add_argument(
'--workers', type=int, default=4, help='number of threads to be used')
args = parser.parse_args()
if __name__ == '__main__':
from mmdet3d.utils import register_all_modules
register_all_modules()
scannet_data_prep(
root_path=args.root_path,
info_prefix=args.extra_tag,
out_dir=args.out_dir,
workers=args.workers)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment