raw_mmdetection

7aa442d5 · raojy · 9c03eaa8 · 7aa442d5 · 7aa442d5 · 7aa442d5
Commit 7aa442d5 authored Apr 01, 2026 by raojy
20 changed files
--- a/mmdetection3d/docs/zh_cn/user_guides/visualization.md
+++ b/mmdetection3d/docs/zh_cn/user_guides/visualization.md
+# 可视化
+
+MMDetection3D 提供了 `Det3DLocalVisualizer` 用来在训练及测试阶段可视化和存储模型的状态以及结果，其具有以下特性：
+
+1. 支持多模态数据和多任务的基本绘图界面。
+2. 支持多个后端（如 local，TensorBoard），将训练状态（如 `loss`，`lr`）或模型评估指标写入指定的一个或多个后端中。
+3. 支持多模态数据真实标签的可视化，3D 检测结果的跨模态可视化。
+
+## 基本绘制界面
+
+继承自 `DetLocalVisualizer`，`Det3DLocalVisualizer` 提供了在 2D 图像上绘制常见目标的界面，例如绘制检测框、点、文本、线、圆、多边形、二进制掩码等。关于 2D 绘制的更多细节，请参考 MMDetection 中的[可视化文档](https://mmengine.readthedocs.io/zh_CN/latest/advanced_tutorials/visualization.html)。这里我们介绍 3D 绘制界面。
+
+### 在图像上绘制点云
+
+通过使用 `draw_points_on_image`，我们支持在图像上绘制点云。
+
+```python
+import mmcv
+import numpy as np
+from mmengine import load
+
+from mmdet3d.visualization import Det3DLocalVisualizer
+
+info_file = load('demo/data/kitti/000008.pkl')
+points = np.fromfile('demo/data/kitti/000008.bin', dtype=np.float32)
+points = points.reshape(-1, 4)[:, :3]
+lidar2img = np.array(info_file['data_list'][0]['images']['CAM2']['lidar2img'], dtype=np.float32)
+
+visualizer = Det3DLocalVisualizer()
+img = mmcv.imread('demo/data/kitti/000008.png')
+img = mmcv.imconvert(img, 'bgr', 'rgb')
+visualizer.set_image(img)
+visualizer.draw_points_on_image(points, lidar2img)
+visualizer.show()
+```
+
+![points_on_image](../../../resources/points_on_image.png)
+
+### 在点云上绘制 3D 框
+
+通过使用 `draw_bboxes_3d`，我们支持在点云上绘制 3D 框。
+
+```python
+import torch
+import numpy as np
+
+from mmdet3d.visualization import Det3DLocalVisualizer
+from mmdet3d.structures import LiDARInstance3DBoxes
+
+points = np.fromfile('demo/data/kitti/000008.bin', dtype=np.float32)
+points = points.reshape(-1, 4)
+visualizer = Det3DLocalVisualizer()
+# set point cloud in visualizer
+visualizer.set_points(points)
+bboxes_3d = LiDARInstance3DBoxes(
+    torch.tensor([[8.7314, -1.8559, -1.5997, 4.2000, 3.4800, 1.8900,
+                   -1.5808]]))
+# Draw 3D bboxes
+visualizer.draw_bboxes_3d(bboxes_3d)
+visualizer.show()
+```
+
+![mono3d](../../../resources/pcd.png)
+
+### 在图像上绘制投影的 3D 框
+
+通过使用 `draw_proj_bboxes_3d`，我们支持在图像上绘制投影的 3D 框。
+
+```python
+import mmcv
+import numpy as np
+from mmengine import load
+
+from mmdet3d.visualization import Det3DLocalVisualizer
+from mmdet3d.structures import CameraInstance3DBoxes
+
+info_file = load('demo/data/kitti/000008.pkl')
+cam2img = np.array(info_file['data_list'][0]['images']['CAM2']['cam2img'], dtype=np.float32)
+bboxes_3d = []
+for instance in info_file['data_list'][0]['instances']:
+    bboxes_3d.append(instance['bbox_3d'])
+gt_bboxes_3d = np.array(bboxes_3d, dtype=np.float32)
+gt_bboxes_3d = CameraInstance3DBoxes(gt_bboxes_3d)
+input_meta = {'cam2img': cam2img}
+
+visualizer = Det3DLocalVisualizer()
+
+img = mmcv.imread('demo/data/kitti/000008.png')
+img = mmcv.imconvert(img, 'bgr', 'rgb')
+visualizer.set_image(img)
+# project 3D bboxes to image
+visualizer.draw_proj_bboxes_3d(gt_bboxes_3d, input_meta)
+visualizer.show()
+```
+
+### 绘制 BEV 视角的框
+
+通过使用 `draw_bev_bboxes`，我们支持绘制 BEV 视角下的框。
+
+```python
+import numpy as np
+from mmengine import load
+
+from mmdet3d.visualization import Det3DLocalVisualizer
+from mmdet3d.structures import CameraInstance3DBoxes
+
+info_file = load('demo/data/kitti/000008.pkl')
+bboxes_3d = []
+for instance in info_file['data_list'][0]['instances']:
+    bboxes_3d.append(instance['bbox_3d'])
+gt_bboxes_3d = np.array(bboxes_3d, dtype=np.float32)
+gt_bboxes_3d = CameraInstance3DBoxes(gt_bboxes_3d)
+
+visualizer = Det3DLocalVisualizer()
+# set bev image in visualizer
+visualizer.set_bev_image()
+# draw bev bboxes
+visualizer.draw_bev_bboxes(gt_bboxes_3d, edge_colors='orange')
+visualizer.show()
+```
+
+### 绘制 3D 分割掩码
+
+通过使用 `draw_seg_mask`，我们支持通过逐点着色来绘制分割掩码。
+
+```python
+import numpy as np
+
+from mmdet3d.visualization import Det3DLocalVisualizer
+
+points = np.fromfile('demo/data/sunrgbd/000017.bin', dtype=np.float32)
+points = points.reshape(-1, 3)
+visualizer = Det3DLocalVisualizer()
+mask = np.random.rand(points.shape[0], 3)
+points_with_mask = np.concatenate((points, mask), axis=-1)
+# Draw 3D points with mask
+visualizer.set_points(points, pcd_mode=2, vis_mode='add')
+visualizer.draw_seg_mask(points_with_mask)
+visualizer.show()
+```
+
+## 结果
+
+如果想要可视化训练模型的预测结果，你可以运行如下指令：
+
+```bash
+python tools/test.py ${CONFIG_FILE} ${CKPT_PATH} --show --show-dir ${SHOW_DIR}
+```
+
+运行该指令后，绘制的结果（包括输入数据和网络输出在输入上的可视化）将会被保存在 `${SHOW_DIR}` 中。
+
+运行该指令后，你将在 `${SHOW_DIR}` 中获得输入数据，网络输出和真是标签在输入上的可视化（如在多模态检测任务和基于视觉的检测任务中的 `***_gt.png` 和 `***_pred.png`）。当启用 `show` 时，[Open3D](http://www.open3d.org/) 将会用于在线可视化结果。如果你是在没有 GUI 的远程服务器上测试时，在线可视化是不被支持的。你可以从远程服务器中下载 `results.pkl`，并在本地机器上离线可视化预测结果。
+
+使用 `Open3D` 后端离线可视化结果，你可以运行如下指令：
+
+```bash
+python tools/misc/visualize_results.py ${CONFIG_FILE} --result ${RESULTS_PATH} --show-dir ${SHOW_DIR}
+```
+
+![](../../../resources/open3d_visual.gif)
+
+这需要在远程服务器中能够推理并生成结果，然后用户在主机中使用 GUI 打开。
+
+## 数据集
+
+我们也提供了脚本来可视化数据集而无需推理。你可以使用 `tools/misc/browse_dataset.py` 来在线可视化加载的数据的真实标签，并保存在硬盘中。目前我们支持所有数据集的单模态 3D 检测和 3D 分割，KITTI 和 SUN RGB-D 的多模态 3D 检测，以及 nuScenes 的单目 3D 检测。如果想要浏览 KITTI 数据集，你可以运行如下指令：
+
+```shell
+python tools/misc/browse_dataset.py configs/_base_/datasets/kitti-3d-3class.py --task lidar_det --output-dir ${OUTPUT_DIR}
+```
+
+**注意**：一旦指定了 `--output-dir`，当在 open3d 窗口中按下 `_ESC_` 时，用户指定的视图图像将会被保存下来。如果你想要对点云进行缩放操作以观察更多细节， 你可以在命令中指定 `--show-interval=0`。
+
+为了验证数据的一致性和数据增强的效果，你可以加上 `--aug` 来可视化数据增强后的数据，指令如下所示：
+
+```shell
+python tools/misc/browse_dataset.py configs/_base_/datasets/kitti-3d-3class.py --task det --aug --output-dir ${OUTPUT_DIR}
+```
+
+如果你想显示带有投影的 3D 边界框的 2D 图像，你需要一个支持多模态数据加载的配置文件，并将 `--task` 参数改为 `multi-modality_det`。示例如下：
+
+```shell
+python tools/misc/browse_dataset.py configs/mvxnet/mvxnet_fpn_dv_second_secfpn_8xb2-80e_kitti-3d-3class.py --task multi-modality_det --output-dir ${OUTPUT_DIR}
+```
+
+![](../../../resources/browse_dataset_multi_modality.png)
+
+你可以使用不同的配置浏览不同的数据集，例如在 3D 语义分割任务中可视化 ScanNet 数据集：
+
+```shell
+python tools/misc/browse_dataset.py configs/_base_/datasets/scannet-seg.py --task lidar_seg --output-dir ${OUTPUT_DIR}
+```
+
+![](../../../resources/browse_dataset_seg.png)
+
+在单目 3D 检测任务中浏览 nuScenes 数据集：
+
+```shell
+python tools/misc/browse_dataset.py configs/_base_/datasets/nus-mono3d.py --task mono_det --output-dir ${OUTPUT_DIR}
+```
+
+![](../../../resources/browse_dataset_mono.png)
--- a/mmdetection3d/mmdet3d/__init__.py
+++ b/mmdetection3d/mmdet3d/__init__.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import mmcv
+import mmdet
+import mmengine
+from mmengine.utils import digit_version
+
+from .version import __version__, version_info
+
+mmcv_minimum_version = '2.0.0rc4'
+mmcv_maximum_version = '2.2.0'
+mmcv_version = digit_version(mmcv.__version__)
+
+mmengine_minimum_version = '0.8.0'
+mmengine_maximum_version = '1.0.0'
+mmengine_version = digit_version(mmengine.__version__)
+
+mmdet_minimum_version = '3.0.0rc5'
+mmdet_maximum_version = '3.4.0'
+mmdet_version = digit_version(mmdet.__version__)
+
+assert (mmcv_version >= digit_version(mmcv_minimum_version)
+        and mmcv_version < digit_version(mmcv_maximum_version)), \
+    f'MMCV=={mmcv.__version__} is used but incompatible. ' \
+    f'Please install mmcv>={mmcv_minimum_version}, <{mmcv_maximum_version}.'
+
+assert (mmengine_version >= digit_version(mmengine_minimum_version)
+        and mmengine_version < digit_version(mmengine_maximum_version)), \
+    f'MMEngine=={mmengine.__version__} is used but incompatible. ' \
+    f'Please install mmengine>={mmengine_minimum_version}, ' \
+    f'<{mmengine_maximum_version}.'
+
+assert (mmdet_version >= digit_version(mmdet_minimum_version)
+        and mmdet_version < digit_version(mmdet_maximum_version)), \
+    f'MMDET=={mmdet.__version__} is used but incompatible. ' \
+    f'Please install mmdet>={mmdet_minimum_version}, ' \
+    f'<{mmdet_maximum_version}.'
+
+__all__ = ['__version__', 'version_info', 'digit_version']
--- a/mmdetection3d/mmdet3d/apis/__init__.py
+++ b/mmdetection3d/mmdet3d/apis/__init__.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from .inference import (convert_SyncBN, inference_detector,
+                        inference_mono_3d_detector,
+                        inference_multi_modality_detector, inference_segmentor,
+                        init_model)
+from .inferencers import (Base3DInferencer, LidarDet3DInferencer,
+                          LidarSeg3DInferencer, MonoDet3DInferencer,
+                          MultiModalityDet3DInferencer)
+
+__all__ = [
+    'inference_detector', 'init_model', 'inference_mono_3d_detector',
+    'convert_SyncBN', 'inference_multi_modality_detector',
+    'inference_segmentor', 'Base3DInferencer', 'MonoDet3DInferencer',
+    'LidarDet3DInferencer', 'LidarSeg3DInferencer',
+    'MultiModalityDet3DInferencer'
+]
--- a/mmdetection3d/mmdet3d/apis/inference.py
+++ b/mmdetection3d/mmdet3d/apis/inference.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+from copy import deepcopy
+from os import path as osp
+from pathlib import Path
+from typing import Optional, Sequence, Union
+
+import mmengine
+import numpy as np
+import torch
+import torch.nn as nn
+from mmengine.config import Config
+from mmengine.dataset import Compose, pseudo_collate
+from mmengine.registry import init_default_scope
+from mmengine.runner import load_checkpoint
+
+from mmdet3d.registry import DATASETS, MODELS
+from mmdet3d.structures import Box3DMode, Det3DDataSample, get_box_type
+from mmdet3d.structures.det3d_data_sample import SampleList
+
+
+def convert_SyncBN(config):
+    """Convert config's naiveSyncBN to BN.
+
+    Args:
+         config (str or :obj:`mmengine.Config`): Config file path or the config
+            object.
+    """
+    if isinstance(config, dict):
+        for item in config:
+            if item == 'norm_cfg':
+                config[item]['type'] = config[item]['type']. \
+                                    replace('naiveSyncBN', 'BN')
+            else:
+                convert_SyncBN(config[item])
+
+
+def init_model(config: Union[str, Path, Config],
+               checkpoint: Optional[str] = None,
+               device: str = 'cuda:0',
+               palette: str = 'none',
+               cfg_options: Optional[dict] = None):
+    """Initialize a model from config file, which could be a 3D detector or a
+    3D segmentor.
+
+    Args:
+        config (str, :obj:`Path`, or :obj:`mmengine.Config`): Config file path,
+            :obj:`Path`, or the config object.
+        checkpoint (str, optional): Checkpoint path. If left as None, the model
+            will not load any weights.
+        device (str): Device to use.
+        cfg_options (dict, optional): Options to override some settings in
+            the used config.
+
+    Returns:
+        nn.Module: The constructed detector.
+    """
+    if isinstance(config, (str, Path)):
+        config = Config.fromfile(config)
+    elif not isinstance(config, Config):
+        raise TypeError('config must be a filename or Config object, '
+                        f'but got {type(config)}')
+    if cfg_options is not None:
+        config.merge_from_dict(cfg_options)
+
+    convert_SyncBN(config.model)
+    config.model.train_cfg = None
+    init_default_scope(config.get('default_scope', 'mmdet3d'))
+    model = MODELS.build(config.model)
+
+    if checkpoint is not None:
+        checkpoint = load_checkpoint(model, checkpoint, map_location='cpu')
+        # save the dataset_meta in the model for convenience
+        if 'dataset_meta' in checkpoint.get('meta', {}):
+            # mmdet3d 1.x
+            model.dataset_meta = checkpoint['meta']['dataset_meta']
+        elif 'CLASSES' in checkpoint.get('meta', {}):
+            # < mmdet3d 1.x
+            classes = checkpoint['meta']['CLASSES']
+            model.dataset_meta = {'classes': classes}
+
+            if 'PALETTE' in checkpoint.get('meta', {}):  # 3D Segmentor
+                model.dataset_meta['palette'] = checkpoint['meta']['PALETTE']
+        else:
+            # < mmdet3d 1.x
+            model.dataset_meta = {'classes': config.class_names}
+
+            if 'PALETTE' in checkpoint.get('meta', {}):  # 3D Segmentor
+                model.dataset_meta['palette'] = checkpoint['meta']['PALETTE']
+
+        test_dataset_cfg = deepcopy(config.test_dataloader.dataset)
+        # lazy init. We only need the metainfo.
+        test_dataset_cfg['lazy_init'] = True
+        metainfo = DATASETS.build(test_dataset_cfg).metainfo
+        cfg_palette = metainfo.get('palette', None)
+        if cfg_palette is not None:
+            model.dataset_meta['palette'] = cfg_palette
+        else:
+            if 'palette' not in model.dataset_meta:
+                warnings.warn(
+                    'palette does not exist, random is used by default. '
+                    'You can also set the palette to customize.')
+                model.dataset_meta['palette'] = 'random'
+
+    model.cfg = config  # save the config in the model for convenience
+    if device != 'cpu':
+        torch.cuda.set_device(device)
+    else:
+        warnings.warn('Don\'t suggest using CPU device. '
+                      'Some functions are not supported for now.')
+
+    model.to(device)
+    model.eval()
+    return model
+
+
+PointsType = Union[str, np.ndarray, Sequence[str], Sequence[np.ndarray]]
+ImagesType = Union[str, np.ndarray, Sequence[str], Sequence[np.ndarray]]
+
+
+def inference_detector(model: nn.Module,
+                       pcds: PointsType) -> Union[Det3DDataSample, SampleList]:
+    """Inference point cloud with the detector.
+
+    Args:
+        model (nn.Module): The loaded detector.
+        pcds (str, ndarray, Sequence[str/ndarray]):
+            Either point cloud files or loaded point cloud.
+
+    Returns:
+        :obj:`Det3DDataSample` or list[:obj:`Det3DDataSample`]:
+        If pcds is a list or tuple, the same length list type results
+        will be returned, otherwise return the detection results directly.
+    """
+    if isinstance(pcds, (list, tuple)):
+        is_batch = True
+    else:
+        pcds = [pcds]
+        is_batch = False
+
+    cfg = model.cfg
+
+    if not isinstance(pcds[0], str):
+        cfg = cfg.copy()
+        # set loading pipeline type
+        cfg.test_dataloader.dataset.pipeline[0].type = 'LoadPointsFromDict'
+
+    # build the data pipeline
+    test_pipeline = deepcopy(cfg.test_dataloader.dataset.pipeline)
+    test_pipeline = Compose(test_pipeline)
+    box_type_3d, box_mode_3d = \
+        get_box_type(cfg.test_dataloader.dataset.box_type_3d)
+
+    data = []
+    for pcd in pcds:
+        # prepare data
+        if isinstance(pcd, str):
+            # load from point cloud file
+            data_ = dict(
+                lidar_points=dict(lidar_path=pcd),
+                timestamp=1,
+                # for ScanNet demo we need axis_align_matrix
+                axis_align_matrix=np.eye(4),
+                box_type_3d=box_type_3d,
+                box_mode_3d=box_mode_3d)
+        else:
+            # directly use loaded point cloud
+            data_ = dict(
+                points=pcd,
+                timestamp=1,
+                # for ScanNet demo we need axis_align_matrix
+                axis_align_matrix=np.eye(4),
+                box_type_3d=box_type_3d,
+                box_mode_3d=box_mode_3d)
+        data_ = test_pipeline(data_)
+        data.append(data_)
+
+    collate_data = pseudo_collate(data)
+
+    # forward the model
+    with torch.no_grad():
+        results = model.test_step(collate_data)
+
+    if not is_batch:
+        return results[0], data[0]
+    else:
+        return results, data
+
+
+def inference_multi_modality_detector(model: nn.Module,
+                                      pcds: Union[str, Sequence[str]],
+                                      imgs: Union[str, Sequence[str]],
+                                      ann_file: Union[str, Sequence[str]],
+                                      cam_type: str = 'CAM2'):
+    """Inference point cloud with the multi-modality detector. Now we only
+    support multi-modality detector for KITTI and SUNRGBD datasets since the
+    multi-view image loading is not supported yet in this inference function.
+
+    Args:
+        model (nn.Module): The loaded detector.
+        pcds (str, Sequence[str]):
+            Either point cloud files or loaded point cloud.
+        imgs (str, Sequence[str]):
+           Either image files or loaded images.
+        ann_file (str, Sequence[str]): Annotation files.
+        cam_type (str): Image of Camera chose to infer. When detector only uses
+            single-view image, we need to specify a camera view. For kitti
+            dataset, it should be 'CAM2'. For sunrgbd, it should be 'CAM0'.
+            When detector uses multi-view images, we should set it to 'all'.
+
+    Returns:
+        :obj:`Det3DDataSample` or list[:obj:`Det3DDataSample`]:
+        If pcds is a list or tuple, the same length list type results
+        will be returned, otherwise return the detection results directly.
+    """
+    if isinstance(pcds, (list, tuple)):
+        is_batch = True
+        assert isinstance(imgs, (list, tuple))
+        assert len(pcds) == len(imgs)
+    else:
+        pcds = [pcds]
+        imgs = [imgs]
+        is_batch = False
+
+    cfg = model.cfg
+
+    # build the data pipeline
+    test_pipeline = deepcopy(cfg.test_dataloader.dataset.pipeline)
+    test_pipeline = Compose(test_pipeline)
+    box_type_3d, box_mode_3d = \
+        get_box_type(cfg.test_dataloader.dataset.box_type_3d)
+
+    data_list = mmengine.load(ann_file)['data_list']
+
+    data = []
+    for index, pcd in enumerate(pcds):
+        # get data info containing calib
+        data_info = data_list[index]
+        img = imgs[index]
+
+        if cam_type != 'all':
+            assert osp.isfile(img), f'{img} must be a file.'
+            img_path = data_info['images'][cam_type]['img_path']
+            if osp.basename(img_path) != osp.basename(img):
+                raise ValueError(
+                    f'the info file of {img_path} is not provided.')
+            data_ = dict(
+                lidar_points=dict(lidar_path=pcd),
+                img_path=img,
+                box_type_3d=box_type_3d,
+                box_mode_3d=box_mode_3d)
+            data_info['images'][cam_type]['img_path'] = img
+            if 'cam2img' in data_info['images'][cam_type]:
+                # The data annotation in SRUNRGBD dataset does not contain
+                # `cam2img`
+                data_['cam2img'] = np.array(
+                    data_info['images'][cam_type]['cam2img'])
+
+            # LiDAR to image conversion for KITTI dataset
+            if box_mode_3d == Box3DMode.LIDAR:
+                if 'lidar2img' in data_info['images'][cam_type]:
+                    data_['lidar2img'] = np.array(
+                        data_info['images'][cam_type]['lidar2img'])
+            # Depth to image conversion for SUNRGBD dataset
+            elif box_mode_3d == Box3DMode.DEPTH:
+                data_['depth2img'] = np.array(
+                    data_info['images'][cam_type]['depth2img'])
+        else:
+            assert osp.isdir(img), f'{img} must be a file directory'
+            for _, img_info in data_info['images'].items():
+                img_info['img_path'] = osp.join(img, img_info['img_path'])
+                assert osp.isfile(img_info['img_path']
+                                  ), f'{img_info["img_path"]} does not exist.'
+            data_ = dict(
+                lidar_points=dict(lidar_path=pcd),
+                images=data_info['images'],
+                box_type_3d=box_type_3d,
+                box_mode_3d=box_mode_3d)
+
+        if 'timestamp' in data_info:
+            # Using multi-sweeps need `timestamp`
+            data_['timestamp'] = data_info['timestamp']
+
+        data_ = test_pipeline(data_)
+        data.append(data_)
+
+    collate_data = pseudo_collate(data)
+
+    # forward the model
+    with torch.no_grad():
+        results = model.test_step(collate_data)
+
+    if not is_batch:
+        return results[0], data[0]
+    else:
+        return results, data
+
+
+def inference_mono_3d_detector(model: nn.Module,
+                               imgs: ImagesType,
+                               ann_file: Union[str, Sequence[str]],
+                               cam_type: str = 'CAM_FRONT'):
+    """Inference image with the monocular 3D detector.
+
+    Args:
+        model (nn.Module): The loaded detector.
+        imgs (str, Sequence[str]):
+           Either image files or loaded images.
+        ann_files (str, Sequence[str]): Annotation files.
+        cam_type (str): Image of Camera chose to infer.
+            For kitti dataset, it should be 'CAM_2',
+            and for nuscenes dataset, it should be
+            'CAM_FRONT'. Defaults to 'CAM_FRONT'.
+
+    Returns:
+        :obj:`Det3DDataSample` or list[:obj:`Det3DDataSample`]:
+        If pcds is a list or tuple, the same length list type results
+        will be returned, otherwise return the detection results directly.
+    """
+    if isinstance(imgs, (list, tuple)):
+        is_batch = True
+    else:
+        imgs = [imgs]
+        is_batch = False
+
+    cfg = model.cfg
+
+    # build the data pipeline
+    test_pipeline = deepcopy(cfg.test_dataloader.dataset.pipeline)
+    test_pipeline = Compose(test_pipeline)
+    box_type_3d, box_mode_3d = \
+        get_box_type(cfg.test_dataloader.dataset.box_type_3d)
+
+    data_list = mmengine.load(ann_file)['data_list']
+    assert len(imgs) == len(data_list)
+
+    data = []
+    for index, img in enumerate(imgs):
+        # get data info containing calib
+        data_info = data_list[index]
+        img_path = data_info['images'][cam_type]['img_path']
+        if osp.basename(img_path) != osp.basename(img):
+            raise ValueError(f'the info file of {img_path} is not provided.')
+
+        # replace the img_path in data_info with img
+        data_info['images'][cam_type]['img_path'] = img
+        # avoid data_info['images'] has multiple keys anout camera views.
+        mono_img_info = {f'{cam_type}': data_info['images'][cam_type]}
+        data_ = dict(
+            images=mono_img_info,
+            box_type_3d=box_type_3d,
+            box_mode_3d=box_mode_3d)
+
+        data_ = test_pipeline(data_)
+        data.append(data_)
+
+    collate_data = pseudo_collate(data)
+
+    # forward the model
+    with torch.no_grad():
+        results = model.test_step(collate_data)
+
+    if not is_batch:
+        return results[0]
+    else:
+        return results
+
+
+def inference_segmentor(model: nn.Module, pcds: PointsType):
+    """Inference point cloud with the segmentor.
+
+    Args:
+        model (nn.Module): The loaded segmentor.
+        pcds (str, Sequence[str]):
+            Either point cloud files or loaded point cloud.
+
+    Returns:
+        :obj:`Det3DDataSample` or list[:obj:`Det3DDataSample`]:
+        If pcds is a list or tuple, the same length list type results
+        will be returned, otherwise return the detection results directly.
+    """
+    if isinstance(pcds, (list, tuple)):
+        is_batch = True
+    else:
+        pcds = [pcds]
+        is_batch = False
+
+    cfg = model.cfg
+
+    # build the data pipeline
+    test_pipeline = deepcopy(cfg.test_dataloader.dataset.pipeline)
+
+    new_test_pipeline = []
+    for pipeline in test_pipeline:
+        if pipeline['type'] != 'LoadAnnotations3D' and pipeline[
+                'type'] != 'PointSegClassMapping':
+            new_test_pipeline.append(pipeline)
+    test_pipeline = Compose(new_test_pipeline)
+
+    data = []
+    # TODO: support load points array
+    for pcd in pcds:
+        data_ = dict(lidar_points=dict(lidar_path=pcd))
+        data_ = test_pipeline(data_)
+        data.append(data_)
+
+    collate_data = pseudo_collate(data)
+
+    # forward the model
+    with torch.no_grad():
+        results = model.test_step(collate_data)
+
+    if not is_batch:
+        return results[0], data[0]
+    else:
+        return results, data
--- a/mmdetection3d/mmdet3d/apis/inferencers/__init__.py
+++ b/mmdetection3d/mmdet3d/apis/inferencers/__init__.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from .base_3d_inferencer import Base3DInferencer
+from .lidar_det3d_inferencer import LidarDet3DInferencer
+from .lidar_seg3d_inferencer import LidarSeg3DInferencer
+from .mono_det3d_inferencer import MonoDet3DInferencer
+from .multi_modality_det3d_inferencer import MultiModalityDet3DInferencer
+
+__all__ = [
+    'Base3DInferencer', 'MonoDet3DInferencer', 'LidarDet3DInferencer',
+    'LidarSeg3DInferencer', 'MultiModalityDet3DInferencer'
+]
--- a/mmdetection3d/mmdet3d/apis/inferencers/base_3d_inferencer.py
+++ b/mmdetection3d/mmdet3d/apis/inferencers/base_3d_inferencer.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import os.path as osp
+from copy import deepcopy
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import numpy as np
+import torch.nn as nn
+from mmengine import dump, print_log
+from mmengine.infer.infer import BaseInferencer, ModelType
+from mmengine.model.utils import revert_sync_batchnorm
+from mmengine.registry import init_default_scope
+from mmengine.runner import load_checkpoint
+from mmengine.structures import InstanceData
+from mmengine.visualization import Visualizer
+from rich.progress import track
+
+from mmdet3d.registry import DATASETS, MODELS
+from mmdet3d.structures import Box3DMode, Det3DDataSample
+from mmdet3d.utils import ConfigType
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ImgType = Union[np.ndarray, Sequence[np.ndarray]]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+class Base3DInferencer(BaseInferencer):
+    """Base 3D model inferencer.
+
+    Args:
+        model (str, optional): Path to the config file or the model name
+            defined in metafile. For example, it could be
+            "pgd-kitti" or
+            "configs/pgd/pgd_r101-caffe_fpn_head-gn_4xb3-4x_kitti-mono3d.py".
+            If model is not specified, user must provide the
+            `weights` saved by MMEngine which contains the config string.
+            Defaults to None.
+        weights (str, optional): Path to the checkpoint. If it is not specified
+            and model is a model name of metafile, the weights will be loaded
+            from metafile. Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        scope (str): The scope of the model. Defaults to 'mmdet3d'.
+        palette (str): Color palette used for visualization. The order of
+            priority is palette -> config -> checkpoint. Defaults to 'none'.
+    """
+
+    preprocess_kwargs: set = {'cam_type'}
+    forward_kwargs: set = set()
+    visualize_kwargs: set = {
+        'return_vis', 'show', 'wait_time', 'draw_pred', 'pred_score_thr',
+        'img_out_dir', 'no_save_vis', 'cam_type_dir'
+    }
+    postprocess_kwargs: set = {
+        'print_result', 'pred_out_dir', 'return_datasample', 'no_save_pred'
+    }
+
+    def __init__(self,
+                 model: Union[ModelType, str, None] = None,
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 scope: str = 'mmdet3d',
+                 palette: str = 'none') -> None:
+        # A global counter tracking the number of frames processed, for
+        # naming of the output results
+        self.num_predicted_frames = 0
+        self.palette = palette
+        init_default_scope(scope)
+        super().__init__(
+            model=model, weights=weights, device=device, scope=scope)
+        self.model = revert_sync_batchnorm(self.model)
+
+    def _convert_syncbn(self, cfg: ConfigType):
+        """Convert config's naiveSyncBN to BN.
+
+        Args:
+            config (str or :obj:`mmengine.Config`): Config file path
+                or the config object.
+        """
+        if isinstance(cfg, dict):
+            for item in cfg:
+                if item == 'norm_cfg':
+                    cfg[item]['type'] = cfg[item]['type']. \
+                                        replace('naiveSyncBN', 'BN')
+                else:
+                    self._convert_syncbn(cfg[item])
+
+    def _init_model(
+        self,
+        cfg: ConfigType,
+        weights: str,
+        device: str = 'cpu',
+    ) -> nn.Module:
+        self._convert_syncbn(cfg.model)
+        cfg.model.train_cfg = None
+        model = MODELS.build(cfg.model)
+
+        checkpoint = load_checkpoint(model, weights, map_location='cpu')
+        if 'dataset_meta' in checkpoint.get('meta', {}):
+            # mmdet3d 1.x
+            model.dataset_meta = checkpoint['meta']['dataset_meta']
+        elif 'CLASSES' in checkpoint.get('meta', {}):
+            # < mmdet3d 1.x
+            classes = checkpoint['meta']['CLASSES']
+            model.dataset_meta = {'classes': classes}
+
+            if 'PALETTE' in checkpoint.get('meta', {}):  # 3D Segmentor
+                model.dataset_meta['palette'] = checkpoint['meta']['PALETTE']
+        else:
+            # < mmdet3d 1.x
+            model.dataset_meta = {'classes': cfg.class_names}
+
+            if 'PALETTE' in checkpoint.get('meta', {}):  # 3D Segmentor
+                model.dataset_meta['palette'] = checkpoint['meta']['PALETTE']
+
+        test_dataset_cfg = deepcopy(cfg.test_dataloader.dataset)
+        # lazy init. We only need the metainfo.
+        test_dataset_cfg['lazy_init'] = True
+        metainfo = DATASETS.build(test_dataset_cfg).metainfo
+        cfg_palette = metainfo.get('palette', None)
+        if cfg_palette is not None:
+            model.dataset_meta['palette'] = cfg_palette
+
+        model.cfg = cfg  # save the config in the model for convenience
+        model.to(device)
+        model.eval()
+        return model
+
+    def _get_transform_idx(self, pipeline_cfg: ConfigType, name: str) -> int:
+        """Returns the index of the transform in a pipeline.
+
+        If the transform is not found, returns -1.
+        """
+        for i, transform in enumerate(pipeline_cfg):
+            if transform['type'] == name:
+                return i
+        return -1
+
+    def _init_visualizer(self, cfg: ConfigType) -> Optional[Visualizer]:
+        visualizer = super()._init_visualizer(cfg)
+        visualizer.dataset_meta = self.model.dataset_meta
+        return visualizer
+
+    def _dispatch_kwargs(self,
+                         out_dir: str = '',
+                         cam_type: str = '',
+                         **kwargs) -> Tuple[Dict, Dict, Dict, Dict]:
+        """Dispatch kwargs to preprocess(), forward(), visualize() and
+        postprocess() according to the actual demands.
+
+        Args:
+            out_dir (str): Dir to save the inference results.
+            cam_type (str): Camera type. Defaults to ''.
+            **kwargs (dict): Key words arguments passed to :meth:`preprocess`,
+                :meth:`forward`, :meth:`visualize` and :meth:`postprocess`.
+                Each key in kwargs should be in the corresponding set of
+                ``preprocess_kwargs``, ``forward_kwargs``, ``visualize_kwargs``
+                and ``postprocess_kwargs``.
+
+        Returns:
+            Tuple[Dict, Dict, Dict, Dict]: kwargs passed to preprocess,
+            forward, visualize and postprocess respectively.
+        """
+        kwargs['img_out_dir'] = out_dir
+        kwargs['pred_out_dir'] = out_dir
+        if cam_type != '':
+            kwargs['cam_type_dir'] = cam_type
+        return super()._dispatch_kwargs(**kwargs)
+
+    def __call__(self,
+                 inputs: InputsType,
+                 batch_size: int = 1,
+                 return_datasamples: bool = False,
+                 **kwargs) -> Optional[dict]:
+        """Call the inferencer.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            batch_size (int): Batch size. Defaults to 1.
+            return_datasamples (bool): Whether to return results as
+                :obj:`BaseDataElement`. Defaults to False.
+            **kwargs: Key words arguments passed to :meth:`preprocess`,
+                :meth:`forward`, :meth:`visualize` and :meth:`postprocess`.
+                Each key in kwargs should be in the corresponding set of
+                ``preprocess_kwargs``, ``forward_kwargs``, ``visualize_kwargs``
+                and ``postprocess_kwargs``.
+
+
+        Returns:
+            dict: Inference and visualization results.
+        """
+
+        (
+            preprocess_kwargs,
+            forward_kwargs,
+            visualize_kwargs,
+            postprocess_kwargs,
+        ) = self._dispatch_kwargs(**kwargs)
+
+        cam_type = preprocess_kwargs.pop('cam_type', 'CAM2')
+        ori_inputs = self._inputs_to_list(inputs, cam_type=cam_type)
+        inputs = self.preprocess(
+            ori_inputs, batch_size=batch_size, **preprocess_kwargs)
+        preds = []
+
+        results_dict = {'predictions': [], 'visualization': []}
+        for data in (track(inputs, description='Inference')
+                     if self.show_progress else inputs):
+            preds.extend(self.forward(data, **forward_kwargs))
+            visualization = self.visualize(ori_inputs, preds,
+                                           **visualize_kwargs)
+            results = self.postprocess(preds, visualization,
+                                       return_datasamples,
+                                       **postprocess_kwargs)
+            results_dict['predictions'].extend(results['predictions'])
+            if results['visualization'] is not None:
+                results_dict['visualization'].extend(results['visualization'])
+        return results_dict
+
+    def postprocess(
+        self,
+        preds: PredType,
+        visualization: Optional[List[np.ndarray]] = None,
+        return_datasample: bool = False,
+        print_result: bool = False,
+        no_save_pred: bool = False,
+        pred_out_dir: str = '',
+    ) -> Union[ResType, Tuple[ResType, np.ndarray]]:
+        """Process the predictions and visualization results from ``forward``
+        and ``visualize``.
+
+        This method should be responsible for the following tasks:
+
+        1. Convert datasamples into a json-serializable dict if needed.
+        2. Pack the predictions and visualization results and return them.
+        3. Dump or log the predictions.
+
+        Args:
+            preds (List[Dict]): Predictions of the model.
+            visualization (np.ndarray, optional): Visualized predictions.
+                Defaults to None.
+            return_datasample (bool): Whether to use Datasample to store
+                inference results. If False, dict will be used.
+                Defaults to False.
+            print_result (bool): Whether to print the inference result w/o
+                visualization to the console. Defaults to False.
+            pred_out_dir (str): Directory to save the inference results w/o
+                visualization. If left as empty, no file will be saved.
+                Defaults to ''.
+
+        Returns:
+            dict: Inference and visualization results with key ``predictions``
+            and ``visualization``.
+
+            - ``visualization`` (Any): Returned by :meth:`visualize`.
+            - ``predictions`` (dict or DataSample): Returned by
+              :meth:`forward` and processed in :meth:`postprocess`.
+              If ``return_datasample=False``, it usually should be a
+              json-serializable dict containing only basic data elements such
+              as strings and numbers.
+        """
+        if no_save_pred is True:
+            pred_out_dir = ''
+
+        result_dict = {}
+        results = preds
+        if not return_datasample:
+            results = []
+            for pred in preds:
+                result = self.pred2dict(pred, pred_out_dir)
+                results.append(result)
+        elif pred_out_dir != '':
+            print_log(
+                'Currently does not support saving datasample '
+                'when return_datasample is set to True. '
+                'Prediction results are not saved!',
+                level=logging.WARNING)
+        # Add img to the results after printing and dumping
+        result_dict['predictions'] = results
+        if print_result:
+            print(result_dict)
+        result_dict['visualization'] = visualization
+        return result_dict
+
+    # TODO: The data format and fields saved in json need further discussion.
+    #  Maybe should include model name, timestamp, filename, image info etc.
+    def pred2dict(self,
+                  data_sample: Det3DDataSample,
+                  pred_out_dir: str = '') -> Dict:
+        """Extract elements necessary to represent a prediction into a
+        dictionary.
+
+        It's better to contain only basic data elements such as strings and
+        numbers in order to guarantee it's json-serializable.
+
+        Args:
+            data_sample (:obj:`DetDataSample`): Predictions of the model.
+            pred_out_dir: Dir to save the inference results w/o
+                visualization. If left as empty, no file will be saved.
+                Defaults to ''.
+
+        Returns:
+            dict: Prediction results.
+        """
+        result = {}
+        if 'pred_instances_3d' in data_sample:
+            pred_instances_3d = data_sample.pred_instances_3d.numpy()
+            result = {
+                'labels_3d': pred_instances_3d.labels_3d.tolist(),
+                'scores_3d': pred_instances_3d.scores_3d.tolist(),
+                'bboxes_3d': pred_instances_3d.bboxes_3d.tensor.cpu().tolist()
+            }
+
+        if 'pred_pts_seg' in data_sample:
+            pred_pts_seg = data_sample.pred_pts_seg.numpy()
+            result['pts_semantic_mask'] = \
+                pred_pts_seg.pts_semantic_mask.tolist()
+
+        if data_sample.box_mode_3d == Box3DMode.LIDAR:
+            result['box_type_3d'] = 'LiDAR'
+        elif data_sample.box_mode_3d == Box3DMode.CAM:
+            result['box_type_3d'] = 'Camera'
+        elif data_sample.box_mode_3d == Box3DMode.DEPTH:
+            result['box_type_3d'] = 'Depth'
+
+        if pred_out_dir != '':
+            if 'lidar_path' in data_sample:
+                lidar_path = osp.basename(data_sample.lidar_path)
+                lidar_path = osp.splitext(lidar_path)[0]
+                out_json_path = osp.join(pred_out_dir, 'preds',
+                                         lidar_path + '.json')
+            elif 'img_path' in data_sample:
+                img_path = osp.basename(data_sample.img_path)
+                img_path = osp.splitext(img_path)[0]
+                out_json_path = osp.join(pred_out_dir, 'preds',
+                                         img_path + '.json')
+            else:
+                out_json_path = osp.join(
+                    pred_out_dir, 'preds',
+                    f'{str(self.num_visualized_imgs).zfill(8)}.json')
+            dump(result, out_json_path)
+
+        return result
--- a/mmdetection3d/mmdet3d/apis/inferencers/lidar_det3d_inferencer.py
+++ b/mmdetection3d/mmdet3d/apis/inferencers/lidar_det3d_inferencer.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import Dict, List, Optional, Sequence, Union
+
+import mmengine
+import numpy as np
+import torch
+from mmengine.dataset import Compose
+from mmengine.fileio import (get_file_backend, isdir, join_path,
+                             list_dir_or_file)
+from mmengine.infer.infer import ModelType
+from mmengine.structures import InstanceData
+
+from mmdet3d.registry import INFERENCERS
+from mmdet3d.structures import (CameraInstance3DBoxes, DepthInstance3DBoxes,
+                                Det3DDataSample, LiDARInstance3DBoxes)
+from mmdet3d.utils import ConfigType
+from .base_3d_inferencer import Base3DInferencer
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ImgType = Union[np.ndarray, Sequence[np.ndarray]]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+@INFERENCERS.register_module(name='det3d-lidar')
+@INFERENCERS.register_module()
+class LidarDet3DInferencer(Base3DInferencer):
+    """The inferencer of LiDAR-based detection.
+
+    Args:
+        model (str, optional): Path to the config file or the model name
+            defined in metafile. For example, it could be
+            "pointpillars_kitti-3class" or
+            "configs/pointpillars/pointpillars_hv_secfpn_8xb6-160e_kitti-3d-3class.py". # noqa: E501
+            If model is not specified, user must provide the
+            `weights` saved by MMEngine which contains the config string.
+            Defaults to None.
+        weights (str, optional): Path to the checkpoint. If it is not specified
+            and model is a model name of metafile, the weights will be loaded
+            from metafile. Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        scope (str): The scope of the model. Defaults to 'mmdet3d'.
+        palette (str): Color palette used for visualization. The order of
+            priority is palette -> config -> checkpoint. Defaults to 'none'.
+    """
+
+    def __init__(self,
+                 model: Union[ModelType, str, None] = None,
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 scope: str = 'mmdet3d',
+                 palette: str = 'none') -> None:
+        # A global counter tracking the number of frames processed, for
+        # naming of the output results
+        self.num_visualized_frames = 0
+        super(LidarDet3DInferencer, self).__init__(
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            palette=palette)
+
+    def _inputs_to_list(self, inputs: Union[dict, list], **kwargs) -> list:
+        """Preprocess the inputs to a list.
+
+        Preprocess inputs to a list according to its type:
+
+        - list or tuple: return inputs
+        - dict: the value with key 'points' is
+            - Directory path: return all files in the directory
+            - other cases: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+
+        Args:
+            inputs (Union[dict, list]): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if isinstance(inputs, dict) and isinstance(inputs['points'], str):
+            pcd = inputs['points']
+            backend = get_file_backend(pcd)
+            if hasattr(backend, 'isdir') and isdir(pcd):
+                # Backends like HttpsBackend do not implement `isdir`, so
+                # only those backends that implement `isdir` could accept
+                # the inputs as a directory
+                filename_list = list_dir_or_file(pcd, list_dir=False)
+                inputs = [{
+                    'points': join_path(pcd, filename)
+                } for filename in filename_list]
+
+        if not isinstance(inputs, (list, tuple)):
+            inputs = [inputs]
+
+        return list(inputs)
+
+    def _init_pipeline(self, cfg: ConfigType) -> Compose:
+        """Initialize the test pipeline."""
+        pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+
+        load_point_idx = self._get_transform_idx(pipeline_cfg,
+                                                 'LoadPointsFromFile')
+        if load_point_idx == -1:
+            raise ValueError(
+                'LoadPointsFromFile is not found in the test pipeline')
+
+        load_cfg = pipeline_cfg[load_point_idx]
+        self.coord_type, self.load_dim = load_cfg['coord_type'], load_cfg[
+            'load_dim']
+        self.use_dim = list(range(load_cfg['use_dim'])) if isinstance(
+            load_cfg['use_dim'], int) else load_cfg['use_dim']
+
+        pipeline_cfg[load_point_idx]['type'] = 'LidarDet3DInferencerLoader'
+        return Compose(pipeline_cfg)
+
+    def visualize(self,
+                  inputs: InputsType,
+                  preds: PredType,
+                  return_vis: bool = False,
+                  show: bool = False,
+                  wait_time: int = -1,
+                  draw_pred: bool = True,
+                  pred_score_thr: float = 0.3,
+                  no_save_vis: bool = False,
+                  img_out_dir: str = '') -> Union[List[np.ndarray], None]:
+        """Visualize predictions.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            preds (PredType): Predictions of the model.
+            return_vis (bool): Whether to return the visualization result.
+                Defaults to False.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to -1.
+            draw_pred (bool): Whether to draw predicted bounding boxes.
+                Defaults to True.
+            pred_score_thr (float): Minimum score of bboxes to draw.
+                Defaults to 0.3.
+            no_save_vis (bool): Whether to force not to save prediction
+                vis results. Defaults to False.
+            img_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+
+        Returns:
+            List[np.ndarray] or None: Returns visualization results only if
+            applicable.
+        """
+        if no_save_vis is True:
+            img_out_dir = ''
+
+        if not show and img_out_dir == '' and not return_vis:
+            return None
+
+        if getattr(self, 'visualizer') is None:
+            raise ValueError('Visualization needs the "visualizer" term'
+                             'defined in the config, but got None.')
+
+        results = []
+
+        for single_input, pred in zip(inputs, preds):
+            single_input = single_input['points']
+            if isinstance(single_input, str):
+                pts_bytes = mmengine.fileio.get(single_input)
+                points = np.frombuffer(pts_bytes, dtype=np.float32)
+                points = points.reshape(-1, self.load_dim)
+                points = points[:, self.use_dim]
+                pc_name = osp.basename(single_input).split('.bin')[0]
+                pc_name = f'{pc_name}.png'
+            elif isinstance(single_input, np.ndarray):
+                points = single_input.copy()
+                pc_num = str(self.num_visualized_frames).zfill(8)
+                pc_name = f'{pc_num}.png'
+            else:
+                raise ValueError('Unsupported input type: '
+                                 f'{type(single_input)}')
+
+            if img_out_dir != '' and show:
+                o3d_save_path = osp.join(img_out_dir, 'vis_lidar', pc_name)
+                mmengine.mkdir_or_exist(osp.dirname(o3d_save_path))
+            else:
+                o3d_save_path = None
+
+            data_input = dict(points=points)
+            self.visualizer.add_datasample(
+                pc_name,
+                data_input,
+                pred,
+                show=show,
+                wait_time=wait_time,
+                draw_gt=False,
+                draw_pred=draw_pred,
+                pred_score_thr=pred_score_thr,
+                o3d_save_path=o3d_save_path,
+                vis_task='lidar_det',
+            )
+            results.append(points)
+            self.num_visualized_frames += 1
+
+        return results
+
+    def visualize_preds_fromfile(self, inputs: InputsType, preds: PredType,
+                                 **kwargs) -> Union[List[np.ndarray], None]:
+        """Visualize predictions from `*.json` files.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            preds (PredType): Predictions of the model.
+
+        Returns:
+            List[np.ndarray] or None: Returns visualization results only if
+            applicable.
+        """
+        data_samples = []
+        for pred in preds:
+            pred = mmengine.load(pred)
+            data_sample = Det3DDataSample()
+            data_sample.pred_instances_3d = InstanceData()
+
+            data_sample.pred_instances_3d.labels_3d = torch.tensor(
+                pred['labels_3d'])
+            data_sample.pred_instances_3d.scores_3d = torch.tensor(
+                pred['scores_3d'])
+            if pred['box_type_3d'] == 'LiDAR':
+                data_sample.pred_instances_3d.bboxes_3d = \
+                    LiDARInstance3DBoxes(pred['bboxes_3d'])
+            elif pred['box_type_3d'] == 'Camera':
+                data_sample.pred_instances_3d.bboxes_3d = \
+                    CameraInstance3DBoxes(pred['bboxes_3d'])
+            elif pred['box_type_3d'] == 'Depth':
+                data_sample.pred_instances_3d.bboxes_3d = \
+                    DepthInstance3DBoxes(pred['bboxes_3d'])
+            else:
+                raise ValueError('Unsupported box type: '
+                                 f'{pred["box_type_3d"]}')
+            data_samples.append(data_sample)
+        return self.visualize(inputs=inputs, preds=data_samples, **kwargs)
--- a/mmdetection3d/mmdet3d/apis/inferencers/lidar_seg3d_inferencer.py
+++ b/mmdetection3d/mmdet3d/apis/inferencers/lidar_seg3d_inferencer.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import Dict, List, Optional, Sequence, Union
+
+import mmengine
+import numpy as np
+from mmengine.dataset import Compose
+from mmengine.fileio import (get_file_backend, isdir, join_path,
+                             list_dir_or_file)
+from mmengine.infer.infer import ModelType
+from mmengine.structures import InstanceData
+
+from mmdet3d.registry import INFERENCERS
+from mmdet3d.utils import ConfigType
+from .base_3d_inferencer import Base3DInferencer
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ImgType = Union[np.ndarray, Sequence[np.ndarray]]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+@INFERENCERS.register_module(name='seg3d-lidar')
+@INFERENCERS.register_module()
+class LidarSeg3DInferencer(Base3DInferencer):
+    """The inferencer of LiDAR-based segmentation.
+
+    Args:
+        model (str, optional): Path to the config file or the model name
+            defined in metafile. For example, it could be
+            "pointnet2-ssg_s3dis-seg" or
+            "configs/pointnet2/pointnet2_ssg_2xb16-cosine-50e_s3dis-seg.py".
+            If model is not specified, user must provide the
+            `weights` saved by MMEngine which contains the config string.
+            Defaults to None.
+        weights (str, optional): Path to the checkpoint. If it is not specified
+            and model is a model name of metafile, the weights will be loaded
+            from metafile. Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        scope (str): The scope of the model. Defaults to 'mmdet3d'.
+        palette (str): Color palette used for visualization. The order of
+            priority is palette -> config -> checkpoint. Defaults to 'none'.
+    """
+
+    def __init__(self,
+                 model: Union[ModelType, str, None] = None,
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 scope: str = 'mmdet3d',
+                 palette: str = 'none') -> None:
+        # A global counter tracking the number of frames processed, for
+        # naming of the output results
+        self.num_visualized_frames = 0
+        super(LidarSeg3DInferencer, self).__init__(
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            palette=palette)
+
+    def _inputs_to_list(self, inputs: Union[dict, list], **kwargs) -> list:
+        """Preprocess the inputs to a list.
+
+        Preprocess inputs to a list according to its type:
+
+        - list or tuple: return inputs
+        - dict: the value with key 'points' is
+            - Directory path: return all files in the directory
+            - other cases: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+
+        Args:
+            inputs (Union[dict, list]): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if isinstance(inputs, dict) and isinstance(inputs['points'], str):
+            pcd = inputs['points']
+            backend = get_file_backend(pcd)
+            if hasattr(backend, 'isdir') and isdir(pcd):
+                # Backends like HttpsBackend do not implement `isdir`, so
+                # only those backends that implement `isdir` could accept
+                # the inputs as a directory
+                filename_list = list_dir_or_file(pcd, list_dir=False)
+                inputs = [{
+                    'points': join_path(pcd, filename)
+                } for filename in filename_list]
+
+        if not isinstance(inputs, (list, tuple)):
+            inputs = [inputs]
+
+        return list(inputs)
+
+    def _init_pipeline(self, cfg: ConfigType) -> Compose:
+        """Initialize the test pipeline."""
+        pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        # Load annotation is also not applicable
+        idx = self._get_transform_idx(pipeline_cfg, 'LoadAnnotations3D')
+        if idx != -1:
+            del pipeline_cfg[idx]
+
+        idx = self._get_transform_idx(pipeline_cfg, 'PointSegClassMapping')
+        if idx != -1:
+            del pipeline_cfg[idx]
+
+        load_point_idx = self._get_transform_idx(pipeline_cfg,
+                                                 'LoadPointsFromFile')
+        if load_point_idx == -1:
+            raise ValueError(
+                'LoadPointsFromFile is not found in the test pipeline')
+
+        load_cfg = pipeline_cfg[load_point_idx]
+        self.coord_type, self.load_dim = load_cfg['coord_type'], load_cfg[
+            'load_dim']
+        self.use_dim = list(range(load_cfg['use_dim'])) if isinstance(
+            load_cfg['use_dim'], int) else load_cfg['use_dim']
+
+        pipeline_cfg[load_point_idx]['type'] = 'LidarDet3DInferencerLoader'
+        return Compose(pipeline_cfg)
+
+    def visualize(self,
+                  inputs: InputsType,
+                  preds: PredType,
+                  return_vis: bool = False,
+                  show: bool = False,
+                  wait_time: int = 0,
+                  draw_pred: bool = True,
+                  pred_score_thr: float = 0.3,
+                  no_save_vis: bool = False,
+                  img_out_dir: str = '') -> Union[List[np.ndarray], None]:
+        """Visualize predictions.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            preds (PredType): Predictions of the model.
+            return_vis (bool): Whether to return the visualization result.
+                Defaults to False.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to 0.
+            draw_pred (bool): Whether to draw predicted bounding boxes.
+                Defaults to True.
+            pred_score_thr (float): Minimum score of bboxes to draw.
+                Defaults to 0.3.
+            no_save_vis (bool): Whether to save visualization results.
+            img_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+
+        Returns:
+            List[np.ndarray] or None: Returns visualization results only if
+            applicable.
+        """
+        if no_save_vis is True:
+            img_out_dir = ''
+
+        if not show and img_out_dir == '' and not return_vis:
+            return None
+
+        if getattr(self, 'visualizer') is None:
+            raise ValueError('Visualization needs the "visualizer" term'
+                             'defined in the config, but got None.')
+
+        results = []
+
+        for single_input, pred in zip(inputs, preds):
+            single_input = single_input['points']
+            if isinstance(single_input, str):
+                pts_bytes = mmengine.fileio.get(single_input)
+                points = np.frombuffer(pts_bytes, dtype=np.float32)
+                points = points.reshape(-1, self.load_dim)
+                points = points[:, self.use_dim]
+                pc_name = osp.basename(single_input).split('.bin')[0]
+                pc_name = f'{pc_name}.png'
+            elif isinstance(single_input, np.ndarray):
+                points = single_input.copy()
+                pc_num = str(self.num_visualized_frames).zfill(8)
+                pc_name = f'{pc_num}.png'
+            else:
+                raise ValueError('Unsupported input type: '
+                                 f'{type(single_input)}')
+
+            if img_out_dir != '' and show:
+                o3d_save_path = osp.join(img_out_dir, 'vis_lidar', pc_name)
+                mmengine.mkdir_or_exist(osp.dirname(o3d_save_path))
+            else:
+                o3d_save_path = None
+
+            data_input = dict(points=points)
+            self.visualizer.add_datasample(
+                pc_name,
+                data_input,
+                pred,
+                show=show,
+                wait_time=wait_time,
+                draw_gt=False,
+                draw_pred=draw_pred,
+                pred_score_thr=pred_score_thr,
+                o3d_save_path=o3d_save_path,
+                vis_task='lidar_seg',
+            )
+            results.append(points)
+            self.num_visualized_frames += 1
+
+        return results
--- a/mmdetection3d/mmdet3d/apis/inferencers/mono_det3d_inferencer.py
+++ b/mmdetection3d/mmdet3d/apis/inferencers/mono_det3d_inferencer.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import Dict, List, Optional, Sequence, Union
+
+import mmcv
+import mmengine
+import numpy as np
+from mmengine.dataset import Compose
+from mmengine.fileio import (get_file_backend, isdir, join_path,
+                             list_dir_or_file)
+from mmengine.infer.infer import ModelType
+from mmengine.structures import InstanceData
+
+from mmdet3d.registry import INFERENCERS
+from mmdet3d.utils import ConfigType
+from .base_3d_inferencer import Base3DInferencer
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ImgType = Union[np.ndarray, Sequence[np.ndarray]]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+@INFERENCERS.register_module(name='det3d-mono')
+@INFERENCERS.register_module()
+class MonoDet3DInferencer(Base3DInferencer):
+    """MMDet3D Monocular 3D object detection inferencer.
+
+    Args:
+        model (str, optional): Path to the config file or the model name
+            defined in metafile. For example, it could be
+            "pgd_kitti" or
+            "configs/pgd/pgd_r101-caffe_fpn_head-gn_4xb3-4x_kitti-mono3d.py".
+            If model is not specified, user must provide the
+            `weights` saved by MMEngine which contains the config string.
+            Defaults to None.
+        weights (str, optional): Path to the checkpoint. If it is not specified
+            and model is a model name of metafile, the weights will be loaded
+            from metafile. Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        scope (str): The scope of the model. Defaults to 'mmdet3d'.
+        palette (str): Color palette used for visualization. The order of
+            priority is palette -> config -> checkpoint. Defaults to 'none'.
+    """
+
+    def __init__(self,
+                 model: Union[ModelType, str, None] = None,
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 scope: str = 'mmdet3d',
+                 palette: str = 'none') -> None:
+        # A global counter tracking the number of images processed, for
+        # naming of the output images
+        self.num_visualized_imgs = 0
+        super(MonoDet3DInferencer, self).__init__(
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            palette=palette)
+
+    def _inputs_to_list(self,
+                        inputs: Union[dict, list],
+                        cam_type='CAM2',
+                        **kwargs) -> list:
+        """Preprocess the inputs to a list.
+
+        Preprocess inputs to a list according to its type:
+
+        - list or tuple: return inputs
+        - dict: the value with key 'img' is
+            - Directory path: return all files in the directory
+            - other cases: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+
+        Args:
+            inputs (Union[dict, list]): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if isinstance(inputs, dict):
+            assert 'infos' in inputs
+            infos = inputs.pop('infos')
+
+            if isinstance(inputs['img'], str):
+                img = inputs['img']
+                backend = get_file_backend(img)
+                if hasattr(backend, 'isdir') and isdir(img):
+                    # Backends like HttpsBackend do not implement `isdir`, so
+                    # only those backends that implement `isdir` could accept
+                    # the inputs as a directory
+                    filename_list = list_dir_or_file(img, list_dir=False)
+                    inputs = [{
+                        'img': join_path(img, filename)
+                    } for filename in filename_list]
+
+            if not isinstance(inputs, (list, tuple)):
+                inputs = [inputs]
+
+            # get cam2img, lidar2cam and lidar2img from infos
+            info_list = mmengine.load(infos)['data_list']
+            assert len(info_list) == len(inputs)
+            for index, input in enumerate(inputs):
+                data_info = info_list[index]
+                img_path = data_info['images'][cam_type]['img_path']
+                if isinstance(input['img'], str) and \
+                        osp.basename(img_path) != osp.basename(input['img']):
+                    raise ValueError(
+                        f'the info file of {img_path} is not provided.')
+                cam2img = np.asarray(
+                    data_info['images'][cam_type]['cam2img'], dtype=np.float32)
+                lidar2cam = np.asarray(
+                    data_info['images'][cam_type]['lidar2cam'],
+                    dtype=np.float32)
+                if 'lidar2img' in data_info['images'][cam_type]:
+                    lidar2img = np.asarray(
+                        data_info['images'][cam_type]['lidar2img'],
+                        dtype=np.float32)
+                else:
+                    lidar2img = cam2img @ lidar2cam
+                input['cam2img'] = cam2img
+                input['lidar2cam'] = lidar2cam
+                input['lidar2img'] = lidar2img
+        elif isinstance(inputs, (list, tuple)):
+            # get cam2img, lidar2cam and lidar2img from infos
+            for input in inputs:
+                assert 'infos' in input
+                infos = input.pop('infos')
+                info_list = mmengine.load(infos)['data_list']
+                assert len(info_list) == 1, 'Only support single sample info' \
+                    'in `.pkl`, when inputs is a list.'
+                data_info = info_list[0]
+                img_path = data_info['images'][cam_type]['img_path']
+                if isinstance(input['img'], str) and \
+                        osp.basename(img_path) != osp.basename(input['img']):
+                    raise ValueError(
+                        f'the info file of {img_path} is not provided.')
+                cam2img = np.asarray(
+                    data_info['images'][cam_type]['cam2img'], dtype=np.float32)
+                lidar2cam = np.asarray(
+                    data_info['images'][cam_type]['lidar2cam'],
+                    dtype=np.float32)
+                if 'lidar2img' in data_info['images'][cam_type]:
+                    lidar2img = np.asarray(
+                        data_info['images'][cam_type]['lidar2img'],
+                        dtype=np.float32)
+                else:
+                    lidar2img = cam2img @ lidar2cam
+                input['cam2img'] = cam2img
+                input['lidar2cam'] = lidar2cam
+                input['lidar2img'] = lidar2img
+
+        return list(inputs)
+
+    def _init_pipeline(self, cfg: ConfigType) -> Compose:
+        """Initialize the test pipeline."""
+        pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+
+        load_img_idx = self._get_transform_idx(pipeline_cfg,
+                                               'LoadImageFromFileMono3D')
+        if load_img_idx == -1:
+            raise ValueError(
+                'LoadImageFromFileMono3D is not found in the test pipeline')
+        pipeline_cfg[load_img_idx]['type'] = 'MonoDet3DInferencerLoader'
+        return Compose(pipeline_cfg)
+
+    def visualize(self,
+                  inputs: InputsType,
+                  preds: PredType,
+                  return_vis: bool = False,
+                  show: bool = False,
+                  wait_time: int = 0,
+                  draw_pred: bool = True,
+                  pred_score_thr: float = 0.3,
+                  no_save_vis: bool = False,
+                  img_out_dir: str = '',
+                  cam_type_dir: str = 'CAM2') -> Union[List[np.ndarray], None]:
+        """Visualize predictions.
+
+        Args:
+            inputs (List[Dict]): Inputs for the inferencer.
+            preds (List[Dict]): Predictions of the model.
+            return_vis (bool): Whether to return the visualization result.
+                Defaults to False.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to 0.
+            draw_pred (bool): Whether to draw predicted bounding boxes.
+                Defaults to True.
+            pred_score_thr (float): Minimum score of bboxes to draw.
+                Defaults to 0.3.
+            no_save_vis (bool): Whether to save visualization results.
+            img_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+            cam_type_dir (str): Camera type directory. Defaults to 'CAM2'.
+
+        Returns:
+            List[np.ndarray] or None: Returns visualization results only if
+            applicable.
+        """
+        if no_save_vis is True:
+            img_out_dir = ''
+
+        if not show and img_out_dir == '' and not return_vis:
+            return None
+
+        if getattr(self, 'visualizer') is None:
+            raise ValueError('Visualization needs the "visualizer" term'
+                             'defined in the config, but got None.')
+
+        results = []
+
+        for single_input, pred in zip(inputs, preds):
+            if isinstance(single_input['img'], str):
+                img_bytes = mmengine.fileio.get(single_input['img'])
+                img = mmcv.imfrombytes(img_bytes)
+                img = img[:, :, ::-1]
+                img_name = osp.basename(single_input['img'])
+            elif isinstance(single_input['img'], np.ndarray):
+                img = single_input['img'].copy()
+                img_num = str(self.num_visualized_imgs).zfill(8)
+                img_name = f'{img_num}.jpg'
+            else:
+                raise ValueError('Unsupported input type: '
+                                 f"{type(single_input['img'])}")
+
+            out_file = osp.join(img_out_dir, 'vis_camera', cam_type_dir,
+                                img_name) if img_out_dir != '' else None
+
+            data_input = dict(img=img)
+            self.visualizer.add_datasample(
+                img_name,
+                data_input,
+                pred,
+                show=show,
+                wait_time=wait_time,
+                draw_gt=False,
+                draw_pred=draw_pred,
+                pred_score_thr=pred_score_thr,
+                out_file=out_file,
+                vis_task='mono_det',
+            )
+            results.append(img)
+            self.num_visualized_imgs += 1
+
+        return results
--- a/mmdetection3d/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py
+++ b/mmdetection3d/mmdet3d/apis/inferencers/multi_modality_det3d_inferencer.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+import warnings
+from typing import Dict, List, Optional, Sequence, Union
+
+import mmcv
+import mmengine
+import numpy as np
+from mmengine.dataset import Compose
+from mmengine.fileio import (get_file_backend, isdir, join_path,
+                             list_dir_or_file)
+from mmengine.infer.infer import ModelType
+from mmengine.structures import InstanceData
+
+from mmdet3d.registry import INFERENCERS
+from mmdet3d.utils import ConfigType
+from .base_3d_inferencer import Base3DInferencer
+
+InstanceList = List[InstanceData]
+InputType = Union[str, np.ndarray]
+InputsType = Union[InputType, Sequence[InputType]]
+PredType = Union[InstanceData, InstanceList]
+ImgType = Union[np.ndarray, Sequence[np.ndarray]]
+ResType = Union[Dict, List[Dict], InstanceData, List[InstanceData]]
+
+
+@INFERENCERS.register_module(name='det3d-multi_modality')
+@INFERENCERS.register_module()
+class MultiModalityDet3DInferencer(Base3DInferencer):
+    """The inferencer of multi-modality detection.
+
+    Args:
+        model (str, optional): Path to the config file or the model name
+            defined in metafile. For example, it could be
+            "pointpillars_kitti-3class" or
+            "configs/pointpillars/pointpillars_hv_secfpn_8xb6-160e_kitti-3d-3class.py". # noqa: E501
+            If model is not specified, user must provide the
+            `weights` saved by MMEngine which contains the config string.
+            Defaults to None.
+        weights (str, optional): Path to the checkpoint. If it is not specified
+            and model is a model name of metafile, the weights will be loaded
+            from metafile. Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        scope (str): The scope of registry. Defaults to 'mmdet3d'.
+        palette (str): The palette of visualization. Defaults to 'none'.
+    """
+
+    def __init__(self,
+                 model: Union[ModelType, str, None] = None,
+                 weights: Optional[str] = None,
+                 device: Optional[str] = None,
+                 scope: str = 'mmdet3d',
+                 palette: str = 'none') -> None:
+        # A global counter tracking the number of frames processed, for
+        # naming of the output results
+        self.num_visualized_frames = 0
+        super(MultiModalityDet3DInferencer, self).__init__(
+            model=model,
+            weights=weights,
+            device=device,
+            scope=scope,
+            palette=palette)
+
+    def _inputs_to_list(self,
+                        inputs: Union[dict, list],
+                        cam_type: str = 'CAM2',
+                        **kwargs) -> list:
+        """Preprocess the inputs to a list.
+
+        Preprocess inputs to a list according to its type:
+
+        - list or tuple: return inputs
+        - dict: the value with key 'points' is
+            - Directory path: return all files in the directory
+            - other cases: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+
+        Args:
+            inputs (Union[dict, list]): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if isinstance(inputs, dict):
+            assert 'infos' in inputs
+            infos = inputs.pop('infos')
+
+            if isinstance(inputs['img'], str):
+                img, pcd = inputs['img'], inputs['points']
+                backend = get_file_backend(img)
+                if hasattr(backend, 'isdir') and isdir(img) and isdir(pcd):
+                    # Backends like HttpsBackend do not implement `isdir`, so
+                    # only those backends that implement `isdir` could accept
+                    # the inputs as a directory
+                    img_filename_list = list_dir_or_file(
+                        img, list_dir=False, suffix=['.png', '.jpg'])
+                    pcd_filename_list = list_dir_or_file(
+                        pcd, list_dir=False, suffix='.bin')
+                    assert len(img_filename_list) == len(pcd_filename_list)
+
+                    inputs = [{
+                        'img': join_path(img, img_filename),
+                        'points': join_path(pcd, pcd_filename)
+                    } for pcd_filename, img_filename in zip(
+                        pcd_filename_list, img_filename_list)]
+
+            if not isinstance(inputs, (list, tuple)):
+                inputs = [inputs]
+
+            # get cam2img, lidar2cam and lidar2img from infos
+            info_list = mmengine.load(infos)['data_list']
+            assert len(info_list) == len(inputs)
+            for index, input in enumerate(inputs):
+                data_info = info_list[index]
+                img_path = data_info['images'][cam_type]['img_path']
+                if isinstance(input['img'], str) and \
+                        osp.basename(img_path) != osp.basename(input['img']):
+                    raise ValueError(
+                        f'the info file of {img_path} is not provided.')
+                cam2img = np.asarray(
+                    data_info['images'][cam_type]['cam2img'], dtype=np.float32)
+                lidar2cam = np.asarray(
+                    data_info['images'][cam_type]['lidar2cam'],
+                    dtype=np.float32)
+                if 'lidar2img' in data_info['images'][cam_type]:
+                    lidar2img = np.asarray(
+                        data_info['images'][cam_type]['lidar2img'],
+                        dtype=np.float32)
+                else:
+                    lidar2img = cam2img @ lidar2cam
+                input['cam2img'] = cam2img
+                input['lidar2cam'] = lidar2cam
+                input['lidar2img'] = lidar2img
+        elif isinstance(inputs, (list, tuple)):
+            # get cam2img, lidar2cam and lidar2img from infos
+            for input in inputs:
+                assert 'infos' in input
+                infos = input.pop('infos')
+                info_list = mmengine.load(infos)['data_list']
+                assert len(info_list) == 1, 'Only support single sample' \
+                    'info in `.pkl`, when input is a list.'
+                data_info = info_list[0]
+                img_path = data_info['images'][cam_type]['img_path']
+                if isinstance(input['img'], str) and \
+                        osp.basename(img_path) != osp.basename(input['img']):
+                    raise ValueError(
+                        f'the info file of {img_path} is not provided.')
+                cam2img = np.asarray(
+                    data_info['images'][cam_type]['cam2img'], dtype=np.float32)
+                lidar2cam = np.asarray(
+                    data_info['images'][cam_type]['lidar2cam'],
+                    dtype=np.float32)
+                if 'lidar2img' in data_info['images'][cam_type]:
+                    lidar2img = np.asarray(
+                        data_info['images'][cam_type]['lidar2img'],
+                        dtype=np.float32)
+                else:
+                    lidar2img = cam2img @ lidar2cam
+                input['cam2img'] = cam2img
+                input['lidar2cam'] = lidar2cam
+                input['lidar2img'] = lidar2img
+
+        return list(inputs)
+
+    def _init_pipeline(self, cfg: ConfigType) -> Compose:
+        """Initialize the test pipeline."""
+        pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+
+        load_point_idx = self._get_transform_idx(pipeline_cfg,
+                                                 'LoadPointsFromFile')
+        load_mv_img_idx = self._get_transform_idx(
+            pipeline_cfg, 'LoadMultiViewImageFromFiles')
+        if load_mv_img_idx != -1:
+            warnings.warn(
+                'LoadMultiViewImageFromFiles is not supported yet in the '
+                'multi-modality inferencer. Please remove it')
+        # Now, we only support ``LoadImageFromFile`` as the image loader in the
+        # original piepline. `LoadMultiViewImageFromFiles` is not supported
+        # yet.
+        load_img_idx = self._get_transform_idx(pipeline_cfg,
+                                               'LoadImageFromFile')
+
+        if load_point_idx == -1 or load_img_idx == -1:
+            raise ValueError(
+                'Both LoadPointsFromFile and LoadImageFromFile must '
+                'be specified the pipeline, but LoadPointsFromFile is '
+                f'{load_point_idx == -1} and LoadImageFromFile is '
+                f'{load_img_idx}')
+
+        load_cfg = pipeline_cfg[load_point_idx]
+        self.coord_type, self.load_dim = load_cfg['coord_type'], load_cfg[
+            'load_dim']
+        self.use_dim = list(range(load_cfg['use_dim'])) if isinstance(
+            load_cfg['use_dim'], int) else load_cfg['use_dim']
+
+        load_point_args = pipeline_cfg[load_point_idx]
+        load_point_args.pop('type')
+        load_img_args = pipeline_cfg[load_img_idx]
+        load_img_args.pop('type')
+
+        load_idx = min(load_point_idx, load_img_idx)
+        pipeline_cfg.pop(max(load_point_idx, load_img_idx))
+
+        pipeline_cfg[load_idx] = dict(
+            type='MultiModalityDet3DInferencerLoader',
+            load_point_args=load_point_args,
+            load_img_args=load_img_args)
+
+        return Compose(pipeline_cfg)
+
+    def visualize(self,
+                  inputs: InputsType,
+                  preds: PredType,
+                  return_vis: bool = False,
+                  show: bool = False,
+                  wait_time: int = 0,
+                  draw_pred: bool = True,
+                  pred_score_thr: float = 0.3,
+                  no_save_vis: bool = False,
+                  img_out_dir: str = '',
+                  cam_type_dir: str = 'CAM2') -> Union[List[np.ndarray], None]:
+        """Visualize predictions.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            preds (PredType): Predictions of the model.
+            return_vis (bool): Whether to return the visualization result.
+                Defaults to False.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+            wait_time (float): The interval of show (s). Defaults to 0.
+            draw_pred (bool): Whether to draw predicted bounding boxes.
+                Defaults to True.
+            no_save_vis (bool): Whether to save visualization results.
+            pred_score_thr (float): Minimum score of bboxes to draw.
+                Defaults to 0.3.
+            img_out_dir (str): Output directory of visualization results.
+                If left as empty, no file will be saved. Defaults to ''.
+
+        Returns:
+            List[np.ndarray] or None: Returns visualization results only if
+            applicable.
+        """
+        if no_save_vis is True:
+            img_out_dir = ''
+
+        if not show and img_out_dir == '' and not return_vis:
+            return None
+
+        if getattr(self, 'visualizer') is None:
+            raise ValueError('Visualization needs the "visualizer" term'
+                             'defined in the config, but got None.')
+
+        results = []
+
+        for single_input, pred in zip(inputs, preds):
+            points_input = single_input['points']
+            if isinstance(points_input, str):
+                pts_bytes = mmengine.fileio.get(points_input)
+                points = np.frombuffer(pts_bytes, dtype=np.float32)
+                points = points.reshape(-1, self.load_dim)
+                points = points[:, self.use_dim]
+                pc_name = osp.basename(points_input).split('.bin')[0]
+                pc_name = f'{pc_name}.png'
+            elif isinstance(points_input, np.ndarray):
+                points = points_input.copy()
+                pc_num = str(self.num_visualized_frames).zfill(8)
+                pc_name = f'{pc_num}.png'
+            else:
+                raise ValueError('Unsupported input type: '
+                                 f'{type(points_input)}')
+
+            if img_out_dir != '' and show:
+                o3d_save_path = osp.join(img_out_dir, 'vis_lidar', pc_name)
+                mmengine.mkdir_or_exist(osp.dirname(o3d_save_path))
+            else:
+                o3d_save_path = None
+
+            img_input = single_input['img']
+            if isinstance(single_input['img'], str):
+                img_bytes = mmengine.fileio.get(img_input)
+                img = mmcv.imfrombytes(img_bytes)
+                img = img[:, :, ::-1]
+                img_name = osp.basename(img_input)
+            elif isinstance(img_input, np.ndarray):
+                img = img_input.copy()
+                img_num = str(self.num_visualized_frames).zfill(8)
+                img_name = f'{img_num}.jpg'
+            else:
+                raise ValueError('Unsupported input type: '
+                                 f'{type(img_input)}')
+
+            out_file = osp.join(img_out_dir, 'vis_camera', cam_type_dir,
+                                img_name) if img_out_dir != '' else None
+
+            data_input = dict(points=points, img=img)
+            self.visualizer.add_datasample(
+                pc_name,
+                data_input,
+                pred,
+                show=show,
+                wait_time=wait_time,
+                draw_gt=False,
+                draw_pred=draw_pred,
+                pred_score_thr=pred_score_thr,
+                o3d_save_path=o3d_save_path,
+                out_file=out_file,
+                vis_task='multi-modality_det',
+            )
+            results.append(points)
+            self.num_visualized_frames += 1
+
+        return results
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/kitti_3d_3class.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/kitti_3d_3class.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.dataset.dataset_wrapper import RepeatDataset
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.kitti_dataset import KittiDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile)
+from mmdet3d.datasets.transforms.test_time_aug import MultiScaleFlipAug3D
+from mmdet3d.datasets.transforms.transforms_3d import (  # noqa
+    GlobalRotScaleTrans, ObjectNoise, ObjectRangeFilter, ObjectSample,
+    PointShuffle, PointsRangeFilter, RandomFlip3D)
+from mmdet3d.evaluation.metrics.kitti_metric import KittiMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+input_modality = dict(use_lidar=True, use_camera=False)
+metainfo = dict(classes=class_names)
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/kitti/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+db_sampler = dict(
+    data_root=data_root,
+    info_path=data_root + 'kitti_dbinfos_train.pkl',
+    rate=1.0,
+    prepare=dict(
+        filter_by_difficulty=[-1],
+        filter_by_min_points=dict(Car=5, Pedestrian=10, Cyclist=10)),
+    classes=class_names,
+    sample_groups=dict(Car=12, Pedestrian=6, Cyclist=6),
+    points_loader=dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    backend_args=backend_args)
+
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,  # x, y, z, intensity
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type=LoadAnnotations3D, with_bbox_3d=True, with_label_3d=True),
+    dict(type=ObjectSample, db_sampler=db_sampler),
+    dict(
+        type=ObjectNoise,
+        num_try=100,
+        translation_std=[1.0, 1.0, 0.5],
+        global_rot_range=[0.0, 0.0],
+        rot_range=[-0.78539816, 0.78539816]),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(
+        type=GlobalRotScaleTrans,
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05]),
+    dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=ObjectRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=PointShuffle),
+    dict(
+        type=Pack3DDetInputs, keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug3D,
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type=GlobalRotScaleTrans,
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type=RandomFlip3D),
+            dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range)
+        ]),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+# construct a pipeline for data and gt loading in show function
+# please keep its loading function consistent with test_pipeline (e.g. client)
+eval_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+train_dataloader = dict(
+    batch_size=6,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=RepeatDataset,
+        times=2,
+        dataset=dict(
+            type=KittiDataset,
+            data_root=data_root,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(pts='training/velodyne_reduced'),
+            pipeline=train_pipeline,
+            modality=input_modality,
+            test_mode=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=KittiDataset,
+        data_root=data_root,
+        data_prefix=dict(pts='training/velodyne_reduced'),
+        ann_file='kitti_infos_val.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        metainfo=metainfo,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=KittiDataset,
+        data_root=data_root,
+        data_prefix=dict(pts='training/velodyne_reduced'),
+        ann_file='kitti_infos_val.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        metainfo=metainfo,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+val_evaluator = dict(
+    type=KittiMetric,
+    ann_file=data_root + 'kitti_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/kitti_3d_car.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/kitti_3d_car.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.dataset.dataset_wrapper import RepeatDataset
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.kitti_dataset import KittiDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile)
+from mmdet3d.datasets.transforms.test_time_aug import MultiScaleFlipAug3D
+from mmdet3d.datasets.transforms.transforms_3d import (  # noqa
+    GlobalRotScaleTrans, ObjectNoise, ObjectRangeFilter, ObjectSample,
+    PointShuffle, PointsRangeFilter, RandomFlip3D)
+from mmdet3d.evaluation.metrics.kitti_metric import KittiMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# dataset settings
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Car']
+point_cloud_range = [0, -40, -3, 70.4, 40, 1]
+input_modality = dict(use_lidar=True, use_camera=False)
+metainfo = dict(classes=class_names)
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/kitti/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+db_sampler = dict(
+    data_root=data_root,
+    info_path=data_root + 'kitti_dbinfos_train.pkl',
+    rate=1.0,
+    prepare=dict(filter_by_difficulty=[-1], filter_by_min_points=dict(Car=5)),
+    classes=class_names,
+    sample_groups=dict(Car=15),
+    points_loader=dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    backend_args=backend_args)
+
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,  # x, y, z, intensity
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type=LoadAnnotations3D, with_bbox_3d=True, with_label_3d=True),
+    dict(type=ObjectSample, db_sampler=db_sampler),
+    dict(
+        type=ObjectNoise,
+        num_try=100,
+        translation_std=[1.0, 1.0, 0.5],
+        global_rot_range=[0.0, 0.0],
+        rot_range=[-0.78539816, 0.78539816]),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(
+        type=GlobalRotScaleTrans,
+        rot_range=[-0.78539816, 0.78539816],
+        scale_ratio_range=[0.95, 1.05]),
+    dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=ObjectRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=PointShuffle),
+    dict(
+        type=Pack3DDetInputs, keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug3D,
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type=GlobalRotScaleTrans,
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type=RandomFlip3D),
+            dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range)
+        ]),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+# construct a pipeline for data and gt loading in show function
+# please keep its loading function consistent with test_pipeline (e.g. client)
+eval_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=4,
+        use_dim=4,
+        backend_args=backend_args),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+train_dataloader = dict(
+    batch_size=6,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=RepeatDataset,
+        times=2,
+        dataset=dict(
+            type=KittiDataset,
+            data_root=data_root,
+            ann_file='kitti_infos_train.pkl',
+            data_prefix=dict(pts='training/velodyne_reduced'),
+            pipeline=train_pipeline,
+            modality=input_modality,
+            test_mode=False,
+            metainfo=metainfo,
+            # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+            # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+            box_type_3d='LiDAR',
+            backend_args=backend_args)))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=KittiDataset,
+        data_root=data_root,
+        data_prefix=dict(pts='training/velodyne_reduced'),
+        ann_file='kitti_infos_val.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        metainfo=metainfo,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=KittiDataset,
+        data_root=data_root,
+        data_prefix=dict(pts='training/velodyne_reduced'),
+        ann_file='kitti_infos_val.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        test_mode=True,
+        metainfo=metainfo,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+val_evaluator = dict(
+    type=KittiMetric,
+    ann_file=data_root + 'kitti_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/kitti_mono3d.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/kitti_mono3d.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.transforms.processing import Resize
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.kitti_dataset import KittiDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadImageFromFileMono3D)
+from mmdet3d.datasets.transforms.transforms_3d import RandomFlip3D
+from mmdet3d.evaluation.metrics.kitti_metric import KittiMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+dataset_type = 'KittiDataset'
+data_root = 'data/kitti/'
+class_names = ['Pedestrian', 'Cyclist', 'Car']
+input_modality = dict(use_lidar=False, use_camera=True)
+metainfo = dict(classes=class_names)
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/kitti/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+train_pipeline = [
+    dict(type=LoadImageFromFileMono3D, backend_args=backend_args),
+    dict(
+        type=LoadAnnotations3D,
+        with_bbox=True,
+        with_label=True,
+        with_attr_label=False,
+        with_bbox_3d=True,
+        with_label_3d=True,
+        with_bbox_depth=True),
+    dict(type=Resize, scale=(1242, 375), keep_ratio=True),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(
+        type=Pack3DDetInputs,
+        keys=[
+            'img', 'gt_bboxes', 'gt_bboxes_labels', 'gt_bboxes_3d',
+            'gt_labels_3d', 'centers_2d', 'depths'
+        ]),
+]
+test_pipeline = [
+    dict(type=LoadImageFromFileMono3D, backend_args=backend_args),
+    dict(type=Resize, scale=(1242, 375), keep_ratio=True),
+    dict(type=Pack3DDetInputs, keys=['img'])
+]
+eval_pipeline = [
+    dict(type=LoadImageFromFileMono3D, backend_args=backend_args),
+    dict(type=Pack3DDetInputs, keys=['img'])
+]
+
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=KittiDataset,
+        data_root=data_root,
+        ann_file='kitti_infos_train.pkl',
+        data_prefix=dict(img='training/image_2'),
+        pipeline=train_pipeline,
+        modality=input_modality,
+        load_type='fov_image_based',
+        test_mode=False,
+        metainfo=metainfo,
+        # we use box_type_3d='Camera' in monocular 3d
+        # detection task
+        box_type_3d='Camera',
+        backend_args=backend_args))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=KittiDataset,
+        data_root=data_root,
+        data_prefix=dict(img='training/image_2'),
+        ann_file='kitti_infos_val.pkl',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        load_type='fov_image_based',
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='Camera',
+        backend_args=backend_args))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+    type=KittiMetric,
+    ann_file=data_root + 'kitti_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/lyft_3d.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/lyft_3d.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.lyft_dataset import LyftDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile,
+                                                 LoadPointsFromMultiSweeps)
+from mmdet3d.datasets.transforms.test_time_aug import MultiScaleFlipAug3D
+from mmdet3d.datasets.transforms.transforms_3d import (GlobalRotScaleTrans,
+                                                       ObjectRangeFilter,
+                                                       PointShuffle,
+                                                       PointsRangeFilter,
+                                                       RandomFlip3D)
+from mmdet3d.evaluation.metrics.lyft_metric import LyftMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# If point cloud range is changed, the models should also change their point
+# cloud range accordingly
+point_cloud_range = [-80, -80, -5, 80, 80, 3]
+# For Lyft we usually do 9-class detection
+class_names = [
+    'car', 'truck', 'bus', 'emergency_vehicle', 'other_vehicle', 'motorcycle',
+    'bicycle', 'pedestrian', 'animal'
+]
+dataset_type = 'LyftDataset'
+data_root = 'data/lyft/'
+# Input modality for Lyft dataset, this is consistent with the submission
+# format which requires the information in input_modality.
+input_modality = dict(use_lidar=True, use_camera=False)
+data_prefix = dict(pts='v1.01-train/lidar', img='', sweeps='v1.01-train/lidar')
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/lyft/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(type=LoadAnnotations3D, with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type=GlobalRotScaleTrans,
+        rot_range=[-0.3925, 0.3925],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0, 0, 0]),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=ObjectRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=PointShuffle),
+    dict(
+        type=Pack3DDetInputs, keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug3D,
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type=GlobalRotScaleTrans,
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type=RandomFlip3D),
+            dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range)
+        ]),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+# construct a pipeline for data and gt loading in show function
+# please keep its loading function consistent with test_pipeline (e.g. client)
+eval_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=LyftDataset,
+        data_root=data_root,
+        ann_file='lyft_infos_train.pkl',
+        pipeline=train_pipeline,
+        metainfo=dict(classes=class_names),
+        modality=input_modality,
+        data_prefix=data_prefix,
+        test_mode=False,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=LyftDataset,
+        data_root=data_root,
+        ann_file='lyft_infos_val.pkl',
+        pipeline=test_pipeline,
+        metainfo=dict(classes=class_names),
+        modality=input_modality,
+        data_prefix=data_prefix,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=LyftDataset,
+        data_root=data_root,
+        ann_file='lyft_infos_val.pkl',
+        pipeline=test_pipeline,
+        metainfo=dict(classes=class_names),
+        modality=input_modality,
+        test_mode=True,
+        data_prefix=data_prefix,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+val_evaluator = dict(
+    type=LyftMetric,
+    data_root=data_root,
+    ann_file='lyft_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/lyft_3d_range100.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/lyft_3d_range100.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.lyft_dataset import LyftDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile,
+                                                 LoadPointsFromMultiSweeps)
+from mmdet3d.datasets.transforms.test_time_aug import MultiScaleFlipAug3D
+from mmdet3d.datasets.transforms.transforms_3d import (GlobalRotScaleTrans,
+                                                       ObjectRangeFilter,
+                                                       PointShuffle,
+                                                       PointsRangeFilter,
+                                                       RandomFlip3D)
+from mmdet3d.evaluation.metrics.lyft_metric import LyftMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# If point cloud range is changed, the models should also change their point
+# cloud range accordingly
+point_cloud_range = [-100, -100, -5, 100, 100, 3]
+# For Lyft we usually do 9-class detection
+class_names = [
+    'car', 'truck', 'bus', 'emergency_vehicle', 'other_vehicle', 'motorcycle',
+    'bicycle', 'pedestrian', 'animal'
+]
+dataset_type = 'LyftDataset'
+data_root = 'data/lyft/'
+data_prefix = dict(pts='v1.01-train/lidar', img='', sweeps='v1.01-train/lidar')
+# Input modality for Lyft dataset, this is consistent with the submission
+# format which requires the information in input_modality.
+input_modality = dict(
+    use_lidar=True,
+    use_camera=False,
+    use_radar=False,
+    use_map=False,
+    use_external=False)
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/lyft/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(type=LoadAnnotations3D, with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type=GlobalRotScaleTrans,
+        rot_range=[-0.3925, 0.3925],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0, 0, 0]),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=ObjectRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=PointShuffle),
+    dict(
+        type=Pack3DDetInputs, keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug3D,
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type=GlobalRotScaleTrans,
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type=RandomFlip3D),
+            dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range),
+        ]),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+# construct a pipeline for data and gt loading in show function
+# please keep its loading function consistent with test_pipeline (e.g. client)
+eval_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=LyftDataset,
+        data_root=data_root,
+        ann_file='lyft_infos_train.pkl',
+        pipeline=train_pipeline,
+        metainfo=dict(classes=class_names),
+        modality=input_modality,
+        data_prefix=data_prefix,
+        test_mode=False,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=LyftDataset,
+        data_root=data_root,
+        ann_file='lyft_infos_val.pkl',
+        pipeline=test_pipeline,
+        metainfo=dict(classes=class_names),
+        modality=input_modality,
+        test_mode=True,
+        data_prefix=data_prefix,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+    type=LyftMetric,
+    data_root=data_root,
+    ann_file='lyft_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/nuim_instance.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/nuim_instance.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.transforms.loading import LoadAnnotations, LoadImageFromFile
+from mmcv.transforms.processing import MultiScaleFlipAug, RandomFlip, Resize
+
+dataset_type = 'CocoDataset'
+data_root = 'data/nuimages/'
+class_names = [
+    'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle',
+    'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
+]
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/nuimages/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+train_pipeline = [
+    dict(type=LoadImageFromFile, backend_args=backend_args),
+    dict(type=LoadAnnotations, with_bbox=True, with_mask=True),
+    dict(
+        type=Resize,
+        img_scale=[(1280, 720), (1920, 1080)],
+        multiscale_mode='range',
+        keep_ratio=True),
+    dict(type=RandomFlip, flip_ratio=0.5),
+    dict(type='PackDetInputs'),
+]
+test_pipeline = [
+    dict(type=LoadImageFromFile, backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug,
+        img_scale=(1600, 900),
+        flip=False,
+        transforms=[
+            dict(type=Resize, keep_ratio=True),
+            dict(type=RandomFlip),
+        ]),
+    dict(
+        type='PackDetInputs',
+        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor')),
+]
+data = dict(
+    samples_per_gpu=2,
+    workers_per_gpu=2,
+    train=dict(
+        type=dataset_type,
+        ann_file=data_root + 'annotations/nuimages_v1.0-train.json',
+        img_prefix=data_root,
+        classes=class_names,
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        ann_file=data_root + 'annotations/nuimages_v1.0-val.json',
+        img_prefix=data_root,
+        classes=class_names,
+        pipeline=test_pipeline),
+    test=dict(
+        type=dataset_type,
+        ann_file=data_root + 'annotations/nuimages_v1.0-val.json',
+        img_prefix=data_root,
+        classes=class_names,
+        pipeline=test_pipeline))
+evaluation = dict(metric=['bbox', 'segm'])
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/nus_3d.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/nus_3d.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.nuscenes_dataset import NuScenesDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile,
+                                                 LoadPointsFromMultiSweeps)
+from mmdet3d.datasets.transforms.test_time_aug import MultiScaleFlipAug3D
+from mmdet3d.datasets.transforms.transforms_3d import (  # noqa
+    GlobalRotScaleTrans, ObjectNameFilter, ObjectRangeFilter, PointShuffle,
+    PointsRangeFilter, RandomFlip3D)
+from mmdet3d.evaluation.metrics.nuscenes_metric import NuScenesMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# If point cloud range is changed, the models should also change their point
+# cloud range accordingly
+point_cloud_range = [-50, -50, -5, 50, 50, 3]
+# Using calibration info convert the Lidar-coordinate point cloud range to the
+# ego-coordinate point cloud range could bring a little promotion in nuScenes.
+# point_cloud_range = [-50, -50.8, -5, 50, 49.2, 3]
+# For nuScenes we usually do 10-class detection
+class_names = [
+    'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle',
+    'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
+]
+metainfo = dict(classes=class_names)
+dataset_type = 'NuScenesDataset'
+data_root = 'data/nuscenes/'
+# Input modality for nuScenes dataset, this is consistent with the submission
+# format which requires the information in input_modality.
+input_modality = dict(use_lidar=True, use_camera=False)
+data_prefix = dict(pts='samples/LIDAR_TOP', img='', sweeps='sweeps/LIDAR_TOP')
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/nuscenes/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        backend_args=backend_args),
+    dict(type=LoadAnnotations3D, with_bbox_3d=True, with_label_3d=True),
+    dict(
+        type=GlobalRotScaleTrans,
+        rot_range=[-0.3925, 0.3925],
+        scale_ratio_range=[0.95, 1.05],
+        translation_std=[0, 0, 0]),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=ObjectRangeFilter, point_cloud_range=point_cloud_range),
+    dict(type=ObjectNameFilter, classes=class_names),
+    dict(type=PointShuffle),
+    dict(
+        type=Pack3DDetInputs, keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        test_mode=True,
+        backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug3D,
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type=GlobalRotScaleTrans,
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(type=RandomFlip3D),
+            dict(type=PointsRangeFilter, point_cloud_range=point_cloud_range)
+        ]),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+# construct a pipeline for data and gt loading in show function
+# please keep its loading function consistent with test_pipeline (e.g. client)
+eval_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='LIDAR',
+        load_dim=5,
+        use_dim=5,
+        backend_args=backend_args),
+    dict(
+        type=LoadPointsFromMultiSweeps,
+        sweeps_num=10,
+        test_mode=True,
+        backend_args=backend_args),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+train_dataloader = dict(
+    batch_size=4,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=NuScenesDataset,
+        data_root=data_root,
+        ann_file='nuscenes_infos_train.pkl',
+        pipeline=train_pipeline,
+        metainfo=metainfo,
+        modality=input_modality,
+        test_mode=False,
+        data_prefix=data_prefix,
+        # we use box_type_3d='LiDAR' in kitti and nuscenes dataset
+        # and box_type_3d='Depth' in sunrgbd and scannet dataset.
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=NuScenesDataset,
+        data_root=data_root,
+        ann_file='nuscenes_infos_val.pkl',
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        modality=input_modality,
+        data_prefix=data_prefix,
+        test_mode=True,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=NuScenesDataset,
+        data_root=data_root,
+        ann_file='nuscenes_infos_val.pkl',
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        modality=input_modality,
+        test_mode=True,
+        data_prefix=data_prefix,
+        box_type_3d='LiDAR',
+        backend_args=backend_args))
+
+val_evaluator = dict(
+    type=NuScenesMetric,
+    data_root=data_root,
+    ann_file=data_root + 'nuscenes_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/nus_mono3d.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/nus_mono3d.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.transforms.processing import Resize
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.nuscenes_dataset import NuScenesDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadImageFromFileMono3D)
+from mmdet3d.datasets.transforms.transforms_3d import RandomFlip3D
+from mmdet3d.evaluation.metrics.nuscenes_metric import NuScenesMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+dataset_type = 'NuScenesDataset'
+data_root = 'data/nuscenes/'
+class_names = [
+    'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle',
+    'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
+]
+metainfo = dict(classes=class_names)
+# Input modality for nuScenes dataset, this is consistent with the submission
+# format which requires the information in input_modality.
+input_modality = dict(use_lidar=False, use_camera=True)
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/nuscenes/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+train_pipeline = [
+    dict(type=LoadImageFromFileMono3D, backend_args=backend_args),
+    dict(
+        type=LoadAnnotations3D,
+        with_bbox=True,
+        with_label=True,
+        with_attr_label=True,
+        with_bbox_3d=True,
+        with_label_3d=True,
+        with_bbox_depth=True),
+    dict(type=Resize, scale=(1600, 900), keep_ratio=True),
+    dict(type=RandomFlip3D, flip_ratio_bev_horizontal=0.5),
+    dict(
+        type=Pack3DDetInputs,
+        keys=[
+            'img', 'gt_bboxes', 'gt_bboxes_labels', 'attr_labels',
+            'gt_bboxes_3d', 'gt_labels_3d', 'centers_2d', 'depths'
+        ]),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFileMono3D, backend_args=backend_args),
+    dict(type=Resize, scale=(1600, 900), keep_ratio=True),
+    dict(type=Pack3DDetInputs, keys=['img'])
+]
+
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=2,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=NuScenesDataset,
+        data_root=data_root,
+        data_prefix=dict(
+            pts='',
+            CAM_FRONT='samples/CAM_FRONT',
+            CAM_FRONT_LEFT='samples/CAM_FRONT_LEFT',
+            CAM_FRONT_RIGHT='samples/CAM_FRONT_RIGHT',
+            CAM_BACK='samples/CAM_BACK',
+            CAM_BACK_RIGHT='samples/CAM_BACK_RIGHT',
+            CAM_BACK_LEFT='samples/CAM_BACK_LEFT'),
+        ann_file='nuscenes_infos_train.pkl',
+        load_type='mv_image_based',
+        pipeline=train_pipeline,
+        metainfo=metainfo,
+        modality=input_modality,
+        test_mode=False,
+        # we use box_type_3d='Camera' in monocular 3d
+        # detection task
+        box_type_3d='Camera',
+        use_valid_flag=True,
+        backend_args=backend_args))
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=2,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=NuScenesDataset,
+        data_root=data_root,
+        data_prefix=dict(
+            pts='',
+            CAM_FRONT='samples/CAM_FRONT',
+            CAM_FRONT_LEFT='samples/CAM_FRONT_LEFT',
+            CAM_FRONT_RIGHT='samples/CAM_FRONT_RIGHT',
+            CAM_BACK='samples/CAM_BACK',
+            CAM_BACK_RIGHT='samples/CAM_BACK_RIGHT',
+            CAM_BACK_LEFT='samples/CAM_BACK_LEFT'),
+        ann_file='nuscenes_infos_val.pkl',
+        load_type='mv_image_based',
+        pipeline=test_pipeline,
+        modality=input_modality,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='Camera',
+        use_valid_flag=True,
+        backend_args=backend_args))
+test_dataloader = val_dataloader
+
+val_evaluator = dict(
+    type=NuScenesMetric,
+    data_root=data_root,
+    ann_file=data_root + 'nuscenes_infos_val.pkl',
+    metric='bbox',
+    backend_args=backend_args)
+
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/s3dis_3d.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/s3dis_3d.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.dataset.dataset_wrapper import ConcatDataset, RepeatDataset
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.s3dis_dataset import S3DISDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile,
+                                                 NormalizePointsColor)
+from mmdet3d.datasets.transforms.test_time_aug import MultiScaleFlipAug3D
+from mmdet3d.datasets.transforms.transforms_3d import (GlobalRotScaleTrans,
+                                                       PointSample,
+                                                       RandomFlip3D)
+from mmdet3d.evaluation.metrics.indoor_metric import IndoorMetric
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# dataset settings
+dataset_type = 'S3DISDataset'
+data_root = 'data/s3dis/'
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/s3dis/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+metainfo = dict(classes=('table', 'chair', 'sofa', 'bookcase', 'board'))
+train_area = [1, 2, 3, 4, 6]
+test_area = 5
+
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='DEPTH',
+        shift_height=False,
+        use_color=True,
+        load_dim=6,
+        use_dim=[0, 1, 2, 3, 4, 5],
+        backend_args=backend_args),
+    dict(type=LoadAnnotations3D, with_bbox_3d=True, with_label_3d=True),
+    dict(type=PointSample, num_points=100000),
+    dict(
+        type=RandomFlip3D,
+        sync_2d=False,
+        flip_ratio_bev_horizontal=0.5,
+        flip_ratio_bev_vertical=0.5),
+    dict(
+        type=GlobalRotScaleTrans,
+        rot_range=[-0.087266, 0.087266],
+        scale_ratio_range=[0.9, 1.1],
+        translation_std=[.1, .1, .1],
+        shift_height=False),
+    dict(type=NormalizePointsColor, color_mean=None),
+    dict(
+        type=Pack3DDetInputs, keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='DEPTH',
+        shift_height=False,
+        use_color=True,
+        load_dim=6,
+        use_dim=[0, 1, 2, 3, 4, 5],
+        backend_args=backend_args),
+    dict(
+        type=MultiScaleFlipAug3D,
+        img_scale=(1333, 800),
+        pts_scale_ratio=1,
+        flip=False,
+        transforms=[
+            dict(
+                type=GlobalRotScaleTrans,
+                rot_range=[0, 0],
+                scale_ratio_range=[1., 1.],
+                translation_std=[0, 0, 0]),
+            dict(
+                type=RandomFlip3D,
+                sync_2d=False,
+                flip_ratio_bev_horizontal=0.5,
+                flip_ratio_bev_vertical=0.5),
+            dict(type=PointSample, num_points=100000),
+            dict(type=NormalizePointsColor, color_mean=None),
+        ]),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=4,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=RepeatDataset,
+        times=13,
+        dataset=dict(
+            type=ConcatDataset,
+            datasets=[
+                dict(
+                    type=S3DISDataset,
+                    data_root=data_root,
+                    ann_file=f's3dis_infos_Area_{i}.pkl',
+                    pipeline=train_pipeline,
+                    filter_empty_gt=True,
+                    metainfo=metainfo,
+                    box_type_3d='Depth',
+                    backend_args=backend_args) for i in train_area
+            ])))
+
+val_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=S3DISDataset,
+        data_root=data_root,
+        ann_file=f's3dis_infos_Area_{test_area}.pkl',
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='Depth',
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=S3DISDataset,
+        data_root=data_root,
+        ann_file=f's3dis_infos_Area_{test_area}.pkl',
+        pipeline=test_pipeline,
+        metainfo=metainfo,
+        test_mode=True,
+        box_type_3d='Depth',
+        backend_args=backend_args))
+val_evaluator = dict(type=IndoorMetric)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
--- a/mmdetection3d/mmdet3d/configs/_base_/datasets/s3dis_seg.py
+++ b/mmdetection3d/mmdet3d/configs/_base_/datasets/s3dis_seg.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.transforms.processing import TestTimeAug
+from mmengine.dataset.sampler import DefaultSampler
+from mmengine.visualization.vis_backend import LocalVisBackend
+
+from mmdet3d.datasets.s3dis_dataset import S3DISSegDataset
+from mmdet3d.datasets.transforms.formating import Pack3DDetInputs
+from mmdet3d.datasets.transforms.loading import (LoadAnnotations3D,
+                                                 LoadPointsFromFile,
+                                                 NormalizePointsColor,
+                                                 PointSegClassMapping)
+from mmdet3d.datasets.transforms.transforms_3d import (IndoorPatchPointSample,
+                                                       RandomFlip3D)
+from mmdet3d.evaluation.metrics.seg_metric import SegMetric
+from mmdet3d.models.segmentors.seg3d_tta import Seg3DTTAModel
+from mmdet3d.visualization.local_visualizer import Det3DLocalVisualizer
+
+# For S3DIS seg we usually do 13-class segmentation
+class_names = ('ceiling', 'floor', 'wall', 'beam', 'column', 'window', 'door',
+               'table', 'chair', 'sofa', 'bookcase', 'board', 'clutter')
+metainfo = dict(classes=class_names)
+dataset_type = 'S3DISSegDataset'
+data_root = 'data/s3dis/'
+input_modality = dict(use_lidar=True, use_camera=False)
+data_prefix = dict(
+    pts='points',
+    pts_instance_mask='instance_mask',
+    pts_semantic_mask='semantic_mask')
+
+# Example to use different file client
+# Method 1: simply set the data root and let the file I/O module
+# automatically infer from prefix (not support LMDB and Memcache yet)
+
+# data_root = 's3://openmmlab/datasets/detection3d/s3dis/'
+
+# Method 2: Use backend_args, file_client_args in versions before 1.1.0
+# backend_args = dict(
+#     backend='petrel',
+#     path_mapping=dict({
+#         './data/': 's3://openmmlab/datasets/detection3d/',
+#          'data/': 's3://openmmlab/datasets/detection3d/'
+#      }))
+backend_args = None
+
+num_points = 4096
+train_area = [1, 2, 3, 4, 6]
+test_area = 5
+train_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='DEPTH',
+        shift_height=False,
+        use_color=True,
+        load_dim=6,
+        use_dim=[0, 1, 2, 3, 4, 5],
+        backend_args=backend_args),
+    dict(
+        type=LoadAnnotations3D,
+        with_bbox_3d=False,
+        with_label_3d=False,
+        with_mask_3d=False,
+        with_seg_3d=True,
+        backend_args=backend_args),
+    dict(type=PointSegClassMapping),
+    dict(
+        type=IndoorPatchPointSample,
+        num_points=num_points,
+        block_size=1.0,
+        ignore_index=len(class_names),
+        use_normalized_coord=True,
+        enlarge_size=0.2,
+        min_unique_num=None),
+    dict(type=NormalizePointsColor, color_mean=None),
+    dict(type=Pack3DDetInputs, keys=['points', 'pts_semantic_mask'])
+]
+test_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='DEPTH',
+        shift_height=False,
+        use_color=True,
+        load_dim=6,
+        use_dim=[0, 1, 2, 3, 4, 5],
+        backend_args=backend_args),
+    dict(
+        type=LoadAnnotations3D,
+        with_bbox_3d=False,
+        with_label_3d=False,
+        with_mask_3d=False,
+        with_seg_3d=True,
+        backend_args=backend_args),
+    dict(type=NormalizePointsColor, color_mean=None),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+# construct a pipeline for data and gt loading in show function
+# please keep its loading function consistent with test_pipeline (e.g. client)
+# we need to load gt seg_mask!
+eval_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='DEPTH',
+        shift_height=False,
+        use_color=True,
+        load_dim=6,
+        use_dim=[0, 1, 2, 3, 4, 5],
+        backend_args=backend_args),
+    dict(type=NormalizePointsColor, color_mean=None),
+    dict(type=Pack3DDetInputs, keys=['points'])
+]
+tta_pipeline = [
+    dict(
+        type=LoadPointsFromFile,
+        coord_type='DEPTH',
+        shift_height=False,
+        use_color=True,
+        load_dim=6,
+        use_dim=[0, 1, 2, 3, 4, 5],
+        backend_args=backend_args),
+    dict(
+        type=LoadAnnotations3D,
+        with_bbox_3d=False,
+        with_label_3d=False,
+        with_mask_3d=False,
+        with_seg_3d=True,
+        backend_args=backend_args),
+    dict(type=NormalizePointsColor, color_mean=None),
+    dict(
+        type=TestTimeAug,
+        transforms=[[
+            dict(
+                type=RandomFlip3D,
+                sync_2d=False,
+                flip_ratio_bev_horizontal=0.,
+                flip_ratio_bev_vertical=0.)
+        ], [dict(type=Pack3DDetInputs, keys=['points'])]])
+]
+
+# train on area 1, 2, 3, 4, 6
+# test on area 5
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    dataset=dict(
+        type=S3DISSegDataset,
+        data_root=data_root,
+        ann_files=[f's3dis_infos_Area_{i}.pkl' for i in train_area],
+        metainfo=metainfo,
+        data_prefix=data_prefix,
+        pipeline=train_pipeline,
+        modality=input_modality,
+        ignore_index=len(class_names),
+        scene_idxs=[
+            f'seg_info/Area_{i}_resampled_scene_idxs.npy' for i in train_area
+        ],
+        test_mode=False,
+        backend_args=backend_args))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=1,
+    persistent_workers=True,
+    drop_last=False,
+    sampler=dict(type=DefaultSampler, shuffle=False),
+    dataset=dict(
+        type=S3DISSegDataset,
+        data_root=data_root,
+        ann_files=f's3dis_infos_Area_{test_area}.pkl',
+        metainfo=metainfo,
+        data_prefix=data_prefix,
+        pipeline=test_pipeline,
+        modality=input_modality,
+        ignore_index=len(class_names),
+        scene_idxs=f'seg_info/Area_{test_area}_resampled_scene_idxs.npy',
+        test_mode=True,
+        backend_args=backend_args))
+val_dataloader = test_dataloader
+
+val_evaluator = dict(type=SegMetric)
+test_evaluator = val_evaluator
+
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(
+    type=Det3DLocalVisualizer, vis_backends=vis_backends, name='visualizer')
+
+tta_model = dict(type=Seg3DTTAModel)