Unverified Commit 02ac3e17 authored by Shaoshuai Shi's avatar Shaoshuai Shi Committed by GitHub
Browse files

Support multi-modal 3D detection on NuScenes #1339

Add support for multi-modal NuScenes Detection
parents ad9c25c0 fcfa0773
......@@ -10,6 +10,7 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
* `OpenPCDet` has been updated to `v0.6.0` (Sep. 2022).
* The codes of PV-RCNN++ has been supported.
* The codes of MPPNet has been supported.
* The multi-modal 3D detection approaches on Nuscenes have been supported.
## Overview
- [Changelog](#changelog)
......@@ -22,10 +23,15 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
## Changelog
[2023-04-02] Added support for [`VoxelNeXt`](https://github.com/dvlab-research/VoxelNeXt) on Nuscenes, Waymo, and Argoverse2 datasets. It is a fully sparse 3D object detection network, which is a clean sparse CNNs network and predicts 3D objects directly upon voxels.
[2023-05-13] **NEW:** Added support for the multi-modal 3D object detection models on Nuscenes dataset.
* Support multi-modal Nuscenes detection (See the [GETTING_STARTED.md](docs/GETTING_STARTED.md) to process data).
* Support [TransFusion-Lidar](https://arxiv.org/abs/2203.11496) head, which ahcieves 69.43% NDS on Nuscenes validation dataset.
* Support [`BEVFusion`](https://arxiv.org/abs/2205.13542), which fuses multi-modal information on BEV space and reaches 70.98% NDS on Nuscenes validation dataset. (see the [guideline](docs/guidelines_of_approaches/bevfusion.md) on how to train/test with BEVFusion).
[2023-04-02] Added support for [`VoxelNeXt`](https://arxiv.org/abs/2303.11301) on Nuscenes, Waymo, and Argoverse2 datasets. It is a fully sparse 3D object detection network, which is a clean sparse CNNs network and predicts 3D objects directly upon voxels.
[2022-09-02] **NEW:** Update `OpenPCDet` to v0.6.0:
* Official code release of [MPPNet](https://arxiv.org/abs/2205.05979) for temporal 3D object detection, which supports long-term multi-frame 3D object detection and ranks 1st place on [3D detection learderboard](https://waymo.com/open/challenges/2020/3d-detection) of Waymo Open Dataset on Sept. 2th, 2022. For validation dataset, MPPNet achieves 74.96%, 75.06% and 74.52% for vehicle, pedestrian and cyclist classes in terms of mAPH@Level_2. (see the [guideline](docs/guidelines_of_approaches/mppnet.md) on how to train/test with MPPNet).
* Official code release of [`MPPNet`](https://arxiv.org/abs/2205.05979) for temporal 3D object detection, which supports long-term multi-frame 3D object detection and ranks 1st place on [3D detection learderboard](https://waymo.com/open/challenges/2020/3d-detection) of Waymo Open Dataset on Sept. 2th, 2022. For validation dataset, MPPNet achieves 74.96%, 75.06% and 74.52% for vehicle, pedestrian and cyclist classes in terms of mAPH@Level_2. (see the [guideline](docs/guidelines_of_approaches/mppnet.md) on how to train/test with MPPNet).
* Support multi-frame training/testing on Waymo Open Dataset (see the [change log](docs/changelog.md) for more details on how to process data).
* Support to save changing training details (e.g., loss, iter, epoch) to file (previous tqdm progress bar is still supported by using `--use_tqdm_to_record`). Please use `pip install gpustat` if you also want to log the GPU related information.
* Support to save latest model every 5 mintues, so you can restore the model training from latest status instead of previous epoch.
......@@ -38,10 +44,10 @@ It is also the official code release of [`[PointRCNN]`](https://arxiv.org/abs/18
[2022-02-07] Added support for Centerpoint models on Nuscenes Dataset.
[2022-01-14] Added support for dynamic pillar voxelization, following the implementation proposed in [H^23D R-CNN](https://arxiv.org/abs/2107.14391) with unique operation and [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) package.
[2022-01-14] Added support for dynamic pillar voxelization, following the implementation proposed in [`H^23D R-CNN`](https://arxiv.org/abs/2107.14391) with unique operation and [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) package.
[2022-01-05] **NEW:** Update `OpenPCDet` to v0.5.2:
* The code of [PV-RCNN++](https://arxiv.org/abs/2102.00463) has been released to this repo, with higher performance, faster training/inference speed and less memory consumption than PV-RCNN.
* The code of [`PV-RCNN++`](https://arxiv.org/abs/2102.00463) has been released to this repo, with higher performance, faster training/inference speed and less memory consumption than PV-RCNN.
* Add performance of several models trained with full training set of [Waymo Open Dataset](#waymo-open-dataset-baselines).
* Support Lyft dataset, see the pull request [here](https://github.com/open-mmlab/OpenPCDet/pull/720).
......@@ -199,7 +205,7 @@ We could not provide the above pretrained models due to [Waymo Dataset License A
but you could easily achieve similar performance by training with the default configs.
### NuScenes 3D Object Detection Baselines
All models are trained with 8 GTX 1080Ti GPUs and are available for download.
All models are trained with 8 GPUs and are available for download. For training BEVFusion, please refer to the [guideline](docs/guidelines_of_approaches/bevfusion.md).
| | mATE | mASE | mAOE | mAVE | mAAE | mAP | NDS | download |
|----------------------------------------------------------------------------------------------------|-------:|:------:|:------:|:-----:|:-----:|:-----:|:------:|:--------------------------------------------------------------------------------------------------:|
......@@ -209,7 +215,10 @@ All models are trained with 8 GTX 1080Ti GPUs and are available for download.
| [CenterPoint (voxel_size=0.1)](tools/cfgs/nuscenes_models/cbgs_voxel01_res3d_centerpoint.yaml) | 30.11 | 25.55 | 38.28 | 21.94 | 18.87 | 56.03 | 64.54 | [model-34M](https://drive.google.com/file/d/1Cz-J1c3dw7JAWc25KRG1XQj8yCaOlexQ/view?usp=sharing) |
| [CenterPoint (voxel_size=0.075)](tools/cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml) | 28.80 | 25.43 | 37.27 | 21.55 | 18.24 | 59.22 | 66.48 | [model-34M](https://drive.google.com/file/d/1XOHAWm1MPkCKr1gqmc3TWi5AYZgPsgxU/view?usp=sharing) |
| [VoxelNeXt (voxel_size=0.075)](tools/cfgs/nuscenes_models/cbgs_voxel0075_voxelnext.yaml) | 30.11 | 25.23 | 40.57 | 21.69 | 18.56 | 60.53 | 66.65 | [model-31M](https://drive.google.com/file/d/1IV7e7G9X-61KXSjMGtQo579pzDNbhwvf/view?usp=share_link) |
| [TransFusion-L*](tools/cfgs/nuscenes_models/transfusion_lidar.yaml) | 27.96 | 25.37 | 29.35 | 27.31 | 18.55 | 64.58 | 69.43 | [model-32M](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link) |
| [BEVFusion](tools/cfgs/nuscenes_models/bevfusion.yaml) | 28.03 | 25.43 | 30.19 | 26.76 | 18.48 | 67.75 | 70.98 | [model-157M](https://drive.google.com/file/d/1X50b-8immqlqD8VPAUkSKI0Ls-4k37g9/view?usp=share_link) |
*: Use the fade strategy, which disables data augmentations in the last several epochs during training.
### ONCE 3D Object Detection Baselines
All models are trained with 8 GPUs.
......
......@@ -53,9 +53,16 @@ pip install nuscenes-devkit==1.0.5
* Generate the data infos by running the following command (it may take several hours):
```python
# for lidar-only setting
python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
--cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
--version v1.0-trainval
# for multi-modal setting
python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
--cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
--version v1.0-trainval \
--with_cam
```
### Waymo Open Dataset
......
## Installation
Please refer to [INSTALL.md](../INSTALL.md) for the installation of `OpenPCDet`.
* We recommend the users to check the version of pillow and use pillow==8.4.0 to avoid bug in bev pooling.
## Data Preparation
Please refer to [GETTING_STARTED.md](../GETTING_STARTED.md) to process the multi-modal Nuscenes Dataset.
## Training
1. Train the lidar branch for BEVFusion:
```shell
bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/transfusion_lidar.yaml \
```
The ckpt will be saved in ../output/nuscenes_models/transfusion_lidar/default/ckpt, or you can download pretrained checkpoint directly form [here](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link).
2. To train BEVFusion, you need to download pretrained parameters for image backbone [here](https://drive.google.com/file/d/1v74WCt4_5ubjO7PciA5T0xhQc9bz_jZu/view?usp=share_link), and specify the path in [config](../../tools/cfgs/nuscenes_models/bevfusion.yaml#L88). Then run the following command:
```shell
bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/bevfusion.yaml \
--pretrained_model path_to_pretrained_lidar_branch_ckpt \
```
## Evaluation
* Test with a pretrained model:
```shell
bash scripts/dist_test.sh ${NUM_GPUS} --cfg_file cfgs/nuscenes_models/bevfusion.yaml \
--ckpt ../output/cfgs/nuscenes_models/bevfusion/default/ckpt/checkpoint_epoch_6.pth
```
## Performance
All models are trained with spconv 1.0, but you can directly load them for testing regardless of the spconv version.
| | mATE | mASE | mAOE | mAVE | mAAE | mAP | NDS | download |
|----------------------------------------------------------------------------------------------------|-------:|:------:|:------:|:-----:|:-----:|:-----:|:------:|:--------------------------------------------------------------------------------------------------:|
| [TransFusion-L](../../tools/cfgs/nuscenes_models/transfusion_lidar.yaml) | 27.96 | 25.37 | 29.35 | 27.31 | 18.55 | 64.58 | 69.43 | [model-32M](https://drive.google.com/file/d/1cuZ2qdDnxSwTCsiXWwbqCGF-uoazTXbz/view?usp=share_link) |
| [BEVFusion](../../tools/cfgs/nuscenes_models/bevfusion.yaml) | 28.03 | 25.43 | 30.19 | 26.76 | 18.48 | 67.75 | 70.98 | [model-157M](https://drive.google.com/file/d/1X50b-8immqlqD8VPAUkSKI0Ls-4k37g9/view?usp=share_link) |
from functools import partial
import numpy as np
from PIL import Image
from ...utils import common_utils
from . import augmentor_utils, database_sampler
......@@ -23,6 +24,18 @@ class DataAugmentor(object):
cur_augmentor = getattr(self, cur_cfg.NAME)(config=cur_cfg)
self.data_augmentor_queue.append(cur_augmentor)
def disable_augmentation(self, augmentor_configs):
self.data_augmentor_queue = []
aug_config_list = augmentor_configs if isinstance(augmentor_configs, list) \
else augmentor_configs.AUG_CONFIG_LIST
for cur_cfg in aug_config_list:
if not isinstance(augmentor_configs, list):
if cur_cfg.NAME in augmentor_configs.DISABLE_AUG_LIST:
continue
cur_augmentor = getattr(self, cur_cfg.NAME)(config=cur_cfg)
self.data_augmentor_queue.append(cur_augmentor)
def gt_sampling(self, config=None):
db_sampler = database_sampler.DataBaseSampler(
root_path=self.root_path,
......@@ -139,6 +152,7 @@ class DataAugmentor(object):
data_dict['gt_boxes'] = gt_boxes
data_dict['points'] = points
data_dict['noise_translate'] = noise_translate
return data_dict
def random_local_translation(self, data_dict=None, config=None):
......@@ -251,6 +265,28 @@ class DataAugmentor(object):
data_dict['points'] = points
return data_dict
def imgaug(self, data_dict=None, config=None):
if data_dict is None:
return partial(self.imgaug, config=config)
imgs = data_dict["camera_imgs"]
img_process_infos = data_dict['img_process_infos']
new_imgs = []
for img, img_process_info in zip(imgs, img_process_infos):
flip = False
if config.RAND_FLIP and np.random.choice([0, 1]):
flip = True
rotate = np.random.uniform(*config.ROT_LIM)
# aug images
if flip:
img = img.transpose(method=Image.FLIP_LEFT_RIGHT)
img = img.rotate(rotate)
img_process_info[2] = flip
img_process_info[3] = rotate
new_imgs.append(img)
data_dict["camera_imgs"] = new_imgs
return data_dict
def forward(self, data_dict):
"""
Args:
......
......@@ -2,6 +2,7 @@ from collections import defaultdict
from pathlib import Path
import numpy as np
import torch
import torch.utils.data as torch_data
from ..utils import common_utils
......@@ -130,6 +131,30 @@ class DatasetTemplate(torch_data.Dataset):
"""
raise NotImplementedError
def set_lidar_aug_matrix(self, data_dict):
"""
Get lidar augment matrix (4 x 4), which are used to recover orig point coordinates.
"""
lidar_aug_matrix = np.eye(4)
if 'flip_y' in data_dict.keys():
flip_x = data_dict['flip_x']
flip_y = data_dict['flip_y']
if flip_x:
lidar_aug_matrix[:3,:3] = np.array([[1, 0, 0], [0, -1, 0], [0, 0, 1]]) @ lidar_aug_matrix[:3,:3]
if flip_y:
lidar_aug_matrix[:3,:3] = np.array([[-1, 0, 0], [0, 1, 0], [0, 0, 1]]) @ lidar_aug_matrix[:3,:3]
if 'noise_rot' in data_dict.keys():
noise_rot = data_dict['noise_rot']
lidar_aug_matrix[:3,:3] = common_utils.angle2matrix(torch.tensor(noise_rot)) @ lidar_aug_matrix[:3,:3]
if 'noise_scale' in data_dict.keys():
noise_scale = data_dict['noise_scale']
lidar_aug_matrix[:3,:3] *= noise_scale
if 'noise_translate' in data_dict.keys():
noise_translate = data_dict['noise_translate']
lidar_aug_matrix[:3,3:4] = noise_translate.T
data_dict['lidar_aug_matrix'] = lidar_aug_matrix
return data_dict
def prepare_data(self, data_dict):
"""
Args:
......@@ -165,6 +190,7 @@ class DatasetTemplate(torch_data.Dataset):
)
if 'calib' in data_dict:
data_dict['calib'] = calib
data_dict = self.set_lidar_aug_matrix(data_dict)
if data_dict.get('gt_boxes', None) is not None:
selected = common_utils.keep_arrays_by_name(data_dict['gt_names'], self.class_names)
data_dict['gt_boxes'] = data_dict['gt_boxes'][selected]
......@@ -287,6 +313,8 @@ class DatasetTemplate(torch_data.Dataset):
constant_values=pad_value)
points.append(points_pad)
ret[key] = np.stack(points, axis=0)
elif key in ['camera_imgs']:
ret[key] = torch.stack([torch.stack(imgs,dim=0) for imgs in val],dim=0)
else:
ret[key] = np.stack(val, axis=0)
except:
......
......@@ -8,6 +8,8 @@ from tqdm import tqdm
from ...ops.roiaware_pool3d import roiaware_pool3d_utils
from ...utils import common_utils
from ..dataset import DatasetTemplate
from pyquaternion import Quaternion
from PIL import Image
class NuScenesDataset(DatasetTemplate):
......@@ -17,6 +19,13 @@ class NuScenesDataset(DatasetTemplate):
dataset_cfg=dataset_cfg, class_names=class_names, training=training, root_path=root_path, logger=logger
)
self.infos = []
self.camera_config = self.dataset_cfg.get('CAMERA_CONFIG', None)
if self.camera_config is not None:
self.use_camera = self.camera_config.get('USE_CAMERA', True)
self.camera_image_config = self.camera_config.IMAGE
else:
self.use_camera = False
self.include_nuscenes_data(self.mode)
if self.training and self.dataset_cfg.get('BALANCED_RESAMPLING', False):
self.infos = self.balanced_infos_resampling(self.infos)
......@@ -108,6 +117,98 @@ class NuScenesDataset(DatasetTemplate):
points = np.concatenate((points, times), axis=1)
return points
def crop_image(self, input_dict):
W, H = input_dict["ori_shape"]
imgs = input_dict["camera_imgs"]
img_process_infos = []
crop_images = []
for img in imgs:
if self.training == True:
fH, fW = self.camera_image_config.FINAL_DIM
resize_lim = self.camera_image_config.RESIZE_LIM_TRAIN
resize = np.random.uniform(*resize_lim)
resize_dims = (int(W * resize), int(H * resize))
newW, newH = resize_dims
crop_h = newH - fH
crop_w = int(np.random.uniform(0, max(0, newW - fW)))
crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)
else:
fH, fW = self.camera_image_config.FINAL_DIM
resize_lim = self.camera_image_config.RESIZE_LIM_TEST
resize = np.mean(resize_lim)
resize_dims = (int(W * resize), int(H * resize))
newW, newH = resize_dims
crop_h = newH - fH
crop_w = int(max(0, newW - fW) / 2)
crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)
# reisze and crop image
img = img.resize(resize_dims)
img = img.crop(crop)
crop_images.append(img)
img_process_infos.append([resize, crop, False, 0])
input_dict['img_process_infos'] = img_process_infos
input_dict['camera_imgs'] = crop_images
return input_dict
def load_camera_info(self, input_dict, info):
input_dict["image_paths"] = []
input_dict["lidar2camera"] = []
input_dict["lidar2image"] = []
input_dict["camera2ego"] = []
input_dict["camera_intrinsics"] = []
input_dict["camera2lidar"] = []
for _, camera_info in info["cams"].items():
input_dict["image_paths"].append(camera_info["data_path"])
# lidar to camera transform
lidar2camera_r = np.linalg.inv(camera_info["sensor2lidar_rotation"])
lidar2camera_t = (
camera_info["sensor2lidar_translation"] @ lidar2camera_r.T
)
lidar2camera_rt = np.eye(4).astype(np.float32)
lidar2camera_rt[:3, :3] = lidar2camera_r.T
lidar2camera_rt[3, :3] = -lidar2camera_t
input_dict["lidar2camera"].append(lidar2camera_rt.T)
# camera intrinsics
camera_intrinsics = np.eye(4).astype(np.float32)
camera_intrinsics[:3, :3] = camera_info["camera_intrinsics"]
input_dict["camera_intrinsics"].append(camera_intrinsics)
# lidar to image transform
lidar2image = camera_intrinsics @ lidar2camera_rt.T
input_dict["lidar2image"].append(lidar2image)
# camera to ego transform
camera2ego = np.eye(4).astype(np.float32)
camera2ego[:3, :3] = Quaternion(
camera_info["sensor2ego_rotation"]
).rotation_matrix
camera2ego[:3, 3] = camera_info["sensor2ego_translation"]
input_dict["camera2ego"].append(camera2ego)
# camera to lidar transform
camera2lidar = np.eye(4).astype(np.float32)
camera2lidar[:3, :3] = camera_info["sensor2lidar_rotation"]
camera2lidar[:3, 3] = camera_info["sensor2lidar_translation"]
input_dict["camera2lidar"].append(camera2lidar)
# read image
filename = input_dict["image_paths"]
images = []
for name in filename:
images.append(Image.open(str(self.root_path / name)))
input_dict["camera_imgs"] = images
input_dict["ori_shape"] = images[0].size
# resize and crop image
input_dict = self.crop_image(input_dict)
return input_dict
def __len__(self):
if self._merge_all_iters_to_one_epoch:
return len(self.infos) * self.total_epochs
......@@ -137,6 +238,8 @@ class NuScenesDataset(DatasetTemplate):
'gt_names': info['gt_names'] if mask is None else info['gt_names'][mask],
'gt_boxes': info['gt_boxes'] if mask is None else info['gt_boxes'][mask]
})
if self.use_camera:
input_dict = self.load_camera_info(input_dict, info)
data_dict = self.prepare_data(data_dict=input_dict)
......@@ -251,7 +354,7 @@ class NuScenesDataset(DatasetTemplate):
pickle.dump(all_db_infos, f)
def create_nuscenes_info(version, data_path, save_path, max_sweeps=10):
def create_nuscenes_info(version, data_path, save_path, max_sweeps=10, with_cam=False):
from nuscenes.nuscenes import NuScenes
from nuscenes.utils import splits
from . import nuscenes_utils
......@@ -283,7 +386,7 @@ def create_nuscenes_info(version, data_path, save_path, max_sweeps=10):
train_nusc_infos, val_nusc_infos = nuscenes_utils.fill_trainval_infos(
data_path=data_path, nusc=nusc, train_scenes=train_scenes, val_scenes=val_scenes,
test='test' in version, max_sweeps=max_sweeps
test='test' in version, max_sweeps=max_sweeps, with_cam=with_cam
)
if version == 'v1.0-test':
......@@ -308,6 +411,7 @@ if __name__ == '__main__':
parser.add_argument('--cfg_file', type=str, default=None, help='specify the config of dataset')
parser.add_argument('--func', type=str, default='create_nuscenes_infos', help='')
parser.add_argument('--version', type=str, default='v1.0-trainval', help='')
parser.add_argument('--with_cam', action='store_true', default=False, help='use camera or not')
args = parser.parse_args()
if args.func == 'create_nuscenes_infos':
......@@ -319,6 +423,7 @@ if __name__ == '__main__':
data_path=ROOT_DIR / 'data' / 'nuscenes',
save_path=ROOT_DIR / 'data' / 'nuscenes',
max_sweeps=dataset_cfg.MAX_SWEEPS,
with_cam=args.with_cam
)
nuscenes_dataset = NuScenesDataset(
......
......@@ -247,9 +247,69 @@ def quaternion_yaw(q: Quaternion) -> float:
yaw = np.arctan2(v[1], v[0])
return yaw
def obtain_sensor2top(
nusc, sensor_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, sensor_type="lidar"
):
"""Obtain the info with RT matric from general sensor to Top LiDAR.
def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, max_sweeps=10):
Args:
nusc (class): Dataset class in the nuScenes dataset.
sensor_token (str): Sample data token corresponding to the
specific sensor type.
l2e_t (np.ndarray): Translation from lidar to ego in shape (1, 3).
l2e_r_mat (np.ndarray): Rotation matrix from lidar to ego
in shape (3, 3).
e2g_t (np.ndarray): Translation from ego to global in shape (1, 3).
e2g_r_mat (np.ndarray): Rotation matrix from ego to global
in shape (3, 3).
sensor_type (str): Sensor to calibrate. Default: 'lidar'.
Returns:
sweep (dict): Sweep information after transformation.
"""
sd_rec = nusc.get("sample_data", sensor_token)
cs_record = nusc.get("calibrated_sensor", sd_rec["calibrated_sensor_token"])
pose_record = nusc.get("ego_pose", sd_rec["ego_pose_token"])
data_path = str(nusc.get_sample_data_path(sd_rec["token"]))
# if os.getcwd() in data_path: # path from lyftdataset is absolute path
# data_path = data_path.split(f"{os.getcwd()}/")[-1] # relative path
sweep = {
"data_path": data_path,
"type": sensor_type,
"sample_data_token": sd_rec["token"],
"sensor2ego_translation": cs_record["translation"],
"sensor2ego_rotation": cs_record["rotation"],
"ego2global_translation": pose_record["translation"],
"ego2global_rotation": pose_record["rotation"],
"timestamp": sd_rec["timestamp"],
}
l2e_r_s = sweep["sensor2ego_rotation"]
l2e_t_s = sweep["sensor2ego_translation"]
e2g_r_s = sweep["ego2global_rotation"]
e2g_t_s = sweep["ego2global_translation"]
# obtain the RT from sensor to Top LiDAR
# sweep->ego->global->ego'->lidar
l2e_r_s_mat = Quaternion(l2e_r_s).rotation_matrix
e2g_r_s_mat = Quaternion(e2g_r_s).rotation_matrix
R = (l2e_r_s_mat.T @ e2g_r_s_mat.T) @ (
np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
)
T = (l2e_t_s @ e2g_r_s_mat.T + e2g_t_s) @ (
np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T
)
T -= (
e2g_t @ (np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(l2e_r_mat).T)
+ l2e_t @ np.linalg.inv(l2e_r_mat).T
).squeeze(0)
sweep["sensor2lidar_rotation"] = R.T # points @ R.T + T
sweep["sensor2lidar_translation"] = T
return sweep
def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, max_sweeps=10, with_cam=False):
train_nusc_infos = []
val_nusc_infos = []
progress_bar = tqdm.tqdm(total=len(nusc.sample), desc='create_info', dynamic_ncols=True)
......@@ -291,6 +351,34 @@ def fill_trainval_infos(data_path, nusc, train_scenes, val_scenes, test=False, m
'car_from_global': car_from_global,
'timestamp': ref_time,
}
if with_cam:
info['cams'] = dict()
l2e_r = ref_cs_rec["rotation"]
l2e_t = ref_cs_rec["translation"],
e2g_r = ref_pose_rec["rotation"]
e2g_t = ref_pose_rec["translation"]
l2e_r_mat = Quaternion(l2e_r).rotation_matrix
e2g_r_mat = Quaternion(e2g_r).rotation_matrix
# obtain 6 image's information per frame
camera_types = [
"CAM_FRONT",
"CAM_FRONT_RIGHT",
"CAM_FRONT_LEFT",
"CAM_BACK",
"CAM_BACK_LEFT",
"CAM_BACK_RIGHT",
]
for cam in camera_types:
cam_token = sample["data"][cam]
cam_path, _, camera_intrinsics = nusc.get_sample_data(cam_token)
cam_info = obtain_sensor2top(
nusc, cam_token, l2e_t, l2e_r_mat, e2g_t, e2g_r_mat, cam
)
cam_info['data_path'] = Path(cam_info['data_path']).relative_to(data_path).__str__()
cam_info.update(camera_intrinsics=camera_intrinsics)
info["cams"].update({cam: cam_info})
sample_data_token = sample['data'][chan]
curr_sd_rec = nusc.get('sample_data', sample_data_token)
......
......@@ -2,7 +2,8 @@ from functools import partial
import numpy as np
from skimage import transform
import torch
import torchvision
from ...utils import box_utils, common_utils
tv = None
......@@ -228,6 +229,56 @@ class DataProcessor(object):
factors=(self.depth_downsample_factor, self.depth_downsample_factor)
)
return data_dict
def image_normalize(self, data_dict=None, config=None):
if data_dict is None:
return partial(self.image_normalize, config=config)
mean = config.mean
std = config.std
compose = torchvision.transforms.Compose(
[
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(mean=mean, std=std),
]
)
data_dict["camera_imgs"] = [compose(img) for img in data_dict["camera_imgs"]]
return data_dict
def image_calibrate(self,data_dict=None, config=None):
if data_dict is None:
return partial(self.image_calibrate, config=config)
img_process_infos = data_dict['img_process_infos']
transforms = []
for img_process_info in img_process_infos:
resize, crop, flip, rotate = img_process_info
rotation = torch.eye(2)
translation = torch.zeros(2)
# post-homography transformation
rotation *= resize
translation -= torch.Tensor(crop[:2])
if flip:
A = torch.Tensor([[-1, 0], [0, 1]])
b = torch.Tensor([crop[2] - crop[0], 0])
rotation = A.matmul(rotation)
translation = A.matmul(translation) + b
theta = rotate / 180 * np.pi
A = torch.Tensor(
[
[np.cos(theta), np.sin(theta)],
[-np.sin(theta), np.cos(theta)],
]
)
b = torch.Tensor([crop[2] - crop[0], crop[3] - crop[1]]) / 2
b = A.matmul(-b) + b
rotation = A.matmul(rotation)
translation = A.matmul(translation) + b
transform = torch.eye(4)
transform[:2, :2] = rotation
transform[:2, 3] = translation
transforms.append(transform.numpy())
data_dict["img_aug_matrix"] = transforms
return data_dict
def forward(self, data_dict):
"""
......
......@@ -22,9 +22,11 @@ def build_network(model_cfg, num_class, dataset):
def load_data_to_gpu(batch_dict):
for key, val in batch_dict.items():
if not isinstance(val, np.ndarray):
if key == 'camera_imgs':
batch_dict[key] = val.cuda()
elif not isinstance(val, np.ndarray):
continue
elif key in ['frame_id', 'metadata', 'calib']:
elif key in ['frame_id', 'metadata', 'calib', 'image_paths','ori_shape','img_process_infos']:
continue
elif key in ['images']:
batch_dict[key] = kornia.image_to_tensor(val).float().cuda().contiguous()
......
......@@ -46,7 +46,7 @@ class BaseBEVBackbone(nn.Module):
self.blocks.append(nn.Sequential(*cur_layers))
if len(upsample_strides) > 0:
stride = upsample_strides[idx]
if stride >= 1:
if stride > 1 or (stride == 1 and not self.model_cfg.get('USE_CONV_FOR_NO_STRIDE', False)):
self.deblocks.append(nn.Sequential(
nn.ConvTranspose2d(
num_filters[idx], num_upsample_filters[idx],
......
from .convfuser import ConvFuser
__all__ = {
'ConvFuser':ConvFuser
}
\ No newline at end of file
import torch
from torch import nn
class ConvFuser(nn.Module):
def __init__(self,model_cfg) -> None:
super().__init__()
self.model_cfg = model_cfg
in_channel = self.model_cfg.IN_CHANNEL
out_channel = self.model_cfg.OUT_CHANNEL
self.conv = nn.Sequential(
nn.Conv2d(in_channel, out_channel, 3, padding=1, bias=False),
nn.BatchNorm2d(out_channel),
nn.ReLU(True)
)
def forward(self,batch_dict):
"""
Args:
batch_dict:
spatial_features_img (tensor): Bev features from image modality
spatial_features (tensor): Bev features from lidar modality
Returns:
batch_dict:
spatial_features (tensor): Bev features after muli-modal fusion
"""
img_bev = batch_dict['spatial_features_img']
lidar_bev = batch_dict['spatial_features']
cat_bev = torch.cat([img_bev,lidar_bev],dim=1)
mm_bev = self.conv(cat_bev)
batch_dict['spatial_features'] = mm_bev
return batch_dict
\ No newline at end of file
......@@ -30,11 +30,12 @@ def post_act_block(in_channels, out_channels, kernel_size, indice_key=None, stri
class SparseBasicBlock(spconv.SparseModule):
expansion = 1
def __init__(self, inplanes, planes, stride=1, norm_fn=None, downsample=None, indice_key=None):
def __init__(self, inplanes, planes, stride=1, bias=None, norm_fn=None, downsample=None, indice_key=None):
super(SparseBasicBlock, self).__init__()
assert norm_fn is not None
bias = norm_fn is not None
if bias is None:
bias = norm_fn is not None
self.conv1 = spconv.SubMConv3d(
inplanes, planes, kernel_size=3, stride=stride, padding=1, bias=bias, indice_key=indice_key
)
......@@ -184,6 +185,7 @@ class VoxelResBackBone8x(nn.Module):
def __init__(self, model_cfg, input_channels, grid_size, **kwargs):
super().__init__()
self.model_cfg = model_cfg
use_bias = self.model_cfg.get('USE_BIAS', None)
norm_fn = partial(nn.BatchNorm1d, eps=1e-3, momentum=0.01)
self.sparse_shape = grid_size[::-1] + [1, 0, 0]
......@@ -196,29 +198,29 @@ class VoxelResBackBone8x(nn.Module):
block = post_act_block
self.conv1 = spconv.SparseSequential(
SparseBasicBlock(16, 16, norm_fn=norm_fn, indice_key='res1'),
SparseBasicBlock(16, 16, norm_fn=norm_fn, indice_key='res1'),
SparseBasicBlock(16, 16, bias=use_bias, norm_fn=norm_fn, indice_key='res1'),
SparseBasicBlock(16, 16, bias=use_bias, norm_fn=norm_fn, indice_key='res1'),
)
self.conv2 = spconv.SparseSequential(
# [1600, 1408, 41] <- [800, 704, 21]
block(16, 32, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv2', conv_type='spconv'),
SparseBasicBlock(32, 32, norm_fn=norm_fn, indice_key='res2'),
SparseBasicBlock(32, 32, norm_fn=norm_fn, indice_key='res2'),
SparseBasicBlock(32, 32, bias=use_bias, norm_fn=norm_fn, indice_key='res2'),
SparseBasicBlock(32, 32, bias=use_bias, norm_fn=norm_fn, indice_key='res2'),
)
self.conv3 = spconv.SparseSequential(
# [800, 704, 21] <- [400, 352, 11]
block(32, 64, 3, norm_fn=norm_fn, stride=2, padding=1, indice_key='spconv3', conv_type='spconv'),
SparseBasicBlock(64, 64, norm_fn=norm_fn, indice_key='res3'),
SparseBasicBlock(64, 64, norm_fn=norm_fn, indice_key='res3'),
SparseBasicBlock(64, 64, bias=use_bias, norm_fn=norm_fn, indice_key='res3'),
SparseBasicBlock(64, 64, bias=use_bias, norm_fn=norm_fn, indice_key='res3'),
)
self.conv4 = spconv.SparseSequential(
# [400, 352, 11] <- [200, 176, 5]
block(64, 128, 3, norm_fn=norm_fn, stride=2, padding=(0, 1, 1), indice_key='spconv4', conv_type='spconv'),
SparseBasicBlock(128, 128, norm_fn=norm_fn, indice_key='res4'),
SparseBasicBlock(128, 128, norm_fn=norm_fn, indice_key='res4'),
SparseBasicBlock(128, 128, bias=use_bias, norm_fn=norm_fn, indice_key='res4'),
SparseBasicBlock(128, 128, bias=use_bias, norm_fn=norm_fn, indice_key='res4'),
)
last_pad = 0
......
from .swin import SwinTransformer
__all__ = {
'SwinTransformer':SwinTransformer,
}
\ No newline at end of file
from .generalized_lss import GeneralizedLSSFPN
__all__ = {
'GeneralizedLSSFPN':GeneralizedLSSFPN,
}
\ No newline at end of file
import torch
import torch.nn as nn
import torch.nn.functional as F
from ...model_utils.basic_block_2d import BasicBlock2D
class GeneralizedLSSFPN(nn.Module):
"""
This module implements FPN, which creates pyramid features built on top of some input feature maps.
This code is adapted from https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/necks/fpn.py with minimal modifications.
"""
def __init__(self, model_cfg):
super().__init__()
self.model_cfg = model_cfg
in_channels = self.model_cfg.IN_CHANNELS
out_channels = self.model_cfg.OUT_CHANNELS
num_ins = len(in_channels)
num_outs = self.model_cfg.NUM_OUTS
start_level = self.model_cfg.START_LEVEL
end_level = self.model_cfg.END_LEVEL
self.in_channels = in_channels
if end_level == -1:
self.backbone_end_level = num_ins - 1
else:
self.backbone_end_level = end_level
assert end_level <= len(in_channels)
assert num_outs == end_level - start_level
self.start_level = start_level
self.end_level = end_level
self.lateral_convs = nn.ModuleList()
self.fpn_convs = nn.ModuleList()
for i in range(self.start_level, self.backbone_end_level):
l_conv = BasicBlock2D(
in_channels[i] + (in_channels[i + 1] if i == self.backbone_end_level - 1 else out_channels),
out_channels, kernel_size=1, bias = False
)
fpn_conv = BasicBlock2D(out_channels,out_channels, kernel_size=3, padding=1, bias = False)
self.lateral_convs.append(l_conv)
self.fpn_convs.append(fpn_conv)
def forward(self, batch_dict):
"""
Args:
batch_dict:
image_features (list[tensor]): Multi-stage features from image backbone.
Returns:
batch_dict:
image_fpn (list(tensor)): FPN features.
"""
# upsample -> cat -> conv1x1 -> conv3x3
inputs = batch_dict['image_features']
assert len(inputs) == len(self.in_channels)
# build laterals
laterals = [inputs[i + self.start_level] for i in range(len(inputs))]
# build top-down path
used_backbone_levels = len(laterals) - 1
for i in range(used_backbone_levels - 1, -1, -1):
x = F.interpolate(
laterals[i + 1],
size=laterals[i].shape[2:],
mode='bilinear', align_corners=False,
)
laterals[i] = torch.cat([laterals[i], x], dim=1)
laterals[i] = self.lateral_convs[i](laterals[i])
laterals[i] = self.fpn_convs[i](laterals[i])
# build outputs
outs = [laterals[i] for i in range(used_backbone_levels)]
batch_dict['image_fpn'] = tuple(outs)
return batch_dict
# Copyright (c) OpenMMLab. All rights reserved.
"""
Mostly copy-paste from
https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/backbones/swin.py
"""
import warnings
from collections import OrderedDict
from copy import deepcopy
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as cp
from ..model_utils.swin_utils import swin_converter
from ..model_utils.swin_utils import PatchEmbed, PatchMerging
from ..model_utils.swin_utils import FFN, DropPath, to_2tuple, trunc_normal_, trunc_normal_init, constant_init
class WindowMSA(nn.Module):
"""Window based multi-head self-attention (W-MSA) module with relative
position bias.
Args:
embed_dims (int): Number of input channels.
num_heads (int): Number of attention heads.
window_size (tuple[int]): The height and width of the window.
qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
Default: True.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
attn_drop_rate (float, optional): Dropout ratio of attention weight.
Default: 0.0
proj_drop_rate (float, optional): Dropout ratio of output. Default: 0.
"""
def __init__(self,
embed_dims,
num_heads,
window_size,
qkv_bias=True,
qk_scale=None,
attn_drop_rate=0.,
proj_drop_rate=0.):
super().__init__()
self._is_init = False
self.embed_dims = embed_dims
self.window_size = window_size # Wh, Ww
self.num_heads = num_heads
head_embed_dims = embed_dims // num_heads
self.scale = qk_scale or head_embed_dims**-0.5
# define a parameter table of relative position bias
self.relative_position_bias_table = nn.Parameter(
torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1),
num_heads)) # 2*Wh-1 * 2*Ww-1, nH
# About 2x faster than original impl
Wh, Ww = self.window_size
rel_index_coords = self.double_step_seq(2 * Ww - 1, Wh, 1, Ww)
rel_position_index = rel_index_coords + rel_index_coords.T
rel_position_index = rel_position_index.flip(1).contiguous()
self.register_buffer('relative_position_index', rel_position_index)
self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop_rate)
self.proj = nn.Linear(embed_dims, embed_dims)
self.proj_drop = nn.Dropout(proj_drop_rate)
self.softmax = nn.Softmax(dim=-1)
def init_weights(self):
trunc_normal_(self.relative_position_bias_table, std=0.02)
def forward(self, x, mask=None):
"""
Args:
x (tensor): input features with shape of (num_windows*B, N, C)
mask (tensor | None, Optional): mask with shape of (num_windows,
Wh*Ww, Wh*Ww), value should be between (-inf, 0].
"""
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
C // self.num_heads).permute(2, 0, 3, 1, 4)
# make torchscript happy (cannot use tensor as tuple)
q, k, v = qkv[0], qkv[1], qkv[2]
q = q * self.scale
attn = (q @ k.transpose(-2, -1))
relative_position_bias = self.relative_position_bias_table[
self.relative_position_index.view(-1)].view(
self.window_size[0] * self.window_size[1],
self.window_size[0] * self.window_size[1],
-1) # Wh*Ww,Wh*Ww,nH
relative_position_bias = relative_position_bias.permute(
2, 0, 1).contiguous() # nH, Wh*Ww, Wh*Ww
attn = attn + relative_position_bias.unsqueeze(0)
if mask is not None:
nW = mask.shape[0]
attn = attn.view(B // nW, nW, self.num_heads, N,
N) + mask.unsqueeze(1).unsqueeze(0)
attn = attn.view(-1, self.num_heads, N, N)
attn = self.softmax(attn)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
@staticmethod
def double_step_seq(step1, len1, step2, len2):
seq1 = torch.arange(0, step1 * len1, step1)
seq2 = torch.arange(0, step2 * len2, step2)
return (seq1[:, None] + seq2[None, :]).reshape(1, -1)
class ShiftWindowMSA(nn.Module):
"""Shifted Window Multihead Self-Attention Module.
Args:
embed_dims (int): Number of input channels.
num_heads (int): Number of attention heads.
window_size (int): The height and width of the window.
shift_size (int, optional): The shift step of each window towards
right-bottom. If zero, act as regular window-msa. Defaults to 0.
qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
Default: True
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Defaults: None.
attn_drop_rate (float, optional): Dropout ratio of attention weight.
Defaults: 0.
proj_drop_rate (float, optional): Dropout ratio of output.
Defaults: 0.
dropout_layer (dict, optional): The dropout_layer used before output.
Defaults: dict(type='DropPath', drop_prob=0.).
"""
def __init__(self,
embed_dims,
num_heads,
window_size,
shift_size=0,
qkv_bias=True,
qk_scale=None,
attn_drop_rate=0,
proj_drop_rate=0,
dropout_layer=dict(type='DropPath', drop_prob=0.)):
super().__init__()
self._is_init = False
self.window_size = window_size
self.shift_size = shift_size
assert 0 <= self.shift_size < self.window_size
self.w_msa = WindowMSA(
embed_dims=embed_dims,
num_heads=num_heads,
window_size=to_2tuple(window_size),
qkv_bias=qkv_bias,
qk_scale=qk_scale,
attn_drop_rate=attn_drop_rate,
proj_drop_rate=proj_drop_rate,)
self.drop = DropPath(dropout_layer['drop_prob'])
def forward(self, query, hw_shape):
B, L, C = query.shape
H, W = hw_shape
assert L == H * W, 'input feature has wrong size'
query = query.view(B, H, W, C)
# pad feature maps to multiples of window size
pad_r = (self.window_size - W % self.window_size) % self.window_size
pad_b = (self.window_size - H % self.window_size) % self.window_size
query = F.pad(query, (0, 0, 0, pad_r, 0, pad_b))
H_pad, W_pad = query.shape[1], query.shape[2]
# cyclic shift
if self.shift_size > 0:
shifted_query = torch.roll(
query,
shifts=(-self.shift_size, -self.shift_size),
dims=(1, 2))
# calculate attention mask for SW-MSA
img_mask = torch.zeros((1, H_pad, W_pad, 1), device=query.device)
h_slices = (slice(0, -self.window_size),
slice(-self.window_size,
-self.shift_size), slice(-self.shift_size, None))
w_slices = (slice(0, -self.window_size),
slice(-self.window_size,
-self.shift_size), slice(-self.shift_size, None))
cnt = 0
for h in h_slices:
for w in w_slices:
img_mask[:, h, w, :] = cnt
cnt += 1
# nW, window_size, window_size, 1
mask_windows = self.window_partition(img_mask)
mask_windows = mask_windows.view(
-1, self.window_size * self.window_size)
attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
attn_mask = attn_mask.masked_fill(attn_mask != 0,
float(-100.0)).masked_fill(
attn_mask == 0, float(0.0))
else:
shifted_query = query
attn_mask = None
# nW*B, window_size, window_size, C
query_windows = self.window_partition(shifted_query)
# nW*B, window_size*window_size, C
query_windows = query_windows.view(-1, self.window_size**2, C)
# W-MSA/SW-MSA (nW*B, window_size*window_size, C)
attn_windows = self.w_msa(query_windows, mask=attn_mask)
# merge windows
attn_windows = attn_windows.view(-1, self.window_size,
self.window_size, C)
# B H' W' C
shifted_x = self.window_reverse(attn_windows, H_pad, W_pad)
# reverse cyclic shift
if self.shift_size > 0:
x = torch.roll(
shifted_x,
shifts=(self.shift_size, self.shift_size),
dims=(1, 2))
else:
x = shifted_x
if pad_r > 0 or pad_b:
x = x[:, :H, :W, :].contiguous()
x = x.view(B, H * W, C)
x = self.drop(x)
return x
def window_reverse(self, windows, H, W):
"""
Args:
windows: (num_windows*B, window_size, window_size, C)
H (int): Height of image
W (int): Width of image
Returns:
x: (B, H, W, C)
"""
window_size = self.window_size
B = int(windows.shape[0] / (H * W / window_size / window_size))
x = windows.view(B, H // window_size, W // window_size, window_size,
window_size, -1)
x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
return x
def window_partition(self, x):
"""
Args:
x: (B, H, W, C)
Returns:
windows: (num_windows*B, window_size, window_size, C)
"""
B, H, W, C = x.shape
window_size = self.window_size
x = x.view(B, H // window_size, window_size, W // window_size,
window_size, C)
windows = x.permute(0, 1, 3, 2, 4, 5).contiguous()
windows = windows.view(-1, window_size, window_size, C)
return windows
class SwinBlock(nn.Module):
""""
Args:
embed_dims (int): The feature dimension.
num_heads (int): Parallel attention heads.
feedforward_channels (int): The hidden dimension for FFNs.
window_size (int, optional): The local window scale. Default: 7.
shift (bool, optional): whether to shift window or not. Default False.
qkv_bias (bool, optional): enable bias for qkv if True. Default: True.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
drop_rate (float, optional): Dropout rate. Default: 0.
attn_drop_rate (float, optional): Attention dropout rate. Default: 0.
drop_path_rate (float, optional): Stochastic depth rate. Default: 0.
act_cfg (dict, optional): The config dict of activation function.
Default: dict(type='GELU').
norm_cfg (dict, optional): The config dict of normalization.
Default: dict(type='LN').
with_cp (bool, optional): Use checkpoint or not. Using checkpoint
will save some memory while slowing down the training speed.
Default: False.
"""
def __init__(self,
embed_dims,
num_heads,
feedforward_channels,
window_size=7,
shift=False,
qkv_bias=True,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.,
act_cfg=dict(type='GELU'),
norm_cfg=dict(type='LN'),
with_cp=False,):
super(SwinBlock, self).__init__()
self._is_init = False
self.with_cp = with_cp
self.norm1 = nn.LayerNorm(embed_dims)
self.attn = ShiftWindowMSA(
embed_dims=embed_dims,
num_heads=num_heads,
window_size=window_size,
shift_size=window_size // 2 if shift else 0,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
attn_drop_rate=attn_drop_rate,
proj_drop_rate=drop_rate,
dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),)
self.norm2 = nn.LayerNorm(embed_dims)
self.ffn = FFN(
embed_dims=embed_dims,
feedforward_channels=feedforward_channels,
num_fcs=2,
ffn_drop=drop_rate,
dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
act_cfg=act_cfg,
add_identity=True,)
def forward(self, x, hw_shape):
def _inner_forward(x):
identity = x
x = self.norm1(x)
x = self.attn(x, hw_shape)
x = x + identity
identity = x
x = self.norm2(x)
x = self.ffn(x, identity=identity)
return x
if self.with_cp and x.requires_grad:
x = cp.checkpoint(_inner_forward, x)
else:
x = _inner_forward(x)
return x
class SwinBlockSequence(nn.Module):
"""Implements one stage in Swin Transformer.
Args:
embed_dims (int): The feature dimension.
num_heads (int): Parallel attention heads.
feedforward_channels (int): The hidden dimension for FFNs.
depth (int): The number of blocks in this stage.
window_size (int, optional): The local window scale. Default: 7.
qkv_bias (bool, optional): enable bias for qkv if True. Default: True.
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
drop_rate (float, optional): Dropout rate. Default: 0.
attn_drop_rate (float, optional): Attention dropout rate. Default: 0.
drop_path_rate (float | list[float], optional): Stochastic depth
rate. Default: 0.
downsample (BaseModule | None, optional): The downsample operation
module. Default: None.
act_cfg (dict, optional): The config dict of activation function.
Default: dict(type='GELU').
norm_cfg (dict, optional): The config dict of normalization.
Default: dict(type='LN').
with_cp (bool, optional): Use checkpoint or not. Using checkpoint
will save some memory while slowing down the training speed.
Default: False.
"""
def __init__(self,
embed_dims,
num_heads,
feedforward_channels,
depth,
window_size=7,
qkv_bias=True,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.,
downsample=None,
act_cfg=dict(type='GELU'),
norm_cfg=dict(type='LN'),
with_cp=False):
super().__init__()
self._is_init = False
if isinstance(drop_path_rate, list):
drop_path_rates = drop_path_rate
assert len(drop_path_rates) == depth
else:
drop_path_rates = [deepcopy(drop_path_rate) for _ in range(depth)]
self.blocks = nn.ModuleList()
for i in range(depth):
block = SwinBlock(
embed_dims=embed_dims,
num_heads=num_heads,
feedforward_channels=feedforward_channels,
window_size=window_size,
shift=False if i % 2 == 0 else True,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
drop_rate=drop_rate,
attn_drop_rate=attn_drop_rate,
drop_path_rate=drop_path_rates[i],
act_cfg=act_cfg,
norm_cfg=norm_cfg,
with_cp=with_cp,)
self.blocks.append(block)
self.downsample = downsample
def forward(self, x, hw_shape):
for block in self.blocks:
x = block(x, hw_shape)
if self.downsample:
x_down, down_hw_shape = self.downsample(x, hw_shape)
return x_down, down_hw_shape, x, hw_shape
else:
return x, hw_shape, x, hw_shape
class SwinTransformer(nn.Module):
""" Swin Transformer
A PyTorch implement of : `Swin Transformer:
Hierarchical Vision Transformer using Shifted Windows` -
https://arxiv.org/abs/2103.14030
This code is adapted from https://github.com/open-mmlab/mmdetection/blob/main/mmdet/models/backbones/swin.py
with minimal modifications.
Args:
pretrain_img_size (int | tuple[int]): The size of input image when
pretrain. Defaults: 224.
in_channels (int): The num of input channels.
Defaults: 3.
embed_dims (int): The feature dimension. Default: 96.
patch_size (int | tuple[int]): Patch size. Default: 4.
window_size (int): Window size. Default: 7.
mlp_ratio (int): Ratio of mlp hidden dim to embedding dim.
Default: 4.
depths (tuple[int]): Depths of each Swin Transformer stage.
Default: (2, 2, 6, 2).
num_heads (tuple[int]): Parallel attention heads of each Swin
Transformer stage. Default: (3, 6, 12, 24).
strides (tuple[int]): The patch merging or patch embedding stride of
each Swin Transformer stage. (In swin, we set kernel size equal to
stride.) Default: (4, 2, 2, 2).
out_indices (tuple[int]): Output from which stages.
Default: (0, 1, 2, 3).
qkv_bias (bool, optional): If True, add a learnable bias to query, key,
value. Default: True
qk_scale (float | None, optional): Override default qk scale of
head_dim ** -0.5 if set. Default: None.
patch_norm (bool): If add a norm layer for patch embed and patch
merging. Default: True.
drop_rate (float): Dropout rate. Defaults: 0.
attn_drop_rate (float): Attention dropout rate. Default: 0.
drop_path_rate (float): Stochastic depth rate. Defaults: 0.1.
use_abs_pos_embed (bool): If True, add absolute position embedding to
the patch embedding. Defaults: False.
act_cfg (dict): Config dict for activation layer.
Default: dict(type='GELU').
norm_cfg (dict): Config dict for normalization layer at
output of backone. Defaults: dict(type='LN').
with_cp (bool, optional): Use checkpoint or not. Using checkpoint
will save some memory while slowing down the training speed.
Default: False.
pretrained (str, optional): model pretrained path. Default: None.
convert_weights (bool): The flag indicates whether the
pre-trained model is from the original repo. We may need
to convert some keys to make it compatible.
Default: False.
frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
Default: -1 (-1 means not freezing any parameters).
init_cfg (dict, optional): The Config for initialization.
Defaults to None.
"""
def __init__(self, model_cfg):
self.model_cfg = model_cfg
pretrain_img_size = self.model_cfg.get('PRETRAIN_IMG_SIZE', 224)
init_cfg = self.model_cfg.get('INIT_CFG', None)
depths = self.model_cfg.DEPTHS
in_channels = self.model_cfg.get('IN_CHANNELS', 3)
strides = self.model_cfg.get('STRIDES', (4, 2, 2, 2))
patch_size = self.model_cfg.get('PATCH_SIZE', 4)
embed_dims = self.model_cfg.EMBED_DIMS
num_heads = self.model_cfg.NUM_HEADS
window_size = self.model_cfg.WINDOW_SIZE
mlp_ratio = self.model_cfg.MLP_RATIO
qkv_bias = self.model_cfg.get('QKV_BIAS', True)
qk_scale = self.model_cfg.get('QK_SCALE', None)
drop_rate = self.model_cfg.DROP_RATE
attn_drop_rate = self.model_cfg.ATTN_DROP_RATE
drop_path_rate = self.model_cfg.DROP_PATH_RATE
patch_norm = self.model_cfg.get('PATCH_NORM', True)
out_indices = self.model_cfg.get('OUT_INDICES', [0, 1, 2, 3])
with_cp = self.model_cfg.get('WITH_CP', False)
use_abs_pos_embed = self.model_cfg.get('USE_ABS_POS_EMBED', False)
act_cfg=dict(type='GELU')
norm_cfg=dict(type='LN')
self.convert_weights = self.model_cfg.get('CONVERT_WEIGHTS', False)
self.frozen_stages = self.model_cfg.get('FROZEN_STAGES', -1)
if isinstance(pretrain_img_size, int):
pretrain_img_size = to_2tuple(pretrain_img_size)
elif isinstance(pretrain_img_size, tuple):
if len(pretrain_img_size) == 1:
pretrain_img_size = to_2tuple(pretrain_img_size[0])
assert len(pretrain_img_size) == 2, \
f'The size of image should have length 1 or 2, ' \
f'but got {len(pretrain_img_size)}'
super(SwinTransformer, self).__init__()
self.init_cfg = init_cfg
num_layers = len(depths)
self.out_indices = out_indices
self.use_abs_pos_embed = use_abs_pos_embed
assert strides[0] == patch_size, 'Use non-overlapping patch embed.'
self.patch_embed = PatchEmbed(
in_channels=in_channels,
embed_dims=embed_dims,
conv_type='Conv2d',
kernel_size=patch_size,
stride=strides[0],
norm_cfg=norm_cfg if patch_norm else None)
if self.use_abs_pos_embed:
patch_row = pretrain_img_size[0] // patch_size
patch_col = pretrain_img_size[1] // patch_size
num_patches = patch_row * patch_col
self.absolute_pos_embed = nn.Parameter(
torch.zeros((1, num_patches, embed_dims)))
self.drop_after_pos = nn.Dropout(p=drop_rate)
# set stochastic depth decay rule
total_depth = sum(depths)
dpr = [
x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
]
self.stages = nn.ModuleList()
in_channels = embed_dims
for i in range(num_layers):
if i < num_layers - 1:
downsample = PatchMerging(
in_channels=in_channels,
out_channels=2 * in_channels,
stride=strides[i + 1],
norm_cfg=norm_cfg if patch_norm else None)
else:
downsample = None
stage = SwinBlockSequence(
embed_dims=in_channels,
num_heads=num_heads[i],
feedforward_channels=mlp_ratio * in_channels,
depth=depths[i],
window_size=window_size,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
drop_rate=drop_rate,
attn_drop_rate=attn_drop_rate,
drop_path_rate=dpr[sum(depths[:i]):sum(depths[:i + 1])],
downsample=downsample,
act_cfg=act_cfg,
norm_cfg=norm_cfg,
with_cp=with_cp)
self.stages.append(stage)
if downsample:
in_channels = downsample.out_channels
self.num_features = [int(embed_dims * 2**i) for i in range(num_layers)]
# Add a norm layer for each output
for i in out_indices:
layer = nn.LayerNorm(self.num_features[i])
layer_name = f'norm{i}'
self.add_module(layer_name, layer)
def train(self, mode=True):
"""Convert the model into training mode while keep layers freezed."""
super(SwinTransformer, self).train(mode)
self._freeze_stages()
def _freeze_stages(self):
if self.frozen_stages >= 0:
self.patch_embed.eval()
for param in self.patch_embed.parameters():
param.requires_grad = False
if self.use_abs_pos_embed:
self.absolute_pos_embed.requires_grad = False
self.drop_after_pos.eval()
for i in range(1, self.frozen_stages + 1):
if (i - 1) in self.out_indices:
norm_layer = getattr(self, f'norm{i-1}')
norm_layer.eval()
for param in norm_layer.parameters():
param.requires_grad = False
m = self.stages[i - 1]
m.eval()
for param in m.parameters():
param.requires_grad = False
def init_weights(self):
if self.init_cfg is None:
print(f'No pre-trained weights for '
f'{self.__class__.__name__}, '
f'training start from scratch')
if self.use_abs_pos_embed:
trunc_normal_(self.absolute_pos_embed, std=0.02)
for m in self.modules():
if isinstance(m, nn.Linear):
trunc_normal_init(m, std=.02, bias=0.)
elif isinstance(m, nn.LayerNorm):
constant_init(m, 1.0)
else:
assert 'checkpoint' in self.init_cfg, f'Only support ' \
f'specify `Pretrained` in ' \
f'`init_cfg` in ' \
f'{self.__class__.__name__} '
ckpt = torch.load(self.init_cfg.checkpoint, map_location='cpu')
if 'state_dict' in ckpt:
_state_dict = ckpt['state_dict']
elif 'model' in ckpt:
_state_dict = ckpt['model']
else:
_state_dict = ckpt
if self.convert_weights:
# supported loading weight from original repo,
_state_dict = swin_converter(_state_dict)
state_dict = OrderedDict()
for k, v in _state_dict.items():
if k.startswith('backbone.'):
state_dict[k[9:]] = v
# strip prefix of state_dict
if list(state_dict.keys())[0].startswith('module.'):
state_dict = {k[7:]: v for k, v in state_dict.items()}
# reshape absolute position embedding
if state_dict.get('absolute_pos_embed') is not None:
absolute_pos_embed = state_dict['absolute_pos_embed']
N1, L, C1 = absolute_pos_embed.size()
N2, C2, H, W = self.absolute_pos_embed.size()
if N1 != N2 or C1 != C2 or L != H * W:
print('Error in loading absolute_pos_embed, pass')
else:
state_dict['absolute_pos_embed'] = absolute_pos_embed.view(
N2, H, W, C2).permute(0, 3, 1, 2).contiguous()
# interpolate position bias table if needed
relative_position_bias_table_keys = [
k for k in state_dict.keys()
if 'relative_position_bias_table' in k
]
for table_key in relative_position_bias_table_keys:
table_pretrained = state_dict[table_key]
table_current = self.state_dict()[table_key]
L1, nH1 = table_pretrained.size()
L2, nH2 = table_current.size()
if nH1 != nH2:
print(f'Error in loading {table_key}, pass')
elif L1 != L2:
S1 = int(L1**0.5)
S2 = int(L2**0.5)
table_pretrained_resized = F.interpolate(
table_pretrained.permute(1, 0).reshape(1, nH1, S1, S1),
size=(S2, S2),
mode='bicubic')
state_dict[table_key] = table_pretrained_resized.view(
nH2, L2).permute(1, 0).contiguous()
# load state_dict
self.load_state_dict(state_dict, False)
def forward(self, batch_dict):
x = batch_dict['camera_imgs']
B, N, C, H, W = x.size()
x = x.view(B * N, C, H, W)
x, hw_shape = self.patch_embed(x)
if self.use_abs_pos_embed:
x = x + self.absolute_pos_embed
x = self.drop_after_pos(x)
outs = []
for i, stage in enumerate(self.stages):
x, hw_shape, out, out_hw_shape = stage(x, hw_shape)
if i in self.out_indices:
norm_layer = getattr(self, f'norm{i}')
out = norm_layer(out)
out = out.view(-1, *out_hw_shape,
self.num_features[i]).permute(0, 3, 1,
2).contiguous()
outs.append(out)
batch_dict['image_features'] = outs
return batch_dict
......@@ -6,6 +6,7 @@ from .point_head_simple import PointHeadSimple
from .point_intra_part_head import PointIntraPartOffsetHead
from .center_head import CenterHead
from .voxelnext_head import VoxelNeXtHead
from .transfusion_head import TransFusionHead
__all__ = {
'AnchorHeadTemplate': AnchorHeadTemplate,
......@@ -16,4 +17,5 @@ __all__ = {
'AnchorHeadMulti': AnchorHeadMulti,
'CenterHead': CenterHead,
'VoxelNeXtHead': VoxelNeXtHead,
'TransFusionHead': TransFusionHead,
}
import torch
from scipy.optimize import linear_sum_assignment
from pcdet.ops.iou3d_nms import iou3d_nms_cuda
def height_overlaps(boxes1, boxes2):
"""
Calculate height overlaps of two boxes.
"""
boxes1_top_height = (boxes1[:,2]+ boxes1[:,5]).view(-1, 1)
boxes1_bottom_height = boxes1[:,2].view(-1, 1)
boxes2_top_height = (boxes2[:,2]+boxes2[:,5]).view(1, -1)
boxes2_bottom_height = boxes2[:,2].view(1, -1)
heighest_of_bottom = torch.max(boxes1_bottom_height, boxes2_bottom_height)
lowest_of_top = torch.min(boxes1_top_height, boxes2_top_height)
overlaps_h = torch.clamp(lowest_of_top - heighest_of_bottom, min=0)
return overlaps_h
def overlaps(boxes1, boxes2):
"""
Calculate 3D overlaps of two boxes.
"""
rows = len(boxes1)
cols = len(boxes2)
if rows * cols == 0:
return boxes1.new(rows, cols)
# height overlap
overlaps_h = height_overlaps(boxes1, boxes2)
boxes1_bev = boxes1[:,:7]
boxes2_bev = boxes2[:,:7]
# bev overlap
overlaps_bev = boxes1_bev.new_zeros(
(boxes1_bev.shape[0], boxes2_bev.shape[0])
).cuda() # (N, M)
iou3d_nms_cuda.boxes_overlap_bev_gpu(
boxes1_bev.contiguous().cuda(), boxes2_bev.contiguous().cuda(), overlaps_bev
)
# 3d overlaps
overlaps_3d = overlaps_bev.to(boxes1.device) * overlaps_h
volume1 = (boxes1[:, 3] * boxes1[:, 4] * boxes1[:, 5]).view(-1, 1)
volume2 = (boxes2[:, 3] * boxes2[:, 4] * boxes2[:, 5]).view(1, -1)
iou3d = overlaps_3d / torch.clamp(volume1 + volume2 - overlaps_3d, min=1e-8)
return iou3d
class HungarianAssigner3D:
def __init__(self, cls_cost, reg_cost, iou_cost):
self.cls_cost = cls_cost
self.reg_cost = reg_cost
self.iou_cost = iou_cost
def focal_loss_cost(self, cls_pred, gt_labels):
weight = self.cls_cost.get('weight', 0.15)
alpha = self.cls_cost.get('alpha', 0.25)
gamma = self.cls_cost.get('gamma', 2.0)
eps = self.cls_cost.get('eps', 1e-12)
cls_pred = cls_pred.sigmoid()
neg_cost = -(1 - cls_pred + eps).log() * (
1 - alpha) * cls_pred.pow(gamma)
pos_cost = -(cls_pred + eps).log() * alpha * (
1 - cls_pred).pow(gamma)
cls_cost = pos_cost[:, gt_labels] - neg_cost[:, gt_labels]
return cls_cost * weight
def bevbox_cost(self, bboxes, gt_bboxes, point_cloud_range):
weight = self.reg_cost.get('weight', 0.25)
pc_start = bboxes.new(point_cloud_range[0:2])
pc_range = bboxes.new(point_cloud_range[3:5]) - bboxes.new(point_cloud_range[0:2])
# normalize the box center to [0, 1]
normalized_bboxes_xy = (bboxes[:, :2] - pc_start) / pc_range
normalized_gt_bboxes_xy = (gt_bboxes[:, :2] - pc_start) / pc_range
reg_cost = torch.cdist(normalized_bboxes_xy, normalized_gt_bboxes_xy, p=1)
return reg_cost * weight
def iou3d_cost(self, bboxes, gt_bboxes):
iou = overlaps(bboxes, gt_bboxes)
weight = self.iou_cost.get('weight', 0.25)
iou_cost = - iou
return iou_cost * weight, iou
def assign(self, bboxes, gt_bboxes, gt_labels, cls_pred, point_cloud_range):
num_gts, num_bboxes = gt_bboxes.size(0), bboxes.size(0)
# 1. assign -1 by default
assigned_gt_inds = bboxes.new_full((num_bboxes,), -1, dtype=torch.long)
assigned_labels = bboxes.new_full((num_bboxes,), -1, dtype=torch.long)
if num_gts == 0 or num_bboxes == 0:
# No ground truth or boxes, return empty assignment
if num_gts == 0:
# No ground truth, assign all to background
assigned_gt_inds[:] = 0
return num_gts, assigned_gt_inds, max_overlaps, assigned_labels
# 2. compute the weighted costs
cls_cost = self.focal_loss_cost(cls_pred[0].T, gt_labels)
reg_cost = self.bevbox_cost(bboxes, gt_bboxes, point_cloud_range)
iou_cost, iou = self.iou3d_cost(bboxes, gt_bboxes)
# weighted sum of above three costs
cost = cls_cost + reg_cost + iou_cost
# 3. do Hungarian matching on CPU using linear_sum_assignment
cost = cost.detach().cpu()
matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
matched_row_inds = torch.from_numpy(matched_row_inds).to(bboxes.device)
matched_col_inds = torch.from_numpy(matched_col_inds).to(bboxes.device)
# 4. assign backgrounds and foregrounds
# assign all indices to backgrounds first
assigned_gt_inds[:] = 0
# assign foregrounds based on matching results
assigned_gt_inds[matched_row_inds] = matched_col_inds + 1
assigned_labels[matched_row_inds] = gt_labels[matched_col_inds]
max_overlaps = torch.zeros_like(iou.max(1).values)
max_overlaps[matched_row_inds] = iou[matched_row_inds, matched_col_inds]
return assigned_gt_inds, max_overlaps
\ No newline at end of file
import copy
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
from torch.nn.init import kaiming_normal_
from ..model_utils.transfusion_utils import clip_sigmoid
from ..model_utils.basic_block_2d import BasicBlock2D
from ..model_utils.transfusion_utils import PositionEmbeddingLearned, TransformerDecoderLayer
from .target_assigner.hungarian_assigner import HungarianAssigner3D
from ...utils import loss_utils
from ..model_utils import centernet_utils
class SeparateHead_Transfusion(nn.Module):
def __init__(self, input_channels, head_channels, kernel_size, sep_head_dict, init_bias=-2.19, use_bias=False):
super().__init__()
self.sep_head_dict = sep_head_dict
for cur_name in self.sep_head_dict:
output_channels = self.sep_head_dict[cur_name]['out_channels']
num_conv = self.sep_head_dict[cur_name]['num_conv']
fc_list = []
for k in range(num_conv - 1):
fc_list.append(nn.Sequential(
nn.Conv1d(input_channels, head_channels, kernel_size, stride=1, padding=kernel_size//2, bias=use_bias),
nn.BatchNorm1d(head_channels),
nn.ReLU()
))
fc_list.append(nn.Conv1d(head_channels, output_channels, kernel_size, stride=1, padding=kernel_size//2, bias=True))
fc = nn.Sequential(*fc_list)
if 'hm' in cur_name:
fc[-1].bias.data.fill_(init_bias)
else:
for m in fc.modules():
if isinstance(m, nn.Conv2d):
kaiming_normal_(m.weight.data)
if hasattr(m, "bias") and m.bias is not None:
nn.init.constant_(m.bias, 0)
self.__setattr__(cur_name, fc)
def forward(self, x):
ret_dict = {}
for cur_name in self.sep_head_dict:
ret_dict[cur_name] = self.__getattr__(cur_name)(x)
return ret_dict
class TransFusionHead(nn.Module):
"""
This module implements TransFusionHead.
The code is adapted from https://github.com/mit-han-lab/bevfusion/ with minimal modifications.
"""
def __init__(
self,
model_cfg, input_channels, num_class, class_names, grid_size, point_cloud_range, voxel_size, predict_boxes_when_training=True,
):
super(TransFusionHead, self).__init__()
self.grid_size = grid_size
self.point_cloud_range = point_cloud_range
self.voxel_size = voxel_size
self.num_classes = num_class
self.model_cfg = model_cfg
self.feature_map_stride = self.model_cfg.TARGET_ASSIGNER_CONFIG.get('FEATURE_MAP_STRIDE', None)
self.dataset_name = self.model_cfg.TARGET_ASSIGNER_CONFIG.get('DATASET', 'nuScenes')
hidden_channel=self.model_cfg.HIDDEN_CHANNEL
self.num_proposals = self.model_cfg.NUM_PROPOSALS
self.bn_momentum = self.model_cfg.BN_MOMENTUM
self.nms_kernel_size = self.model_cfg.NMS_KERNEL_SIZE
num_heads = self.model_cfg.NUM_HEADS
dropout = self.model_cfg.DROPOUT
activation = self.model_cfg.ACTIVATION
ffn_channel = self.model_cfg.FFN_CHANNEL
bias = self.model_cfg.get('USE_BIAS_BEFORE_NORM', False)
loss_cls = self.model_cfg.LOSS_CONFIG.LOSS_CLS
self.use_sigmoid_cls = loss_cls.get("use_sigmoid", False)
if not self.use_sigmoid_cls:
self.num_classes += 1
self.loss_cls = loss_utils.SigmoidFocalClassificationLoss(gamma=loss_cls.gamma,alpha=loss_cls.alpha)
self.loss_cls_weight = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['cls_weight']
self.loss_bbox = loss_utils.L1Loss()
self.loss_bbox_weight = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['bbox_weight']
self.loss_heatmap = loss_utils.GaussianFocalLoss()
self.loss_heatmap_weight = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['hm_weight']
self.code_size = 10
# a shared convolution
self.shared_conv = nn.Conv2d(in_channels=input_channels,out_channels=hidden_channel,kernel_size=3,padding=1)
layers = []
layers.append(BasicBlock2D(hidden_channel,hidden_channel, kernel_size=3,padding=1,bias=bias))
layers.append(nn.Conv2d(in_channels=hidden_channel,out_channels=num_class,kernel_size=3,padding=1))
self.heatmap_head = nn.Sequential(*layers)
self.class_encoding = nn.Conv1d(num_class, hidden_channel, 1)
# transformer decoder layers for object query with LiDAR feature
self.decoder = TransformerDecoderLayer(hidden_channel, num_heads, ffn_channel, dropout, activation,
self_posembed=PositionEmbeddingLearned(2, hidden_channel),
cross_posembed=PositionEmbeddingLearned(2, hidden_channel),
)
# Prediction Head
heads = copy.deepcopy(self.model_cfg.SEPARATE_HEAD_CFG.HEAD_DICT)
heads['heatmap'] = dict(out_channels=self.num_classes, num_conv=self.model_cfg.NUM_HM_CONV)
self.prediction_head = SeparateHead_Transfusion(hidden_channel, 64, 1, heads, use_bias=bias)
self.init_weights()
self.bbox_assigner = HungarianAssigner3D(**self.model_cfg.TARGET_ASSIGNER_CONFIG.HUNGARIAN_ASSIGNER)
# Position Embedding for Cross-Attention, which is re-used during training
x_size = self.grid_size[0] // self.feature_map_stride
y_size = self.grid_size[1] // self.feature_map_stride
self.bev_pos = self.create_2D_grid(x_size, y_size)
self.forward_ret_dict = {}
def create_2D_grid(self, x_size, y_size):
meshgrid = [[0, x_size - 1, x_size], [0, y_size - 1, y_size]]
# NOTE: modified
batch_x, batch_y = torch.meshgrid(
*[torch.linspace(it[0], it[1], it[2]) for it in meshgrid]
)
batch_x = batch_x + 0.5
batch_y = batch_y + 0.5
coord_base = torch.cat([batch_x[None], batch_y[None]], dim=0)[None]
coord_base = coord_base.view(1, 2, -1).permute(0, 2, 1)
return coord_base
def init_weights(self):
# initialize transformer
for m in self.decoder.parameters():
if m.dim() > 1:
nn.init.xavier_uniform_(m)
if hasattr(self, "query"):
nn.init.xavier_normal_(self.query)
self.init_bn_momentum()
def init_bn_momentum(self):
for m in self.modules():
if isinstance(m, (nn.BatchNorm2d, nn.BatchNorm1d)):
m.momentum = self.bn_momentum
def predict(self, inputs):
batch_size = inputs.shape[0]
lidar_feat = self.shared_conv(inputs)
lidar_feat_flatten = lidar_feat.view(
batch_size, lidar_feat.shape[1], -1
)
bev_pos = self.bev_pos.repeat(batch_size, 1, 1).to(lidar_feat.device)
# query initialization
dense_heatmap = self.heatmap_head(lidar_feat)
heatmap = dense_heatmap.detach().sigmoid()
padding = self.nms_kernel_size // 2
local_max = torch.zeros_like(heatmap)
local_max_inner = F.max_pool2d(
heatmap, kernel_size=self.nms_kernel_size, stride=1, padding=0
)
local_max[:, :, padding:(-padding), padding:(-padding)] = local_max_inner
# for Pedestrian & Traffic_cone in nuScenes
if self.dataset_name == "nuScenes":
local_max[ :, 8, ] = F.max_pool2d(heatmap[:, 8], kernel_size=1, stride=1, padding=0)
local_max[ :, 9, ] = F.max_pool2d(heatmap[:, 9], kernel_size=1, stride=1, padding=0)
# for Pedestrian & Cyclist in Waymo
elif self.dataset_name == "Waymo":
local_max[ :, 1, ] = F.max_pool2d(heatmap[:, 1], kernel_size=1, stride=1, padding=0)
local_max[ :, 2, ] = F.max_pool2d(heatmap[:, 2], kernel_size=1, stride=1, padding=0)
heatmap = heatmap * (heatmap == local_max)
heatmap = heatmap.view(batch_size, heatmap.shape[1], -1)
# top num_proposals among all classes
top_proposals = heatmap.view(batch_size, -1).argsort(dim=-1, descending=True)[
..., : self.num_proposals
]
top_proposals_class = top_proposals // heatmap.shape[-1]
top_proposals_index = top_proposals % heatmap.shape[-1]
query_feat = lidar_feat_flatten.gather(
index=top_proposals_index[:, None, :].expand(-1, lidar_feat_flatten.shape[1], -1),
dim=-1,
)
self.query_labels = top_proposals_class
# add category embedding
one_hot = F.one_hot(top_proposals_class, num_classes=self.num_classes).permute(0, 2, 1)
query_cat_encoding = self.class_encoding(one_hot.float())
query_feat += query_cat_encoding
query_pos = bev_pos.gather(
index=top_proposals_index[:, None, :].permute(0, 2, 1).expand(-1, -1, bev_pos.shape[-1]),
dim=1,
)
# convert to xy
query_pos = query_pos.flip(dims=[-1])
bev_pos = bev_pos.flip(dims=[-1])
query_feat = self.decoder(
query_feat, lidar_feat_flatten, query_pos, bev_pos
)
res_layer = self.prediction_head(query_feat)
res_layer["center"] = res_layer["center"] + query_pos.permute(0, 2, 1)
res_layer["query_heatmap_score"] = heatmap.gather(
index=top_proposals_index[:, None, :].expand(-1, self.num_classes, -1),
dim=-1,
)
res_layer["dense_heatmap"] = dense_heatmap
return res_layer
def forward(self, batch_dict):
feats = batch_dict['spatial_features_2d']
res = self.predict(feats)
if not self.training:
bboxes = self.get_bboxes(res)
batch_dict['final_box_dicts'] = bboxes
else:
gt_boxes = batch_dict['gt_boxes']
gt_bboxes_3d = gt_boxes[...,:-1]
gt_labels_3d = gt_boxes[...,-1].long() - 1
loss, tb_dict = self.loss(gt_bboxes_3d, gt_labels_3d, res)
batch_dict['loss'] = loss
batch_dict['tb_dict'] = tb_dict
return batch_dict
def get_targets(self, gt_bboxes_3d, gt_labels_3d, pred_dicts):
assign_results = []
for batch_idx in range(len(gt_bboxes_3d)):
pred_dict = {}
for key in pred_dicts.keys():
pred_dict[key] = pred_dicts[key][batch_idx : batch_idx + 1]
gt_bboxes = gt_bboxes_3d[batch_idx]
valid_idx = []
# filter empty boxes
for i in range(len(gt_bboxes)):
if gt_bboxes[i][3] > 0 and gt_bboxes[i][4] > 0:
valid_idx.append(i)
assign_result = self.get_targets_single(gt_bboxes[valid_idx], gt_labels_3d[batch_idx][valid_idx], pred_dict)
assign_results.append(assign_result)
res_tuple = tuple(map(list, zip(*assign_results)))
labels = torch.cat(res_tuple[0], dim=0)
label_weights = torch.cat(res_tuple[1], dim=0)
bbox_targets = torch.cat(res_tuple[2], dim=0)
bbox_weights = torch.cat(res_tuple[3], dim=0)
num_pos = np.sum(res_tuple[4])
matched_ious = np.mean(res_tuple[5])
heatmap = torch.cat(res_tuple[6], dim=0)
return labels, label_weights, bbox_targets, bbox_weights, num_pos, matched_ious, heatmap
def get_targets_single(self, gt_bboxes_3d, gt_labels_3d, preds_dict):
num_proposals = preds_dict["center"].shape[-1]
score = copy.deepcopy(preds_dict["heatmap"].detach())
center = copy.deepcopy(preds_dict["center"].detach())
height = copy.deepcopy(preds_dict["height"].detach())
dim = copy.deepcopy(preds_dict["dim"].detach())
rot = copy.deepcopy(preds_dict["rot"].detach())
if "vel" in preds_dict.keys():
vel = copy.deepcopy(preds_dict["vel"].detach())
else:
vel = None
boxes_dict = self.decode_bbox(score, rot, dim, center, height, vel)
bboxes_tensor = boxes_dict[0]["pred_boxes"]
gt_bboxes_tensor = gt_bboxes_3d.to(score.device)
assigned_gt_inds, ious = self.bbox_assigner.assign(
bboxes_tensor, gt_bboxes_tensor, gt_labels_3d,
score, self.point_cloud_range,
)
pos_inds = torch.nonzero(assigned_gt_inds > 0, as_tuple=False).squeeze(-1).unique()
neg_inds = torch.nonzero(assigned_gt_inds == 0, as_tuple=False).squeeze(-1).unique()
pos_assigned_gt_inds = assigned_gt_inds[pos_inds] - 1
if gt_bboxes_3d.numel() == 0:
assert pos_inds.numel() == 0
pos_gt_bboxes = torch.empty_like(gt_bboxes_3d).view(-1, 9)
else:
pos_gt_bboxes = gt_bboxes_3d[pos_assigned_gt_inds.long(), :]
# create target for loss computation
bbox_targets = torch.zeros([num_proposals, self.code_size]).to(center.device)
bbox_weights = torch.zeros([num_proposals, self.code_size]).to(center.device)
ious = torch.clamp(ious, min=0.0, max=1.0)
labels = bboxes_tensor.new_zeros(num_proposals, dtype=torch.long)
label_weights = bboxes_tensor.new_zeros(num_proposals, dtype=torch.long)
if gt_labels_3d is not None: # default label is -1
labels += self.num_classes
# both pos and neg have classification loss, only pos has regression and iou loss
if len(pos_inds) > 0:
pos_bbox_targets = self.encode_bbox(pos_gt_bboxes)
bbox_targets[pos_inds, :] = pos_bbox_targets
bbox_weights[pos_inds, :] = 1.0
if gt_labels_3d is None:
labels[pos_inds] = 1
else:
labels[pos_inds] = gt_labels_3d[pos_assigned_gt_inds]
label_weights[pos_inds] = 1.0
if len(neg_inds) > 0:
label_weights[neg_inds] = 1.0
# compute dense heatmap targets
device = labels.device
target_assigner_cfg = self.model_cfg.TARGET_ASSIGNER_CONFIG
feature_map_size = (self.grid_size[:2] // self.feature_map_stride)
heatmap = gt_bboxes_3d.new_zeros(self.num_classes, feature_map_size[1], feature_map_size[0])
for idx in range(len(gt_bboxes_3d)):
width = gt_bboxes_3d[idx][3]
length = gt_bboxes_3d[idx][4]
width = width / self.voxel_size[0] / self.feature_map_stride
length = length / self.voxel_size[1] / self.feature_map_stride
if width > 0 and length > 0:
radius = centernet_utils.gaussian_radius(length.view(-1), width.view(-1), target_assigner_cfg.GAUSSIAN_OVERLAP)[0]
radius = max(target_assigner_cfg.MIN_RADIUS, int(radius))
x, y = gt_bboxes_3d[idx][0], gt_bboxes_3d[idx][1]
coor_x = (x - self.point_cloud_range[0]) / self.voxel_size[0] / self.feature_map_stride
coor_y = (y - self.point_cloud_range[1]) / self.voxel_size[1] / self.feature_map_stride
center = torch.tensor([coor_x, coor_y], dtype=torch.float32, device=device)
center_int = center.to(torch.int32)
centernet_utils.draw_gaussian_to_heatmap(heatmap[gt_labels_3d[idx]], center_int, radius)
mean_iou = ious[pos_inds].sum() / max(len(pos_inds), 1)
return (labels[None], label_weights[None], bbox_targets[None], bbox_weights[None], int(pos_inds.shape[0]), float(mean_iou), heatmap[None])
def loss(self, gt_bboxes_3d, gt_labels_3d, pred_dicts, **kwargs):
labels, label_weights, bbox_targets, bbox_weights, num_pos, matched_ious, heatmap = \
self.get_targets(gt_bboxes_3d, gt_labels_3d, pred_dicts)
loss_dict = dict()
loss_all = 0
# compute heatmap loss
loss_heatmap = self.loss_heatmap(
clip_sigmoid(pred_dicts["dense_heatmap"]),
heatmap,
).sum() / max(heatmap.eq(1).float().sum().item(), 1)
loss_dict["loss_heatmap"] = loss_heatmap.item() * self.loss_heatmap_weight
loss_all += loss_heatmap * self.loss_heatmap_weight
labels = labels.reshape(-1)
label_weights = label_weights.reshape(-1)
cls_score = pred_dicts["heatmap"].permute(0, 2, 1).reshape(-1, self.num_classes)
one_hot_targets = torch.zeros(*list(labels.shape), self.num_classes+1, dtype=cls_score.dtype, device=labels.device)
one_hot_targets.scatter_(-1, labels.unsqueeze(dim=-1).long(), 1.0)
one_hot_targets = one_hot_targets[..., :-1]
loss_cls = self.loss_cls(
cls_score, one_hot_targets, label_weights
).sum() / max(num_pos, 1)
preds = torch.cat([pred_dicts[head_name] for head_name in self.model_cfg.SEPARATE_HEAD_CFG.HEAD_ORDER], dim=1).permute(0, 2, 1)
code_weights = self.model_cfg.LOSS_CONFIG.LOSS_WEIGHTS['code_weights']
reg_weights = bbox_weights * bbox_weights.new_tensor(code_weights)
loss_bbox = self.loss_bbox(preds, bbox_targets)
loss_bbox = (loss_bbox * reg_weights).sum() / max(num_pos, 1)
loss_dict["loss_cls"] = loss_cls.item() * self.loss_cls_weight
loss_dict["loss_bbox"] = loss_bbox.item() * self.loss_bbox_weight
loss_all = loss_all + loss_cls * self.loss_cls_weight + loss_bbox * self.loss_bbox_weight
loss_dict[f"matched_ious"] = loss_cls.new_tensor(matched_ious)
loss_dict['loss_trans'] = loss_all
return loss_all,loss_dict
def encode_bbox(self, bboxes):
code_size = 10
targets = torch.zeros([bboxes.shape[0], code_size]).to(bboxes.device)
targets[:, 0] = (bboxes[:, 0] - self.point_cloud_range[0]) / (self.feature_map_stride * self.voxel_size[0])
targets[:, 1] = (bboxes[:, 1] - self.point_cloud_range[1]) / (self.feature_map_stride * self.voxel_size[1])
targets[:, 3:6] = bboxes[:, 3:6].log()
targets[:, 2] = bboxes[:, 2]
targets[:, 6] = torch.sin(bboxes[:, 6])
targets[:, 7] = torch.cos(bboxes[:, 6])
if code_size == 10:
targets[:, 8:10] = bboxes[:, 7:]
return targets
def decode_bbox(self, heatmap, rot, dim, center, height, vel, filter=False):
post_process_cfg = self.model_cfg.POST_PROCESSING
score_thresh = post_process_cfg.SCORE_THRESH
post_center_range = post_process_cfg.POST_CENTER_RANGE
post_center_range = torch.tensor(post_center_range).cuda().float()
# class label
final_preds = heatmap.max(1, keepdims=False).indices
final_scores = heatmap.max(1, keepdims=False).values
center[:, 0, :] = center[:, 0, :] * self.feature_map_stride * self.voxel_size[0] + self.point_cloud_range[0]
center[:, 1, :] = center[:, 1, :] * self.feature_map_stride * self.voxel_size[1] + self.point_cloud_range[1]
dim = dim.exp()
rots, rotc = rot[:, 0:1, :], rot[:, 1:2, :]
rot = torch.atan2(rots, rotc)
if vel is None:
final_box_preds = torch.cat([center, height, dim, rot], dim=1).permute(0, 2, 1)
else:
final_box_preds = torch.cat([center, height, dim, rot, vel], dim=1).permute(0, 2, 1)
predictions_dicts = []
for i in range(heatmap.shape[0]):
boxes3d = final_box_preds[i]
scores = final_scores[i]
labels = final_preds[i]
predictions_dict = {
'pred_boxes': boxes3d,
'pred_scores': scores,
'pred_labels': labels
}
predictions_dicts.append(predictions_dict)
if filter is False:
return predictions_dicts
thresh_mask = final_scores > score_thresh
mask = (final_box_preds[..., :3] >= post_center_range[:3]).all(2)
mask &= (final_box_preds[..., :3] <= post_center_range[3:]).all(2)
predictions_dicts = []
for i in range(heatmap.shape[0]):
cmask = mask[i, :]
cmask &= thresh_mask[i]
boxes3d = final_box_preds[i, cmask]
scores = final_scores[i, cmask]
labels = final_preds[i, cmask]
predictions_dict = {
'pred_boxes': boxes3d,
'pred_scores': scores,
'pred_labels': labels,
}
predictions_dicts.append(predictions_dict)
return predictions_dicts
def get_bboxes(self, preds_dicts):
batch_size = preds_dicts["heatmap"].shape[0]
batch_score = preds_dicts["heatmap"].sigmoid()
one_hot = F.one_hot(
self.query_labels, num_classes=self.num_classes
).permute(0, 2, 1)
batch_score = batch_score * preds_dicts["query_heatmap_score"] * one_hot
batch_center = preds_dicts["center"]
batch_height = preds_dicts["height"]
batch_dim = preds_dicts["dim"]
batch_rot = preds_dicts["rot"]
batch_vel = None
if "vel" in preds_dicts:
batch_vel = preds_dicts["vel"]
ret_dict = self.decode_bbox(
batch_score, batch_rot, batch_dim,
batch_center, batch_height, batch_vel,
filter=True,
)
for k in range(batch_size):
ret_dict[k]['pred_labels'] = ret_dict[k]['pred_labels'].int() + 1
return ret_dict
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment