Unverified Commit 730c5e1e authored by Vasilis Vryniotis's avatar Vasilis Vryniotis Committed by GitHub
Browse files

Add SSD architecture with VGG16 backbone (#3403)

* Early skeleton of API.

* Adding MultiFeatureMap and vgg16 backbone.

* Making vgg16 backbone same as paper.

* Making code generic to support all vggs.

* Moving vgg's extra layers a separate class + L2 scaling.

* Adding header vgg layers.

* Fix maxpool patching.

* Refactoring code to allow for support of different backbones & sizes:
- Skeleton for Default Boxes generator class
- Dynamic estimation of configuration when possible
- Addition of types

* Complete the implementation of DefaultBox generator.

* Replace randn with empty.

* Minor refactoring

* Making clamping between 0 and 1 optional.

* Change xywh to xyxy encoding.

* Adding parameters and reusing objects in constructor.

* Temporarily inherit from Retina to avoid dup code.

* Implement forward methods + temp workarounds to inherit from retina.

* Inherit more methods from retinanet.

* Fix type error.

* Add Regression loss.

* Fixing JIT issues.

* Change JIT workaround to minimize new code.

* Fixing initialization bug.

* Add classification loss.

* Update todos.

* Add weight loading support.

* Support SSD512.

* Change kernel_size to get output size 1x1

* Add xavier init and refactoring.

* Adding unit-tests and fixing JIT issues.

* Add a test for dbox generator.

* Remove unnecessary import.

* Workaround on GeneralizedRCNNTransform to support fixed size input.

* Remove unnecessary random calls from the test.

* Remove more rand calls from the test.

* change mapping and handling of empty labels

* Fix JIT warnings.

* Speed up loss.

* Convert 0-1 dboxes to original size.

* Fix warning.

* Fix tests.

* Update comments.

* Fixing minor bugs.

* Introduce a custom DBoxMatcher.

* Minor refactoring

* Move extra layer definition inside feature extractor.

* handle no bias on init.

* Remove fixed image size limitation

* Change initialization values for bias of classification head.

* Refactoring and update test file.

* Adding ResNet backbone.

* Minor refactoring.

* Remove inheritance of retina and general refactoring.

* SSD should fix the input size.

* Fixing messages and comments.

* Silently ignoring exception if test-only.

* Update comments.

* Update regression loss.

* Restore Xavier init everywhere, update the negative sampling method, change the clipping approach.

* Fixing tests.

* Refactor to move the losses from the Head to the SSD.

* Removing resnet50 ssd version.

* Adding support for best performing backbone and its config.

* Refactor and clean up the API.

* Fix lint

* Update todos and comments.

* Adding RandomHorizontalFlip and RandomIoUCrop transforms.

* Adding necessary checks to our tranforms.

* Adding RandomZoomOut.

* Adding RandomPhotometricDistort.

* Moving Detection transforms to references.

* Update presets

* fix lint

* leave compose and object

* Adding scaling for completeness.

* Adding params in the repr

* Remove unnecessary import.

* minor refactoring

* Remove unnecessary call.

* Give better names to DBox* classes

* Port num_anchors estimation in generator

* Remove rescaling and fix presets

* Add the ability to pass a custom head and refactoring.

* fix lint

* Fix unit-test

* Update todos.

* Change mean values.

* Change the default parameter of SSD to train the full VGG16 and remove the catch of exception for eval only.

* Adding documentation

* Adding weights and updating readmes.

* Update the model weights with a more performing model.

* Adding doc for head.

* Restore import.
parent 7c35e133
......@@ -381,17 +381,18 @@ Object Detection, Instance Segmentation and Person Keypoint Detection
The models subpackage contains definitions for the following model
architectures for detection:
- `Faster R-CNN ResNet-50 FPN <https://arxiv.org/abs/1506.01497>`_
- `Mask R-CNN ResNet-50 FPN <https://arxiv.org/abs/1703.06870>`_
- `Faster R-CNN <https://arxiv.org/abs/1506.01497>`_
- `Mask R-CNN <https://arxiv.org/abs/1703.06870>`_
- `RetinaNet <https://arxiv.org/abs/1708.02002>`_
- `SSD <https://arxiv.org/abs/1512.02325>`_
The pre-trained models for detection, instance segmentation and
keypoint detection are initialized with the classification models
in torchvision.
The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
The models internally resize the images so that they have a minimum size
of ``800``. This option can be changed by passing the option ``min_size``
to the constructor of the models.
The models internally resize the images but the behaviour varies depending
on the model. Check the constructor of the models for more information.
For object detection and instance segmentation, the pre-trained
......@@ -425,6 +426,7 @@ Faster R-CNN ResNet-50 FPN 37.0 - -
Faster R-CNN MobileNetV3-Large FPN 32.8 - -
Faster R-CNN MobileNetV3-Large 320 FPN 22.8 - -
RetinaNet ResNet-50 FPN 36.4 - -
SSD VGG16 25.1 - -
Mask R-CNN ResNet-50 FPN 37.9 34.6 -
====================================== ======= ======== ===========
......@@ -483,6 +485,7 @@ Faster R-CNN ResNet-50 FPN 0.2288 0.0590
Faster R-CNN MobileNetV3-Large FPN 0.1020 0.0415 1.0
Faster R-CNN MobileNetV3-Large 320 FPN 0.0978 0.0376 0.6
RetinaNet ResNet-50 FPN 0.2514 0.0939 4.1
SSD VGG16 0.2093 0.0744 1.5
Mask R-CNN ResNet-50 FPN 0.2728 0.0903 5.4
Keypoint R-CNN ResNet-50 FPN 0.3789 0.1242 6.8
====================================== =================== ================== ===========
......@@ -502,6 +505,12 @@ RetinaNet
.. autofunction:: torchvision.models.detection.retinanet_resnet50_fpn
SSD
------------
.. autofunction:: torchvision.models.detection.ssd300_vgg16
Mask R-CNN
----------
......
......@@ -48,6 +48,14 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
--lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01
```
### SSD VGG16
```
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
--dataset coco --model ssd300_vgg16 --epochs 120\
--lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4\
--weight-decay 0.0005 --data-augmentation ssd
```
### Mask R-CNN
```
......
......@@ -2,12 +2,22 @@ import transforms as T
class DetectionPresetTrain:
def __init__(self, hflip_prob=0.5):
trans = [T.ToTensor()]
if hflip_prob > 0:
trans.append(T.RandomHorizontalFlip(hflip_prob))
self.transforms = T.Compose(trans)
def __init__(self, data_augmentation, hflip_prob=0.5, mean=(123., 117., 104.)):
if data_augmentation == 'hflip':
self.transforms = T.Compose([
T.RandomHorizontalFlip(p=hflip_prob),
T.ToTensor(),
])
elif data_augmentation == 'ssd':
self.transforms = T.Compose([
T.RandomPhotometricDistort(),
T.RandomZoomOut(fill=list(mean)),
T.RandomIoUCrop(),
T.RandomHorizontalFlip(p=hflip_prob),
T.ToTensor(),
])
else:
raise ValueError(f'Unknown data augmentation policy "{data_augmentation}"')
def __call__(self, img, target):
return self.transforms(img, target)
......
......@@ -47,8 +47,8 @@ def get_dataset(name, image_set, transform, data_path):
return ds, num_classes
def get_transform(train):
return presets.DetectionPresetTrain() if train else presets.DetectionPresetEval()
def get_transform(train, data_augmentation):
return presets.DetectionPresetTrain(data_augmentation) if train else presets.DetectionPresetEval()
def main(args):
......@@ -60,8 +60,9 @@ def main(args):
# Data loading code
print("Loading data")
dataset, num_classes = get_dataset(args.dataset, "train", get_transform(train=True), args.data_path)
dataset_test, _ = get_dataset(args.dataset, "val", get_transform(train=False), args.data_path)
dataset, num_classes = get_dataset(args.dataset, "train", get_transform(True, args.data_augmentation),
args.data_path)
dataset_test, _ = get_dataset(args.dataset, "val", get_transform(False, args.data_augmentation), args.data_path)
print("Creating data loaders")
if args.distributed:
......@@ -179,6 +180,7 @@ if __name__ == "__main__":
parser.add_argument('--rpn-score-thresh', default=None, type=float, help='rpn score threshold for faster-rcnn')
parser.add_argument('--trainable-backbone-layers', default=None, type=int,
help='number of trainable layers of backbone')
parser.add_argument('--data-augmentation', default="hflip", help='data augmentation policy (default: hflip)')
parser.add_argument(
"--test-only",
dest="test_only",
......
import random
import torch
import torchvision
from torch import nn, Tensor
from torchvision.transforms import functional as F
from torchvision.transforms import transforms as T
from typing import List, Tuple, Dict, Optional
def _flip_coco_person_keypoints(kps, width):
......@@ -23,17 +27,14 @@ class Compose(object):
return image, target
class RandomHorizontalFlip(object):
def __init__(self, prob):
self.prob = prob
def __call__(self, image, target):
if random.random() < self.prob:
height, width = image.shape[-2:]
image = image.flip(-1)
bbox = target["boxes"]
bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
target["boxes"] = bbox
class RandomHorizontalFlip(T.RandomHorizontalFlip):
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if torch.rand(1) < self.p:
image = F.hflip(image)
if target is not None:
width, _ = F._get_image_size(image)
target["boxes"][:, [0, 2]] = width - target["boxes"][:, [2, 0]]
if "masks" in target:
target["masks"] = target["masks"].flip(-1)
if "keypoints" in target:
......@@ -43,7 +44,196 @@ class RandomHorizontalFlip(object):
return image, target
class ToTensor(object):
def __call__(self, image, target):
class ToTensor(nn.Module):
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
image = F.to_tensor(image)
return image, target
class RandomIoUCrop(nn.Module):
def __init__(self, min_scale: float = 0.3, max_scale: float = 1.0, min_aspect_ratio: float = 0.5,
max_aspect_ratio: float = 2.0, sampler_options: Optional[List[float]] = None, trials: int = 40):
super().__init__()
# Configuration similar to https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_coco.py#L89-L174
self.min_scale = min_scale
self.max_scale = max_scale
self.min_aspect_ratio = min_aspect_ratio
self.max_aspect_ratio = max_aspect_ratio
if sampler_options is None:
sampler_options = [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
self.options = sampler_options
self.trials = trials
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if target is None:
raise ValueError("The targets can't be None for this transform.")
if isinstance(image, torch.Tensor):
if image.ndimension() not in {2, 3}:
raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
elif image.ndimension() == 2:
image = image.unsqueeze(0)
orig_w, orig_h = F._get_image_size(image)
while True:
# sample an option
idx = int(torch.randint(low=0, high=len(self.options), size=(1,)))
min_jaccard_overlap = self.options[idx]
if min_jaccard_overlap >= 1.0: # a value larger than 1 encodes the leave as-is option
return image, target
for _ in range(self.trials):
# check the aspect ratio limitations
r = self.min_scale + (self.max_scale - self.min_scale) * torch.rand(2)
new_w = int(orig_w * r[0])
new_h = int(orig_h * r[1])
aspect_ratio = new_w / new_h
if not (self.min_aspect_ratio <= aspect_ratio <= self.max_aspect_ratio):
continue
# check for 0 area crops
r = torch.rand(2)
left = int((orig_w - new_w) * r[0])
top = int((orig_h - new_h) * r[1])
right = left + new_w
bottom = top + new_h
if left == right or top == bottom:
continue
# check for any valid boxes with centers within the crop area
cx = 0.5 * (target["boxes"][:, 0] + target["boxes"][:, 2])
cy = 0.5 * (target["boxes"][:, 1] + target["boxes"][:, 3])
is_within_crop_area = (left < cx) & (cx < right) & (top < cy) & (cy < bottom)
if not is_within_crop_area.any():
continue
# check at least 1 box with jaccard limitations
boxes = target["boxes"][is_within_crop_area]
ious = torchvision.ops.boxes.box_iou(boxes, torch.tensor([[left, top, right, bottom]],
dtype=boxes.dtype, device=boxes.device))
if ious.max() < min_jaccard_overlap:
continue
# keep only valid boxes and perform cropping
target["boxes"] = boxes
target["labels"] = target["labels"][is_within_crop_area]
target["boxes"][:, 0::2] -= left
target["boxes"][:, 1::2] -= top
target["boxes"][:, 0::2].clamp_(min=0, max=new_w)
target["boxes"][:, 1::2].clamp_(min=0, max=new_h)
image = F.crop(image, top, left, new_h, new_w)
return image, target
class RandomZoomOut(nn.Module):
def __init__(self, fill: Optional[List[float]] = None, side_range: Tuple[float, float] = (1., 4.), p: float = 0.5):
super().__init__()
if fill is None:
fill = [0., 0., 0.]
self.fill = fill
self.side_range = side_range
if side_range[0] < 1. or side_range[0] > side_range[1]:
raise ValueError("Invalid canvas side range provided {}.".format(side_range))
self.p = p
@torch.jit.unused
def _get_fill_value(self, is_pil):
# type: (bool) -> int
# We fake the type to make it work on JIT
return tuple(int(x) for x in self.fill) if is_pil else 0
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if isinstance(image, torch.Tensor):
if image.ndimension() not in {2, 3}:
raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
elif image.ndimension() == 2:
image = image.unsqueeze(0)
if torch.rand(1) < self.p:
return image, target
orig_w, orig_h = F._get_image_size(image)
r = self.side_range[0] + torch.rand(1) * (self.side_range[1] - self.side_range[0])
canvas_width = int(orig_w * r)
canvas_height = int(orig_h * r)
r = torch.rand(2)
left = int((canvas_width - orig_w) * r[0])
top = int((canvas_height - orig_h) * r[1])
right = canvas_width - (left + orig_w)
bottom = canvas_height - (top + orig_h)
if torch.jit.is_scripting():
fill = 0
else:
fill = self._get_fill_value(F._is_pil_image(image))
image = F.pad(image, [left, top, right, bottom], fill=fill)
if isinstance(image, torch.Tensor):
v = torch.tensor(self.fill, device=image.device, dtype=image.dtype).view(-1, 1, 1)
image[..., :top, :] = image[..., :, :left] = image[..., (top + orig_h):, :] = \
image[..., :, (left + orig_w):] = v
if target is not None:
target["boxes"][:, 0::2] += left
target["boxes"][:, 1::2] += top
return image, target
class RandomPhotometricDistort(nn.Module):
def __init__(self, contrast: Tuple[float] = (0.5, 1.5), saturation: Tuple[float] = (0.5, 1.5),
hue: Tuple[float] = (-0.05, 0.05), brightness: Tuple[float] = (0.875, 1.125), p: float = 0.5):
super().__init__()
self._brightness = T.ColorJitter(brightness=brightness)
self._contrast = T.ColorJitter(contrast=contrast)
self._hue = T.ColorJitter(hue=hue)
self._saturation = T.ColorJitter(saturation=saturation)
self.p = p
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if isinstance(image, torch.Tensor):
if image.ndimension() not in {2, 3}:
raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
elif image.ndimension() == 2:
image = image.unsqueeze(0)
r = torch.rand(7)
if r[0] < self.p:
image = self._brightness(image)
contrast_before = r[1] < 0.5
if contrast_before:
if r[2] < self.p:
image = self._contrast(image)
if r[3] < self.p:
image = self._saturation(image)
if r[4] < self.p:
image = self._hue(image)
if not contrast_before:
if r[5] < self.p:
image = self._contrast(image)
if r[6] < self.p:
channels = F._get_image_num_channels(image)
permutation = torch.randperm(channels)
is_pil = F._is_pil_image(image)
if is_pil:
image = F.to_tensor(image)
image = image[..., permutation, :, :]
if is_pil:
image = F.to_pil_image(image)
return image, target
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
......@@ -44,6 +44,7 @@ script_model_unwrapper = {
"maskrcnn_resnet50_fpn": lambda x: x[1],
"keypointrcnn_resnet50_fpn": lambda x: x[1],
"retinanet_resnet50_fpn": lambda x: x[1],
"ssd300_vgg16": lambda x: x[1],
}
......
from collections import OrderedDict
import torch
from common_utils import TestCase
from torchvision.models.detection.anchor_utils import AnchorGenerator
from torchvision.models.detection.anchor_utils import AnchorGenerator, DefaultBoxGenerator
from torchvision.models.detection.image_list import ImageList
......@@ -22,6 +21,12 @@ class Tester(TestCase):
return anchor_generator
def _init_test_defaultbox_generator(self):
aspect_ratios = [[2]]
dbox_generator = DefaultBoxGenerator(aspect_ratios)
return dbox_generator
def get_features(self, images):
s0, s1 = images.shape[-2:]
features = [torch.rand(2, 8, s0 // 5, s1 // 5)]
......@@ -59,3 +64,26 @@ class Tester(TestCase):
self.assertEqual(tuple(anchors[1].shape), (9, 4))
self.assertEqual(anchors[0], anchors_output)
self.assertEqual(anchors[1], anchors_output)
def test_defaultbox_generator(self):
images = torch.zeros(2, 3, 15, 15)
features = [torch.zeros(2, 8, 1, 1)]
image_shapes = [i.shape[-2:] for i in images]
images = ImageList(images, image_shapes)
model = self._init_test_defaultbox_generator()
model.eval()
dboxes = model(images, features)
dboxes_output = torch.tensor([
[6.9750, 6.9750, 8.0250, 8.0250],
[6.7315, 6.7315, 8.2685, 8.2685],
[6.7575, 7.1288, 8.2425, 7.8712],
[7.1288, 6.7575, 7.8712, 8.2425]
])
self.assertEqual(len(dboxes), 2)
self.assertEqual(tuple(dboxes[0].shape), (4, 4))
self.assertEqual(tuple(dboxes[1].shape), (4, 4))
self.assertTrue(dboxes[0].allclose(dboxes_output))
self.assertTrue(dboxes[1].allclose(dboxes_output))
......@@ -139,6 +139,15 @@ class Tester(unittest.TestCase):
self.assertEqual(loss_dict["bbox_regression"], torch.tensor(0.))
def test_forward_negative_sample_ssd(self):
model = torchvision.models.detection.ssd300_vgg16(
num_classes=2, pretrained_backbone=False)
images, targets = self._make_empty_sample()
loss_dict = model(images, targets)
self.assertEqual(loss_dict["bbox_regression"], torch.tensor(0.))
if __name__ == '__main__':
unittest.main()
......@@ -2,3 +2,4 @@ from .faster_rcnn import *
from .mask_rcnn import *
from .keypoint_rcnn import *
from .retinanet import *
from .ssd import *
import math
import torch
from collections import OrderedDict
from torch import Tensor
from typing import List, Tuple
......@@ -344,6 +345,23 @@ class Matcher(object):
matches[pred_inds_to_update] = all_matches[pred_inds_to_update]
class SSDMatcher(Matcher):
def __init__(self, threshold):
super().__init__(threshold, threshold, allow_low_quality_matches=False)
def __call__(self, match_quality_matrix):
matches = super().__call__(match_quality_matrix)
# For each gt, find the prediction with which it has the highest quality
_, highest_quality_pred_foreach_gt = match_quality_matrix.max(dim=1)
matches[highest_quality_pred_foreach_gt] = torch.arange(highest_quality_pred_foreach_gt.size(0),
dtype=torch.int64,
device=highest_quality_pred_foreach_gt.device)
return matches
def overwrite_eps(model, eps):
"""
This method overwrites the default eps values of all the
......@@ -360,3 +378,33 @@ def overwrite_eps(model, eps):
for module in model.modules():
if isinstance(module, FrozenBatchNorm2d):
module.eps = eps
def retrieve_out_channels(model, size):
"""
This method retrieves the number of output channels of a specific model.
Args:
model (nn.Module): The model for which we estimate the out_channels.
It should return a single Tensor or an OrderedDict[Tensor].
size (Tuple[int, int]): The size (wxh) of the input.
Returns:
out_channels (List[int]): A list of the output channels of the model.
"""
in_training = model.training
model.eval()
with torch.no_grad():
# Use dummy data to retrieve the feature map sizes to avoid hard-coding their values
device = next(model.parameters()).device
tmp_img = torch.zeros((1, 3, size[1], size[0]), device=device)
features = model(tmp_img)
if isinstance(features, torch.Tensor):
features = OrderedDict([('0', features)])
out_channels = [x.size(1) for x in features.values()]
if in_training:
model.train()
return out_channels
import math
import torch
from torch import nn, Tensor
from typing import List
from typing import List, Optional
from .image_list import ImageList
......@@ -128,3 +129,97 @@ class AnchorGenerator(nn.Module):
anchors.append(anchors_in_image)
anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
return anchors
class DefaultBoxGenerator(nn.Module):
"""
This module generates the default boxes of SSD for a set of feature maps and image sizes.
Args:
aspect_ratios (List[List[int]]): A list with all the aspect ratios used in each feature map.
min_ratio (float): The minimum scale :math:`\text{s}_{\text{min}}` of the default boxes used in the estimation
of the scales of each feature map.
max_ratio (float): The maximum scale :math:`\text{s}_{\text{max}}` of the default boxes used in the estimation
of the scales of each feature map.
steps (List[int]], optional): It's a hyper-parameter that affects the tiling of defalt boxes. If not provided
it will be estimated from the data.
clip (bool): Whether the standardized values of default boxes should be clipped between 0 and 1. The clipping
is applied while the boxes are encoded in format ``(cx, cy, w, h)``.
"""
def __init__(self, aspect_ratios: List[List[int]], min_ratio: float = 0.15, max_ratio: float = 0.9,
steps: Optional[List[int]] = None, clip: bool = True):
super().__init__()
if steps is not None:
assert len(aspect_ratios) == len(steps)
self.aspect_ratios = aspect_ratios
self.steps = steps
self.clip = clip
num_outputs = len(aspect_ratios)
# Estimation of default boxes scales
# Inspired from https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_pascal.py#L311-L317
min_centile = int(100 * min_ratio)
max_centile = int(100 * max_ratio)
conv4_centile = min_centile // 2 # assume half of min_ratio as in paper
step = (max_centile - min_centile) // (num_outputs - 2)
centiles = [conv4_centile, min_centile]
for c in range(min_centile, max_centile + 1, step):
centiles.append(c + step)
self.scales = [c / 100 for c in centiles]
self._wh_pairs = []
for k in range(num_outputs):
# Adding the 2 default width-height pairs for aspect ratio 1 and scale s'k
s_k = self.scales[k]
s_prime_k = math.sqrt(self.scales[k] * self.scales[k + 1])
wh_pairs = [(s_k, s_k), (s_prime_k, s_prime_k)]
# Adding 2 pairs for each aspect ratio of the feature map k
for ar in self.aspect_ratios[k]:
sq_ar = math.sqrt(ar)
w = self.scales[k] * sq_ar
h = self.scales[k] / sq_ar
wh_pairs.extend([(w, h), (h, w)])
self._wh_pairs.append(wh_pairs)
def num_anchors_per_location(self):
# Estimate num of anchors based on aspect ratios: 2 default boxes + 2 * ratios of feaure map.
return [2 + 2 * len(r) for r in self.aspect_ratios]
def __repr__(self) -> str:
s = self.__class__.__name__ + '('
s += 'aspect_ratios={aspect_ratios}'
s += ', clip={clip}'
s += ', scales={scales}'
s += ', steps={steps}'
s += ')'
return s.format(**self.__dict__)
def forward(self, image_list: ImageList, feature_maps: List[Tensor]) -> List[Tensor]:
grid_sizes = [feature_map.shape[-2:] for feature_map in feature_maps]
image_size = image_list.tensors.shape[-2:]
dtype, device = feature_maps[0].dtype, feature_maps[0].device
# Default Boxes calculation based on page 6 of SSD paper
default_boxes: List[List[float]] = []
for k, f_k in enumerate(grid_sizes):
# Now add the default boxes for each width-height pair
for j in range(f_k[0]):
cy = (j + 0.5) / (float(f_k[0]) if self.steps is None else image_size[1] / self.steps[k])
for i in range(f_k[1]):
cx = (i + 0.5) / (float(f_k[1]) if self.steps is None else image_size[0] / self.steps[k])
default_boxes.extend([[cx, cy, w, h] for w, h in self._wh_pairs[k]])
dboxes = []
for _ in image_list.image_sizes:
dboxes_in_image = torch.tensor(default_boxes, dtype=dtype, device=device)
if self.clip:
dboxes_in_image.clamp_(min=0, max=1)
dboxes_in_image = torch.cat([dboxes_in_image[:, :2] - 0.5 * dboxes_in_image[:, 2:],
dboxes_in_image[:, :2] + 0.5 * dboxes_in_image[:, 2:]], -1)
dboxes_in_image[:, 0::2] *= image_size[1]
dboxes_in_image[:, 1::2] *= image_size[0]
dboxes.append(dboxes_in_image)
return dboxes
This diff is collapsed.
import math
import torch
from torch import nn, Tensor
from torch.nn import functional as F
import torchvision
from torch import nn, Tensor
from typing import List, Tuple, Dict, Optional
from .image_list import ImageList
......@@ -23,13 +23,20 @@ def _fake_cast_onnx(v):
return v
def _resize_image_and_masks(image, self_min_size, self_max_size, target):
# type: (Tensor, float, float, Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]
def _resize_image_and_masks(image: Tensor, self_min_size: float, self_max_size: float,
target: Optional[Dict[str, Tensor]],
fixed_size: Optional[Tuple[int, int]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if torchvision._is_tracing():
im_shape = _get_shape_onnx(image)
else:
im_shape = torch.tensor(image.shape[-2:])
size: Optional[List[int]] = None
scale_factor: Optional[float] = None
recompute_scale_factor: Optional[bool] = None
if fixed_size is not None:
size = [fixed_size[1], fixed_size[0]]
else:
min_size = torch.min(im_shape).to(dtype=torch.float32)
max_size = torch.max(im_shape).to(dtype=torch.float32)
scale = torch.min(self_min_size / min_size, self_max_size / max_size)
......@@ -38,17 +45,18 @@ def _resize_image_and_masks(image, self_min_size, self_max_size, target):
scale_factor = _fake_cast_onnx(scale)
else:
scale_factor = scale.item()
recompute_scale_factor = True
image = torch.nn.functional.interpolate(
image[None], scale_factor=scale_factor, mode='bilinear', recompute_scale_factor=True,
align_corners=False)[0]
image = torch.nn.functional.interpolate(image[None], size=size, scale_factor=scale_factor, mode='bilinear',
recompute_scale_factor=recompute_scale_factor, align_corners=False)[0]
if target is None:
return image, target
if "masks" in target:
mask = target["masks"]
mask = F.interpolate(mask[:, None].float(), scale_factor=scale_factor, recompute_scale_factor=True)[:, 0].byte()
mask = torch.nn.functional.interpolate(mask[:, None].float(), size=size, scale_factor=scale_factor,
recompute_scale_factor=recompute_scale_factor)[:, 0].byte()
target["masks"] = mask
return image, target
......@@ -65,7 +73,7 @@ class GeneralizedRCNNTransform(nn.Module):
It returns a ImageList for the inputs, and a List[Dict[Tensor]] for the targets
"""
def __init__(self, min_size, max_size, image_mean, image_std):
def __init__(self, min_size, max_size, image_mean, image_std, size_divisible=32, fixed_size=None):
super(GeneralizedRCNNTransform, self).__init__()
if not isinstance(min_size, (list, tuple)):
min_size = (min_size,)
......@@ -73,6 +81,8 @@ class GeneralizedRCNNTransform(nn.Module):
self.max_size = max_size
self.image_mean = image_mean
self.image_std = image_std
self.size_divisible = size_divisible
self.fixed_size = fixed_size
def forward(self,
images, # type: List[Tensor]
......@@ -106,7 +116,7 @@ class GeneralizedRCNNTransform(nn.Module):
targets[i] = target_index
image_sizes = [img.shape[-2:] for img in images]
images = self.batch_images(images)
images = self.batch_images(images, size_divisible=self.size_divisible)
image_sizes_list: List[Tuple[int, int]] = []
for image_size in image_sizes:
assert len(image_size) == 2
......@@ -144,7 +154,7 @@ class GeneralizedRCNNTransform(nn.Module):
else:
# FIXME assume for now that testing uses the largest scale
size = float(self.min_size[-1])
image, target = _resize_image_and_masks(image, size, float(self.max_size), target)
image, target = _resize_image_and_masks(image, size, float(self.max_size), target, self.fixed_size)
if target is None:
return image, target
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment