".github/vscode:/vscode.git/clone" did not exist on "b4172770ae2a5fb67c129a321a2f566a669f26d5"
Unverified Commit 730c5e1e authored by Vasilis Vryniotis's avatar Vasilis Vryniotis Committed by GitHub
Browse files

Add SSD architecture with VGG16 backbone (#3403)

* Early skeleton of API.

* Adding MultiFeatureMap and vgg16 backbone.

* Making vgg16 backbone same as paper.

* Making code generic to support all vggs.

* Moving vgg's extra layers a separate class + L2 scaling.

* Adding header vgg layers.

* Fix maxpool patching.

* Refactoring code to allow for support of different backbones & sizes:
- Skeleton for Default Boxes generator class
- Dynamic estimation of configuration when possible
- Addition of types

* Complete the implementation of DefaultBox generator.

* Replace randn with empty.

* Minor refactoring

* Making clamping between 0 and 1 optional.

* Change xywh to xyxy encoding.

* Adding parameters and reusing objects in constructor.

* Temporarily inherit from Retina to avoid dup code.

* Implement forward methods + temp workarounds to inherit from retina.

* Inherit more methods from retinanet.

* Fix type error.

* Add Regression loss.

* Fixing JIT issues.

* Change JIT workaround to minimize new code.

* Fixing initialization bug.

* Add classification loss.

* Update todos.

* Add weight loading support.

* Support SSD512.

* Change kernel_size to get output size 1x1

* Add xavier init and refactoring.

* Adding unit-tests and fixing JIT issues.

* Add a test for dbox generator.

* Remove unnecessary import.

* Workaround on GeneralizedRCNNTransform to support fixed size input.

* Remove unnecessary random calls from the test.

* Remove more rand calls from the test.

* change mapping and handling of empty labels

* Fix JIT warnings.

* Speed up loss.

* Convert 0-1 dboxes to original size.

* Fix warning.

* Fix tests.

* Update comments.

* Fixing minor bugs.

* Introduce a custom DBoxMatcher.

* Minor refactoring

* Move extra layer definition inside feature extractor.

* handle no bias on init.

* Remove fixed image size limitation

* Change initialization values for bias of classification head.

* Refactoring and update test file.

* Adding ResNet backbone.

* Minor refactoring.

* Remove inheritance of retina and general refactoring.

* SSD should fix the input size.

* Fixing messages and comments.

* Silently ignoring exception if test-only.

* Update comments.

* Update regression loss.

* Restore Xavier init everywhere, update the negative sampling method, change the clipping approach.

* Fixing tests.

* Refactor to move the losses from the Head to the SSD.

* Removing resnet50 ssd version.

* Adding support for best performing backbone and its config.

* Refactor and clean up the API.

* Fix lint

* Update todos and comments.

* Adding RandomHorizontalFlip and RandomIoUCrop transforms.

* Adding necessary checks to our tranforms.

* Adding RandomZoomOut.

* Adding RandomPhotometricDistort.

* Moving Detection transforms to references.

* Update presets

* fix lint

* leave compose and object

* Adding scaling for completeness.

* Adding params in the repr

* Remove unnecessary import.

* minor refactoring

* Remove unnecessary call.

* Give better names to DBox* classes

* Port num_anchors estimation in generator

* Remove rescaling and fix presets

* Add the ability to pass a custom head and refactoring.

* fix lint

* Fix unit-test

* Update todos.

* Change mean values.

* Change the default parameter of SSD to train the full VGG16 and remove the catch of exception for eval only.

* Adding documentation

* Adding weights and updating readmes.

* Update the model weights with a more performing model.

* Adding doc for head.

* Restore import.
parent 7c35e133
......@@ -381,17 +381,18 @@ Object Detection, Instance Segmentation and Person Keypoint Detection
The models subpackage contains definitions for the following model
architectures for detection:
- `Faster R-CNN ResNet-50 FPN <https://arxiv.org/abs/1506.01497>`_
- `Mask R-CNN ResNet-50 FPN <https://arxiv.org/abs/1703.06870>`_
- `Faster R-CNN <https://arxiv.org/abs/1506.01497>`_
- `Mask R-CNN <https://arxiv.org/abs/1703.06870>`_
- `RetinaNet <https://arxiv.org/abs/1708.02002>`_
- `SSD <https://arxiv.org/abs/1512.02325>`_
The pre-trained models for detection, instance segmentation and
keypoint detection are initialized with the classification models
in torchvision.
The models expect a list of ``Tensor[C, H, W]``, in the range ``0-1``.
The models internally resize the images so that they have a minimum size
of ``800``. This option can be changed by passing the option ``min_size``
to the constructor of the models.
The models internally resize the images but the behaviour varies depending
on the model. Check the constructor of the models for more information.
For object detection and instance segmentation, the pre-trained
......@@ -425,6 +426,7 @@ Faster R-CNN ResNet-50 FPN 37.0 - -
Faster R-CNN MobileNetV3-Large FPN 32.8 - -
Faster R-CNN MobileNetV3-Large 320 FPN 22.8 - -
RetinaNet ResNet-50 FPN 36.4 - -
SSD VGG16 25.1 - -
Mask R-CNN ResNet-50 FPN 37.9 34.6 -
====================================== ======= ======== ===========
......@@ -483,6 +485,7 @@ Faster R-CNN ResNet-50 FPN 0.2288 0.0590
Faster R-CNN MobileNetV3-Large FPN 0.1020 0.0415 1.0
Faster R-CNN MobileNetV3-Large 320 FPN 0.0978 0.0376 0.6
RetinaNet ResNet-50 FPN 0.2514 0.0939 4.1
SSD VGG16 0.2093 0.0744 1.5
Mask R-CNN ResNet-50 FPN 0.2728 0.0903 5.4
Keypoint R-CNN ResNet-50 FPN 0.3789 0.1242 6.8
====================================== =================== ================== ===========
......@@ -502,6 +505,12 @@ RetinaNet
.. autofunction:: torchvision.models.detection.retinanet_resnet50_fpn
SSD
------------
.. autofunction:: torchvision.models.detection.ssd300_vgg16
Mask R-CNN
----------
......
......@@ -48,6 +48,14 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
--lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01
```
### SSD VGG16
```
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
--dataset coco --model ssd300_vgg16 --epochs 120\
--lr-steps 80 110 --aspect-ratio-group-factor 3 --lr 0.002 --batch-size 4\
--weight-decay 0.0005 --data-augmentation ssd
```
### Mask R-CNN
```
......
......@@ -2,12 +2,22 @@ import transforms as T
class DetectionPresetTrain:
def __init__(self, hflip_prob=0.5):
trans = [T.ToTensor()]
if hflip_prob > 0:
trans.append(T.RandomHorizontalFlip(hflip_prob))
self.transforms = T.Compose(trans)
def __init__(self, data_augmentation, hflip_prob=0.5, mean=(123., 117., 104.)):
if data_augmentation == 'hflip':
self.transforms = T.Compose([
T.RandomHorizontalFlip(p=hflip_prob),
T.ToTensor(),
])
elif data_augmentation == 'ssd':
self.transforms = T.Compose([
T.RandomPhotometricDistort(),
T.RandomZoomOut(fill=list(mean)),
T.RandomIoUCrop(),
T.RandomHorizontalFlip(p=hflip_prob),
T.ToTensor(),
])
else:
raise ValueError(f'Unknown data augmentation policy "{data_augmentation}"')
def __call__(self, img, target):
return self.transforms(img, target)
......
......@@ -47,8 +47,8 @@ def get_dataset(name, image_set, transform, data_path):
return ds, num_classes
def get_transform(train):
return presets.DetectionPresetTrain() if train else presets.DetectionPresetEval()
def get_transform(train, data_augmentation):
return presets.DetectionPresetTrain(data_augmentation) if train else presets.DetectionPresetEval()
def main(args):
......@@ -60,8 +60,9 @@ def main(args):
# Data loading code
print("Loading data")
dataset, num_classes = get_dataset(args.dataset, "train", get_transform(train=True), args.data_path)
dataset_test, _ = get_dataset(args.dataset, "val", get_transform(train=False), args.data_path)
dataset, num_classes = get_dataset(args.dataset, "train", get_transform(True, args.data_augmentation),
args.data_path)
dataset_test, _ = get_dataset(args.dataset, "val", get_transform(False, args.data_augmentation), args.data_path)
print("Creating data loaders")
if args.distributed:
......@@ -179,6 +180,7 @@ if __name__ == "__main__":
parser.add_argument('--rpn-score-thresh', default=None, type=float, help='rpn score threshold for faster-rcnn')
parser.add_argument('--trainable-backbone-layers', default=None, type=int,
help='number of trainable layers of backbone')
parser.add_argument('--data-augmentation', default="hflip", help='data augmentation policy (default: hflip)')
parser.add_argument(
"--test-only",
dest="test_only",
......
import random
import torch
import torchvision
from torch import nn, Tensor
from torchvision.transforms import functional as F
from torchvision.transforms import transforms as T
from typing import List, Tuple, Dict, Optional
def _flip_coco_person_keypoints(kps, width):
......@@ -23,27 +27,213 @@ class Compose(object):
return image, target
class RandomHorizontalFlip(object):
def __init__(self, prob):
self.prob = prob
def __call__(self, image, target):
if random.random() < self.prob:
height, width = image.shape[-2:]
image = image.flip(-1)
bbox = target["boxes"]
bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
target["boxes"] = bbox
if "masks" in target:
target["masks"] = target["masks"].flip(-1)
if "keypoints" in target:
keypoints = target["keypoints"]
keypoints = _flip_coco_person_keypoints(keypoints, width)
target["keypoints"] = keypoints
class RandomHorizontalFlip(T.RandomHorizontalFlip):
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if torch.rand(1) < self.p:
image = F.hflip(image)
if target is not None:
width, _ = F._get_image_size(image)
target["boxes"][:, [0, 2]] = width - target["boxes"][:, [2, 0]]
if "masks" in target:
target["masks"] = target["masks"].flip(-1)
if "keypoints" in target:
keypoints = target["keypoints"]
keypoints = _flip_coco_person_keypoints(keypoints, width)
target["keypoints"] = keypoints
return image, target
class ToTensor(object):
def __call__(self, image, target):
class ToTensor(nn.Module):
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
image = F.to_tensor(image)
return image, target
class RandomIoUCrop(nn.Module):
def __init__(self, min_scale: float = 0.3, max_scale: float = 1.0, min_aspect_ratio: float = 0.5,
max_aspect_ratio: float = 2.0, sampler_options: Optional[List[float]] = None, trials: int = 40):
super().__init__()
# Configuration similar to https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_coco.py#L89-L174
self.min_scale = min_scale
self.max_scale = max_scale
self.min_aspect_ratio = min_aspect_ratio
self.max_aspect_ratio = max_aspect_ratio
if sampler_options is None:
sampler_options = [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
self.options = sampler_options
self.trials = trials
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if target is None:
raise ValueError("The targets can't be None for this transform.")
if isinstance(image, torch.Tensor):
if image.ndimension() not in {2, 3}:
raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
elif image.ndimension() == 2:
image = image.unsqueeze(0)
orig_w, orig_h = F._get_image_size(image)
while True:
# sample an option
idx = int(torch.randint(low=0, high=len(self.options), size=(1,)))
min_jaccard_overlap = self.options[idx]
if min_jaccard_overlap >= 1.0: # a value larger than 1 encodes the leave as-is option
return image, target
for _ in range(self.trials):
# check the aspect ratio limitations
r = self.min_scale + (self.max_scale - self.min_scale) * torch.rand(2)
new_w = int(orig_w * r[0])
new_h = int(orig_h * r[1])
aspect_ratio = new_w / new_h
if not (self.min_aspect_ratio <= aspect_ratio <= self.max_aspect_ratio):
continue
# check for 0 area crops
r = torch.rand(2)
left = int((orig_w - new_w) * r[0])
top = int((orig_h - new_h) * r[1])
right = left + new_w
bottom = top + new_h
if left == right or top == bottom:
continue
# check for any valid boxes with centers within the crop area
cx = 0.5 * (target["boxes"][:, 0] + target["boxes"][:, 2])
cy = 0.5 * (target["boxes"][:, 1] + target["boxes"][:, 3])
is_within_crop_area = (left < cx) & (cx < right) & (top < cy) & (cy < bottom)
if not is_within_crop_area.any():
continue
# check at least 1 box with jaccard limitations
boxes = target["boxes"][is_within_crop_area]
ious = torchvision.ops.boxes.box_iou(boxes, torch.tensor([[left, top, right, bottom]],
dtype=boxes.dtype, device=boxes.device))
if ious.max() < min_jaccard_overlap:
continue
# keep only valid boxes and perform cropping
target["boxes"] = boxes
target["labels"] = target["labels"][is_within_crop_area]
target["boxes"][:, 0::2] -= left
target["boxes"][:, 1::2] -= top
target["boxes"][:, 0::2].clamp_(min=0, max=new_w)
target["boxes"][:, 1::2].clamp_(min=0, max=new_h)
image = F.crop(image, top, left, new_h, new_w)
return image, target
class RandomZoomOut(nn.Module):
def __init__(self, fill: Optional[List[float]] = None, side_range: Tuple[float, float] = (1., 4.), p: float = 0.5):
super().__init__()
if fill is None:
fill = [0., 0., 0.]
self.fill = fill
self.side_range = side_range
if side_range[0] < 1. or side_range[0] > side_range[1]:
raise ValueError("Invalid canvas side range provided {}.".format(side_range))
self.p = p
@torch.jit.unused
def _get_fill_value(self, is_pil):
# type: (bool) -> int
# We fake the type to make it work on JIT
return tuple(int(x) for x in self.fill) if is_pil else 0
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if isinstance(image, torch.Tensor):
if image.ndimension() not in {2, 3}:
raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
elif image.ndimension() == 2:
image = image.unsqueeze(0)
if torch.rand(1) < self.p:
return image, target
orig_w, orig_h = F._get_image_size(image)
r = self.side_range[0] + torch.rand(1) * (self.side_range[1] - self.side_range[0])
canvas_width = int(orig_w * r)
canvas_height = int(orig_h * r)
r = torch.rand(2)
left = int((canvas_width - orig_w) * r[0])
top = int((canvas_height - orig_h) * r[1])
right = canvas_width - (left + orig_w)
bottom = canvas_height - (top + orig_h)
if torch.jit.is_scripting():
fill = 0
else:
fill = self._get_fill_value(F._is_pil_image(image))
image = F.pad(image, [left, top, right, bottom], fill=fill)
if isinstance(image, torch.Tensor):
v = torch.tensor(self.fill, device=image.device, dtype=image.dtype).view(-1, 1, 1)
image[..., :top, :] = image[..., :, :left] = image[..., (top + orig_h):, :] = \
image[..., :, (left + orig_w):] = v
if target is not None:
target["boxes"][:, 0::2] += left
target["boxes"][:, 1::2] += top
return image, target
class RandomPhotometricDistort(nn.Module):
def __init__(self, contrast: Tuple[float] = (0.5, 1.5), saturation: Tuple[float] = (0.5, 1.5),
hue: Tuple[float] = (-0.05, 0.05), brightness: Tuple[float] = (0.875, 1.125), p: float = 0.5):
super().__init__()
self._brightness = T.ColorJitter(brightness=brightness)
self._contrast = T.ColorJitter(contrast=contrast)
self._hue = T.ColorJitter(hue=hue)
self._saturation = T.ColorJitter(saturation=saturation)
self.p = p
def forward(self, image: Tensor,
target: Optional[Dict[str, Tensor]] = None) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if isinstance(image, torch.Tensor):
if image.ndimension() not in {2, 3}:
raise ValueError('image should be 2/3 dimensional. Got {} dimensions.'.format(image.ndimension()))
elif image.ndimension() == 2:
image = image.unsqueeze(0)
r = torch.rand(7)
if r[0] < self.p:
image = self._brightness(image)
contrast_before = r[1] < 0.5
if contrast_before:
if r[2] < self.p:
image = self._contrast(image)
if r[3] < self.p:
image = self._saturation(image)
if r[4] < self.p:
image = self._hue(image)
if not contrast_before:
if r[5] < self.p:
image = self._contrast(image)
if r[6] < self.p:
channels = F._get_image_num_channels(image)
permutation = torch.randperm(channels)
is_pil = F._is_pil_image(image)
if is_pil:
image = F.to_tensor(image)
image = image[..., permutation, :, :]
if is_pil:
image = F.to_pil_image(image)
return image, target
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
......@@ -44,6 +44,7 @@ script_model_unwrapper = {
"maskrcnn_resnet50_fpn": lambda x: x[1],
"keypointrcnn_resnet50_fpn": lambda x: x[1],
"retinanet_resnet50_fpn": lambda x: x[1],
"ssd300_vgg16": lambda x: x[1],
}
......
from collections import OrderedDict
import torch
from common_utils import TestCase
from torchvision.models.detection.anchor_utils import AnchorGenerator
from torchvision.models.detection.anchor_utils import AnchorGenerator, DefaultBoxGenerator
from torchvision.models.detection.image_list import ImageList
......@@ -22,6 +21,12 @@ class Tester(TestCase):
return anchor_generator
def _init_test_defaultbox_generator(self):
aspect_ratios = [[2]]
dbox_generator = DefaultBoxGenerator(aspect_ratios)
return dbox_generator
def get_features(self, images):
s0, s1 = images.shape[-2:]
features = [torch.rand(2, 8, s0 // 5, s1 // 5)]
......@@ -59,3 +64,26 @@ class Tester(TestCase):
self.assertEqual(tuple(anchors[1].shape), (9, 4))
self.assertEqual(anchors[0], anchors_output)
self.assertEqual(anchors[1], anchors_output)
def test_defaultbox_generator(self):
images = torch.zeros(2, 3, 15, 15)
features = [torch.zeros(2, 8, 1, 1)]
image_shapes = [i.shape[-2:] for i in images]
images = ImageList(images, image_shapes)
model = self._init_test_defaultbox_generator()
model.eval()
dboxes = model(images, features)
dboxes_output = torch.tensor([
[6.9750, 6.9750, 8.0250, 8.0250],
[6.7315, 6.7315, 8.2685, 8.2685],
[6.7575, 7.1288, 8.2425, 7.8712],
[7.1288, 6.7575, 7.8712, 8.2425]
])
self.assertEqual(len(dboxes), 2)
self.assertEqual(tuple(dboxes[0].shape), (4, 4))
self.assertEqual(tuple(dboxes[1].shape), (4, 4))
self.assertTrue(dboxes[0].allclose(dboxes_output))
self.assertTrue(dboxes[1].allclose(dboxes_output))
......@@ -139,6 +139,15 @@ class Tester(unittest.TestCase):
self.assertEqual(loss_dict["bbox_regression"], torch.tensor(0.))
def test_forward_negative_sample_ssd(self):
model = torchvision.models.detection.ssd300_vgg16(
num_classes=2, pretrained_backbone=False)
images, targets = self._make_empty_sample()
loss_dict = model(images, targets)
self.assertEqual(loss_dict["bbox_regression"], torch.tensor(0.))
if __name__ == '__main__':
unittest.main()
......@@ -2,3 +2,4 @@ from .faster_rcnn import *
from .mask_rcnn import *
from .keypoint_rcnn import *
from .retinanet import *
from .ssd import *
import math
import torch
from collections import OrderedDict
from torch import Tensor
from typing import List, Tuple
......@@ -344,6 +345,23 @@ class Matcher(object):
matches[pred_inds_to_update] = all_matches[pred_inds_to_update]
class SSDMatcher(Matcher):
def __init__(self, threshold):
super().__init__(threshold, threshold, allow_low_quality_matches=False)
def __call__(self, match_quality_matrix):
matches = super().__call__(match_quality_matrix)
# For each gt, find the prediction with which it has the highest quality
_, highest_quality_pred_foreach_gt = match_quality_matrix.max(dim=1)
matches[highest_quality_pred_foreach_gt] = torch.arange(highest_quality_pred_foreach_gt.size(0),
dtype=torch.int64,
device=highest_quality_pred_foreach_gt.device)
return matches
def overwrite_eps(model, eps):
"""
This method overwrites the default eps values of all the
......@@ -360,3 +378,33 @@ def overwrite_eps(model, eps):
for module in model.modules():
if isinstance(module, FrozenBatchNorm2d):
module.eps = eps
def retrieve_out_channels(model, size):
"""
This method retrieves the number of output channels of a specific model.
Args:
model (nn.Module): The model for which we estimate the out_channels.
It should return a single Tensor or an OrderedDict[Tensor].
size (Tuple[int, int]): The size (wxh) of the input.
Returns:
out_channels (List[int]): A list of the output channels of the model.
"""
in_training = model.training
model.eval()
with torch.no_grad():
# Use dummy data to retrieve the feature map sizes to avoid hard-coding their values
device = next(model.parameters()).device
tmp_img = torch.zeros((1, 3, size[1], size[0]), device=device)
features = model(tmp_img)
if isinstance(features, torch.Tensor):
features = OrderedDict([('0', features)])
out_channels = [x.size(1) for x in features.values()]
if in_training:
model.train()
return out_channels
import math
import torch
from torch import nn, Tensor
from typing import List
from typing import List, Optional
from .image_list import ImageList
......@@ -128,3 +129,97 @@ class AnchorGenerator(nn.Module):
anchors.append(anchors_in_image)
anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
return anchors
class DefaultBoxGenerator(nn.Module):
"""
This module generates the default boxes of SSD for a set of feature maps and image sizes.
Args:
aspect_ratios (List[List[int]]): A list with all the aspect ratios used in each feature map.
min_ratio (float): The minimum scale :math:`\text{s}_{\text{min}}` of the default boxes used in the estimation
of the scales of each feature map.
max_ratio (float): The maximum scale :math:`\text{s}_{\text{max}}` of the default boxes used in the estimation
of the scales of each feature map.
steps (List[int]], optional): It's a hyper-parameter that affects the tiling of defalt boxes. If not provided
it will be estimated from the data.
clip (bool): Whether the standardized values of default boxes should be clipped between 0 and 1. The clipping
is applied while the boxes are encoded in format ``(cx, cy, w, h)``.
"""
def __init__(self, aspect_ratios: List[List[int]], min_ratio: float = 0.15, max_ratio: float = 0.9,
steps: Optional[List[int]] = None, clip: bool = True):
super().__init__()
if steps is not None:
assert len(aspect_ratios) == len(steps)
self.aspect_ratios = aspect_ratios
self.steps = steps
self.clip = clip
num_outputs = len(aspect_ratios)
# Estimation of default boxes scales
# Inspired from https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_pascal.py#L311-L317
min_centile = int(100 * min_ratio)
max_centile = int(100 * max_ratio)
conv4_centile = min_centile // 2 # assume half of min_ratio as in paper
step = (max_centile - min_centile) // (num_outputs - 2)
centiles = [conv4_centile, min_centile]
for c in range(min_centile, max_centile + 1, step):
centiles.append(c + step)
self.scales = [c / 100 for c in centiles]
self._wh_pairs = []
for k in range(num_outputs):
# Adding the 2 default width-height pairs for aspect ratio 1 and scale s'k
s_k = self.scales[k]
s_prime_k = math.sqrt(self.scales[k] * self.scales[k + 1])
wh_pairs = [(s_k, s_k), (s_prime_k, s_prime_k)]
# Adding 2 pairs for each aspect ratio of the feature map k
for ar in self.aspect_ratios[k]:
sq_ar = math.sqrt(ar)
w = self.scales[k] * sq_ar
h = self.scales[k] / sq_ar
wh_pairs.extend([(w, h), (h, w)])
self._wh_pairs.append(wh_pairs)
def num_anchors_per_location(self):
# Estimate num of anchors based on aspect ratios: 2 default boxes + 2 * ratios of feaure map.
return [2 + 2 * len(r) for r in self.aspect_ratios]
def __repr__(self) -> str:
s = self.__class__.__name__ + '('
s += 'aspect_ratios={aspect_ratios}'
s += ', clip={clip}'
s += ', scales={scales}'
s += ', steps={steps}'
s += ')'
return s.format(**self.__dict__)
def forward(self, image_list: ImageList, feature_maps: List[Tensor]) -> List[Tensor]:
grid_sizes = [feature_map.shape[-2:] for feature_map in feature_maps]
image_size = image_list.tensors.shape[-2:]
dtype, device = feature_maps[0].dtype, feature_maps[0].device
# Default Boxes calculation based on page 6 of SSD paper
default_boxes: List[List[float]] = []
for k, f_k in enumerate(grid_sizes):
# Now add the default boxes for each width-height pair
for j in range(f_k[0]):
cy = (j + 0.5) / (float(f_k[0]) if self.steps is None else image_size[1] / self.steps[k])
for i in range(f_k[1]):
cx = (i + 0.5) / (float(f_k[1]) if self.steps is None else image_size[0] / self.steps[k])
default_boxes.extend([[cx, cy, w, h] for w, h in self._wh_pairs[k]])
dboxes = []
for _ in image_list.image_sizes:
dboxes_in_image = torch.tensor(default_boxes, dtype=dtype, device=device)
if self.clip:
dboxes_in_image.clamp_(min=0, max=1)
dboxes_in_image = torch.cat([dboxes_in_image[:, :2] - 0.5 * dboxes_in_image[:, 2:],
dboxes_in_image[:, :2] + 0.5 * dboxes_in_image[:, 2:]], -1)
dboxes_in_image[:, 0::2] *= image_size[1]
dboxes_in_image[:, 1::2] *= image_size[0]
dboxes.append(dboxes_in_image)
return dboxes
import torch
import torch.nn.functional as F
import warnings
from collections import OrderedDict
from torch import nn, Tensor
from typing import Any, Dict, List, Optional, Tuple
from . import _utils as det_utils
from .anchor_utils import DefaultBoxGenerator
from .backbone_utils import _validate_trainable_layers
from .transform import GeneralizedRCNNTransform
from .. import vgg
from ..utils import load_state_dict_from_url
from ...ops import boxes as box_ops
__all__ = ['SSD', 'ssd300_vgg16']
model_urls = {
'ssd300_vgg16_coco': 'https://download.pytorch.org/models/ssd300_vgg16_coco-b556d3b4.pth',
}
backbone_urls = {
# We port the features of a VGG16 backbone trained by amdegroot because unlike the one on TorchVision, it uses the
# same input standardization method as the paper. Ref: https://s3.amazonaws.com/amdegroot-models/vgg16_reducedfc.pth
'vgg16_features': 'https://download.pytorch.org/models/vgg16_features-amdegroot.pth'
}
def _xavier_init(conv: nn.Module):
for layer in conv.modules():
if isinstance(layer, nn.Conv2d):
torch.nn.init.xavier_uniform_(layer.weight)
if layer.bias is not None:
torch.nn.init.constant_(layer.bias, 0.0)
class SSDHead(nn.Module):
def __init__(self, in_channels: List[int], num_anchors: List[int], num_classes: int):
super().__init__()
self.classification_head = SSDClassificationHead(in_channels, num_anchors, num_classes)
self.regression_head = SSDRegressionHead(in_channels, num_anchors)
def forward(self, x: List[Tensor]) -> Dict[str, Tensor]:
return {
'bbox_regression': self.regression_head(x),
'cls_logits': self.classification_head(x),
}
class SSDScoringHead(nn.Module):
def __init__(self, module_list: nn.ModuleList, num_columns: int):
super().__init__()
self.module_list = module_list
self.num_columns = num_columns
def _get_result_from_module_list(self, x: Tensor, idx: int) -> Tensor:
"""
This is equivalent to self.module_list[idx](x),
but torchscript doesn't support this yet
"""
num_blocks = len(self.module_list)
if idx < 0:
idx += num_blocks
i = 0
out = x
for module in self.module_list:
if i == idx:
out = module(x)
i += 1
return out
def forward(self, x: List[Tensor]) -> Tensor:
all_results = []
for i, features in enumerate(x):
results = self._get_result_from_module_list(features, i)
# Permute output from (N, A * K, H, W) to (N, HWA, K).
N, _, H, W = results.shape
results = results.view(N, -1, self.num_columns, H, W)
results = results.permute(0, 3, 4, 1, 2)
results = results.reshape(N, -1, self.num_columns) # Size=(N, HWA, K)
all_results.append(results)
return torch.cat(all_results, dim=1)
class SSDClassificationHead(SSDScoringHead):
def __init__(self, in_channels: List[int], num_anchors: List[int], num_classes: int):
cls_logits = nn.ModuleList()
for channels, anchors in zip(in_channels, num_anchors):
cls_logits.append(nn.Conv2d(channels, num_classes * anchors, kernel_size=3, padding=1))
_xavier_init(cls_logits)
super().__init__(cls_logits, num_classes)
class SSDRegressionHead(SSDScoringHead):
def __init__(self, in_channels: List[int], num_anchors: List[int]):
bbox_reg = nn.ModuleList()
for channels, anchors in zip(in_channels, num_anchors):
bbox_reg.append(nn.Conv2d(channels, 4 * anchors, kernel_size=3, padding=1))
_xavier_init(bbox_reg)
super().__init__(bbox_reg, 4)
class SSD(nn.Module):
"""
Implements SSD architecture from `"SSD: Single Shot MultiBox Detector" <https://arxiv.org/abs/1512.02325>`_.
The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
image, and should be in 0-1 range. Different images can have different sizes but they will be resized
to a fixed size before passing it to the backbone.
The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets (list of dictionary),
containing:
- boxes (``FloatTensor[N, 4]``): the ground-truth boxes in ``[x1, y1, x2, y2]`` format, with
``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
- labels (Int64Tensor[N]): the class label for each ground-truth box
The model returns a Dict[Tensor] during training, containing the classification and regression
losses.
During inference, the model requires only the input tensors, and returns the post-processed
predictions as a List[Dict[Tensor]], one for each input image. The fields of the Dict are as
follows:
- boxes (``FloatTensor[N, 4]``): the predicted boxes in ``[x1, y1, x2, y2]`` format, with
``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
- labels (Int64Tensor[N]): the predicted labels for each image
- scores (Tensor[N]): the scores for each prediction
Args:
backbone (nn.Module): the network used to compute the features for the model.
It should contain an out_channels attribute with the list of the output channels of
each feature map. The backbone should return a single Tensor or an OrderedDict[Tensor].
anchor_generator (DefaultBoxGenerator): module that generates the default boxes for a
set of feature maps.
size (Tuple[int, int]): the width and height to which images will be rescaled before feeding them
to the backbone.
num_classes (int): number of output classes of the model (excluding the background).
image_mean (Tuple[float, float, float]): mean values used for input normalization.
They are generally the mean values of the dataset on which the backbone has been trained
on
image_std (Tuple[float, float, float]): std values used for input normalization.
They are generally the std values of the dataset on which the backbone has been trained on
head (nn.Module, optional): Module run on top of the backbone features. Defaults to a module containing
a classification and regression module.
score_thresh (float): Score threshold used for postprocessing the detections.
nms_thresh (float): NMS threshold used for postprocessing the detections.
detections_per_img (int): Number of best detections to keep after NMS.
iou_thresh (float): minimum IoU between the anchor and the GT box so that they can be
considered as positive during training.
topk_candidates (int): Number of best detections to keep before NMS.
positive_fraction (float): a number between 0 and 1 which indicates the proportion of positive
proposals used during the training of the classification head. It is used to estimate the negative to
positive ratio.
"""
__annotations__ = {
'box_coder': det_utils.BoxCoder,
'proposal_matcher': det_utils.Matcher,
}
def __init__(self, backbone: nn.Module, anchor_generator: DefaultBoxGenerator,
size: Tuple[int, int], num_classes: int,
image_mean: Optional[List[float]] = None, image_std: Optional[List[float]] = None,
head: Optional[nn.Module] = None,
score_thresh: float = 0.01,
nms_thresh: float = 0.45,
detections_per_img: int = 200,
iou_thresh: float = 0.5,
topk_candidates: int = 400,
positive_fraction: float = 0.25):
super().__init__()
self.backbone = backbone
self.anchor_generator = anchor_generator
self.box_coder = det_utils.BoxCoder(weights=(10., 10., 5., 5.))
if head is None:
if hasattr(backbone, 'out_channels'):
out_channels = backbone.out_channels
else:
out_channels = det_utils.retrieve_out_channels(backbone, size)
assert len(out_channels) == len(anchor_generator.aspect_ratios)
num_anchors = self.anchor_generator.num_anchors_per_location()
head = SSDHead(out_channels, num_anchors, num_classes)
self.head = head
self.proposal_matcher = det_utils.SSDMatcher(iou_thresh)
if image_mean is None:
image_mean = [0.485, 0.456, 0.406]
if image_std is None:
image_std = [0.229, 0.224, 0.225]
self.transform = GeneralizedRCNNTransform(min(size), max(size), image_mean, image_std,
size_divisible=1, fixed_size=size)
self.score_thresh = score_thresh
self.nms_thresh = nms_thresh
self.detections_per_img = detections_per_img
self.topk_candidates = topk_candidates
self.neg_to_pos_ratio = (1.0 - positive_fraction) / positive_fraction
# used only on torchscript mode
self._has_warned = False
@torch.jit.unused
def eager_outputs(self, losses: Dict[str, Tensor],
detections: List[Dict[str, Tensor]]) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]:
if self.training:
return losses
return detections
def compute_loss(self, targets: List[Dict[str, Tensor]], head_outputs: Dict[str, Tensor], anchors: List[Tensor],
matched_idxs: List[Tensor]) -> Dict[str, Tensor]:
bbox_regression = head_outputs['bbox_regression']
cls_logits = head_outputs['cls_logits']
# Match original targets with default boxes
num_foreground = 0
bbox_loss = []
cls_targets = []
for (targets_per_image, bbox_regression_per_image, cls_logits_per_image, anchors_per_image,
matched_idxs_per_image) in zip(targets, bbox_regression, cls_logits, anchors, matched_idxs):
# produce the matching between boxes and targets
foreground_idxs_per_image = torch.where(matched_idxs_per_image >= 0)[0]
foreground_matched_idxs_per_image = matched_idxs_per_image[foreground_idxs_per_image]
num_foreground += foreground_matched_idxs_per_image.numel()
# Calculate regression loss
matched_gt_boxes_per_image = targets_per_image['boxes'][foreground_matched_idxs_per_image]
bbox_regression_per_image = bbox_regression_per_image[foreground_idxs_per_image, :]
anchors_per_image = anchors_per_image[foreground_idxs_per_image, :]
target_regression = self.box_coder.encode_single(matched_gt_boxes_per_image, anchors_per_image)
bbox_loss.append(torch.nn.functional.smooth_l1_loss(
bbox_regression_per_image,
target_regression,
reduction='sum'
))
# Estimate ground truth for class targets
gt_classes_target = torch.zeros((cls_logits_per_image.size(0), ), dtype=targets_per_image['labels'].dtype,
device=targets_per_image['labels'].device)
gt_classes_target[foreground_idxs_per_image] = \
targets_per_image['labels'][foreground_matched_idxs_per_image]
cls_targets.append(gt_classes_target)
bbox_loss = torch.stack(bbox_loss)
cls_targets = torch.stack(cls_targets)
# Calculate classification loss
num_classes = cls_logits.size(-1)
cls_loss = F.cross_entropy(
cls_logits.view(-1, num_classes),
cls_targets.view(-1),
reduction='none'
).view(cls_targets.size())
# Hard Negative Sampling
foreground_idxs = cls_targets > 0
num_negative = self.neg_to_pos_ratio * foreground_idxs.sum(1, keepdim=True)
# num_negative[num_negative < self.neg_to_pos_ratio] = self.neg_to_pos_ratio
negative_loss = cls_loss.clone()
negative_loss[foreground_idxs] = -float('inf') # use -inf to detect positive values that creeped in the sample
values, idx = negative_loss.sort(1, descending=True)
# background_idxs = torch.logical_and(idx.sort(1)[1] < num_negative, torch.isfinite(values))
background_idxs = idx.sort(1)[1] < num_negative
N = max(1, num_foreground)
return {
'bbox_regression': bbox_loss.sum() / N,
'classification': (cls_loss[foreground_idxs].sum() + cls_loss[background_idxs].sum()) / N,
}
def forward(self, images: List[Tensor],
targets: Optional[List[Dict[str, Tensor]]] = None) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]:
if self.training and targets is None:
raise ValueError("In training mode, targets should be passed")
if self.training:
assert targets is not None
for target in targets:
boxes = target["boxes"]
if isinstance(boxes, torch.Tensor):
if len(boxes.shape) != 2 or boxes.shape[-1] != 4:
raise ValueError("Expected target boxes to be a tensor"
"of shape [N, 4], got {:}.".format(
boxes.shape))
else:
raise ValueError("Expected target boxes to be of type "
"Tensor, got {:}.".format(type(boxes)))
# get the original image sizes
original_image_sizes: List[Tuple[int, int]] = []
for img in images:
val = img.shape[-2:]
assert len(val) == 2
original_image_sizes.append((val[0], val[1]))
# transform the input
images, targets = self.transform(images, targets)
# Check for degenerate boxes
if targets is not None:
for target_idx, target in enumerate(targets):
boxes = target["boxes"]
degenerate_boxes = boxes[:, 2:] <= boxes[:, :2]
if degenerate_boxes.any():
bb_idx = torch.where(degenerate_boxes.any(dim=1))[0][0]
degen_bb: List[float] = boxes[bb_idx].tolist()
raise ValueError("All bounding boxes should have positive height and width."
" Found invalid box {} for target at index {}."
.format(degen_bb, target_idx))
# get the features from the backbone
features = self.backbone(images.tensors)
if isinstance(features, torch.Tensor):
features = OrderedDict([('0', features)])
features = list(features.values())
# compute the ssd heads outputs using the features
head_outputs = self.head(features)
# create the set of anchors
anchors = self.anchor_generator(images, features)
losses = {}
detections: List[Dict[str, Tensor]] = []
if self.training:
assert targets is not None
matched_idxs = []
for anchors_per_image, targets_per_image in zip(anchors, targets):
if targets_per_image['boxes'].numel() == 0:
matched_idxs.append(torch.full((anchors_per_image.size(0),), -1, dtype=torch.int64,
device=anchors_per_image.device))
continue
match_quality_matrix = box_ops.box_iou(targets_per_image['boxes'], anchors_per_image)
matched_idxs.append(self.proposal_matcher(match_quality_matrix))
losses = self.compute_loss(targets, head_outputs, anchors, matched_idxs)
else:
detections = self.postprocess_detections(head_outputs, anchors, images.image_sizes)
detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
if torch.jit.is_scripting():
if not self._has_warned:
warnings.warn("SSD always returns a (Losses, Detections) tuple in scripting")
self._has_warned = True
return losses, detections
return self.eager_outputs(losses, detections)
def postprocess_detections(self, head_outputs: Dict[str, Tensor], image_anchors: List[Tensor],
image_shapes: List[Tuple[int, int]]) -> List[Dict[str, Tensor]]:
bbox_regression = head_outputs['bbox_regression']
pred_scores = F.softmax(head_outputs['cls_logits'], dim=-1)
num_classes = pred_scores.size(-1)
device = pred_scores.device
detections: List[Dict[str, Tensor]] = []
for boxes, scores, anchors, image_shape in zip(bbox_regression, pred_scores, image_anchors, image_shapes):
boxes = self.box_coder.decode_single(boxes, anchors)
boxes = box_ops.clip_boxes_to_image(boxes, image_shape)
image_boxes = []
image_scores = []
image_labels = []
for label in range(1, num_classes):
score = scores[:, label]
keep_idxs = score > self.score_thresh
score = score[keep_idxs]
box = boxes[keep_idxs]
# keep only topk scoring predictions
num_topk = min(self.topk_candidates, score.size(0))
score, idxs = score.topk(num_topk)
box = box[idxs]
image_boxes.append(box)
image_scores.append(score)
image_labels.append(torch.full_like(score, fill_value=label, dtype=torch.int64, device=device))
image_boxes = torch.cat(image_boxes, dim=0)
image_scores = torch.cat(image_scores, dim=0)
image_labels = torch.cat(image_labels, dim=0)
# non-maximum suppression
keep = box_ops.batched_nms(image_boxes, image_scores, image_labels, self.nms_thresh)
keep = keep[:self.detections_per_img]
detections.append({
'boxes': image_boxes[keep],
'scores': image_scores[keep],
'labels': image_labels[keep],
})
return detections
class SSDFeatureExtractorVGG(nn.Module):
def __init__(self, backbone: nn.Module, highres: bool, rescaling: bool):
super().__init__()
_, _, maxpool3_pos, maxpool4_pos, _ = (i for i, layer in enumerate(backbone) if isinstance(layer, nn.MaxPool2d))
# Patch ceil_mode for maxpool3 to get the same WxH output sizes as the paper
backbone[maxpool3_pos].ceil_mode = True
# parameters used for L2 regularization + rescaling
self.scale_weight = nn.Parameter(torch.ones(512) * 20)
# Multiple Feature maps - page 4, Fig 2 of SSD paper
self.features = nn.Sequential(
*backbone[:maxpool4_pos] # until conv4_3
)
# SSD300 case - page 4, Fig 2 of SSD paper
extra = nn.ModuleList([
nn.Sequential(
nn.Conv2d(1024, 256, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 512, kernel_size=3, padding=1, stride=2), # conv8_2
nn.ReLU(inplace=True),
),
nn.Sequential(
nn.Conv2d(512, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=3, padding=1, stride=2), # conv9_2
nn.ReLU(inplace=True),
),
nn.Sequential(
nn.Conv2d(256, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=3), # conv10_2
nn.ReLU(inplace=True),
),
nn.Sequential(
nn.Conv2d(256, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=3), # conv11_2
nn.ReLU(inplace=True),
)
])
if highres:
# Additional layers for the SSD512 case. See page 11, footernote 5.
extra.append(nn.Sequential(
nn.Conv2d(256, 128, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=4), # conv12_2
nn.ReLU(inplace=True),
))
_xavier_init(extra)
fc = nn.Sequential(
nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=False), # add modified maxpool5
nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, padding=6, dilation=6), # FC6 with atrous
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=1), # FC7
nn.ReLU(inplace=True)
)
_xavier_init(fc)
extra.insert(0, nn.Sequential(
*backbone[maxpool4_pos:-1], # until conv5_3, skip maxpool5
fc,
))
self.extra = extra
self.rescaling = rescaling
def forward(self, x: Tensor) -> Dict[str, Tensor]:
# Undo the 0-1 scaling of toTensor. Necessary for some backbones.
if self.rescaling:
x *= 255
# L2 regularization + Rescaling of 1st block's feature map
x = self.features(x)
rescaled = self.scale_weight.view(1, -1, 1, 1) * F.normalize(x)
output = [rescaled]
# Calculating Feature maps for the rest blocks
for block in self.extra:
x = block(x)
output.append(x)
return OrderedDict([(str(i), v) for i, v in enumerate(output)])
def _vgg_extractor(backbone_name: str, highres: bool, progress: bool, pretrained: bool, trainable_layers: int,
rescaling: bool):
if backbone_name in backbone_urls:
# Use custom backbones more appropriate for SSD
arch = backbone_name.split('_')[0]
backbone = vgg.__dict__[arch](pretrained=False, progress=progress).features
if pretrained:
state_dict = load_state_dict_from_url(backbone_urls[backbone_name], progress=progress)
backbone.load_state_dict(state_dict)
else:
# Use standard backbones from TorchVision
backbone = vgg.__dict__[backbone_name](pretrained=pretrained, progress=progress).features
# Gather the indices of maxpools. These are the locations of output blocks.
stage_indices = [i for i, b in enumerate(backbone) if isinstance(b, nn.MaxPool2d)]
num_stages = len(stage_indices)
# find the index of the layer from which we wont freeze
assert 0 <= trainable_layers <= num_stages
freeze_before = num_stages if trainable_layers == 0 else stage_indices[num_stages - trainable_layers]
for b in backbone[:freeze_before]:
for parameter in b.parameters():
parameter.requires_grad_(False)
return SSDFeatureExtractorVGG(backbone, highres, rescaling)
def ssd300_vgg16(pretrained: bool = False, progress: bool = True, num_classes: int = 91,
pretrained_backbone: bool = True, trainable_backbone_layers: Optional[int] = None, **kwargs: Any):
"""
Constructs an SSD model with a VGG16 backbone. See `SSD` for more details.
Example:
>>> model = torchvision.models.detection.ssd300_vgg16(pretrained=True)
>>> model.eval()
>>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
>>> predictions = model(x)
Args:
pretrained (bool): If True, returns a model pre-trained on COCO train2017
progress (bool): If True, displays a progress bar of the download to stderr
num_classes (int): number of output classes of the model (including the background)
pretrained_backbone (bool): If True, returns a model with backbone pre-trained on Imagenet
trainable_backbone_layers (int): number of trainable (not frozen) resnet layers starting from final block.
Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
"""
trainable_backbone_layers = _validate_trainable_layers(
pretrained or pretrained_backbone, trainable_backbone_layers, 5, 5)
if pretrained:
# no need to download the backbone if pretrained is set
pretrained_backbone = False
backbone = _vgg_extractor("vgg16_features", False, progress, pretrained_backbone, trainable_backbone_layers, True)
anchor_generator = DefaultBoxGenerator([[2], [2, 3], [2, 3], [2, 3], [2], [2]], steps=[8, 16, 32, 64, 100, 300])
model = SSD(backbone, anchor_generator, (300, 300), num_classes,
image_mean=[0.48235, 0.45882, 0.40784], image_std=[1., 1., 1.], **kwargs)
if pretrained:
weights_name = 'ssd300_vgg16_coco'
if model_urls.get(weights_name, None) is None:
raise ValueError("No checkpoint is available for model {}".format(weights_name))
state_dict = load_state_dict_from_url(model_urls[weights_name], progress=progress)
model.load_state_dict(state_dict)
return model
import math
import torch
from torch import nn, Tensor
from torch.nn import functional as F
import torchvision
from torch import nn, Tensor
from typing import List, Tuple, Dict, Optional
from .image_list import ImageList
......@@ -23,32 +23,40 @@ def _fake_cast_onnx(v):
return v
def _resize_image_and_masks(image, self_min_size, self_max_size, target):
# type: (Tensor, float, float, Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]
def _resize_image_and_masks(image: Tensor, self_min_size: float, self_max_size: float,
target: Optional[Dict[str, Tensor]],
fixed_size: Optional[Tuple[int, int]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]:
if torchvision._is_tracing():
im_shape = _get_shape_onnx(image)
else:
im_shape = torch.tensor(image.shape[-2:])
min_size = torch.min(im_shape).to(dtype=torch.float32)
max_size = torch.max(im_shape).to(dtype=torch.float32)
scale = torch.min(self_min_size / min_size, self_max_size / max_size)
if torchvision._is_tracing():
scale_factor = _fake_cast_onnx(scale)
size: Optional[List[int]] = None
scale_factor: Optional[float] = None
recompute_scale_factor: Optional[bool] = None
if fixed_size is not None:
size = [fixed_size[1], fixed_size[0]]
else:
scale_factor = scale.item()
min_size = torch.min(im_shape).to(dtype=torch.float32)
max_size = torch.max(im_shape).to(dtype=torch.float32)
scale = torch.min(self_min_size / min_size, self_max_size / max_size)
if torchvision._is_tracing():
scale_factor = _fake_cast_onnx(scale)
else:
scale_factor = scale.item()
recompute_scale_factor = True
image = torch.nn.functional.interpolate(
image[None], scale_factor=scale_factor, mode='bilinear', recompute_scale_factor=True,
align_corners=False)[0]
image = torch.nn.functional.interpolate(image[None], size=size, scale_factor=scale_factor, mode='bilinear',
recompute_scale_factor=recompute_scale_factor, align_corners=False)[0]
if target is None:
return image, target
if "masks" in target:
mask = target["masks"]
mask = F.interpolate(mask[:, None].float(), scale_factor=scale_factor, recompute_scale_factor=True)[:, 0].byte()
mask = torch.nn.functional.interpolate(mask[:, None].float(), size=size, scale_factor=scale_factor,
recompute_scale_factor=recompute_scale_factor)[:, 0].byte()
target["masks"] = mask
return image, target
......@@ -65,7 +73,7 @@ class GeneralizedRCNNTransform(nn.Module):
It returns a ImageList for the inputs, and a List[Dict[Tensor]] for the targets
"""
def __init__(self, min_size, max_size, image_mean, image_std):
def __init__(self, min_size, max_size, image_mean, image_std, size_divisible=32, fixed_size=None):
super(GeneralizedRCNNTransform, self).__init__()
if not isinstance(min_size, (list, tuple)):
min_size = (min_size,)
......@@ -73,6 +81,8 @@ class GeneralizedRCNNTransform(nn.Module):
self.max_size = max_size
self.image_mean = image_mean
self.image_std = image_std
self.size_divisible = size_divisible
self.fixed_size = fixed_size
def forward(self,
images, # type: List[Tensor]
......@@ -106,7 +116,7 @@ class GeneralizedRCNNTransform(nn.Module):
targets[i] = target_index
image_sizes = [img.shape[-2:] for img in images]
images = self.batch_images(images)
images = self.batch_images(images, size_divisible=self.size_divisible)
image_sizes_list: List[Tuple[int, int]] = []
for image_size in image_sizes:
assert len(image_size) == 2
......@@ -144,7 +154,7 @@ class GeneralizedRCNNTransform(nn.Module):
else:
# FIXME assume for now that testing uses the largest scale
size = float(self.min_size[-1])
image, target = _resize_image_and_masks(image, size, float(self.max_size), target)
image, target = _resize_image_and_masks(image, size, float(self.max_size), target, self.fixed_size)
if target is None:
return image, target
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment