Unverified Commit b4997382 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Fix MaskformerFeatureExtractor (#20100)



* Fix bug

* Add another fix

* Add print statement

* Apply fix

* Fix feature extractor

* Fix feature extractor

* Add print statements

* Add print statements

* Remove print statements

* Add instance segmentation integration test

* Add integration test for semantic segmentation

* Add draft for panoptic segmentation integration test

* Fix integration test for panoptic segmentation

* Remove slow annotator
Co-authored-by: default avatarNiels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
parent 6e3b0144
...@@ -225,7 +225,8 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM ...@@ -225,7 +225,8 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
ImageNet std. ImageNet std.
ignore_index (`int`, *optional*): ignore_index (`int`, *optional*):
Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels
denoted with 0 (background) will be replaced with `ignore_index`. denoted with 0 (background) will be replaced with `ignore_index`. The ignore index of the loss function of
the model should then correspond to this ignore index.
reduce_labels (`bool`, *optional*, defaults to `False`): reduce_labels (`bool`, *optional*, defaults to `False`):
Whether or not to decrement all label values of segmentation maps by 1. Usually used for datasets where 0 Whether or not to decrement all label values of segmentation maps by 1. Usually used for datasets where 0
is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k).
...@@ -327,12 +328,24 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM ...@@ -327,12 +328,24 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
padded up to the largest image in a batch, and a pixel mask is created that indicates which pixels are padded up to the largest image in a batch, and a pixel mask is created that indicates which pixels are
real/which are padding. real/which are padding.
MaskFormer addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps Segmentation maps can be instance, semantic or panoptic segmentation maps. In case of instance and panoptic
will be converted to lists of binary masks and their respective labels. Let's see an example, assuming segmentation, one needs to provide `instance_id_to_semantic_id`, which is a mapping from instance/segment ids
`segmentation_maps = [[2,6,7,9]]`, the output will contain `mask_labels = to semantic category ids.
MaskFormer addresses all 3 forms of segmentation (instance, semantic and panoptic) in the same way, namely by
converting the segmentation maps to a set of binary masks with corresponding classes.
In case of instance segmentation, the segmentation maps contain the instance ids, and
`instance_id_to_semantic_id` maps instance IDs to their corresponding semantic category.
In case of semantic segmentation, the segmentation maps contain the semantic category ids. Let's see an
example, assuming `segmentation_maps = [[2,6,7,9]]`, the output will contain `mask_labels =
[[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]` (four binary masks) and `class_labels = [2,6,7,9]`, the labels for [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]` (four binary masks) and `class_labels = [2,6,7,9]`, the labels for
each mask. each mask.
In case of panoptic segmentation, the segmentation maps contain the segment ids, and
`instance_id_to_semantic_id` maps segment IDs to their corresponding semantic category.
<Tip warning={true}> <Tip warning={true}>
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
...@@ -347,9 +360,9 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM ...@@ -347,9 +360,9 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
number of channels, H and W are image height and width. number of channels, H and W are image height and width.
segmentation_maps (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, *optional*): segmentation_maps (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, *optional*):
The corresponding semantic segmentation maps with the pixel-wise class id annotations or instance The corresponding segmentation maps with the pixel-wise instance id, semantic id or segment id
segmentation maps with pixel-wise instance id annotations. Assumed to be semantic segmentation maps if annotations. Assumed to be semantic segmentation maps if no `instance_id_to_semantic_id map` is
no `instance_id_to_semantic_id map` is provided. provided.
pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`): pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
Whether or not to pad images up to the largest image in a batch and create a pixel mask. Whether or not to pad images up to the largest image in a batch and create a pixel mask.
...@@ -360,10 +373,11 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM ...@@ -360,10 +373,11 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
- 0 for pixels that are padding (i.e. **masked**). - 0 for pixels that are padding (i.e. **masked**).
instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*): instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*):
A mapping between object instance ids and class ids. If passed, `segmentation_maps` is treated as an A mapping between instance/segment ids and semantic category ids. If passed, `segmentation_maps` is
instance segmentation map where each pixel represents an instance id. Can be provided as a single treated as an instance or panoptic segmentation map where each pixel represents an instance or segment
dictionary with a global / dataset-level mapping or as a list of dictionaries (one per image), to map id. Can be provided as a single dictionary with a global / dataset-level mapping or as a list of
instance ids in each image separately. dictionaries (one per image), to map instance ids in each image separately. Note that this assumes a
mapping before reduction of labels.
return_tensors (`str` or [`~file_utils.TensorType`], *optional*): return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor` If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
...@@ -478,57 +492,81 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM ...@@ -478,57 +492,81 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
segmentation_map: "np.ndarray", segmentation_map: "np.ndarray",
instance_id_to_semantic_id: Optional[Dict[int, int]] = None, instance_id_to_semantic_id: Optional[Dict[int, int]] = None,
): ):
# Get unique ids (class or instance ids based on input) # Reduce labels, if requested
if self.reduce_labels:
if self.ignore_index is None:
raise ValueError("`ignore_index` must be set when `reduce_labels` is `True`.")
segmentation_map[segmentation_map == 0] = self.ignore_index
segmentation_map -= 1
segmentation_map[segmentation_map == self.ignore_index - 1] = self.ignore_index
# Get unique ids (instance, class ids or segment ids based on input)
all_labels = np.unique(segmentation_map) all_labels = np.unique(segmentation_map)
# Drop background label if applicable # Remove ignored label
if self.reduce_labels: if self.ignore_index is not None:
all_labels = all_labels[all_labels != 0] all_labels = all_labels[all_labels != self.ignore_index]
# Generate a binary mask for each object instance # Generate a binary mask for each object instance
binary_masks = [np.ma.masked_where(segmentation_map == i, segmentation_map) for i in all_labels] binary_masks = [(segmentation_map == i) for i in all_labels]
binary_masks = np.stack(binary_masks, axis=0) # (num_labels, height, width) binary_masks = np.stack(binary_masks, axis=0) # (num_labels, height, width)
# Convert instance ids to class ids # Convert instance/segment ids to class ids
if instance_id_to_semantic_id is not None: if instance_id_to_semantic_id is not None:
labels = np.zeros(all_labels.shape[0]) labels = np.zeros(all_labels.shape[0])
for label in all_labels: for label in all_labels:
class_id = instance_id_to_semantic_id[label] class_id = instance_id_to_semantic_id[label + 1 if self.reduce_labels else label]
labels[all_labels == label] = class_id labels[all_labels == label] = class_id - 1 if self.reduce_labels else class_id
else: else:
labels = all_labels labels = all_labels
# Decrement labels by 1
if self.reduce_labels:
labels -= 1
return binary_masks.astype(np.float32), labels.astype(np.int64) return binary_masks.astype(np.float32), labels.astype(np.int64)
def encode_inputs( def encode_inputs(
self, self,
pixel_values_list: List["np.ndarray"], pixel_values_list: Union[List["np.ndarray"], List["torch.Tensor"]],
segmentation_maps: ImageInput = None, segmentation_maps: ImageInput = None,
pad_and_return_pixel_mask: bool = True, pad_and_return_pixel_mask: bool = True,
instance_id_to_semantic_id: Optional[Union[List[Dict[int, int]], Dict[int, int]]] = None, instance_id_to_semantic_id: Optional[Union[List[Dict[int, int]], Dict[int, int]]] = None,
return_tensors: Optional[Union[str, TensorType]] = None, return_tensors: Optional[Union[str, TensorType]] = None,
): ):
""" """
Pad images up to the largest image in a batch and create a corresponding `pixel_mask`. Encode a list of pixel values and an optional list of corresponding segmentation maps.
This method is useful if you have resized and normalized your images and segmentation maps yourself, using a
library like [torchvision](https://pytorch.org/vision/stable/transforms.html) or
[albumentations](https://albumentations.ai/).
Images are padded up to the largest image in a batch, and a corresponding `pixel_mask` is created.
Segmentation maps can be instance, semantic or panoptic segmentation maps. In case of instance and panoptic
segmentation, one needs to provide `instance_id_to_semantic_id`, which is a mapping from instance/segment ids
to semantic category ids.
MaskFormer addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps MaskFormer addresses all 3 forms of segmentation (instance, semantic and panoptic) in the same way, namely by
will be converted to lists of binary masks and their respective labels. Let's see an example, assuming converting the segmentation maps to a set of binary masks with corresponding classes.
`segmentation_maps = [[2,6,7,9]]`, the output will contain `mask_labels =
In case of instance segmentation, the segmentation maps contain the instance ids, and
`instance_id_to_semantic_id` maps instance IDs to their corresponding semantic category.
In case of semantic segmentation, the segmentation maps contain the semantic category ids. Let's see an
example, assuming `segmentation_maps = [[2,6,7,9]]`, the output will contain `mask_labels =
[[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]` (four binary masks) and `class_labels = [2,6,7,9]`, the labels for [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]` (four binary masks) and `class_labels = [2,6,7,9]`, the labels for
each mask. each mask.
In case of panoptic segmentation, the segmentation maps contain the segment ids, and
`instance_id_to_semantic_id` maps segment IDs to their corresponding semantic category.
Args: Args:
pixel_values_list (`List[torch.Tensor]`): pixel_values_list (`List[np.ndarray]` or `List[torch.Tensor]`):
List of images (pixel values) to be padded. Each image should be a tensor of shape `(channels, height, List of images (pixel values) to be padded. Each image should be a tensor of shape `(channels, height,
width)`. width)`.
segmentation_maps (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, *optional*): segmentation_maps (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`, *optional*):
The corresponding semantic segmentation maps with the pixel-wise annotations. The corresponding segmentation maps with the pixel-wise instance id, semantic id or segment id
annotations. Assumed to be semantic segmentation maps if no `instance_id_to_semantic_id map` is
provided.
pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`): pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
Whether or not to pad images up to the largest image in a batch and create a pixel mask. Whether or not to pad images up to the largest image in a batch and create a pixel mask.
...@@ -539,10 +577,11 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM ...@@ -539,10 +577,11 @@ class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
- 0 for pixels that are padding (i.e. **masked**). - 0 for pixels that are padding (i.e. **masked**).
instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*): instance_id_to_semantic_id (`List[Dict[int, int]]` or `Dict[int, int]`, *optional*):
A mapping between object instance ids and class ids. If passed, `segmentation_maps` is treated as an A mapping between instance/segment ids and semantic category ids. If passed, `segmentation_maps` is
instance segmentation map where each pixel represents an instance id. Can be provided as a single treated as an instance or panoptic segmentation map where each pixel represents an instance or segment
dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map id. Can be provided as a single dictionary with a global / dataset-level mapping or as a list of
instance ids in each image separately. dictionaries (one per image), to map instance ids in each image separately. Note that this assumes a
mapping before reduction of labels.
return_tensors (`str` or [`~file_utils.TensorType`], *optional*): return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor` If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
......
...@@ -17,7 +17,9 @@ ...@@ -17,7 +17,9 @@
import unittest import unittest
import numpy as np import numpy as np
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from transformers.testing_utils import require_torch, require_vision from transformers.testing_utils import require_torch, require_vision
from transformers.utils import is_torch_available, is_vision_available from transformers.utils import is_torch_available, is_vision_available
...@@ -345,6 +347,173 @@ class MaskFormerFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest ...@@ -345,6 +347,173 @@ class MaskFormerFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest
common(is_instance_map=False, segmentation_type="pil") common(is_instance_map=False, segmentation_type="pil")
common(is_instance_map=True, segmentation_type="pil") common(is_instance_map=True, segmentation_type="pil")
def test_integration_instance_segmentation(self):
# load 2 images and corresponding annotations from the hub
repo_id = "nielsr/image-segmentation-toy-data"
image1 = Image.open(
hf_hub_download(repo_id=repo_id, filename="instance_segmentation_image_1.png", repo_type="dataset")
)
image2 = Image.open(
hf_hub_download(repo_id=repo_id, filename="instance_segmentation_image_2.png", repo_type="dataset")
)
annotation1 = Image.open(
hf_hub_download(repo_id=repo_id, filename="instance_segmentation_annotation_1.png", repo_type="dataset")
)
annotation2 = Image.open(
hf_hub_download(repo_id=repo_id, filename="instance_segmentation_annotation_2.png", repo_type="dataset")
)
# get instance segmentations and instance-to-segmentation mappings
def get_instance_segmentation_and_mapping(annotation):
instance_seg = np.array(annotation)[:, :, 1]
class_id_map = np.array(annotation)[:, :, 0]
class_labels = np.unique(class_id_map)
# create mapping between instance IDs and semantic category IDs
inst2class = {}
for label in class_labels:
instance_ids = np.unique(instance_seg[class_id_map == label])
inst2class.update({i: label for i in instance_ids})
return instance_seg, inst2class
instance_seg1, inst2class1 = get_instance_segmentation_and_mapping(annotation1)
instance_seg2, inst2class2 = get_instance_segmentation_and_mapping(annotation2)
# create a feature extractor
feature_extractor = MaskFormerFeatureExtractor(reduce_labels=True, ignore_index=255, size=(512, 512))
# prepare the images and annotations
inputs = feature_extractor(
[image1, image2],
[instance_seg1, instance_seg2],
instance_id_to_semantic_id=[inst2class1, inst2class2],
return_tensors="pt",
)
# verify the pixel values and pixel mask
self.assertEqual(inputs["pixel_values"].shape, (2, 3, 512, 512))
self.assertEqual(inputs["pixel_mask"].shape, (2, 512, 512))
# verify the class labels
self.assertEqual(len(inputs["class_labels"]), 2)
self.assertTrue(torch.allclose(inputs["class_labels"][0], torch.tensor([30, 55])))
self.assertTrue(torch.allclose(inputs["class_labels"][1], torch.tensor([4, 4, 23, 55])))
# verify the mask labels
self.assertEqual(len(inputs["mask_labels"]), 2)
self.assertEqual(inputs["mask_labels"][0].shape, (2, 512, 512))
self.assertEqual(inputs["mask_labels"][1].shape, (4, 512, 512))
self.assertEquals(inputs["mask_labels"][0].sum().item(), 41527.0)
self.assertEquals(inputs["mask_labels"][1].sum().item(), 26259.0)
def test_integration_semantic_segmentation(self):
# load 2 images and corresponding semantic annotations from the hub
repo_id = "nielsr/image-segmentation-toy-data"
image1 = Image.open(
hf_hub_download(repo_id=repo_id, filename="semantic_segmentation_image_1.png", repo_type="dataset")
)
image2 = Image.open(
hf_hub_download(repo_id=repo_id, filename="semantic_segmentation_image_2.png", repo_type="dataset")
)
annotation1 = Image.open(
hf_hub_download(repo_id=repo_id, filename="semantic_segmentation_annotation_1.png", repo_type="dataset")
)
annotation2 = Image.open(
hf_hub_download(repo_id=repo_id, filename="semantic_segmentation_annotation_2.png", repo_type="dataset")
)
# create a feature extractor
feature_extractor = MaskFormerFeatureExtractor(reduce_labels=True, ignore_index=255, size=(512, 512))
# prepare the images and annotations
inputs = feature_extractor(
[image1, image2],
[annotation1, annotation2],
return_tensors="pt",
)
# verify the pixel values and pixel mask
self.assertEqual(inputs["pixel_values"].shape, (2, 3, 512, 512))
self.assertEqual(inputs["pixel_mask"].shape, (2, 512, 512))
# verify the class labels
self.assertEqual(len(inputs["class_labels"]), 2)
self.assertTrue(torch.allclose(inputs["class_labels"][0], torch.tensor([2, 4, 60])))
self.assertTrue(torch.allclose(inputs["class_labels"][1], torch.tensor([0, 3, 7, 8, 15, 28, 30, 143])))
# verify the mask labels
self.assertEqual(len(inputs["mask_labels"]), 2)
self.assertEqual(inputs["mask_labels"][0].shape, (3, 512, 512))
self.assertEqual(inputs["mask_labels"][1].shape, (8, 512, 512))
self.assertEquals(inputs["mask_labels"][0].sum().item(), 170200.0)
self.assertEquals(inputs["mask_labels"][1].sum().item(), 257036.0)
def test_integration_panoptic_segmentation(self):
# load 2 images and corresponding panoptic annotations from the hub
dataset = load_dataset("nielsr/ade20k-panoptic-demo")
image1 = dataset["train"][0]["image"]
image2 = dataset["train"][1]["image"]
segments_info1 = dataset["train"][0]["segments_info"]
segments_info2 = dataset["train"][1]["segments_info"]
annotation1 = dataset["train"][0]["label"]
annotation2 = dataset["train"][1]["label"]
def rgb_to_id(color):
if isinstance(color, np.ndarray) and len(color.shape) == 3:
if color.dtype == np.uint8:
color = color.astype(np.int32)
return color[:, :, 0] + 256 * color[:, :, 1] + 256 * 256 * color[:, :, 2]
return int(color[0] + 256 * color[1] + 256 * 256 * color[2])
def create_panoptic_map(annotation, segments_info):
annotation = np.array(annotation)
# convert RGB to segment IDs per pixel
# 0 is the "ignore" label, for which we don't need to make binary masks
panoptic_map = rgb_to_id(annotation)
# create mapping between segment IDs and semantic classes
inst2class = {segment["id"]: segment["category_id"] for segment in segments_info}
return panoptic_map, inst2class
panoptic_map1, inst2class1 = create_panoptic_map(annotation1, segments_info1)
panoptic_map2, inst2class2 = create_panoptic_map(annotation2, segments_info2)
# create a feature extractor
feature_extractor = MaskFormerFeatureExtractor(ignore_index=0, do_resize=False)
# prepare the images and annotations
pixel_values_list = [np.moveaxis(np.array(image1), -1, 0), np.moveaxis(np.array(image2), -1, 0)]
inputs = feature_extractor.encode_inputs(
pixel_values_list,
[panoptic_map1, panoptic_map2],
instance_id_to_semantic_id=[inst2class1, inst2class2],
return_tensors="pt",
)
# verify the pixel values and pixel mask
self.assertEqual(inputs["pixel_values"].shape, (2, 3, 512, 711))
self.assertEqual(inputs["pixel_mask"].shape, (2, 512, 711))
# verify the class labels
self.assertEqual(len(inputs["class_labels"]), 2)
# fmt: off
expected_class_labels = torch.tensor([4, 17, 32, 42, 42, 42, 42, 42, 42, 42, 32, 12, 12, 12, 12, 12, 42, 42, 12, 12, 12, 42, 12, 12, 12, 12, 12, 3, 12, 12, 12, 12, 42, 42, 42, 12, 42, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 5, 12, 12, 12, 12, 12, 12, 12, 0, 43, 43, 43, 96, 43, 104, 43, 31, 125, 31, 125, 138, 87, 125, 149, 138, 125, 87, 87]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][0], torch.tensor(expected_class_labels)))
# fmt: off
expected_class_labels = torch.tensor([19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 67, 82, 19, 19, 17, 19, 19, 19, 19, 19, 19, 19, 19, 19, 12, 12, 42, 12, 12, 12, 12, 3, 14, 12, 12, 12, 12, 12, 12, 12, 12, 14, 5, 12, 12, 0, 115, 43, 43, 115, 43, 43, 43, 8, 8, 8, 138, 138, 125, 143]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][1], expected_class_labels))
# verify the mask labels
self.assertEqual(len(inputs["mask_labels"]), 2)
self.assertEqual(inputs["mask_labels"][0].shape, (79, 512, 711))
self.assertEqual(inputs["mask_labels"][1].shape, (61, 512, 711))
self.assertEquals(inputs["mask_labels"][0].sum().item(), 315193.0)
self.assertEquals(inputs["mask_labels"][1].sum().item(), 350747.0)
def test_binary_mask_to_rle(self): def test_binary_mask_to_rle(self):
fake_binary_mask = np.zeros((20, 50)) fake_binary_mask = np.zeros((20, 50))
fake_binary_mask[0, 20:] = 1 fake_binary_mask[0, 20:] = 1
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment