Updates to computer vision section of the Preprocess doc (#21181)

* Extended the CV preprocessing section with more details and refactored the example * added padding to the CV section, though it is a special case * Added a tip about post processing methods * make style * link update * Apply suggestions from review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * review feedback Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Updates to computer vision section of the Preprocess doc (#21181)
* Extended the CV preprocessing section with more details and refactored the example * added padding to the CV section, though it is a special case * Added a tip about post processing methods * make style * link update * Apply suggestions from review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * review feedback Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
0359e2e1 · Maria Khalusova · GitHub · 5761ceb3 · 0359e2e1
Unverified Commit 0359e2e1 authored Jan 19, 2023 by Maria Khalusova Committed by GitHub Jan 19, 2023
Show whitespace changes
Inline Side-by-side

Showing with 66 additions and 10 deletions

docs/source/en/preprocessing.mdx docs/source/en/preprocessing.mdx +66 -10

No files found.
--- a/docs/source/en/preprocessing.mdx
+++ b/docs/source/en/preprocessing.mdx
-<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.

 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -17,8 +17,8 @@ specific language governing permissions and limitations under the License.
 Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:

 * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
-* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
 * Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
+* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
 * Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.

 <Tip>
@@ -320,7 +320,21 @@ The sample lengths are now the same and match the specified maximum length. You

 ## Computer vision

-For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model. The image processor is designed to preprocess images, and convert them into tensors.
+For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model.
+Image preprocessing consists of several steps that convert images into the input expected by the model. These steps
+include but are not limited to resizing, normalizing, color channel correction, and converting images to tensors.
+
+<Tip>
+
+Image preprocessing often follows some form of image augmentation. Both image preprocessing and image augmentation
+transform image data, but they serve different purposes:
+
+* Image augmentation alters images in a way that can help prevent overfitting and increase the robustness of the model. You can get creative in how you augment your data - adjust brightness and colors, crop, rotate, resize, zoom, etc. However, be mindful not to change the meaning of the images with your augmentations.
+* Image preprocessing guarantees that the images match the model’s expected input format. When fine-tuning a computer vision model, images must be preprocessed exactly as when the model was initially trained.
+
+You can use any library you like for image augmentation. For image preprocessing, use the `ImageProcessor` associated with the model.
+
+</Tip>

 Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:

@@ -354,30 +368,46 @@ Load the image processor with [`AutoImageProcessor.from_pretrained`]:
 >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
 ```

-For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
+First, let's add some image augmentation. You can use any library you prefer, but in this tutorial, we'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).

-1. Normalize the image with the image processor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
+1. Here we use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain together a couple of
+transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html).
+Note that for resizing, we can get the image size requirements from the `image_processor`. For some models, an exact height and
+width are expected, for others only the `shortest_edge` is defined.

 ```py
->>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
+>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose

->>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
 >>> size = (
 ...     image_processor.size["shortest_edge"]
 ...     if "shortest_edge" in image_processor.size
 ...     else (image_processor.size["height"], image_processor.size["width"])
 ... )
->>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
+
+>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])
 ```

-2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the image processor. Create a function that generates `pixel_values` from the transforms:
+2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)
+as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors.
+Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`:

 ```py
 >>> def transforms(examples):
-...     examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
+...     images = [_transforms(img.convert("RGB")) for img in examples["image"]]
+...     examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"]
 ...     return examples
 ```

+<Tip>
+
+In the example above we set `do_resize=False` because we have already resized the images in the image augmentation transformation,
+and leveraged the `size` attribute from the appropriate `image_processor`. If you do not resize images during image augmentation,
+leave this parameter out. By default, `ImageProcessor` will handle the resizing.
+
+If you wish to normalize images as a part of the augmentation transformation, use the `image_processor.image_mean`,
+and `image_processor.image_std` values.
+</Tip>
+
 3. Then use 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly:

 ```py
@@ -404,6 +434,32 @@ Here is what the image looks like after the transforms are applied. The image ha
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png"/>
 </div>

+<Tip>
+
+For tasks like object detection, semantic segmentation, instance segmentation, and panoptic segmentation, `ImageProcessor`
+offers post processing methods. These methods convert model's raw outputs into meaningful predictions such as bounding boxes,
+or segmentation maps.
+
+</Tip>
+
+### Pad
+
+In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training
+time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad_and_create_pixel_mask`]
+from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images together.
+
+```py
+>>> def collate_fn(batch):
+...     pixel_values = [item["pixel_values"] for item in batch]
+...     encoding = image_processor.pad_and_create_pixel_mask(pixel_values, return_tensors="pt")
+...     labels = [item["labels"] for item in batch]
+...     batch = {}
+...     batch["pixel_values"] = encoding["pixel_values"]
+...     batch["pixel_mask"] = encoding["pixel_mask"]
+...     batch["labels"] = labels
+...     return batch
+```
+
 ## Multimodal

 For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.