in the resize() function in image_transforms.py, the line 267: (#20728)

`image = to_channel_dimension_format(image, ChannelDimension.LAST)` is redundant as this same conversion is also applied in to_pil_image(). This redundant call actually makes the training fail in rare cases. The problem can be reproduced with the following code snippet: ``` from transformers.models.clip import CLIPFeatureExtractor vision_processor = CLIPFeatureExtractor.from_pretrained('openai/clip-vit-large-patch14') images = [ torch.rand(size=(3, 2, 10), dtype=torch.float), torch.rand(size=(3, 10, 1), dtype=torch.float), torch.rand(size=(3, 1, 10), dtype=torch.float) ] for image in images: processed_image = vision_processor(images=image, return_tensors="pt")['pixel_values'] print(processed_image.shape) assert processed_image.shape == torch.Size([1, 3, 224, 224]) ``` The last image has a height of 1 pixel. The second call to to_channel_dimesion_format() will transpose the image, and the height dimension is wrongly treated as the channels dimension afterwards. Because of this, the following normalize() step will result in an exception.

in the resize() function in image_transforms.py, the line 267: (#20728)
`image = to_channel_dimension_format(image, ChannelDimension.LAST)` is redundant as this same conversion is also applied in to_pil_image(). This redundant call actually makes the training fail in rare cases. The problem can be reproduced with the following code snippet: ``` from transformers.models.clip import CLIPFeatureExtractor vision_processor = CLIPFeatureExtractor.from_pretrained('openai/clip-vit-large-patch14') images = [ torch.rand(size=(3, 2, 10), dtype=torch.float), torch.rand(size=(3, 10, 1), dtype=torch.float), torch.rand(size=(3, 1, 10), dtype=torch.float) ] for image in images: processed_image = vision_processor(images=image, return_tensors="pt")['pixel_values'] print(processed_image.shape) assert processed_image.shape == torch.Size([1, 3, 224, 224]) ``` The last image has a height of 1 pixel. The second call to to_channel_dimesion_format() will transpose the image, and the height dimension is wrongly treated as the channels dimension afterwards. Because of this, the following normalize() step will result in an exception.
30d8919a · dhansmair · GitHub · 4f1788b3 · 30d8919a
Unverified Commit 30d8919a authored Dec 13, 2022 by dhansmair Committed by GitHub Dec 13, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 0 additions and 2 deletions

src/transformers/image_transforms.py src/transformers/image_transforms.py +0 -2

No files found.
--- a/src/transformers/image_transforms.py
+++ b/src/transformers/image_transforms.py
@@ -271,8 +271,6 @@ def resize(
    # To maintain backwards compatibility with the resizing done in previous image feature extractors, we use
    # the pillow library to resize the image and then convert back to numpy
    if not isinstance(image, PIL.Image.Image):
-        # PIL expects image to have channels last
-        image = to_channel_dimension_format(image, ChannelDimension.LAST)
        image = to_pil_image(image)
    height, width = size
    # PIL images are in the format (width, height)