AutoImageProcessor (#20111)

* AutoImageProcessor skeleton * Update references * Add mapping in init * Add model image processors to __init__ for importing * Add AutoImageProcessor tests * Fix up * Image Processor documentation * Remove pdb * Update docs/source/en/model_doc/mobilevit.mdx * Update docs * Don't add whitespace on json files * Remove fixtures * Move checking model config down * Fix up * Add check for image processor * Remove FeatureExtractorMixin in docstrings * Rename model_tmpfile to config_tmpfile * Don't make None if not in image processor map

AutoImageProcessor (#20111)
* AutoImageProcessor skeleton * Update references * Add mapping in init * Add model image processors to __init__ for importing * Add AutoImageProcessor tests * Fix up * Image Processor documentation * Remove pdb * Update docs/source/en/model_doc/mobilevit.mdx * Update docs * Don't add whitespace on json files * Remove fixtures * Move checking model config down * Fix up * Add check for image processor * Remove FeatureExtractorMixin in docstrings * Rename model_tmpfile to config_tmpfile * Don't make None if not in image processor map
4eb918e6 · amyeroberts · GitHub · c08a1e26 · 4eb918e6 · 4eb918e6
Unverified Commit 4eb918e6 authored Nov 08, 2022 by amyeroberts Committed by GitHub Nov 08, 2022
20 changed files
--- a/docs/source/en/model_doc/vilt.mdx
+++ b/docs/source/en/model_doc/vilt.mdx
@@ -38,12 +38,12 @@ Tips:
  This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
 - ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
  under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
-  which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you. 
+  which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes 
+- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes
  additional embedding layers for the language modality.
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vilt_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>
 <small> ViLT architecture. Taken from the <a href="https://arxiv.org/abs/2102.03334">original paper</a>. </small>
@@ -63,6 +63,11 @@ Tips:
 [[autodoc]] ViltFeatureExtractor
    - __call__
+## ViltImageProcessor
+[[autodoc]] ViltImageProcessor
+    - preprocess
 ## ViltProcessor
 [[autodoc]] ViltProcessor

--- a/docs/source/en/model_doc/vit.mdx
+++ b/docs/source/en/model_doc/vit.mdx
@@ -57,7 +57,7 @@ Tips:
  improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg"
-alt="drawing" width="600"/> 
+alt="drawing" width="600"/>
 <small> ViT architecture. Taken from the <a href="https://arxiv.org/abs/2010.11929">original paper.</a> </small>
@@ -96,6 +96,12 @@ go to him!
 [[autodoc]] ViTFeatureExtractor
    - __call__
+## ViTImageProcessor
+[[autodoc]] ViTImageProcessor
+    - preprocess
 ## ViTModel
 [[autodoc]] ViTModel

--- a/docs/source/en/preprocessing.mdx
+++ b/docs/source/en/preprocessing.mdx
@@ -17,7 +17,8 @@ specific language governing permissions and limitations under the License.
 Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors. 🤗 Transformers provides a set of preprocessing classes to help prepare your data for the model. In this tutorial, you'll learn that for:
 * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
-* Computer vision and speech, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and images and convert them into tensors.
+* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
+* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
 * Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.
 <Tip>

--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -125,11 +125,13 @@ _import_structure = {
        "ALL_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "CONFIG_MAPPING",
        "FEATURE_EXTRACTOR_MAPPING",
+        "IMAGE_PROCESSOR_MAPPING",
        "MODEL_NAMES_MAPPING",
        "PROCESSOR_MAPPING",
        "TOKENIZER_MAPPING",
        "AutoConfig",
        "AutoFeatureExtractor",
+        "AutoImageProcessor",
        "AutoProcessor",
        "AutoTokenizer",
    ],
@@ -251,6 +253,7 @@ _import_structure = {
        "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "LayoutLMv2Config",
        "LayoutLMv2FeatureExtractor",
+        "LayoutLMv2ImageProcessor",
        "LayoutLMv2Processor",
        "LayoutLMv2Tokenizer",
    ],
@@ -258,6 +261,7 @@ _import_structure = {
        "LAYOUTLMV3_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "LayoutLMv3Config",
        "LayoutLMv3FeatureExtractor",
+        "LayoutLMv3ImageProcessor",
        "LayoutLMv3Processor",
        "LayoutLMv3Tokenizer",
    ],
@@ -375,7 +379,13 @@ _import_structure = {
    ],
    "models.van": ["VAN_PRETRAINED_CONFIG_ARCHIVE_MAP", "VanConfig"],
    "models.videomae": ["VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "VideoMAEConfig"],
-    "models.vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig", "ViltFeatureExtractor", "ViltProcessor"],
+    "models.vilt": [
+        "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ViltConfig",
+        "ViltFeatureExtractor",
+        "ViltImageProcessor",
+        "ViltProcessor",
+    ],
    "models.vision_encoder_decoder": ["VisionEncoderDecoderConfig"],
    "models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"],
    "models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
@@ -689,35 +699,34 @@ except OptionalDependencyNotAvailable:
        name for name in dir(dummy_vision_objects) if not name.startswith("_")
    ]
 else:
-    _import_structure["image_processing_utils"] = ["ImageProcessorMixin"]
+    _import_structure["image_processing_utils"] = ["ImageProcessingMixin"]
    _import_structure["image_transforms"] = ["rescale", "resize", "to_pil_image"]
    _import_structure["image_utils"] = ["ImageFeatureExtractionMixin"]
-    _import_structure["models.beit"].append("BeitFeatureExtractor")
+    _import_structure["models.beit"].extend(["BeitFeatureExtractor", "BeitImageProcessor"])
-    _import_structure["models.clip"].append("CLIPFeatureExtractor")
+    _import_structure["models.clip"].extend(["CLIPFeatureExtractor", "CLIPImageProcessor"])
-    _import_structure["models.convnext"].append("ConvNextFeatureExtractor")
+    _import_structure["models.convnext"].extend(["ConvNextFeatureExtractor", "ConvNextImageProcessor"])
    _import_structure["models.deformable_detr"].append("DeformableDetrFeatureExtractor")
-    _import_structure["models.deit"].append("DeiTFeatureExtractor")
+    _import_structure["models.deit"].extend(["DeiTFeatureExtractor", "DeiTImageProcessor"])
    _import_structure["models.detr"].append("DetrFeatureExtractor")
    _import_structure["models.conditional_detr"].append("ConditionalDetrFeatureExtractor")
    _import_structure["models.donut"].append("DonutFeatureExtractor")
-    _import_structure["models.dpt"].append("DPTFeatureExtractor")
+    _import_structure["models.dpt"].extend(["DPTFeatureExtractor", "DPTImageProcessor"])
-    _import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaProcessor"])
+    _import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaProcessor", "FlavaImageProcessor"])
-    _import_structure["models.glpn"].append("GLPNFeatureExtractor")
+    _import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
-    _import_structure["models.imagegpt"].append("ImageGPTFeatureExtractor")
+    _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
-    _import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
+    _import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
-    _import_structure["models.layoutlmv3"].append("LayoutLMv3FeatureExtractor")
+    _import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
-    _import_structure["models.levit"].append("LevitFeatureExtractor")
+    _import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"])
    _import_structure["models.maskformer"].append("MaskFormerFeatureExtractor")
-    _import_structure["models.mobilevit"].append("MobileViTFeatureExtractor")
+    _import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
    _import_structure["models.owlvit"].append("OwlViTFeatureExtractor")
-    _import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
+    _import_structure["models.perceiver"].extend(["PerceiverFeatureExtractor", "PerceiverImageProcessor"])
-    _import_structure["models.poolformer"].append("PoolFormerFeatureExtractor")
+    _import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
-    _import_structure["models.segformer"].append("SegformerFeatureExtractor")
+    _import_structure["models.segformer"].extend(["SegformerFeatureExtractor", "SegformerImageProcessor"])
-    _import_structure["models.videomae"].append("VideoMAEFeatureExtractor")
+    _import_structure["models.videomae"].extend(["VideoMAEFeatureExtractor", "VideoMAEImageProcessor"])
-    _import_structure["models.vilt"].append("ViltFeatureExtractor")
+    _import_structure["models.vilt"].extend(["ViltFeatureExtractor", "ViltImageProcessor", "ViltProcessor"])
-    _import_structure["models.vilt"].append("ViltProcessor")
+    _import_structure["models.vit"].extend(["ViTFeatureExtractor", "ViTImageProcessor"])
-    _import_structure["models.vit"].append("ViTFeatureExtractor")
+    _import_structure["models.yolos"].extend(["YolosFeatureExtractor"])
-    _import_structure["models.yolos"].append("YolosFeatureExtractor")
 # Timm-backed objects
 try:
@@ -3220,11 +3229,13 @@ if TYPE_CHECKING:
        ALL_PRETRAINED_CONFIG_ARCHIVE_MAP,
        CONFIG_MAPPING,
        FEATURE_EXTRACTOR_MAPPING,
+        IMAGE_PROCESSOR_MAPPING,
        MODEL_NAMES_MAPPING,
        PROCESSOR_MAPPING,
        TOKENIZER_MAPPING,
        AutoConfig,
        AutoFeatureExtractor,
+        AutoImageProcessor,
        AutoProcessor,
        AutoTokenizer,
    )
@@ -3337,6 +3348,7 @@ if TYPE_CHECKING:
        LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP,
        LayoutLMv2Config,
        LayoutLMv2FeatureExtractor,
+        LayoutLMv2ImageProcessor,
        LayoutLMv2Processor,
        LayoutLMv2Tokenizer,
    )
@@ -3344,6 +3356,7 @@ if TYPE_CHECKING:
        LAYOUTLMV3_PRETRAINED_CONFIG_ARCHIVE_MAP,
        LayoutLMv3Config,
        LayoutLMv3FeatureExtractor,
+        LayoutLMv3ImageProcessor,
        LayoutLMv3Processor,
        LayoutLMv3Tokenizer,
    )
@@ -3441,7 +3454,13 @@ if TYPE_CHECKING:
    from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig
    from .models.van import VAN_PRETRAINED_CONFIG_ARCHIVE_MAP, VanConfig
    from .models.videomae import VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP, VideoMAEConfig
-    from .models.vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig, ViltFeatureExtractor, ViltProcessor
+    from .models.vilt import (
+        VILT_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ViltConfig,
+        ViltFeatureExtractor,
+        ViltImageProcessor,
+        ViltProcessor,
+    )
    from .models.vision_encoder_decoder import VisionEncoderDecoderConfig
    from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor
    from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
@@ -3716,33 +3735,33 @@ if TYPE_CHECKING:
    except OptionalDependencyNotAvailable:
        from .utils.dummy_vision_objects import *
    else:
-        from .image_processing_utils import ImageProcessorMixin
+        from .image_processing_utils import ImageProcessingMixin
        from .image_transforms import rescale, resize, to_pil_image
        from .image_utils import ImageFeatureExtractionMixin
-        from .models.beit import BeitFeatureExtractor
+        from .models.beit import BeitFeatureExtractor, BeitImageProcessor
-        from .models.clip import CLIPFeatureExtractor
+        from .models.clip import CLIPFeatureExtractor, CLIPImageProcessor
        from .models.conditional_detr import ConditionalDetrFeatureExtractor
-        from .models.convnext import ConvNextFeatureExtractor
+        from .models.convnext import ConvNextFeatureExtractor, ConvNextImageProcessor
        from .models.deformable_detr import DeformableDetrFeatureExtractor
-        from .models.deit import DeiTFeatureExtractor
+        from .models.deit import DeiTFeatureExtractor, DeiTImageProcessor
        from .models.detr import DetrFeatureExtractor
        from .models.donut import DonutFeatureExtractor
-        from .models.dpt import DPTFeatureExtractor
+        from .models.dpt import DPTFeatureExtractor, DPTImageProcessor
-        from .models.flava import FlavaFeatureExtractor, FlavaProcessor
+        from .models.flava import FlavaFeatureExtractor, FlavaImageProcessor, FlavaProcessor
-        from .models.glpn import GLPNFeatureExtractor
+        from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor
-        from .models.imagegpt import ImageGPTFeatureExtractor
+        from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
-        from .models.layoutlmv2 import LayoutLMv2FeatureExtractor
+        from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor
-        from .models.layoutlmv3 import LayoutLMv3FeatureExtractor
+        from .models.layoutlmv3 import LayoutLMv3FeatureExtractor, LayoutLMv3ImageProcessor
-        from .models.levit import LevitFeatureExtractor
+        from .models.levit import LevitFeatureExtractor, LevitImageProcessor
        from .models.maskformer import MaskFormerFeatureExtractor
-        from .models.mobilevit import MobileViTFeatureExtractor
+        from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
        from .models.owlvit import OwlViTFeatureExtractor
-        from .models.perceiver import PerceiverFeatureExtractor
+        from .models.perceiver import PerceiverFeatureExtractor, PerceiverImageProcessor
-        from .models.poolformer import PoolFormerFeatureExtractor
+        from .models.poolformer import PoolFormerFeatureExtractor, PoolFormerImageProcessor
-        from .models.segformer import SegformerFeatureExtractor
+        from .models.segformer import SegformerFeatureExtractor, SegformerImageProcessor
-        from .models.videomae import VideoMAEFeatureExtractor
+        from .models.videomae import VideoMAEFeatureExtractor, VideoMAEImageProcessor
-        from .models.vilt import ViltFeatureExtractor, ViltProcessor
+        from .models.vilt import ViltFeatureExtractor, ViltImageProcessor, ViltProcessor
-        from .models.vit import ViTFeatureExtractor
+        from .models.vit import ViTFeatureExtractor, ViTImageProcessor
        from .models.yolos import YolosFeatureExtractor
    # Modeling

--- a/src/transformers/commands/add_new_model_like.py
+++ b/src/transformers/commands/add_new_model_like.py
@@ -389,6 +389,7 @@ SPECIAL_PATTERNS = {
    "_CHECKPOINT_FOR_DOC =": "checkpoint",
    "_CONFIG_FOR_DOC =": "config_class",
    "_TOKENIZER_FOR_DOC =": "tokenizer_class",
+    "_IMAGE_PROCESSOR_FOR_DOC =": "image_processor_class",
    "_FEAT_EXTRACTOR_FOR_DOC =": "feature_extractor_class",
    "_PROCESSOR_FOR_DOC =": "processor_class",
 }

--- a/src/transformers/commands/pt_to_tf.py
+++ b/src/transformers/commands/pt_to_tf.py
@@ -24,10 +24,12 @@ import huggingface_hub
 from .. import (
    FEATURE_EXTRACTOR_MAPPING,
+    IMAGE_PROCESSOR_MAPPING,
    PROCESSOR_MAPPING,
    TOKENIZER_MAPPING,
    AutoConfig,
    AutoFeatureExtractor,
+    AutoImageProcessor,
    AutoProcessor,
    AutoTokenizer,
    is_datasets_available,
@@ -202,6 +204,8 @@ class PTtoTFCommand(BaseTransformersCLICommand):
            processor = AutoProcessor.from_pretrained(self._local_dir)
            if model_config_class in TOKENIZER_MAPPING and processor.tokenizer.pad_token is None:
                processor.tokenizer.pad_token = processor.tokenizer.eos_token
+        elif model_config_class in IMAGE_PROCESSOR_MAPPING:
+            processor = AutoImageProcessor.from_pretrained(self._local_dir)
        elif model_config_class in FEATURE_EXTRACTOR_MAPPING:
            processor = AutoFeatureExtractor.from_pretrained(self._local_dir)
        elif model_config_class in TOKENIZER_MAPPING:

--- a/src/transformers/image_processing_utils.py
+++ b/src/transformers/image_processing_utils.py
--- a/src/transformers/models/auto/__init__.py
+++ b/src/transformers/models/auto/__init__.py
@@ -31,6 +31,7 @@ _import_structure = {
    "auto_factory": ["get_values"],
    "configuration_auto": ["ALL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CONFIG_MAPPING", "MODEL_NAMES_MAPPING", "AutoConfig"],
    "feature_extraction_auto": ["FEATURE_EXTRACTOR_MAPPING", "AutoFeatureExtractor"],
+    "image_processing_auto": ["IMAGE_PROCESSOR_MAPPING", "AutoImageProcessor"],
    "processing_auto": ["PROCESSOR_MAPPING", "AutoProcessor"],
    "tokenization_auto": ["TOKENIZER_MAPPING", "AutoTokenizer"],
 }
@@ -184,6 +185,7 @@ if TYPE_CHECKING:
    from .auto_factory import get_values
    from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, MODEL_NAMES_MAPPING, AutoConfig
    from .feature_extraction_auto import FEATURE_EXTRACTOR_MAPPING, AutoFeatureExtractor
+    from .image_processing_auto import IMAGE_PROCESSOR_MAPPING, AutoImageProcessor
    from .processing_auto import PROCESSOR_MAPPING, AutoProcessor
    from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer

--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" AutoImageProcessor class."""
+import importlib
+import json
+import os
+from collections import OrderedDict
+from typing import Dict, Optional, Union
+# Build the list of all image processors
+from ...configuration_utils import PretrainedConfig
+from ...dynamic_module_utils import get_class_from_dynamic_module
+from ...image_processing_utils import ImageProcessingMixin
+from ...utils import CONFIG_NAME, IMAGE_PROCESSOR_NAME, get_file_from_repo, logging
+from .auto_factory import _LazyAutoMapping
+from .configuration_auto import (
+    CONFIG_MAPPING_NAMES,
+    AutoConfig,
+    model_type_to_module_name,
+    replace_list_option_in_docstrings,
+)
+logger = logging.get_logger(__name__)
+IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
+    [
+        ("beit", "BeitImageProcessor"),
+        ("clip", "CLIPImageProcessor"),
+        ("convnext", "ConvNextImageProcessor"),
+        ("cvt", "ConvNextImageProcessor"),
+        ("data2vec-vision", "BeitImageProcessor"),
+        ("deit", "DeiTImageProcessor"),
+        ("dpt", "DPTImageProcessor"),
+        ("flava", "FlavaImageProcessor"),
+        ("glpn", "GLPNImageProcessor"),
+        ("groupvit", "CLIPImageProcessor"),
+        ("imagegpt", "ImageGPTImageProcessor"),
+        ("layoutlmv2", "LayoutLMv2ImageProcessor"),
+        ("layoutlmv3", "LayoutLMv3ImageProcessor"),
+        ("levit", "LevitImageProcessor"),
+        ("mobilevit", "MobileViTImageProcessor"),
+        ("perceiver", "PerceiverImageProcessor"),
+        ("poolformer", "PoolFormerImageProcessor"),
+        ("regnet", "ConvNextImageProcessor"),
+        ("resnet", "ConvNextImageProcessor"),
+        ("segformer", "SegformerImageProcessor"),
+        ("swin", "ViTImageProcessor"),
+        ("swinv2", "ViTImageProcessor"),
+        ("van", "ConvNextImageProcessor"),
+        ("videomae", "VideoMAEImageProcessor"),
+        ("vilt", "ViltImageProcessor"),
+        ("vit", "ViTImageProcessor"),
+        ("vit_mae", "ViTImageProcessor"),
+        ("vit_msn", "ViTImageProcessor"),
+        ("xclip", "CLIPImageProcessor"),
+    ]
+)
+IMAGE_PROCESSOR_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, IMAGE_PROCESSOR_MAPPING_NAMES)
+def image_processor_class_from_name(class_name: str):
+    for module_name, extractors in IMAGE_PROCESSOR_MAPPING_NAMES.items():
+        if class_name in extractors:
+            module_name = model_type_to_module_name(module_name)
+            module = importlib.import_module(f".{module_name}", "transformers.models")
+            try:
+                return getattr(module, class_name)
+            except AttributeError:
+                continue
+    for _, extractor in IMAGE_PROCESSOR_MAPPING._extra_content.items():
+        if getattr(extractor, "__name__", None) == class_name:
+            return extractor
+    # We did not fine the class, but maybe it's because a dep is missing. In that case, the class will be in the main
+    # init and we return the proper dummy to get an appropriate error message.
+    main_module = importlib.import_module("transformers")
+    if hasattr(main_module, class_name):
+        return getattr(main_module, class_name)
+    return None
+def get_image_processor_config(
+    pretrained_model_name_or_path: Union[str, os.PathLike],
+    cache_dir: Optional[Union[str, os.PathLike]] = None,
+    force_download: bool = False,
+    resume_download: bool = False,
+    proxies: Optional[Dict[str, str]] = None,
+    use_auth_token: Optional[Union[bool, str]] = None,
+    revision: Optional[str] = None,
+    local_files_only: bool = False,
+    **kwargs,
+):
+    """
+    Loads the image processor configuration from a pretrained model imag processor configuration. # FIXME
+    Args:
+        pretrained_model_name_or_path (`str` or `os.PathLike`):
+            This can be either:
+            - a string, the *model id* of a pretrained model configuration hosted inside a model repo on
+              huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced
+              under a user or organization name, like `dbmdz/bert-base-german-cased`.
+            - a path to a *directory* containing a configuration file saved using the
+              [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
+        cache_dir (`str` or `os.PathLike`, *optional*):
+            Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
+            cache should not be used.
+        force_download (`bool`, *optional*, defaults to `False`):
+            Whether or not to force to (re-)download the configuration files and override the cached versions if they
+            exist.
+        resume_download (`bool`, *optional*, defaults to `False`):
+            Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
+        proxies (`Dict[str, str]`, *optional*):
+            A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+            'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+        use_auth_token (`str` or *bool*, *optional*):
+            The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
+            when running `huggingface-cli login` (stored in `~/.huggingface`).
+        revision (`str`, *optional*, defaults to `"main"`):
+            The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+            git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+            identifier allowed by git.
+        local_files_only (`bool`, *optional*, defaults to `False`):
+            If `True`, will only try to load the image processor configuration from local files.
+    <Tip>
+    Passing `use_auth_token=True` is required when you want to use a private model.
+    </Tip>
+    Returns:
+        `Dict`: The configuration of the image processor.
+    Examples:
+    ```python
+    # Download configuration from huggingface.co and cache.
+    image_processor_config = get_image_processor_config("bert-base-uncased")
+    # This model does not have a image processor config so the result will be an empty dict.
+    image_processor_config = get_image_processor_config("xlm-roberta-base")
+    # Save a pretrained image processor locally and you can reload its config
+    from transformers import AutoTokenizer
+    image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
+    image_processor.save_pretrained("image-processor-test")
+    image_processor_config = get_image_processor_config("image-processor-test")
+    ```"""
+    resolved_config_file = get_file_from_repo(
+        pretrained_model_name_or_path,
+        IMAGE_PROCESSOR_NAME,
+        cache_dir=cache_dir,
+        force_download=force_download,
+        resume_download=resume_download,
+        proxies=proxies,
+        use_auth_token=use_auth_token,
+        revision=revision,
+        local_files_only=local_files_only,
+    )
+    if resolved_config_file is None:
+        logger.info(
+            "Could not locate the image processor configuration file, will try to use the model config instead."
+        )
+        return {}
+    with open(resolved_config_file, encoding="utf-8") as reader:
+        return json.load(reader)
+class AutoImageProcessor:
+    r"""
+    This is a generic image processor class that will be instantiated as one of the image processor classes of the
+    library when created with the [`AutoImageProcessor.from_pretrained`] class method.
+    This class cannot be instantiated directly using `__init__()` (throws an error).
+    """
+    def __init__(self):
+        raise EnvironmentError(
+            "AutoImageProcessor is designed to be instantiated "
+            "using the `AutoImageProcessor.from_pretrained(pretrained_model_name_or_path)` method."
+        )
+    @classmethod
+    @replace_list_option_in_docstrings(IMAGE_PROCESSOR_MAPPING_NAMES)
+    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
+        r"""
+        Instantiate one of the image processor classes of the library from a pretrained model vocabulary.
+        The image processor class to instantiate is selected based on the `model_type` property of the config object
+        (either passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's
+        missing, by falling back to using pattern matching on `pretrained_model_name_or_path`:
+        List options
+        Params:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+                - a string, the *model id* of a pretrained image_processor hosted inside a model repo on
+                  huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                - a path to a *directory* containing a image processor file saved using the
+                  [`~image_processing_utils.ImageProcessingMixin.save_pretrained`] method, e.g.,
+                  `./my_model_directory/`.
+                - a path or url to a saved image processor JSON *file*, e.g.,
+                  `./my_model_directory/preprocessor_config.json`.
+            cache_dir (`str` or `os.PathLike`, *optional*):
+                Path to a directory in which a downloaded pretrained model image processor should be cached if the
+                standard cache should not be used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force to (re-)download the image processor files and override the cached versions if
+                they exist.
+            resume_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to delete incompletely received file. Attempts to resume the download if such a file
+                exists.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
+            use_auth_token (`str` or *bool*, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
+                when running `huggingface-cli login` (stored in `~/.huggingface`).
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
+                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
+                identifier allowed by git.
+            return_unused_kwargs (`bool`, *optional*, defaults to `False`):
+                If `False`, then this function returns just the final image processor object. If `True`, then this
+                functions returns a `Tuple(image_processor, unused_kwargs)` where *unused_kwargs* is a dictionary
+                consisting of the key/value pairs whose keys are not image processor attributes: i.e., the part of
+                `kwargs` which has not been used to update `image_processor` and is otherwise ignored.
+            trust_remote_code (`bool`, *optional*, defaults to `False`):
+                Whether or not to allow for custom models defined on the Hub in their own modeling files. This option
+                should only be set to `True` for repositories you trust and in which you have read the code, as it will
+                execute code present on the Hub on your local machine.
+            kwargs (`Dict[str, Any]`, *optional*):
+                The values in kwargs of any keys which are image processor attributes will be used to override the
+                loaded values. Behavior concerning key/value pairs whose keys are *not* image processor attributes is
+                controlled by the `return_unused_kwargs` keyword parameter.
+        <Tip>
+        Passing `use_auth_token=True` is required when you want to use a private model.
+        </Tip>
+        Examples:
+        ```python
+        >>> from transformers import AutoImageProcessor
+        >>> # Download image processor from huggingface.co and cache.
+        >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
+        >>> # If image processor files are in a directory (e.g. image processor was saved using *save_pretrained('./test/saved_model/')*)
+        >>> image_processor = AutoImageProcessor.from_pretrained("./test/saved_model/")
+        ```"""
+        config = kwargs.pop("config", None)
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
+        kwargs["_from_auto"] = True
+        config_dict, _ = ImageProcessingMixin.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
+        image_processor_class = config_dict.get("image_processor_type", None)
+        image_processor_auto_map = None
+        if "AutoImageProcessor" in config_dict.get("auto_map", {}):
+            image_processor_auto_map = config_dict["auto_map"]["AutoImageProcessor"]
+        # If we still don't have the image processor class, check if we're loading from a previous feature extractor config
+        # and if so, infer the image processor class from there.
+        if image_processor_class is None and image_processor_auto_map is None:
+            feature_extractor_class = config_dict.pop("feature_extractor_type", None)
+            if feature_extractor_class is not None:
+                logger.warning(
+                    "Could not find image processor class in the image processor config or the model config. Loading"
+                    " based on pattern matching with the model's feature extractor configuration."
+                )
+                image_processor_class = feature_extractor_class.replace("FeatureExtractor", "ImageProcessor")
+            if "AutoFeatureExtractor" in config_dict.get("auto_map", {}):
+                feature_extractor_auto_map = config_dict["auto_map"]["AutoFeatureExtractor"]
+                image_processor_auto_map = feature_extractor_auto_map.replace("FeatureExtractor", "ImageProcessor")
+                logger.warning(
+                    "Could not find image processor auto map in the image processor config or the model config."
+                    " Loading based on pattern matching with the model's feature extractor configuration."
+                )
+        # If we don't find the image processor class in the image processor config, let's try the model config.
+        if image_processor_class is None and image_processor_auto_map is None:
+            if not isinstance(config, PretrainedConfig):
+                config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+            # It could be in `config.image_processor_type``
+            image_processor_class = getattr(config, "image_processor_type", None)
+            if hasattr(config, "auto_map") and "AutoImageProcessor" in config.auto_map:
+                image_processor_auto_map = config.auto_map["AutoImageProcessor"]
+        if image_processor_class is not None:
+            # If we have custom code for a image processor, we get the proper class.
+            if image_processor_auto_map is not None:
+                if not trust_remote_code:
+                    raise ValueError(
+                        f"Loading {pretrained_model_name_or_path} requires you to execute the image processor file "
+                        "in that repo on your local machine. Make sure you have read the code there to avoid "
+                        "malicious use, then set the option `trust_remote_code=True` to remove this error."
+                    )
+                if kwargs.get("revision", None) is None:
+                    logger.warning(
+                        "Explicitly passing a `revision` is encouraged when loading a image processor with custom "
+                        "code to ensure no malicious code has been contributed in a newer revision."
+                    )
+                module_file, class_name = image_processor_auto_map.split(".")
+                image_processor_class = get_class_from_dynamic_module(
+                    pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs
+                )
+            else:
+                image_processor_class = image_processor_class_from_name(image_processor_class)
+            return image_processor_class.from_dict(config_dict, **kwargs)
+        # Last try: we use the IMAGE_PROCESSOR_MAPPING.
+        elif type(config) in IMAGE_PROCESSOR_MAPPING:
+            image_processor_class = IMAGE_PROCESSOR_MAPPING[type(config)]
+            return image_processor_class.from_dict(config_dict, **kwargs)
+        raise ValueError(
+            f"Unrecognized image processor in {pretrained_model_name_or_path}. Should have a "
+            f"`image_processor_type` key in its {IMAGE_PROCESSOR_NAME} of {CONFIG_NAME}, or one of the following "
+            f"`model_type` keys in its {CONFIG_NAME}: {', '.join(c for c in IMAGE_PROCESSOR_MAPPING_NAMES.keys())}"
+        )
+    @staticmethod
+    def register(config_class, image_processor_class):
+        """
+        Register a new image processor for this class.
+        Args:
+            config_class ([`PretrainedConfig`]):
+                The configuration corresponding to the model to register.
+            image_processor_class ([`ImageProcessingMixin`]): The image processor to register.
+        """
+        IMAGE_PROCESSOR_MAPPING.register(config_class, image_processor_class)
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -22,6 +22,7 @@ from collections import OrderedDict
 from ...configuration_utils import PretrainedConfig
 from ...dynamic_module_utils import get_class_from_dynamic_module
 from ...feature_extraction_utils import FeatureExtractionMixin
+from ...image_processing_utils import ImageProcessingMixin
 from ...tokenization_utils import TOKENIZER_CONFIG_FILE
 from ...utils import FEATURE_EXTRACTOR_NAME, get_file_from_repo, logging
 from .auto_factory import _LazyAutoMapping
@@ -32,6 +33,7 @@ from .configuration_auto import (
    replace_list_option_in_docstrings,
 )
 from .feature_extraction_auto import AutoFeatureExtractor
+from .image_processing_auto import AutoImageProcessor
 from .tokenization_auto import AutoTokenizer
@@ -189,11 +191,18 @@ class AutoProcessor:
        get_file_from_repo_kwargs = {
            key: kwargs[key] for key in inspect.signature(get_file_from_repo).parameters.keys() if key in kwargs
        }
-        # Let's start by checking whether the processor class is saved in a feature extractor
+        # Let's start by checking whether the processor class is saved in an image processor
        preprocessor_config_file = get_file_from_repo(
            pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME, **get_file_from_repo_kwargs
        )
        if preprocessor_config_file is not None:
+            config_dict, _ = ImageProcessingMixin.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
+            processor_class = config_dict.get("processor_class", None)
+            if "AutoProcessor" in config_dict.get("auto_map", {}):
+                processor_auto_map = config_dict["auto_map"]["AutoProcessor"]
+        # If not found, let's check whether the processor class is saved in a feature extractor config
+        if preprocessor_config_file is not None and processor_class is None:
            config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
            processor_class = config_dict.get("processor_class", None)
            if "AutoProcessor" in config_dict.get("auto_map", {}):
@@ -261,6 +270,13 @@ class AutoProcessor:
                pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
            )
        except Exception:
+            try:
+                return AutoImageProcessor.from_pretrained(
+                    pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs
+                )
+            except Exception:
+                pass
            try:
                return AutoFeatureExtractor.from_pretrained(
                    pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs

--- a/src/transformers/models/beit/__init__.py
+++ b/src/transformers/models/beit/__init__.py
@@ -36,6 +36,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_beit"] = ["BeitFeatureExtractor"]
+    _import_structure["image_processing_beit"] = ["BeitImageProcessor"]
 try:
    if not is_torch_available():
@@ -76,6 +77,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_beit import BeitFeatureExtractor
+        from .image_processing_beit import BeitImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/clip/__init__.py
+++ b/src/transformers/models/clip/__init__.py
@@ -55,6 +55,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_clip"] = ["CLIPFeatureExtractor"]
+    _import_structure["image_processing_clip"] = ["CLIPImageProcessor"]
 try:
    if not is_torch_available():
@@ -126,6 +127,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_clip import CLIPFeatureExtractor
+        from .image_processing_clip import CLIPImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/convnext/__init__.py
+++ b/src/transformers/models/convnext/__init__.py
@@ -38,6 +38,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_convnext"] = ["ConvNextFeatureExtractor"]
+    _import_structure["image_processing_convnext"] = ["ConvNextImageProcessor"]
 try:
    if not is_torch_available():
@@ -74,6 +75,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_convnext import ConvNextFeatureExtractor
+        from .image_processing_convnext import ConvNextImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/deit/__init__.py
+++ b/src/transformers/models/deit/__init__.py
@@ -35,6 +35,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_deit"] = ["DeiTFeatureExtractor"]
+    _import_structure["image_processing_deit"] = ["DeiTImageProcessor"]
 try:
    if not is_torch_available():
@@ -77,6 +78,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_deit import DeiTFeatureExtractor
+        from .image_processing_deit import DeiTImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/dpt/__init__.py
+++ b/src/transformers/models/dpt/__init__.py
@@ -30,6 +30,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_dpt"] = ["DPTFeatureExtractor"]
+    _import_structure["image_processing_dpt"] = ["DPTImageProcessor"]
 try:
    if not is_torch_available():
@@ -56,6 +57,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_dpt import DPTFeatureExtractor
+        from .image_processing_dpt import DPTImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/flava/__init__.py
+++ b/src/transformers/models/flava/__init__.py
@@ -38,6 +38,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_flava"] = ["FlavaFeatureExtractor"]
+    _import_structure["image_processing_flava"] = ["FlavaImageProcessor"]
    _import_structure["processing_flava"] = ["FlavaProcessor"]
 try:
@@ -74,6 +75,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_flava import FlavaFeatureExtractor
+        from .image_processing_flava import FlavaImageProcessor
        from .processing_flava import FlavaProcessor
    try:

--- a/src/transformers/models/glpn/__init__.py
+++ b/src/transformers/models/glpn/__init__.py
@@ -30,6 +30,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_glpn"] = ["GLPNFeatureExtractor"]
+    _import_structure["image_processing_glpn"] = ["GLPNImageProcessor"]
 try:
    if not is_torch_available():
@@ -56,6 +57,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_glpn import GLPNFeatureExtractor
+        from .image_processing_glpn import GLPNImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/imagegpt/__init__.py
+++ b/src/transformers/models/imagegpt/__init__.py
@@ -32,6 +32,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_imagegpt"] = ["ImageGPTFeatureExtractor"]
+    _import_structure["image_processing_imagegpt"] = ["ImageGPTImageProcessor"]
 try:
    if not is_torch_available():
@@ -59,6 +60,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_imagegpt import ImageGPTFeatureExtractor
+        from .image_processing_imagegpt import ImageGPTImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/layoutlmv2/__init__.py
+++ b/src/transformers/models/layoutlmv2/__init__.py
@@ -48,6 +48,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_layoutlmv2"] = ["LayoutLMv2FeatureExtractor"]
+    _import_structure["image_processing_layoutlmv2"] = ["LayoutLMv2ImageProcessor"]
 try:
    if not is_torch_available():
@@ -84,7 +85,7 @@ if TYPE_CHECKING:
    except OptionalDependencyNotAvailable:
        pass
    else:
-        from .feature_extraction_layoutlmv2 import LayoutLMv2FeatureExtractor
+        from .feature_extraction_layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor
    try:
        if not is_torch_available():

--- a/src/transformers/models/layoutlmv3/__init__.py
+++ b/src/transformers/models/layoutlmv3/__init__.py
@@ -83,6 +83,7 @@ except OptionalDependencyNotAvailable:
    pass
 else:
    _import_structure["feature_extraction_layoutlmv3"] = ["LayoutLMv3FeatureExtractor"]
+    _import_structure["image_processing_layoutlmv3"] = ["LayoutLMv3ImageProcessor"]
 if TYPE_CHECKING:
@@ -139,6 +140,7 @@ if TYPE_CHECKING:
        pass
    else:
        from .feature_extraction_layoutlmv3 import LayoutLMv3FeatureExtractor
+        from .image_processing_layoutlmv3 import LayoutLMv3ImageProcessor
 else:
    import sys