Update old existing feature extractor references (#24552)

* Update old existing feature extractor references * Typo * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Address comments from review - update 'feature extractor' Co-authored by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>

Update old existing feature extractor references (#24552)
* Update old existing feature extractor references * Typo * Apply suggestions from code review * Apply suggestions from code review * Apply suggestions from code review * Address comments from review - update 'feature extractor' Co-authored by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
ae454f41 · amyeroberts · GitHub · 10c2ac7b · ae454f41 · ae454f41
Unverified Commit ae454f41 authored Jun 29, 2023 by amyeroberts Committed by GitHub Jun 29, 2023
20 changed files
--- a/docs/source/de/preprocessing.md
+++ b/docs/source/de/preprocessing.md
@@ -354,12 +354,12 @@ Als Nächstes sehen Sie sich das Bild mit dem Merkmal 🤗 Datensätze [Bild] (h
 ### Merkmalsextraktor
-Laden Sie den Merkmalsextraktor mit [`AutoFeatureExtractor.from_pretrained`]:
+Laden Sie den Merkmalsextraktor mit [`AutoImageProcessor.from_pretrained`]:
 ```py
->>> from transformers import AutoFeatureExtractor
+>>> from transformers import AutoImageProcessor
->>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
+>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
 ```
 ### Datenerweiterung
@@ -371,9 +371,9 @@ Bei Bildverarbeitungsaufgaben ist es üblich, den Bildern als Teil der Vorverarb
 ```py
 >>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
->>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
+>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
 >>> _transforms = Compose(
-...     [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
+...     [RandomResizedCrop(image_processor.size["height"]), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
 ... )
 ```

--- a/docs/source/en/create_a_model.md
+++ b/docs/source/en/create_a_model.md
@@ -263,7 +263,7 @@ To use, create an image processor associated with the model you're using. For ex
 ViTImageProcessor {
  "do_normalize": true,
  "do_resize": true,
-  "feature_extractor_type": "ViTImageProcessor",
+  "image_processor_type": "ViTImageProcessor",
  "image_mean": [
    0.5,
    0.5,
@@ -295,7 +295,7 @@ Modify any of the [`ViTImageProcessor`] parameters to create your custom image p
 ViTImageProcessor {
  "do_normalize": false,
  "do_resize": true,
-  "feature_extractor_type": "ViTImageProcessor",
+  "image_processor_type": "ViTImageProcessor",
  "image_mean": [
    0.3,
    0.3,

--- a/docs/source/en/model_doc/clip.md
+++ b/docs/source/en/model_doc/clip.md
@@ -50,10 +50,10 @@ product between the projected image and text features is then used as a similar
 To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
 which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
 also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
-The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model.
+The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model.
 The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
-[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both
+[`CLIPImageProcessor`] and [`CLIPTokenizer`] into a single instance to both
 encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
 [`CLIPProcessor`] and [`CLIPModel`].

--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -46,9 +46,9 @@ Tips:
 Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
 [`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
-The [`DonutFeatureExtractor`] class is responsible for preprocessing the input image and
+The [`DonutImageProcessor`] class is responsible for preprocessing the input image and
 [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The
-[`DonutProcessor`] wraps [`DonutFeatureExtractor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]
+[`DonutProcessor`] wraps [`DonutImageProcessor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]
 into a single instance to both extract the input features and decode the predicted token ids.
 - Step-by-step Document Image Classification

--- a/docs/source/en/model_doc/layoutlmv2.md
+++ b/docs/source/en/model_doc/layoutlmv2.md
@@ -150,23 +150,23 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## Usage: LayoutLMv2Processor
 The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally
-combines a feature extractor ([`LayoutLMv2FeatureExtractor`]) and a tokenizer
+combines a image processor ([`LayoutLMv2ImageProcessor`]) and a tokenizer
-([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The feature extractor
+([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The image processor
 handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
 for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
 modality.
 ```python
-from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
+from transformers import LayoutLMv2ImageProcessor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
-feature_extractor = LayoutLMv2FeatureExtractor()  # apply_ocr is set to True by default
+image_processor = LayoutLMv2ImageProcessor()  # apply_ocr is set to True by default
 tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
-processor = LayoutLMv2Processor(feature_extractor, tokenizer)
+processor = LayoutLMv2Processor(image_processor, tokenizer)
 ```
 In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`],
 and it will create the inputs expected by the model. Internally, the processor first uses
-[`LayoutLMv2FeatureExtractor`] to apply OCR on the image to get a list of words and normalized
+[`LayoutLMv2ImageProcessor`] to apply OCR on the image to get a list of words and normalized
 bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and
 normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or
 [`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`,
@@ -176,7 +176,7 @@ which are turned into token-level `labels`.
 [`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python
 wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
 choice, and provide the words and normalized boxes yourself. This requires initializing
-[`LayoutLMv2FeatureExtractor`] with `apply_ocr` set to `False`.
+[`LayoutLMv2ImageProcessor`] with `apply_ocr` set to `False`.
 In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
 use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
@@ -184,7 +184,7 @@ use cases work for both batched and non-batched inputs (we illustrate them for n
 **Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
 True**
-This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
+This is the simplest case, in which the processor (actually the image processor) will perform OCR on the image to get
 the words and normalized bounding boxes.
 ```python
@@ -205,7 +205,7 @@ print(encoding.keys())
 **Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
-In case one wants to do OCR themselves, one can initialize the feature extractor with `apply_ocr` set to
+In case one wants to do OCR themselves, one can initialize the image processor with `apply_ocr` set to
 `False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
 the processor.

--- a/docs/source/en/model_doc/layoutlmv3.md
+++ b/docs/source/en/model_doc/layoutlmv3.md
@@ -31,7 +31,7 @@ Tips:
 - In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
    - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
    - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
-  Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3FeatureExtractor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
+  Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
 - Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
 - Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
 - Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).

--- a/docs/source/en/model_doc/layoutxlm.md
+++ b/docs/source/en/model_doc/layoutxlm.md
@@ -52,7 +52,7 @@ tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")
 ```
 Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies
-[`LayoutLMv2FeatureExtractor`] and
+[`LayoutLMv2ImageProcessor`] and
 [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
 data for the model.

--- a/docs/source/en/model_doc/owlvit.md
+++ b/docs/source/en/model_doc/owlvit.md
@@ -28,7 +28,7 @@ The abstract from the paper is the following:
 OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
-[`OwlViTFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`].
+[`OwlViTImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`].
 ```python

--- a/docs/source/en/model_doc/vilt.md
+++ b/docs/source/en/model_doc/vilt.md
@@ -39,7 +39,7 @@ Tips:
 - The quickest way to get started with ViLT is by checking the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ViLT)
  (which showcase both inference and fine-tuning on custom data).
 - ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model.
-  This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
+  This processor wraps a image processor (for the image modality) and a tokenizer (for the language modality) into one.
 - ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
  under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
  which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.

--- a/docs/source/en/tasks/object_detection.md
+++ b/docs/source/en/tasks/object_detection.md
@@ -462,9 +462,9 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco
 >>> class CocoDetection(torchvision.datasets.CocoDetection):
-...     def __init__(self, img_folder, feature_extractor, ann_file):
+...     def __init__(self, img_folder, image_processor, ann_file):
 ...         super().__init__(img_folder, ann_file)
-...         self.feature_extractor = feature_extractor
+...         self.image_processor = image_processor
 ...     def __getitem__(self, idx):
 ...         # read in PIL image and target in COCO format
@@ -474,7 +474,7 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco
 ...         # resizing + normalization of both image and target)
 ...         image_id = self.ids[idx]
 ...         target = {"image_id": image_id, "annotations": target}
-...         encoding = self.feature_extractor(images=img, annotations=target, return_tensors="pt")
+...         encoding = self.image_processor(images=img, annotations=target, return_tensors="pt")
 ...         pixel_values = encoding["pixel_values"].squeeze()  # remove batch dimension
 ...         target = encoding["labels"][0]  # remove batch dimension
@@ -591,4 +591,3 @@ Let's plot the result:
 <div class="flex justify-center">
    <img src="https://i.imgur.com/4QZnf9A.png" alt="Object detection result on a new image"/>
 </div>
--- a/docs/source/es/tasks/image_classification.md
+++ b/docs/source/es/tasks/image_classification.md
@@ -73,12 +73,12 @@ Cada clase de alimento - o label - corresponde a un número; `79` indica una cos
 ## Preprocesa
-Carga el feature extractor de ViT para procesar la imagen en un tensor:
+Carga el image processor de ViT para procesar la imagen en un tensor:
 ```py
->>> from transformers import AutoFeatureExtractor
+>>> from transformers import AutoImageProcessor
->>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
+>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
 ```
 Aplica varias transformaciones de imagen al dataset para hacer el modelo más robusto contra el overfitting. En este caso se utilizará el módulo [`transforms`](https://pytorch.org/vision/stable/transforms.html) de torchvision. Recorta una parte aleatoria de la imagen, cambia su tamaño y normalízala con la media y la desviación estándar de la imagen:
@@ -86,8 +86,8 @@ Aplica varias transformaciones de imagen al dataset para hacer el modelo más ro
 ```py
 >>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
->>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
+>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
->>> _transforms = Compose([RandomResizedCrop(feature_extractor.size), ToTensor(), normalize])
+>>> _transforms = Compose([RandomResizedCrop(image_processor.size["height"]), ToTensor(), normalize])
 ```
 Crea una función de preprocesamiento que aplique las transformaciones y devuelva los `pixel_values` - los inputs al modelo - de la imagen:
@@ -160,7 +160,7 @@ Al llegar a este punto, solo quedan tres pasos:
 ...     data_collator=data_collator,
 ...     train_dataset=food["train"],
 ...     eval_dataset=food["test"],
-...     tokenizer=feature_extractor,
+...     tokenizer=image_processor,
 ... )
 >>> trainer.train()

--- a/docs/source/ko/tasks/object_detection.md
+++ b/docs/source/ko/tasks/object_detection.md
@@ -454,9 +454,9 @@ COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로 
 >>> class CocoDetection(torchvision.datasets.CocoDetection):
-...     def __init__(self, img_folder, feature_extractor, ann_file):
+...     def __init__(self, img_folder, image_processor, ann_file):
 ...         super().__init__(img_folder, ann_file)
-...         self.feature_extractor = feature_extractor
+...         self.image_processor = image_processor
 ...     def __getitem__(self, idx):
 ...         # read in PIL image and target in COCO format
@@ -466,7 +466,7 @@ COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로 
 ...         # resizing + normalization of both image and target)
 ...         image_id = self.ids[idx]
 ...         target = {"image_id": image_id, "annotations": target}
-...         encoding = self.feature_extractor(images=img, annotations=target, return_tensors="pt")
+...         encoding = self.image_processor(images=img, annotations=target, return_tensors="pt")
 ...         pixel_values = encoding["pixel_values"].squeeze()  # remove batch dimension
 ...         target = encoding["labels"][0]  # remove batch dimension
@@ -586,4 +586,3 @@ Detected Mask with confidence 0.584 at location [2449.06, 823.19, 3256.43, 1413.
 <div class="flex justify-center">
    <img src="https://i.imgur.com/4QZnf9A.png" alt="Object detection result on a new image"/>
 </div>
--- a/src/transformers/models/align/convert_align_tf_to_hf.py
+++ b/src/transformers/models/align/convert_align_tf_to_hf.py
@@ -354,12 +354,12 @@ def convert_align_checkpoint(checkpoint_path, pytorch_dump_folder_path, save_mod
        # Create folder to save model
        if not os.path.isdir(pytorch_dump_folder_path):
            os.mkdir(pytorch_dump_folder_path)
-        # Save converted model and feature extractor
+        # Save converted model and image processor
        hf_model.save_pretrained(pytorch_dump_folder_path)
        processor.save_pretrained(pytorch_dump_folder_path)
    if push_to_hub:
-        # Push model and feature extractor to hub
+        # Push model and image processor to hub
        print("Pushing converted ALIGN to the hub...")
        processor.push_to_hub("align-base")
        hf_model.push_to_hub("align-base")
@@ -381,7 +381,7 @@ if __name__ == "__main__":
        help="Path to the output PyTorch model directory.",
    )
    parser.add_argument("--save_model", action="store_true", help="Save model to local")
-    parser.add_argument("--push_to_hub", action="store_true", help="Push model and feature extractor to the hub")
+    parser.add_argument("--push_to_hub", action="store_true", help="Push model and image processor to the hub")
    args = parser.parse_args()
    convert_align_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.save_model, args.push_to_hub)
--- a/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py
+++ b/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py
@@ -27,10 +27,10 @@ from PIL import Image
 from transformers import (
    BeitConfig,
-    BeitFeatureExtractor,
    BeitForImageClassification,
    BeitForMaskedImageModeling,
    BeitForSemanticSegmentation,
+    BeitImageProcessor,
 )
 from transformers.image_utils import PILImageResampling
 from transformers.utils import logging
@@ -266,16 +266,16 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path):
    # Check outputs on an image
    if is_semantic:
-        feature_extractor = BeitFeatureExtractor(size=config.image_size, do_center_crop=False)
+        image_processor = BeitImageProcessor(size=config.image_size, do_center_crop=False)
        ds = load_dataset("hf-internal-testing/fixtures_ade20k", split="test")
        image = Image.open(ds[0]["file"])
    else:
-        feature_extractor = BeitFeatureExtractor(
+        image_processor = BeitImageProcessor(
            size=config.image_size, resample=PILImageResampling.BILINEAR, do_center_crop=False
        )
        image = prepare_img()
-    encoding = feature_extractor(images=image, return_tensors="pt")
+    encoding = image_processor(images=image, return_tensors="pt")
    pixel_values = encoding["pixel_values"]
    outputs = model(pixel_values)
@@ -353,8 +353,8 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path):
    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
    print(f"Saving model to {pytorch_dump_folder_path}")
    model.save_pretrained(pytorch_dump_folder_path)
-    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
+    print(f"Saving image processor to {pytorch_dump_folder_path}")
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 if __name__ == "__main__":

--- a/src/transformers/models/chinese_clip/configuration_chinese_clip.py
+++ b/src/transformers/models/chinese_clip/configuration_chinese_clip.py
@@ -468,7 +468,7 @@ class ChineseCLIPOnnxConfig(OnnxConfig):
            processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework
        )
        image_input_dict = super().generate_dummy_inputs(
-            processor.feature_extractor, batch_size=batch_size, framework=framework
+            processor.image_processor, batch_size=batch_size, framework=framework
        )
        return {**text_input_dict, **image_input_dict}

--- a/src/transformers/models/clip/configuration_clip.py
+++ b/src/transformers/models/clip/configuration_clip.py
@@ -449,7 +449,7 @@ class CLIPOnnxConfig(OnnxConfig):
            processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework
        )
        image_input_dict = super().generate_dummy_inputs(
-            processor.feature_extractor, batch_size=batch_size, framework=framework
+            processor.image_processor, batch_size=batch_size, framework=framework
        )
        return {**text_input_dict, **image_input_dict}

--- a/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py
+++ b/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py
@@ -28,7 +28,7 @@ from transformers import (
    CLIPSegTextConfig,
    CLIPSegVisionConfig,
    CLIPTokenizer,
-    ViTFeatureExtractor,
+    ViTImageProcessor,
 )
@@ -185,9 +185,9 @@ def convert_clipseg_checkpoint(model_name, checkpoint_path, pytorch_dump_folder_
    if unexpected_keys != ["decoder.reduce.weight", "decoder.reduce.bias"]:
        raise ValueError(f"Unexpected keys: {unexpected_keys}")
-    feature_extractor = ViTFeatureExtractor(size=352)
+    image_processor = ViTImageProcessor(size=352)
    tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
-    processor = CLIPSegProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
+    processor = CLIPSegProcessor(image_processor=image_processor, tokenizer=tokenizer)
    image = prepare_img()
    text = ["a glass", "something to fill", "wood", "a jar"]

--- a/src/transformers/models/conditional_detr/convert_conditional_detr_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/conditional_detr/convert_conditional_detr_original_pytorch_checkpoint_to_pytorch.py
@@ -27,9 +27,9 @@ from PIL import Image
 from transformers import (
    ConditionalDetrConfig,
-    ConditionalDetrFeatureExtractor,
    ConditionalDetrForObjectDetection,
    ConditionalDetrForSegmentation,
+    ConditionalDetrImageProcessor,
 )
 from transformers.utils import logging
@@ -244,13 +244,13 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path):
        config.id2label = id2label
        config.label2id = {v: k for k, v in id2label.items()}
-    # load feature extractor
+    # load image processor
    format = "coco_panoptic" if is_panoptic else "coco_detection"
-    feature_extractor = ConditionalDetrFeatureExtractor(format=format)
+    image_processor = ConditionalDetrImageProcessor(format=format)
    # prepare image
    img = prepare_img()
-    encoding = feature_extractor(images=img, return_tensors="pt")
+    encoding = image_processor(images=img, return_tensors="pt")
    pixel_values = encoding["pixel_values"]
    logger.info(f"Converting model {model_name}...")
@@ -302,11 +302,11 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path):
    if is_panoptic:
        assert torch.allclose(outputs.pred_masks, original_outputs["pred_masks"], atol=1e-4)
-    # Save model and feature extractor
+    # Save model and image processor
-    logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...")
+    logger.info(f"Saving PyTorch model and image processor to {pytorch_dump_folder_path}...")
    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
    model.save_pretrained(pytorch_dump_folder_path)
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 if __name__ == "__main__":

--- a/src/transformers/models/convnext/convert_convnext_to_pytorch.py
+++ b/src/transformers/models/convnext/convert_convnext_to_pytorch.py
@@ -26,7 +26,7 @@ import torch
 from huggingface_hub import hf_hub_download
 from PIL import Image
-from transformers import ConvNextConfig, ConvNextFeatureExtractor, ConvNextForImageClassification
+from transformers import ConvNextConfig, ConvNextForImageClassification, ConvNextImageProcessor
 from transformers.utils import logging
@@ -144,10 +144,10 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path):
    model.load_state_dict(state_dict)
    model.eval()
-    # Check outputs on an image, prepared by ConvNextFeatureExtractor
+    # Check outputs on an image, prepared by ConvNextImageProcessor
    size = 224 if "224" in checkpoint_url else 384
-    feature_extractor = ConvNextFeatureExtractor(size=size)
+    image_processor = ConvNextImageProcessor(size=size)
-    pixel_values = feature_extractor(images=prepare_img(), return_tensors="pt").pixel_values
+    pixel_values = image_processor(images=prepare_img(), return_tensors="pt").pixel_values
    logits = model(pixel_values).logits
@@ -191,8 +191,8 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path):
    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
    print(f"Saving model to {pytorch_dump_folder_path}")
    model.save_pretrained(pytorch_dump_folder_path)
-    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
+    print(f"Saving image processor to {pytorch_dump_folder_path}")
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    image_processor.save_pretrained(pytorch_dump_folder_path)
    print("Pushing model to the hub...")
    model_name = "convnext"

--- a/src/transformers/models/cvt/convert_cvt_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/cvt/convert_cvt_original_pytorch_checkpoint_to_pytorch.py
@@ -24,7 +24,7 @@ from collections import OrderedDict
 import torch
 from huggingface_hub import cached_download, hf_hub_url
-from transformers import AutoFeatureExtractor, CvtConfig, CvtForImageClassification
+from transformers import AutoImageProcessor, CvtConfig, CvtForImageClassification
 def embeddings(idx):
@@ -307,8 +307,8 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
        config.embed_dim = [192, 768, 1024]
    model = CvtForImageClassification(config)
-    feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/convnext-base-224-22k-1k")
+    image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-base-224-22k-1k")
-    feature_extractor.size["shortest_edge"] = image_size
+    image_processor.size["shortest_edge"] = image_size
    original_weights = torch.load(cvt_file_name, map_location=torch.device("cpu"))
    huggingface_weights = OrderedDict()
@@ -329,7 +329,7 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
    model.load_state_dict(huggingface_weights)
    model.save_pretrained(pytorch_dump_folder)
-    feature_extractor.save_pretrained(pytorch_dump_folder)
+    image_processor.save_pretrained(pytorch_dump_folder)
 # Download the weights from zoo: https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al