Unverified Commit ae454f41 authored by amyeroberts's avatar amyeroberts Committed by GitHub
Browse files

Update old existing feature extractor references (#24552)

* Update old existing feature extractor references

* Typo

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

* Address comments from review - update 'feature extractor'
Co-authored by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
parent 10c2ac7b
...@@ -354,12 +354,12 @@ Als Nächstes sehen Sie sich das Bild mit dem Merkmal 🤗 Datensätze [Bild] (h ...@@ -354,12 +354,12 @@ Als Nächstes sehen Sie sich das Bild mit dem Merkmal 🤗 Datensätze [Bild] (h
### Merkmalsextraktor ### Merkmalsextraktor
Laden Sie den Merkmalsextraktor mit [`AutoFeatureExtractor.from_pretrained`]: Laden Sie den Merkmalsextraktor mit [`AutoImageProcessor.from_pretrained`]:
```py ```py
>>> from transformers import AutoFeatureExtractor >>> from transformers import AutoImageProcessor
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224") >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
``` ```
### Datenerweiterung ### Datenerweiterung
...@@ -371,9 +371,9 @@ Bei Bildverarbeitungsaufgaben ist es üblich, den Bildern als Teil der Vorverarb ...@@ -371,9 +371,9 @@ Bei Bildverarbeitungsaufgaben ist es üblich, den Bildern als Teil der Vorverarb
```py ```py
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor >>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std) >>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
>>> _transforms = Compose( >>> _transforms = Compose(
... [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize] ... [RandomResizedCrop(image_processor.size["height"]), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
... ) ... )
``` ```
......
...@@ -263,7 +263,7 @@ To use, create an image processor associated with the model you're using. For ex ...@@ -263,7 +263,7 @@ To use, create an image processor associated with the model you're using. For ex
ViTImageProcessor { ViTImageProcessor {
"do_normalize": true, "do_normalize": true,
"do_resize": true, "do_resize": true,
"feature_extractor_type": "ViTImageProcessor", "image_processor_type": "ViTImageProcessor",
"image_mean": [ "image_mean": [
0.5, 0.5,
0.5, 0.5,
...@@ -295,7 +295,7 @@ Modify any of the [`ViTImageProcessor`] parameters to create your custom image p ...@@ -295,7 +295,7 @@ Modify any of the [`ViTImageProcessor`] parameters to create your custom image p
ViTImageProcessor { ViTImageProcessor {
"do_normalize": false, "do_normalize": false,
"do_resize": true, "do_resize": true,
"feature_extractor_type": "ViTImageProcessor", "image_processor_type": "ViTImageProcessor",
"image_mean": [ "image_mean": [
0.3, 0.3,
0.3, 0.3,
......
...@@ -50,10 +50,10 @@ product between the projected image and text features is then used as a similar ...@@ -50,10 +50,10 @@ product between the projected image and text features is then used as a similar
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model. The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model.
The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both [`CLIPImageProcessor`] and [`CLIPTokenizer`] into a single instance to both
encode the text and prepare the images. The following example shows how to get the image-text similarity scores using encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
[`CLIPProcessor`] and [`CLIPModel`]. [`CLIPProcessor`] and [`CLIPModel`].
......
...@@ -46,9 +46,9 @@ Tips: ...@@ -46,9 +46,9 @@ Tips:
Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image. [`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
The [`DonutFeatureExtractor`] class is responsible for preprocessing the input image and The [`DonutImageProcessor`] class is responsible for preprocessing the input image and
[`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The
[`DonutProcessor`] wraps [`DonutFeatureExtractor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] [`DonutProcessor`] wraps [`DonutImageProcessor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]
into a single instance to both extract the input features and decode the predicted token ids. into a single instance to both extract the input features and decode the predicted token ids.
- Step-by-step Document Image Classification - Step-by-step Document Image Classification
......
...@@ -150,23 +150,23 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ...@@ -150,23 +150,23 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
## Usage: LayoutLMv2Processor ## Usage: LayoutLMv2Processor
The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally
combines a feature extractor ([`LayoutLMv2FeatureExtractor`]) and a tokenizer combines a image processor ([`LayoutLMv2ImageProcessor`]) and a tokenizer
([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The feature extractor ([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The image processor
handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
modality. modality.
```python ```python
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor from transformers import LayoutLMv2ImageProcessor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default image_processor = LayoutLMv2ImageProcessor() # apply_ocr is set to True by default
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased") tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer) processor = LayoutLMv2Processor(image_processor, tokenizer)
``` ```
In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`], In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`],
and it will create the inputs expected by the model. Internally, the processor first uses and it will create the inputs expected by the model. Internally, the processor first uses
[`LayoutLMv2FeatureExtractor`] to apply OCR on the image to get a list of words and normalized [`LayoutLMv2ImageProcessor`] to apply OCR on the image to get a list of words and normalized
bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and
normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or
[`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`, [`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`,
...@@ -176,7 +176,7 @@ which are turned into token-level `labels`. ...@@ -176,7 +176,7 @@ which are turned into token-level `labels`.
[`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python [`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python
wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
choice, and provide the words and normalized boxes yourself. This requires initializing choice, and provide the words and normalized boxes yourself. This requires initializing
[`LayoutLMv2FeatureExtractor`] with `apply_ocr` set to `False`. [`LayoutLMv2ImageProcessor`] with `apply_ocr` set to `False`.
In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs). use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
...@@ -184,7 +184,7 @@ use cases work for both batched and non-batched inputs (we illustrate them for n ...@@ -184,7 +184,7 @@ use cases work for both batched and non-batched inputs (we illustrate them for n
**Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr = **Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
True** True**
This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get This is the simplest case, in which the processor (actually the image processor) will perform OCR on the image to get
the words and normalized bounding boxes. the words and normalized bounding boxes.
```python ```python
...@@ -205,7 +205,7 @@ print(encoding.keys()) ...@@ -205,7 +205,7 @@ print(encoding.keys())
**Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False** **Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
In case one wants to do OCR themselves, one can initialize the feature extractor with `apply_ocr` set to In case one wants to do OCR themselves, one can initialize the image processor with `apply_ocr` set to
`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to `False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
the processor. the processor.
......
...@@ -31,7 +31,7 @@ Tips: ...@@ -31,7 +31,7 @@ Tips:
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that: - In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3FeatureExtractor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model. Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor. - Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3). - Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3). - Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
......
...@@ -52,7 +52,7 @@ tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base") ...@@ -52,7 +52,7 @@ tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")
``` ```
Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies
[`LayoutLMv2FeatureExtractor`] and [`LayoutLMv2ImageProcessor`] and
[`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
data for the model. data for the model.
......
...@@ -28,7 +28,7 @@ The abstract from the paper is the following: ...@@ -28,7 +28,7 @@ The abstract from the paper is the following:
OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
[`OwlViTFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`]. [`OwlViTImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`].
```python ```python
......
...@@ -39,7 +39,7 @@ Tips: ...@@ -39,7 +39,7 @@ Tips:
- The quickest way to get started with ViLT is by checking the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ViLT) - The quickest way to get started with ViLT is by checking the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ViLT)
(which showcase both inference and fine-tuning on custom data). (which showcase both inference and fine-tuning on custom data).
- ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model. - ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model.
This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one. This processor wraps a image processor (for the image modality) and a tokenizer (for the language modality) into one.
- ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to - ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you. which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
......
...@@ -462,9 +462,9 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco ...@@ -462,9 +462,9 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco
>>> class CocoDetection(torchvision.datasets.CocoDetection): >>> class CocoDetection(torchvision.datasets.CocoDetection):
... def __init__(self, img_folder, feature_extractor, ann_file): ... def __init__(self, img_folder, image_processor, ann_file):
... super().__init__(img_folder, ann_file) ... super().__init__(img_folder, ann_file)
... self.feature_extractor = feature_extractor ... self.image_processor = image_processor
... def __getitem__(self, idx): ... def __getitem__(self, idx):
... # read in PIL image and target in COCO format ... # read in PIL image and target in COCO format
...@@ -474,7 +474,7 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco ...@@ -474,7 +474,7 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco
... # resizing + normalization of both image and target) ... # resizing + normalization of both image and target)
... image_id = self.ids[idx] ... image_id = self.ids[idx]
... target = {"image_id": image_id, "annotations": target} ... target = {"image_id": image_id, "annotations": target}
... encoding = self.feature_extractor(images=img, annotations=target, return_tensors="pt") ... encoding = self.image_processor(images=img, annotations=target, return_tensors="pt")
... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension ... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension
... target = encoding["labels"][0] # remove batch dimension ... target = encoding["labels"][0] # remove batch dimension
...@@ -591,4 +591,3 @@ Let's plot the result: ...@@ -591,4 +591,3 @@ Let's plot the result:
<div class="flex justify-center"> <div class="flex justify-center">
<img src="https://i.imgur.com/4QZnf9A.png" alt="Object detection result on a new image"/> <img src="https://i.imgur.com/4QZnf9A.png" alt="Object detection result on a new image"/>
</div> </div>
...@@ -73,12 +73,12 @@ Cada clase de alimento - o label - corresponde a un número; `79` indica una cos ...@@ -73,12 +73,12 @@ Cada clase de alimento - o label - corresponde a un número; `79` indica una cos
## Preprocesa ## Preprocesa
Carga el feature extractor de ViT para procesar la imagen en un tensor: Carga el image processor de ViT para procesar la imagen en un tensor:
```py ```py
>>> from transformers import AutoFeatureExtractor >>> from transformers import AutoImageProcessor
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k") >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
``` ```
Aplica varias transformaciones de imagen al dataset para hacer el modelo más robusto contra el overfitting. En este caso se utilizará el módulo [`transforms`](https://pytorch.org/vision/stable/transforms.html) de torchvision. Recorta una parte aleatoria de la imagen, cambia su tamaño y normalízala con la media y la desviación estándar de la imagen: Aplica varias transformaciones de imagen al dataset para hacer el modelo más robusto contra el overfitting. En este caso se utilizará el módulo [`transforms`](https://pytorch.org/vision/stable/transforms.html) de torchvision. Recorta una parte aleatoria de la imagen, cambia su tamaño y normalízala con la media y la desviación estándar de la imagen:
...@@ -86,8 +86,8 @@ Aplica varias transformaciones de imagen al dataset para hacer el modelo más ro ...@@ -86,8 +86,8 @@ Aplica varias transformaciones de imagen al dataset para hacer el modelo más ro
```py ```py
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor >>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std) >>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
>>> _transforms = Compose([RandomResizedCrop(feature_extractor.size), ToTensor(), normalize]) >>> _transforms = Compose([RandomResizedCrop(image_processor.size["height"]), ToTensor(), normalize])
``` ```
Crea una función de preprocesamiento que aplique las transformaciones y devuelva los `pixel_values` - los inputs al modelo - de la imagen: Crea una función de preprocesamiento que aplique las transformaciones y devuelva los `pixel_values` - los inputs al modelo - de la imagen:
...@@ -160,7 +160,7 @@ Al llegar a este punto, solo quedan tres pasos: ...@@ -160,7 +160,7 @@ Al llegar a este punto, solo quedan tres pasos:
... data_collator=data_collator, ... data_collator=data_collator,
... train_dataset=food["train"], ... train_dataset=food["train"],
... eval_dataset=food["test"], ... eval_dataset=food["test"],
... tokenizer=feature_extractor, ... tokenizer=image_processor,
... ) ... )
>>> trainer.train() >>> trainer.train()
......
...@@ -454,9 +454,9 @@ COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로 ...@@ -454,9 +454,9 @@ COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로
>>> class CocoDetection(torchvision.datasets.CocoDetection): >>> class CocoDetection(torchvision.datasets.CocoDetection):
... def __init__(self, img_folder, feature_extractor, ann_file): ... def __init__(self, img_folder, image_processor, ann_file):
... super().__init__(img_folder, ann_file) ... super().__init__(img_folder, ann_file)
... self.feature_extractor = feature_extractor ... self.image_processor = image_processor
... def __getitem__(self, idx): ... def __getitem__(self, idx):
... # read in PIL image and target in COCO format ... # read in PIL image and target in COCO format
...@@ -466,7 +466,7 @@ COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로 ...@@ -466,7 +466,7 @@ COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로
... # resizing + normalization of both image and target) ... # resizing + normalization of both image and target)
... image_id = self.ids[idx] ... image_id = self.ids[idx]
... target = {"image_id": image_id, "annotations": target} ... target = {"image_id": image_id, "annotations": target}
... encoding = self.feature_extractor(images=img, annotations=target, return_tensors="pt") ... encoding = self.image_processor(images=img, annotations=target, return_tensors="pt")
... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension ... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension
... target = encoding["labels"][0] # remove batch dimension ... target = encoding["labels"][0] # remove batch dimension
...@@ -586,4 +586,3 @@ Detected Mask with confidence 0.584 at location [2449.06, 823.19, 3256.43, 1413. ...@@ -586,4 +586,3 @@ Detected Mask with confidence 0.584 at location [2449.06, 823.19, 3256.43, 1413.
<div class="flex justify-center"> <div class="flex justify-center">
<img src="https://i.imgur.com/4QZnf9A.png" alt="Object detection result on a new image"/> <img src="https://i.imgur.com/4QZnf9A.png" alt="Object detection result on a new image"/>
</div> </div>
...@@ -354,12 +354,12 @@ def convert_align_checkpoint(checkpoint_path, pytorch_dump_folder_path, save_mod ...@@ -354,12 +354,12 @@ def convert_align_checkpoint(checkpoint_path, pytorch_dump_folder_path, save_mod
# Create folder to save model # Create folder to save model
if not os.path.isdir(pytorch_dump_folder_path): if not os.path.isdir(pytorch_dump_folder_path):
os.mkdir(pytorch_dump_folder_path) os.mkdir(pytorch_dump_folder_path)
# Save converted model and feature extractor # Save converted model and image processor
hf_model.save_pretrained(pytorch_dump_folder_path) hf_model.save_pretrained(pytorch_dump_folder_path)
processor.save_pretrained(pytorch_dump_folder_path) processor.save_pretrained(pytorch_dump_folder_path)
if push_to_hub: if push_to_hub:
# Push model and feature extractor to hub # Push model and image processor to hub
print("Pushing converted ALIGN to the hub...") print("Pushing converted ALIGN to the hub...")
processor.push_to_hub("align-base") processor.push_to_hub("align-base")
hf_model.push_to_hub("align-base") hf_model.push_to_hub("align-base")
...@@ -381,7 +381,7 @@ if __name__ == "__main__": ...@@ -381,7 +381,7 @@ if __name__ == "__main__":
help="Path to the output PyTorch model directory.", help="Path to the output PyTorch model directory.",
) )
parser.add_argument("--save_model", action="store_true", help="Save model to local") parser.add_argument("--save_model", action="store_true", help="Save model to local")
parser.add_argument("--push_to_hub", action="store_true", help="Push model and feature extractor to the hub") parser.add_argument("--push_to_hub", action="store_true", help="Push model and image processor to the hub")
args = parser.parse_args() args = parser.parse_args()
convert_align_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.save_model, args.push_to_hub) convert_align_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.save_model, args.push_to_hub)
...@@ -27,10 +27,10 @@ from PIL import Image ...@@ -27,10 +27,10 @@ from PIL import Image
from transformers import ( from transformers import (
BeitConfig, BeitConfig,
BeitFeatureExtractor,
BeitForImageClassification, BeitForImageClassification,
BeitForMaskedImageModeling, BeitForMaskedImageModeling,
BeitForSemanticSegmentation, BeitForSemanticSegmentation,
BeitImageProcessor,
) )
from transformers.image_utils import PILImageResampling from transformers.image_utils import PILImageResampling
from transformers.utils import logging from transformers.utils import logging
...@@ -266,16 +266,16 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path): ...@@ -266,16 +266,16 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path):
# Check outputs on an image # Check outputs on an image
if is_semantic: if is_semantic:
feature_extractor = BeitFeatureExtractor(size=config.image_size, do_center_crop=False) image_processor = BeitImageProcessor(size=config.image_size, do_center_crop=False)
ds = load_dataset("hf-internal-testing/fixtures_ade20k", split="test") ds = load_dataset("hf-internal-testing/fixtures_ade20k", split="test")
image = Image.open(ds[0]["file"]) image = Image.open(ds[0]["file"])
else: else:
feature_extractor = BeitFeatureExtractor( image_processor = BeitImageProcessor(
size=config.image_size, resample=PILImageResampling.BILINEAR, do_center_crop=False size=config.image_size, resample=PILImageResampling.BILINEAR, do_center_crop=False
) )
image = prepare_img() image = prepare_img()
encoding = feature_extractor(images=image, return_tensors="pt") encoding = image_processor(images=image, return_tensors="pt")
pixel_values = encoding["pixel_values"] pixel_values = encoding["pixel_values"]
outputs = model(pixel_values) outputs = model(pixel_values)
...@@ -353,8 +353,8 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path): ...@@ -353,8 +353,8 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path):
Path(pytorch_dump_folder_path).mkdir(exist_ok=True) Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
print(f"Saving model to {pytorch_dump_folder_path}") print(f"Saving model to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path) model.save_pretrained(pytorch_dump_folder_path)
print(f"Saving feature extractor to {pytorch_dump_folder_path}") print(f"Saving image processor to {pytorch_dump_folder_path}")
feature_extractor.save_pretrained(pytorch_dump_folder_path) image_processor.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__": if __name__ == "__main__":
......
...@@ -468,7 +468,7 @@ class ChineseCLIPOnnxConfig(OnnxConfig): ...@@ -468,7 +468,7 @@ class ChineseCLIPOnnxConfig(OnnxConfig):
processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework
) )
image_input_dict = super().generate_dummy_inputs( image_input_dict = super().generate_dummy_inputs(
processor.feature_extractor, batch_size=batch_size, framework=framework processor.image_processor, batch_size=batch_size, framework=framework
) )
return {**text_input_dict, **image_input_dict} return {**text_input_dict, **image_input_dict}
......
...@@ -449,7 +449,7 @@ class CLIPOnnxConfig(OnnxConfig): ...@@ -449,7 +449,7 @@ class CLIPOnnxConfig(OnnxConfig):
processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework
) )
image_input_dict = super().generate_dummy_inputs( image_input_dict = super().generate_dummy_inputs(
processor.feature_extractor, batch_size=batch_size, framework=framework processor.image_processor, batch_size=batch_size, framework=framework
) )
return {**text_input_dict, **image_input_dict} return {**text_input_dict, **image_input_dict}
......
...@@ -28,7 +28,7 @@ from transformers import ( ...@@ -28,7 +28,7 @@ from transformers import (
CLIPSegTextConfig, CLIPSegTextConfig,
CLIPSegVisionConfig, CLIPSegVisionConfig,
CLIPTokenizer, CLIPTokenizer,
ViTFeatureExtractor, ViTImageProcessor,
) )
...@@ -185,9 +185,9 @@ def convert_clipseg_checkpoint(model_name, checkpoint_path, pytorch_dump_folder_ ...@@ -185,9 +185,9 @@ def convert_clipseg_checkpoint(model_name, checkpoint_path, pytorch_dump_folder_
if unexpected_keys != ["decoder.reduce.weight", "decoder.reduce.bias"]: if unexpected_keys != ["decoder.reduce.weight", "decoder.reduce.bias"]:
raise ValueError(f"Unexpected keys: {unexpected_keys}") raise ValueError(f"Unexpected keys: {unexpected_keys}")
feature_extractor = ViTFeatureExtractor(size=352) image_processor = ViTImageProcessor(size=352)
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32") tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPSegProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer) processor = CLIPSegProcessor(image_processor=image_processor, tokenizer=tokenizer)
image = prepare_img() image = prepare_img()
text = ["a glass", "something to fill", "wood", "a jar"] text = ["a glass", "something to fill", "wood", "a jar"]
......
...@@ -27,9 +27,9 @@ from PIL import Image ...@@ -27,9 +27,9 @@ from PIL import Image
from transformers import ( from transformers import (
ConditionalDetrConfig, ConditionalDetrConfig,
ConditionalDetrFeatureExtractor,
ConditionalDetrForObjectDetection, ConditionalDetrForObjectDetection,
ConditionalDetrForSegmentation, ConditionalDetrForSegmentation,
ConditionalDetrImageProcessor,
) )
from transformers.utils import logging from transformers.utils import logging
...@@ -244,13 +244,13 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path): ...@@ -244,13 +244,13 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path):
config.id2label = id2label config.id2label = id2label
config.label2id = {v: k for k, v in id2label.items()} config.label2id = {v: k for k, v in id2label.items()}
# load feature extractor # load image processor
format = "coco_panoptic" if is_panoptic else "coco_detection" format = "coco_panoptic" if is_panoptic else "coco_detection"
feature_extractor = ConditionalDetrFeatureExtractor(format=format) image_processor = ConditionalDetrImageProcessor(format=format)
# prepare image # prepare image
img = prepare_img() img = prepare_img()
encoding = feature_extractor(images=img, return_tensors="pt") encoding = image_processor(images=img, return_tensors="pt")
pixel_values = encoding["pixel_values"] pixel_values = encoding["pixel_values"]
logger.info(f"Converting model {model_name}...") logger.info(f"Converting model {model_name}...")
...@@ -302,11 +302,11 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path): ...@@ -302,11 +302,11 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path):
if is_panoptic: if is_panoptic:
assert torch.allclose(outputs.pred_masks, original_outputs["pred_masks"], atol=1e-4) assert torch.allclose(outputs.pred_masks, original_outputs["pred_masks"], atol=1e-4)
# Save model and feature extractor # Save model and image processor
logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...") logger.info(f"Saving PyTorch model and image processor to {pytorch_dump_folder_path}...")
Path(pytorch_dump_folder_path).mkdir(exist_ok=True) Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
model.save_pretrained(pytorch_dump_folder_path) model.save_pretrained(pytorch_dump_folder_path)
feature_extractor.save_pretrained(pytorch_dump_folder_path) image_processor.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__": if __name__ == "__main__":
......
...@@ -26,7 +26,7 @@ import torch ...@@ -26,7 +26,7 @@ import torch
from huggingface_hub import hf_hub_download from huggingface_hub import hf_hub_download
from PIL import Image from PIL import Image
from transformers import ConvNextConfig, ConvNextFeatureExtractor, ConvNextForImageClassification from transformers import ConvNextConfig, ConvNextForImageClassification, ConvNextImageProcessor
from transformers.utils import logging from transformers.utils import logging
...@@ -144,10 +144,10 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path): ...@@ -144,10 +144,10 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path):
model.load_state_dict(state_dict) model.load_state_dict(state_dict)
model.eval() model.eval()
# Check outputs on an image, prepared by ConvNextFeatureExtractor # Check outputs on an image, prepared by ConvNextImageProcessor
size = 224 if "224" in checkpoint_url else 384 size = 224 if "224" in checkpoint_url else 384
feature_extractor = ConvNextFeatureExtractor(size=size) image_processor = ConvNextImageProcessor(size=size)
pixel_values = feature_extractor(images=prepare_img(), return_tensors="pt").pixel_values pixel_values = image_processor(images=prepare_img(), return_tensors="pt").pixel_values
logits = model(pixel_values).logits logits = model(pixel_values).logits
...@@ -191,8 +191,8 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path): ...@@ -191,8 +191,8 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path):
Path(pytorch_dump_folder_path).mkdir(exist_ok=True) Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
print(f"Saving model to {pytorch_dump_folder_path}") print(f"Saving model to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path) model.save_pretrained(pytorch_dump_folder_path)
print(f"Saving feature extractor to {pytorch_dump_folder_path}") print(f"Saving image processor to {pytorch_dump_folder_path}")
feature_extractor.save_pretrained(pytorch_dump_folder_path) image_processor.save_pretrained(pytorch_dump_folder_path)
print("Pushing model to the hub...") print("Pushing model to the hub...")
model_name = "convnext" model_name = "convnext"
......
...@@ -24,7 +24,7 @@ from collections import OrderedDict ...@@ -24,7 +24,7 @@ from collections import OrderedDict
import torch import torch
from huggingface_hub import cached_download, hf_hub_url from huggingface_hub import cached_download, hf_hub_url
from transformers import AutoFeatureExtractor, CvtConfig, CvtForImageClassification from transformers import AutoImageProcessor, CvtConfig, CvtForImageClassification
def embeddings(idx): def embeddings(idx):
...@@ -307,8 +307,8 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo ...@@ -307,8 +307,8 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
config.embed_dim = [192, 768, 1024] config.embed_dim = [192, 768, 1024]
model = CvtForImageClassification(config) model = CvtForImageClassification(config)
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/convnext-base-224-22k-1k") image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-base-224-22k-1k")
feature_extractor.size["shortest_edge"] = image_size image_processor.size["shortest_edge"] = image_size
original_weights = torch.load(cvt_file_name, map_location=torch.device("cpu")) original_weights = torch.load(cvt_file_name, map_location=torch.device("cpu"))
huggingface_weights = OrderedDict() huggingface_weights = OrderedDict()
...@@ -329,7 +329,7 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo ...@@ -329,7 +329,7 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
model.load_state_dict(huggingface_weights) model.load_state_dict(huggingface_weights)
model.save_pretrained(pytorch_dump_folder) model.save_pretrained(pytorch_dump_folder)
feature_extractor.save_pretrained(pytorch_dump_folder) image_processor.save_pretrained(pytorch_dump_folder)
# Download the weights from zoo: https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al # Download the weights from zoo: https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment