Unverified Commit 4eb918e6 authored by amyeroberts's avatar amyeroberts Committed by GitHub
Browse files

AutoImageProcessor (#20111)

* AutoImageProcessor skeleton

* Update references

* Add mapping in init

* Add model image processors to __init__ for importing

* Add AutoImageProcessor tests

* Fix up

* Image Processor documentation

* Remove pdb

* Update docs/source/en/model_doc/mobilevit.mdx

* Update docs

* Don't add whitespace on json files

* Remove fixtures

* Move checking model config down

* Fix up

* Add check for image processor

* Remove FeatureExtractorMixin in docstrings

* Rename model_tmpfile to config_tmpfile

* Don't make None if not in image processor map
parent c08a1e26
...@@ -183,6 +183,8 @@ ...@@ -183,6 +183,8 @@
title: DeepSpeed Integration title: DeepSpeed Integration
- local: main_classes/feature_extractor - local: main_classes/feature_extractor
title: Feature Extractor title: Feature Extractor
- local: main_classes/image_processor
title: Image Processor
title: Main Classes title: Main Classes
- sections: - sections:
- isExpanded: false - isExpanded: false
......
...@@ -29,6 +29,6 @@ Most of those are only useful if you are studying the code of the image processo ...@@ -29,6 +29,6 @@ Most of those are only useful if you are studying the code of the image processo
[[autodoc]] image_transforms.to_pil_image [[autodoc]] image_transforms.to_pil_image
## ImageProcessorMixin ## ImageProcessingMixin
[[autodoc]] image_processing_utils.ImageProcessorMixin [[autodoc]] image_processing_utils.ImageProcessingMixin
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Image Processor
An image processor is in charge of preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch, TensorFlow, Flax and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
## ImageProcessingMixin
[[autodoc]] image_processing_utils.ImageProcessingMixin
- from_pretrained
- save_pretrained
## BatchFeature
[[autodoc]] BatchFeature
## BaseImageProcessor
[[autodoc]] image_processing_utils.BaseImageProcessor
...@@ -66,6 +66,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its ...@@ -66,6 +66,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
[[autodoc]] AutoFeatureExtractor [[autodoc]] AutoFeatureExtractor
## AutoImageProcessor
[[autodoc]] AutoImageProcessor
## AutoProcessor ## AutoProcessor
[[autodoc]] AutoProcessor [[autodoc]] AutoProcessor
......
...@@ -60,7 +60,7 @@ Tips: ...@@ -60,7 +60,7 @@ Tips:
position embeddings. position embeddings.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> BEiT pre-training. Taken from the <a href="https://arxiv.org/abs/2106.08254">original paper.</a> </small> <small> BEiT pre-training. Taken from the <a href="https://arxiv.org/abs/2106.08254">original paper.</a> </small>
...@@ -84,6 +84,12 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code ...@@ -84,6 +84,12 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code
- __call__ - __call__
- post_process_semantic_segmentation - post_process_semantic_segmentation
## BeitImageProcessor
[[autodoc]] BeitImageProcessor
- preprocess
- post_process_semantic_segmentation
## BeitModel ## BeitModel
[[autodoc]] BeitModel [[autodoc]] BeitModel
......
...@@ -100,6 +100,11 @@ This model was contributed by [valhalla](https://huggingface.co/valhalla). The o ...@@ -100,6 +100,11 @@ This model was contributed by [valhalla](https://huggingface.co/valhalla). The o
[[autodoc]] CLIPTokenizerFast [[autodoc]] CLIPTokenizerFast
## CLIPImageProcessor
[[autodoc]] CLIPImageProcessor
- preprocess
## CLIPFeatureExtractor ## CLIPFeatureExtractor
[[autodoc]] CLIPFeatureExtractor [[autodoc]] CLIPFeatureExtractor
......
...@@ -33,7 +33,7 @@ Tips: ...@@ -33,7 +33,7 @@ Tips:
- See the code examples below each model regarding usage. - See the code examples below each model regarding usage.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/convnext_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> ConvNeXT architecture. Taken from the <a href="https://arxiv.org/abs/2201.03545">original paper</a>.</small> <small> ConvNeXT architecture. Taken from the <a href="https://arxiv.org/abs/2201.03545">original paper</a>.</small>
...@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo ...@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo
[[autodoc]] ConvNextFeatureExtractor [[autodoc]] ConvNextFeatureExtractor
## ConvNextImageProcessor
[[autodoc]] ConvNextImageProcessor
- preprocess
## ConvNextModel ## ConvNextModel
[[autodoc]] ConvNextModel [[autodoc]] ConvNextModel
...@@ -71,4 +76,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo ...@@ -71,4 +76,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlo
## TFConvNextForImageClassification ## TFConvNextForImageClassification
[[autodoc]] TFConvNextForImageClassification [[autodoc]] TFConvNextForImageClassification
- call - call
\ No newline at end of file
...@@ -81,6 +81,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tenso ...@@ -81,6 +81,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tenso
[[autodoc]] DeiTFeatureExtractor [[autodoc]] DeiTFeatureExtractor
- __call__ - __call__
## DeiTImageProcessor
[[autodoc]] DeiTImageProcessor
- preprocess
## DeiTModel ## DeiTModel
[[autodoc]] DeiTModel [[autodoc]] DeiTModel
......
...@@ -22,7 +22,7 @@ The abstract from the paper is the following: ...@@ -22,7 +22,7 @@ The abstract from the paper is the following:
*We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.* *We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.*
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dpt_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> DPT architecture. Taken from the <a href="https://arxiv.org/abs/2103.13413" target="_blank">original paper</a>. </small> <small> DPT architecture. Taken from the <a href="https://arxiv.org/abs/2103.13413" target="_blank">original paper</a>. </small>
...@@ -40,6 +40,13 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi ...@@ -40,6 +40,13 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
- post_process_semantic_segmentation - post_process_semantic_segmentation
## DPTImageProcessor
[[autodoc]] DPTImageProcessor
- preprocess
- post_process_semantic_segmentation
## DPTModel ## DPTModel
[[autodoc]] DPTModel [[autodoc]] DPTModel
...@@ -55,4 +62,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi ...@@ -55,4 +62,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
## DPTForSemanticSegmentation ## DPTForSemanticSegmentation
[[autodoc]] DPTForSemanticSegmentation [[autodoc]] DPTForSemanticSegmentation
- forward - forward
\ No newline at end of file
...@@ -16,17 +16,17 @@ specific language governing permissions and limitations under the License. ...@@ -16,17 +16,17 @@ specific language governing permissions and limitations under the License.
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022. The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.
The paper aims at creating a single unified foundation model which can work across vision, language The paper aims at creating a single unified foundation model which can work across vision, language
as well as vision-and-language multimodal tasks. as well as vision-and-language multimodal tasks.
The abstract from the paper is the following: The abstract from the paper is the following:
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety *State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
impressive performance on a wide range of 35 tasks spanning these target modalities.* impressive performance on a wide range of 35 tasks spanning these target modalities.*
...@@ -61,6 +61,11 @@ This model was contributed by [aps](https://huggingface.co/aps). The original co ...@@ -61,6 +61,11 @@ This model was contributed by [aps](https://huggingface.co/aps). The original co
[[autodoc]] FlavaFeatureExtractor [[autodoc]] FlavaFeatureExtractor
## FlavaImageProcessor
[[autodoc]] FlavaImageProcessor
- preprocess
## FlavaForPreTraining ## FlavaForPreTraining
[[autodoc]] FlavaForPreTraining [[autodoc]] FlavaForPreTraining
......
...@@ -35,7 +35,7 @@ Tips: ...@@ -35,7 +35,7 @@ Tips:
- One can use [`GLPNFeatureExtractor`] to prepare images for the model. - One can use [`GLPNFeatureExtractor`] to prepare images for the model.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> Summary of the approach. Taken from the <a href="https://arxiv.org/abs/2201.07436" target="_blank">original paper</a>. </small> <small> Summary of the approach. Taken from the <a href="https://arxiv.org/abs/2201.07436" target="_blank">original paper</a>. </small>
...@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi ...@@ -50,6 +50,11 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
[[autodoc]] GLPNFeatureExtractor [[autodoc]] GLPNFeatureExtractor
- __call__ - __call__
## GLPNImageProcessor
[[autodoc]] GLPNImageProcessor
- preprocess
## GLPNModel ## GLPNModel
[[autodoc]] GLPNModel [[autodoc]] GLPNModel
...@@ -58,4 +63,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi ...@@ -58,4 +63,4 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi
## GLPNForDepthEstimation ## GLPNForDepthEstimation
[[autodoc]] GLPNForDepthEstimation [[autodoc]] GLPNForDepthEstimation
- forward - forward
\ No newline at end of file
...@@ -29,7 +29,7 @@ competitive with self-supervised benchmarks on ImageNet when substituting pixels ...@@ -29,7 +29,7 @@ competitive with self-supervised benchmarks on ImageNet when substituting pixels
top-1 accuracy on a linear probe of our features.* top-1 accuracy on a linear probe of our features.*
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/imagegpt_architecture.png"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small> <small> Summary of the approach. Taken from the [original paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). </small>
...@@ -81,6 +81,11 @@ Tips: ...@@ -81,6 +81,11 @@ Tips:
- __call__ - __call__
## ImageGPTImageProcessor
[[autodoc]] ImageGPTImageProcessor
- preprocess
## ImageGPTModel ## ImageGPTModel
[[autodoc]] ImageGPTModel [[autodoc]] ImageGPTModel
...@@ -97,4 +102,4 @@ Tips: ...@@ -97,4 +102,4 @@ Tips:
[[autodoc]] ImageGPTForImageClassification [[autodoc]] ImageGPTForImageClassification
- forward - forward
\ No newline at end of file
...@@ -45,7 +45,7 @@ RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained Layo ...@@ -45,7 +45,7 @@ RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained Layo
this https URL.* this https URL.*
LayoutLMv2 depends on `detectron2`, `torchvision` and `tesseract`. Run the LayoutLMv2 depends on `detectron2`, `torchvision` and `tesseract`. Run the
following to install them: following to install them:
``` ```
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
python -m pip install torchvision tesseract python -m pip install torchvision tesseract
...@@ -275,6 +275,11 @@ print(encoding.keys()) ...@@ -275,6 +275,11 @@ print(encoding.keys())
[[autodoc]] LayoutLMv2FeatureExtractor [[autodoc]] LayoutLMv2FeatureExtractor
- __call__ - __call__
## LayoutLMv2ImageProcessor
[[autodoc]] LayoutLMv2ImageProcessor
- preprocess
## LayoutLMv2Tokenizer ## LayoutLMv2Tokenizer
[[autodoc]] LayoutLMv2Tokenizer [[autodoc]] LayoutLMv2Tokenizer
......
...@@ -73,6 +73,11 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 ...@@ -73,6 +73,11 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
[[autodoc]] LayoutLMv3FeatureExtractor [[autodoc]] LayoutLMv3FeatureExtractor
- __call__ - __call__
## LayoutLMv3ImageProcessor
[[autodoc]] LayoutLMv3ImageProcessor
- preprocess
## LayoutLMv3Tokenizer ## LayoutLMv3Tokenizer
[[autodoc]] LayoutLMv3Tokenizer [[autodoc]] LayoutLMv3Tokenizer
......
...@@ -19,18 +19,18 @@ The LeViT model was proposed in [LeViT: Introducing Convolutions to Vision Trans ...@@ -19,18 +19,18 @@ The LeViT model was proposed in [LeViT: Introducing Convolutions to Vision Trans
The abstract from the paper is the following: The abstract from the paper is the following:
*We design a family of image classification architectures that optimize the trade-off between accuracy *We design a family of image classification architectures that optimize the trade-off between accuracy
and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures,
which are competitive on highly parallel processing hardware. We revisit principles from the extensive which are competitive on highly parallel processing hardware. We revisit principles from the extensive
literature on convolutional neural networks to apply them to transformers, in particular activation maps literature on convolutional neural networks to apply them to transformers, in particular activation maps
with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information
in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification.
We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of
application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable
to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect
to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. * to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. *
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/levit_architecture.png" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/levit_architecture.png"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small> <small> LeViT Architecture. Taken from the <a href="https://arxiv.org/abs/2104.01136">original paper</a>.</small>
...@@ -38,25 +38,25 @@ Tips: ...@@ -38,25 +38,25 @@ Tips:
- Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency. - Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency.
- There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top - There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top
of the final hidden state and not using the distillation head, or (2) by placing both a prediction head and distillation of the final hidden state and not using the distillation head, or (2) by placing both a prediction head and distillation
head on top of the final hidden state. In that case, the prediction head is trained using regular cross-entropy between head on top of the final hidden state. In that case, the prediction head is trained using regular cross-entropy between
the prediction of the head and the ground-truth label, while the distillation prediction head is trained using hard distillation the prediction of the head and the ground-truth label, while the distillation prediction head is trained using hard distillation
(cross-entropy between the prediction of the distillation head and the label predicted by the teacher). At inference time, (cross-entropy between the prediction of the distillation head and the label predicted by the teacher). At inference time,
one takes the average prediction between both heads as final prediction. (2) is also called "fine-tuning with distillation", one takes the average prediction between both heads as final prediction. (2) is also called "fine-tuning with distillation",
because one relies on a teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds because one relies on a teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds
to [`LevitForImageClassification`] and (2) corresponds to [`LevitForImageClassificationWithTeacher`]. to [`LevitForImageClassification`] and (2) corresponds to [`LevitForImageClassificationWithTeacher`].
- All released checkpoints were pre-trained and fine-tuned on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) - All released checkpoints were pre-trained and fine-tuned on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k)
(also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). only. No external data was used. This is in (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). only. No external data was used. This is in
contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
pre-training. pre-training.
- The authors of LeViT released 5 trained LeViT models, which you can directly plug into [`LevitModel`] or [`LevitForImageClassification`]. - The authors of LeViT released 5 trained LeViT models, which you can directly plug into [`LevitModel`] or [`LevitForImageClassification`].
Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
(while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224): (while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
*facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and *facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
*facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to *facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to
prepare images for the model. prepare images for the model.
- [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning. - [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
(you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]). (you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT). This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
...@@ -71,6 +71,12 @@ This model was contributed by [anugunj](https://huggingface.co/anugunj). The ori ...@@ -71,6 +71,12 @@ This model was contributed by [anugunj](https://huggingface.co/anugunj). The ori
[[autodoc]] LevitFeatureExtractor [[autodoc]] LevitFeatureExtractor
- __call__ - __call__
## LevitImageProcessor
[[autodoc]] LevitImageProcessor
- preprocess
## LevitModel ## LevitModel
[[autodoc]] LevitModel [[autodoc]] LevitModel
......
...@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License. ...@@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License.
## Overview ## Overview
The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers. The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
The abstract from the paper is the following: The abstract from the paper is the following:
...@@ -25,10 +25,10 @@ Tips: ...@@ -25,10 +25,10 @@ Tips:
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction. - MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
- One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB). - One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). - The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/). - The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite). - As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
TensorFlow Lite model: TensorFlow Lite model:
```py ```py
...@@ -52,7 +52,7 @@ with open(tflite_filename, "wb") as f: ...@@ -52,7 +52,7 @@ with open(tflite_filename, "wb") as f:
``` ```
The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network
bandwidth can be constrained. bandwidth can be constrained.
This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets). This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
...@@ -68,6 +68,12 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T ...@@ -68,6 +68,12 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
- __call__ - __call__
- post_process_semantic_segmentation - post_process_semantic_segmentation
## MobileViTImageProcessor
[[autodoc]] MobileViTImageProcessor
- preprocess
- post_process_semantic_segmentation
## MobileViTModel ## MobileViTModel
[[autodoc]] MobileViTModel [[autodoc]] MobileViTModel
...@@ -86,14 +92,14 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T ...@@ -86,14 +92,14 @@ This model was contributed by [matthijs](https://huggingface.co/Matthijs). The T
## TFMobileViTModel ## TFMobileViTModel
[[autodoc]] TFMobileViTModel [[autodoc]] TFMobileViTModel
- call - call
## TFMobileViTForImageClassification ## TFMobileViTForImageClassification
[[autodoc]] TFMobileViTForImageClassification [[autodoc]] TFMobileViTForImageClassification
- call - call
## TFMobileViTForSemanticSegmentation ## TFMobileViTForSemanticSegmentation
[[autodoc]] TFMobileViTForSemanticSegmentation [[autodoc]] TFMobileViTForSemanticSegmentation
- call - call
...@@ -70,7 +70,7 @@ vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048, ...@@ -70,7 +70,7 @@ vocabulary size of the model, i.e. creating logits of shape `(batch_size, 2048,
size of 262 byte IDs). size of 262 byte IDs).
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perceiver_architecture.jpg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> Perceiver IO architecture. Taken from the <a href="https://arxiv.org/abs/2105.15203">original paper</a> </small> <small> Perceiver IO architecture. Taken from the <a href="https://arxiv.org/abs/2105.15203">original paper</a> </small>
...@@ -83,8 +83,8 @@ Tips: ...@@ -83,8 +83,8 @@ Tips:
notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver). notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver).
- Refer to the [blog post](https://huggingface.co/blog/perceiver) if you want to fully understand how the model works and - Refer to the [blog post](https://huggingface.co/blog/perceiver) if you want to fully understand how the model works and
is implemented in the library. Note that the models available in the library only showcase some examples of what you can do is implemented in the library. Note that the models available in the library only showcase some examples of what you can do
with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection, with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection,
audio classification, video classification, etc. audio classification, video classification, etc.
**Note**: **Note**:
...@@ -114,6 +114,11 @@ audio classification, video classification, etc. ...@@ -114,6 +114,11 @@ audio classification, video classification, etc.
[[autodoc]] PerceiverFeatureExtractor [[autodoc]] PerceiverFeatureExtractor
- __call__ - __call__
## PerceiverImageProcessor
[[autodoc]] PerceiverImageProcessor
- preprocess
## PerceiverTextPreprocessor ## PerceiverTextPreprocessor
[[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor [[autodoc]] models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
......
...@@ -50,12 +50,17 @@ This model was contributed by [heytanay](https://huggingface.co/heytanay). The o ...@@ -50,12 +50,17 @@ This model was contributed by [heytanay](https://huggingface.co/heytanay). The o
[[autodoc]] PoolFormerFeatureExtractor [[autodoc]] PoolFormerFeatureExtractor
- __call__ - __call__
## PoolFormerImageProcessor
[[autodoc]] PoolFormerImageProcessor
- preprocess
## PoolFormerModel ## PoolFormerModel
[[autodoc]] PoolFormerModel [[autodoc]] PoolFormerModel
- forward - forward
## PoolFormerForImageClassification ## PoolFormerForImageClassification
[[autodoc]] PoolFormerForImageClassification [[autodoc]] PoolFormerForImageClassification
- forward - forward
\ No newline at end of file
...@@ -36,7 +36,7 @@ The figure below illustrates the architecture of SegFormer. Taken from the [orig ...@@ -36,7 +36,7 @@ The figure below illustrates the architecture of SegFormer. Taken from the [orig
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png"/> <img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/segformer_architecture.png"/>
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version
of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer). of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer).
Tips: Tips:
...@@ -55,7 +55,7 @@ Tips: ...@@ -55,7 +55,7 @@ Tips:
- TensorFlow users should refer to [this repository](https://github.com/deep-diver/segformer-tf-transformers) that shows off-the-shelf inference and fine-tuning. - TensorFlow users should refer to [this repository](https://github.com/deep-diver/segformer-tf-transformers) that shows off-the-shelf inference and fine-tuning.
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers) - One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
to try out a SegFormer model on custom images. to try out a SegFormer model on custom images.
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`. - SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
- One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps - One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps
for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in
the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
...@@ -95,6 +95,12 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape ...@@ -95,6 +95,12 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape
- __call__ - __call__
- post_process_semantic_segmentation - post_process_semantic_segmentation
## SegformerImageProcessor
[[autodoc]] SegformerImageProcessor
- preprocess
- post_process_semantic_segmentation
## SegformerModel ## SegformerModel
[[autodoc]] SegformerModel [[autodoc]] SegformerModel
...@@ -123,14 +129,14 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape ...@@ -123,14 +129,14 @@ SegFormer's results on the segmentation datasets like ADE20k, refer to the [pape
## TFSegformerModel ## TFSegformerModel
[[autodoc]] TFSegformerModel [[autodoc]] TFSegformerModel
- call - call
## TFSegformerForImageClassification ## TFSegformerForImageClassification
[[autodoc]] TFSegformerForImageClassification [[autodoc]] TFSegformerForImageClassification
- call - call
## TFSegformerForSemanticSegmentation ## TFSegformerForSemanticSegmentation
[[autodoc]] TFSegformerForSemanticSegmentation [[autodoc]] TFSegformerForSemanticSegmentation
- call - call
...@@ -27,7 +27,7 @@ Tips: ...@@ -27,7 +27,7 @@ Tips:
- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training. - [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg" <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
alt="drawing" width="600"/> alt="drawing" width="600"/>
<small> VideoMAE pre-training. Taken from the <a href="https://arxiv.org/abs/2203.12602">original paper</a>. </small> <small> VideoMAE pre-training. Taken from the <a href="https://arxiv.org/abs/2203.12602">original paper</a>. </small>
...@@ -44,6 +44,11 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE). ...@@ -44,6 +44,11 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
[[autodoc]] VideoMAEFeatureExtractor [[autodoc]] VideoMAEFeatureExtractor
- __call__ - __call__
## VideoMAEImageProcessor
[[autodoc]] VideoMAEImageProcessor
- preprocess
## VideoMAEModel ## VideoMAEModel
[[autodoc]] VideoMAEModel [[autodoc]] VideoMAEModel
...@@ -57,4 +62,4 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE). ...@@ -57,4 +62,4 @@ The original code can be found [here](https://github.com/MCG-NJU/VideoMAE).
## VideoMAEForVideoClassification ## VideoMAEForVideoClassification
[[autodoc]] transformers.VideoMAEForVideoClassification [[autodoc]] transformers.VideoMAEForVideoClassification
- forward - forward
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment