Add BEiT (#12994)

* First pass * Make conversion script work * Improve conversion script * Fix bug, conversion script working * Improve conversion script, implement BEiTFeatureExtractor * Make conversion script work based on URL * Improve conversion script * Add tests, add documentation * Fix bug in conversion script * Fix another bug * Add support for converting masked image modeling model * Add support for converting masked image modeling * Fix bug * Add print statement for debugging * Fix another bug * Make conversion script finally work for masked image modeling models * Move id2label for datasets to JSON files on the hub * Make sure id's are read in as integers * Add integration tests * Make style & quality * Fix test, add BEiT to README * Apply suggestions from @sgugger's review * Apply suggestions from code review * Make quality * Replace nielsr by microsoft in tests, add docs * Rename BEiT to Beit * Minor fix * Fix docs of BeitForMaskedImageModeling Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Add BEiT (#12994)
* First pass * Make conversion script work * Improve conversion script * Fix bug, conversion script working * Improve conversion script, implement BEiTFeatureExtractor * Make conversion script work based on URL * Improve conversion script * Add tests, add documentation * Fix bug in conversion script * Fix another bug * Add support for converting masked image modeling model * Add support for converting masked image modeling * Fix bug * Add print statement for debugging * Fix another bug * Make conversion script finally work for masked image modeling models * Move id2label for datasets to JSON files on the hub * Make sure id's are read in as integers * Add integration tests * Make style & quality * Fix test, add BEiT to README * Apply suggestions from @sgugger's review * Apply suggestions from code review * Make quality * Replace nielsr by microsoft in tests, add docs * Rename BEiT to Beit * Minor fix * Fix docs of BeitForMaskedImageModeling Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
83e5a106 · NielsRogge · GitHub · 0dd1152c · 83e5a106 · 83e5a106
Unverified Commit 83e5a106 authored Aug 04, 2021 by NielsRogge Committed by GitHub Aug 04, 2021
20 changed files
--- a/README.md
+++ b/README.md
@@ -211,6 +211,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[ALBERT](https://huggingface.co/transformers/model_doc/albert.html)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
 1. **[BART](https://huggingface.co/transformers/model_doc/bart.html)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BARThez](https://huggingface.co/transformers/model_doc/barthez.html)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
+1. **[BEiT](https://huggingface.co/transformers/master/model_doc/beit.html)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei.
 1. **[BERT](https://huggingface.co/transformers/model_doc/bert.html)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
 1. **[BERT For Sequence Generation](https://huggingface.co/transformers/model_doc/bertgeneration.html)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 1. **[BigBird-RoBERTa](https://huggingface.co/transformers/model_doc/bigbird.html)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
--- a/docs/source/model_doc/beit.rst
+++ b/docs/source/model_doc/beit.rst
+.. 
+    Copyright 2021 The HuggingFace Team. All rights reserved.
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+        http://www.apache.org/licenses/LICENSE-2.0
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+    specific language governing permissions and limitations under the License.
+BEiT
+-----------------------------------------------------------------------------------------------------------------------
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The BEiT model was proposed in `BEiT: BERT Pre-Training of Image Transformers <https://arxiv.org/abs/2106.08254>`__ by
+Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
+Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
+of an image (as done in the `original ViT paper <https://arxiv.org/abs/2010.11929>`__), BEiT models are pre-trained to
+predict visual tokens from the codebook of OpenAI's `DALL-E model <https://arxiv.org/abs/2102.12092>`__ given masked
+patches.
+The abstract from the paper is the following:
+*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
+from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
+modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
+patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
+visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
+objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
+directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
+Experimental results on image classification and semantic segmentation show that our model achieves competitive results
+with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
+significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
+86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
+Tips:
+- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
+  outperform both the original model (ViT) as well as Data-efficient Image Transformers (DeiT) when fine-tuned on
+  ImageNet-1K and CIFAR-100.
+- As the BEiT models expect each image to be of the same size (resolution), one can use
+  :class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
+- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
+  each checkpoint. For example, :obj:`microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
+  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub
+  <https://huggingface.co/models?search=microsoft/beit>`__.
+- The available checkpoints are either (1) pre-trained on `ImageNet-22k <http://www.image-net.org/>`__ (a collection of
+  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on `ImageNet-1k
+  <http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
+  images and 1,000 classes).
+- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
+  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
+  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
+  pre-train a model from scratch, one needs to either set the :obj:`use_relative_position_bias` or the
+  :obj:`use_relative_position_bias` attribute of :class:`~transformers.BeitConfig` to :obj:`True` in order to add
+  position embeddings.
+This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
+<https://github.com/microsoft/unilm/tree/master/beit>`__.
+BeitConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.BeitConfig
+    :members:
+BeitFeatureExtractor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.BeitFeatureExtractor
+    :members: __call__
+BeitModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.BeitModel
+    :members: forward
+BeitForMaskedImageModeling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.BeitForMaskedImageModeling
+    :members: forward
+BeitForImageClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.BeitForImageClassification
+    :members: forward
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -147,6 +147,7 @@ _import_structure = {
    ],
    "models.bart": ["BartConfig", "BartTokenizer"],
    "models.barthez": [],
+    "models.beit": ["BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BeitConfig"],
    "models.bert": [
        "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "BasicTokenizer",
@@ -412,6 +413,7 @@ else:
 # Vision-specific objects
 if is_vision_available():
    _import_structure["image_utils"] = ["ImageFeatureExtractionMixin"]
+    _import_structure["models.beit"].append("BeitFeatureExtractor")
    _import_structure["models.clip"].append("CLIPFeatureExtractor")
    _import_structure["models.clip"].append("CLIPProcessor")
    _import_structure["models.deit"].append("DeiTFeatureExtractor")
@@ -510,7 +512,6 @@ if is_torch_available():
            "load_tf_weights_in_albert",
        ]
    )
    _import_structure["models.auto"].extend(
        [
            "MODEL_FOR_CAUSAL_LM_MAPPING",
@@ -542,7 +543,6 @@ if is_torch_available():
            "AutoModelWithLMHead",
        ]
    )
    _import_structure["models.bart"].extend(
        [
            "BART_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -555,6 +555,15 @@ if is_torch_available():
            "PretrainedBartModel",
        ]
    )
+    _import_structure["models.beit"].extend(
+        [
+            "BEIT_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "BeitForImageClassification",
+            "BeitForMaskedImageModeling",
+            "BeitModel",
+            "BeitPreTrainedModel",
+        ]
+    )
    _import_structure["models.bert"].extend(
        [
            "BERT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -1814,6 +1823,7 @@ if TYPE_CHECKING:
        AutoTokenizer,
    )
    from .models.bart import BartConfig, BartTokenizer
+    from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
    from .models.bert import (
        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        BasicTokenizer,
@@ -2049,6 +2059,7 @@ if TYPE_CHECKING:
    if is_vision_available():
        from .image_utils import ImageFeatureExtractionMixin
+        from .models.beit import BeitFeatureExtractor
        from .models.clip import CLIPFeatureExtractor, CLIPProcessor
        from .models.deit import DeiTFeatureExtractor
        from .models.detr import DetrFeatureExtractor
@@ -2171,6 +2182,13 @@ if TYPE_CHECKING:
            BartPretrainedModel,
            PretrainedBartModel,
        )
+        from .models.beit import (
+            BEIT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BeitForImageClassification,
+            BeitForMaskedImageModeling,
+            BeitModel,
+            BeitPreTrainedModel,
+        )
        from .models.bert import (
            BERT_PRETRAINED_MODEL_ARCHIVE_LIST,
            BertForMaskedLM,

--- a/src/transformers/image_utils.py
+++ b/src/transformers/image_utils.py
@@ -21,6 +21,8 @@ from .file_utils import _is_torch, is_torch_available
 IMAGENET_DEFAULT_MEAN = [0.485, 0.456, 0.406]
 IMAGENET_DEFAULT_STD = [0.229, 0.224, 0.225]
+IMAGENET_STANDARD_MEAN = [0.5, 0.5, 0.5]
+IMAGENET_STANDARD_STD = [0.5, 0.5, 0.5]
 def is_torch_tensor(obj):

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -21,6 +21,7 @@ from . import (
    auto,
    bart,
    barthez,
+    beit,
    bert,
    bert_generation,
    bert_japanese,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -20,6 +20,7 @@ from collections import OrderedDict
 from ...configuration_utils import PretrainedConfig
 from ..albert.configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
 from ..bart.configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
+from ..beit.configuration_beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
 from ..bert.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
 from ..bert_generation.configuration_bert_generation import BertGenerationConfig
 from ..big_bird.configuration_big_bird import BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP, BigBirdConfig
@@ -97,6 +98,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
    (key, value)
    for pretrained_map in [
        # Add archive maps here
+        BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -158,6 +160,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
 CONFIG_MAPPING = OrderedDict(
    [
        # Add configs here
+        ("beit", BeitConfig),
        ("rembert", RemBertConfig),
        ("visual_bert", VisualBertConfig),
        ("canine", CanineConfig),
@@ -225,6 +228,7 @@ CONFIG_MAPPING = OrderedDict(
 MODEL_NAMES_MAPPING = OrderedDict(
    [
        # Add full (and cased) model names here
+        ("beit", "BeiT"),
        ("rembert", "RemBERT"),
        ("visual_bert", "VisualBert"),
        ("canine", "Canine"),

--- a/src/transformers/models/auto/feature_extraction_auto.py
+++ b/src/transformers/models/auto/feature_extraction_auto.py
@@ -17,9 +17,9 @@
 import os
 from collections import OrderedDict
-from transformers import DeiTFeatureExtractor, Speech2TextFeatureExtractor, ViTFeatureExtractor
+from transformers import BeitFeatureExtractor, DeiTFeatureExtractor, Speech2TextFeatureExtractor, ViTFeatureExtractor
-from ... import DeiTConfig, PretrainedConfig, Speech2TextConfig, ViTConfig, Wav2Vec2Config
+from ... import BeitConfig, DeiTConfig, PretrainedConfig, Speech2TextConfig, ViTConfig, Wav2Vec2Config
 from ...feature_extraction_utils import FeatureExtractionMixin
 # Build the list of all feature extractors
@@ -30,6 +30,7 @@ from .configuration_auto import AutoConfig, replace_list_option_in_docstrings
 FEATURE_EXTRACTOR_MAPPING = OrderedDict(
    [
+        (BeitConfig, BeitFeatureExtractor),
        (DeiTConfig, DeiTFeatureExtractor),
        (Speech2TextConfig, Speech2TextFeatureExtractor),
        (ViTConfig, ViTFeatureExtractor),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -37,6 +37,7 @@ from ..bart.modeling_bart import (
    BartForSequenceClassification,
    BartModel,
 )
+from ..beit.modeling_beit import BeitForImageClassification, BeitModel
 from ..bert.modeling_bert import (
    BertForMaskedLM,
    BertForMultipleChoice,
@@ -321,6 +322,7 @@ from .auto_factory import _BaseAutoModelClass, auto_class_update
 from .configuration_auto import (
    AlbertConfig,
    BartConfig,
+    BeitConfig,
    BertConfig,
    BertGenerationConfig,
    BigBirdConfig,
@@ -388,6 +390,7 @@ logger = logging.get_logger(__name__)
 MODEL_MAPPING = OrderedDict(
    [
        # Base model mapping
+        (BeitConfig, BeitModel),
        (RemBertConfig, RemBertModel),
        (VisualBertConfig, VisualBertModel),
        (CanineConfig, CanineModel),
@@ -579,6 +582,7 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING = OrderedDict(
        # Model for Image Classification mapping
        (ViTConfig, ViTForImageClassification),
        (DeiTConfig, (DeiTForImageClassification, DeiTForImageClassificationWithTeacher)),
+        (BeitConfig, BeitForImageClassification),
    ]
 )

--- a/src/transformers/models/beit/__init__.py
+++ b/src/transformers/models/beit/__init__.py
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+from ...file_utils import _LazyModule, is_torch_available, is_vision_available
+_import_structure = {
+    "configuration_beit": ["BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BeitConfig"],
+}
+if is_vision_available():
+    _import_structure["feature_extraction_beit"] = ["BeitFeatureExtractor"]
+if is_torch_available():
+    _import_structure["modeling_beit"] = [
+        "BEIT_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "BeitForImageClassification",
+        "BeitForMaskedImageModeling",
+        "BeitModel",
+        "BeitPreTrainedModel",
+    ]
+if TYPE_CHECKING:
+    from .configuration_beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
+    if is_vision_available():
+        from .feature_extraction_beit import BeitFeatureExtractor
+    if is_torch_available():
+        from .modeling_beit import (
+            BEIT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BeitForImageClassification,
+            BeitForMaskedImageModeling,
+            BeitModel,
+            BeitPreTrainedModel,
+        )
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
--- a/src/transformers/models/beit/configuration_beit.py
+++ b/src/transformers/models/beit/configuration_beit.py
+# coding=utf-8
+# Copyright Microsoft Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BEiT model configuration """
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+logger = logging.get_logger(__name__)
+BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "microsoft/beit-base-patch16-224-in22k": "https://huggingface.co/microsoft/beit-base-patch16-224-in22k/resolve/main/config.json",
+    # See all BEiT models at https://huggingface.co/models?filter=beit
+}
+class BeitConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.BeitModel`. It is used to
+    instantiate an BEiT model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the BEiT
+    `microsoft/beit-base-patch16-224-in22k <https://huggingface.co/microsoft/beit-base-patch16-224-in22k>`__
+    architecture.
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 8092):
+            Vocabulary size of the BEiT model. Defines the number of different image tokens that can be used during
+            pre-training.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
+        image_size (:obj:`int`, `optional`, defaults to :obj:`224`):
+            The size (resolution) of each image.
+        patch_size (:obj:`int`, `optional`, defaults to :obj:`16`):
+            The size (resolution) of each patch.
+        num_channels (:obj:`int`, `optional`, defaults to :obj:`3`):
+            The number of input channels.
+        use_mask_token (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use a mask token for masked image modeling.
+        use_absolute_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use BERT-style absolute position embeddings.
+        use_relative_position_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use T5-style relative position embeddings in the self-attention layers.
+        use_shared_relative_position_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use the same relative position embeddings across all self-attention layers of the Transformer.
+        layer_scale_init_value (:obj:`float`, `optional`, defaults to 0.1):
+            Scale to use in the self-attention layers. 0.1 for base, 1e-5 for large. Set 0 to disable layer scale.
+        drop_path_rate (:obj:`float`, `optional`, defaults to 0.1):
+            Stochastic depth rate per sample (when applied in the main path of residual layers).
+        use_mean_pooling (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to mean pool the final hidden states of the patches instead of using the final hidden state of the
+            CLS token, before applying the classification head.
+    Example::
+        >>> from transformers import BeitModel, BeitConfig
+        >>> # Initializing a BEiT beit-base-patch16-224-in22k style configuration
+        >>> configuration = BeitConfig()
+        >>> # Initializing a model from the beit-base-patch16-224-in22k style configuration
+        >>> model = BeitModel(configuration)
+        >>> # Accessing the model configuration
+        >>> configuration = model.config
+    """
+    model_type = "beit"
+    def __init__(
+        self,
+        vocab_size=8192,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        is_encoder_decoder=False,
+        image_size=224,
+        patch_size=16,
+        num_channels=3,
+        use_mask_token=False,
+        use_absolute_position_embeddings=False,
+        use_relative_position_bias=False,
+        use_shared_relative_position_bias=False,
+        layer_scale_init_value=0.1,
+        drop_path_rate=0.1,
+        use_mean_pooling=True,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.use_mask_token = use_mask_token
+        self.use_absolute_position_embeddings = use_absolute_position_embeddings
+        self.use_relative_position_bias = use_relative_position_bias
+        self.use_shared_relative_position_bias = use_shared_relative_position_bias
+        self.layer_scale_init_value = layer_scale_init_value
+        self.drop_path_rate = drop_path_rate
+        self.use_mean_pooling = use_mean_pooling
--- a/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py
+++ b/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert BEiT checkpoints from the unilm repository."""
+import argparse
+import json
+from pathlib import Path
+import torch
+from PIL import Image
+import requests
+from huggingface_hub import cached_download, hf_hub_url
+from transformers import BeitConfig, BeitFeatureExtractor, BeitForImageClassification, BeitForMaskedImageModeling
+from transformers.utils import logging
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys(config, has_lm_head=False):
+    rename_keys = []
+    for i in range(config.num_hidden_layers):
+        # encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
+        rename_keys.append((f"blocks.{i}.norm1.weight", f"beit.encoder.layer.{i}.layernorm_before.weight"))
+        rename_keys.append((f"blocks.{i}.norm1.bias", f"beit.encoder.layer.{i}.layernorm_before.bias"))
+        rename_keys.append((f"blocks.{i}.attn.proj.weight", f"beit.encoder.layer.{i}.attention.output.dense.weight"))
+        rename_keys.append((f"blocks.{i}.attn.proj.bias", f"beit.encoder.layer.{i}.attention.output.dense.bias"))
+        rename_keys.append((f"blocks.{i}.norm2.weight", f"beit.encoder.layer.{i}.layernorm_after.weight"))
+        rename_keys.append((f"blocks.{i}.norm2.bias", f"beit.encoder.layer.{i}.layernorm_after.bias"))
+        rename_keys.append((f"blocks.{i}.mlp.fc1.weight", f"beit.encoder.layer.{i}.intermediate.dense.weight"))
+        rename_keys.append((f"blocks.{i}.mlp.fc1.bias", f"beit.encoder.layer.{i}.intermediate.dense.bias"))
+        rename_keys.append((f"blocks.{i}.mlp.fc2.weight", f"beit.encoder.layer.{i}.output.dense.weight"))
+        rename_keys.append((f"blocks.{i}.mlp.fc2.bias", f"beit.encoder.layer.{i}.output.dense.bias"))
+    # projection layer + position embeddings
+    rename_keys.extend(
+        [
+            ("cls_token", "beit.embeddings.cls_token"),
+            ("patch_embed.proj.weight", "beit.embeddings.patch_embeddings.projection.weight"),
+            ("patch_embed.proj.bias", "beit.embeddings.patch_embeddings.projection.bias"),
+        ]
+    )
+    if has_lm_head:
+        # mask token + shared relative position bias + layernorm
+        rename_keys.extend(
+            [
+                ("mask_token", "beit.embeddings.mask_token"),
+                (
+                    "rel_pos_bias.relative_position_bias_table",
+                    "beit.encoder.relative_position_bias.relative_position_bias_table",
+                ),
+                (
+                    "rel_pos_bias.relative_position_index",
+                    "beit.encoder.relative_position_bias.relative_position_index",
+                ),
+                ("norm.weight", "layernorm.weight"),
+                ("norm.bias", "layernorm.bias"),
+            ]
+        )
+    else:
+        # layernorm + classification head
+        rename_keys.extend(
+            [
+                ("fc_norm.weight", "beit.pooler.layernorm.weight"),
+                ("fc_norm.bias", "beit.pooler.layernorm.bias"),
+                ("head.weight", "classifier.weight"),
+                ("head.bias", "classifier.bias"),
+            ]
+        )
+    return rename_keys
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config, has_lm_head=False):
+    for i in range(config.num_hidden_layers):
+        prefix = "beit."
+        # queries, keys and values
+        in_proj_weight = state_dict.pop(f"blocks.{i}.attn.qkv.weight")
+        q_bias = state_dict.pop(f"blocks.{i}.attn.q_bias")
+        v_bias = state_dict.pop(f"blocks.{i}.attn.v_bias")
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[
+            : config.hidden_size, :
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.bias"] = q_bias
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
+            config.hidden_size : config.hidden_size * 2, :
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[
+            -config.hidden_size :, :
+        ]
+        state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.bias"] = v_bias
+        # gamma_1 and gamma_2
+        # we call them lambda because otherwise they are renamed when using .from_pretrained
+        gamma_1 = state_dict.pop(f"blocks.{i}.gamma_1")
+        gamma_2 = state_dict.pop(f"blocks.{i}.gamma_2")
+        state_dict[f"{prefix}encoder.layer.{i}.lambda_1"] = gamma_1
+        state_dict[f"{prefix}encoder.layer.{i}.lambda_2"] = gamma_2
+        # relative_position bias table + index
+        if not has_lm_head:
+            # each layer has its own relative position bias
+            table = state_dict.pop(f"blocks.{i}.attn.relative_position_bias_table")
+            index = state_dict.pop(f"blocks.{i}.attn.relative_position_index")
+            state_dict[
+                f"{prefix}encoder.layer.{i}.attention.attention.relative_position_bias.relative_position_bias_table"
+            ] = table
+            state_dict[
+                f"{prefix}encoder.layer.{i}.attention.attention.relative_position_bias.relative_position_index"
+            ] = index
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+@torch.no_grad()
+def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path):
+    """
+    Copy/paste/tweak model's weights to our BEiT structure.
+    """
+    # define default BEiT configuration
+    config = BeitConfig()
+    has_lm_head = False
+    repo_id = "datasets/huggingface/label-files"
+    # set config parameters based on URL
+    if checkpoint_url[-9:-4] == "pt22k":
+        # masked image modeling
+        config.use_shared_relative_position_bias = True
+        config.use_mask_token = True
+        has_lm_head = True
+    elif checkpoint_url[-9:-4] == "ft22k":
+        # intermediate fine-tuning on ImageNet-22k
+        config.use_relative_position_bias = True
+        config.num_labels = 21841
+        filename = "imagenet-22k-id2label.json"
+        id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
+        id2label = {int(k): v for k, v in id2label.items()}
+        # this dataset contains 21843 labels but the model only has 21841
+        # we delete the classes as mentioned in https://github.com/google-research/big_transfer/issues/18
+        del id2label[9205]
+        del id2label[15027]
+        config.id2label = id2label
+        config.label2id = {v: k for k, v in id2label.items()}
+    elif checkpoint_url[-8:-4] == "to1k":
+        # fine-tuning on ImageNet-1k
+        config.use_relative_position_bias = True
+        config.num_labels = 1000
+        filename = "imagenet-1k-id2label.json"
+        id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
+        id2label = {int(k): v for k, v in id2label.items()}
+        config.id2label = id2label
+        config.label2id = {v: k for k, v in id2label.items()}
+        if "384" in checkpoint_url:
+            config.image_size = 384
+        if "512" in checkpoint_url:
+            config.image_size = 512
+    else:
+        raise ValueError("Checkpoint not supported, URL should either end with 'pt22k', 'ft22k' or 'to1k'")
+    # size of the architecture
+    if "base" in checkpoint_url:
+        pass
+    elif "large" in checkpoint_url:
+        config.hidden_size = 1024
+        config.intermediate_size = 4096
+        config.num_hidden_layers = 24
+        config.num_attention_heads = 16
+    else:
+        raise ValueError("Should either find 'base' or 'large' in checkpoint URL")
+    # load state_dict of original model, remove and rename some keys
+    state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu", check_hash=True)["model"]
+    rename_keys = create_rename_keys(config, has_lm_head=has_lm_head)
+    for src, dest in rename_keys:
+        rename_key(state_dict, src, dest)
+    read_in_q_k_v(state_dict, config, has_lm_head=has_lm_head)
+    # load HuggingFace model
+    if checkpoint_url[-9:-4] == "pt22k":
+        model = BeitForMaskedImageModeling(config)
+    else:
+        model = BeitForImageClassification(config)
+    model.eval()
+    model.load_state_dict(state_dict)
+    # Check outputs on an image
+    feature_extractor = BeitFeatureExtractor(size=config.image_size, resample=Image.BILINEAR, do_center_crop=False)
+    encoding = feature_extractor(images=prepare_img(), return_tensors="pt")
+    pixel_values = encoding["pixel_values"]
+    outputs = model(pixel_values)
+    logits = outputs.logits
+    # verify logits
+    expected_shape = torch.Size([1, 1000])
+    if checkpoint_url[:-4].endswith("beit_base_patch16_224_pt22k"):
+        expected_shape = torch.Size([1, 196, 8192])
+    elif checkpoint_url[:-4].endswith("beit_large_patch16_224_pt22k"):
+        expected_shape = torch.Size([1, 196, 8192])
+    elif checkpoint_url[:-4].endswith("beit_base_patch16_224_pt22k_ft22k"):
+        expected_shape = torch.Size([1, 21841])
+        expected_logits = torch.tensor([2.2288, 2.4671, 0.7395])
+        expected_class_idx = 2397
+    elif checkpoint_url[:-4].endswith("beit_large_patch16_224_pt22k_ft22k"):
+        expected_shape = torch.Size([1, 21841])
+        expected_logits = torch.tensor([1.6881, -0.2787, 0.5901])
+        expected_class_idx = 2396
+    elif checkpoint_url[:-4].endswith("beit_base_patch16_224_pt22k_ft1k"):
+        expected_logits = torch.tensor([0.1241, 0.0798, -0.6569])
+        expected_class_idx = 285
+    elif checkpoint_url[:-4].endswith("beit_base_patch16_224_pt22k_ft22kto1k"):
+        expected_logits = torch.tensor([-1.2385, -1.0987, -1.0108])
+        expected_class_idx = 281
+    elif checkpoint_url[:-4].endswith("beit_base_patch16_384_pt22k_ft22kto1k"):
+        expected_logits = torch.tensor([-1.5303, -0.9484, -0.3147])
+        expected_class_idx = 761
+    elif checkpoint_url[:-4].endswith("beit_large_patch16_224_pt22k_ft1k"):
+        expected_logits = torch.tensor([0.4610, -0.0928, 0.2086])
+        expected_class_idx = 761
+    elif checkpoint_url[:-4].endswith("beit_large_patch16_224_pt22k_ft22kto1k"):
+        expected_logits = torch.tensor([-0.4804, 0.6257, -0.1837])
+        expected_class_idx = 761
+    elif checkpoint_url[:-4].endswith("beit_large_patch16_384_pt22k_ft22kto1k"):
+        expected_logits = torch.tensor([[-0.5122, 0.5117, -0.2113]])
+        expected_class_idx = 761
+    elif checkpoint_url[:-4].endswith("beit_large_patch16_512_pt22k_ft22kto1k"):
+        expected_logits = torch.tensor([-0.3062, 0.7261, 0.4852])
+        expected_class_idx = 761
+    else:
+        raise ValueError("Can't verify logits as model is not supported")
+    assert logits.shape == expected_shape, "Shape of logits not as expected"
+    print("Shape of logits:", logits.shape)
+    if not has_lm_head:
+        print("Predicted class idx:", logits.argmax(-1).item())
+        assert torch.allclose(logits[0, :3], expected_logits, atol=1e-3), "First elements of logits not as expected"
+        assert logits.argmax(-1).item() == expected_class_idx, "Predicted class index not as expected"
+    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+    print(f"Saving model to {pytorch_dump_folder_path}")
+    model.save_pretrained(pytorch_dump_folder_path)
+    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
+    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--checkpoint_url",
+        default="https://unilm.blob.core.windows.net/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth",
+        type=str,
+        help="URL to the original PyTorch checkpoint (.pth file).",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
+    )
+    args = parser.parse_args()
+    convert_beit_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path)
--- a/src/transformers/models/beit/feature_extraction_beit.py
+++ b/src/transformers/models/beit/feature_extraction_beit.py
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for BEiT."""
+from typing import List, Optional, Union
+import numpy as np
+from PIL import Image
+from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
+from ...file_utils import TensorType
+from ...image_utils import IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD, ImageFeatureExtractionMixin, is_torch_tensor
+from ...utils import logging
+logger = logging.get_logger(__name__)
+class BeitFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
+    r"""
+    Constructs a BEiT feature extractor.
+    This feature extractor inherits from :class:`~transformers.FeatureExtractionMixin` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
+    Args:
+        do_resize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to resize the input to a certain :obj:`size`.
+        size (:obj:`int` or :obj:`Tuple(int)`, `optional`, defaults to 256):
+            Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an
+            integer is provided, then the input will be resized to (size, size). Only has an effect if :obj:`do_resize`
+            is set to :obj:`True`.
+        resample (:obj:`int`, `optional`, defaults to :obj:`PIL.Image.BICUBIC`):
+            An optional resampling filter. This can be one of :obj:`PIL.Image.NEAREST`, :obj:`PIL.Image.BOX`,
+            :obj:`PIL.Image.BILINEAR`, :obj:`PIL.Image.HAMMING`, :obj:`PIL.Image.BICUBIC` or :obj:`PIL.Image.LANCZOS`.
+            Only has an effect if :obj:`do_resize` is set to :obj:`True`.
+        do_center_crop (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to crop the input at the center. If the input size is smaller than :obj:`crop_size` along any edge,
+            the image is padded with 0's and then center cropped.
+        crop_size (:obj:`int`, `optional`, defaults to 224):
+            Desired output size when applying center-cropping. Only has an effect if :obj:`do_center_crop` is set to
+            :obj:`True`.
+        do_normalize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to normalize the input with :obj:`image_mean` and :obj:`image_std`.
+        image_mean (:obj:`List[int]`, defaults to :obj:`[0.5, 0.5, 0.5]`):
+            The sequence of means for each channel, to be used when normalizing images.
+        image_std (:obj:`List[int]`, defaults to :obj:`[0.5, 0.5, 0.5]`):
+            The sequence of standard deviations for each channel, to be used when normalizing images.
+    """
+    model_input_names = ["pixel_values"]
+    def __init__(
+        self,
+        do_resize=True,
+        size=256,
+        resample=Image.BICUBIC,
+        do_center_crop=True,
+        crop_size=224,
+        do_normalize=True,
+        image_mean=None,
+        image_std=None,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+    def __call__(
+        self,
+        images: Union[
+            Image.Image, np.ndarray, "torch.Tensor", List[Image.Image], List[np.ndarray], List["torch.Tensor"]  # noqa
+        ],
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs
+    ) -> BatchFeature:
+        """
+        Main method to prepare for the model one or several image(s).
+        .. warning::
+           NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
+           PIL images.
+        Args:
+            images (:obj:`PIL.Image.Image`, :obj:`np.ndarray`, :obj:`torch.Tensor`, :obj:`List[PIL.Image.Image]`, :obj:`List[np.ndarray]`, :obj:`List[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
+                number of channels, H and W are image height and width.
+            return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`, defaults to :obj:`'np'`):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
+                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
+                * :obj:`'np'`: Return NumPy :obj:`np.ndarray` objects.
+                * :obj:`'jax'`: Return JAX :obj:`jnp.ndarray` objects.
+        Returns:
+            :class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
+            - **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
+              width).
+        """
+        # Input type checking for clearer error
+        valid_images = False
+        # Check that images has a valid type
+        if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
+            valid_images = True
+        elif isinstance(images, (list, tuple)):
+            if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
+                valid_images = True
+        if not valid_images:
+            raise ValueError(
+                "Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example),"
+                "`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
+            )
+        is_batched = bool(
+            isinstance(images, (list, tuple))
+            and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
+        )
+        if not is_batched:
+            images = [images]
+        # transformations (resizing + center cropping + normalization)
+        if self.do_resize and self.size is not None and self.resample is not None:
+            images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
+        if self.do_center_crop and self.crop_size is not None:
+            images = [self.center_crop(image, self.crop_size) for image in images]
+        if self.do_normalize:
+            images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
+        # return as BatchFeature
+        data = {"pixel_values": images}
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+        return encoded_inputs
--- a/src/transformers/models/beit/modeling_beit.py
+++ b/src/transformers/models/beit/modeling_beit.py
--- a/src/transformers/models/deit/convert_deit_timm_to_pytorch.py
+++ b/src/transformers/models/deit/convert_deit_timm_to_pytorch.py
@@ -16,6 +16,7 @@
 import argparse
+import json
 from pathlib import Path
 import torch
@@ -23,9 +24,9 @@ from PIL import Image
 import requests
 import timm
+from huggingface_hub import cached_download, hf_hub_url
 from transformers import DeiTConfig, DeiTFeatureExtractor, DeiTForImageClassificationWithTeacher
 from transformers.utils import logging
-from transformers.utils.imagenet_classes import id2label
 logging.set_verbosity_info()
@@ -139,6 +140,10 @@ def convert_deit_checkpoint(deit_name, pytorch_dump_folder_path):
    base_model = False
    # dataset (fine-tuned on ImageNet 2012), patch_size and image_size
    config.num_labels = 1000
+    repo_id = "datasets/huggingface/label-files"
+    filename = "imagenet-1k-id2label.json"
+    id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
+    id2label = {int(k): v for k, v in id2label.items()}
    config.id2label = id2label
    config.label2id = {v: k for k, v in id2label.items()}
    config.patch_size = int(deit_name[-6:-4])

--- a/src/transformers/models/detr/convert_detr_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/detr/convert_detr_original_pytorch_checkpoint_to_pytorch.py
@@ -16,6 +16,7 @@
 import argparse
+import json
 from collections import OrderedDict
 from pathlib import Path
@@ -23,9 +24,9 @@ import torch
 from PIL import Image
 import requests
+from huggingface_hub import cached_download, hf_hub_url
 from transformers import DetrConfig, DetrFeatureExtractor, DetrForObjectDetection, DetrForSegmentation
 from transformers.utils import logging
-from transformers.utils.coco_classes import id2label
 logging.set_verbosity_info()
@@ -193,6 +194,10 @@ def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
        config.num_labels = 250
    else:
        config.num_labels = 91
+        repo_id = "datasets/huggingface/label-files"
+        filename = "coco-detection-id2label.json"
+        id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
+        id2label = {int(k): v for k, v in id2label.items()}
        config.id2label = id2label
        config.label2id = {v: k for k, v in id2label.items()}

--- a/src/transformers/models/vit/convert_vit_timm_to_pytorch.py
+++ b/src/transformers/models/vit/convert_vit_timm_to_pytorch.py
@@ -16,6 +16,7 @@
 import argparse
+import json
 from pathlib import Path
 import torch
@@ -23,9 +24,9 @@ from PIL import Image
 import requests
 import timm
+from huggingface_hub import cached_download, hf_hub_url
 from transformers import DeiTFeatureExtractor, ViTConfig, ViTFeatureExtractor, ViTForImageClassification, ViTModel
 from transformers.utils import logging
-from transformers.utils.imagenet_classes import id2label
 logging.set_verbosity_info()
@@ -146,6 +147,10 @@ def convert_vit_checkpoint(vit_name, pytorch_dump_folder_path):
        config.image_size = int(vit_name[-9:-6])
    else:
        config.num_labels = 1000
+        repo_id = "datasets/huggingface/label-files"
+        filename = "imagenet-1k-id2label.json"
+        id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
+        id2label = {int(k): v for k, v in id2label.items()}
        config.id2label = id2label
        config.label2id = {v: k for k, v in id2label.items()}
        config.patch_size = int(vit_name[-6:-4])

--- a/src/transformers/models/vit/feature_extraction_vit.py
+++ b/src/transformers/models/vit/feature_extraction_vit.py
@@ -21,7 +21,7 @@ from PIL import Image
 from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
 from ...file_utils import TensorType
-from ...image_utils import ImageFeatureExtractionMixin, is_torch_tensor
+from ...image_utils import IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD, ImageFeatureExtractionMixin, is_torch_tensor
 from ...utils import logging
@@ -71,8 +71,8 @@ class ViTFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
        self.size = size
        self.resample = resample
        self.do_normalize = do_normalize
-        self.image_mean = image_mean if image_mean is not None else [0.5, 0.5, 0.5]
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
-        self.image_std = image_std if image_std is not None else [0.5, 0.5, 0.5]
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
    def __call__(
        self,

--- a/src/transformers/utils/coco_classes.py
+++ b/src/transformers/utils/coco_classes.py
-# COCO object detection id's to class names
-id2label = {
-    0: "N/A",
-    1: "person",
-    2: "bicycle",
-    3: "car",
-    4: "motorcycle",
-    5: "airplane",
-    6: "bus",
-    7: "train",
-    8: "truck",
-    9: "boat",
-    10: "traffic light",
-    11: "fire hydrant",
-    12: "N/A",
-    13: "stop sign",
-    14: "parking meter",
-    15: "bench",
-    16: "bird",
-    17: "cat",
-    18: "dog",
-    19: "horse",
-    20: "sheep",
-    21: "cow",
-    22: "elephant",
-    23: "bear",
-    24: "zebra",
-    25: "giraffe",
-    26: "N/A",
-    27: "backpack",
-    28: "umbrella",
-    29: "N/A",
-    30: "N/A",
-    31: "handbag",
-    32: "tie",
-    33: "suitcase",
-    34: "frisbee",
-    35: "skis",
-    36: "snowboard",
-    37: "sports ball",
-    38: "kite",
-    39: "baseball bat",
-    40: "baseball glove",
-    41: "skateboard",
-    42: "surfboard",
-    43: "tennis racket",
-    44: "bottle",
-    45: "N/A",
-    46: "wine glass",
-    47: "cup",
-    48: "fork",
-    49: "knife",
-    50: "spoon",
-    51: "bowl",
-    52: "banana",
-    53: "apple",
-    54: "sandwich",
-    55: "orange",
-    56: "broccoli",
-    57: "carrot",
-    58: "hot dog",
-    59: "pizza",
-    60: "donut",
-    61: "cake",
-    62: "chair",
-    63: "couch",
-    64: "potted plant",
-    65: "bed",
-    66: "N/A",
-    67: "dining table",
-    68: "N/A",
-    69: "N/A",
-    70: "toilet",
-    71: "N/A",
-    72: "tv",
-    73: "laptop",
-    74: "mouse",
-    75: "remote",
-    76: "keyboard",
-    77: "cell phone",
-    78: "microwave",
-    79: "oven",
-    80: "toaster",
-    81: "sink",
-    82: "refrigerator",
-    83: "N/A",
-    84: "book",
-    85: "clock",
-    86: "vase",
-    87: "scissors",
-    88: "teddy bear",
-    89: "hair drier",
-    90: "toothbrush",
-}
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -588,6 +588,41 @@ class PretrainedBartModel:
        requires_backends(cls, ["torch"])
+BEIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
+class BeitForImageClassification:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+class BeitForMaskedImageModeling:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+class BeitModel:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
+class BeitPreTrainedModel:
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch"])
 BERT_PRETRAINED_MODEL_ARCHIVE_LIST = None