Migrate doc files to Markdown. (#24376)

* Rename index.mdx to index.md * With saved modifs * Address review comment * Treat all files * .mdx -> .md * Remove special char * Update utils/tests_fetcher.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> --------- Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

Migrate doc files to Markdown. (#24376)
* Rename index.mdx to index.md * With saved modifs * Address review comment * Treat all files * .mdx -> .md * Remove special char * Update utils/tests_fetcher.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> --------- Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
eb849f66 · Sylvain Gugger · GitHub · b0513b01 · eb849f66 · b0513b01
Unverified Commit eb849f66 authored Jun 20, 2023 by Sylvain Gugger Committed by GitHub Jun 20, 2023
20 changed files
--- a/docs/source/en/model_doc/blip.md
+++ b/docs/source/en/model_doc/blip.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# BLIP
+## Overview
+The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
+BLIP is a model that is able to perform various multi-modal tasks including
+- Visual Question Answering 
+- Image-Text retrieval (Image-text matching)
+- Image Captioning
+The abstract from the paper is the following:
+*Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. 
+However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.*
+![BLIP.gif](https://s3.amazonaws.com/moonup/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)
+This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
+The original code can be found [here](https://github.com/salesforce/BLIP).
+## Resources
+- [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset
+## BlipConfig
+[[autodoc]] BlipConfig
+    - from_text_vision_configs
+## BlipTextConfig
+[[autodoc]] BlipTextConfig
+## BlipVisionConfig
+[[autodoc]] BlipVisionConfig
+## BlipProcessor
+[[autodoc]] BlipProcessor
+## BlipImageProcessor
+[[autodoc]] BlipImageProcessor
+    - preprocess
+## BlipModel
+[[autodoc]] BlipModel
+    - forward
+    - get_text_features
+    - get_image_features
+## BlipTextModel
+[[autodoc]] BlipTextModel
+    - forward
+## BlipVisionModel
+[[autodoc]] BlipVisionModel
+    - forward
+## BlipForConditionalGeneration
+[[autodoc]] BlipForConditionalGeneration
+    - forward
+## BlipForImageTextRetrieval
+[[autodoc]] BlipForImageTextRetrieval
+    - forward
+## BlipForQuestionAnswering
+[[autodoc]] BlipForQuestionAnswering
+    - forward
+## TFBlipModel
+[[autodoc]] TFBlipModel
+    - call
+    - get_text_features
+    - get_image_features
+## TFBlipTextModel
+[[autodoc]] TFBlipTextModel
+    - call
+## TFBlipVisionModel
+[[autodoc]] TFBlipVisionModel
+    - call
+## TFBlipForConditionalGeneration
+[[autodoc]] TFBlipForConditionalGeneration
+    - call
+## TFBlipForImageTextRetrieval
+[[autodoc]] TFBlipForImageTextRetrieval
+    - call
+## TFBlipForQuestionAnswering
+[[autodoc]] TFBlipForQuestionAnswering
+    - call
\ No newline at end of file
--- a/docs/source/en/model_doc/blip.mdx
+++ b/docs/source/en/model_doc/blip.mdx
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# BLIP
-## Overview
-The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
-BLIP is a model that is able to perform various multi-modal tasks including
- Visual Question Answering 
- Image-Text retrieval (Image-text matching)
- Image Captioning
-The abstract from the paper is the following:
-*Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. 
-However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.*
-![BLIP.gif](https://s3.amazonaws.com/moonup/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif)
-This model was contributed by [ybelkada](https://huggingface.co/ybelkada).
-The original code can be found [here](https://github.com/salesforce/BLIP).
-## Resources
- [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset
-## BlipConfig
-[[autodoc]] BlipConfig
-    - from_text_vision_configs
-## BlipTextConfig
-[[autodoc]] BlipTextConfig
-## BlipVisionConfig
-[[autodoc]] BlipVisionConfig
-## BlipProcessor
-[[autodoc]] BlipProcessor
-## BlipImageProcessor
-[[autodoc]] BlipImageProcessor
-    - preprocess
-## BlipModel
-[[autodoc]] BlipModel
-    - forward
-    - get_text_features
-    - get_image_features
-## BlipTextModel
-[[autodoc]] BlipTextModel
-    - forward
-## BlipVisionModel
-[[autodoc]] BlipVisionModel
-    - forward
-## BlipForConditionalGeneration
-[[autodoc]] BlipForConditionalGeneration
-    - forward
-## BlipForImageTextRetrieval
-[[autodoc]] BlipForImageTextRetrieval
-    - forward
-## BlipForQuestionAnswering
-[[autodoc]] BlipForQuestionAnswering
-    - forward
-## TFBlipModel
-[[autodoc]] TFBlipModel
-    - call
-    - get_text_features
-    - get_image_features
-## TFBlipTextModel
-[[autodoc]] TFBlipTextModel
-    - call
-## TFBlipVisionModel
-[[autodoc]] TFBlipVisionModel
-    - call
-## TFBlipForConditionalGeneration
-[[autodoc]] TFBlipForConditionalGeneration
-    - call
-## TFBlipForImageTextRetrieval
-[[autodoc]] TFBlipForImageTextRetrieval
-    - call
-## TFBlipForQuestionAnswering
-[[autodoc]] TFBlipForQuestionAnswering
-    - call
\ No newline at end of file
--- a/docs/source/en/model_doc/bloom.md
+++ b/docs/source/en/model_doc/bloom.md
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# BLOOM
+## Overview
+The BLOOM model has been proposed with its various versions through the [BigScience Workshop](https://bigscience.huggingface.co/). BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact.
+The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages.
+Several smaller versions of the models have been trained on the same dataset. BLOOM is available in the following versions:
+- [bloom-560m](https://huggingface.co/bigscience/bloom-560m)
+- [bloom-1b1](https://huggingface.co/bigscience/bloom-1b1)
+- [bloom-1b7](https://huggingface.co/bigscience/bloom-1b7)
+- [bloom-3b](https://huggingface.co/bigscience/bloom-3b)
+- [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1)
+- [bloom](https://huggingface.co/bigscience/bloom) (176B parameters)
+## Resources
+A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLOOM. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
+<PipelineTag pipeline="text-generation"/>
+- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
+See also:
+- [Causal language modeling task guide](../tasks/language_modeling)
+- [Text classification task guide](../tasks/sequence_classification)
+- [Token classification task guide](../tasks/token_classification)
+- [Question answering task guide](../tasks/question_answering)
+⚡️ Inference
+- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
+- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
+⚙️ Training
+- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
+## BloomConfig
+[[autodoc]] BloomConfig
+    - all
+## BloomModel
+[[autodoc]] BloomModel
+    - forward
+## BloomTokenizerFast
+[[autodoc]] BloomTokenizerFast
+    - all
+## BloomForCausalLM
+[[autodoc]] BloomForCausalLM
+    - forward
+## BloomForSequenceClassification
+[[autodoc]] BloomForSequenceClassification
+    - forward
+## BloomForTokenClassification
+[[autodoc]] BloomForTokenClassification
+    - forward
+## BloomForQuestionAnswering
+[[autodoc]] BloomForQuestionAnswering
+    - forward
--- a/docs/source/en/model_doc/bloom.mdx
+++ b/docs/source/en/model_doc/bloom.mdx
-<!--Copyright 2022 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# BLOOM
-## Overview
-The BLOOM model has been proposed with its various versions through the [BigScience Workshop](https://bigscience.huggingface.co/). BigScience is inspired by other open science initiatives where researchers have pooled their time and resources to collectively achieve a higher impact.
-The architecture of BLOOM is essentially similar to GPT3 (auto-regressive model for next token prediction), but has been trained on 46 different languages and 13 programming languages.
-Several smaller versions of the models have been trained on the same dataset. BLOOM is available in the following versions:
- [bloom-560m](https://huggingface.co/bigscience/bloom-560m)
- [bloom-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [bloom-1b7](https://huggingface.co/bigscience/bloom-1b7)
- [bloom-3b](https://huggingface.co/bigscience/bloom-3b)
- [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1)
- [bloom](https://huggingface.co/bigscience/bloom) (176B parameters)
-## Resources
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLOOM. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-<PipelineTag pipeline="text-generation"/>
- [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
-See also:
- [Causal language modeling task guide](../tasks/language_modeling)
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
-⚡️ Inference
- A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization).
- A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts).
-⚙️ Training
- A blog on [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed).
-## BloomConfig
-[[autodoc]] BloomConfig
-    - all
-## BloomModel
-[[autodoc]] BloomModel
-    - forward
-## BloomTokenizerFast
-[[autodoc]] BloomTokenizerFast
-    - all
-## BloomForCausalLM
-[[autodoc]] BloomForCausalLM
-    - forward
-## BloomForSequenceClassification
-[[autodoc]] BloomForSequenceClassification
-    - forward
-## BloomForTokenClassification
-[[autodoc]] BloomForTokenClassification
-    - forward
-## BloomForQuestionAnswering
-[[autodoc]] BloomForQuestionAnswering
-    - forward
--- a/docs/source/en/model_doc/bort.md
+++ b/docs/source/en/model_doc/bort.md
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# BORT
+## Overview
+The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by
+Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
+authors refer to as "Bort".
+The abstract from the paper is the following:
+*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
+applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
+"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
+original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
+is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
+(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
+hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
+architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
+absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
+Tips:
+- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
+  model's API as well as usage examples.
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
+- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
+  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
+  algorithm to make BORT fine-tuning work.
+This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
--- a/docs/source/en/model_doc/bort.mdx
+++ b/docs/source/en/model_doc/bort.mdx
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# BORT
-## Overview
-The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by
-Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
-authors refer to as "Bort".
-The abstract from the paper is the following:
-*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
-applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
-"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
-original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
-is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
-(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
-hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
-architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
-absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
-Tips:
- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
-  model's API as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
-  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
-  algorithm to make BORT fine-tuning work.
-This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
--- a/docs/source/en/model_doc/bridgetower.md
+++ b/docs/source/en/model_doc/bridgetower.md
+<!--Copyright 2023 The Intel Labs Team Authors, The Microsoft Research Team Authors and HuggingFace Inc. team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# BridgeTower
+## Overview
+The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a
+bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
+This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference. 
+The abstract from the paper is the following:
+*Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years.
+Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder.
+Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the crossmodal encoder.
+This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks.
+In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs.
+Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.*
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/bridgetower_architecture%20.jpg"
+alt="drawing" width="600"/>
+<small> BridgeTower architecture. Taken from the <a href="https://arxiv.org/abs/2206.08657">original paper.</a> </small>
+## Usage
+BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers.
+The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder.
+In principle, one can apply any visual, textual or cross-modal encoder in the proposed architecture.
+The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImageProcessor`] into a single instance to both
+encode the text and prepare the images respectively.
+The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
+```python
+>>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
+>>> import requests
+>>> from PIL import Image
+>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+>>> image = Image.open(requests.get(url, stream=True).raw)
+>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
+>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+>>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+>>> # forward pass
+>>> scores = dict()
+>>> for text in texts:
+...     # prepare inputs
+...     encoding = processor(image, text, return_tensors="pt")
+...     outputs = model(**encoding)
+...     scores[text] = outputs
+```
+The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
+```python
+>>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
+>>> import requests
+>>> from PIL import Image
+>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+>>> image = Image.open(requests.get(url, stream=True).raw)
+>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
+>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
+>>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
+>>> # forward pass
+>>> scores = dict()
+>>> for text in texts:
+...     # prepare inputs
+...     encoding = processor(image, text, return_tensors="pt")
+...     outputs = model(**encoding)
+...     scores[text] = outputs.logits[0, 1].item()
+```
+The following example shows how to run masked language modeling using [`BridgeTowerProcessor`] and [`BridgeTowerForMaskedLM`].
+```python
+>>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
+>>> from PIL import Image
+>>> import requests
+>>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
+>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+>>> text = "a <mask> looking out of the window"
+>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
+>>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
+>>> # prepare inputs
+>>> encoding = processor(image, text, return_tensors="pt")
+>>> # forward pass
+>>> outputs = model(**encoding)
+>>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
+>>> print(results)
+.a cat looking out of the window.
+```
+This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
+Tips:
+- This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings.
+- Checkpoints for pre-trained [bridgeTower-base](https://huggingface.co/BridgeTower/bridgetower-base) and [bridgetower masked language modeling and image text matching](https://huggingface.co/BridgeTower/bridgetower-base-itm-mlm) are released.
+- Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
+- The PyTorch version of this model is only available in torch 1.10 and higher.
+## BridgeTowerConfig
+[[autodoc]] BridgeTowerConfig
+## BridgeTowerTextConfig
+[[autodoc]] BridgeTowerTextConfig
+## BridgeTowerVisionConfig
+[[autodoc]] BridgeTowerVisionConfig
+## BridgeTowerImageProcessor
+[[autodoc]] BridgeTowerImageProcessor
+    - preprocess
+## BridgeTowerProcessor
+[[autodoc]] BridgeTowerProcessor
+    - __call__
+## BridgeTowerModel
+[[autodoc]] BridgeTowerModel
+    - forward
+## BridgeTowerForContrastiveLearning
+[[autodoc]] BridgeTowerForContrastiveLearning
+    - forward
+## BridgeTowerForMaskedLM
+[[autodoc]] BridgeTowerForMaskedLM
+    - forward
+## BridgeTowerForImageAndTextRetrieval
+[[autodoc]] BridgeTowerForImageAndTextRetrieval
+    - forward
--- a/docs/source/en/model_doc/bridgetower.mdx
+++ b/docs/source/en/model_doc/bridgetower.mdx
-<!--Copyright 2023 The Intel Labs Team Authors, The Microsoft Research Team Authors and HuggingFace Inc. team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# BridgeTower
-## Overview
-The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a
-bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
-This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference. 
-The abstract from the paper is the following:
-*Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years.
-Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder.
-Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the crossmodal encoder.
-This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks.
-In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs.
-Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.*
-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/bridgetower_architecture%20.jpg"
-alt="drawing" width="600"/>
-<small> BridgeTower architecture. Taken from the <a href="https://arxiv.org/abs/2206.08657">original paper.</a> </small>
-## Usage
-BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers.
-The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder.
-In principle, one can apply any visual, textual or cross-modal encoder in the proposed architecture.
-The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImageProcessor`] into a single instance to both
-encode the text and prepare the images respectively.
-The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`].
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
->>> import requests
->>> from PIL import Image
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
->>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
->>> # forward pass
->>> scores = dict()
->>> for text in texts:
-...     # prepare inputs
-...     encoding = processor(image, text, return_tensors="pt")
-...     outputs = model(**encoding)
-...     scores[text] = outputs
-```
-The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
->>> import requests
->>> from PIL import Image
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> # forward pass
->>> scores = dict()
->>> for text in texts:
-...     # prepare inputs
-...     encoding = processor(image, text, return_tensors="pt")
-...     outputs = model(**encoding)
-...     scores[text] = outputs.logits[0, 1].item()
-```
-The following example shows how to run masked language modeling using [`BridgeTowerProcessor`] and [`BridgeTowerForMaskedLM`].
-```python
->>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
->>> from PIL import Image
->>> import requests
->>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
->>> text = "a <mask> looking out of the window"
->>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
->>> # prepare inputs
->>> encoding = processor(image, text, return_tensors="pt")
->>> # forward pass
->>> outputs = model(**encoding)
->>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
->>> print(results)
-.a cat looking out of the window.
-```
-This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
-Tips:
- This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings.
- Checkpoints for pre-trained [bridgeTower-base](https://huggingface.co/BridgeTower/bridgetower-base) and [bridgetower masked language modeling and image text matching](https://huggingface.co/BridgeTower/bridgetower-base-itm-mlm) are released.
- Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
- The PyTorch version of this model is only available in torch 1.10 and higher.
-## BridgeTowerConfig
-[[autodoc]] BridgeTowerConfig
-## BridgeTowerTextConfig
-[[autodoc]] BridgeTowerTextConfig
-## BridgeTowerVisionConfig
-[[autodoc]] BridgeTowerVisionConfig
-## BridgeTowerImageProcessor
-[[autodoc]] BridgeTowerImageProcessor
-    - preprocess
-## BridgeTowerProcessor
-[[autodoc]] BridgeTowerProcessor
-    - __call__
-## BridgeTowerModel
-[[autodoc]] BridgeTowerModel
-    - forward
-## BridgeTowerForContrastiveLearning
-[[autodoc]] BridgeTowerForContrastiveLearning
-    - forward
-## BridgeTowerForMaskedLM
-[[autodoc]] BridgeTowerForMaskedLM
-    - forward
-## BridgeTowerForImageAndTextRetrieval
-[[autodoc]] BridgeTowerForImageAndTextRetrieval
-    - forward
--- a/docs/source/en/model_doc/byt5.md
+++ b/docs/source/en/model_doc/byt5.md
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# ByT5
+## Overview
+The ByT5 model was presented in [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
+Kale, Adam Roberts, Colin Raffel.
+The abstract from the paper is the following:
+*Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
+Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
+the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
+can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
+removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
+sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
+operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
+minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
+training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
+counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
+tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
+pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
+experiments.*
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
+found [here](https://github.com/google-research/byt5).
+ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They
+only differ in how inputs should be prepared for the model, see the code examples below.
+Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
+fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
+### Example
+ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
+```python
+>>> from transformers import T5ForConditionalGeneration
+>>> import torch
+>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
+>>> num_special_tokens = 3
+>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
+>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.
+>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens
+>>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens
+>>> loss = model(input_ids, labels=labels).loss
+>>> loss.item()
+2.66
+```
+For batched inference and training it is however recommended to make use of the tokenizer:
+```python
+>>> from transformers import T5ForConditionalGeneration, AutoTokenizer
+>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
+>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
+>>> model_inputs = tokenizer(
+...     ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
+... )
+>>> labels_dict = tokenizer(
+...     ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
+... )
+>>> labels = labels_dict.input_ids
+>>> loss = model(**model_inputs, labels=labels).loss
+>>> loss.item()
+17.9
+```
+Similar to [T5](t5), ByT5 was trained on the span-mask denoising task. However, 
+since the model works directly on characters, the pretraining task is a bit 
+different. Let's corrupt some characters of the 
+input sentence `"The dog chases a ball in the park."` and ask ByT5 to predict them 
+for us.
+```python
+>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+>>> import torch
+>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
+>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base")
+>>> input_ids_prompt = "The dog chases a ball in the park."
+>>> input_ids = tokenizer(input_ids_prompt).input_ids
+>>> # Note that we cannot add "{extra_id_...}" to the string directly
+>>> # as the Byte tokenizer would incorrectly merge the tokens
+>>> # For ByT5, we need to work directly on the character level
+>>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead
+>>> # uses final utf character ids.
+>>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens.
+>>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258.
+>>> # => mask to "The dog [258]a ball [257]park."
+>>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
+>>> input_ids
+tensor([[ 87, 107, 104,  35, 103, 114, 106,  35, 258,  35, 100,  35, 101, 100, 111, 111, 257,  35, 115, 100, 117, 110,  49,   1]])
+>>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`.
+>>> output_ids = model.generate(input_ids, max_length=100)[0].tolist()
+>>> output_ids
+[0, 258, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118, 257,  35, 108, 113,  35, 119, 107, 104,  35, 103, 108, 118, 102, 114, 256, 108, 113,  35, 119, 107, 104, 35, 115, 100, 117, 110,  49,  35,  87, 107, 104,  35, 103, 114, 106, 35, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118,  35, 100,  35, 101, 100, 111, 111,  35, 108, 113, 255,  35, 108, 113,  35, 119, 107, 104,  35, 115, 100, 117, 110,  49]
+>>> # ^- Note how 258 descends to 257, 256, 255
+>>> # Now we need to split on the sentinel tokens, let's write a short loop for this
+>>> output_ids_list = []
+>>> start_token = 0
+>>> sentinel_token = 258
+>>> while sentinel_token in output_ids:
+...     split_idx = output_ids.index(sentinel_token)
+...     output_ids_list.append(output_ids[start_token:split_idx])
+...     start_token = split_idx
+...     sentinel_token -= 1
+>>> output_ids_list.append(output_ids[start_token:])
+>>> output_string = tokenizer.batch_decode(output_ids_list)
+>>> output_string
+['<pad>', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.']
+```
+## ByT5Tokenizer
+[[autodoc]] ByT5Tokenizer
+See [`ByT5Tokenizer`] for all details.
--- a/docs/source/en/model_doc/byt5.mdx
+++ b/docs/source/en/model_doc/byt5.mdx
-<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# ByT5
-## Overview
-The ByT5 model was presented in [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
-Kale, Adam Roberts, Colin Raffel.
-The abstract from the paper is the following:
-*Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
-Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
-the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
-can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
-removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
-sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
-operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
-minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
-training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
-counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
-tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
-pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
-experiments.*
-This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
-found [here](https://github.com/google-research/byt5).
-ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They
-only differ in how inputs should be prepared for the model, see the code examples below.
-Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
-fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
-### Example
-ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
-```python
->>> from transformers import T5ForConditionalGeneration
->>> import torch
->>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
->>> num_special_tokens = 3
->>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
->>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.
->>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens
->>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens
->>> loss = model(input_ids, labels=labels).loss
->>> loss.item()
-2.66
-```
-For batched inference and training it is however recommended to make use of the tokenizer:
-```python
->>> from transformers import T5ForConditionalGeneration, AutoTokenizer
->>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
->>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
->>> model_inputs = tokenizer(
-...     ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
-... )
->>> labels_dict = tokenizer(
-...     ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
-... )
->>> labels = labels_dict.input_ids
->>> loss = model(**model_inputs, labels=labels).loss
->>> loss.item()
-17.9
-```
-Similar to [T5](t5), ByT5 was trained on the span-mask denoising task. However, 
-since the model works directly on characters, the pretraining task is a bit 
-different. Let's corrupt some characters of the 
-input sentence `"The dog chases a ball in the park."` and ask ByT5 to predict them 
-for us.
-```python
->>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
->>> import torch
->>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
->>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base")
->>> input_ids_prompt = "The dog chases a ball in the park."
->>> input_ids = tokenizer(input_ids_prompt).input_ids
->>> # Note that we cannot add "{extra_id_...}" to the string directly
->>> # as the Byte tokenizer would incorrectly merge the tokens
->>> # For ByT5, we need to work directly on the character level
->>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead
->>> # uses final utf character ids.
->>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens.
->>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258.
->>> # => mask to "The dog [258]a ball [257]park."
->>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
->>> input_ids
-tensor([[ 87, 107, 104,  35, 103, 114, 106,  35, 258,  35, 100,  35, 101, 100, 111, 111, 257,  35, 115, 100, 117, 110,  49,   1]])
->>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`.
->>> output_ids = model.generate(input_ids, max_length=100)[0].tolist()
->>> output_ids
-[0, 258, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118, 257,  35, 108, 113,  35, 119, 107, 104,  35, 103, 108, 118, 102, 114, 256, 108, 113,  35, 119, 107, 104, 35, 115, 100, 117, 110,  49,  35,  87, 107, 104,  35, 103, 114, 106, 35, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118,  35, 100,  35, 101, 100, 111, 111,  35, 108, 113, 255,  35, 108, 113,  35, 119, 107, 104,  35, 115, 100, 117, 110,  49]
->>> # ^- Note how 258 descends to 257, 256, 255
->>> # Now we need to split on the sentinel tokens, let's write a short loop for this
->>> output_ids_list = []
->>> start_token = 0
->>> sentinel_token = 258
->>> while sentinel_token in output_ids:
-...     split_idx = output_ids.index(sentinel_token)
-...     output_ids_list.append(output_ids[start_token:split_idx])
-...     start_token = split_idx
-...     sentinel_token -= 1
->>> output_ids_list.append(output_ids[start_token:])
->>> output_string = tokenizer.batch_decode(output_ids_list)
->>> output_string
-['<pad>', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.']
-```
-## ByT5Tokenizer
-[[autodoc]] ByT5Tokenizer
-See [`ByT5Tokenizer`] for all details.
--- a/docs/source/en/model_doc/camembert.md
+++ b/docs/source/en/model_doc/camembert.md
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# CamemBERT
+## Overview
+The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by
+Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
+Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
+trained on 138GB of French text.
+The abstract from the paper is the following:
+*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
+models have either been trained on English data or on the concatenation of data in multiple languages. This makes
+practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
+we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
+performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
+dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
+for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
+downstream applications for French NLP.*
+Tips:
+- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
+  as well as the information relative to the inputs and outputs.
+This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/).
+## Documentation resources
+- [Text classification task guide](../tasks/sequence_classification)
+- [Token classification task guide](../tasks/token_classification)
+- [Question answering task guide](../tasks/question_answering)
+- [Causal language modeling task guide](../tasks/language_modeling)
+- [Masked language modeling task guide](../tasks/masked_language_modeling)
+- [Multiple choice task guide](../tasks/multiple_choice)
+## CamembertConfig
+[[autodoc]] CamembertConfig
+## CamembertTokenizer
+[[autodoc]] CamembertTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## CamembertTokenizerFast
+[[autodoc]] CamembertTokenizerFast
+## CamembertModel
+[[autodoc]] CamembertModel
+## CamembertForCausalLM
+[[autodoc]] CamembertForCausalLM
+## CamembertForMaskedLM
+[[autodoc]] CamembertForMaskedLM
+## CamembertForSequenceClassification
+[[autodoc]] CamembertForSequenceClassification
+## CamembertForMultipleChoice
+[[autodoc]] CamembertForMultipleChoice
+## CamembertForTokenClassification
+[[autodoc]] CamembertForTokenClassification
+## CamembertForQuestionAnswering
+[[autodoc]] CamembertForQuestionAnswering
+## TFCamembertModel
+[[autodoc]] TFCamembertModel
+## TFCamembertForCasualLM
+[[autodoc]] TFCamembertForCausalLM
+## TFCamembertForMaskedLM
+[[autodoc]] TFCamembertForMaskedLM
+## TFCamembertForSequenceClassification
+[[autodoc]] TFCamembertForSequenceClassification
+## TFCamembertForMultipleChoice
+[[autodoc]] TFCamembertForMultipleChoice
+## TFCamembertForTokenClassification
+[[autodoc]] TFCamembertForTokenClassification
+## TFCamembertForQuestionAnswering
+[[autodoc]] TFCamembertForQuestionAnswering
--- a/docs/source/en/model_doc/camembert.mdx
+++ b/docs/source/en/model_doc/camembert.mdx
-<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# CamemBERT
-## Overview
-The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by
-Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
-Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
-trained on 138GB of French text.
-The abstract from the paper is the following:
-*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
-models have either been trained on English data or on the concatenation of data in multiple languages. This makes
-practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
-we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
-performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
-dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
-for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
-downstream applications for French NLP.*
-Tips:
- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
-  as well as the information relative to the inputs and outputs.
-This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/).
-## Documentation resources
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
- [Causal language modeling task guide](../tasks/language_modeling)
- [Masked language modeling task guide](../tasks/masked_language_modeling)
- [Multiple choice task guide](../tasks/multiple_choice)
-## CamembertConfig
-[[autodoc]] CamembertConfig
-## CamembertTokenizer
-[[autodoc]] CamembertTokenizer
-    - build_inputs_with_special_tokens
-    - get_special_tokens_mask
-    - create_token_type_ids_from_sequences
-    - save_vocabulary
-## CamembertTokenizerFast
-[[autodoc]] CamembertTokenizerFast
-## CamembertModel
-[[autodoc]] CamembertModel
-## CamembertForCausalLM
-[[autodoc]] CamembertForCausalLM
-## CamembertForMaskedLM
-[[autodoc]] CamembertForMaskedLM
-## CamembertForSequenceClassification
-[[autodoc]] CamembertForSequenceClassification
-## CamembertForMultipleChoice
-[[autodoc]] CamembertForMultipleChoice
-## CamembertForTokenClassification
-[[autodoc]] CamembertForTokenClassification
-## CamembertForQuestionAnswering
-[[autodoc]] CamembertForQuestionAnswering
-## TFCamembertModel
-[[autodoc]] TFCamembertModel
-## TFCamembertForCasualLM
-[[autodoc]] TFCamembertForCausalLM
-## TFCamembertForMaskedLM
-[[autodoc]] TFCamembertForMaskedLM
-## TFCamembertForSequenceClassification
-[[autodoc]] TFCamembertForSequenceClassification
-## TFCamembertForMultipleChoice
-[[autodoc]] TFCamembertForMultipleChoice
-## TFCamembertForTokenClassification
-[[autodoc]] TFCamembertForTokenClassification
-## TFCamembertForQuestionAnswering
-[[autodoc]] TFCamembertForQuestionAnswering
--- a/docs/source/en/model_doc/canine.md
+++ b/docs/source/en/model_doc/canine.md
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# CANINE
+## Overview
+The CANINE model was proposed in [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
+Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
+among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
+Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
+Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
+downsampling strategy, before applying a deep Transformer encoder.
+The abstract from the paper is the following:
+*Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
+still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
+lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
+languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
+a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
+pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
+To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
+sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
+2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
+Tips:
+- CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
+  layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
+  the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
+  after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
+  downsampling can be found in the paper.
+- CANINE uses a max sequence length of 2048 characters by default. One can use [`CanineTokenizer`]
+  to prepare text for the model.
+- Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
+  (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
+  tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
+  details for this can be found in the paper.
+-  Models:
+  - [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss,
+    12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
+  - [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer,
+    768-hidden, 12-heads, 121M parameters (size ~500 MB).
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).
+### Example
+CANINE works on raw characters, so it can be used without a tokenizer:
+```python
+>>> from transformers import CanineModel
+>>> import torch
+>>> model = CanineModel.from_pretrained("google/canine-c")  # model pre-trained with autoregressive character loss
+>>> text = "hello world"
+>>> # use Python's built-in ord() function to turn each character into its unicode code point id
+>>> input_ids = torch.tensor([[ord(char) for char in text]])
+>>> outputs = model(input_ids)  # forward pass
+>>> pooled_output = outputs.pooler_output
+>>> sequence_output = outputs.last_hidden_state
+```
+For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
+sequences to the same length):
+```python
+>>> from transformers import CanineTokenizer, CanineModel
+>>> model = CanineModel.from_pretrained("google/canine-c")
+>>> tokenizer = CanineTokenizer.from_pretrained("google/canine-c")
+>>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
+>>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
+>>> outputs = model(**encoding)  # forward pass
+>>> pooled_output = outputs.pooler_output
+>>> sequence_output = outputs.last_hidden_state
+```
+## Documentation resources
+- [Text classification task guide](../tasks/sequence_classification)
+- [Token classification task guide](../tasks/token_classification)
+- [Question answering task guide](../tasks/question_answering)
+- [Multiple choice task guide](../tasks/multiple_choice)
+## CANINE specific outputs
+[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
+## CanineConfig
+[[autodoc]] CanineConfig
+## CanineTokenizer
+[[autodoc]] CanineTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+## CanineModel
+[[autodoc]] CanineModel
+    - forward
+## CanineForSequenceClassification
+[[autodoc]] CanineForSequenceClassification
+    - forward
+## CanineForMultipleChoice
+[[autodoc]] CanineForMultipleChoice
+    - forward
+## CanineForTokenClassification
+[[autodoc]] CanineForTokenClassification
+    - forward
+## CanineForQuestionAnswering
+[[autodoc]] CanineForQuestionAnswering
+    - forward
--- a/docs/source/en/model_doc/canine.mdx
+++ b/docs/source/en/model_doc/canine.mdx
-<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# CANINE
-## Overview
-The CANINE model was proposed in [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
-Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
-among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
-Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
-Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
-downsampling strategy, before applying a deep Transformer encoder.
-The abstract from the paper is the following:
-*Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
-still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
-lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
-languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
-a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
-pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
-To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
-sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
-2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
-Tips:
- CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
-  layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
-  the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
-  after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
-  downsampling can be found in the paper.
- CANINE uses a max sequence length of 2048 characters by default. One can use [`CanineTokenizer`]
-  to prepare text for the model.
- Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
-  (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
-  tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
-  details for this can be found in the paper.
-  Models:
-  - [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss,
-    12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
-  - [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer,
-    768-hidden, 12-heads, 121M parameters (size ~500 MB).
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).
-### Example
-CANINE works on raw characters, so it can be used without a tokenizer:
-```python
->>> from transformers import CanineModel
->>> import torch
->>> model = CanineModel.from_pretrained("google/canine-c")  # model pre-trained with autoregressive character loss
->>> text = "hello world"
->>> # use Python's built-in ord() function to turn each character into its unicode code point id
->>> input_ids = torch.tensor([[ord(char) for char in text]])
->>> outputs = model(input_ids)  # forward pass
->>> pooled_output = outputs.pooler_output
->>> sequence_output = outputs.last_hidden_state
-```
-For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
-sequences to the same length):
-```python
->>> from transformers import CanineTokenizer, CanineModel
->>> model = CanineModel.from_pretrained("google/canine-c")
->>> tokenizer = CanineTokenizer.from_pretrained("google/canine-c")
->>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
->>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
->>> outputs = model(**encoding)  # forward pass
->>> pooled_output = outputs.pooler_output
->>> sequence_output = outputs.last_hidden_state
-```
-## Documentation resources
- [Text classification task guide](../tasks/sequence_classification)
- [Token classification task guide](../tasks/token_classification)
- [Question answering task guide](../tasks/question_answering)
- [Multiple choice task guide](../tasks/multiple_choice)
-## CANINE specific outputs
-[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
-## CanineConfig
-[[autodoc]] CanineConfig
-## CanineTokenizer
-[[autodoc]] CanineTokenizer
-    - build_inputs_with_special_tokens
-    - get_special_tokens_mask
-    - create_token_type_ids_from_sequences
-## CanineModel
-[[autodoc]] CanineModel
-    - forward
-## CanineForSequenceClassification
-[[autodoc]] CanineForSequenceClassification
-    - forward
-## CanineForMultipleChoice
-[[autodoc]] CanineForMultipleChoice
-    - forward
-## CanineForTokenClassification
-[[autodoc]] CanineForTokenClassification
-    - forward
-## CanineForQuestionAnswering
-[[autodoc]] CanineForQuestionAnswering
-    - forward
--- a/docs/source/en/model_doc/chinese_clip.md
+++ b/docs/source/en/model_doc/chinese_clip.md
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# Chinese-CLIP
+## Overview
+The Chinese-CLIP model was proposed in [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
+Chinese-CLIP is an implementation of CLIP (Radford et al., 2021) on a large-scale dataset of Chinese image-text pairs. It is capable of performing cross-modal retrieval and also playing as a vision backbone for vision tasks like zero-shot image classification, open-domain object detection, etc. The original Chinese-CLIP code is released [at this link](https://github.com/OFA-Sys/Chinese-CLIP).
+The abstract from the paper is the following:
+*The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Our codes, pretrained models, and demos have been released.*
+## Usage
+The code snippet below shows how to compute image & text features and similarities:
+```python
+>>> from PIL import Image
+>>> import requests
+>>> from transformers import ChineseCLIPProcessor, ChineseCLIPModel
+>>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+>>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
+>>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
+>>> image = Image.open(requests.get(url, stream=True).raw)
+>>> # Squirtle, Bulbasaur, Charmander, Pikachu in English
+>>> texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
+>>> # compute image feature
+>>> inputs = processor(images=image, return_tensors="pt")
+>>> image_features = model.get_image_features(**inputs)
+>>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize
+>>> # compute text features
+>>> inputs = processor(text=texts, padding=True, return_tensors="pt")
+>>> text_features = model.get_text_features(**inputs)
+>>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize
+>>> # compute image-text similarity scores
+>>> inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
+>>> outputs = model(**inputs)
+>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+>>> probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]
+```
+Currently, we release the following scales of pretrained Chinese-CLIP models at HF Model Hub:
+- [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16)
+- [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14)
+- [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px)
+- [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)
+The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys). 
+## ChineseCLIPConfig
+[[autodoc]] ChineseCLIPConfig
+    - from_text_vision_configs
+## ChineseCLIPTextConfig
+[[autodoc]] ChineseCLIPTextConfig
+## ChineseCLIPVisionConfig
+[[autodoc]] ChineseCLIPVisionConfig
+## ChineseCLIPImageProcessor
+[[autodoc]] ChineseCLIPImageProcessor
+    - preprocess
+## ChineseCLIPFeatureExtractor
+[[autodoc]] ChineseCLIPFeatureExtractor
+## ChineseCLIPProcessor
+[[autodoc]] ChineseCLIPProcessor
+## ChineseCLIPModel
+[[autodoc]] ChineseCLIPModel
+    - forward
+    - get_text_features
+    - get_image_features
+## ChineseCLIPTextModel
+[[autodoc]] ChineseCLIPTextModel
+    - forward
+## ChineseCLIPVisionModel
+[[autodoc]] ChineseCLIPVisionModel
+    - forward
\ No newline at end of file
--- a/docs/source/en/model_doc/chinese_clip.mdx
+++ b/docs/source/en/model_doc/chinese_clip.mdx
-<!--Copyright 2022 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# Chinese-CLIP
-## Overview
-The Chinese-CLIP model was proposed in [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
-Chinese-CLIP is an implementation of CLIP (Radford et al., 2021) on a large-scale dataset of Chinese image-text pairs. It is capable of performing cross-modal retrieval and also playing as a vision backbone for vision tasks like zero-shot image classification, open-domain object detection, etc. The original Chinese-CLIP code is released [at this link](https://github.com/OFA-Sys/Chinese-CLIP).
-The abstract from the paper is the following:
-*The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Our codes, pretrained models, and demos have been released.*
-## Usage
-The code snippet below shows how to compute image & text features and similarities:
-```python
->>> from PIL import Image
->>> import requests
->>> from transformers import ChineseCLIPProcessor, ChineseCLIPModel
->>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
->>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16")
->>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> # Squirtle, Bulbasaur, Charmander, Pikachu in English
->>> texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"]
->>> # compute image feature
->>> inputs = processor(images=image, return_tensors="pt")
->>> image_features = model.get_image_features(**inputs)
->>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True)  # normalize
->>> # compute text features
->>> inputs = processor(text=texts, padding=True, return_tensors="pt")
->>> text_features = model.get_text_features(**inputs)
->>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True)  # normalize
->>> # compute image-text similarity scores
->>> inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
->>> outputs = model(**inputs)
->>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
->>> probs = logits_per_image.softmax(dim=1)  # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]]
-```
-Currently, we release the following scales of pretrained Chinese-CLIP models at HF Model Hub:
- [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16)
- [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14)
- [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px)
- [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14)
-The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys). 
-## ChineseCLIPConfig
-[[autodoc]] ChineseCLIPConfig
-    - from_text_vision_configs
-## ChineseCLIPTextConfig
-[[autodoc]] ChineseCLIPTextConfig
-## ChineseCLIPVisionConfig
-[[autodoc]] ChineseCLIPVisionConfig
-## ChineseCLIPImageProcessor
-[[autodoc]] ChineseCLIPImageProcessor
-    - preprocess
-## ChineseCLIPFeatureExtractor
-[[autodoc]] ChineseCLIPFeatureExtractor
-## ChineseCLIPProcessor
-[[autodoc]] ChineseCLIPProcessor
-## ChineseCLIPModel
-[[autodoc]] ChineseCLIPModel
-    - forward
-    - get_text_features
-    - get_image_features
-## ChineseCLIPTextModel
-[[autodoc]] ChineseCLIPTextModel
-    - forward
-## ChineseCLIPVisionModel
-[[autodoc]] ChineseCLIPVisionModel
-    - forward
\ No newline at end of file
--- a/docs/source/en/model_doc/clap.md
+++ b/docs/source/en/model_doc/clap.md
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# CLAP
+## Overview
+The CLAP model was proposed in [Large Scale Contrastive Language-Audio pretraining with
+feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
+CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
+The abstract from the paper is the following:
+*Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zeroshot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-6*
+This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
+The original code can be found [here](https://github.com/LAION-AI/Clap).
+## ClapConfig
+[[autodoc]] ClapConfig
+    - from_text_audio_configs
+## ClapTextConfig
+[[autodoc]] ClapTextConfig
+## ClapAudioConfig
+[[autodoc]] ClapAudioConfig
+## ClapFeatureExtractor
+[[autodoc]] ClapFeatureExtractor
+## ClapProcessor
+[[autodoc]] ClapProcessor
+## ClapModel
+[[autodoc]] ClapModel
+    - forward
+    - get_text_features
+    - get_audio_features
+## ClapTextModel
+[[autodoc]] ClapTextModel
+    - forward
+## ClapTextModelWithProjection
+[[autodoc]] ClapTextModelWithProjection
+    - forward
+## ClapAudioModel
+[[autodoc]] ClapAudioModel
+    - forward
+## ClapAudioModelWithProjection
+[[autodoc]] ClapAudioModelWithProjection
+    - forward
--- a/docs/source/en/model_doc/clap.mdx
+++ b/docs/source/en/model_doc/clap.mdx
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# CLAP
-## Overview
-The CLAP model was proposed in [Large Scale Contrastive Language-Audio pretraining with
-feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
-CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
-The abstract from the paper is the following:
-*Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zeroshot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-6*
-This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
-The original code can be found [here](https://github.com/LAION-AI/Clap).
-## ClapConfig
-[[autodoc]] ClapConfig
-    - from_text_audio_configs
-## ClapTextConfig
-[[autodoc]] ClapTextConfig
-## ClapAudioConfig
-[[autodoc]] ClapAudioConfig
-## ClapFeatureExtractor
-[[autodoc]] ClapFeatureExtractor
-## ClapProcessor
-[[autodoc]] ClapProcessor
-## ClapModel
-[[autodoc]] ClapModel
-    - forward
-    - get_text_features
-    - get_audio_features
-## ClapTextModel
-[[autodoc]] ClapTextModel
-    - forward
-## ClapTextModelWithProjection
-[[autodoc]] ClapTextModelWithProjection
-    - forward
-## ClapAudioModel
-[[autodoc]] ClapAudioModel
-    - forward
-## ClapAudioModelWithProjection
-[[autodoc]] ClapAudioModelWithProjection
-    - forward
--- a/docs/source/en/model_doc/clip.md
+++ b/docs/source/en/model_doc/clip.md
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# CLIP
+## Overview
+The CLIP model was proposed in [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
+Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
+(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
+instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
+for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
+The abstract from the paper is the following:
+*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
+restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
+any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
+much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
+with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
+million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
+learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
+the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
+such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
+model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
+for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
+without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
+model weights at this https URL.*
+## Usage
+CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
+classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
+features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
+product between the projected image and text features is then used as a similar score.
+To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
+which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
+also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
+The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model.
+The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
+[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both
+encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
+[`CLIPProcessor`] and [`CLIPModel`].
+```python
+>>> from PIL import Image
+>>> import requests
+>>> from transformers import CLIPProcessor, CLIPModel
+>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
+>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+>>> image = Image.open(requests.get(url, stream=True).raw)
+>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
+>>> outputs = model(**inputs)
+>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
+```
+This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
+## Resources
+A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP.
+- A blog post on [How to fine-tune CLIP on 10,000 image-text pairs](https://huggingface.co/blog/fine-tune-clip-rsicd).
+- CLIP is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).
+If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it.
+The resource should ideally demonstrate something new instead of duplicating an existing resource.
+## CLIPConfig
+[[autodoc]] CLIPConfig
+    - from_text_vision_configs
+## CLIPTextConfig
+[[autodoc]] CLIPTextConfig
+## CLIPVisionConfig
+[[autodoc]] CLIPVisionConfig
+## CLIPTokenizer
+[[autodoc]] CLIPTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## CLIPTokenizerFast
+[[autodoc]] CLIPTokenizerFast
+## CLIPImageProcessor
+[[autodoc]] CLIPImageProcessor
+    - preprocess
+## CLIPFeatureExtractor
+[[autodoc]] CLIPFeatureExtractor
+## CLIPProcessor
+[[autodoc]] CLIPProcessor
+## CLIPModel
+[[autodoc]] CLIPModel
+    - forward
+    - get_text_features
+    - get_image_features
+## CLIPTextModel
+[[autodoc]] CLIPTextModel
+    - forward
+## CLIPTextModelWithProjection
+[[autodoc]] CLIPTextModelWithProjection
+    - forward
+## CLIPVisionModelWithProjection
+[[autodoc]] CLIPVisionModelWithProjection
+    - forward
+## CLIPVisionModel
+[[autodoc]] CLIPVisionModel
+    - forward
+## TFCLIPModel
+[[autodoc]] TFCLIPModel
+    - call
+    - get_text_features
+    - get_image_features
+## TFCLIPTextModel
+[[autodoc]] TFCLIPTextModel
+    - call
+## TFCLIPVisionModel
+[[autodoc]] TFCLIPVisionModel
+    - call
+## FlaxCLIPModel
+[[autodoc]] FlaxCLIPModel
+    - __call__
+    - get_text_features
+    - get_image_features
+## FlaxCLIPTextModel
+[[autodoc]] FlaxCLIPTextModel
+    - __call__
+## FlaxCLIPVisionModel
+[[autodoc]] FlaxCLIPVisionModel
+    - __call__
--- a/docs/source/en/model_doc/clip.mdx
+++ b/docs/source/en/model_doc/clip.mdx
-<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-# CLIP
-## Overview
-The CLIP model was proposed in [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
-Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
-(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
-instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
-for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
-The abstract from the paper is the following:
-*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
-restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
-any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
-much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
-with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
-million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
-learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
-the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
-such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
-model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
-for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
-without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
-model weights at this https URL.*
-## Usage
-CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
-classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
-features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
-product between the projected image and text features is then used as a similar score.
-To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
-which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
-also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
-The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model.
-The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
-[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both
-encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
-[`CLIPProcessor`] and [`CLIPModel`].
-```python
->>> from PIL import Image
->>> import requests
->>> from transformers import CLIPProcessor, CLIPModel
->>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
->>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
->>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
->>> image = Image.open(requests.get(url, stream=True).raw)
->>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
->>> outputs = model(**inputs)
->>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
->>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
-```
-This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
-## Resources
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP.
- A blog post on [How to fine-tune CLIP on 10,000 image-text pairs](https://huggingface.co/blog/fine-tune-clip-rsicd).
- CLIP is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text).
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it.
-The resource should ideally demonstrate something new instead of duplicating an existing resource.
-## CLIPConfig
-[[autodoc]] CLIPConfig
-    - from_text_vision_configs
-## CLIPTextConfig
-[[autodoc]] CLIPTextConfig
-## CLIPVisionConfig
-[[autodoc]] CLIPVisionConfig
-## CLIPTokenizer
-[[autodoc]] CLIPTokenizer
-    - build_inputs_with_special_tokens
-    - get_special_tokens_mask
-    - create_token_type_ids_from_sequences
-    - save_vocabulary
-## CLIPTokenizerFast
-[[autodoc]] CLIPTokenizerFast
-## CLIPImageProcessor
-[[autodoc]] CLIPImageProcessor
-    - preprocess
-## CLIPFeatureExtractor
-[[autodoc]] CLIPFeatureExtractor
-## CLIPProcessor
-[[autodoc]] CLIPProcessor
-## CLIPModel
-[[autodoc]] CLIPModel
-    - forward
-    - get_text_features
-    - get_image_features
-## CLIPTextModel
-[[autodoc]] CLIPTextModel
-    - forward
-## CLIPTextModelWithProjection
-[[autodoc]] CLIPTextModelWithProjection
-    - forward
-## CLIPVisionModelWithProjection
-[[autodoc]] CLIPVisionModelWithProjection
-    - forward
-## CLIPVisionModel
-[[autodoc]] CLIPVisionModel
-    - forward
-## TFCLIPModel
-[[autodoc]] TFCLIPModel
-    - call
-    - get_text_features
-    - get_image_features
-## TFCLIPTextModel
-[[autodoc]] TFCLIPTextModel
-    - call
-## TFCLIPVisionModel
-[[autodoc]] TFCLIPVisionModel
-    - call
-## FlaxCLIPModel
-[[autodoc]] FlaxCLIPModel
-    - __call__
-    - get_text_features
-    - get_image_features
-## FlaxCLIPTextModel
-[[autodoc]] FlaxCLIPTextModel
-    - __call__
-## FlaxCLIPVisionModel
-[[autodoc]] FlaxCLIPVisionModel
-    - __call__