Convert model files from rst to mdx (#14865)

* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Convert model files from rst to mdx (#14865)
* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ec3567fe · Lysandre Debut · GitHub · d0422de5 · ec3567fe · ec3567fe
Unverified Commit ec3567fe authored Dec 22, 2021 by Lysandre Debut Committed by GitHub Dec 22, 2021
16 changed files
--- a/docs/source/model_doc/layoutlm.rst
+++ b/docs/source/model_doc/layoutlm.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-LayoutLM
+# LayoutLM
-----------------------------------------------------------------------------------------------------------------------
-.. _Overview:
+<a id='Overview'></a>
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
+The LayoutLM model was proposed in the paper [LayoutLM: Pre-training of Text and Layout for Document Image
-Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
+Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
 Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
 information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
 on several downstream tasks:
- form understanding: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a collection of 199 annotated
+- form understanding: the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset (a collection of 199 annotated
  forms comprising more than 30,000 words).
- receipt understanding: the `SROIE <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for
+- receipt understanding: the [SROIE](https://rrc.cvc.uab.es/?ch=13) dataset (a collection of 626 receipts for
  training and 347 receipts for testing).
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
+- document image classification: the [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/) dataset (a collection of
  400,000 images belonging to one of 16 classes).
 The abstract from the paper is the following:
@@ -46,116 +44,81 @@ understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24)
 Tips:
- In addition to `input_ids`, :meth:`~transformer.LayoutLMModel.forward` also expects the input :obj:`bbox`, which are
+- In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are
  the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
-  as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python wrapper
+  as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1) format, where
-  <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1) format, where
  (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
  position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
  scale. To normalize, you can use the following function:
-.. code-block::
+```python
+def normalize_bbox(bbox, width, height):
-    def normalize_bbox(bbox, width, height):
+     return [
-         return [
+         int(1000 * (bbox[0] / width)),
-             int(1000 * (bbox[0] / width)),
+         int(1000 * (bbox[1] / height)),
-             int(1000 * (bbox[1] / height)),
+         int(1000 * (bbox[2] / width)),
-             int(1000 * (bbox[2] / width)),
+         int(1000 * (bbox[3] / height)),
-             int(1000 * (bbox[3] / height)),
+     ]
-         ]
+```
-Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
+Here, `width` and `height` correspond to the width and height of the original document in which the token
 occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
-.. code-block::
+```python
+from PIL import Image
-    from PIL import Image
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
+width, height = image.size
+```
-    width, height = image.size
+- For a demo which shows how to fine-tune [`LayoutLMForTokenClassification`] on the [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/) (a collection of annotated forms), see [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb).
- For a demo which shows how to fine-tune :class:`LayoutLMForTokenClassification` on the `FUNSD dataset
-  <https://guillaumejaume.github.io/FUNSD/>`__ (a collection of annotated forms), see `this notebook
-  <https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
  It includes an inference part, which shows how to use Google's Tesseract on a new document.
-This model was contributed by `liminghao1630 <https://huggingface.co/liminghao1630>`__. The original code can be found
+This model was contributed by [liminghao1630](https://huggingface.co/liminghao1630). The original code can be found
-`here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
+[here](https://github.com/microsoft/unilm/tree/master/layoutlm).
-LayoutLMConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMConfig
-    :members:
-LayoutLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMTokenizer
-    :members:
-LayoutLMTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMTokenizerFast
-    :members:
-LayoutLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMModel
+## LayoutLMConfig
-    :members:
+[[autodoc]] LayoutLMConfig
-LayoutLMForMaskedLM
+## LayoutLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMForMaskedLM
+[[autodoc]] LayoutLMTokenizer
-    :members:
+## LayoutLMTokenizerFast
-LayoutLMForSequenceClassification
+[[autodoc]] LayoutLMTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMForSequenceClassification
+## LayoutLMModel
-    :members:
+[[autodoc]] LayoutLMModel
-LayoutLMForTokenClassification
+## LayoutLMForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMForTokenClassification
+[[autodoc]] LayoutLMForMaskedLM
-    :members:
+## LayoutLMForSequenceClassification
-TFLayoutLMModel
+[[autodoc]] LayoutLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLayoutLMModel
+## LayoutLMForTokenClassification
-    :members:
+[[autodoc]] LayoutLMForTokenClassification
-TFLayoutLMForMaskedLM
+## TFLayoutLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLayoutLMForMaskedLM
+[[autodoc]] TFLayoutLMModel
-    :members:
+## TFLayoutLMForMaskedLM
-TFLayoutLMForSequenceClassification
+[[autodoc]] TFLayoutLMForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLayoutLMForSequenceClassification
+## TFLayoutLMForSequenceClassification
-    :members:
+[[autodoc]] TFLayoutLMForSequenceClassification
-TFLayoutLMForTokenClassification
+## TFLayoutLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLayoutLMForTokenClassification
+[[autodoc]] TFLayoutLMForTokenClassification
-    :members:
--- a/docs/source/model_doc/layoutlmv2.rst
+++ b/docs/source/model_doc/layoutlmv2.rst
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-LayoutLMV2
+# LayoutLMV2
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The LayoutLMV2 model was proposed in `LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
+The LayoutLMV2 model was proposed in [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
-<https://arxiv.org/abs/2012.14740>`__ by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
+Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves [LayoutLM](layoutlm) to obtain
-Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves `LayoutLM <layoutlm>`__ to obtain
 state-of-the-art results across several document image understanding benchmarks:
- information extraction from scanned documents: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a
+- information extraction from scanned documents: the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset (a
-  collection of 199 annotated forms comprising more than 30,000 words), the `CORD <https://github.com/clovaai/cord>`__
+  collection of 199 annotated forms comprising more than 30,000 words), the [CORD](https://github.com/clovaai/cord)
-  dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the `SROIE
+  dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the [SROIE](https://rrc.cvc.uab.es/?ch=13) dataset (a collection of 626 receipts for training and 347 receipts for testing)
-  <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for training and 347 receipts for testing)
+  and the [Kleister-NDA](https://github.com/applicaai/kleister-nda) dataset (a collection of non-disclosure
-  and the `Kleister-NDA <https://github.com/applicaai/kleister-nda>`__ dataset (a collection of non-disclosure
  agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
  documents for testing).
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
+- document image classification: the [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/) dataset (a collection of
  400,000 images belonging to one of 16 classes).
- document visual question answering: the `DocVQA <https://arxiv.org/abs/2007.00398>`__ dataset (a collection of 50,000
+- document visual question answering: the [DocVQA](https://arxiv.org/abs/2007.00398) dataset (a collection of 50,000
  questions defined on 12,000+ document images).
 The abstract from the paper is the following:
@@ -53,100 +49,98 @@ Tips:
 - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
  pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
 - LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
-  the self-attention layers. Details can be found on page 5 of the `paper <https://arxiv.org/abs/2012.14740>`__.
+  the self-attention layers. Details can be found on page 5 of the [paper](https://arxiv.org/abs/2012.14740).
- Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found `here
+- Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found [here](https://github.com/NielsRogge/Transformers-Tutorials).
-  <https://github.com/NielsRogge/Transformers-Tutorials>`__.
+- LayoutLMv2 uses Facebook AI's [Detectron2](https://github.com/facebookresearch/detectron2/) package for its visual
- LayoutLMv2 uses Facebook AI's `Detectron2 <https://github.com/facebookresearch/detectron2/>`__ package for its visual
+  backbone. See [this link](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) for installation
-  backbone. See `this link <https://detectron2.readthedocs.io/en/latest/tutorials/install.html>`__ for installation
  instructions.
- In addition to :obj:`input_ids`, :meth:`~transformer.LayoutLMv2Model.forward` expects 2 additional inputs, namely
+- In addition to `input_ids`, [`~LayoutLMv2Model.forward`] expects 2 additional inputs, namely
-  :obj:`image` and :obj:`bbox`. The :obj:`image` input corresponds to the original document image in which the text
+  `image` and `bbox`. The `image` input corresponds to the original document image in which the text
  tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
-  document images, :obj:`image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
+  document images, `image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
-  :obj:`torch.Tensor` or a :obj:`Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
+  `torch.Tensor` or a `Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
  done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
-  in Detectron2 are pre-trained using the BGR format. The :obj:`bbox` input are the bounding boxes (i.e. 2D-positions)
+  in Detectron2 are pre-trained using the BGR format. The `bbox` input are the bounding boxes (i.e. 2D-positions)
-  of the input text tokens. This is identical to :class:`~transformer.LayoutLMModel`. These can be obtained using an
+  of the input text tokens. This is identical to [`LayoutLMModel`]. These can be obtained using an
-  external OCR engine such as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python
+  external OCR engine such as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python
-  wrapper <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1)
+  wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1)
  format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
  represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
  a 0-1000 scale. To normalize, you can use the following function:
-.. code-block::
+```python
+def normalize_bbox(bbox, width, height):
-    def normalize_bbox(bbox, width, height):
+     return [
-         return [
+         int(1000 * (bbox[0] / width)),
-             int(1000 * (bbox[0] / width)),
+         int(1000 * (bbox[1] / height)),
-             int(1000 * (bbox[1] / height)),
+         int(1000 * (bbox[2] / width)),
-             int(1000 * (bbox[2] / width)),
+         int(1000 * (bbox[3] / height)),
-             int(1000 * (bbox[3] / height)),
+     ]
-         ]
+```
-Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
+Here, `width` and `height` correspond to the width and height of the original document in which the token
 occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
 follows:
-.. code-block::
+```python
+from PIL import Image
-    from PIL import Image
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
-    width, height = image.size
+width, height = image.size
+```
-However, this model includes a brand new :class:`~transformer.LayoutLMv2Processor` which can be used to directly
+However, this model includes a brand new [`~transformers.LayoutLMv2Processor`] which can be used to directly
 prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
 section below.
- Internally, :class:`~transformer.LayoutLMv2Model` will send the :obj:`image` input through its visual backbone to
+- Internally, [`~transformers.LayoutLMv2Model`] will send the `image` input through its visual backbone to
-  obtain a lower-resolution feature map, whose shape is equal to the :obj:`image_feature_pool_shape` attribute of
+  obtain a lower-resolution feature map, whose shape is equal to the `image_feature_pool_shape` attribute of
-  :class:`~transformer.LayoutLMv2Config`. This feature map is then flattened to obtain a sequence of image tokens. As
+  [`~transformers.LayoutLMv2Config`]. This feature map is then flattened to obtain a sequence of image tokens. As
  the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
  tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
  length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
-  will have a shape of :obj:`seq_length` + :obj:`image_feature_pool_shape[0]` *
+  will have a shape of `seq_length` + `image_feature_pool_shape[0]` *
-  :obj:`config.image_feature_pool_shape[1]`.
+  `config.image_feature_pool_shape[1]`.
- When calling :meth:`~transformer.LayoutLMv2Model.from_pretrained`, a warning will be printed with a long list of
+- When calling [`~transformers.LayoutLMv2Model.from_pretrained`], a warning will be printed with a long list of
  parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
  statistics, which are going to have values when fine-tuning on a custom dataset.
- If you want to train the model in a distributed environment, make sure to call :meth:`synchronize_batch_norm` on the
+- If you want to train the model in a distributed environment, make sure to call [`synchronize_batch_norm`] on the
  model in order to properly synchronize the batch normalization layers of the visual backbone.
 In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
-:doc:`LayoutXLM's documentation page <layoutxlm>`.
+[LayoutXLM's documentation page](layoutxlm).
-Usage: LayoutLMv2Processor
+## Usage: LayoutLMv2Processor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The easiest way to prepare data for the model is to use :class:`~transformer.LayoutLMv2Processor`, which internally
+The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally
-combines a feature extractor (:class:`~transformer.LayoutLMv2FeatureExtractor`) and a tokenizer
+combines a feature extractor ([`LayoutLMv2FeatureExtractor`]) and a tokenizer
-(:class:`~transformer.LayoutLMv2Tokenizer` or :class:`~transformer.LayoutLMv2TokenizerFast`). The feature extractor
+([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The feature extractor
 handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
 for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
 modality.
-.. code-block::
+```python
+from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
-    from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
+feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
+tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
+processor = LayoutLMv2Processor(feature_extractor, tokenizer)
+```
-    feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
+In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`],
-    tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
-    processor = LayoutLMv2Processor(feature_extractor, tokenizer)
-In short, one can provide a document image (and possibly additional data) to :class:`~transformer.LayoutLMv2Processor`,
 and it will create the inputs expected by the model. Internally, the processor first uses
-:class:`~transformer.LayoutLMv2FeatureExtractor` to apply OCR on the image to get a list of words and normalized
+[`LayoutLMv2FeatureExtractor`] to apply OCR on the image to get a list of words and normalized
-bounding boxes, as well to resize the image to a given size in order to get the :obj:`image` input. The words and
+bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and
-normalized bounding boxes are then provided to :class:`~transformer.LayoutLMv2Tokenizer` or
+normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or
-:class:`~transformer.LayoutLMv2TokenizerFast`, which converts them to token-level :obj:`input_ids`,
+[`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`,
-:obj:`attention_mask`, :obj:`token_type_ids`, :obj:`bbox`. Optionally, one can provide word labels to the processor,
+`attention_mask`, `token_type_ids`, `bbox`. Optionally, one can provide word labels to the processor,
-which are turned into token-level :obj:`labels`.
+which are turned into token-level `labels`.
-:class:`~transformer.LayoutLMv2Processor` uses `PyTesseract <https://pypi.org/project/pytesseract/>`__, a Python
+[`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python
 wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
 choice, and provide the words and normalized boxes yourself. This requires initializing
-:class:`~transformer.LayoutLMv2FeatureExtractor` with :obj:`apply_ocr` set to :obj:`False`.
+[`LayoutLMv2FeatureExtractor`] with `apply_ocr` set to `False`.
 In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
 use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
@@ -157,157 +151,137 @@ True**
 This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
 the words and normalized bounding boxes.
-.. code-block::
+```python
+from transformers import LayoutLMv2Processor
-    from transformers import LayoutLMv2Processor
+from PIL import Image
-    from PIL import Image
-    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
+processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
-    encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
+encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
-    print(encoding.keys())
+print(encoding.keys())
-    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
+# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
+```
 **Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
-In case one wants to do OCR themselves, one can initialize the feature extractor with :obj:`apply_ocr` set to
+In case one wants to do OCR themselves, one can initialize the feature extractor with `apply_ocr` set to
-:obj:`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
+`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
 the processor.
-.. code-block::
+```python
+from transformers import LayoutLMv2Processor
+from PIL import Image
-    from transformers import LayoutLMv2Processor
+processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
-    from PIL import Image
-    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
+words = ["hello", "world"]
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
+boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
-    words = ["hello", "world"]
+encoding = processor(image, words, boxes=boxes, return_tensors="pt")
-    boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
+print(encoding.keys())
-    encoding = processor(image, words, boxes=boxes, return_tensors="pt")
+# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
-    print(encoding.keys())
+```
-    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 **Use case 3: token classification (training), apply_ocr=False**
 For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
-labels in order to train a model. The processor will then convert these into token-level :obj:`labels`. By default, it
+labels in order to train a model. The processor will then convert these into token-level `labels`. By default, it
 will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
-:obj:`ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
+`ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
-initialize the tokenizer with :obj:`only_label_first_subword` set to :obj:`False`.
+initialize the tokenizer with `only_label_first_subword` set to `False`.
-.. code-block::
-    from transformers import LayoutLMv2Processor
+```python
-    from PIL import Image
+from transformers import LayoutLMv2Processor
+from PIL import Image
-    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
+processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
-    words = ["hello", "world"]
+words = ["hello", "world"]
-    boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
+boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
-    word_labels = [1, 2]
+word_labels = [1, 2]
-    encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
+encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
-    print(encoding.keys())
+print(encoding.keys())
-    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
+# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
+```
 **Use case 4: visual question answering (inference), apply_ocr=True**
 For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
 processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
-.. code-block::
+```python
+from transformers import LayoutLMv2Processor
+from PIL import Image
-    from transformers import LayoutLMv2Processor
+processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
-    from PIL import Image
-    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
+question = "What's his name?"
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
+encoding = processor(image, question, return_tensors="pt") 
-    question = "What's his name?"
+print(encoding.keys())
-    encoding = processor(image, question, return_tensors="pt") 
+# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
-    print(encoding.keys())
+```
-    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 **Use case 5: visual question answering (inference), apply_ocr=False**
 For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
 perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
-.. code-block::
+```python
+from transformers import LayoutLMv2Processor
-    from transformers import LayoutLMv2Processor
+from PIL import Image
-    from PIL import Image
-    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
-    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
-    question = "What's his name?"
-    words = ["hello", "world"]
-    boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
-    encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")  
-    print(encoding.keys())
-    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
-LayoutLMv2Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2Config
-    :members:
-LayoutLMv2FeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2FeatureExtractor
-    :members: __call__
-LayoutLMv2Tokenizer
+processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2Tokenizer
+image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
-    :members: __call__, save_vocabulary
+question = "What's his name?"
+words = ["hello", "world"]
+boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
+encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")  
+print(encoding.keys())
+# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
+```
+## LayoutLMv2Config
-LayoutLMv2TokenizerFast
+[[autodoc]] LayoutLMv2Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2TokenizerFast
+## LayoutLMv2FeatureExtractor
-    :members: __call__
+[[autodoc]] LayoutLMv2FeatureExtractor
+    - __call__
-LayoutLMv2Processor
+## LayoutLMv2Tokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2Processor
+[[autodoc]] LayoutLMv2Tokenizer
-    :members: __call__
+    - __call__
+    - save_vocabulary
+## LayoutLMv2TokenizerFast
-LayoutLMv2Model
+[[autodoc]] LayoutLMv2TokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - __call__
-.. autoclass:: transformers.LayoutLMv2Model
+## LayoutLMv2Processor
-    :members: forward
+[[autodoc]] LayoutLMv2Processor
+    - __call__
-LayoutLMv2ForSequenceClassification
+## LayoutLMv2Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2ForSequenceClassification
+[[autodoc]] LayoutLMv2Model
-    :members:
+    - forward
+## LayoutLMv2ForSequenceClassification
-LayoutLMv2ForTokenClassification
+[[autodoc]] LayoutLMv2ForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2ForTokenClassification
+## LayoutLMv2ForTokenClassification
-    :members:
+[[autodoc]] LayoutLMv2ForTokenClassification
-LayoutLMv2ForQuestionAnswering
+## LayoutLMv2ForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutLMv2ForQuestionAnswering
+[[autodoc]] LayoutLMv2ForQuestionAnswering
-    :members:
--- a/docs/source/model_doc/layoutxlm.mdx
+++ b/docs/source/model_doc/layoutxlm.mdx
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# LayoutXLM
+## Overview
+LayoutXLM was proposed in [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
+Zhang, Furu Wei. It's a multilingual extension of the [LayoutLMv2 model](https://arxiv.org/abs/2012.14740) trained
+on 53 languages.
+The abstract from the paper is the following:
+*Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
+understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
+this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
+bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
+introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
+7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
+for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
+cross-lingual pre-trained models on the XFUN dataset.*
+One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
+```python
+from transformers import LayoutLMv2Model
+model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base')
+```
+Note that LayoutXLM has its own tokenizer, based on
+[`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`]. You can initialize it as
+follows:
+```python
+from transformers import LayoutXLMTokenizer
+tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base')
+```
+Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies
+[`LayoutLMv2FeatureExtractor`] and
+[`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
+data for the model.
+As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
+This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
+## LayoutXLMTokenizer
+[[autodoc]] LayoutXLMTokenizer
+    - __call__
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## LayoutXLMTokenizerFast
+[[autodoc]] LayoutXLMTokenizerFast
+    - __call__
+## LayoutXLMProcessor
+[[autodoc]] LayoutXLMProcessor
+    - __call__
--- a/docs/source/model_doc/layoutxlm.rst
+++ b/docs/source/model_doc/layoutxlm.rst
-.. 
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-LayoutXLM
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-LayoutXLM was proposed in `LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
-<https://arxiv.org/abs/2104.08836>`__ by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
-Zhang, Furu Wei. It's a multilingual extension of the `LayoutLMv2 model <https://arxiv.org/abs/2012.14740>`__ trained
-on 53 languages.
-The abstract from the paper is the following:
-*Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
-understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
-this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
-bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
-introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
-7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
-for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
-cross-lingual pre-trained models on the XFUN dataset.*
-One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
-.. code-block::
-    from transformers import LayoutLMv2Model
-    model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base') 
-Note that LayoutXLM has its own tokenizer, based on
-:class:`~transformers.LayoutXLMTokenizer`/:class:`~transformers.LayoutXLMTokenizerFast`. You can initialize it as
-follows:
-.. code-block::
-    from transformers import LayoutXLMTokenizer
-    tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base') 
-Similar to LayoutLMv2, you can use :class:`~transformers.LayoutXLMProcessor` (which internally applies
-:class:`~transformers.LayoutLMv2FeatureExtractor` and
-:class:`~transformers.LayoutXLMTokenizer`/:class:`~transformers.LayoutXLMTokenizerFast` in sequence) to prepare all
-data for the model.
-As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to :doc:`LayoutLMv2's documentation page
-<layoutlmv2>` for all tips, code examples and notebooks.
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
-<https://github.com/microsoft/unilm>`__.
-LayoutXLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutXLMTokenizer
-    :members: __call__, build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-LayoutXLMTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutXLMTokenizerFast
-    :members: __call__
-LayoutXLMProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LayoutXLMProcessor
-    :members: __call__
--- a/docs/source/model_doc/led.mdx
+++ b/docs/source/model_doc/led.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# LED
+## Overview
+The LED model was proposed in [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz
+Beltagy, Matthew E. Peters, Arman Cohan.
+The abstract from the paper is the following:
+*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
+quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
+mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
+longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
+windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
+evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
+contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
+pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
+WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
+long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
+dataset.*
+Tips:
+- [`LEDForConditionalGeneration`] is an extension of
+  [`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with
+  *Longformer*'s *chunked self-attention* layer. [`LEDTokenizer`] is an alias of
+  [`BartTokenizer`].
+- LED works very well on long-range *sequence-to-sequence* tasks where the `input_ids` largely exceed a length of
+  1024 tokens.
+- LED pads the `input_ids` to be a multiple of `config.attention_window` if required. Therefore a small speed-up is
+  gained, when [`LEDTokenizer`] is used with the `pad_to_multiple_of` argument.
+- LED makes use of *global attention* by means of the `global_attention_mask` (see
+  [`LongformerModel`]). For summarization, it is advised to put *global attention* only on the first
+  `<s>` token. For question answering, it is advised to put *global attention* on all tokens of the question.
+- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
+  `model.gradient_checkpointing_enable()`.
+- A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
+- A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
+This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
+## LEDConfig
+[[autodoc]] LEDConfig
+## LEDTokenizer
+[[autodoc]] LEDTokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## LEDTokenizerFast
+[[autodoc]] LEDTokenizerFast
+## LED specific outputs
+[[autodoc]] models.led.modeling_led.LEDEncoderBaseModelOutput
+[[autodoc]] models.led.modeling_led.LEDSeq2SeqModelOutput
+[[autodoc]] models.led.modeling_led.LEDSeq2SeqLMOutput
+[[autodoc]] models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
+[[autodoc]] models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
+[[autodoc]] models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
+[[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
+[[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
+## LEDModel
+[[autodoc]] LEDModel
+    - forward
+## LEDForConditionalGeneration
+[[autodoc]] LEDForConditionalGeneration
+    - forward
+## LEDForSequenceClassification
+[[autodoc]] LEDForSequenceClassification
+    - forward
+## LEDForQuestionAnswering
+[[autodoc]] LEDForQuestionAnswering
+    - forward
+## TFLEDModel
+[[autodoc]] TFLEDModel
+    - call
+## TFLEDForConditionalGeneration
+[[autodoc]] TFLEDForConditionalGeneration
+    - call
--- a/docs/source/model_doc/led.rst
+++ b/docs/source/model_doc/led.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-LED
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The LED model was proposed in `Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz
-Beltagy, Matthew E. Peters, Arman Cohan.
-The abstract from the paper is the following:
-*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
-quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
-mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
-longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
-windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
-evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
-contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
-pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
-WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
-long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
-dataset.*
-Tips:
- :class:`~transformers.LEDForConditionalGeneration` is an extension of
-  :class:`~transformers.BartForConditionalGeneration` exchanging the traditional *self-attention* layer with
-  *Longformer*'s *chunked self-attention* layer. :class:`~transformers.LEDTokenizer` is an alias of
-  :class:`~transformers.BartTokenizer`.
- LED works very well on long-range *sequence-to-sequence* tasks where the ``input_ids`` largely exceed a length of
-  1024 tokens.
- LED pads the ``input_ids`` to be a multiple of ``config.attention_window`` if required. Therefore a small speed-up is
-  gained, when :class:`~transformers.LEDTokenizer` is used with the ``pad_to_multiple_of`` argument.
- LED makes use of *global attention* by means of the ``global_attention_mask`` (see
-  :class:`~transformers.LongformerModel`). For summarization, it is advised to put *global attention* only on the first
-  ``<s>`` token. For question answering, it is advised to put *global attention* on all tokens of the question.
- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
-  ``model.gradient_checkpointing_enable()``.
- A notebook showing how to evaluate LED, can be accessed `here
-  <https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing>`__.
- A notebook showing how to fine-tune LED, can be accessed `here
-  <https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing>`__.
-This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
-LEDConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDConfig
-    :members:
-LEDTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-LEDTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDTokenizerFast
-    :members:
-LED specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.led.modeling_led.LEDEncoderBaseModelOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqModelOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqLMOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
-    :members: 
-.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
-    :members: 
-LEDModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDModel
-    :members: forward
-LEDForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDForConditionalGeneration
-    :members: forward
-LEDForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDForSequenceClassification
-    :members: forward
-LEDForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LEDForQuestionAnswering
-    :members: forward
-TFLEDModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLEDModel
-    :members: call
-TFLEDForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLEDForConditionalGeneration
-    :members: call
--- a/docs/source/model_doc/longformer.mdx
+++ b/docs/source/model_doc/longformer.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Longformer
+**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
+## Overview
+The Longformer model was presented in [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+The abstract from the paper is the following:
+*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
+quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
+mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
+longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
+windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
+evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
+contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
+pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
+WikiHop and TriviaQA.*
+Tips:
+- Since the Longformer is based on RoBERTa, it doesn't have `token_type_ids`. You don't need to indicate which
+  token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or
+  `</s>`).
+This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer).
+## Longformer Self Attention
+Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
+attend "locally" to each other meaning that each token attends to its \\(\frac{1}{2} w\\) previous tokens and
+\\(\frac{1}{2} w\\) succeding tokens with \\(w\\) being the window length as defined in
+`config.attention_window`. Note that `config.attention_window` can be of type `List` to define a
+different \\(w\\) for each layer. A selected few tokens attend "globally" to all other tokens, as it is
+conventionally done for all tokens in `BertSelfAttention`.
+Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
+that every "locally" attending token not only attends to tokens within its window \\(w\\), but also to all "globally"
+attending tokens so that global attention is *symmetric*.
+The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
+`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
+`global_attention_mask`:
+- 0: the token attends "locally",
+- 1: the token attends "globally".
+For more information please also refer to [`~LongformerModel.forward`] method.
+Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
+represents the memory and time bottleneck, can be reduced from \\(\mathcal{O}(n_s \times n_s)\\) to
+\\(\mathcal{O}(n_s \times w)\\), with \\(n_s\\) being the sequence length and \\(w\\) being the average window
+size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
+"locally" attending tokens.
+For more information, please refer to the official [paper](https://arxiv.org/pdf/2004.05150.pdf).
+## Training
+[`LongformerForMaskedLM`] is trained the exact same way [`RobertaForMaskedLM`] is
+trained and should be used as follows:
+```python
+input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
+mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
+```
+## LongformerConfig
+[[autodoc]] LongformerConfig
+## LongformerTokenizer
+[[autodoc]] LongformerTokenizer
+## LongformerTokenizerFast
+[[autodoc]] LongformerTokenizerFast
+## Longformer specific outputs
+[[autodoc]] models.longformer.modeling_longformer.LongformerBaseModelOutput
+[[autodoc]] models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
+[[autodoc]] models.longformer.modeling_longformer.LongformerMaskedLMOutput
+[[autodoc]] models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
+[[autodoc]] models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
+[[autodoc]] models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
+[[autodoc]] models.longformer.modeling_longformer.LongformerTokenClassifierOutput
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
+[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
+## LongformerModel
+[[autodoc]] LongformerModel
+    - forward
+## LongformerForMaskedLM
+[[autodoc]] LongformerForMaskedLM
+    - forward
+## LongformerForSequenceClassification
+[[autodoc]] LongformerForSequenceClassification
+    - forward
+## LongformerForMultipleChoice
+[[autodoc]] LongformerForMultipleChoice
+    - forward
+## LongformerForTokenClassification
+[[autodoc]] LongformerForTokenClassification
+    - forward
+## LongformerForQuestionAnswering
+[[autodoc]] LongformerForQuestionAnswering
+    - forward
+## TFLongformerModel
+[[autodoc]] TFLongformerModel
+    - call
+## TFLongformerForMaskedLM
+[[autodoc]] TFLongformerForMaskedLM
+    - call
+## TFLongformerForQuestionAnswering
+[[autodoc]] TFLongformerForQuestionAnswering
+    - call
+## TFLongformerForSequenceClassification
+[[autodoc]] TFLongformerForSequenceClassification
+    - call
+## TFLongformerForTokenClassification
+[[autodoc]] TFLongformerForTokenClassification
+    - call
+## TFLongformerForMultipleChoice
+[[autodoc]] TFLongformerForMultipleChoice
+    - call
--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-Longformer
-----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
-<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The Longformer model was presented in `Longformer: The Long-Document Transformer
-<https://arxiv.org/pdf/2004.05150.pdf>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-The abstract from the paper is the following:
-*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
-quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
-mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
-longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
-windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
-evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
-contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
-pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
-WikiHop and TriviaQA.*
-Tips:
- Since the Longformer is based on RoBERTa, it doesn't have :obj:`token_type_ids`. You don't need to indicate which
-  token belongs to which segment. Just separate your segments with the separation token :obj:`tokenizer.sep_token` (or
-  :obj:`</s>`).
-This model was contributed by `beltagy <https://huggingface.co/beltagy>`__. The Authors' code can be found `here
-<https://github.com/allenai/longformer>`__.
-Longformer Self Attention
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
-attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and
-:math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
-:obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
-different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
-conventionally done for all tokens in :obj:`BertSelfAttention`.
-Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
-that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally"
-attending tokens so that global attention is *symmetric*.
-The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
-:obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
-:obj:`global_attention_mask`:
- 0: the token attends "locally",
- 1: the token attends "globally".
-For more information please also refer to :meth:`~transformers.LongformerModel.forward` method.
-Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
-represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to
-:math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window
-size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
-"locally" attending tokens.
-For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`__.
-Training
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-:class:`~transformers.LongformerForMaskedLM` is trained the exact same way :class:`~transformers.RobertaForMaskedLM` is
-trained and should be used as follows:
-.. code-block::
-    input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
-    mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
-    loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
-LongformerConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerConfig
-    :members:
-LongformerTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerTokenizer
-    :members: 
-LongformerTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerTokenizerFast
-    :members: 
-Longformer specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMaskedLMOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerTokenClassifierOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
-    :members: 
-.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
-    :members: 
-LongformerModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerModel
-    :members: forward
-LongformerForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerForMaskedLM
-    :members: forward
-LongformerForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerForSequenceClassification
-    :members: forward
-LongformerForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerForMultipleChoice
-    :members: forward
-LongformerForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerForTokenClassification
-    :members: forward
-LongformerForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LongformerForQuestionAnswering
-    :members: forward
-TFLongformerModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLongformerModel
-    :members: call
-TFLongformerForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLongformerForMaskedLM
-    :members: call
-TFLongformerForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLongformerForQuestionAnswering
-    :members: call
-TFLongformerForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLongformerForSequenceClassification
-    :members: call
-TFLongformerForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLongformerForTokenClassification
-    :members: call
-TFLongformerForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLongformerForMultipleChoice
-    :members: call
--- a/docs/source/model_doc/luke.rst
+++ b/docs/source/model_doc/luke.rst
-..
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-LUKE
+# LUKE
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The LUKE model was proposed in `LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
+The LUKE model was proposed in [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
-<https://arxiv.org/abs/2010.01057>`_ by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
 It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps
 improve performance on various downstream tasks involving reasoning about entities such as named entity recognition,
 extractive and cloze-style question answering, entity typing, and relation classification.
@@ -38,13 +35,13 @@ answering).*
 Tips:
- This implementation is the same as :class:`~transformers.RobertaModel` with the addition of entity embeddings as well
+- This implementation is the same as [`RobertaModel`] with the addition of entity embeddings as well
  as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
- LUKE treats entities as input tokens; therefore, it takes :obj:`entity_ids`, :obj:`entity_attention_mask`,
+- LUKE treats entities as input tokens; therefore, it takes `entity_ids`, `entity_attention_mask`,
-  :obj:`entity_token_type_ids` and :obj:`entity_position_ids` as extra input. You can obtain those using
+  `entity_token_type_ids` and `entity_position_ids` as extra input. You can obtain those using
-  :class:`~transformers.LukeTokenizer`.
+  [`LukeTokenizer`].
- :class:`~transformers.LukeTokenizer` takes :obj:`entities` and :obj:`entity_spans` (character-based start and end
+- [`LukeTokenizer`] takes `entities` and `entity_spans` (character-based start and end
-  positions of the entities in the input text) as extra input. :obj:`entities` typically consist of [MASK] entities or
+  positions of the entities in the input text) as extra input. `entities` typically consist of [MASK] entities or
  Wikipedia entities. The brief description when inputting these entities are as follows:
  - *Inputting [MASK] entities to compute entity representations*: The [MASK] entity is used to mask entities to be
@@ -60,109 +57,95 @@ Tips:
 - There are three head models for the former use case:
-  - :class:`~transformers.LukeForEntityClassification`, for tasks to classify a single entity in an input text such as
+  - [`LukeForEntityClassification`], for tasks to classify a single entity in an input text such as
-    entity typing, e.g. the `Open Entity dataset <https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html>`__.
+    entity typing, e.g. the [Open Entity dataset](https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html).
    This model places a linear head on top of the output entity representation.
-  - :class:`~transformers.LukeForEntityPairClassification`, for tasks to classify the relationship between two entities
+  - [`LukeForEntityPairClassification`], for tasks to classify the relationship between two entities
-    such as relation classification, e.g. the `TACRED dataset <https://nlp.stanford.edu/projects/tacred/>`__. This
+    such as relation classification, e.g. the [TACRED dataset](https://nlp.stanford.edu/projects/tacred/). This
    model places a linear head on top of the concatenated output representation of the pair of given entities.
-  - :class:`~transformers.LukeForEntitySpanClassification`, for tasks to classify the sequence of entity spans, such as
+  - [`LukeForEntitySpanClassification`], for tasks to classify the sequence of entity spans, such as
    named entity recognition (NER). This model places a linear head on top of the output entity representations. You
    can address NER using this model by inputting all possible entity spans in the text to the model.
-  :class:`~transformers.LukeTokenizer` has a ``task`` argument, which enables you to easily create an input to these
+  [`LukeTokenizer`] has a `task` argument, which enables you to easily create an input to these
-  head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
+  head models by specifying `task="entity_classification"`, `task="entity_pair_classification"`, or
-  ``task="entity_span_classification"``. Please refer to the example code of each head models.
+  `task="entity_span_classification"`. Please refer to the example code of each head models.
-  A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
+  A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation
-  classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
+  classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE).
  There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
-  the HuggingFace implementation of LUKE. They can be found `here
+  the HuggingFace implementation of LUKE. They can be found [here](https://github.com/studio-ousia/luke/tree/master/notebooks).
-  <https://github.com/studio-ousia/luke/tree/master/notebooks>`__.
 Example:
-.. code-block::
+```python
+>>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
-    >>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
+>>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
+>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
-    >>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
+# Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
-    >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
+>>> text = "Beyoncé lives in Los Angeles."
+>>> entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"
+>>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
+>>> outputs = model(**inputs)
+>>> word_last_hidden_state = outputs.last_hidden_state
+>>> entity_last_hidden_state = outputs.entity_last_hidden_state
-    # Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
+# Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
-    >>> text = "Beyoncé lives in Los Angeles."
+>>> entities = ["Beyoncé", "Los Angeles"]  # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
-    >>> entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"
+>>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
-    >>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
+>>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
-    >>> outputs = model(**inputs)
+>>> outputs = model(**inputs)
-    >>> word_last_hidden_state = outputs.last_hidden_state
+>>> word_last_hidden_state = outputs.last_hidden_state
-    >>> entity_last_hidden_state = outputs.entity_last_hidden_state
+>>> entity_last_hidden_state = outputs.entity_last_hidden_state
-    # Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
+# Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
-    >>> entities = ["Beyoncé", "Los Angeles"]  # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
+>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
-    >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
+>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
-    >>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
+>>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
-    >>> outputs = model(**inputs)
+>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
-    >>> word_last_hidden_state = outputs.last_hidden_state
+>>> outputs = model(**inputs)
-    >>> entity_last_hidden_state = outputs.entity_last_hidden_state
+>>> logits = outputs.logits
+>>> predicted_class_idx = int(logits[0].argmax())
+>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
+```
-    # Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
+This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke).
-    >>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
-    >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
-    >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
-    >>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
-    >>> outputs = model(**inputs)
-    >>> logits = outputs.logits
-    >>> predicted_class_idx = int(logits[0].argmax())
-    >>> print("Predicted class:", model.config.id2label[predicted_class_idx])
-This model was contributed by `ikuyamada <https://huggingface.co/ikuyamada>`__ and `nielsr
-<https://huggingface.co/nielsr>`__. The original code can be found `here <https://github.com/studio-ousia/luke>`__.
+## LukeConfig
-LukeConfig
+[[autodoc]] LukeConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LukeConfig
+## LukeTokenizer
-    :members:
+[[autodoc]] LukeTokenizer
+    - __call__
+    - save_vocabulary
-LukeTokenizer
+## LukeModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LukeTokenizer
+[[autodoc]] LukeModel
-    :members: __call__, save_vocabulary
+    - forward
+## LukeForMaskedLM
-LukeModel
+[[autodoc]] LukeForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.LukeModel
+## LukeForEntityClassification
-    :members: forward
-LukeForMaskedLM
+[[autodoc]] LukeForEntityClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
-.. autoclass:: transformers.LukeForMaskedLM
+## LukeForEntityPairClassification
-    :members: forward
+[[autodoc]] LukeForEntityPairClassification
+    - forward
-LukeForEntityClassification
+## LukeForEntitySpanClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LukeForEntityClassification
+[[autodoc]] LukeForEntitySpanClassification
-    :members: forward
+    - forward
-LukeForEntityPairClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LukeForEntityPairClassification
-    :members: forward
-LukeForEntitySpanClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LukeForEntitySpanClassification
-    :members: forward
--- a/docs/source/model_doc/lxmert.rst
+++ b/docs/source/model_doc/lxmert.rst
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
+the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
+http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
+specific language governing permissions and limitations under the License.
+-->
-LXMERT
+# LXMERT
-----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
+The LXMERT model was proposed in [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
-<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
 (one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
 combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
 visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
@@ -52,77 +49,54 @@ Tips:
  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
  both self attention outputs are disregarded.
-This model was contributed by `eltoto1219 <https://huggingface.co/eltoto1219>`__. The original code can be found `here
+This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert).
-<https://github.com/airsplay/lxmert>`__.
-LxmertConfig
+## LxmertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LxmertConfig
+[[autodoc]] LxmertConfig
-    :members:
+## LxmertTokenizer
-LxmertTokenizer
+[[autodoc]] LxmertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LxmertTokenizer
+## LxmertTokenizerFast
-    :members:
+[[autodoc]] LxmertTokenizerFast
-LxmertTokenizerFast
+## Lxmert specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LxmertTokenizerFast
+[[autodoc]] models.lxmert.modeling_lxmert.LxmertModelOutput
-    :members:
+[[autodoc]] models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
-Lxmert specific outputs
+[[autodoc]] models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertModelOutput
+[[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
-    :members:
-.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
+[[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
-    :members:
-.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
+## LxmertModel
-    :members:
-.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
+[[autodoc]] LxmertModel
-    :members:
+    - forward
-.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
+## LxmertForPreTraining
-    :members:
+[[autodoc]] LxmertForPreTraining
+    - forward
-LxmertModel
+## LxmertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LxmertModel
+[[autodoc]] LxmertForQuestionAnswering
-    :members: forward
+    - forward
-LxmertForPreTraining
+## TFLxmertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LxmertForPreTraining
+[[autodoc]] TFLxmertModel
-    :members: forward
+    - call
-LxmertForQuestionAnswering
+## TFLxmertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.LxmertForQuestionAnswering
+[[autodoc]] TFLxmertForPreTraining
-    :members: forward
+    - call
-TFLxmertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLxmertModel
-    :members: call
-TFLxmertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFLxmertForPreTraining
-    :members: call
--- a/docs/source/model_doc/m2m_100.mdx
+++ b/docs/source/model_doc/m2m_100.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# M2M100
+## Overview
+The M2M100 model was proposed in [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
+Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
+Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
+The abstract from the paper is the following:
+*Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
+single model able to translate between any pair of languages. However, much of this work is English-Centric by training
+only on data which was translated from or to English. While this is supported by large sources of training data, it
+does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
+model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
+covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
+to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
+to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
+translating between non-English directions while performing competitively to the best single systems of WMT. We
+open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
+This model was contributed by [valhalla](https://huggingface.co/valhalla).
+### Training and Generation
+M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
+multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
+source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
+id for source text and target language id for target text, with `X` being the source or target text.
+The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
+examples. To install `sentencepiece` run `pip install sentencepiece`.
+- Supervised Training
+```python
+from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
+model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
+tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
+src_text = "Life is like a box of chocolates."
+tgt_text = "La vie est comme une boîte de chocolat."
+model_inputs = tokenizer(src_text, return_tensors="pt")
+with tokenizer.as_target_tokenizer():
+    labels = tokenizer(tgt_text, return_tensors="pt").input_ids
+loss = model(**model_inputs, labels=labels) # forward pass
+```
+- Generation
+  M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id
+  being forced as the first generated token. To force the target language id as the first generated token, pass the
+  *forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between
+  Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint.
+```python
+>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
+>>> hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
+>>> chinese_text = "生活就像一盒巧克力。"
+>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
+>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
+>>> # translate Hindi to French
+>>> tokenizer.src_lang = "hi"
+>>> encoded_hi = tokenizer(hi_text, return_tensors="pt")
+>>> generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
+>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+"La vie est comme une boîte de chocolat."
+>>> # translate Chinese to English
+>>> tokenizer.src_lang = "zh"
+>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")
+>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
+>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+"Life is like a box of chocolate."
+```
+## M2M100Config
+[[autodoc]] M2M100Config
+## M2M100Tokenizer
+[[autodoc]] M2M100Tokenizer
+    - build_inputs_with_special_tokens
+    - get_special_tokens_mask
+    - create_token_type_ids_from_sequences
+    - save_vocabulary
+## M2M100Model
+[[autodoc]] M2M100Model
+    - forward
+## M2M100ForConditionalGeneration
+[[autodoc]] M2M100ForConditionalGeneration
+    - forward
--- a/docs/source/model_doc/m2m_100.rst
+++ b/docs/source/model_doc/m2m_100.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-M2M100
-----------------------------------------------------------------------------------------------------------------------
-Overview
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The M2M100 model was proposed in `Beyond English-Centric Multilingual Machine Translation
-<https://arxiv.org/abs/2010.11125>`__ by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
-Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
-Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
-The abstract from the paper is the following:
-*Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
-single model able to translate between any pair of languages. However, much of this work is English-Centric by training
-only on data which was translated from or to English. While this is supported by large sources of training data, it
-does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
-model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
-covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
-to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
-to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
-translating between non-English directions while performing competitively to the best single systems of WMT. We
-open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
-This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
-Training and Generation
-_______________________________________________________________________________________________________________________
-M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
-multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
-source and target text. The source text format is :obj:`[lang_code] X [eos]`, where :obj:`lang_code` is source language
-id for source text and target language id for target text, with :obj:`X` being the source or target text.
-The :class:`~transformers.M2M100Tokenizer` depends on :obj:`sentencepiece` so be sure to install it before running the
-examples. To install :obj:`sentencepiece` run ``pip install sentencepiece``.
- Supervised Training
-.. code-block::
-    from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
-    model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
-    tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
-    src_text = "Life is like a box of chocolates."
-    tgt_text = "La vie est comme une boîte de chocolat."
-    model_inputs = tokenizer(src_text, return_tensors="pt")
-    with tokenizer.as_target_tokenizer():
-        labels = tokenizer(tgt_text, return_tensors="pt").input_ids
-    loss = model(**model_inputs, labels=labels) # forward pass
- Generation
-    M2M100 uses the :obj:`eos_token_id` as the :obj:`decoder_start_token_id` for generation with the target language id
-    being forced as the first generated token. To force the target language id as the first generated token, pass the
-    `forced_bos_token_id` parameter to the `generate` method. The following example shows how to translate between
-    Hindi to French and Chinese to English using the `facebook/m2m100_418M` checkpoint.
-.. code-block::
-    >>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
-    >>> hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
-    >>> chinese_text = "生活就像一盒巧克力。"
-    >>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
-    >>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
-    >>> # translate Hindi to French
-    >>> tokenizer.src_lang = "hi"
-    >>> encoded_hi = tokenizer(hi_text, return_tensors="pt")
-    >>> generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
-    >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
-    "La vie est comme une boîte de chocolat."
-    >>> # translate Chinese to English
-    >>> tokenizer.src_lang = "zh"
-    >>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")
-    >>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
-    >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
-    "Life is like a box of chocolate."
-M2M100Config
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.M2M100Config
-    :members:
-M2M100Tokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.M2M100Tokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
-M2M100Model
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.M2M100Model
-    :members: forward
-M2M100ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.M2M100ForConditionalGeneration
-    :members: forward
--- a/docs/source/model_doc/marian.mdx
+++ b/docs/source/model_doc/marian.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# MarianMT
+**Bugs:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
+and assign @patrickvonplaten.
+Translations should be similar, but not identical to output in the test set linked to in each model card.
+## Implementation Notes
+- Each model is about 298 MB on disk, there are more than 1,000 models.
+- The list of supported language pairs can be found [here](https://huggingface.co/Helsinki-NLP).
+- Models were originally trained by [Jörg Tiedemann](https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann) using the [Marian](https://marian-nmt.github.io/) C++ library, which supports fast training and translation.
+- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
+  in a model card.
+- The 80 opus models that require BPE preprocessing are not supported.
+- The modeling code is the same as [`BartForConditionalGeneration`] with a few minor modifications:
+  - static (sinusoid) positional embeddings (`MarianConfig.static_position_embeddings=True`)
+  - no layernorm_embedding (`MarianConfig.normalize_embedding=False`)
+  - the model starts generating with `pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
+    `<s/>`),
+- Code to bulk convert models can be found in `convert_marian_to_pytorch.py`.
+- This model was contributed by [sshleifer](https://huggingface.co/sshleifer).
+## Naming
+- All model names use the following format: `Helsinki-NLP/opus-mt-{src}-{tgt}`
+- The language codes used to name models are inconsistent. Two digit codes can usually be found [here](https://developers.google.com/admin-sdk/directory/v1/languages), three digit codes require googling "language
+  code {code}".
+- Codes formatted like `es_AR` are usually `code_{region}`. That one is Spanish from Argentina.
+- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
+  group use a combination of ISO-639-5 codes and ISO-639-2 codes.
+## Examples
+- Since Marian models are smaller than many other translation models available in the library, they can be useful for
+  fine-tuning experiments and integration tests.
+- [Fine-tune on GPU](https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh)
+- [Fine-tune on GPU with pytorch-lightning](https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh)
+## Multilingual Models
+- All model names use the following format: `Helsinki-NLP/opus-mt-{src}-{tgt}`:
+- If a model can output multiple languages, and you should specify a language code by prepending the desired output
+  language to the `src_text`.
+- You can see a models's supported language codes in its model card, under target constituents, like in [opus-mt-en-roa](https://huggingface.co/Helsinki-NLP/opus-mt-en-roa).
+- Note that if a model is only multilingual on the source side, like `Helsinki-NLP/opus-mt-roa-en`, no language
+  codes are required.
+New multi-lingual models from the [Tatoeba-Challenge repo](https://github.com/Helsinki-NLP/Tatoeba-Challenge)
+require 3 character language codes:
+```python
+>>> from transformers import MarianMTModel, MarianTokenizer
+>>> src_text = [
+...     '>>fra<< this is a sentence in english that we want to translate to french',
+...     '>>por<< This should go to portuguese',
+...     '>>esp<< And this to Spanish'
+>>> ]
+>>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
+>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
+>>> print(tokenizer.supported_language_codes)
+['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
+>>> model = MarianMTModel.from_pretrained(model_name)
+>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
+>>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
+["c'est une phrase en anglais que nous voulons traduire en français",
+ 'Isto deve ir para o português.',
+ 'Y esto al español']
+```
+Here is the code to see all available pretrained models on the hub:
+```python
+from huggingface_hub import list_models
+model_list = list_models()
+org = "Helsinki-NLP"
+model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
+suffix = [x.split('/')[1] for x in model_ids]
+old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
+```
+## Old Style Multi-Lingual Models
+These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
+group:
+```python
+['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
+ 'Helsinki-NLP/opus-mt-ROMANCE-en',
+ 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
+ 'Helsinki-NLP/opus-mt-de-ZH',
+ 'Helsinki-NLP/opus-mt-en-CELTIC',
+ 'Helsinki-NLP/opus-mt-en-ROMANCE',
+ 'Helsinki-NLP/opus-mt-es-NORWAY',
+ 'Helsinki-NLP/opus-mt-fi-NORWAY',
+ 'Helsinki-NLP/opus-mt-fi-ZH',
+ 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
+ 'Helsinki-NLP/opus-mt-sv-NORWAY',
+ 'Helsinki-NLP/opus-mt-sv-ZH']
+GROUP_MEMBERS = {
+ 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
+ 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
+ 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
+ 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
+ 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
+ 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
+ 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
+}
+```
+Example of translating english to many romance languages, using old-style 2 character language codes
+```python
+>>> from transformers import MarianMTModel, MarianTokenizer
+>>> src_text = [
+...     '>>fr<< this is a sentence in english that we want to translate to french',
+...     '>>pt<< This should go to portuguese',
+...     '>>es<< And this to Spanish'
+>>> ]
+>>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
+>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
+>>> model = MarianMTModel.from_pretrained(model_name)
+>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
+>>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
+["c'est une phrase en anglais que nous voulons traduire en français", 
+ 'Isto deve ir para o português.',
+ 'Y esto al español']
+```
+## MarianConfig
+[[autodoc]] MarianConfig
+## MarianTokenizer
+[[autodoc]] MarianTokenizer
+    - as_target_tokenizer
+## MarianModel
+[[autodoc]] MarianModel
+    - forward
+## MarianMTModel
+[[autodoc]] MarianMTModel
+    - forward
+## MarianForCausalLM
+[[autodoc]] MarianForCausalLM
+    - forward
+## TFMarianModel
+[[autodoc]] TFMarianModel
+    - call
+## TFMarianMTModel
+[[autodoc]] TFMarianMTModel
+    - call
+## FlaxMarianModel
+[[autodoc]] FlaxMarianModel
+    - __call__
+## FlaxMarianMTModel
+[[autodoc]] FlaxMarianMTModel
+    - __call__
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-MarianMT
-----------------------------------------------------------------------------------------------------------------------
-**Bugs:** If you see something strange, file a `Github Issue
-<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
-and assign @patrickvonplaten.
-Translations should be similar, but not identical to output in the test set linked to in each model card.
-Implementation Notes
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Each model is about 298 MB on disk, there are more than 1,000 models.
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- Models were originally trained by `Jörg Tiedemann
-  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
-  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
-  in a model card.
- The 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
-    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
-    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
-    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
-      :obj:`<s/>`),
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.
- This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__.
-Naming
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
-  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
-  code {code}".
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
-  group use a combination of ISO-639-5 codes and ISO-639-2 codes.
-Examples
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
-  fine-tuning experiments and integration tests.
- `Fine-tune on GPU
-  <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh>`__
- `Fine-tune on GPU with pytorch-lightning
-  <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh>`__
-Multilingual Models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
-  language to the :obj:`src_text`.
- You can see a models's supported language codes in its model card, under target constituents, like in `opus-mt-en-roa
-  <https://huggingface.co/Helsinki-NLP/opus-mt-en-roa>`__.
- Note that if a model is only multilingual on the source side, like :obj:`Helsinki-NLP/opus-mt-roa-en`, no language
-  codes are required.
-New multi-lingual models from the `Tatoeba-Challenge repo <https://github.com/Helsinki-NLP/Tatoeba-Challenge>`__
-require 3 character language codes:
-.. code-block:: python
-    >>> from transformers import MarianMTModel, MarianTokenizer
-    >>> src_text = [
-    ...     '>>fra<< this is a sentence in english that we want to translate to french',
-    ...     '>>por<< This should go to portuguese',
-    ...     '>>esp<< And this to Spanish'
-    >>> ]
-    >>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
-    >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
-    >>> print(tokenizer.supported_language_codes)
-    ['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
-    >>> model = MarianMTModel.from_pretrained(model_name)
-    >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
-    >>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
-    ["c'est une phrase en anglais que nous voulons traduire en français",
-     'Isto deve ir para o português.',
-     'Y esto al español']
-Here is the code to see all available pretrained models on the hub:
-.. code-block:: python
-    from huggingface_hub import list_models
-    model_list = list_models()
-    org = "Helsinki-NLP"
-    model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
-    suffix = [x.split('/')[1] for x in model_ids]
-    old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
-Old Style Multi-Lingual Models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
-group:
-.. code-block:: python
-    ['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
-     'Helsinki-NLP/opus-mt-ROMANCE-en',
-     'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
-     'Helsinki-NLP/opus-mt-de-ZH',
-     'Helsinki-NLP/opus-mt-en-CELTIC',
-     'Helsinki-NLP/opus-mt-en-ROMANCE',
-     'Helsinki-NLP/opus-mt-es-NORWAY',
-     'Helsinki-NLP/opus-mt-fi-NORWAY',
-     'Helsinki-NLP/opus-mt-fi-ZH',
-     'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
-     'Helsinki-NLP/opus-mt-sv-NORWAY',
-     'Helsinki-NLP/opus-mt-sv-ZH']
-    GROUP_MEMBERS = {
-     'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
-     'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
-     'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
-     'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
-     'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
-     'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
-     'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
-    }
-Example of translating english to many romance languages, using old-style 2 character language codes
-.. code-block::python
-    >>> from transformers import MarianMTModel, MarianTokenizer
-    >>> src_text = [
-    ...     '>>fr<< this is a sentence in english that we want to translate to french',
-    ...     '>>pt<< This should go to portuguese',
-    ...     '>>es<< And this to Spanish'
-    >>> ]
-    >>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
-    >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
-    >>> model = MarianMTModel.from_pretrained(model_name)
-    >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
-    >>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
-    ["c'est une phrase en anglais que nous voulons traduire en français", 
-     'Isto deve ir para o português.',
-     'Y esto al español']
-MarianConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MarianConfig
-    :members:
-MarianTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MarianTokenizer
-    :members: as_target_tokenizer
-MarianModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MarianModel
-    :members: forward
-MarianMTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MarianMTModel
-    :members: forward
-MarianForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MarianForCausalLM
-    :members: forward
-TFMarianModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFMarianModel
-    :members: call
-TFMarianMTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFMarianMTModel
-    :members: call
-FlaxMarianModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxMarianModel
-    :members: __call__
-FlaxMarianMTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxMarianMTModel
-    :members: __call__
--- a/docs/source/model_doc/mbart.mdx
+++ b/docs/source/model_doc/mbart.mdx
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# MBart and MBart-50
+**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
+@patrickvonplaten
+## Overview of MBart
+The MBart model was presented in [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
+Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
+corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
+sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
+on the encoder, decoder, or reconstructing parts of the text.
+This model was contributed by [valhalla](https://huggingface.co/valhalla). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/mbart)
+### Training of MBart
+MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
+model is multilingual it expects the sequences in a different format. A special language id token is added in both the
+source and target text. The source text format is `X [eos, src_lang_code]` where `X` is the source text. The
+target text format is `[tgt_lang_code] X [eos]`. `bos` is never used.
+The regular [`~MBartTokenizer.__call__`] will encode source text format, and it should be wrapped
+inside the context manager [`~MBartTokenizer.as_target_tokenizer`] to encode target text format.
+- Supervised training
+```python
+>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
+>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
+>>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
+>>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
+>>> inputs = tokenizer(example_english_phrase, return_tensors="pt")
+>>> with tokenizer.as_target_tokenizer():
+...     labels = tokenizer(expected_translation_romanian, return_tensors="pt")
+>>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
+>>> # forward pass
+>>> model(**inputs, labels=batch['labels'])
+```
+- Generation
+  While generating the target text set the `decoder_start_token_id` to the target language id. The following
+  example shows how to translate English to Romanian using the *facebook/mbart-large-en-ro* model.
+```python
+>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
+>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
+>>> article = "UN Chief Says There Is No Military Solution in Syria"
+>>> inputs = tokenizer(article, return_tensors="pt")
+>>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
+>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+"Şeful ONU declară că nu există o soluţie militară în Siria"
+```
+## Overview of MBart-50
+MBart-50 was introduced in the *Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
+<https://arxiv.org/abs/2008.00401>* paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
+Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extendeding
+its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
+languages.
+According to the abstract
+*Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
+direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
+can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
+average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
+improving 9.3 BLEU on average over bilingual baselines from scratch.*
+### Training of MBart-50
+The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
+for both source and target text i.e the text format is `[lang_code] X [eos]`, where `lang_code` is source
+language id for source text and target language id for target text, with `X` being the source or target text
+respectively.
+MBart-50 has its own tokenizer [`MBart50Tokenizer`].
+-  Supervised training
+```python
+from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
+model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
+tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
+src_text = " UN Chief Says There Is No Military Solution in Syria"
+tgt_text =  "Şeful ONU declară că nu există o soluţie militară în Siria"
+model_inputs = tokenizer(src_text, return_tensors="pt")
+with tokenizer.as_target_tokenizer():
+    labels = tokenizer(tgt_text, return_tensors="pt").input_ids
+model(**model_inputs, labels=labels) # forward pass
+```
+- Generation
+  To generate using the mBART-50 multilingual translation models, `eos_token_id` is used as the
+  `decoder_start_token_id` and the target language id is forced as the first generated token. To force the
+  target language id as the first generated token, pass the *forced_bos_token_id* parameter to the *generate* method.
+  The following example shows how to translate between Hindi to French and Arabic to English using the
+  *facebook/mbart-50-large-many-to-many* checkpoint.
+```python
+from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
+article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
+article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
+model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
+tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
+# translate Hindi to French
+tokenizer.src_lang = "hi_IN"
+encoded_hi = tokenizer(article_hi, return_tensors="pt")
+generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
+tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
+# translate Arabic to English
+tokenizer.src_lang = "ar_AR"
+encoded_ar = tokenizer(article_ar, return_tensors="pt")
+generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
+tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+# => "The Secretary-General of the United Nations says there is no military solution in Syria."
+```
+## MBartConfig
+[[autodoc]] MBartConfig
+## MBartTokenizer
+[[autodoc]] MBartTokenizer
+    - as_target_tokenizer
+    - build_inputs_with_special_tokens
+## MBartTokenizerFast
+[[autodoc]] MBartTokenizerFast
+## MBart50Tokenizer
+[[autodoc]] MBart50Tokenizer
+## MBart50TokenizerFast
+[[autodoc]] MBart50TokenizerFast
+## MBartModel
+[[autodoc]] MBartModel
+## MBartForConditionalGeneration
+[[autodoc]] MBartForConditionalGeneration
+## MBartForQuestionAnswering
+[[autodoc]] MBartForQuestionAnswering
+## MBartForSequenceClassification
+[[autodoc]] MBartForSequenceClassification
+## MBartForCausalLM
+[[autodoc]] MBartForCausalLM
+    - forward
+## TFMBartModel
+[[autodoc]] TFMBartModel
+    - call
+## TFMBartForConditionalGeneration
+[[autodoc]] TFMBartForConditionalGeneration
+    - call
+## FlaxMBartModel
+[[autodoc]] FlaxMBartModel
+    - __call__
+    - encode
+    - decode
+## FlaxMBartForConditionalGeneration
+[[autodoc]] FlaxMBartForConditionalGeneration
+    - __call__
+    - encode
+    - decode
+## FlaxMBartForSequenceClassification
+[[autodoc]] FlaxMBartForSequenceClassification
+    - __call__
+    - encode
+    - decode
+## FlaxMBartForQuestionAnswering
+[[autodoc]] FlaxMBartForQuestionAnswering
+    - __call__
+    - encode
+    - decode
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
-.. 
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-        http://www.apache.org/licenses/LICENSE-2.0
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-MBart and MBart-50
-----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange, file a `Github Issue
-<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
-@patrickvonplaten
-Overview of MBart
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
-<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
-Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
-corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
-sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
-on the encoder, decoder, or reconstructing parts of the text.
-This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The Authors' code can be found `here
-<https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
-Training of MBart
-_______________________________________________________________________________________________________________________
-MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
-model is multilingual it expects the sequences in a different format. A special language id token is added in both the
-source and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The
-target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
-The regular :meth:`~transformers.MBartTokenizer.__call__` will encode source text format, and it should be wrapped
-inside the context manager :meth:`~transformers.MBartTokenizer.as_target_tokenizer` to encode target text format.
- Supervised training
-.. code-block::
-    >>> from transformers import MBartForConditionalGeneration, MBartTokenizer
-    >>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
-    >>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
-    >>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
-    >>> inputs = tokenizer(example_english_phrase, return_tensors="pt")
-    >>> with tokenizer.as_target_tokenizer():
-    ...     labels = tokenizer(expected_translation_romanian, return_tensors="pt")
-    >>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
-    >>> # forward pass
-    >>> model(**inputs, labels=batch['labels'])
- Generation
-    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
-    example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
-.. code-block::
-    >>> from transformers import MBartForConditionalGeneration, MBartTokenizer
-    >>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
-    >>> article = "UN Chief Says There Is No Military Solution in Syria"
-    >>> inputs = tokenizer(article, return_tensors="pt")
-    >>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
-    >>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
-    "Şeful ONU declară că nu există o soluţie militară în Siria"
-Overview of MBart-50
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-MBart-50 was introduced in the `Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
-<https://arxiv.org/abs/2008.00401>` paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
-Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original `mbart-large-cc25` checkpoint by extendeding
-its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
-languages.
-According to the abstract
-*Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
-direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
-can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
-average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
-improving 9.3 BLEU on average over bilingual baselines from scratch.*
-Training of MBart-50
-_______________________________________________________________________________________________________________________
-The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
-for both source and target text i.e the text format is :obj:`[lang_code] X [eos]`, where :obj:`lang_code` is source
-language id for source text and target language id for target text, with :obj:`X` being the source or target text
-respectively.
-MBart-50 has its own tokenizer :class:`~transformers.MBart50Tokenizer`.
-  Supervised training
-.. code-block::
-    from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
-    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
-    tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
-    src_text = " UN Chief Says There Is No Military Solution in Syria"
-    tgt_text =  "Şeful ONU declară că nu există o soluţie militară în Siria"
-    model_inputs = tokenizer(src_text, return_tensors="pt")
-    with tokenizer.as_target_tokenizer():
-        labels = tokenizer(tgt_text, return_tensors="pt").input_ids
-    model(**model_inputs, labels=labels) # forward pass
- Generation
-    To generate using the mBART-50 multilingual translation models, :obj:`eos_token_id` is used as the
-    :obj:`decoder_start_token_id` and the target language id is forced as the first generated token. To force the
-    target language id as the first generated token, pass the `forced_bos_token_id` parameter to the `generate` method.
-    The following example shows how to translate between Hindi to French and Arabic to English using the
-    `facebook/mbart-50-large-many-to-many` checkpoint.
-.. code-block::
-    from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
-    article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
-    article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
-    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
-    tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
-    # translate Hindi to French
-    tokenizer.src_lang = "hi_IN"
-    encoded_hi = tokenizer(article_hi, return_tensors="pt")
-    generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
-    tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
-    # => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
-    # translate Arabic to English
-    tokenizer.src_lang = "ar_AR"
-    encoded_ar = tokenizer(article_ar, return_tensors="pt")
-    generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
-    tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
-    # => "The Secretary-General of the United Nations says there is no military solution in Syria."
-MBartConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartConfig
-    :members:
-MBartTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartTokenizer
-    :members: as_target_tokenizer, build_inputs_with_special_tokens
-MBartTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartTokenizerFast
-    :members:
-MBart50Tokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBart50Tokenizer
-    :members:
-MBart50TokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBart50TokenizerFast
-    :members:
-MBartModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartModel
-    :members:
-MBartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartForConditionalGeneration
-    :members:
-MBartForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartForQuestionAnswering
-    :members:
-MBartForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartForSequenceClassification
-MBartForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.MBartForCausalLM
-    :members: forward
-TFMBartModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFMBartModel
-    :members: call
-TFMBartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.TFMBartForConditionalGeneration
-    :members: call
-FlaxMBartModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxMBartModel
-    :members: __call__, encode, decode
-FlaxMBartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxMBartForConditionalGeneration
-    :members: __call__, encode, decode
-FlaxMBartForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxMBartForSequenceClassification
-    :members: __call__, encode, decode
-FlaxMBartForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.FlaxMBartForQuestionAnswering
-    :members: __call__, encode, decode