Clean RAG docs and template docs (#7348)

* Clean RAG docs and template docs * Fix typo * Better doc

Clean RAG docs and template docs (#7348)
* Clean RAG docs and template docs * Fix typo * Better doc
0ccb6f5c · Sylvain Gugger · GitHub · 27174bd4 · 0ccb6f5c · 0ccb6f5c
Unverified Commit 0ccb6f5c authored Sep 24, 2020 by Sylvain Gugger Committed by GitHub Sep 24, 2020
9 changed files
--- a/docs/source/model_doc/rag.rst
+++ b/docs/source/model_doc/rag.rst
@@ -4,11 +4,14 @@ RAG
 Overview
 ~~~~~~~~~~~~~~~~~~~~~
-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
+sequence-to-sequence models. RAG models retrieve documents, pass them to a seq2seq model, then marginalize to generate
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing
+both retrieval and generation to adapt to downstream tasks.
-It is based on the paper `Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`__ by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
+It is based on the paper `Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
+<https://arxiv.org/abs/2005.11401>`__ by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir
+Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
 The abstract from the paper is the following:
@@ -47,7 +50,7 @@ RagTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.RagTokenizer
-    :members:
+    :members: prepare_seq2seq_batch
 Rag specific outputs

--- a/src/transformers/configuration_rag.py
+++ b/src/transformers/configuration_rag.py
@@ -38,35 +38,39 @@ RAG_CONFIG_DOC = r"""
        retrieval_vector_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the document embeddings indexed by :class:`~transformers.RagRetriever`.
        retrieval_batch_size (:obj:`int`, `optional`, defaults to 8):
-            Retrieval batch size, defined as the number of queries issues concurrently to the faiss index excapsulated :class:`~transformers.RagRetriever`.
+            Retrieval batch size, defined as the number of queries issues concurrently to the faiss index excapsulated
+            :class:`~transformers.RagRetriever`.
        dataset (:obj:`str`, `optional`, defaults to :obj:`"wiki_dpr"`):
-            A datatset identifier of the indexed dataset on HuggingFace AWS bucket (list all available datasets and ids using :obj:`datasets.list_datasets()`).
+            A dataset identifier of the indexed dataset on HuggingFace AWS bucket (list all available datasets and
-        dataset_split (:obj:`str`, `optional`, defaults to :obj:`train`)
+            ids using :obj:`datasets.list_datasets()`).
-            Which split of the ``dataset`` to load.
+        dataset_split (:obj:`str`, `optional`, defaults to :obj:`"train"`)
-        index_name (:obj:`str`, `optional`, defaults to :obj:`compressed`)
+            Which split of the :obj:`dataset` to load.
-            The index_name of the index associated with the :obj:`dataset`. One can choose between :obj:`legacy`, :obj:`exact` and :obj:`compressed`.
+        index_name (:obj:`str`, `optional`, defaults to :obj:`"compressed"`)
+            The index name of the index associated with the :obj:`dataset`. One can choose between :obj:`"legacy"`,
+            :obj:`"exact"` and :obj:`"compressed"`.
        index_path (:obj:`str`, `optional`)
            The path to the serialized faiss index on disk.
        passages_path: (:obj:`str`, `optional`):
-            A path to text passages compatible with the faiss index. Required if using :class:`~transformers.retrieval_rag.LegacyIndex`
+            A path to text passages compatible with the faiss index. Required if using
+            :class:`~transformers.retrieval_rag.LegacyIndex`
        use_dummy_dataset (:obj:`bool`, `optional`, defaults to ``False``)
            Whether to load a "dummy" variant of the dataset specified by :obj:`dataset`.
        label_smoothing (:obj:`float`, `optional`, defaults to 0.0):
-            Only relevant if ``return_loss`` is set to :obj:`True`. Controls the ``epsilon`` parameter value for label smoothing in the loss calculation.
+            Only relevant if ``return_loss`` is set to :obj:`True`. Controls the ``epsilon`` parameter value for label
-            If set to ``0.0``, no label smoothing is performed.
+            smoothing in the loss calculation. If set to 0, no label smoothing is performed.
        do_marginalize (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If :obj:`True`, the logits are marginalized over all documents
            by making use of ``torch.nn.functional.log_softmax``.
        reduce_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            If :obj:`True`, the NLL loss is reduced using the ``torch.Tensor.sum`` operation.
+            Whether or not to reduce the NLL loss using the ``torch.Tensor.sum`` operation.
        do_deduplication (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Controls whether we want to deduplicate the generations from different context documents for a given input.
+            Whether or not to deduplicate the generations from different context documents for a given input.
            Has to be set to :obj:`False` if used while training with distributed backend.
        exclude_bos_score (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            If :obj:`True`, the score of the BOS token is disregarded when computing
+            Whether or not to disregard the BOS token when computing the loss.
-            the loss.
        output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`):
-            If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and :obj:`context_attention_mask` are returned. See returned tensors for more detail.
+            If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and
+            :obj:`context_attention_mask` are returned. See returned tensors for more detail.
 """

--- a/src/transformers/modeling_rag.py
+++ b/src/transformers/modeling_rag.py
@@ -45,66 +45,63 @@ class RetrievAugLMMarginOutput(ModelOutput):
            Prediction scores of the language modeling head.
            The score is possibly marginalized over all documents for each vocabulary token.
        doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
-            Score between each retrieved document embeddigs
+            Score between each retrieved document embeddings (see :obj:`retrieved_doc_embeds`) and
-            (see :obj:`retrieved_doc_embeds`) and :obj:`question_encoder_last_hidden_state`.
+            :obj:`question_encoder_last_hidden_state`.
        past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
            List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`,  with each tensor of shape
            :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
-            Contains pre-computed hidden-states (key and values in the attention blocks)
+            Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used
-            of the decoder that can be used (see ``past_key_values`` input) to
+            (see ``past_key_values`` input) to speed up sequential decoding.
-            speed up sequential decoding.
        retrieved_doc_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs, hidden_size)`, `optional`, returned when `output_retrieved=True`):
            Embedded documents retrieved by the retriever.
-            Is used with ``question_encoder_last_hidden_state`` to compute
+            Is used with ``question_encoder_last_hidden_state`` to compute the ``doc_scores``.
-            the ``doc_scores``.
        retrieved_doc_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, config.n_docs)`, `optional`, returned when `output_retrieved=True`):
            The indexes of the embedded documents retrieved by the retriever.
        context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
            Input ids post-processed from the retrieved documents
            and the question encoder input_ids by the retriever.
        context_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-            Attention mask post-processed from the retrieved documents
+            Attention mask post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
-            and the question encoder input_ids by the retriever.
+            retriever.
        question_encoder_last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            Sequence of hidden-states at the output of the last layer
+            Sequence of hidden states at the output of the last layer of the question encoder pooled output of the
-            of the question encoder pooled output of the model.
+            model.
        question_enc_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
-            + one for the output of each layer)
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the question encoder at the output of each layer plus the initial embedding outputs.
+            Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.
        question_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
-            Attentions weights of the question encoder, after the attention softmax, used to compute the weighted average in the
+            Attentions weights of the question encoder, after the attention softmax, used to compute the weighted
-            self-attention heads.
+            average in the self-attention heads.
        generator_enc_last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
            Sequence of hidden-states at the output of the last layer of the generator encoder of the model.
        generator_enc_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the generator encoder at the output of each layer plus the initial embedding outputs.
+            Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.
        generator_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
-            Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted average in the
+            Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted
-            self-attention heads.
+            average in the self-attention heads.
        generator_dec_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the generator decoder at the output of each layer plus the initial embedding outputs.
+            Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.
        generator_dec_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
-            Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the
+            Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted
-            self-attention heads.
+            average in the self-attention heads.
    """
    loss: Optional[torch.FloatTensor] = None
@@ -133,14 +130,14 @@ class RetrievAugLMOutput(ModelOutput):
            Prediction scores of the language modeling head.
            The score is possibly marginalized over all documents for each vocabulary token.
        doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
-            Score between each retrieved document embeddigs (see :obj:`retrieved_doc_embeds`) and :obj:`question_encoder_last_hidden_state`.
+            Score between each retrieved document embeddings (see :obj:`retrieved_doc_embeds`) and
+            :obj:`question_encoder_last_hidden_state`.
        past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
-            List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`,
+            List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`,  with each tensor of shape
-            with each tensor of shape
            :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
-            Contains pre-computed hidden-states (key and values in the attention blocks)
-            of the decoder that can be used (see ``past_key_values`` input) to
+            Contains precomputed hidden-states (key and values in the attention blocks) of the decoder that can be used
-            speed up sequential decoding.
+            (see ``past_key_values`` input) to speed up sequential decoding.
        retrieved_doc_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs, hidden_size)`, `optional`, returned when `output_retrieved=True`):
            Embedded documents retrieved by the retriever.
            Is used with ``question_encoder_last_hidden_state`` to compute the ``doc_scores``.
@@ -150,48 +147,46 @@ class RetrievAugLMOutput(ModelOutput):
            Input ids post-processed from the retrieved documents
            and the question encoder input_ids by the retriever.
        context_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-            Attention mask post-processed from the retrieved
+            Attention mask post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
-            documents and the question encoder input_ids by the retriever.
+            retriever.
        question_encoder_last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            Sequence of hidden-states at the output of the last layer
+            Sequence of hidden states at the output of the last layer of the question encoder pooled output of the
-            of the question encoder pooled output of the model.
+            model.
        question_enc_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the question encoder at the output of each
+            Hidden states of the question encoder at the output of each layer plus the initial embedding outputs.
-            layer plus the initial embedding outputs.
        question_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
-            Attentions weights of the question encoder, after the attention softmax, used to compute the weighted average in the
+            Attentions weights of the question encoder, after the attention softmax, used to compute the weighted
-            self-attention heads.
+            average in the self-attention heads.
        generator_enc_last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
            Sequence of hidden-states at the output of the last layer of the generator encoder of the model.
        generator_enc_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the generator encoder at the output
+            Hidden states of the generator encoder at the output of each layer plus the initial embedding outputs.
-            of each layer plus the initial embedding outputs.
        generator_enc_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
-            Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted average in the
+            Attentions weights of the generator encoder, after the attention softmax, used to compute the weighted
-            self-attention heads.
+            average in the self-attention heads.
        generator_dec_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the generator decoder at the output of each layer plus the initial embedding outputs.
+            Hidden states of the generator decoder at the output of each layer plus the initial embedding outputs.
        generator_dec_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
-            Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted average in the
+            Attentions weights of the generator decoder, after the attention softmax, used to compute the weighted
-            self-attention heads.
+            average in the self-attention heads.
    """
    logits: torch.FloatTensor = None
@@ -213,10 +208,11 @@ class RetrievAugLMOutput(ModelOutput):
 class RagPreTrainedModel(PreTrainedModel):
    r"""
-    RAG models were released with the paper `Retrieval-Augmented Generation for
+    RAG models were released with the paper `Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
-    Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_ by Patrick Lewis, Ethan Perez, Aleksandra Piktus et al.
+    <https://arxiv.org/abs/2005.11401>`_ by Patrick Lewis, Ethan Perez, Aleksandra Piktus et al.
-    RAG is a retriever augmented model and encapsulate three components: a question encoder, a dataset retriever and a generator, the encoder and generator are trainable while the retriever is just an indexed dataset.
+    RAG is a retriever augmented model and encapsulate three components: a question encoder, a dataset retriever and a
+    generator, the encoder and generator are trainable while the retriever is just an indexed dataset.
    """
    config_class = RagConfig
@@ -232,40 +228,56 @@ class RagPreTrainedModel(PreTrainedModel):
        *model_args,
        **kwargs
    ) -> PreTrainedModel:
-        r"""Instantiates an question_encoder and a generator from one or two base classes of the library from pre-trained model checkpoints.
+        r"""
+        Instantiates an question encoder and a generator from one or two base classes of the library from pretrained
+        model checkpoints.
-        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated).
+        The model is set in evaluation mode by default using :obj:`model.eval()` (Dropout modules are deactivated).
-        To train the model, you need to first set it back in training mode with `model.train()`.
+        To train the model, you need to first set it back in training mode with :obj:`model.train()`.
        Params:
            question_encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
-                information necessary to initiate the question_encoder. Either:
+                Information necessary to initiate the question encoder. Can be either:
-                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                    - A string with the `shortcut name` of a pretrained model to load from cache or download, e.g.,
-                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
+                      ``bert-base-uncased``.
-                - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/question_encoder``.
+                    - A string with the `identifier name` of a pretrained model that was user-uploaded to our S3, e.g.,
-                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+                      ``dbmdz/bert-base-german-cased``.
+                    - A path to a `directory` containing model weights saved using
+                      :func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
+                    - A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
+                      this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided
+                      as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in
+                      a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
            generator_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
-                information necessary to initiate the generator. Either:
+                Information necessary to initiate the generator. Can be either:
-                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                    - A string with the `shortcut name` of a pretrained model to load from cache or download, e.g.,
-                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
+                      ``bert-base-uncased``.
-                - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/generator``.
+                    - A string with the `identifier name` of a pretrained model that was user-uploaded to our S3, e.g.,
-                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+                      ``dbmdz/bert-base-german-cased``.
+                    - A path to a `directory` containing model weights saved using
-            model_args: (`optional`) Sequence of positional arguments:
+                      :func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
-                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+                    - A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
+                      this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided
-            retriever: (`optional`, ``RagRetriever``) An instance of a :class:`~transformers.RagRetriever` to use as a retriever.
+                      as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in
+                      a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
-            kwargs: (`optional`) Remaining dictionary of keyword arguments.
-                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attentions=True``).
+            model_args (remaining positional arguments, `optional`):
-                - To update the question_encoder configuration, use the prefix `question_encoder_` for each configuration parameter
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method.
-                - To update the generator configuration, use the prefix `generator_` for each configuration parameter
+            retriever (:class:`~transformers.RagRetriever`, `optional`):
-                - To update the parent model configuration, do not use a prefix for each configuration parameter
+                The retriever to use.
-                Behave differently depending on whether a :obj:`config` is provided or automatically loaded.
+            kwwargs (remaining dictionary of keyword arguments, `optional`):
+                Can be used to update the configuration object (after it being loaded) and initiate the model
+                (e.g., ``output_attentions=True``).
+                - To update the question_encoder configuration, use the prefix `question_encoder_` for each
+                  configuration parameter.
+                - To update the generator configuration, use the prefix `generator_` for each configuration parameter.
+                - To update the parent model configuration, do not use a prefix for each configuration parameter.
+                Behaves differently depending on whether a :obj:`config` is provided or automatically loaded.
        Example::
@@ -345,23 +357,33 @@ class RagPreTrainedModel(PreTrainedModel):
 RAG_START_DOCSTRING = r"""
    RAG is a seq2seq model which encapsulates two core components: a question encoder and a generator.
    During a forward pass, we encode the input with the question encoder and pass it
    to the retriever to extract relevant context documents. The documents are then prepended to the input.
    Such contextualized inputs is passed to the generator.
-    The question encoder can be any `autoencoding` model, preferably :obj:`~transformers.DPRQuestionEncoder`, and the generator can be any `seq2seq` model, preferably :obj:`~transformers.BartForConditionalGeneration`.
+    The question encoder can be any `autoencoding` model, preferably :class:`~transformers.DPRQuestionEncoder`, and the
+    generator can be any `seq2seq` model, preferably :class:`~transformers.BartForConditionalGeneration`.
-    The model can be initialized with a :obj:`~transformers.RagRetriever` for end-to-end generation or used in combination with the outputs of a retriever in multiple steps - see examples for more details.
+    The model can be initialized with a :class:`~transformers.RagRetriever` for end-to-end generation or used in
-    The model is compatible any `autoencoding` model as the ``question_encoder`` and any `seq2seq` model with language model head as the ``generator``.
+    combination with the outputs of a retriever in multiple steps---see examples for more details.
-    The model has been tested with :class:`~transformers.DPRQuestionEncoder` as the ``question_encoder`` and :class:`~transformers.BartForConditionalGeneration` or :class:`~transformers.T5ForConditionalGeneration` as the ``generator``.
+    The model is compatible any `autoencoding` model as the ``question_encoder`` and any `seq2seq` model with language
+    model head as the ``generator``. It has been tested with :class:`~transformers.DPRQuestionEncoder` as the
+    ``question_encoder`` and :class:`~transformers.BartForConditionalGeneration` or
+    :class:`~transformers.T5ForConditionalGeneration` as the ``generator``.
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
    usage and behavior.
    Args:
-        config (:class:`~transformers.RagConfig`): Model configuration class with all the parameters of the model.
+        config (:class:`~transformers.RagConfig`):
+            Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the configuration.
            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
        question_encoder (:class:`transformers.PreTrainedModel`):
@@ -377,44 +399,65 @@ RAG_FORWARD_INPUTS_DOCSTRING = r"""
    Args:
        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.
-            :class:`~transformers.RagConfig`, used to initialize the model, specifies which generator to use, it also specifies a compatible
+            :class:`~transformers.RagConfig`, used to initialize the model, specifies which generator to use, it also
-            generator tokenizer. Use that tokenizer class to obtain the indices.
+            specifies a compatible generator tokenizer. Use that tokenizer class to obtain the indices.
-        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+        attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Mask to avoid performing attention on padding token indices in input_ids.
+            Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+            `What are attention masks? <../glossary.html#attention-mask>`__
        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`)
-            Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
+            Tuple consists of (:obj:`generator_enc_last_hidden_state`, `optional`: :obj:`generator_enc_hidden_states`,
-            `last_hidden_state` of shape :obj:`(batch_size, n_docs * sequence_length, hidden_size)` is a sequence of hidden-states at the output of the last layer of the encoder.
+            `optional`: :obj:`generator_enc_attentions`). :obj:`generator_enc_last_hidden_state` of shape
-            `doc_scores` of shape :obj:`(batch_size, n_docs)` store retrieval scores of documents retrieved for each input in the batch.
+            :obj:`(batch_size, n_docs * sequence_length, hidden_size)` is a sequence of hidden-states at the output of
-            Used by the (:class:`~transformers.RagTokenForGeneration`) model during decoding.
+            the last layer of the generator's encoder.
-        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
-            Provide for generation tasks. `None` by default, constuct as per instructions for the generator model you're using with your RAG instance.
+            Used by the (:class:`~transformers.RagModel`) model during decoding.
-            Provide for generation tasks. `None` by default, constuct as per instructions for the generator model you're using with your RAG instance.
+        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
+            Provide for generation tasks. `None` by default, constuct as per instructions for the generator model
+            you're using with your RAG instance.
        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size,  target_sequence_length)`, `optional`):
-            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))`):
-            Tuple consists of two elements: :obj:`encoder_outputs` of the RAG model (see :obj:`encoder_outputs`) and :obj:`past_key_values` of the underlying generator.
+            Tuple consists of two elements: :obj:`encoder_outputs` of the RAG model (see :obj:`encoder_outputs`) and
-            Can be used to speed up decoding. :obj:`past_key_values` are used in the (:class:`~transformers.RagTokenForGeneration`)
+            :obj:`past_key_values` of the underlying generator.
-            model during decoding.
+            Can be used to speed up decoding. :obj:`past_key_values` are used in the
+            (:class:`~transformers.RagTokenForGeneration`) model during decoding.
        doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
-            Score between each retrieved document embeddigs (see :obj:`retrieved_doc_embeds`) and :obj:`question_encoder_last_hidden_state`.
+            Score between each retrieved document embeddigs (see :obj:`retrieved_doc_embeds`) and
-            If the model has is not initialized with a ``retriever`` :obj:`doc_scores` has to be provided to the forward pass. :obj:`doc_scores` can be computed via :obj:`question_encoder_last_hidden_state` and :obj:`retrieved_doc_embeds`, see examples for more information.
+            :obj:`question_encoder_last_hidden_state`.
+            If the model has is not initialized with a ``retriever`` :obj:`doc_scores` has to be provided to the
+            forward pass. :obj:`doc_scores` can be computed via :obj:`question_encoder_last_hidden_state` and
+            :obj:`retrieved_doc_embeds`, see examples for more information.
        context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-            Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
+            Input IDs post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
-            If the model has is not initialized with a ``retriever`` :obj:`context_input_ids` has to be provided to the forward pass. :obj:`context_input_ids` are returned by :meth:`~transformers.RagRetriever.__call__`
+            retriever.
+            If the model has is not initialized with a ``retriever`` :obj:`context_input_ids` has to be provided to the
+            forward pass. :obj:`context_input_ids` are returned by :meth:`~transformers.RagRetriever.__call__`.
        context_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-            Attention mask post-processed from the retrieved documents and the question encoder input_ids by the retriever.
+            Attention mask post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
-            If the model has is not initialized with a ``retriever`` :obj:`context_attention_mask` has to be provided to the forward pass. :obj:`context_attention_mask` are returned by :meth:`~transformers.RagRetriever.__call__`
+            retriever.
+            If the model has is not initialized with a ``retriever`` :obj:`context_attention_mask` has to be provided
+            to the forward pass. :obj:`context_attention_mask` are returned by
+            :meth:`~transformers.RagRetriever.__call__`.
        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            If `use_cache` is True, ``past_key_values`` are returned and can be used to speed up decoding (see
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
-            ``past_key_values``).
+            decoding (see ``past_key_values``).
        output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
        output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
        output_retrieved(:obj:`bool`, `optional`):
-            If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and :obj:`context_attention_mask` are returned. See returned tensors for more detail.
+            Whether or not to return the :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`,
+            :obj:`context_input_ids` and :obj:`context_attention_mask`. See returned tensors for more detail.
 """
@@ -780,28 +823,31 @@ class RagSequenceForGeneration(RagPreTrainedModel):
    ):
        """
        Implements RAG sequence "thorough" decoding.
-        Read the :meth:`~transformers.PreTrainedModel.generate`` documentation for more information on how to set other generate input parameters.
+        Read the :meth:`~transformers.PreTrainedModel.generate`` documentation for more information on how to set other
+        generate input parameters.
        Args:
            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                The sequence used as a prompt for the generation. If :obj:`input_ids` is not passed, then :obj:`context_input_ids` has to be provided.
+                The sequence used as a prompt for the generation. If :obj:`input_ids` is not passed, then
+                :obj:`context_input_ids` has to be provided.
            context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-                Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
+                Input IDs post-processed from the retrieved documents and the question encoder input_ids by the
+                retriever.
            do_deduplication (:obj:`bool`, `optional`):
-                Controls whether we want to deduplicate the generations from different context documents for a given input.
+                Whether or not to deduplicate the generations from different context documents for a given input.
                Has to be set to :obj:`False` if used while training with distributed backend.
            num_return_sequences(:obj:`int`, `optional`, defaults to 1):
-                The number of independently computed returned sequences for each element in the batch. Note that this is not the value
+                The number of independently computed returned sequences for each element in the batch. Note that this
-                we pass to the ``generator``'s  `:func:`~transformers.PreTrainedModel.generate`` function, where we set ``num_return_sequences``
+                is not the value we pass to the ``generator``'s  `:func:`~transformers.PreTrainedModel.generate``
-                to `num_beams`.
+                function, where we set ``num_return_sequences`` to :obj:`num_beams`.
            num_beams (:obj:`int`, `optional`, defaults to 1):
                Number of beams for beam search. 1 means no beam search.
            kwargs:
-                Additional kwargs will be passed to :meth:`~transformers.PreTrainedModel.generate``.
+                Additional kwargs will be passed to :meth:`~transformers.PreTrainedModel.generate`.
-        Return:
+        Return:
            :obj:`torch.LongTensor` of shape :obj:`(batch_size * num_return_sequences, sequence_length)`:
-            The generated sequences. The second dimension (sequence_length) is either equal to :obj:`max_length` or
+            The generated sequences. The second dimension (sequence length) is either equal to :obj:`max_length` or
            shorter if all batches finished early due to the :obj:`eos_token_id`.
        """
@@ -1041,6 +1087,7 @@ class RagTokenForGeneration(RagPreTrainedModel):
            If :obj:`True`, the NLL loss is reduced using the ``torch.Tensor.sum`` operation.
        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
            Legacy dictionary, which is required so that model can use `generate()` function.
        Returns:
        Example::
@@ -1156,23 +1203,35 @@ class RagTokenForGeneration(RagPreTrainedModel):
        Args:
            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                The sequence used as a prompt for the generation. If :obj:`input_ids` is not passed, then :obj:`context_input_ids` has to be provided.
+                The sequence used as a prompt for the generation. If :obj:`input_ids` is not passed, then
+                :obj:`context_input_ids` has to be provided.
            context_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-                Input ids post-processed from the retrieved documents and the question encoder input_ids by the retriever.
+                Input IDs post-processed from the retrieved documents and the question encoder :obj:`input_ids` by the
-                If the model has is not initialized with a ``retriever`` :obj:`context_input_ids` has to be provided to the forward pass. :obj:`context_input_ids` are returned by :meth:`~transformers.RagRetriever.__call__`
+                retriever.
+                If the model has is not initialized with a ``retriever``, :obj:`context_input_ids` has to be provided
+                to the forward pass. :obj:`context_input_ids` are returned by
+                :meth:`~transformers.RagRetriever.__call__`.
            context_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size * config.n_docs, config.max_combined_length)`, `optional`, returned when `output_retrieved=True`):
-                Attention mask post-processed from the retrieved documents and the question encoder input_ids by the retriever.
+                Attention mask post-processed from the retrieved documents and the question encoder :obj:`input_ids` by
-                If the model has is not initialized with a ``retriever`` :obj:`context_attention_mask` has to be provided to the forward pass. :obj:`context_attention_mask` are returned by :meth:`~transformers.RagRetriever.__call__`
+                the retriever.
+                If the model has is not initialized with a ``retriever``, :obj:`context_input_ids` has to be provided
+                to the forward pass. :obj:`context_input_ids` are returned by
+                :meth:`~transformers.RagRetriever.__call__`.
            doc_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.n_docs)`):
-                Score between each retrieved document embeddigs (see :obj:`retrieved_doc_embeds`) and :obj:`question_encoder_last_hidden_state`.
+                Score between each retrieved document embeddigs (see :obj:`retrieved_doc_embeds`) and
-                If the model has is not initialized with a ``retriever`` :obj:`doc_scores` has to be provided to the forward pass. :obj:`doc_scores` can be computed via :obj:`question_encoder_last_hidden_state` and :obj:`retrieved_doc_embeds`, see examples for more information.
+                :obj:`question_encoder_last_hidden_state`.
+                If the model has is not initialized with a ``retriever``, :obj:`context_input_ids` has to be provided
+                to the forward pass. :obj:`context_input_ids` are returned by
+                :meth:`~transformers.RagRetriever.__call__`.
            max_length (:obj:`int`, `optional`, defaults to 20):
                The maximum length of the sequence to be generated.
            min_length (:obj:`int`, `optional`, defaults to 10):
                The minimum length of the sequence to be generated.
            early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Whether to stop the beam search when at least ``num_beams`` sentences are finished per batch or not.
+                Whether or not to stop the beam search when at least ``num_beams`` sentences are finished per batch or not.
            use_cache: (:obj:`bool`, `optional`, defaults to :obj:`True`):
                Whether or not the model should use the past last key/values attentions (if applicable to the model) to
                speed up decoding.
@@ -1195,14 +1254,13 @@ class RagTokenForGeneration(RagPreTrainedModel):
            num_beams (:obj:`int`, `optional`, defaults to 1):
                Number of beams for beam search. 1 means no beam search.
            num_return_sequences(:obj:`int`, `optional`, defaults to 1):
-                The number of independently computed returned sequences for each element in the batch. Note that this is not the value
+                The number of independently computed returned sequences for each element in the batch. Note that this
-                we pass to the ``generator``'s  `:func:`~transformers.PreTrainedModel.generate`` function, where we set ``num_return_sequences``
+                is not the value we pass to the ``generator``'s  `:func:`~transformers.PreTrainedModel.generate`
-                to `num_beams`.
+                function, where we set ``num_return_sequences`` to :obj:`num_beams`.
            decoder_start_token_id (:obj:`int`, `optional`):
                If an encoder-decoder model starts decoding with a different token than `bos`, the id of that token.
        Return:
            :obj:`torch.LongTensor` of shape :obj:`(batch_size * num_return_sequences, sequence_length)`:
            The generated sequences. The second dimension (sequence_length) is either equal to :obj:`max_length` or
            shorter if all batches finished early due to the :obj:`eos_token_id`.

--- a/src/transformers/retrieval_rag.py
+++ b/src/transformers/retrieval_rag.py
@@ -399,12 +399,14 @@ class RagRetriever:
                The number of docs retrieved per query.
        Return:
-            retrieved_doc_embeds (:obj:`np.ndarray` of shape :obj:`(batch_size, n_docs, dim)`
+            :obj:`Tuple[np.ndarray, np.ndarray, List[dict]]`:
-                The retrieval embeddings of the retrieved docs per query.
+            A tuple with the following objects:
-            doc_ids (:obj:`np.ndarray` of shape :obj:`batch_size, n_docs`)
-                The ids of the documents in the index
+            - **retrieved_doc_embeds** (:obj:`np.ndarray` of shape :obj:`(batch_size, n_docs, dim)`) -- The
-            doc_dicts (:obj:`List[dict]`):
+              retrieval embeddings of the retrieved docs per query.
-                The retrieved_doc_embeds examples per query.
+            - **doc_ids** (:obj:`np.ndarray` of shape :obj:`(batch_size, n_docs)`) -- The ids of the documents in the
+              index
+            - **doc_dicts** (:obj:`List[dict]`): The :obj:`retrieved_doc_embeds` examples per query.
        """
        doc_ids, retrieved_doc_embeds = self._main_retrieve(question_hidden_states, n_docs)

--- a/src/transformers/tokenization_rag.py
+++ b/src/transformers/tokenization_rag.py
@@ -17,7 +17,8 @@ import os
 from typing import List, Optional
 from .configuration_rag import RagConfig
-from .tokenization_utils_base import BatchEncoding
+from .file_utils import add_start_docstrings
+from .tokenization_utils_base import PREPARE_SEQ2SEQ_BATCH_DOCSTRING, BatchEncoding
 from .utils import logging
@@ -60,6 +61,7 @@ class RagTokenizer:
    def batch_decode(self, *args, **kwargs):
        return self.generator.batch_decode(*args, **kwargs)
+    @add_start_docstrings(PREPARE_SEQ2SEQ_BATCH_DOCSTRING)
    def prepare_seq2seq_batch(
        self,
        src_texts: List[str],
@@ -71,66 +73,6 @@ class RagTokenizer:
        truncation=True,
        **kwargs,
    ) -> BatchEncoding:
-        r"""
-        Prepare a batch that can be passed directly to an instance of :class:`~transformers.RagModel`.
-        Args:
-            src_texts: (:obj:`List[str]`):
-                List of documents to summarize or source language texts.
-            tgt_texts: (:obj:`List[str]`, `optional`):
-                List of summaries or target language texts.
-            max_length (:obj:`int`, `optional`):
-                Controls the maximum length for encoder inputs (documents to summarize or source language texts).
-                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
-                length is required by one of the truncation/padding parameters. If the model has no specific maximum
-                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-            max_target_length (:obj:`int`, `optional`):
-                Controls the maximum length of decoder inputs (target language texts or summaries).
-                If left unset or set to :obj:`None`, this will use the max_length value.
-            padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
-                Activates and controls padding. Accepts the following values:
-                * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
-                  single sequence if provided).
-                * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
-                  maximum acceptable input length for the model if that argument is not provided.
-                * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
-                  different lengths).
-            return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`, defaults to "pt"):
-                If set, will return tensors instead of list of python integers. Acceptable values are:
-                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
-                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
-                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
-            truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`True`):
-                Activates and controls truncation. Accepts the following values:
-                * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
-                  :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
-                  provided. This will truncate token by token, removing a token from the longest sequence in the pair
-                  if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to
-                  the maximum acceptable input length for the model if that argument is not provided. This will only
-                  truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`'only_second'`: Truncate to a maximum length specified with the argument :obj:`max_length` or
-                  to the maximum acceptable input length for the model if that argument is not provided. This will only
-                  truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
-                  sequence lengths greater than the model maximum admissible input size).
-            **kwargs:
-                Additional keyword arguments passed along to :obj:`self.__call__`.
-        Returns:
-            :class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:
-            - **input_ids** -- List of token ids to be fed to the encoder.
-            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model.
-            - **labels** -- List of token ids for tgt_texts
-            The full set of keys ``[input_ids, attention_mask, labels]``,
-            will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.
-        """
        if max_length is None:
            max_length = self.question_encoder.model_max_length
        model_inputs: BatchEncoding = self.question_encoder(

--- a/templates/adding_a_new_model/configuration_xxx.py
+++ b/templates/adding_a_new_model/configuration_xxx.py
@@ -31,10 +31,10 @@ XXX_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class XxxConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.XXXModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.XxxModel` or a
-    It is used to instantiate a XXX model according to the specified arguments, defining the model
+    :class:`~transformers.TFXxxModel`. It is used to instantiate a XXX model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the XXX `xxx-base-uncased <https://huggingface.co/xxx/xxx-base-uncased>`__ architecture.
+    configuration to that of the XXX `xxx-base-uncased <https://huggingface.co/xxx/xxx-base-uncased>`__ architecture.
    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig`
@@ -42,33 +42,35 @@ class XxxConfig(PretrainedConfig):
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the XXX model. Defines the different tokens that
+            Vocabulary size of the XXX model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XXXModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.XxxModel` or
-        hidden_size (:obj:`int`, optional, defaults to 768):
+            :class:`~transformers.TFXxxModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to :obj:`"gelu"`):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.XxxModel` or
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            :class:`~transformers.TFXxxModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the :obj:`truncated_normal_initializer` for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-5):
            The epsilon used by the layer normalization layers.
-        gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
        kwargs:
            Additional arguments for common configurations, passed to :class:`~transformers.PretrainedConfig`.

--- a/templates/adding_a_new_model/modeling_tf_xxx.py
+++ b/templates/adding_a_new_model/modeling_tf_xxx.py
@@ -257,13 +257,18 @@ class TFXxxPreTrainedModel(TFPreTrainedModel):
 XXX_START_DOCSTRING = r"""
    The XXX model was proposed in
    `XXX: Pre-training of Deep Bidirectional Transformers for Language Understanding
    <https://arxiv.org/abs/1810.04805>`__ by....
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
-    Use it as a regular TF 2.0 Keras Model and
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+    embeddings, pruning heads etc.)
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
    .. note::
@@ -272,17 +277,17 @@ XXX_START_DOCSTRING = r"""
        - having all inputs as keyword arguments (like PyTorch models), or
        - having all inputs as a list, tuple or dict in the first positional arguments.
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
        If you choose this second option, there are three possibilities you can use to gather all the input Tensors
        in the first positional argument :
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
        - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
    Parameters:
        config (:class:`~transformers.XxxConfig`): Model configuration class with all the parameters of the model.
@@ -292,27 +297,31 @@ XXX_START_DOCSTRING = r"""
 XXX_INPUTS_DOCSTRING = r"""
    Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using :class:`transformers.XxxTokenizer`.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
            `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
            Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
            `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
            Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+            Indices are selected in ``[0, 1]``:
-            corresponds to a `sentence B` token
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
            `What are token type IDs? <../glossary.html#token-type-ids>`__
-        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
            Indices of positions of each input sequence tokens in the position embeddings.
            Selected in the range ``[0, config.max_position_embeddings - 1]``.
@@ -320,21 +329,25 @@ XXX_INPUTS_DOCSTRING = r"""
        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
            Mask to nullify selected heads of the self-attention modules.
            Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
-            than the model's internal embedding lookup matrix.
+            vectors than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
        output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
        output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
        return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
-            plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
@@ -347,7 +360,7 @@ class TFXxxModel(TFXxxPreTrainedModel):
        super().__init__(config, *inputs, **kwargs)
        self.transformer = TFXxxMainLayer(config, name="transformer")
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-cased",
@@ -370,7 +383,7 @@ class TFXxxForMaskedLM(TFXxxPreTrainedModel, TFMaskedLanguageModelingLoss):
        self.transformer = TFXxxMainLayer(config, name="transformer")
        self.mlm = TFXxxMLMHead(config, self.transformer.embeddings, name="mlm")
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-cased",
@@ -452,7 +465,7 @@ class TFXxxForSequenceClassification(TFXxxPreTrainedModel, TFSequenceClassificat
            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
        )
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-cased",
@@ -544,7 +557,7 @@ class TFXxxForMultipleChoice(TFXxxPreTrainedModel, TFMultipleChoiceLoss):
        """
        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-cased",
@@ -568,8 +581,8 @@ class TFXxxForMultipleChoice(TFXxxPreTrainedModel, TFMultipleChoiceLoss):
        r"""
        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)s after the attention softmax, used to compute the weighted average in the self-attention
+            of the input tensors. (See :obj:`input_ids` above)
            heads.
        """
        if isinstance(inputs, (tuple, list)):
@@ -667,7 +680,7 @@ class TFXxxForTokenClassification(TFXxxPreTrainedModel, TFTokenClassificationLos
            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
        )
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-cased",
@@ -734,8 +747,8 @@ class TFXxxForTokenClassification(TFXxxPreTrainedModel, TFTokenClassificationLos
 @add_start_docstrings(
-    """XXX Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+    """XXX Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
    XXX_START_DOCSTRING,
 )
 class TFXxxForQuestionAnswering(TFXxxPreTrainedModel, TFQuestionAnsweringLoss):
@@ -748,7 +761,7 @@ class TFXxxForQuestionAnswering(TFXxxPreTrainedModel, TFQuestionAnsweringLoss):
            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
        )
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-cased",
@@ -773,11 +786,11 @@ class TFXxxForQuestionAnswering(TFXxxPreTrainedModel, TFQuestionAnsweringLoss):
        r"""
        start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
        end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
        """
        return_dict = return_dict if return_dict is not None else self.transformer.return_dict

--- a/templates/adding_a_new_model/modeling_xxx.py
+++ b/templates/adding_a_new_model/modeling_xxx.py
@@ -209,11 +209,16 @@ class XxxPreTrainedModel(PreTrainedModel):
            module.bias.data.zero_()
-XXX_START_DOCSTRING = r"""    The XXX model was proposed in
+XXX_START_DOCSTRING = r"""
-    `XXX: Pre-training of Deep Bidirectional Transformers for Language Understanding
+    The XXX model was proposed in `XXX: Pre-training of Deep Bidirectional Transformers for Language Understanding
    <https://arxiv.org/abs/1810.04805>`__ by....
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
    usage and behavior.
@@ -225,27 +230,31 @@ XXX_START_DOCSTRING = r"""    The XXX model was proposed in
 XXX_INPUTS_DOCSTRING = r"""
    Inputs:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using :class:`transformers.XxxTokenizer`.
+            Indices can be obtained using :class:`~transformers.XxxTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
            `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
            Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
            `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
            Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+            Indices are selected in ``[0, 1]``:
-            corresponds to a `sentence B` token
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
            `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
            Indices of positions of each input sequence tokens in the position embeddings.
            Selected in the range ``[0, config.max_position_embeddings - 1]``.
@@ -253,18 +262,22 @@ XXX_INPUTS_DOCSTRING = r"""
        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
            Mask to nullify selected heads of the self-attention modules.
            Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
-            than the model's internal embedding lookup matrix.
+            vectors than the model's internal embedding lookup matrix.
        output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
        output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
        return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
-            plain tuple.
 """
@@ -296,7 +309,7 @@ class XxxModel(XxxPreTrainedModel):
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-uncased",
@@ -378,7 +391,7 @@ class XxxForMaskedLM(XxxPreTrainedModel):
    def get_output_embeddings(self):
        return self.lm_head
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-uncased",
@@ -455,7 +468,7 @@ class XxxForSequenceClassification(XxxPreTrainedModel):
        self.init_weights()
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-uncased",
@@ -538,7 +551,7 @@ class XxxForMultipleChoice(XxxPreTrainedModel):
        self.init_weights()
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-uncased",
@@ -561,8 +574,8 @@ class XxxForMultipleChoice(XxxPreTrainedModel):
        r"""
        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            of the input tensors. (See :obj:`input_ids` above)
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
@@ -628,7 +641,7 @@ class XxxForTokenClassification(XxxPreTrainedModel):
        self.init_weights()
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-uncased",
@@ -713,7 +726,7 @@ class XxxForQuestionAnswering(XxxPreTrainedModel):
        self.init_weights()
-    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XXX_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="xxx-base-uncased",
@@ -737,11 +750,11 @@ class XxxForQuestionAnswering(XxxPreTrainedModel):
        r"""
        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
            Position outside of the sequence are not taken into account for computing the loss.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

--- a/templates/adding_a_new_model/tokenization_xxx.py
+++ b/templates/adding_a_new_model/tokenization_xxx.py
@@ -80,16 +80,16 @@ class XxxTokenizer(PreTrainedTokenizer):
    r"""
    Constructs a XXX tokenizer. Based on XXX.
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
-    should refer to the superclass for more information regarding methods.
+    Users should refer to this superclass for more information regarding those methods.
    Args:
        vocab_file (:obj:`str`):
            File containing the vocabulary.
        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to lowercase the input when tokenizing.
+            Whether or not to lowercase the input when tokenizing.
        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to do basic tokenization before WordPiece.
+            Whether ot not to do basic tokenization before WordPiece.
        never_split (:obj:`Iterable`, `optional`):
            Collection of tokens which will never be split during tokenization. Only has an effect when
            :obj:`do_basic_tokenize=True`
@@ -194,19 +194,19 @@ class XxxTokenizer(PreTrainedTokenizer):
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
        by concatenating and adding special tokens.
-        A BERT sequence has the following format:
+        A XXX sequence has the following format:
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
        Args:
            token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
@@ -218,16 +218,16 @@ class XxxTokenizer(PreTrainedTokenizer):
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer ``prepare_for_model`` method.
        Args:
            token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
        Returns:
            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -249,7 +249,7 @@ class XxxTokenizer(PreTrainedTokenizer):
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
        A BERT sequence pair mask has the following format:
        ::
@@ -257,11 +257,11 @@ class XxxTokenizer(PreTrainedTokenizer):
            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
            | first sequence    | second sequence |
-        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
        Args:
            token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
@@ -277,7 +277,7 @@ class XxxTokenizer(PreTrainedTokenizer):
    def save_vocabulary(self, vocab_path):
        """
-        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Save the vocabulary (copy original file) and special tokens file to a directory.
        Args:
            vocab_path (:obj:`str`):