Doc styler v2 (#14950)

* New doc styler * Fix issue with args at the start * Code sample fixes * Style code examples in MDX * Fix more patterns * Typo * Typo * More patterns * Do without black for now * Get more info in error * Docstring style * Re-enable check * Quality * Fix add_end_docstring decorator * Fix docstring

Doc styler v2 (#14950)
* New doc styler * Fix issue with args at the start * Code sample fixes * Style code examples in MDX * Fix more patterns * Typo * Typo * More patterns * Do without black for now * Get more info in error * Docstring style * Re-enable check * Quality * Fix add_end_docstring decorator * Fix docstring
87e6e4fe · Sylvain Gugger · GitHub · c1138273 · 87e6e4fe · 87e6e4fe
Unverified Commit 87e6e4fe authored Dec 27, 2021 by Sylvain Gugger Committed by GitHub Dec 27, 2021
20 changed files
--- a/src/transformers/models/beit/feature_extraction_beit.py
+++ b/src/transformers/models/beit/feature_extraction_beit.py
@@ -38,27 +38,25 @@ class BeitFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
    r"""
    Constructs a BEiT feature extractor.
-    This feature extractor inherits from [`~feature_extraction_utils.FeatureExtractionMixin`] which
+    This feature extractor inherits from [`~feature_extraction_utils.FeatureExtractionMixin`] which contains most of
-    contains most of the main methods. Users should refer to this superclass for more information regarding those
+    the main methods. Users should refer to this superclass for more information regarding those methods.
-    methods.
    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the input to a certain `size`.
        size (`int` or `Tuple(int)`, *optional*, defaults to 256):
            Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an
-            integer is provided, then the input will be resized to (size, size). Only has an effect if `do_resize`
+            integer is provided, then the input will be resized to (size, size). Only has an effect if `do_resize` is
-            is set to `True`.
+            set to `True`.
        resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
            An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
-            `PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`.
+            `PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
-            Only has an effect if `do_resize` is set to `True`.
+            if `do_resize` is set to `True`.
        do_center_crop (`bool`, *optional*, defaults to `True`):
-            Whether to crop the input at the center. If the input size is smaller than `crop_size` along any edge,
+            Whether to crop the input at the center. If the input size is smaller than `crop_size` along any edge, the
-            the image is padded with 0's and then center cropped.
+            image is padded with 0's and then center cropped.
        crop_size (`int`, *optional*, defaults to 224):
-            Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to
+            Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`.
-            `True`.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether or not to normalize the input with `image_mean` and `image_std`.
        image_mean (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):

--- a/src/transformers/models/beit/modeling_beit.py
+++ b/src/transformers/models/beit/modeling_beit.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" PyTorch BEiT model. """
+""" PyTorch BEiT model."""
 import collections.abc
@@ -56,12 +56,13 @@ class BeitModelOutputWithPooling(BaseModelOutputWithPooling):
            *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
            will be returned.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
-            of shape `(batch_size, sequence_length, hidden_size)`.
+            shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -547,15 +548,14 @@ class BeitPreTrainedModel(PreTrainedModel):
 BEIT_START_DOCSTRING = r"""
-    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
-    it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
    behavior.
    Parameters:
        config ([`BeitConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
-            weights.
 """
 BEIT_INPUTS_DOCSTRING = r"""
@@ -737,8 +737,9 @@ class BeitForMaskedImageModeling(BeitPreTrainedModel):
            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the image classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
-            If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        Returns:
@@ -824,8 +825,9 @@ class BeitForImageClassification(BeitPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the image classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
-            If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        Returns:
@@ -1158,8 +1160,8 @@ class BeitForSemanticSegmentation(BeitPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
-            Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed
+            Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ...,
-            (Cross-Entropy).
+            config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed (Cross-Entropy).
        Returns:

--- a/src/transformers/models/beit/modeling_flax_beit.py
+++ b/src/transformers/models/beit/modeling_flax_beit.py
@@ -54,23 +54,24 @@ class FlaxBeitModelOutputWithPooling(FlaxBaseModelOutputWithPooling):
            *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
            will be returned.
        hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of
+            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
-            shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each
+            `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus
-            layer plus the initial embedding outputs.
+            the initial embedding outputs.
        attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
+            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
            the self-attention heads.
    """
 BEIT_START_DOCSTRING = r"""
-    This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the
+    This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
-    generic methods the library implements for all its model (such as downloading, saving and converting weights from
+    library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
-    PyTorch models)
-    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. Use it as a regular Flax linen Module
+    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
-    and refer to the Flax documentation for all matter related to general usage and behavior.
+    subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
+    general usage and behavior.
    Finally, this model supports inherent JAX features such as:
@@ -82,11 +83,10 @@ BEIT_START_DOCSTRING = r"""
    Parameters:
        config ([`BeitConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the
+            configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights.
-            model weights.
        dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
-            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on
+            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
-            GPUs) and `jax.numpy.bfloat16` (on TPUs).
+            `jax.numpy.bfloat16` (on TPUs).
            This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
            specified all the computation will be performed with the given `dtype`.
@@ -94,8 +94,8 @@ BEIT_START_DOCSTRING = r"""
            **Note that this only specifies the dtype of the computation and does not influence the dtype of model
            parameters.**
-            If you wish to change the dtype of the model parameters, see
+            If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
-            [`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`].
+            [`~FlaxPreTrainedModel.to_bf16`].
 """
 BEIT_INPUTS_DOCSTRING = r"""

--- a/src/transformers/models/bert/configuration_bert.py
+++ b/src/transformers/models/bert/configuration_bert.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" BERT model configuration """
+""" BERT model configuration"""
 from collections import OrderedDict
 from typing import Mapping
@@ -53,20 +53,19 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class BertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a [`BertModel`] or a
+    This is the configuration class to store the configuration of a [`BertModel`] or a [`TFBertModel`]. It is used to
-    [`TFBertModel`]. It is used to instantiate a BERT model according to the specified arguments,
+    instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a
-    defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
+    configuration with the defaults will yield a similar configuration to that of the BERT
-    to that of the BERT [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture.
+    [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture.
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    outputs. Read the documentation from [`PretrainedConfig`] for more information.
+    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
-            `inputs_ids` passed when calling [`BertModel`] or
+            `inputs_ids` passed when calling [`BertModel`] or [`TFBertModel`].
-            [`TFBertModel`].
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
@@ -76,8 +75,8 @@ class BertConfig(PretrainedConfig):
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
-            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
@@ -86,17 +85,17 @@ class BertConfig(PretrainedConfig):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (`int`, *optional*, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or
+            The vocabulary size of the `token_type_ids` passed when calling [`BertModel`] or [`TFBertModel`].
-            [`TFBertModel`].
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
-            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`,
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
-            `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
-            `"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
-            *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.

--- a/src/transformers/models/bert/modeling_bert.py
+++ b/src/transformers/models/bert/modeling_bert.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""PyTorch BERT model. """
+"""PyTorch BERT model."""
 import math
@@ -1130,7 +1130,7 @@ class BertForPreTraining(BertPreTrainedModel):
 @add_start_docstrings(
-    """Bert Model with a `language modeling` head on top for CLM fine-tuning. """, BERT_START_DOCSTRING
+    """Bert Model with a `language modeling` head on top for CLM fine-tuning.""", BERT_START_DOCSTRING
 )
 class BertLMHeadModel(BertPreTrainedModel):
@@ -1282,7 +1282,7 @@ class BertLMHeadModel(BertPreTrainedModel):
        return reordered_past
-@add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING)
+@add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING)
 class BertForMaskedLM(BertPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r"pooler"]
@@ -1391,7 +1391,7 @@ class BertForMaskedLM(BertPreTrainedModel):
 @add_start_docstrings(
-    """Bert Model with a `next sentence prediction (classification)` head on top. """,
+    """Bert Model with a `next sentence prediction (classification)` head on top.""",
    BERT_START_DOCSTRING,
 )
 class BertForNextSentencePrediction(BertPreTrainedModel):

--- a/src/transformers/models/bert/modeling_flax_bert.py
+++ b/src/transformers/models/bert/modeling_flax_bert.py
@@ -66,12 +66,13 @@ class FlaxBertForPreTrainingOutput(ModelOutput):
            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
            before SoftMax).
        hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of
+            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
-            shape `(batch_size, sequence_length, hidden_size)`.
+            `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -85,12 +86,12 @@ class FlaxBertForPreTrainingOutput(ModelOutput):
 BERT_START_DOCSTRING = r"""
-    This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the
+    This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
-    generic methods the library implements for all its model (such as downloading, saving and converting weights from
+    library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
-    PyTorch models)
-    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. Use it as a regular Flax linen Module
+    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
-    and refer to the Flax documentation for all matter related to general usage and behavior.
+    subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
+    general usage and behavior.
    Finally, this model supports inherent JAX features such as:
@@ -102,11 +103,10 @@ BERT_START_DOCSTRING = r"""
    Parameters:
        config ([`BertConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the
+            configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights.
-            model weights.
        dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
-            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on
+            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
-            GPUs) and `jax.numpy.bfloat16` (on TPUs).
+            `jax.numpy.bfloat16` (on TPUs).
            This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
            specified all the computation will be performed with the given `dtype`.
@@ -114,11 +114,11 @@ BERT_START_DOCSTRING = r"""
            **Note that this only specifies the dtype of the computation and does not influence the dtype of model
            parameters.**
-            If you wish to change the dtype of the model parameters, see
+            If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
-            [`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`].
+            [`~FlaxPreTrainedModel.to_bf16`].
        dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
-            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on
+            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
-            GPUs) and `jax.numpy.bfloat16` (on TPUs).
+            `jax.numpy.bfloat16` (on TPUs).
            This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
            specified all the computation will be performed with the given `dtype`.
@@ -126,8 +126,8 @@ BERT_START_DOCSTRING = r"""
            **Note that this only specifies the dtype of the computation and does not influence the dtype of model
            parameters.**
-            If you wish to change the dtype of the model parameters, see
+            If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
-            [`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`].
+            [`~FlaxPreTrainedModel.to_bf16`].
 """
@@ -136,9 +136,8 @@ BERT_INPUTS_DOCSTRING = r"""
        input_ids (`numpy.ndarray` of shape `({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using [`BertTokenizer`]. See
+            Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`numpy.ndarray` of shape `({0})`, *optional*):
@@ -149,15 +148,18 @@ BERT_INPUTS_DOCSTRING = r"""
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`numpy.ndarray` of shape `({0})`, *optional*):
-            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
+            1]`:
            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.
            [What are token type IDs?](../glossary#token-type-ids)
        position_ids (`numpy.ndarray` of shape `({0})`, *optional*):
-            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
-        head_mask (`numpy.ndarray` of shape `({0})`, `optional): Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
+            config.max_position_embeddings - 1]`.
+        head_mask (`numpy.ndarray` of shape `({0})`, `optional):
+            Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.
@@ -909,7 +911,7 @@ class FlaxBertForMaskedLMModule(nn.Module):
        )
-@add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING)
+@add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING)
 class FlaxBertForMaskedLM(FlaxBertPreTrainedModel):
    module_class = FlaxBertForMaskedLMModule
@@ -968,7 +970,7 @@ class FlaxBertForNextSentencePredictionModule(nn.Module):
 @add_start_docstrings(
-    """Bert Model with a `next sentence prediction (classification)` head on top. """,
+    """Bert Model with a `next sentence prediction (classification)` head on top.""",
    BERT_START_DOCSTRING,
 )
 class FlaxBertForNextSentencePrediction(FlaxBertPreTrainedModel):

--- a/src/transformers/models/bert/modeling_tf_bert.py
+++ b/src/transformers/models/bert/modeling_tf_bert.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" TF 2.0 BERT model. """
+""" TF 2.0 BERT model."""
 import math
 import warnings
@@ -938,12 +938,13 @@ class TFBertForPreTrainingOutput(ModelOutput):
            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
            before SoftMax).
        hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of
+            Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
-            shape `(batch_size, sequence_length, hidden_size)`.
+            `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -958,13 +959,13 @@ class TFBertForPreTrainingOutput(ModelOutput):
 BERT_START_DOCSTRING = r"""
-    This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the
+    This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
-    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
-    embeddings, pruning heads etc.)
+    etc.)
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use
+    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
-    it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage
+    as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
-    and behavior.
+    behavior.
    <Tip>
@@ -973,11 +974,11 @@ BERT_START_DOCSTRING = r"""
    - having all inputs as keyword arguments (like PyTorch models), or
    - having all inputs as a list, tuple or dict in the first positional arguments.
-    This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all
+    This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all the
-    the tensors in the first argument of the model call function: `model(inputs)`.
+    tensors in the first argument of the model call function: `model(inputs)`.
-    If you choose this second option, there are three possibilities you can use to gather all the input Tensors in
+    If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the
-    the first positional argument :
+    first positional argument :
    - a single Tensor with `input_ids` only and nothing else: `model(inputs_ids)`
    - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
@@ -990,8 +991,7 @@ BERT_START_DOCSTRING = r"""
    Args:
        config ([`BertConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~TFPreTrainedModel.from_pretrained`] method to load the
+            configuration. Check out the [`~TFPreTrainedModel.from_pretrained`] method to load the model weights.
-            model weights.
 """
 BERT_INPUTS_DOCSTRING = r"""
@@ -999,9 +999,8 @@ BERT_INPUTS_DOCSTRING = r"""
        input_ids (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using [`BertTokenizer`]. See
+            Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
-            [`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for
+            [`PreTrainedTokenizer.encode`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*):
@@ -1012,14 +1011,16 @@ BERT_INPUTS_DOCSTRING = r"""
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*):
-            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
+            1]`:
            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.
            [What are token type IDs?](../glossary#token-type-ids)
        position_ids (`np.ndarray` or `tf.Tensor` of shape `({0})`, *optional*):
-            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
            [What are position IDs?](../glossary#position-ids)
        head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
@@ -1029,9 +1030,9 @@ BERT_INPUTS_DOCSTRING = r"""
            - 0 indicates the head is **masked**.
        inputs_embeds (`np.ndarray` or `tf.Tensor` of shape `({0}, hidden_size)`, *optional*):
-            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
-            This is useful if you want more control over how to convert `input_ids` indices into associated
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
-            vectors than the model's internal embedding lookup matrix.
+            model's internal embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the
@@ -1041,8 +1042,8 @@ BERT_INPUTS_DOCSTRING = r"""
            more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
            used instead.
        return_dict (`bool`, *optional*):
-            Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. This
+            Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. This argument can be used
-            argument can be used in eager mode, in graph mode the value will always be set to True.
+            in eager mode, in graph mode the value will always be set to True.
        training (`bool`, *optional*, defaults to `False``):
            Whether or not to use the model in training mode (some modules like dropout modules have different
            behaviors between training and evaluation).
@@ -1097,12 +1098,12 @@ class TFBertModel(TFBertPreTrainedModel):
        past_key_values (`Tuple[Tuple[tf.Tensor]]` of length `config.n_layers`)
            contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*, defaults to `True`):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`). Set to `False` during training, `True` during generation
+            `past_key_values`). Set to `False` during training, `True` during generation
        """
        inputs = input_processing(
            func=self.call,
@@ -1212,8 +1213,9 @@ class TFBertForPreTraining(TFBertPreTrainedModel, TFBertPreTrainingLoss):
    ) -> Union[TFBertForPreTrainingOutput, Tuple[tf.Tensor]]:
        r"""
        labels (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
-            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
        next_sentence_label (`tf.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
            (see `input_ids` docstring) Indices should be in `[0, 1]`:
@@ -1300,7 +1302,7 @@ class TFBertForPreTraining(TFBertPreTrainedModel, TFBertPreTrainingLoss):
        )
-@add_start_docstrings("""Bert Model with a `language modeling` head on top. """, BERT_START_DOCSTRING)
+@add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING)
 class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
    # names with a '.' represents the authorized unexpected/missing layers when a TF model is loaded from a PT model
    _keys_to_ignore_on_load_unexpected = [
@@ -1353,8 +1355,9 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
    ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
        r"""
        labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
-            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
-            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
        """
        inputs = input_processing(
            func=self.call,
@@ -1483,14 +1486,15 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
        past_key_values (`Tuple[Tuple[tf.Tensor]]` of length `config.n_layers`)
            contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*, defaults to `True`):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`). Set to `False` during training, `True` during generation
+            `past_key_values`). Set to `False` during training, `True` during generation
        labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size, sequence_length)`, *optional*):
-            Labels for computing the cross entropy classification loss. Indices should be in `[0, ..., config.vocab_size - 1]`.
+            Labels for computing the cross entropy classification loss. Indices should be in `[0, ...,
+            config.vocab_size - 1]`.
        """
        inputs = input_processing(
            func=self.call,
@@ -1566,7 +1570,7 @@ class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):
 @add_start_docstrings(
-    """Bert Model with a `next sentence prediction (classification)` head on top. """,
+    """Bert Model with a `next sentence prediction (classification)` head on top.""",
    BERT_START_DOCSTRING,
 )
 class TFBertForNextSentencePrediction(TFBertPreTrainedModel, TFNextSentencePredictionLoss):
@@ -1721,8 +1725,9 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
    ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
        r"""
        labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
-            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
-            If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        inputs = input_processing(
            func=self.call,
@@ -1830,8 +1835,8 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
    ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
        r"""
        labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
-            Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]`
-            `input_ids` above)
+            where `num_choices` is the size of the second dimension of the input tensors. (See `input_ids` above)
        """
        inputs = input_processing(
            func=self.call,
@@ -2096,12 +2101,12 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
        r"""
        start_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
-            sequence are not taken into account for computing the loss.
+            are not taken into account for computing the loss.
        end_positions (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
-            sequence are not taken into account for computing the loss.
+            are not taken into account for computing the loss.
        """
        inputs = input_processing(
            func=self.call,

--- a/src/transformers/models/bert/tokenization_bert.py
+++ b/src/transformers/models/bert/tokenization_bert.py
@@ -118,8 +118,8 @@ class BertTokenizer(PreTrainedTokenizer):
    r"""
    Construct a BERT tokenizer. Based on WordPiece.
-    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
-    Users should refer to this superclass for more information regarding those methods.
+    this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
@@ -149,7 +149,8 @@ class BertTokenizer(PreTrainedTokenizer):
        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
            Whether or not to tokenize Chinese characters.
-            This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)).
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
        strip_accents: (`bool`, *optional*):
            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
            value for `lowercase` (as in the original BERT).
@@ -318,8 +319,7 @@ class BertTokenizer(PreTrainedTokenizer):
                Optional second list of IDs for sequence pairs.
        Returns:
-            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
-            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
@@ -361,7 +361,8 @@ class BasicTokenizer(object):
        tokenize_chinese_chars (`bool`, *optional*, defaults to `True`):
            Whether or not to tokenize Chinese characters.
-            This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)).
+            This should likely be deactivated for Japanese (see this
+            [issue](https://github.com/huggingface/transformers/issues/328)).
        strip_accents: (`bool`, *optional*):
            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
            value for `lowercase` (as in the original BERT).

--- a/src/transformers/models/bert/tokenization_bert_fast.py
+++ b/src/transformers/models/bert/tokenization_bert_fast.py
@@ -118,8 +118,8 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
    r"""
    Construct a "fast" BERT tokenizer (backed by HuggingFace's *tokenizers* library). Based on WordPiece.
-    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
-    methods. Users should refer to this superclass for more information regarding those methods.
+    refer to this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
@@ -245,8 +245,7 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
                Optional second list of IDs for sequence pairs.
        Returns:
-            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
-            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]

--- a/src/transformers/models/bert_generation/configuration_bert_generation.py
+++ b/src/transformers/models/bert_generation/configuration_bert_generation.py
@@ -12,19 +12,18 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""  BertGeneration model configuration """
+"""  BertGeneration model configuration"""
 from ...configuration_utils import PretrainedConfig
 class BertGenerationConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a
+    This is the configuration class to store the configuration of a [`BertGenerationPreTrainedModel`]. It is used to
-    [`BertGenerationPreTrainedModel`]. It is used to instantiate a BertGeneration model according to
+    instantiate a BertGeneration model according to the specified arguments, defining the model architecture.
-    the specified arguments, defining the model architecture.
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    outputs. Read the documentation from [`PretrainedConfig`] for more information.
+    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 50358):
@@ -39,8 +38,8 @@ class BertGenerationConfig(PretrainedConfig):
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
-            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
-            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
@@ -53,10 +52,11 @@ class BertGenerationConfig(PretrainedConfig):
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
-            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`,
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
-            `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
-            `"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
-            *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.

--- a/src/transformers/models/bert_generation/modeling_bert_generation.py
+++ b/src/transformers/models/bert_generation/modeling_bert_generation.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""PyTorch BERT model specific for generation. """
+"""PyTorch BERT model specific for generation."""
 import torch
@@ -195,19 +195,18 @@ class BertGenerationPreTrainedModel(PreTrainedModel):
 BERT_GENERATION_START_DOCSTRING = r"""
-    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
-    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
-    pruning heads etc.)
+    etc.)
-    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
-    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
-    general usage and behavior.
+    and behavior.
    Parameters:
        config ([`BertGenerationConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
-            weights.
 """
 BERT_GENERATION_INPUTS_DOCSTRING = r"""
@@ -215,9 +214,8 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
        input_ids (`torch.LongTensor` of shape `({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using [`BertGenerationTokenizer`]. See
+            Indices can be obtained using [`BertGenerationTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
-            [`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for
+            [`PreTrainedTokenizer.encode`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
@@ -228,7 +226,8 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
            [What are attention masks?](../glossary#attention-mask)
        position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
-            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
            [What are position IDs?](../glossary#position-ids)
        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
@@ -238,9 +237,9 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
            - 0 indicates the head is **masked**.
        inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
-            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
-            This is useful if you want more control over how to convert `input_ids` indices into associated
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
-            vectors than the model's internal embedding lookup matrix.
+            model's internal embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
@@ -264,14 +263,13 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
    all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
-    This model should be used when leveraging Bert or Roberta checkpoints for the
+    This model should be used when leveraging Bert or Roberta checkpoints for the [`EncoderDecoderModel`] class as
-    [`EncoderDecoderModel`] class as described in [Leveraging Pre-trained Checkpoints for Sequence
+    described in [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461)
-    Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
+    by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
-    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration
+    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
-    set to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder`
+    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
-    argument and `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an
+    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
-    input to the forward pass.
    """
    def __init__(self, config):
@@ -331,12 +329,12 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`).
+            `past_key_values`).
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
@@ -443,7 +441,7 @@ class BertGenerationOnlyLMHead(nn.Module):
 @add_start_docstrings(
-    """BertGeneration Model with a `language modeling` head on top for CLM fine-tuning. """,
+    """BertGeneration Model with a `language modeling` head on top for CLM fine-tuning.""",
    BERT_GENERATION_START_DOCSTRING,
 )
 class BertGenerationDecoder(BertGenerationPreTrainedModel):
@@ -500,12 +498,12 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel):
        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`).
+            `past_key_values`).
        Returns:

--- a/src/transformers/models/bert_generation/tokenization_bert_generation.py
+++ b/src/transformers/models/bert_generation/tokenization_bert_generation.py
@@ -42,8 +42,8 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
    """
    Construct a BertGeneration tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
-    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
-    Users should refer to this superclass for more information regarding those methods.
+    this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
@@ -59,7 +59,9 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
        sp_model_kwargs (`dict`, *optional*):
-            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
+            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
+            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
+            to set:
            - `enable_sampling`: Enable subword regularization.
            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.

--- a/src/transformers/models/bertweet/tokenization_bertweet.py
+++ b/src/transformers/models/bertweet/tokenization_bertweet.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Tokenization classes for BERTweet """
+""" Tokenization classes for BERTweet"""
 import html
@@ -69,8 +69,8 @@ class BertweetTokenizer(PreTrainedTokenizer):
    """
    Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.
-    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
-    Users should refer to this superclass for more information regarding those methods.
+    this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
@@ -94,8 +94,8 @@ class BertweetTokenizer(PreTrainedTokenizer):
            <Tip>
-            When building a sequence using special tokens, this is not the token that is used for the end of
+            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
-            sequence. The token used is the `sep_token`.
+            The token used is the `sep_token`.
            </Tip>

--- a/src/transformers/models/big_bird/configuration_big_bird.py
+++ b/src/transformers/models/big_bird/configuration_big_bird.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" BigBird model configuration """
+""" BigBird model configuration"""
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
@@ -30,13 +30,13 @@ BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class BigBirdConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a [`BigBirdModel`]. It is used to
+    This is the configuration class to store the configuration of a [`BigBirdModel`]. It is used to instantiate an
-    instantiate an BigBird model according to the specified arguments, defining the model architecture. Instantiating a
+    BigBird model according to the specified arguments, defining the model architecture. Instantiating a configuration
-    configuration with the defaults will yield a similar configuration to that of the BigBird
+    with the defaults will yield a similar configuration to that of the BigBird
    [google/bigbird-roberta-base](https://huggingface.co/google/bigbird-roberta-base) architecture.
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    outputs. Read the documentation from [`PretrainedConfig`] for more information.
+    documentation from [`PretrainedConfig`] for more information.
    Args:
@@ -52,8 +52,8 @@ class BigBirdConfig(PretrainedConfig):
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu_new"`):
-            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
-            `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
@@ -80,7 +80,8 @@ class BigBirdConfig(PretrainedConfig):
        block_size (`int`, *optional*, defaults to 64)
            Size of each block. Useful only when `attention_type == "block_sparse"`.
        num_random_blocks (`int`, *optional*, defaults to 3)
-            Each query is going to attend these many number of random blocks. Useful only when `attention_type == "block_sparse"`.
+            Each query is going to attend these many number of random blocks. Useful only when `attention_type ==
+            "block_sparse"`.
        classifier_dropout (`float`, *optional*):
            The dropout ratio for the classification head.
@@ -92,14 +93,13 @@ class BigBirdConfig(PretrainedConfig):
        >>> from transformers import BigBirdModel, BigBirdConfig
-        >>> # Initializing a BigBird google/bigbird-roberta-base style configuration
+        >>> # Initializing a BigBird google/bigbird-roberta-base style configuration >>> configuration =
-        >>> configuration = BigBirdConfig()
+        BigBirdConfig()
-        >>> # Initializing a model from the google/bigbird-roberta-base style configuration
+        >>> # Initializing a model from the google/bigbird-roberta-base style configuration >>> model =
-        >>> model = BigBirdModel(configuration)
+        BigBirdModel(configuration)
-        >>> # Accessing the model configuration
+        >>> # Accessing the model configuration >>> configuration = model.config
-        >>> configuration = model.config
    """
    model_type = "big_bird"

--- a/src/transformers/models/big_bird/modeling_big_bird.py
+++ b/src/transformers/models/big_bird/modeling_big_bird.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" PyTorch BigBird model. """
+""" PyTorch BigBird model."""
 import math
@@ -1788,8 +1788,7 @@ BIG_BIRD_START_DOCSTRING = r"""
    Parameters:
        config ([`BigBirdConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
-            weights.
 """
 BIG_BIRD_INPUTS_DOCSTRING = r"""
@@ -1797,9 +1796,8 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
        input_ids (`torch.LongTensor` of shape `({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using [`BigBirdTokenizer`]. See
+            Indices can be obtained using [`BigBirdTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
@@ -1810,14 +1808,16 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
-            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
+            1]`:
            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.
            [What are token type IDs?](../glossary#token-type-ids)
        position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
-            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
            [What are position IDs?](../glossary#position-ids)
        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
@@ -1827,9 +1827,9 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
            - 0 indicates the head is **masked**.
        inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
-            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
-            This is useful if you want more control over how to convert *input_ids* indices into associated vectors
+            is useful if you want more control over how to convert *input_ids* indices into associated vectors than the
-            than the model's internal embedding lookup matrix.
+            model's internal embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
@@ -1856,12 +1856,13 @@ class BigBirdForPreTrainingOutput(ModelOutput):
            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
            before SoftMax).
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
-            of shape `(batch_size, sequence_length, hidden_size)`.
+            shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -1889,12 +1890,13 @@ class BigBirdForQuestionAnsweringModelOutput(ModelOutput):
        pooler_output (`torch.FloatTensor` of shape `(batch_size, 1)`):
            pooler output from BigBigModel
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
-            of shape `(batch_size, sequence_length, hidden_size)`.
+            shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -1920,10 +1922,9 @@ class BigBirdModel(BigBirdPreTrainedModel):
    all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
-    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration
+    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
-    set to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder`
+    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
-    argument and `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an
+    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
-    input to the forward pass.
    """
    def __init__(self, config, add_pooling_layer=True):
@@ -2004,12 +2005,12 @@ class BigBirdModel(BigBirdPreTrainedModel):
            - 0 for tokens that are **masked**.
        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`).
+            `past_key_values`).
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
@@ -2286,12 +2287,13 @@ class BigBirdForPreTraining(BigBirdPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
-            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
        next_sentence_label (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the next sequence prediction (classification) loss. If specified, nsp loss will be
-            added to masked_lm loss. Input should be a sequence pair (see `input_ids` docstring) Indices should be
+            added to masked_lm loss. Input should be a sequence pair (see `input_ids` docstring) Indices should be in
-            in `[0, 1]`:
+            `[0, 1]`:
            - 0 indicates sequence B is a continuation of sequence A,
            - 1 indicates sequence B is a random sequence.
@@ -2354,7 +2356,7 @@ class BigBirdForPreTraining(BigBirdPreTrainedModel):
        )
-@add_start_docstrings("""BigBird Model with a `language modeling` head on top. """, BIG_BIRD_START_DOCSTRING)
+@add_start_docstrings("""BigBird Model with a `language modeling` head on top.""", BIG_BIRD_START_DOCSTRING)
 class BigBirdForMaskedLM(BigBirdPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
@@ -2401,8 +2403,9 @@ class BigBirdForMaskedLM(BigBirdPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
-            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -2455,7 +2458,7 @@ class BigBirdForMaskedLM(BigBirdPreTrainedModel):
 @add_start_docstrings(
-    """BigBird Model with a `language modeling` head on top for CLM fine-tuning. """, BIG_BIRD_START_DOCSTRING
+    """BigBird Model with a `language modeling` head on top for CLM fine-tuning.""", BIG_BIRD_START_DOCSTRING
 )
 class BigBirdForCausalLM(BigBirdPreTrainedModel):
@@ -2510,16 +2513,16 @@ class BigBirdForCausalLM(BigBirdPreTrainedModel):
            - 0 for tokens that are **masked**.
        past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
            `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
            ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., config.vocab_size]`.
        use_cache (`bool`, *optional*):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`).
+            `past_key_values`).
        Returns:
@@ -2667,8 +2670,9 @@ class BigBirdForSequenceClassification(BigBirdPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
-            If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -2764,7 +2768,8 @@ class BigBirdForMultipleChoice(BigBirdPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+            num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
            `input_ids` above)
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -2970,12 +2975,12 @@ class BigBirdForQuestionAnswering(BigBirdPreTrainedModel):
        r"""
        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
-            sequence are not taken into account for computing the loss.
+            are not taken into account for computing the loss.
        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
-            sequence are not taken into account for computing the loss.
+            are not taken into account for computing the loss.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

--- a/src/transformers/models/big_bird/modeling_flax_big_bird.py
+++ b/src/transformers/models/big_bird/modeling_flax_big_bird.py
@@ -64,12 +64,13 @@ class FlaxBigBirdForPreTrainingOutput(ModelOutput):
            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
            before SoftMax).
        hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of
+            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
-            shape `(batch_size, sequence_length, hidden_size)`.
+            `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -94,12 +95,13 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput):
        pooled_output (`jnp.ndarray` of shape `(batch_size, hidden_size)`):
            pooled_output returned by FlaxBigBirdModel.
        hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of
+            Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape
-            shape `(batch_size, sequence_length, hidden_size)`.
+            `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.
+            Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
@@ -114,12 +116,12 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput):
 BIG_BIRD_START_DOCSTRING = r"""
-    This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the
+    This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
-    generic methods the library implements for all its model (such as downloading, saving and converting weights from
+    library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
-    PyTorch models)
-    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. Use it as a regular Flax linen Module
+    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
-    and refer to the Flax documentation for all matter related to general usage and behavior.
+    subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
+    general usage and behavior.
    Finally, this model supports inherent JAX features such as:
@@ -131,11 +133,10 @@ BIG_BIRD_START_DOCSTRING = r"""
    Parameters:
        config ([`BigBirdConfig`]): Model configuration class with all the parameters of the model.
            Initializing with a config file does not load the weights associated with the model, only the
-            configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the
+            configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights.
-            model weights.
        dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`):
-            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on
+            The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and
-            GPUs) and `jax.numpy.bfloat16` (on TPUs).
+            `jax.numpy.bfloat16` (on TPUs).
            This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If
            specified all the computation will be performed with the given `dtype`.
@@ -143,8 +144,8 @@ BIG_BIRD_START_DOCSTRING = r"""
            **Note that this only specifies the dtype of the computation and does not influence the dtype of model
            parameters.**
-            If you wish to change the dtype of the model parameters, see
+            If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and
-            [`~FlaxPreTrainedModel.to_fp16`] and [`~FlaxPreTrainedModel.to_bf16`].
+            [`~FlaxPreTrainedModel.to_bf16`].
 """
 BIG_BIRD_INPUTS_DOCSTRING = r"""
@@ -152,9 +153,8 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
        input_ids (`numpy.ndarray` of shape `({0})`):
            Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using [`BigBirdTokenizer`]. See
+            Indices can be obtained using [`BigBirdTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`numpy.ndarray` of shape `({0})`, *optional*):
@@ -165,15 +165,18 @@ BIG_BIRD_INPUTS_DOCSTRING = r"""
            [What are attention masks?](../glossary#attention-mask)
        token_type_ids (`numpy.ndarray` of shape `({0})`, *optional*):
-            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
+            1]`:
            - 0 corresponds to a *sentence A* token,
            - 1 corresponds to a *sentence B* token.
            [What are token type IDs?](../glossary#token-type-ids)
        position_ids (`numpy.ndarray` of shape `({0})`, *optional*):
-            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.max_position_embeddings - 1]`.
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
-        head_mask (`numpy.ndarray` of shape `({0})`, `optional): Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
+            config.max_position_embeddings - 1]`.
+        head_mask (`numpy.ndarray` of shape `({0})`, `optional):
+            Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.
@@ -787,7 +790,8 @@ class FlaxBigBirdBlockSparseAttention(nn.Module):
        Args:
            from_blocked_mask: 2D Tensor of shape [batch_size, from_seq_length//from_block_size, from_block_size].
            to_blocked_mask: int32 Tensor of shape [batch_size, to_seq_length//to_block_size, to_block_size].
-            broadcasted_rand_attn: [batch_size, num_attention_heads, from_seq_length//from_block_size-2, num_rand_blocks]
+            broadcasted_rand_attn:
+                [batch_size, num_attention_heads, from_seq_length//from_block_size-2, num_rand_blocks]
            num_attention_heads: int. Number of attention heads.
            num_random_blocks: int. Number of random chunks per row.
            batch_size: int. Batch size for computation.
@@ -1713,7 +1717,7 @@ class FlaxBigBirdForMaskedLMModule(nn.Module):
        )
-@add_start_docstrings("""BigBird Model with a `language modeling` head on top. """, BIG_BIRD_START_DOCSTRING)
+@add_start_docstrings("""BigBird Model with a `language modeling` head on top.""", BIG_BIRD_START_DOCSTRING)
 # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertForMaskedLM with Bert->BigBird
 class FlaxBigBirdForMaskedLM(FlaxBigBirdPreTrainedModel):
    module_class = FlaxBigBirdForMaskedLMModule

--- a/src/transformers/models/big_bird/tokenization_big_bird.py
+++ b/src/transformers/models/big_bird/tokenization_big_bird.py
@@ -48,8 +48,8 @@ class BigBirdTokenizer(PreTrainedTokenizer):
    """
    Construct a BigBird tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
-    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
-    Users should refer to this superclass for more information regarding those methods.
+    this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
@@ -75,7 +75,9 @@ class BigBirdTokenizer(PreTrainedTokenizer):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
        sp_model_kwargs (`dict`, *optional*):
-            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
+            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
+            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
+            to set:
            - `enable_sampling`: Enable subword regularization.
            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
@@ -259,8 +261,7 @@ class BigBirdTokenizer(PreTrainedTokenizer):
                Optional second list of IDs for sequence pairs.
        Returns:
-            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
-            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]

--- a/src/transformers/models/big_bird/tokenization_big_bird_fast.py
+++ b/src/transformers/models/big_bird/tokenization_big_bird_fast.py
@@ -58,9 +58,10 @@ SPIECE_UNDERLINE = "▁"
 class BigBirdTokenizerFast(PreTrainedTokenizerFast):
    """
-    Construct a "fast" BigBird tokenizer (backed by HuggingFace's *tokenizers* library). Based on [Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models). This tokenizer
+    Construct a "fast" BigBird tokenizer (backed by HuggingFace's *tokenizers* library). Based on
-    inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+    [Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models). This
-    refer to this superclass for more information regarding those methods
+    tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should refer to
+    this superclass for more information regarding those methods
    Args:
        vocab_file (`str`):
@@ -219,8 +220,7 @@ class BigBirdTokenizerFast(PreTrainedTokenizerFast):
                Optional second list of IDs for sequence pairs.
        Returns:
-            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
-            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]

--- a/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py
+++ b/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" BigBirdPegasus model configuration """
+""" BigBirdPegasus model configuration"""
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
@@ -30,13 +30,13 @@ BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class BigBirdPegasusConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a [`BigBirdPegasusModel`]. It is
+    This is the configuration class to store the configuration of a [`BigBirdPegasusModel`]. It is used to instantiate
-    used to instantiate an BigBirdPegasus model according to the specified arguments, defining the model architecture.
+    an BigBirdPegasus model according to the specified arguments, defining the model architecture. Instantiating a
-    Instantiating a configuration with the defaults will yield a similar configuration to that of the BigBirdPegasus
+    configuration with the defaults will yield a similar configuration to that of the BigBirdPegasus
    [google/bigbird-pegasus-large-arxiv](https://huggingface.co/google/bigbird-pegasus-large-arxiv) architecture.
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    outputs. Read the documentation from [`PretrainedConfig`] for more information.
+    documentation from [`PretrainedConfig`] for more information.
    Args:
@@ -58,8 +58,8 @@ class BigBirdPegasusConfig(PretrainedConfig):
        encoder_ffn_dim (`int`, *optional*, defaults to 4096):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        activation_function (`str` or `function`, *optional*, defaults to `"gelu_new"`):
-            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
-            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (`float`, *optional*, defaults to 0.0):
@@ -74,23 +74,23 @@ class BigBirdPegasusConfig(PretrainedConfig):
        init_std (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        encoder_layerdrop: (`float`, *optional*, defaults to 0.0):
-            The LayerDrop probability for the encoder. See the [LayerDrop paper](see
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
-            https://arxiv.org/abs/1909.11556) for more details.
+            for more details.
        decoder_layerdrop: (`float`, *optional*, defaults to 0.0):
-            The LayerDrop probability for the decoder. See the [LayerDrop paper](see
+            The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
-            https://arxiv.org/abs/1909.11556) for more details.
+            for more details.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).
        attention_type (`str`, *optional*, defaults to `"block_sparse"`)
            Whether to use block sparse attention (with n complexity) as introduced in paper or original attention
-            layer (with n^2 complexity) in encoder. Possible values are `"original_full"` and
+            layer (with n^2 complexity) in encoder. Possible values are `"original_full"` and `"block_sparse"`.
-            `"block_sparse"`.
        use_bias (`bool`, *optional*, defaults to `False`)
            Whether to use bias in query, key, value.
        block_size (`int`, *optional*, defaults to 64)
            Size of each block. Useful only when `attention_type == "block_sparse"`.
        num_random_blocks (`int`, *optional*, defaults to 3)
-            Each query is going to attend these many number of random blocks. Useful only when `attention_type == "block_sparse"`.
+            Each query is going to attend these many number of random blocks. Useful only when `attention_type ==
+            "block_sparse"`.
        scale_embeddings (`bool`, *optional*, defaults to `True`)
            Whether to rescale embeddings with (hidden_size ** 0.5).
@@ -102,14 +102,13 @@ class BigBirdPegasusConfig(PretrainedConfig):
        >>> from transformers import BigBirdPegasusModel, BigBirdPegasusConfig
-        >>> # Initializing a BigBirdPegasus bigbird-pegasus-base style configuration
+        >>> # Initializing a BigBirdPegasus bigbird-pegasus-base style configuration >>> configuration =
-        >>> configuration = BigBirdPegasusConfig()
+        BigBirdPegasusConfig()
-        >>> # Initializing a model from the bigbird-pegasus-base style configuration
+        >>> # Initializing a model from the bigbird-pegasus-base style configuration >>> model =
-        >>> model = BigBirdPegasusModel(configuration)
+        BigBirdPegasusModel(configuration)
-        >>> # Accessing the model configuration
+        >>> # Accessing the model configuration >>> configuration = model.config
-        >>> configuration = model.config
    """
    model_type = "bigbird_pegasus"
    keys_to_ignore_at_inference = ["past_key_values"]

--- a/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
+++ b/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
@@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" PyTorch BigBirdPegasus model. """
+""" PyTorch BigBirdPegasus model."""
 import copy
@@ -1474,7 +1474,8 @@ class BigBirdPegasusDecoderLayer(nn.Module):
            hidden_states (`torch.FloatTensor`): input to the layer of shape *(seq_len, batch, embed_dim)*
            attention_mask (`torch.FloatTensor`): attention mask of size
                *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values.
-            encoder_hidden_states (`torch.FloatTensor`): cross attention input to the layer of shape *(seq_len, batch, embed_dim)*
+            encoder_hidden_states (`torch.FloatTensor`):
+                cross attention input to the layer of shape *(seq_len, batch, embed_dim)*
            encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
                *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values.
            layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
@@ -1603,13 +1604,12 @@ class BigBirdPegasusPreTrainedModel(PreTrainedModel):
 BIGBIRD_PEGASUS_START_DOCSTRING = r"""
-    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
-    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings
+    library implements for all its model (such as downloading or saving, resizing the input embeddings etc.)
-    etc.)
-    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
-    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
-    general usage and behavior.
+    and behavior.
    Parameters:
        config ([`BigBirdPegasusConfig`]):
@@ -1623,15 +1623,15 @@ BIGBIRD_PEGASUS_GENERATION_EXAMPLE = r"""
        >>> from transformers import PegasusTokenizer, BigBirdPegasusForConditionalGeneration, BigBirdPegasusConfig
-        >>> model = BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-arxiv')
+        >>> model = BigBirdPegasusForConditionalGeneration.from_pretrained('google/bigbird-pegasus-large-arxiv') >>>
-        >>> tokenizer = PegasusTokenizer.from_pretrained('google/bigbird-pegasus-large-arxiv')
+        tokenizer = PegasusTokenizer.from_pretrained('google/bigbird-pegasus-large-arxiv')
-        >>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
+        >>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs." >>> inputs =
-        >>> inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=4096, return_tensors='pt', truncation=True)
+        tokenizer([ARTICLE_TO_SUMMARIZE], max_length=4096, return_tensors='pt', truncation=True)
-        >>> # Generate Summary
+        >>> # Generate Summary >>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5,
-        >>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)
+        early_stopping=True) >>> print([tokenizer.decode(g, skip_special_tokens=True,
-        >>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
+        clean_up_tokenization_spaces=False) for g in summary_ids])
 """
 BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
@@ -1640,9 +1640,8 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
            it.
-            Indices can be obtained using [`PegasusTokenizer`]. See
+            Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1656,8 +1655,8 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
            Provide for translation and summarization training. By default, the model will create this tensor by
            shifting the `input_ids` to the right, following the paper.
        decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
-            Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will
+            Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
-            also be used by default.
+            be used by default.
            If you want to change padding behavior, you should read
            [`modeling_bigbird_pegasus._prepare_decoder_inputs`] and modify to your needs. See diagram 1 in [the
@@ -1670,33 +1669,35 @@ BIGBIRD_PEGASUS_INPUTS_DOCSTRING = r"""
            - 0 indicates the head is **masked**.
        encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
-            Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*:
+            Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
-            `attentions`) `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)`,
+            `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) is a sequence of
-            *optional*) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the
+            hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-            cross-attention of the decoder.
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
-            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
-            of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
-            shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            instead of all ``decoder_input_ids``` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated
+            ``decoder_input_ids``` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of
-            vectors than the model's internal embedding lookup matrix.
+            shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids`
+            you can choose to directly pass an embedded representation. This is useful if you want more control over
+            how to convert `input_ids` indices into associated vectors than the model's internal embedding lookup
+            matrix.
        decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*):
            Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded
-            representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds`
+            representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be
-            have to be input (see `past_key_values`). This is useful if you want more control over how to convert
+            input (see `past_key_values`). This is useful if you want more control over how to convert
            `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix.
-            If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds`
+            If `decoder_input_ids` and `decoder_inputs_embeds` are both unset, `decoder_inputs_embeds` takes the value
-            takes the value of `inputs_embeds`.
+            of `inputs_embeds`.
        use_cache (`bool`, *optional*):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            decoding (see `past_key_values`).
+            `past_key_values`).
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
@@ -1713,9 +1714,8 @@ BIGBIRD_PEGASUS_STANDALONE_INPUTS_DOCSTRING = r"""
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
            it.
-            Indices can be obtained using [`ProphetNetTokenizer`]. See
+            Indices can be obtained using [`ProphetNetTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            [`PreTrainedTokenizer.__call__`] for details.
-            details.
            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1792,9 +1792,8 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.
-                Indices can be obtained using [`PegasusTokenizer`]. See
+                Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-                [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`]
+                [`PreTrainedTokenizer.__call__`] for details.
-                for details.
                [What are input IDs?](../glossary#input-ids)
            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -1806,9 +1805,9 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel):
                [What are attention masks?](../glossary#attention-mask)
            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
-                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
-                representation. This is useful if you want more control over how to convert `input_ids` indices
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-                into associated vectors than the model's internal embedding lookup matrix.
+                than the model's internal embedding lookup matrix.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
@@ -2036,8 +2035,7 @@ class BigBirdPegasusEncoder(BigBirdPegasusPreTrainedModel):
 class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel):
    """
-    Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a
+    Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`BigBirdPegasusDecoderLayer`]
-    [`BigBirdPegasusDecoderLayer`]
    Args:
        config: BigBirdPegasusConfig
@@ -2114,9 +2112,8 @@ class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.
-                Indices can be obtained using [`PegasusTokenizer`]. See
+                Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-                [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`]
+                [`PreTrainedTokenizer.__call__`] for details.
-                for details.
                [What are input IDs?](../glossary#input-ids)
            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -2151,19 +2148,20 @@ class BigBirdPegasusDecoder(BigBirdPegasusPreTrainedModel):
                - 0 indicates the head is **masked**.
            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
-                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2
+                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
-                tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional
+                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
-                tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+                shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
-                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential
+                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
-                decoding.
+                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
-                If `past_key_values` are used, the user can optionally input only the last
+                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
-                `decoder_input_ids` (those that don't have their past key value states given to this model) of
+                all ``decoder_input_ids``` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor`
-                shape `(batch_size, 1)` instead of all ``decoder_input_ids``` of shape `(batch_size,
+                of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
-                sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices
+                `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more
-                into associated vectors than the model's internal embedding lookup matrix.
+                control over how to convert `input_ids` indices into associated vectors than the model's internal
+                embedding lookup matrix.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
@@ -2504,7 +2502,8 @@ class BigBirdPegasusForConditionalGeneration(BigBirdPegasusPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
        Returns:
@@ -2647,7 +2646,8 @@ class BigBirdPegasusForSequenceClassification(BigBirdPegasusPreTrainedModel):
    ):
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        if labels is not None:
@@ -2916,9 +2916,8 @@ class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.
-                Indices can be obtained using [`PegasusTokenizer`]. See
+                Indices can be obtained using [`PegasusTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-                [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`]
+                [`PreTrainedTokenizer.__call__`] for details.
-                for details.
                [What are input IDs?](../glossary#input-ids)
            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
@@ -2947,25 +2946,24 @@ class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel):
                - 0 indicates the head is **masked**.
            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
-                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2
+                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
-                tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional
+                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
-                tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two
+                shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two additional
-                additional tensors are only required when the model is used as a decoder in a Sequence to Sequence
+                tensors are only required when the model is used as a decoder in a Sequence to Sequence model.
-                model.
                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
-                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential
+                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
-                decoding.
-                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
-                (those that don't have their past key value states given to this model) of shape `(batch_size, 1)`
+                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
-                instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-                Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
-                ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
            use_cache (`bool`, *optional*):
-                If set to `True`, `past_key_values` key value states are returned and can be used to speed up
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
-                decoding (see `past_key_values`).
+                (see `past_key_values`).
                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.