Mass conversion of documentation from rst to Markdown (#14866)

* Convert docstrings of all configurations and tokenizers * Processors and fixes * Last modeling files and fixes to models * Pipeline modules * Utils files * Data submodule * All the other files * Style * Missing examples * Style again * Fix copies * Say bye bye to rst docstrings forever

Mass conversion of documentation from rst to Markdown (#14866)
* Convert docstrings of all configurations and tokenizers * Processors and fixes * Last modeling files and fixes to models * Pipeline modules * Utils files * Data submodule * All the other files * Style * Missing examples * Style again * Fix copies * Say bye bye to rst docstrings forever
27b3031d · Sylvain Gugger · GitHub · 18587639 · 27b3031d · 27b3031d
Unverified Commit 27b3031d authored Dec 21, 2021 by Sylvain Gugger Committed by GitHub Dec 21, 2021
20 changed files
--- a/src/transformers/models/flaubert/tokenization_flaubert.py
+++ b/src/transformers/models/flaubert/tokenization_flaubert.py
@@ -82,11 +82,11 @@ class FlaubertTokenizer(XLMTokenizer):

    - Moses preprocessing and tokenization.
    - Normalizing all inputs text.
-    - The arguments ``special_tokens`` and the function ``set_special_tokens``, can be used to add additional symbols
+    - The arguments `special_tokens` and the function `set_special_tokens`, can be used to add additional symbols
      (like "__classify__") to a vocabulary.
-    - The argument :obj:`do_lowercase` controls lower casing (automatically set for pretrained vocabularies).
+    - The argument `do_lowercase` controls lower casing (automatically set for pretrained vocabularies).

-    This tokenizer inherits from :class:`~transformers.XLMTokenizer`. Please check the superclass for usage examples
+    This tokenizer inherits from [`XLMTokenizer`]. Please check the superclass for usage examples
    and documentation regarding arguments.
    """


--- a/src/transformers/models/fnet/configuration_fnet.py
+++ b/src/transformers/models/fnet/configuration_fnet.py
@@ -29,63 +29,62 @@ FNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class FNetConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.FNetModel`. It is used to
+    This is the configuration class to store the configuration of a [`FNetModel`]. It is used to
    instantiate an FNet model according to the specified arguments, defining the model architecture. Instantiating a
-    configuration with the defaults will yield a similar configuration to that of the FNet `fnet-base
-    <https://huggingface.co/google/fnet-base>`__ architecture.
+    configuration with the defaults will yield a similar configuration to that of the FNet [fnet-base](https://huggingface.co/google/fnet-base) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.


    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 32000):
+        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the FNet model. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed when calling :class:`~transformers.FNetModel` or
-            :class:`~transformers.TFFNetModel`.
-        hidden_size (:obj:`int`, `optional`, defaults to 768):
+            `inputs_ids` passed when calling [`FNetModel`] or
+            [`TFFNetModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
            Dimension of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
+        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu_new"`):
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu_new"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, `optional`, defaults to 4):
-            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FNetModel` or
-            :class:`~transformers.TFFNetModel`.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        type_vocab_size (`int`, *optional*, defaults to 4):
+            The vocabulary size of the `token_type_ids` passed when calling [`FNetModel`] or
+            [`TFFNetModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        use_tpu_fourier_optimizations (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Determines whether to use TPU optimized FFTs. If :obj:`True`, the model will favor axis-wise FFTs
-            transforms. Set to :obj:`False` for GPU/CPU hardware, in which case n-dimensional FFTs are used.
-        tpu_short_seq_length (:obj:`int`, `optional`, defaults to 512):
+        use_tpu_fourier_optimizations (`bool`, *optional*, defaults to `False`):
+            Determines whether to use TPU optimized FFTs. If `True`, the model will favor axis-wise FFTs
+            transforms. Set to `False` for GPU/CPU hardware, in which case n-dimensional FFTs are used.
+        tpu_short_seq_length (`int`, *optional*, defaults to 512):
            The sequence length that is expected by the model when using TPUs. This will be used to initialize the DFT
-            matrix only when `use_tpu_fourier_optimizations` is set to :obj:`True` and the input sequence is shorter
+            matrix only when *use_tpu_fourier_optimizations* is set to `True` and the input sequence is shorter
            than or equal to 4096 tokens.

-    Example::
+    Example:

+    ```python
+    >>> from transformers import FNetModel, FNetConfig

-        >>> from transformers import FNetModel, FNetConfig
+    >>> # Initializing a FNet fnet-base style configuration
+    >>> configuration = FNetConfig()

-        >>> # Initializing a FNet fnet-base style configuration
-        >>> configuration = FNetConfig()
+    >>> # Initializing a model from the fnet-base style configuration
+    >>> model = FNetModel(configuration)

-        >>> # Initializing a model from the fnet-base style configuration
-        >>> model = FNetModel(configuration)
-
-        >>> # Accessing the model configuration
-        >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
    model_type = "fnet"

    def __init__(

--- a/src/transformers/models/fnet/tokenization_fnet.py
+++ b/src/transformers/models/fnet/tokenization_fnet.py
@@ -45,53 +45,51 @@ SPIECE_UNDERLINE = "▁"

 class FNetTokenizer(PreTrainedTokenizer):
    """
-    Construct an FNet tokenizer. Adapted from :class:`~transformers.AlbertTokenizer`. Based on `SentencePiece
-    <https://github.com/google/sentencepiece>`__. This tokenizer inherits from
-    :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods. Users should refer to this
+    Construct an FNet tokenizer. Adapted from [`AlbertTokenizer`]. Based on [SentencePiece](https://github.com/google/sentencepiece). This tokenizer inherits from
+    [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this
    superclass for more information regarding those methods.

    Args:
-        vocab_file (:obj:`str`):
-            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
+        vocab_file (`str`):
+            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
            contains the vocabulary necessary to instantiate a tokenizer.
-        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        do_lower_case (`bool`, *optional*, defaults to `False`):
            Whether or not to lowercase the input when tokenizing.
-        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        remove_space (`bool`, *optional*, defaults to `True`):
            Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
-        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        keep_accents (`bool`, *optional*, defaults to `True`):
            Whether or not to keep accents when tokenizing.
-        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
-        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+        sep_token (`str`, *optional*, defaults to `"[SEP]"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
-        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+        cls_token (`str`, *optional*, defaults to `"[CLS]"`):
            The classifier token which is used when doing sequence classification (classification of the whole sequence
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
-        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+        mask_token (`str`, *optional*, defaults to `"[MASK]"`):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
-        sp_model_kwargs (:obj:`dict`, `optional`):
-            Will be passed to the ``SentencePieceProcessor.__init__()`` method. The `Python wrapper for SentencePiece
-            <https://github.com/google/sentencepiece/tree/master/python>`__ can be used, among other things, to set:
+        sp_model_kwargs (`dict`, *optional*):
+            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:

-            - ``enable_sampling``: Enable subword regularization.
-            - ``nbest_size``: Sampling parameters for unigram. Invalid for BPE-Dropout.
+            - `enable_sampling`: Enable subword regularization.
+            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.

-              - ``nbest_size = {0,1}``: No sampling is performed.
-              - ``nbest_size > 1``: samples from the nbest_size results.
-              - ``nbest_size < 0``: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
+              - `nbest_size = {0,1}`: No sampling is performed.
+              - `nbest_size > 1`: samples from the nbest_size results.
+              - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
                using forward-filtering-and-backward-sampling algorithm.
-            - ``alpha``: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
+            - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
              BPE-dropout.

    Attributes:
-        sp_model (:obj:`SentencePieceProcessor`):
-            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
+        sp_model (`SentencePieceProcessor`):
+            The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
    """

    vocab_files_names = VOCAB_FILES_NAMES
@@ -221,17 +219,17 @@ class FNetTokenizer(PreTrainedTokenizer):
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An FNet sequence has the following format:

-        - single sequence: ``[CLS] X [SEP]``
-        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        - single sequence: `[CLS] X [SEP]`
+        - pair of sequences: `[CLS] A [SEP] B [SEP]`

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
@@ -244,18 +242,18 @@ class FNetTokenizer(PreTrainedTokenizer):
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` method.
+        special tokens using the tokenizer `prepare_for_model` method.

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
-            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
-            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """

        if already_has_special_tokens:
@@ -272,20 +270,22 @@ class FNetTokenizer(PreTrainedTokenizer):
    ) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet sequence
-        pair mask has the following format: ::
+        pair mask has the following format: :

-            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
+        ```python
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
+        ```

-        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]

--- a/src/transformers/models/fnet/tokenization_fnet_fast.py
+++ b/src/transformers/models/fnet/tokenization_fnet_fast.py
@@ -54,35 +54,34 @@ SPIECE_UNDERLINE = "▁"

 class FNetTokenizerFast(PreTrainedTokenizerFast):
    """
-    Construct a "fast" FNetTokenizer (backed by HuggingFace's `tokenizers` library). Adapted from
-    :class:`~transformers.AlbertTokenizerFast`. Based on `Unigram
-    <https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models>`__. This tokenizer
-    inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main methods. Users should
+    Construct a "fast" FNetTokenizer (backed by HuggingFace's *tokenizers* library). Adapted from
+    [`AlbertTokenizerFast`]. Based on [Unigram](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=unigram#models). This tokenizer
+    inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods

    Args:
-        vocab_file (:obj:`str`):
-            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
+        vocab_file (`str`):
+            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
            contains the vocabulary necessary to instantiate a tokenizer.
-        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        do_lower_case (`bool`, *optional*, defaults to `False`):
            Whether or not to lowercase the input when tokenizing.
-        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        remove_space (`bool`, *optional*, defaults to `True`):
            Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
-        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        keep_accents (`bool`, *optional*, defaults to `True`):
            Whether or not to keep accents when tokenizing.
-        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
-        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+        sep_token (`str`, *optional*, defaults to `"[SEP]"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
-        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+        cls_token (`str`, *optional*, defaults to `"[CLS]"`):
            The classifier token which is used when doing sequence classification (classification of the whole sequence
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
-        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+        mask_token (`str`, *optional*, defaults to `"[MASK]"`):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
    """
@@ -142,17 +141,17 @@ class FNetTokenizerFast(PreTrainedTokenizerFast):
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An FNet sequence has the following format:

-        - single sequence: ``[CLS] X [SEP]``
-        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+        - single sequence: `[CLS] X [SEP]`
+        - pair of sequences: `[CLS] A [SEP] B [SEP]`

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            `List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
@@ -167,21 +166,21 @@ class FNetTokenizerFast(PreTrainedTokenizerFast):
        Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An FNet
        sequence pair mask has the following format:

-        ::
-
-            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
-            | first sequence    | second sequence |
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```

        if token_ids_1 is None, only returns the first portion of the mask (0s).

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of ids.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]

--- a/src/transformers/models/fsmt/configuration_fsmt.py
+++ b/src/transformers/models/fsmt/configuration_fsmt.py
@@ -40,89 +40,89 @@ class DecoderConfig(PretrainedConfig):

 class FSMTConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
+    This is the configuration class to store the configuration of a [`FSMTModel`]. It is used to
    instantiate a FSMT model according to the specified arguments, defining the model architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.

    Args:
-        langs (:obj:`List[str]`):
+        langs (`List[str]`):
            A list with source language and target_language (e.g., ['en', 'ru']).
-        src_vocab_size (:obj:`int`):
+        src_vocab_size (`int`):
            Vocabulary size of the encoder. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed to the forward method in the encoder.
-        tgt_vocab_size (:obj:`int`):
+            `inputs_ids` passed to the forward method in the encoder.
+        tgt_vocab_size (`int`):
            Vocabulary size of the decoder. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed to the forward method in the decoder.
-        d_model (:obj:`int`, `optional`, defaults to 1024):
+            `inputs_ids` passed to the forward method in the decoder.
+        d_model (`int`, *optional*, defaults to 1024):
            Dimensionality of the layers and the pooler layer.
-        encoder_layers (:obj:`int`, `optional`, defaults to 12):
+        encoder_layers (`int`, *optional*, defaults to 12):
            Number of encoder layers.
-        decoder_layers (:obj:`int`, `optional`, defaults to 12):
+        decoder_layers (`int`, *optional*, defaults to 12):
            Number of decoder layers.
-        encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+        encoder_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
+        decoder_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer decoder.
-        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+        decoder_ffn_dim (`int`, *optional*, defaults to 4096):
            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
-        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
+        encoder_ffn_dim (`int`, *optional*, defaults to 4096):
            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
-        activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
+        activation_function (`str` or `Callable`, *optional*, defaults to `"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
-        dropout (:obj:`float`, `optional`, defaults to 0.1):
+            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
+        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
-        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
+        activation_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
+        max_position_embeddings (`int`, *optional*, defaults to 1024):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        init_std (:obj:`float`, `optional`, defaults to 0.02):
+        init_std (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        scale_embedding (`bool`, *optional*, defaults to `True`):
            Scale embeddings by diving by sqrt(d_model).
-        bos_token_id (:obj:`int`, `optional`, defaults to 0)
+        bos_token_id (`int`, *optional*, defaults to 0)
            Beginning of stream token id.
-        pad_token_id (:obj:`int`, `optional`, defaults to 1)
+        pad_token_id (`int`, *optional*, defaults to 1)
            Padding token id.
-        eos_token_id (:obj:`int`, `optional`, defaults to 2)
+        eos_token_id (`int`, *optional*, defaults to 2)
            End of stream token id.
-        decoder_start_token_id (:obj:`int`, `optional`):
-            This model starts decoding with :obj:`eos_token_id`
-        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+        decoder_start_token_id (`int`, *optional*):
+            This model starts decoding with `eos_token_id`
+        encoder_layerdrop: (`float`, *optional*, defaults to 0.0):
            Google "layerdrop arxiv", as its not explainable in one line.
-        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
+        decoder_layerdrop: (`float`, *optional*, defaults to 0.0):
            Google "layerdrop arxiv", as its not explainable in one line.
-        is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        is_encoder_decoder (`bool`, *optional*, defaults to `True`):
            Whether this is an encoder/decoder model.
-        tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie input and output embeddings.
-        num_beams (:obj:`int`, `optional`, defaults to 5)
-            Number of beams for beam search that will be used by default in the :obj:`generate` method of the model. 1
+        num_beams (`int`, *optional*, defaults to 5)
+            Number of beams for beam search that will be used by default in the `generate` method of the model. 1
            means no beam search.
-        length_penalty (:obj:`float`, `optional`, defaults to 1)
-            Exponential penalty to the length that will be used by default in the :obj:`generate` method of the model.
-        early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
-            Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam
-            search when at least ``num_beams`` sentences are finished per batch or not.
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        length_penalty (`float`, *optional*, defaults to 1)
+            Exponential penalty to the length that will be used by default in the `generate` method of the model.
+        early_stopping (`bool`, *optional*, defaults to `False`)
+            Flag that will be used by default in the `generate` method of the model. Whether to stop the beam
+            search when at least `num_beams` sentences are finished per batch or not.
+        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).
-        forced_eos_token_id (:obj:`int`, `optional`, defaults to 2):
-            The id of the token to force as the last generated token when :obj:`max_length` is reached. Usually set to
-            :obj:`eos_token_id`.
+        forced_eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the token to force as the last generated token when `max_length` is reached. Usually set to
+            `eos_token_id`.

-        Examples::
+    Examples:

-            >>> from transformers import FSMTConfig, FSMTModel
+    ```python
+    >>> from transformers import FSMTConfig, FSMTModel

-            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
-            >>> model = FSMTModel(config)
-
-    """
+    >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
+    >>> model = FSMTModel(config)
+    ```"""
    model_type = "fsmt"
    attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"}

@@ -208,10 +208,10 @@ class FSMTConfig(PretrainedConfig):

    def to_dict(self):
        """
-        Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`.
+        Serializes this instance to a Python dictionary. Override the default *to_dict()* from *PretrainedConfig*.

        Returns:
-            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
        """
        output = copy.deepcopy(self.__dict__)
        output["decoder"] = self.decoder.to_dict()

--- a/src/transformers/models/fsmt/tokenization_fsmt.py
+++ b/src/transformers/models/fsmt/tokenization_fsmt.py
@@ -140,39 +140,42 @@ class FSMTTokenizer(PreTrainedTokenizer):

    - Moses preprocessing and tokenization.
    - Normalizing all inputs text.
-    - The arguments ``special_tokens`` and the function ``set_special_tokens``, can be used to add additional symbols
+    - The arguments `special_tokens` and the function `set_special_tokens`, can be used to add additional symbols
      (like "__classify__") to a vocabulary.
-    - The argument :obj:`langs` defines a pair of languages.
+    - The argument `langs` defines a pair of languages.

-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
    Users should refer to this superclass for more information regarding those methods.

    Args:
-        langs (:obj:`List[str]`):
-            A list of two languages to translate from and to, for instance :obj:`["en", "ru"]`.
-        src_vocab_file (:obj:`str`):
+        langs (`List[str]`):
+            A list of two languages to translate from and to, for instance `["en", "ru"]`.
+        src_vocab_file (`str`):
            File containing the vocabulary for the source language.
-        tgt_vocab_file (:obj:`st`):
+        tgt_vocab_file (`st`):
            File containing the vocabulary for the target language.
-        merges_file (:obj:`str`):
+        merges_file (`str`):
            File containing the merges.
-        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        do_lower_case (`bool`, *optional*, defaults to `False`):
            Whether or not to lowercase the input when tokenizing.
-        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
-        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+        bos_token (`str`, *optional*, defaults to `"<s>"`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

-            .. note::
+            <Tip>

-                When building a sequence using special tokens, this is not the token that is used for the beginning of
-                sequence. The token used is the :obj:`cls_token`.
-        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
+            When building a sequence using special tokens, this is not the token that is used for the beginning of
+            sequence. The token used is the `cls_token`.
+
+            </Tip>
+
+        sep_token (`str`, *optional*, defaults to `"</s>"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
-        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.

    """
@@ -398,17 +401,17 @@ class FSMTTokenizer(PreTrainedTokenizer):
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A FAIRSEQ Transformer sequence has the following format:

-        - single sequence: ``<s> X </s>``
-        - pair of sequences: ``<s> A </s> B </s>``
+        - single sequence: `<s> X </s>`
+        - pair of sequences: `<s> A </s> B </s>`

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        sep = [self.sep_token_id]

@@ -422,18 +425,18 @@ class FSMTTokenizer(PreTrainedTokenizer):
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` method.
+        special tokens using the tokenizer `prepare_for_model` method.

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
-            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
-            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """

        if already_has_special_tokens:
@@ -452,21 +455,21 @@ class FSMTTokenizer(PreTrainedTokenizer):
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A FAIRSEQ
        Transformer sequence pair mask has the following format:

-        ::
-
-            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
-            | first sequence    | second sequence |
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```

-        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
            sequence(s).

        Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An

--- a/src/transformers/models/funnel/configuration_funnel.py
+++ b/src/transformers/models/funnel/configuration_funnel.py
@@ -36,69 +36,68 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class FunnelConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel` or a
-    :class:`~transformers.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
+    This is the configuration class to store the configuration of a [`FunnelModel`] or a
+    [`TFBertModel`]. It is used to instantiate a Funnel Transformer model according to the specified
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    configuration to that of the Funnel Transformer `funnel-transformer/small
-    <https://huggingface.co/funnel-transformer/small>`__ architecture.
+    configuration to that of the Funnel Transformer [funnel-transformer/small](https://huggingface.co/funnel-transformer/small) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.

    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
-            by the :obj:`inputs_ids` passed when calling :class:`~transformers.FunnelModel` or
-            :class:`~transformers.TFFunnelModel`.
-        block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
+            by the `inputs_ids` passed when calling [`FunnelModel`] or
+            [`TFFunnelModel`].
+        block_sizes (`List[int]`, *optional*, defaults to `[4, 4, 4]`):
            The sizes of the blocks used in the model.
-        block_repeats (:obj:`List[int]`, `optional`):
+        block_repeats (`List[int]`, *optional*):
            If passed along, each layer of each block is repeated the number of times indicated.
-        num_decoder_layers (:obj:`int`, `optional`, defaults to 2):
+        num_decoder_layers (`int`, *optional*, defaults to 2):
            The number of layers in the decoder (when not using the base model).
-        d_model (:obj:`int`, `optional`, defaults to 768):
+        d_model (`int`, *optional*, defaults to 768):
            Dimensionality of the model's hidden states.
-        n_head (:obj:`int`, `optional`, defaults to 12):
+        n_head (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        d_head (:obj:`int`, `optional`, defaults to 64):
+        d_head (`int`, *optional*, defaults to 64):
            Dimensionality of the model's heads.
-        d_inner (:obj:`int`, `optional`, defaults to 3072):
+        d_inner (`int`, *optional*, defaults to 3072):
            Inner dimension in the feed-forward blocks.
-        hidden_act (:obj:`str` or :obj:`callable`, `optional`, defaults to :obj:`"gelu_new"`):
+        hidden_act (`str` or `callable`, *optional*, defaults to `"gelu_new"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
+        attention_dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for the attention probabilities.
-        activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
+        activation_dropout (`float`, *optional*, defaults to 0.0):
            The dropout probability used between the two layers of the feed-forward blocks.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, `optional`, defaults to 3):
-            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
-            :class:`~transformers.TFFunnelModel`.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.1):
-            The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
+        type_vocab_size (`int`, *optional*, defaults to 3):
+            The vocabulary size of the `token_type_ids` passed when calling [`FunnelModel`] or
+            [`TFFunnelModel`].
+        initializer_range (`float`, *optional*, defaults to 0.1):
+            The standard deviation of the *uniform initializer* for initializing all weight matrices in attention
            layers.
-        initializer_std (:obj:`float`, `optional`):
-            The standard deviation of the `normal initializer` for initializing the embedding matrix and the weight of
+        initializer_std (`float`, *optional*):
+            The standard deviation of the *normal initializer* for initializing the embedding matrix and the weight of
            linear layers. Will default to 1 for the embedding matrix and the value given by Xavier initialization for
            linear layers.
-        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-9):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-9):
            The epsilon used by the layer normalization layers.
-        pooling_type (:obj:`str`, `optional`, defaults to :obj:`"mean"`):
-            Possible values are ``"mean"`` or ``"max"``. The way pooling is performed at the beginning of each block.
-        attention_type (:obj:`str`, `optional`, defaults to :obj:`"relative_shift"`):
-            Possible values are ``"relative_shift"`` or ``"factorized"``. The former is faster on CPU/GPU while the
+        pooling_type (`str`, *optional*, defaults to `"mean"`):
+            Possible values are `"mean"` or `"max"`. The way pooling is performed at the beginning of each block.
+        attention_type (`str`, *optional*, defaults to `"relative_shift"`):
+            Possible values are `"relative_shift"` or `"factorized"`. The former is faster on CPU/GPU while the
            latter is faster on TPU.
-        separate_cls (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        separate_cls (`bool`, *optional*, defaults to `True`):
            Whether or not to separate the cls token when applying pooling.
-        truncate_seq (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            When using ``separate_cls``, whether or not to truncate the last token when pooling, to avoid getting a
+        truncate_seq (`bool`, *optional*, defaults to `False`):
+            When using `separate_cls`, whether or not to truncate the last token when pooling, to avoid getting a
            sequence length that is not a multiple of 2.
-        pool_q_only (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        pool_q_only (`bool`, *optional*, defaults to `False`):
            Whether or not to apply the pooling only to the query or to query, key and values for the attention layers.
    """
    model_type = "funnel"

--- a/src/transformers/models/funnel/modeling_tf_funnel.py
+++ b/src/transformers/models/funnel/modeling_tf_funnel.py
@@ -1267,17 +1267,18 @@ class TFFunnelForPreTraining(TFFunnelPreTrainedModel):
        r"""
        Returns:

-        Examples::
+        Examples:

-            >>> from transformers import FunnelTokenizer, TFFunnelForPreTraining
-            >>> import torch
+        ```python
+        >>> from transformers import FunnelTokenizer, TFFunnelForPreTraining
+        >>> import torch

-            >>> tokenizer = TFFunnelTokenizer.from_pretrained('funnel-transformer/small')
-            >>> model = TFFunnelForPreTraining.from_pretrained('funnel-transformer/small')
+        >>> tokenizer = TFFunnelTokenizer.from_pretrained('funnel-transformer/small')
+        >>> model = TFFunnelForPreTraining.from_pretrained('funnel-transformer/small')

-            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors= "tf")
-            >>> logits = model(inputs).logits
-        """
+        >>> inputs = tokenizer("Hello, my dog is cute", return_tensors= "tf")
+        >>> logits = model(inputs).logits
+        ```"""
        inputs = input_processing(
            func=self.call,
            config=self.config,

--- a/src/transformers/models/funnel/tokenization_funnel.py
+++ b/src/transformers/models/funnel/tokenization_funnel.py
@@ -59,10 +59,10 @@ class FunnelTokenizer(BertTokenizer):
    r"""
    Construct a Funnel Transformer tokenizer.

-    :class:`~transformers.FunnelTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
+    [`FunnelTokenizer`] is identical to [`BertTokenizer`] and runs end-to-end
    tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
    parameters.
    """

@@ -113,21 +113,21 @@ class FunnelTokenizer(BertTokenizer):
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A Funnel
        Transformer sequence pair mask has the following format:

-        ::
+        ```
+        2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```

-            2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
-            | first sequence    | second sequence |
-
-        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]

--- a/src/transformers/models/funnel/tokenization_funnel_fast.py
+++ b/src/transformers/models/funnel/tokenization_funnel_fast.py
@@ -70,12 +70,12 @@ PRETRAINED_INIT_CONFIGURATION = {f"funnel-transformer/{name}": {"do_lower_case":

 class FunnelTokenizerFast(BertTokenizerFast):
    r"""
-    Construct a "fast" Funnel Transformer tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" Funnel Transformer tokenizer (backed by HuggingFace's *tokenizers* library).

-    :class:`~transformers.FunnelTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    [`FunnelTokenizerFast`] is identical to [`BertTokenizerFast`] and runs
    end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
    parameters.
    """

@@ -129,21 +129,21 @@ class FunnelTokenizerFast(BertTokenizerFast):
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A Funnel
        Transformer sequence pair mask has the following format:

-        ::
+        ```
+        2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```

-            2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
-            | first sequence    | second sequence |
-
-        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
+        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]

--- a/src/transformers/models/gpt2/configuration_gpt2.py
+++ b/src/transformers/models/gpt2/configuration_gpt2.py
@@ -37,97 +37,98 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class GPT2Config(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
-    :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified arguments,
+    This is the configuration class to store the configuration of a [`GPT2Model`] or a
+    [`TFGPT2Model`]. It is used to instantiate a GPT-2 model according to the specified arguments,
    defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration
-    to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
+    to that of the GPT-2 [small](https://huggingface.co/gpt2) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.


    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 50257):
+        vocab_size (`int`, *optional*, defaults to 50257):
            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
-            :class:`~transformers.TFGPT2Model`.
-        n_positions (:obj:`int`, `optional`, defaults to 1024):
+            `inputs_ids` passed when calling [`GPT2Model`] or
+            [`TFGPT2Model`].
+        n_positions (`int`, *optional*, defaults to 1024):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        n_embd (:obj:`int`, `optional`, defaults to 768):
+        n_embd (`int`, *optional*, defaults to 768):
            Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, `optional`, defaults to 12):
+        n_layer (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, `optional`, defaults to 12):
+        n_head (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        n_inner (:obj:`int`, `optional`, defaults to None):
-            Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
-        activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu"`):
-            Activation function, to be selected in the list :obj:`["relu", "silu", "gelu", "tanh", "gelu_new"]`.
-        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
+        n_inner (`int`, *optional*, defaults to None):
+            Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
+        activation_function (`str`, *optional*, defaults to `"gelu"`):
+            Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
+        embd_pdrop (`int`, *optional*, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
+        attn_pdrop (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
            The epsilon to use in the layer normalization layers.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
-            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+        summary_type (`string`, *optional*, defaults to `"cls_index"`):
+            Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`]
+            and [`TFGPT2DoubleHeadsModel`].

            Has to be one of the following options:

-                - :obj:`"last"`: Take the last token hidden state (like XLNet).
-                - :obj:`"first"`: Take the first token hidden state (like BERT).
-                - :obj:`"mean"`: Take the mean of all tokens hidden states.
-                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
-                - :obj:`"attn"`: Not implemented now, use multi-head attention.
-        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+                - `"last"`: Take the last token hidden state (like XLNet).
+                - `"first"`: Take the first token hidden state (like BERT).
+                - `"mean"`: Take the mean of all tokens hidden states.
+                - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - `"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (`bool`, *optional*, defaults to `True`):
+            Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`]
+            and [`TFGPT2DoubleHeadsModel`].

            Whether or not to add a projection after the vector extraction.
-        summary_activation (:obj:`str`, `optional`):
+        summary_activation (`str`, *optional*):
            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
+            [`GPT2DoubleHeadsModel`].

-            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
-        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (`bool`, *optional*, defaults to `True`):
+            Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`]
+            and [`TFGPT2DoubleHeadsModel`].

-            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
-        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
-            Argument used when doing sequence summary, used in the models :class:`~transformers.GPT2DoubleHeadsModel`
-            and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
+        summary_first_dropout (`float`, *optional*, defaults to 0.1):
+            Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`]
+            and [`TFGPT2DoubleHeadsModel`].

            The dropout ratio to be used after the projection and activation.
-        scale_attn_weights (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        scale_attn_weights (`bool`, *optional*, defaults to `True`):
            Scale attention weights by dividing by sqrt(hidden_size)..
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).
-        scale_attn_by_inverse_layer_idx (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether to additionally scale attention weights by ``1 / layer_idx + 1``.
-        reorder_and_upcast_attn (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        scale_attn_by_inverse_layer_idx (`bool`, *optional*, defaults to `False`):
+            Whether to additionally scale attention weights by `1 / layer_idx + 1`.
+        reorder_and_upcast_attn (`bool`, *optional*, defaults to `False`):
            Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention
            dot-product/softmax to float() when training with mixed precision.

-    Example::
+    Example:

-        >>> from transformers import GPT2Model, GPT2Config
+    ```python
+    >>> from transformers import GPT2Model, GPT2Config

-        >>> # Initializing a GPT2 configuration
-        >>> configuration = GPT2Config()
+    >>> # Initializing a GPT2 configuration
+    >>> configuration = GPT2Config()

-        >>> # Initializing a model from the configuration
-        >>> model = GPT2Model(configuration)
+    >>> # Initializing a model from the configuration
+    >>> model = GPT2Model(configuration)

-        >>> # Accessing the model configuration
-        >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""

    model_type = "gpt2"
    keys_to_ignore_at_inference = ["past_key_values"]

--- a/src/transformers/models/gpt2/modeling_gpt2.py
+++ b/src/transformers/models/gpt2/modeling_gpt2.py
@@ -642,17 +642,19 @@ PARALLELIZE_DOCSTRING = r"""
 DEPARALLELIZE_DOCSTRING = r"""
    Moves the model to cpu from a model parallel state.

-    Example::
-
-        # On a 4 GPU machine with gpt2-large:
-        model = GPT2LMHeadModel.from_pretrained('gpt2-large')
-        device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7],
+    Example:

-                    1: [8, 9, 10, 11, 12, 13, 14, 15],
-                    2: [16, 17, 18, 19, 20, 21, 22, 23],
-                    3: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]}
-        model.parallelize(device_map) # Splits the model across several devices
-        model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
+    ```python
+    # On a 4 GPU machine with gpt2-large:
+    model = GPT2LMHeadModel.from_pretrained('gpt2-large')
+    device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7],
+
+                1: [8, 9, 10, 11, 12, 13, 14, 15],
+                2: [16, 17, 18, 19, 20, 21, 22, 23],
+                3: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]}
+    model.parallelize(device_map) # Splits the model across several devices
+    model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
+    ```
 """



--- a/src/transformers/models/gpt2/tokenization_gpt2.py
+++ b/src/transformers/models/gpt2/tokenization_gpt2.py
@@ -108,42 +108,43 @@ class GPT2Tokenizer(PreTrainedTokenizer):
    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
    be encoded differently whether it is at the beginning of the sentence (without space) or not:

-    ::
-
-        >>> from transformers import GPT2Tokenizer
-        >>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
-        >>> tokenizer("Hello world")['input_ids']
-        [15496, 995]
-        >>> tokenizer(" Hello world")['input_ids']
-        [18435, 995]
-
-    You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
+    ```
+    >>> from transformers import GPT2Tokenizer
+    >>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+    >>> tokenizer("Hello world")['input_ids']
+    [15496, 995]
+    >>> tokenizer(" Hello world")['input_ids']
+    [18435, 995]
+    ```
+
+    You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
    call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

-    .. note::
+    <Tip>
+
+    When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first
+    one).

-        When used with ``is_split_into_words=True``, this tokenizer will add a space before each word (even the first
-        one).
+    </Tip>

-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
    Users should refer to this superclass for more information regarding those methods.

    Args:
-        vocab_file (:obj:`str`):
+        vocab_file (`str`):
            Path to the vocabulary file.
-        merges_file (:obj:`str`):
+        merges_file (`str`):
            Path to the merges file.
-        errors (:obj:`str`, `optional`, defaults to :obj:`"replace"`):
-            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
-            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
-        unk_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
+        errors (`str`, *optional*, defaults to `"replace"`):
+            Paradigm to follow when decoding bytes to UTF-8. See [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
+        unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
-        bos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
+        bos_token (`str`, *optional*, defaults to `<|endoftext|>`):
            The beginning of sequence token.
-        eos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
+        eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
            The end of sequence token.
-        add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        add_prefix_space (`bool`, *optional*, defaults to `False`):
            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
            other word. (GPT2 tokenizer detect beginning of words by the preceding space).
    """

--- a/src/transformers/models/gpt2/tokenization_gpt2_fast.py
+++ b/src/transformers/models/gpt2/tokenization_gpt2_fast.py
@@ -69,51 +69,52 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {

 class GPT2TokenizerFast(PreTrainedTokenizerFast):
    """
-    Construct a "fast" GPT-2 tokenizer (backed by HuggingFace's `tokenizers` library). Based on byte-level
+    Construct a "fast" GPT-2 tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
    Byte-Pair-Encoding.

    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
    be encoded differently whether it is at the beginning of the sentence (without space) or not:

-    ::
+    ```
+    >>> from transformers import GPT2TokenizerFast
+    >>> tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
+    >>> tokenizer("Hello world")['input_ids']
+    [15496, 995]
+    >>> tokenizer(" Hello world")['input_ids']
+    [18435, 995]
+    ```

-        >>> from transformers import GPT2TokenizerFast
-        >>> tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
-        >>> tokenizer("Hello world")['input_ids']
-        [15496, 995]
-        >>> tokenizer(" Hello world")['input_ids']
-        [18435, 995]
-
-    You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer, but since
+    You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
    the model was not pretrained this way, it might yield a decrease in performance.

-    .. note::
+    <Tip>
+
+    When used with `is_split_into_words=True`, this tokenizer needs to be instantiated with
+    `add_prefix_space=True`.

-        When used with ``is_split_into_words=True``, this tokenizer needs to be instantiated with
-        ``add_prefix_space=True``.
+    </Tip>

-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
    methods. Users should refer to this superclass for more information regarding those methods.

    Args:
-        vocab_file (:obj:`str`):
+        vocab_file (`str`):
            Path to the vocabulary file.
-        merges_file (:obj:`str`):
+        merges_file (`str`):
            Path to the merges file.
-        errors (:obj:`str`, `optional`, defaults to :obj:`"replace"`):
-            Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
-            <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
-        unk_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
+        errors (`str`, *optional*, defaults to `"replace"`):
+            Paradigm to follow when decoding bytes to UTF-8. See [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
+        unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
-        bos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
+        bos_token (`str`, *optional*, defaults to `<|endoftext|>`):
            The beginning of sequence token.
-        eos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
+        eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
            The end of sequence token.
-        add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        add_prefix_space (`bool`, *optional*, defaults to `False`):
            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
            other word. (GPT2 tokenizer detect beginning of words by the preceding space).
-        trim_offsets (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        trim_offsets (`bool`, *optional*, defaults to `True`):
            Whether or not the post-processing step should trim offsets to avoid including whitespaces.
    """


--- a/src/transformers/models/gpt_neo/configuration_gpt_neo.py
+++ b/src/transformers/models/gpt_neo/configuration_gpt_neo.py
@@ -33,66 +33,65 @@ GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class GPTNeoConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.GPTNeoModel`. It is used to
+    This is the configuration class to store the configuration of a [`GPTNeoModel`]. It is used to
    instantiate a GPT Neo model according to the specified arguments, defining the model architecture. Instantiating a
-    configuration with the defaults will yield a similar configuration to that of the GPTNeo `gpt-neo-1.3B
-    <https://huggingface.co/EleutherAI/gpt-neo-1.3B>`__ architecture.
+    configuration with the defaults will yield a similar configuration to that of the GPTNeo [gpt-neo-1.3B](https://huggingface.co/EleutherAI/gpt-neo-1.3B) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.


    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 50257):
+        vocab_size (`int`, *optional*, defaults to 50257):
            Vocabulary size of the GPT Neo model. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed when calling :class:`~transformers.GPTNeoModel`. Vocabulary size of the model.
-            Defines the different tokens that can be represented by the `inputs_ids` passed to the forward method of
-            :class:`~transformers.GPTNeoModel`.
-        attention_types (:obj:`List`, `optional`, defaults to :obj:`[[["global", "local"], 12]]`):
-            The type of attention for each layer in a :obj:`List` of the following format :obj:`[[["attention_type"],
-            num_layerss]]` e.g. for a 24 layer model :obj:`[[["global"], 24]]` or :obj:`[[["global", "local"], 12]]`
-            Choose the value of ``attention_type`` from :obj:`["global", "local"]`
-        hidden_size (:obj:`int`, `optional`, defaults to 2048):
+            `inputs_ids` passed when calling [`GPTNeoModel`]. Vocabulary size of the model.
+            Defines the different tokens that can be represented by the *inputs_ids* passed to the forward method of
+            [`GPTNeoModel`].
+        attention_types (`List`, *optional*, defaults to `[[["global", "local"], 12]]`):
+            The type of attention for each layer in a `List` of the following format `[[["attention_type"], num_layerss]]` e.g. for a 24 layer model `[[["global"], 24]]` or `[[["global", "local"], 12]]`
+            Choose the value of `attention_type` from `["global", "local"]`
+        hidden_size (`int`, *optional*, defaults to 2048):
            Dimensionality of the encoder layers and the pooler layer.
-        num_layers (:obj:`int`, `optional`, defaults to 24):
+        num_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
-        num_heads (:obj:`int`, `optional`, defaults to 16):
+        num_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, `optional`, defaults to 8192):
+        intermediate_size (`int`, *optional*, defaults to 8192):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu_new"`):
+        activation_function (`str` or `function`, *optional*, defaults to `"gelu_new"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
-        embed_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        embed_dropout (`float`, *optional*, defaults to 0.0):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
+        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 2048):
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.GPTNeoModel`.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`GPTNeoModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
            The epsilon used by the layer normalization layers.
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
-            relevant if ``config.is_decoder=True``.
+            relevant if `config.is_decoder=True`.

-        Example::
+    Example:

-            >>> from transformers import GPTNeoModel, GPTNeoConfig
+    ```python
+    >>> from transformers import GPTNeoModel, GPTNeoConfig

-            >>> # Initializing a GPTNeo EleutherAI/gpt-neo-1.3B style configuration
-            >>> configuration = GPTNeoConfig()
+    >>> # Initializing a GPTNeo EleutherAI/gpt-neo-1.3B style configuration
+    >>> configuration = GPTNeoConfig()

-            >>> # Initializing a model from the EleutherAI/gpt-neo-1.3B style configuration
-            >>> model = GPTNeoModel(configuration)
+    >>> # Initializing a model from the EleutherAI/gpt-neo-1.3B style configuration
+    >>> model = GPTNeoModel(configuration)

-            >>> # Accessing the model configuration
-            >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
    model_type = "gpt_neo"
    keys_to_ignore_at_inference = ["past_key_values"]
    attribute_map = {"num_attention_heads": "num_heads", "num_hidden_layers": "num_layers"}

--- a/src/transformers/models/gptj/configuration_gptj.py
+++ b/src/transformers/models/gptj/configuration_gptj.py
@@ -28,60 +28,60 @@ GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class GPTJConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.GPTJModel`. It is used to
+    This is the configuration class to store the configuration of a [`GPTJModel`]. It is used to
    instantiate a GPT-J model according to the specified arguments, defining the model architecture. Instantiating a
-    configuration with the defaults will yield a similar configuration to that of the GPT-J `gpt-j-6B
-    <https://huggingface.co/EleutherAI/gpt-j-6B>`__ architecture. Configuration objects inherit from
-    :class:`~transformers.PretrainedConfig` and can be used to control the model outputs. Read the documentation from
-    :class:`~transformers.PretrainedConfig` for more information.
+    configuration with the defaults will yield a similar configuration to that of the GPT-J [gpt-j-6B](https://huggingface.co/EleutherAI/gpt-j-6B) architecture. Configuration objects inherit from
+    [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from
+    [`PretrainedConfig`] for more information.

    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 50400):
+        vocab_size (`int`, *optional*, defaults to 50400):
            Vocabulary size of the GPT-J model. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed when calling :class:`~transformers.GPTJModel`.
-        n_positions (:obj:`int`, `optional`, defaults to 2048):
+            `inputs_ids` passed when calling [`GPTJModel`].
+        n_positions (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        n_embd (:obj:`int`, `optional`, defaults to 4096):
+        n_embd (`int`, *optional*, defaults to 4096):
            Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, `optional`, defaults to 28):
+        n_layer (`int`, *optional*, defaults to 28):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, `optional`, defaults to 16):
+        n_head (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        rotary_dim (:obj:`int`, `optional`, defaults to 64):
+        rotary_dim (`int`, *optional*, defaults to 64):
            Number of dimensions in the embedding that Rotary Position Embedding is applied to.
-        n_inner (:obj:`int`, `optional`, defaults to None):
-            Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
-        activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu_new"`):
-            Activation function, to be selected in the list :obj:`["relu", "silu", "gelu", "tanh", "gelu_new"]`.
-        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
+        n_inner (`int`, *optional*, defaults to None):
+            Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
+        activation_function (`str`, *optional*, defaults to `"gelu_new"`):
+            Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
+        embd_pdrop (`int`, *optional*, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
+        attn_pdrop (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
            The epsilon to use in the layer normalization layers.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        scale_attn_weights (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        scale_attn_weights (`bool`, *optional*, defaults to `True`):
            Scale attention weights by dividing by sqrt(hidden_size).
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).

-    Example::
+    Example:

-        >>> from transformers import GPTJModel, GPTJConfig
+    ```python
+    >>> from transformers import GPTJModel, GPTJConfig

-        >>> # Initializing a GPT-J 6B configuration
-        >>> configuration = GPTJConfig()
+    >>> # Initializing a GPT-J 6B configuration
+    >>> configuration = GPTJConfig()

-        >>> # Initializing a model from the configuration
-        >>> model = GPTJModel(configuration)
+    >>> # Initializing a model from the configuration
+    >>> model = GPTJModel(configuration)

-        >>> # Accessing the model configuration
-        >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
    model_type = "gptj"
    attribute_map = {
        "max_position_embeddings": "n_positions",

--- a/src/transformers/models/gptj/modeling_gptj.py
+++ b/src/transformers/models/gptj/modeling_gptj.py
@@ -424,15 +424,18 @@ PARALLELIZE_DOCSTRING = r"""
 DEPARALLELIZE_DOCSTRING = r"""
    Moves the model to CPU from a model parallel state.

-    Example::
-        # On a 4 GPU machine with gpt-j-6B:
-        model = GPTJForCausalLM.from_pretrained('EleutherAI/gpt-j-6B')
-        device_map = {0: [0, 1, 2, 3, 4, 5, 6],
-                      1: [7, 8, 9, 10, 11, 12, 13],
-                      2: [14, 15, 16, 17, 18, 19, 20],
-                      3: [21, 22, 23, 24, 25, 26, 27]}
-        model.parallelize(device_map) # Splits the model across several devices
-        model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
+    Example:
+
+    ```python
+    # On a 4 GPU machine with gpt-j-6B:
+    model = GPTJForCausalLM.from_pretrained('EleutherAI/gpt-j-6B')
+    device_map = {0: [0, 1, 2, 3, 4, 5, 6],
+                  1: [7, 8, 9, 10, 11, 12, 13],
+                  2: [14, 15, 16, 17, 18, 19, 20],
+                  3: [21, 22, 23, 24, 25, 26, 27]}
+    model.parallelize(device_map) # Splits the model across several devices
+    model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
+    ```
 """



--- a/src/transformers/models/herbert/tokenization_herbert.py
+++ b/src/transformers/models/herbert/tokenization_herbert.py
@@ -49,7 +49,7 @@ class HerbertTokenizer(XLMTokenizer):

    - Such pretokenized input is BPE subtokenized

-    This tokenizer inherits from :class:`~transformers.XLMTokenizer` which contains most of the methods. Users should
+    This tokenizer inherits from [`XLMTokenizer`] which contains most of the methods. Users should
    refer to the superclass for more information regarding methods.
    """


--- a/src/transformers/models/herbert/tokenization_herbert_fast.py
+++ b/src/transformers/models/herbert/tokenization_herbert_fast.py
@@ -39,20 +39,20 @@ PRETRAINED_INIT_CONFIGURATION = {}

 class HerbertTokenizerFast(PreTrainedTokenizerFast):
    """
-    Construct a "Fast" BPE tokenizer for HerBERT (backed by HuggingFace's `tokenizers` library).
+    Construct a "Fast" BPE tokenizer for HerBERT (backed by HuggingFace's *tokenizers* library).

    Peculiarities:

    - uses BERT's pre-tokenizer: BertPreTokenizer splits tokens on spaces, and also on punctuation. Each occurrence of
      a punctuation character will be treated separately.

-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the methods. Users
    should refer to the superclass for more information regarding methods.

    Args:
-        vocab_file (:obj:`str`):
+        vocab_file (`str`):
            Path to the vocabulary file.
-        merges_file (:obj:`str`):
+        merges_file (`str`):
            Path to the merges file.
    """

@@ -94,17 +94,17 @@ class HerbertTokenizerFast(PreTrainedTokenizerFast):
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An HerBERT, like BERT sequence has the following format:

-        - single sequence: ``<s> X </s>``
-        - pair of sequences: ``<s> A </s> B </s>``
+        - single sequence: `<s> X </s>`
+        - pair of sequences: `<s> A </s> B </s>`

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """

        cls = [self.cls_token_id]
@@ -119,18 +119,18 @@ class HerbertTokenizerFast(PreTrainedTokenizerFast):
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` method.
+        special tokens using the tokenizer `prepare_for_model` method.

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
-            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
-            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
@@ -148,19 +148,19 @@ class HerbertTokenizerFast(PreTrainedTokenizerFast):
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. HerBERT, like
        BERT sequence pair mask has the following format:

-        ::
-
-            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
-            | first sequence    | second sequence |
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```

        Args:
-            token_ids_0 (:obj:`List[int]`):
+            token_ids_0 (`List[int]`):
                List of IDs.
-            token_ids_1 (:obj:`List[int]`, `optional`):
+            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
-            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]

--- a/src/transformers/models/hubert/configuration_hubert.py
+++ b/src/transformers/models/hubert/configuration_hubert.py
@@ -28,129 +28,126 @@ HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class HubertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.HubertModel`. It is used to
+    This is the configuration class to store the configuration of a [`HubertModel`]. It is used to
    instantiate an Hubert model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the Hubert
-    `facebook/hubert-base-ls960 <https://huggingface.co/facebook/hubert-base-ls960>`__ architecture.
+    [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.


    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 32):
+        vocab_size (`int`, *optional*, defaults to 32):
            Vocabulary size of the Hubert model. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed when calling :class:`~transformers.HubertModel`. Vocabulary size of the model.
-            Defines the different tokens that can be represented by the `inputs_ids` passed to the forward method of
-            :class:`~transformers.HubertModel`.
-        hidden_size (:obj:`int`, `optional`, defaults to 768):
+            `inputs_ids` passed when calling [`HubertModel`]. Vocabulary size of the model.
+            Defines the different tokens that can be represented by the *inputs_ids* passed to the forward method of
+            [`HubertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
+        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
+        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout(:obj:`float`, `optional`, defaults to 0.1):
+            `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout(`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout(:obj:`float`, `optional`, defaults to 0.1):
+        attention_dropout(`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        final_dropout (:obj:`float`, `optional`, defaults to 0.1):
-            The dropout probabilitiy for the final projection layer of :class:`Wav2Vec2ForCTC`.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        final_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for the final projection layer of [`Wav2Vec2ForCTC`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        feat_extract_norm (:obj:`str`, `optional`, defaults to :obj:`"group"`):
-            The norm to be applied to 1D convolutional layers in feature extractor. One of :obj:`"group"` for group
-            normalization of only the first 1D convolutional layer or :obj:`"layer"` for layer normalization of all 1D
+        feat_extract_norm (`str`, *optional*, defaults to `"group"`):
+            The norm to be applied to 1D convolutional layers in feature extractor. One of `"group"` for group
+            normalization of only the first 1D convolutional layer or `"layer"` for layer normalization of all 1D
            convolutional layers.
-        feat_proj_dropout (:obj:`float`, `optional`, defaults to 0.0):
+        feat_proj_dropout (`float`, *optional*, defaults to 0.0):
            The dropout probability for output of the feature extractor.
-        feat_proj_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        feat_proj_layer_norm (`bool`, *optional*, defaults to `True`):
            Whether to apply LayerNorm to the output of the feature extractor.
-        feat_extract_activation (:obj:`str, `optional`, defaults to :obj:`"gelu"`):
+        feat_extract_activation (`str, `optional`, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the 1D convolutional layers of the feature
-            extractor. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
-        conv_dim (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(512, 512, 512, 512, 512, 512, 512)`):
+            extractor. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        conv_dim (`Tuple[int]`, *optional*, defaults to `(512, 512, 512, 512, 512, 512, 512)`):
            A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
-            feature extractor. The length of `conv_dim` defines the number of 1D convolutional layers.
-        conv_stride (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(5, 2, 2, 2, 2, 2, 2)`):
+            feature extractor. The length of *conv_dim* defines the number of 1D convolutional layers.
+        conv_stride (`Tuple[int]`, *optional*, defaults to `(5, 2, 2, 2, 2, 2, 2)`):
            A tuple of integers defining the stride of each 1D convolutional layer in the feature extractor. The length
-            of `conv_stride` defines the number of convolutional layers and has to match the the length of `conv_dim`.
-        conv_kernel (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(10, 3, 3, 3, 3, 3, 3)`):
+            of *conv_stride* defines the number of convolutional layers and has to match the the length of *conv_dim*.
+        conv_kernel (`Tuple[int]`, *optional*, defaults to `(10, 3, 3, 3, 3, 3, 3)`):
            A tuple of integers defining the kernel size of each 1D convolutional layer in the feature extractor. The
-            length of `conv_kernel` defines the number of convolutional layers and has to match the the length of
-            `conv_dim`.
-        conv_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            length of *conv_kernel* defines the number of convolutional layers and has to match the the length of
+            *conv_dim*.
+        conv_bias (`bool`, *optional*, defaults to `False`):
            Whether the 1D convolutional layers have a bias.
-        num_conv_pos_embeddings (:obj:`int`, `optional`, defaults to 128):
+        num_conv_pos_embeddings (`int`, *optional*, defaults to 128):
            Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
            embeddings layer.
-        num_conv_pos_embedding_groups (:obj:`int`, `optional`, defaults to 16):
+        num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
            Number of groups of 1D convolutional positional embeddings layer.
-        do_stable_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether do apply `stable` layer norm architecture of the Transformer encoder. ``do_stable_layer_norm is
-            True`` corresponds to applying layer norm before the attention layer, whereas ``do_stable_layer_norm is
-            False`` corresponds to applying layer norm after the attention layer.
-        apply_spec_augment (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        do_stable_layer_norm (`bool`, *optional*, defaults to `False`):
+            Whether do apply *stable* layer norm architecture of the Transformer encoder. `do_stable_layer_norm is True` corresponds to applying layer norm before the attention layer, whereas `do_stable_layer_norm is False` corresponds to applying layer norm after the attention layer.
+        apply_spec_augment (`bool`, *optional*, defaults to `True`):
            Whether to apply *SpecAugment* data augmentation to the outputs of the feature extractor. For reference see
-            `SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
-            <https://arxiv.org/abs/1904.08779>`__.
-        mask_time_prob (:obj:`float`, `optional`, defaults to 0.05):
+            [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779).
+        mask_time_prob (`float`, *optional*, defaults to 0.05):
            Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking
            procecure generates ''mask_time_prob*len(time_axis)/mask_time_length'' independent masks over the axis. If
            reasoning from the propability of each feature vector to be chosen as the start of the vector span to be
-            masked, `mask_time_prob` should be ``prob_vector_start*mask_time_length``. Note that overlap may decrease
-            the actual percentage of masked vectors. This is only relevant if ``apply_spec_augment is True``.
-        mask_time_length (:obj:`int`, `optional`, defaults to 10):
+            masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease
+            the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is True`.
+        mask_time_length (`int`, *optional*, defaults to 10):
            Length of vector span along the time axis.
-        mask_time_min_masks (:obj:`int`, `optional`, defaults to 2),:
-            The minimum number of masks of length ``mask_feature_length`` generated along the time axis, each time
-            step, irrespectively of ``mask_feature_prob``. Only relevant if
+        mask_time_min_masks (`int`, *optional*, defaults to 2),:
+            The minimum number of masks of length `mask_feature_length` generated along the time axis, each time
+            step, irrespectively of `mask_feature_prob`. Only relevant if
            ''mask_time_prob*len(time_axis)/mask_time_length < mask_time_min_masks''
-        mask_feature_prob (:obj:`float`, `optional`, defaults to 0.0):
+        mask_feature_prob (`float`, *optional*, defaults to 0.0):
            Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The
            masking procecure generates ''mask_feature_prob*len(feature_axis)/mask_time_length'' independent masks over
            the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector
-            span to be masked, `mask_feature_prob` should be ``prob_vector_start*mask_feature_length``. Note that
-            overlap may decrease the actual percentage of masked vectors. This is only relevant if ``apply_spec_augment
-            is True``.
-        mask_feature_length (:obj:`int`, `optional`, defaults to 10):
+            span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that
+            overlap may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is True`.
+        mask_feature_length (`int`, *optional*, defaults to 10):
            Length of vector span along the feature axis.
-        mask_feature_min_masks (:obj:`int`, `optional`, defaults to 0),:
-            The minimum number of masks of length ``mask_feature_length`` generated along the feature axis, each time
-            step, irrespectively of ``mask_feature_prob``. Only relevant if
+        mask_feature_min_masks (`int`, *optional*, defaults to 0),:
+            The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time
+            step, irrespectively of `mask_feature_prob`. Only relevant if
            ''mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks''
-        ctc_loss_reduction (:obj:`str`, `optional`, defaults to :obj:`"sum"`):
-            Specifies the reduction to apply to the output of ``torch.nn.CTCLoss``. Only relevant when training an
-            instance of :class:`~transformers.HubertForCTC`.
-        ctc_zero_infinity (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether to zero infinite losses and the associated gradients of ``torch.nn.CTCLoss``. Infinite losses
+        ctc_loss_reduction (`str`, *optional*, defaults to `"sum"`):
+            Specifies the reduction to apply to the output of `torch.nn.CTCLoss`. Only relevant when training an
+            instance of [`HubertForCTC`].
+        ctc_zero_infinity (`bool`, *optional*, defaults to `False`):
+            Whether to zero infinite losses and the associated gradients of `torch.nn.CTCLoss`. Infinite losses
            mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an
-            instance of :class:`~transformers.HubertForCTC`.
-        use_weighted_layer_sum (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            instance of [`HubertForCTC`].
+        use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
            Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
-            instance of :class:`~transformers.HubertForSequenceClassification`.
-        classifier_proj_size (:obj:`int`, `optional`, defaults to 256):
+            instance of [`HubertForSequenceClassification`].
+        classifier_proj_size (`int`, *optional*, defaults to 256):
            Dimensionality of the projection before token mean-pooling for classification.

-    Example::
+    Example:

-        >>> from transformers import HubertModel, HubertConfig
+    ```python
+    >>> from transformers import HubertModel, HubertConfig

-        >>> # Initializing a Hubert facebook/hubert-base-ls960 style configuration
-        >>> configuration = HubertConfig()
+    >>> # Initializing a Hubert facebook/hubert-base-ls960 style configuration
+    >>> configuration = HubertConfig()

-        >>> # Initializing a model from the facebook/hubert-base-ls960 style configuration
-        >>> model = HubertModel(configuration)
+    >>> # Initializing a model from the facebook/hubert-base-ls960 style configuration
+    >>> model = HubertModel(configuration)

-        >>> # Accessing the model configuration
-        >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
    model_type = "hubert"

    def __init__(