Mass conversion of documentation from rst to Markdown (#14866)

* Convert docstrings of all configurations and tokenizers * Processors and fixes * Last modeling files and fixes to models * Pipeline modules * Utils files * Data submodule * All the other files * Style * Missing examples * Style again * Fix copies * Say bye bye to rst docstrings forever

Mass conversion of documentation from rst to Markdown (#14866)
* Convert docstrings of all configurations and tokenizers * Processors and fixes * Last modeling files and fixes to models * Pipeline modules * Utils files * Data submodule * All the other files * Style * Missing examples * Style again * Fix copies * Say bye bye to rst docstrings forever
27b3031d · Sylvain Gugger · GitHub · 18587639 · 27b3031d · 27b3031d
Unverified Commit 27b3031d authored Dec 21, 2021 by Sylvain Gugger Committed by GitHub Dec 21, 2021
20 changed files
--- a/src/transformers/models/detr/feature_extraction_detr.py
+++ b/src/transformers/models/detr/feature_extraction_detr.py
@@ -124,28 +124,28 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
    r"""
    Constructs a DETR feature extractor.

-    This feature extractor inherits from :class:`~transformers.FeatureExtractionMixin` which contains most of the main
+    This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main
    methods. Users should refer to this superclass for more information regarding those methods.


    Args:
-        format (:obj:`str`, `optional`, defaults to :obj:`"coco_detection"`):
+        format (`str`, *optional*, defaults to `"coco_detection"`):
            Data format of the annotations. One of "coco_detection" or "coco_panoptic".
-        do_resize (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to resize the input to a certain :obj:`size`.
-        size (:obj:`int`, `optional`, defaults to 800):
-            Resize the input to the given size. Only has an effect if :obj:`do_resize` is set to :obj:`True`. If size
-            is a sequence like :obj:`(width, height)`, output size will be matched to this. If size is an int, smaller
-            edge of the image will be matched to this number. i.e, if :obj:`height > width`, then image will be
-            rescaled to :obj:`(size * height / width, size)`.
-        max_size (:obj:`int`, `optional`, defaults to :obj:`1333`):
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the input to a certain `size`.
+        size (`int`, *optional*, defaults to 800):
+            Resize the input to the given size. Only has an effect if `do_resize` is set to `True`. If size
+            is a sequence like `(width, height)`, output size will be matched to this. If size is an int, smaller
+            edge of the image will be matched to this number. i.e, if `height > width`, then image will be
+            rescaled to `(size * height / width, size)`.
+        max_size (`int`, *optional*, defaults to `1333`):
            The largest size an image dimension can have (otherwise it's capped). Only has an effect if
-            :obj:`do_resize` is set to :obj:`True`.
-        do_normalize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            `do_resize` is set to `True`.
+        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether or not to normalize the input with mean and standard deviation.
-        image_mean (:obj:`int`, `optional`, defaults to :obj:`[0.485, 0.456, 0.406]`):
+        image_mean (`int`, *optional*, defaults to `[0.485, 0.456, 0.406]`):
            The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.
-        image_std (:obj:`int`, `optional`, defaults to :obj:`[0.229, 0.224, 0.225]`):
+        image_std (`int`, *optional*, defaults to `[0.229, 0.224, 0.225]`):
            The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the
            ImageNet std.
    """
@@ -416,39 +416,37 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
        padded up to the largest image in a batch, and a pixel mask is created that indicates which pixels are
        real/which are padding.

-        .. warning::
+        <Tip warning={true}>

-           NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
-           PIL images.
+        NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
+        PIL images.
+
+        </Tip>

        Args:
-            images (:obj:`PIL.Image.Image`, :obj:`np.ndarray`, :obj:`torch.Tensor`, :obj:`List[PIL.Image.Image]`, :obj:`List[np.ndarray]`, :obj:`List[torch.Tensor]`):
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
                number of channels, H and W are image height and width.

-            annotations (:obj:`Dict`, :obj:`List[Dict]`, `optional`):
+            annotations (`Dict`, `List[Dict]`, *optional*):
                The corresponding annotations in COCO format.

-                In case :class:`~transformers.DetrFeatureExtractor` was initialized with :obj:`format =
-                "coco_detection"`, the annotations for each image should have the following format: {'image_id': int,
+                In case [`DetrFeatureExtractor`] was initialized with `format = "coco_detection"`, the annotations for each image should have the following format: {'image_id': int,
                'annotations': [annotation]}, with the annotations being a list of COCO object annotations.

-                In case :class:`~transformers.DetrFeatureExtractor` was initialized with :obj:`format =
-                "coco_panoptic"`, the annotations for each image should have the following format: {'image_id': int,
+                In case [`DetrFeatureExtractor`] was initialized with `format = "coco_panoptic"`, the annotations for each image should have the following format: {'image_id': int,
                'file_name': str, 'segments_info': [segment_info]} with segments_info being a list of COCO panoptic
                annotations.

-            return_segmentation_masks (:obj:`Dict`, :obj:`List[Dict]`, `optional`, defaults to :obj:`False`):
-                Whether to also include instance segmentation masks as part of the labels in case :obj:`format =
-                "coco_detection"`.
+            return_segmentation_masks (`Dict`, `List[Dict]`, *optional*, defaults to `False`):
+                Whether to also include instance segmentation masks as part of the labels in case `format = "coco_detection"`.

-            masks_path (:obj:`pathlib.Path`, `optional`):
+            masks_path (`pathlib.Path`, *optional*):
                Path to the directory containing the PNG files that store the class-agnostic image segmentations. Only
-                relevant in case :class:`~transformers.DetrFeatureExtractor` was initialized with :obj:`format =
-                "coco_panoptic"`.
+                relevant in case [`DetrFeatureExtractor`] was initialized with `format = "coco_panoptic"`.

-            pad_and_return_pixel_mask (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
                Whether or not to pad images up to the largest image in a batch and create a pixel mask.

                If left to the default, will return a pixel mask that is:
@@ -456,17 +454,17 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
                - 1 for pixels that are real (i.e. **not masked**),
                - 0 for pixels that are padding (i.e. **masked**).

-            return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
-                If set, will return tensors instead of NumPy arrays. If set to :obj:`'pt'`, return PyTorch
-                :obj:`torch.Tensor` objects.
+            return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
+                If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch
+                `torch.Tensor` objects.

        Returns:
-            :class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:

            - **pixel_values** -- Pixel values to be fed to a model.
-            - **pixel_mask** -- Pixel mask to be fed to a model (when :obj:`pad_and_return_pixel_mask=True` or if
-              `"pixel_mask"` is in :obj:`self.model_input_names`).
-            - **labels** -- Optional labels to be fed to a model (when :obj:`annotations` are provided)
+            - **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
+              *"pixel_mask"* is in `self.model_input_names`).
+            - **labels** -- Optional labels to be fed to a model (when `annotations` are provided)
        """
        # Input type checking for clearer error

@@ -634,21 +632,21 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
        self, pixel_values_list: List["torch.Tensor"], return_tensors: Optional[Union[str, TensorType]] = None
    ):
        """
-        Pad images up to the largest image in a batch and create a corresponding :obj:`pixel_mask`.
+        Pad images up to the largest image in a batch and create a corresponding `pixel_mask`.

        Args:
-            pixel_values_list (:obj:`List[torch.Tensor]`):
+            pixel_values_list (`List[torch.Tensor]`):
                List of images (pixel values) to be padded. Each image should be a tensor of shape (C, H, W).
-            return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
-                If set, will return tensors instead of NumPy arrays. If set to :obj:`'pt'`, return PyTorch
-                :obj:`torch.Tensor` objects.
+            return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
+                If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch
+                `torch.Tensor` objects.

        Returns:
-            :class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:

            - **pixel_values** -- Pixel values to be fed to a model.
-            - **pixel_mask** -- Pixel mask to be fed to a model (when :obj:`pad_and_return_pixel_mask=True` or if
-              `"pixel_mask"` is in :obj:`self.model_input_names`).
+            - **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
+              *"pixel_mask"* is in `self.model_input_names`).

        """

@@ -676,19 +674,19 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
    # inspired by https://github.com/facebookresearch/detr/blob/master/models/detr.py#L258
    def post_process(self, outputs, target_sizes):
        """
-        Converts the output of :class:`~transformers.DetrForObjectDetection` into the format expected by the COCO api.
+        Converts the output of [`DetrForObjectDetection`] into the format expected by the COCO api.
        Only supports PyTorch.

        Args:
-            outputs (:class:`~transformers.DetrObjectDetectionOutput`):
+            outputs ([`DetrObjectDetectionOutput`]):
                Raw outputs of the model.
-            target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`, `optional`):
+            target_sizes (`torch.Tensor` of shape `(batch_size, 2)`, *optional*):
                Tensor containing the size (h, w) of each image of the batch. For evaluation, this must be the original
                image size (before any data augmentation). For visualization, this should be the image size after data
                augment, but before padding.

        Returns:
-            :obj:`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an
+            `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an
            image in the batch as predicted by the model.
        """
        out_logits, out_bbox = outputs.logits, outputs.pred_boxes
@@ -714,21 +712,21 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):

    def post_process_segmentation(self, outputs, target_sizes, threshold=0.9, mask_threshold=0.5):
        """
-        Converts the output of :class:`~transformers.DetrForSegmentation` into image segmentation predictions. Only
+        Converts the output of [`DetrForSegmentation`] into image segmentation predictions. Only
        supports PyTorch.

        Parameters:
-            outputs (:class:`~transformers.DetrSegmentationOutput`):
+            outputs ([`DetrSegmentationOutput`]):
                Raw outputs of the model.
-            target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)` or :obj:`List[Tuple]` of length :obj:`batch_size`):
+            target_sizes (`torch.Tensor` of shape `(batch_size, 2)` or `List[Tuple]` of length `batch_size`):
                Torch Tensor (or list) corresponding to the requested final size (h, w) of each prediction.
-            threshold (:obj:`float`, `optional`, defaults to 0.9):
+            threshold (`float`, *optional*, defaults to 0.9):
                Threshold to use to filter out queries.
-            mask_threshold (:obj:`float`, `optional`, defaults to 0.5):
+            mask_threshold (`float`, *optional*, defaults to 0.5):
                Threshold to use when turning the predicted masks into binary values.

        Returns:
-            :obj:`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, and masks for an
+            `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, and masks for an
            image in the batch as predicted by the model.
        """
        out_logits, raw_masks = outputs.logits, outputs.pred_masks
@@ -757,26 +755,26 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
    # inspired by https://github.com/facebookresearch/detr/blob/master/models/segmentation.py#L218
    def post_process_instance(self, results, outputs, orig_target_sizes, max_target_sizes, threshold=0.5):
        """
-        Converts the output of :class:`~transformers.DetrForSegmentation` into actual instance segmentation
+        Converts the output of [`DetrForSegmentation`] into actual instance segmentation
        predictions. Only supports PyTorch.

        Args:
-            results (:obj:`List[Dict]`):
-                Results list obtained by :meth:`~transformers.DetrFeatureExtractor.post_process`, to which "masks"
+            results (`List[Dict]`):
+                Results list obtained by [`~DetrFeatureExtractor.post_process`], to which "masks"
                results will be added.
-            outputs (:class:`~transformers.DetrSegmentationOutput`):
+            outputs ([`DetrSegmentationOutput`]):
                Raw outputs of the model.
-            orig_target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`):
+            orig_target_sizes (`torch.Tensor` of shape `(batch_size, 2)`):
                Tensor containing the size (h, w) of each image of the batch. For evaluation, this must be the original
                image size (before any data augmentation).
-            max_target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`):
+            max_target_sizes (`torch.Tensor` of shape `(batch_size, 2)`):
                Tensor containing the maximum size (h, w) of each image of the batch. For evaluation, this must be the
                original image size (before any data augmentation).
-            threshold (:obj:`float`, `optional`, defaults to 0.5):
+            threshold (`float`, *optional*, defaults to 0.5):
                Threshold to use when turning the predicted masks into binary values.

        Returns:
-            :obj:`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, boxes and masks
+            `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, boxes and masks
            for an image in the batch as predicted by the model.
        """

@@ -801,26 +799,26 @@ class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
    # inspired by https://github.com/facebookresearch/detr/blob/master/models/segmentation.py#L241
    def post_process_panoptic(self, outputs, processed_sizes, target_sizes=None, is_thing_map=None, threshold=0.85):
        """
-        Converts the output of :class:`~transformers.DetrForSegmentation` into actual panoptic predictions. Only
+        Converts the output of [`DetrForSegmentation`] into actual panoptic predictions. Only
        supports PyTorch.

        Parameters:
-            outputs (:class:`~transformers.DetrSegmentationOutput`):
+            outputs ([`DetrSegmentationOutput`]):
                Raw outputs of the model.
-            processed_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)` or :obj:`List[Tuple]` of length :obj:`batch_size`):
+            processed_sizes (`torch.Tensor` of shape `(batch_size, 2)` or `List[Tuple]` of length `batch_size`):
                Torch Tensor (or list) containing the size (h, w) of each image of the batch, i.e. the size after data
                augmentation but before batching.
-            target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)` or :obj:`List[Tuple]` of length :obj:`batch_size`, `optional`):
+            target_sizes (`torch.Tensor` of shape `(batch_size, 2)` or `List[Tuple]` of length `batch_size`, *optional*):
                Torch Tensor (or list) corresponding to the requested final size (h, w) of each prediction. If left to
-                None, it will default to the :obj:`processed_sizes`.
-            is_thing_map (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`, `optional`):
+                None, it will default to the `processed_sizes`.
+            is_thing_map (`torch.Tensor` of shape `(batch_size, 2)`, *optional*):
                Dictionary mapping class indices to either True or False, depending on whether or not they are a thing.
-                If not set, defaults to the :obj:`is_thing_map` of COCO panoptic.
-            threshold (:obj:`float`, `optional`, defaults to 0.85):
+                If not set, defaults to the `is_thing_map` of COCO panoptic.
+            threshold (`float`, *optional*, defaults to 0.85):
                Threshold to use to filter out queries.

        Returns:
-            :obj:`List[Dict]`: A list of dictionaries, each dictionary containing a PNG string and segments_info values
+            `List[Dict]`: A list of dictionaries, each dictionary containing a PNG string and segments_info values
            for an image in the batch as predicted by the model.
        """
        if target_sizes is None:

--- a/src/transformers/models/detr/modeling_detr.py
+++ b/src/transformers/models/detr/modeling_detr.py
@@ -1205,21 +1205,22 @@ class DetrModel(DetrPreTrainedModel):
        r"""
        Returns:

-        Examples::
+        Examples:

-            >>> from transformers import DetrFeatureExtractor, DetrModel
-            >>> from PIL import Image
-            >>> import requests
+        ```python
+        >>> from transformers import DetrFeatureExtractor, DetrModel
+        >>> from PIL import Image
+        >>> import requests

-            >>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
-            >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
+        >>> image = Image.open(requests.get(url, stream=True).raw)

-            >>> feature_extractor = DetrFeatureExtractor.from_pretrained('facebook/detr-resnet-50')
-            >>> model = DetrModel.from_pretrained('facebook/detr-resnet-50')
-            >>> inputs = feature_extractor(images=image, return_tensors="pt")
-            >>> outputs = model(**inputs)
-            >>> last_hidden_states = outputs.last_hidden_state
-        """
+        >>> feature_extractor = DetrFeatureExtractor.from_pretrained('facebook/detr-resnet-50')
+        >>> model = DetrModel.from_pretrained('facebook/detr-resnet-50')
+        >>> inputs = feature_extractor(images=image, return_tensors="pt")
+        >>> outputs = model(**inputs)
+        >>> last_hidden_states = outputs.last_hidden_state
+        ```"""
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states

--- a/src/transformers/models/distilbert/configuration_distilbert.py
+++ b/src/transformers/models/distilbert/configuration_distilbert.py
@@ -36,62 +36,62 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class DistilBertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
-    :class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
+    This is the configuration class to store the configuration of a [`DistilBertModel`] or a
+    [`TFDistilBertModel`]. It is used to instantiate a DistilBERT model according to the specified
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    configuration to that of the DistilBERT `distilbert-base-uncased
-    <https://huggingface.co/distilbert-base-uncased>`__ architecture.
+    configuration to that of the DistilBERT [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.

    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by
-            the :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
-            :class:`~transformers.TFDistilBertModel`.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+            the `inputs_ids` passed when calling [`DistilBertModel`] or
+            [`TFDistilBertModel`].
+        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
+        sinusoidal_pos_embds (`boolean`, *optional*, defaults to `False`):
            Whether to use sinusoidal positional embeddings.
-        n_layers (:obj:`int`, `optional`, defaults to 6):
+        n_layers (`int`, *optional*, defaults to 6):
            Number of hidden layers in the Transformer encoder.
-        n_heads (:obj:`int`, `optional`, defaults to 12):
+        n_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        dim (:obj:`int`, `optional`, defaults to 768):
+        dim (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        hidden_dim (:obj:`int`, `optional`, defaults to 3072):
+        hidden_dim (`int`, *optional*, defaults to 3072):
            The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
-        dropout (:obj:`float`, `optional`, defaults to 0.1):
+        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
+        attention_dropout (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
+        activation (`str` or `Callable`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
+        qa_dropout (`float`, *optional*, defaults to 0.1):
            The dropout probabilities used in the question answering model
-            :class:`~transformers.DistilBertForQuestionAnswering`.
-        seq_classif_dropout (:obj:`float`, `optional`, defaults to 0.2):
+            [`DistilBertForQuestionAnswering`].
+        seq_classif_dropout (`float`, *optional*, defaults to 0.2):
            The dropout probabilities used in the sequence classification and the multiple choice model
-            :class:`~transformers.DistilBertForSequenceClassification`.
+            [`DistilBertForSequenceClassification`].

-    Examples::
+    Examples:

-        >>> from transformers import DistilBertModel, DistilBertConfig
+    ```python
+    >>> from transformers import DistilBertModel, DistilBertConfig

-        >>> # Initializing a DistilBERT configuration
-        >>> configuration = DistilBertConfig()
+    >>> # Initializing a DistilBERT configuration
+    >>> configuration = DistilBertConfig()

-        >>> # Initializing a model from the configuration
-        >>> model = DistilBertModel(configuration)
+    >>> # Initializing a model from the configuration
+    >>> model = DistilBertModel(configuration)

-        >>> # Accessing the model configuration
-        >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
    model_type = "distilbert"
    attribute_map = {
        "hidden_size": "dim",

--- a/src/transformers/models/distilbert/tokenization_distilbert.py
+++ b/src/transformers/models/distilbert/tokenization_distilbert.py
@@ -57,10 +57,10 @@ class DistilBertTokenizer(BertTokenizer):
    r"""
    Construct a DistilBERT tokenizer.

-    :class:`~transformers.DistilBertTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
+    [`DistilBertTokenizer`] is identical to [`BertTokenizer`] and runs end-to-end
    tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
    parameters.
    """


--- a/src/transformers/models/distilbert/tokenization_distilbert_fast.py
+++ b/src/transformers/models/distilbert/tokenization_distilbert_fast.py
@@ -64,12 +64,12 @@ PRETRAINED_INIT_CONFIGURATION = {

 class DistilBertTokenizerFast(BertTokenizerFast):
    r"""
-    Construct a "fast" DistilBERT tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" DistilBERT tokenizer (backed by HuggingFace's *tokenizers* library).

-    :class:`~transformers.DistilBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    [`DistilBertTokenizerFast`] is identical to [`BertTokenizerFast`] and runs
    end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
    parameters.
    """


--- a/src/transformers/models/dpr/configuration_dpr.py
+++ b/src/transformers/models/dpr/configuration_dpr.py
@@ -32,51 +32,49 @@ DPR_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class DPRConfig(PretrainedConfig):
    r"""
-    :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a `DPRModel`.
+    [`DPRConfig`] is the configuration class to store the configuration of a *DPRModel*.

-    This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
-    :class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
+    This is the configuration class to store the configuration of a [`DPRContextEncoder`],
+    [`DPRQuestionEncoder`], or a [`DPRReader`]. It is used to instantiate the
    components of the DPR model.

-    This class is a subclass of :class:`~transformers.BertConfig`. Please check the superclass for the documentation of
+    This class is a subclass of [`BertConfig`]. Please check the superclass for the documentation of
    all kwargs.

    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the DPR model. Defines the different tokens that can be represented by the `inputs_ids`
-            passed to the forward method of :class:`~transformers.BertModel`.
-        hidden_size (:obj:`int`, `optional`, defaults to 768):
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the DPR model. Defines the different tokens that can be represented by the *inputs_ids*
+            passed to the forward method of [`BertModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
+        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
+        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the *token_type_ids* passed into [`BertModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        position_embedding_type (:obj:`str`, `optional`, defaults to :obj:`"absolute"`):
-            Type of position embedding. Choose one of :obj:`"absolute"`, :obj:`"relative_key"`,
-            :obj:`"relative_key_query"`. For positional embeddings use :obj:`"absolute"`. For more information on
-            :obj:`"relative_key"`, please refer to `Self-Attention with Relative Position Representations (Shaw et al.)
-            <https://arxiv.org/abs/1803.02155>`__. For more information on :obj:`"relative_key_query"`, please refer to
-            `Method 4` in `Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)
-            <https://arxiv.org/abs/2009.13658>`__.
-        projection_dim (:obj:`int`, `optional`, defaults to 0):
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`,
+            `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on
+            `"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to
+            *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        projection_dim (`int`, *optional*, defaults to 0):
            Dimension of the projection for the context and question encoders. If it is set to zero (default), then no
            projection is done.
    """

--- a/src/transformers/models/dpr/modeling_dpr.py
+++ b/src/transformers/models/dpr/modeling_dpr.py
@@ -64,7 +64,7 @@ class DPRContextEncoderOutput(ModelOutput):
    Class for outputs of [`DPRQuestionEncoder`].

    Args:
-        pooler_output: (:obj:`torch.FloatTensor` of shape `(batch_size, embeddings_size)`):
+        pooler_output (`torch.FloatTensor` of shape `(batch_size, embeddings_size)`):
            The DPR encoder outputs the *pooler_output* that corresponds to the context representation. Last layer
            hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
            This output is to be used to embed contexts for nearest neighbors queries with questions embeddings.
@@ -91,7 +91,7 @@ class DPRQuestionEncoderOutput(ModelOutput):
    Class for outputs of [`DPRQuestionEncoder`].

    Args:
-        pooler_output: (:obj:`torch.FloatTensor` of shape `(batch_size, embeddings_size)`):
+        pooler_output (`torch.FloatTensor` of shape `(batch_size, embeddings_size)`):
            The DPR encoder outputs the *pooler_output* that corresponds to the question representation. Last layer
            hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
            This output is to be used to embed questions for nearest neighbors queries with context embeddings.
@@ -118,11 +118,11 @@ class DPRReaderOutput(ModelOutput):
    Class for outputs of [`DPRQuestionEncoder`].

    Args:
-        start_logits: (:obj:`torch.FloatTensor` of shape `(n_passages, sequence_length)`):
+        start_logits (`torch.FloatTensor` of shape `(n_passages, sequence_length)`):
            Logits of the start index of the span for each passage.
-        end_logits: (:obj:`torch.FloatTensor` of shape `(n_passages, sequence_length)`):
+        end_logits (`torch.FloatTensor` of shape `(n_passages, sequence_length)`):
            Logits of the end index of the span for each passage.
-        relevance_logits: (``torch.FloatTensor``` of shape `(n_passages, )`):
+        relevance_logits (`torch.FloatTensor` of shape `(n_passages, )`):
            Outputs of the QA classifier of the DPRReader that corresponds to the scores of each passage to answer the
            question, compared to all the other passages.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
@@ -350,17 +350,17 @@ DPR_ENCODERS_INPUTS_DOCSTRING = r"""

            (a) For sequence pairs (for a pair title+text for example):

-    ```
-    tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
-    token_type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
-    ```
+            ```
+            tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+            token_type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
+            ```

            (b) For single sequences (for a question for example):

-    ```
-    tokens:         [CLS] the dog is hairy . [SEP]
-    token_type_ids:   0   0   0   0  0     0   0
-    ```
+            ```
+            tokens:         [CLS] the dog is hairy . [SEP]
+            token_type_ids:   0   0   0   0  0     0   0
+            ```

            DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
            rather than the left.
@@ -463,14 +463,15 @@ class DPRContextEncoder(DPRPretrainedContextEncoder):
        r"""
        Return:

-        Examples::
+        Examples:

-            >>> from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
-            >>> tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
-            >>> model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
-            >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='pt')["input_ids"]
-            >>> embeddings = model(input_ids).pooler_output
-        """
+        ```python
+        >>> from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
+        >>> tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
+        >>> model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
+        >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='pt')["input_ids"]
+        >>> embeddings = model(input_ids).pooler_output
+        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
@@ -542,13 +543,15 @@ class DPRQuestionEncoder(DPRPretrainedQuestionEncoder):
        r"""
        Return:

-        Examples::
+        Examples:

-            >>> from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
-            >>> tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
-            >>> model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
-            >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='pt')["input_ids"]
-            >>> embeddings = model(input_ids).pooler_output
+        ```python
+        >>> from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
+        >>> tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
+        >>> model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
+        >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='pt')["input_ids"]
+        >>> embeddings = model(input_ids).pooler_output
+        ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
@@ -619,22 +622,23 @@ class DPRReader(DPRPretrainedReader):
        r"""
        Return:

-        Examples::
-
-            >>> from transformers import DPRReader, DPRReaderTokenizer
-            >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> encoded_inputs = tokenizer(
-            ...         questions=["What is love ?"],
-            ...         titles=["Haddaway"],
-            ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
-            ...         return_tensors='pt'
-            ...     )
-            >>> outputs = model(**encoded_inputs)
-            >>> start_logits = outputs.start_logits
-            >>> end_logits = outputs.end_logits
-            >>> relevance_logits = outputs.relevance_logits
-
+        Examples:
+
+        ```python
+        >>> from transformers import DPRReader, DPRReaderTokenizer
+        >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> encoded_inputs = tokenizer(
+        ...         questions=["What is love ?"],
+        ...         titles=["Haddaway"],
+        ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
+        ...         return_tensors='pt'
+        ...     )
+        >>> outputs = model(**encoded_inputs)
+        >>> start_logits = outputs.start_logits
+        >>> end_logits = outputs.end_logits
+        >>> relevance_logits = outputs.relevance_logits
+        ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (

--- a/src/transformers/models/dpr/modeling_tf_dpr.py
+++ b/src/transformers/models/dpr/modeling_tf_dpr.py
@@ -61,7 +61,7 @@ class TFDPRContextEncoderOutput(ModelOutput):
    Class for outputs of [`TFDPRContextEncoder`].

    Args:
-        pooler_output: (:obj:`tf.Tensor` of shape `(batch_size, embeddings_size)`):
+        pooler_output (`tf.Tensor` of shape `(batch_size, embeddings_size)`):
            The DPR encoder outputs the *pooler_output* that corresponds to the context representation. Last layer
            hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
            This output is to be used to embed contexts for nearest neighbors queries with questions embeddings.
@@ -88,7 +88,7 @@ class TFDPRQuestionEncoderOutput(ModelOutput):
    Class for outputs of [`TFDPRQuestionEncoder`].

    Args:
-        pooler_output: (:obj:`tf.Tensor` of shape `(batch_size, embeddings_size)`):
+        pooler_output (`tf.Tensor` of shape `(batch_size, embeddings_size)`):
            The DPR encoder outputs the *pooler_output* that corresponds to the question representation. Last layer
            hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
            This output is to be used to embed questions for nearest neighbors queries with context embeddings.
@@ -115,11 +115,11 @@ class TFDPRReaderOutput(ModelOutput):
    Class for outputs of [`TFDPRReaderEncoder`].

    Args:
-        start_logits: (:obj:`tf.Tensor` of shape `(n_passages, sequence_length)`):
+        start_logits (`tf.Tensor` of shape `(n_passages, sequence_length)`):
            Logits of the start index of the span for each passage.
-        end_logits: (:obj:`tf.Tensor` of shape `(n_passages, sequence_length)`):
+        end_logits (`tf.Tensor` of shape `(n_passages, sequence_length)`):
            Logits of the end index of the span for each passage.
-        relevance_logits: (``tf.Tensor``` of shape `(n_passages, )`):
+        relevance_logits (`tf.Tensor` of shape `(n_passages, )`):
            Outputs of the QA classifier of the DPRReader that corresponds to the scores of each passage to answer the
            question, compared to all the other passages.
        hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
@@ -485,17 +485,17 @@ TF_DPR_ENCODERS_INPUTS_DOCSTRING = r"""

            (a) For sequence pairs (for a pair title+text for example):

-    ```
-    tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
-    token_type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
-    ```
+            ```
+            tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+            token_type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
+            ```

            (b) For single sequences (for a question for example):

-    ```
-    tokens:         [CLS] the dog is hairy . [SEP]
-    token_type_ids:   0   0   0   0  0     0   0
-    ```
+            ```
+            tokens:         [CLS] the dog is hairy . [SEP]
+            token_type_ids:   0   0   0   0  0     0   0
+            ```

            DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on the right
            rather than the left.
@@ -610,13 +610,15 @@ class TFDPRContextEncoder(TFDPRPretrainedContextEncoder):
        r"""
        Return:

-        Examples::
+        Examples:

-            >>> from transformers import TFDPRContextEncoder, DPRContextEncoderTokenizer
-            >>> tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
-            >>> model = TFDPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base', from_pt=True)
-            >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='tf')["input_ids"]
-            >>> embeddings = model(input_ids).pooler_output
+        ```python
+        >>> from transformers import TFDPRContextEncoder, DPRContextEncoderTokenizer
+        >>> tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
+        >>> model = TFDPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base', from_pt=True)
+        >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='tf')["input_ids"]
+        >>> embeddings = model(input_ids).pooler_output
+        ```
        """
        inputs = input_processing(
            func=self.call,
@@ -708,13 +710,15 @@ class TFDPRQuestionEncoder(TFDPRPretrainedQuestionEncoder):
        r"""
        Return:

-        Examples::
+        Examples:

-            >>> from transformers import TFDPRQuestionEncoder, DPRQuestionEncoderTokenizer
-            >>> tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
-            >>> model = TFDPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base', from_pt=True)
-            >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='tf')["input_ids"]
-            >>> embeddings = model(input_ids).pooler_output
+        ```python
+        >>> from transformers import TFDPRQuestionEncoder, DPRQuestionEncoderTokenizer
+        >>> tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
+        >>> model = TFDPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base', from_pt=True)
+        >>> input_ids = tokenizer("Hello, is my dog cute ?", return_tensors='tf')["input_ids"]
+        >>> embeddings = model(input_ids).pooler_output
+        ```
        """
        inputs = input_processing(
            func=self.call,
@@ -804,22 +808,23 @@ class TFDPRReader(TFDPRPretrainedReader):
        r"""
        Return:

-        Examples::
-
-            >>> from transformers import TFDPRReader, DPRReaderTokenizer
-            >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> model = TFDPRReader.from_pretrained('facebook/dpr-reader-single-nq-base', from_pt=True)
-            >>> encoded_inputs = tokenizer(
-            ...         questions=["What is love ?"],
-            ...         titles=["Haddaway"],
-            ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
-            ...         return_tensors='tf'
-            ...     )
-            >>> outputs = model(encoded_inputs)
-            >>> start_logits = outputs.start_logits
-            >>> end_logits = outputs.end_logits
-            >>> relevance_logits = outputs.relevance_logits
-
+        Examples:
+
+        ```python
+        >>> from transformers import TFDPRReader, DPRReaderTokenizer
+        >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> model = TFDPRReader.from_pretrained('facebook/dpr-reader-single-nq-base', from_pt=True)
+        >>> encoded_inputs = tokenizer(
+        ...         questions=["What is love ?"],
+        ...         titles=["Haddaway"],
+        ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
+        ...         return_tensors='tf'
+        ...     )
+        >>> outputs = model(encoded_inputs)
+        >>> start_logits = outputs.start_logits
+        >>> end_logits = outputs.end_logits
+        >>> relevance_logits = outputs.relevance_logits
+        ```
        """
        inputs = input_processing(
            func=self.call,

--- a/src/transformers/models/dpr/tokenization_dpr.py
+++ b/src/transformers/models/dpr/tokenization_dpr.py
@@ -91,10 +91,10 @@ class DPRContextEncoderTokenizer(BertTokenizer):
    r"""
    Construct a DPRContextEncoder tokenizer.

-    :class:`~transformers.DPRContextEncoderTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs
+    [`DPRContextEncoderTokenizer`] is identical to [`BertTokenizer`] and runs
    end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
    parameters.
    """

@@ -108,10 +108,10 @@ class DPRQuestionEncoderTokenizer(BertTokenizer):
    r"""
    Constructs a DPRQuestionEncoder tokenizer.

-    :class:`~transformers.DPRQuestionEncoderTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs
+    [`DPRQuestionEncoderTokenizer`] is identical to [`BertTokenizer`] and runs
    end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
    parameters.
    """

@@ -130,70 +130,70 @@ DPRReaderOutput = collections.namedtuple("DPRReaderOutput", ["start_logits", "en

 CUSTOM_DPR_READER_DOCSTRING = r"""
    Return a dictionary with the token ids of the input strings and other information to give to
-    :obj:`.decode_best_spans`. It converts the strings of a question and different passages (title and text) in a
-    sequence of IDs (integers), using the tokenizer and vocabulary. The resulting :obj:`input_ids` is a matrix of size
-    :obj:`(n_passages, sequence_length)` with the format:
+    `.decode_best_spans`. It converts the strings of a question and different passages (title and text) in a
+    sequence of IDs (integers), using the tokenizer and vocabulary. The resulting `input_ids` is a matrix of size
+    `(n_passages, sequence_length)` with the format:

-    ::
-
-        [CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>
+    ```
+    [CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>
+    ```

    Args:
-        questions (:obj:`str` or :obj:`List[str]`):
+        questions (`str` or `List[str]`):
            The questions to be encoded. You can specify one question for many passages. In this case, the question
-            will be duplicated like :obj:`[questions] * n_passages`. Otherwise you have to specify as many questions as
-            in :obj:`titles` or :obj:`texts`.
-        titles (:obj:`str` or :obj:`List[str]`):
+            will be duplicated like `[questions] * n_passages`. Otherwise you have to specify as many questions as
+            in `titles` or `texts`.
+        titles (`str` or `List[str]`):
            The passages titles to be encoded. This can be a string or a list of strings if there are several passages.
-        texts (:obj:`str` or :obj:`List[str]`):
+        texts (`str` or `List[str]`):
            The passages texts to be encoded. This can be a string or a list of strings if there are several passages.
-        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.file_utils.PaddingStrategy`, `optional`, defaults to :obj:`False`):
+        padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
            Activates and controls padding. Accepts the following values:

-            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
-            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
-            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
-        truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`False`):
+        truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
            Activates and controls truncation. Accepts the following values:

-            * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
-              :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
+            - `True` or `'longest_first'`: Truncate to a maximum length specified with the argument
+              `max_length` or to the maximum acceptable input length for the model if that argument is not
              provided. This will truncate token by token, removing a token from the longest sequence in the pair if a
              pair of sequences (or a batch of pairs) is provided.
-            * :obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to the
+            - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
              maximum acceptable input length for the model if that argument is not provided. This will only truncate
              the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-            * :obj:`'only_second'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to
+            - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to
              the maximum acceptable input length for the model if that argument is not provided. This will only
              truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-            * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence
+            - `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence
              lengths greater than the model maximum admissible input size).
-        max_length (:obj:`int`, `optional`):
+        max_length (`int`, *optional*):
                Controls the maximum length to use by one of the truncation/padding parameters.

-                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
+                If left unset or set to `None`, this will use the predefined model maximum length if a maximum
                length is required by one of the truncation/padding parameters. If the model has no specific maximum
                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-        return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
+        return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
                If set, will return tensors instead of list of python integers. Acceptable values are:

-                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
-                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
-                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
-        return_attention_mask (:obj:`bool`, `optional`):
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        return_attention_mask (`bool`, *optional*):
            Whether or not to return the attention mask. If not set, will return the attention mask according to the
-            specific tokenizer's default, defined by the :obj:`return_outputs` attribute.
+            specific tokenizer's default, defined by the `return_outputs` attribute.

-            `What are attention masks? <../glossary.html#attention-mask>`__
+            [What are attention masks?](../glossary#attention-mask)

    Returns:
-        :obj:`Dict[str, List[List[int]]]`: A dictionary with the following keys:
+        `Dict[str, List[List[int]]]`: A dictionary with the following keys:

-        - ``input_ids``: List of token ids to be fed to a model.
-        - ``attention_mask``: List of indices specifying which tokens should be attended to by the model.
+        - `input_ids`: List of token ids to be fed to a model.
+        - `attention_mask`: List of indices specifying which tokens should be attended to by the model.
    """


@@ -268,33 +268,31 @@ class CustomDPRReaderTokenizerMixin:
        """
        Get the span predictions for the extractive Q&A model.

-        Returns: `List` of `DPRReaderOutput` sorted by descending `(relevance_score, span_score)`. Each
-        `DPRReaderOutput` is a `Tuple` with:
+        Returns: *List* of *DPRReaderOutput* sorted by descending *(relevance_score, span_score)*. Each
+        *DPRReaderOutput* is a *Tuple* with:

-            - **span_score**: ``float`` that corresponds to the score given by the reader for this span compared to
+            - **span_score**: `float` that corresponds to the score given by the reader for this span compared to
              other spans in the same passage. It corresponds to the sum of the start and end logits of the span.
-            - **relevance_score**: ``float`` that corresponds to the score of the each passage to answer the question,
+            - **relevance_score**: `float` that corresponds to the score of the each passage to answer the question,
              compared to all the other passages. It corresponds to the output of the QA classifier of the DPRReader.
-            - **doc_id**: ``int``` the id of the passage.
-            - **start_index**: ``int`` the start index of the span (inclusive).
-            - **end_index**: ``int`` the end index of the span (inclusive).
-
-        Examples::
-
-            >>> from transformers import DPRReader, DPRReaderTokenizer
-            >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> encoded_inputs = tokenizer(
-            ...         questions=["What is love ?"],
-            ...         titles=["Haddaway"],
-            ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
-            ...         return_tensors='pt'
-            ...     )
-            >>> outputs = model(**encoded_inputs)
-            >>> predicted_spans = tokenizer.decode_best_spans(encoded_inputs, outputs)
-            >>> print(predicted_spans[0].text)  # best span
-
-        """
+            - **doc_id**: ``int``` the id of the passage. - **start_index**: `int` the start index of the span (inclusive). - **end_index**: `int` the end index of the span (inclusive).
+
+        Examples:
+
+        ```python
+        >>> from transformers import DPRReader, DPRReaderTokenizer
+        >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> encoded_inputs = tokenizer(
+        ...         questions=["What is love ?"],
+        ...         titles=["Haddaway"],
+        ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
+        ...         return_tensors='pt'
+        ...     )
+        >>> outputs = model(**encoded_inputs)
+        >>> predicted_spans = tokenizer.decode_best_spans(encoded_inputs, outputs)
+        >>> print(predicted_spans[0].text)  # best span
+        ```"""
        input_ids = reader_input["input_ids"]
        start_logits, end_logits, relevance_logits = reader_output[:3]
        n_passages = len(relevance_logits)
@@ -373,11 +371,11 @@ class DPRReaderTokenizer(CustomDPRReaderTokenizerMixin, BertTokenizer):
    r"""
    Construct a DPRReader tokenizer.

-    :class:`~transformers.DPRReaderTokenizer` is almost identical to :class:`~transformers.BertTokenizer` and runs
+    [`DPRReaderTokenizer`] is almost identical to [`BertTokenizer`] and runs
    end-to-end tokenization: punctuation splitting and wordpiece. The difference is that is has three inputs strings:
-    question, titles and texts that are combined to be fed to the :class:`~transformers.DPRReader` model.
+    question, titles and texts that are combined to be fed to the [`DPRReader`] model.

-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
    parameters.
    """


--- a/src/transformers/models/dpr/tokenization_dpr_fast.py
+++ b/src/transformers/models/dpr/tokenization_dpr_fast.py
@@ -90,12 +90,12 @@ READER_PRETRAINED_INIT_CONFIGURATION = {

 class DPRContextEncoderTokenizerFast(BertTokenizerFast):
    r"""
-    Construct a "fast" DPRContextEncoder tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" DPRContextEncoder tokenizer (backed by HuggingFace's *tokenizers* library).

-    :class:`~transformers.DPRContextEncoderTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and
+    [`DPRContextEncoderTokenizerFast`] is identical to [`BertTokenizerFast`] and
    runs end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
    parameters.
    """

@@ -108,12 +108,12 @@ class DPRContextEncoderTokenizerFast(BertTokenizerFast):

 class DPRQuestionEncoderTokenizerFast(BertTokenizerFast):
    r"""
-    Constructs a "fast" DPRQuestionEncoder tokenizer (backed by HuggingFace's `tokenizers` library).
+    Constructs a "fast" DPRQuestionEncoder tokenizer (backed by HuggingFace's *tokenizers* library).

-    :class:`~transformers.DPRQuestionEncoderTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and
+    [`DPRQuestionEncoderTokenizerFast`] is identical to [`BertTokenizerFast`] and
    runs end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
    parameters.
    """

@@ -133,68 +133,68 @@ DPRReaderOutput = collections.namedtuple("DPRReaderOutput", ["start_logits", "en

 CUSTOM_DPR_READER_DOCSTRING = r"""
    Return a dictionary with the token ids of the input strings and other information to give to
-    :obj:`.decode_best_spans`. It converts the strings of a question and different passages (title and text) in a
-    sequence of IDs (integers), using the tokenizer and vocabulary. The resulting :obj:`input_ids` is a matrix of size
-    :obj:`(n_passages, sequence_length)` with the format:
+    `.decode_best_spans`. It converts the strings of a question and different passages (title and text) in a
+    sequence of IDs (integers), using the tokenizer and vocabulary. The resulting `input_ids` is a matrix of size
+    `(n_passages, sequence_length)` with the format:

    [CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>

    Args:
-        questions (:obj:`str` or :obj:`List[str]`):
+        questions (`str` or `List[str]`):
            The questions to be encoded. You can specify one question for many passages. In this case, the question
-            will be duplicated like :obj:`[questions] * n_passages`. Otherwise you have to specify as many questions as
-            in :obj:`titles` or :obj:`texts`.
-        titles (:obj:`str` or :obj:`List[str]`):
+            will be duplicated like `[questions] * n_passages`. Otherwise you have to specify as many questions as
+            in `titles` or `texts`.
+        titles (`str` or `List[str]`):
            The passages titles to be encoded. This can be a string or a list of strings if there are several passages.
-        texts (:obj:`str` or :obj:`List[str]`):
+        texts (`str` or `List[str]`):
            The passages texts to be encoded. This can be a string or a list of strings if there are several passages.
-        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.file_utils.PaddingStrategy`, `optional`, defaults to :obj:`False`):
+        padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
            Activates and controls padding. Accepts the following values:

-            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
-            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
+            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
-            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
-        truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`False`):
+        truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
            Activates and controls truncation. Accepts the following values:

-            * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
-              :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
+            - `True` or `'longest_first'`: Truncate to a maximum length specified with the argument
+              `max_length` or to the maximum acceptable input length for the model if that argument is not
              provided. This will truncate token by token, removing a token from the longest sequence in the pair if a
              pair of sequences (or a batch of pairs) is provided.
-            * :obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to the
+            - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
              maximum acceptable input length for the model if that argument is not provided. This will only truncate
              the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-            * :obj:`'only_second'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to
+            - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to
              the maximum acceptable input length for the model if that argument is not provided. This will only
              truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-            * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence
+            - `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence
              lengths greater than the model maximum admissible input size).
-        max_length (:obj:`int`, `optional`):
+        max_length (`int`, *optional*):
                Controls the maximum length to use by one of the truncation/padding parameters.

-                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
+                If left unset or set to `None`, this will use the predefined model maximum length if a maximum
                length is required by one of the truncation/padding parameters. If the model has no specific maximum
                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-        return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
+        return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
                If set, will return tensors instead of list of python integers. Acceptable values are:

-                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
-                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
-                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
-        return_attention_mask (:obj:`bool`, `optional`):
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        return_attention_mask (`bool`, *optional*):
            Whether or not to return the attention mask. If not set, will return the attention mask according to the
-            specific tokenizer's default, defined by the :obj:`return_outputs` attribute.
+            specific tokenizer's default, defined by the `return_outputs` attribute.

-            `What are attention masks? <../glossary.html#attention-mask>`__
+            [What are attention masks?](../glossary#attention-mask)

    Return:
-        :obj:`Dict[str, List[List[int]]]`: A dictionary with the following keys:
+        `Dict[str, List[List[int]]]`: A dictionary with the following keys:

-        - ``input_ids``: List of token ids to be fed to a model.
-        - ``attention_mask``: List of indices specifying which tokens should be attended to by the model.
+        - `input_ids`: List of token ids to be fed to a model.
+        - `attention_mask`: List of indices specifying which tokens should be attended to by the model.
    """


@@ -269,33 +269,31 @@ class CustomDPRReaderTokenizerMixin:
        """
        Get the span predictions for the extractive Q&A model.

-        Returns: `List` of `DPRReaderOutput` sorted by descending `(relevance_score, span_score)`. Each
-        `DPRReaderOutput` is a `Tuple` with:
+        Returns: *List* of *DPRReaderOutput* sorted by descending *(relevance_score, span_score)*. Each
+        *DPRReaderOutput* is a *Tuple* with:

-            - **span_score**: ``float`` that corresponds to the score given by the reader for this span compared to
+            - **span_score**: `float` that corresponds to the score given by the reader for this span compared to
              other spans in the same passage. It corresponds to the sum of the start and end logits of the span.
-            - **relevance_score**: ``float`` that corresponds to the score of the each passage to answer the question,
+            - **relevance_score**: `float` that corresponds to the score of the each passage to answer the question,
              compared to all the other passages. It corresponds to the output of the QA classifier of the DPRReader.
-            - **doc_id**: ``int``` the id of the passage.
-            - ***start_index**: ``int`` the start index of the span (inclusive).
-            - **end_index**: ``int`` the end index of the span (inclusive).
-
-        Examples::
-
-            >>> from transformers import DPRReader, DPRReaderTokenizer
-            >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')
-            >>> encoded_inputs = tokenizer(
-            ...         questions=["What is love ?"],
-            ...         titles=["Haddaway"],
-            ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
-            ...         return_tensors='pt'
-            ...     )
-            >>> outputs = model(**encoded_inputs)
-            >>> predicted_spans = tokenizer.decode_best_spans(encoded_inputs, outputs)
-            >>> print(predicted_spans[0].text)  # best span
-
-        """
+            - **doc_id**: ``int``` the id of the passage. - ***start_index**: `int` the start index of the span (inclusive). - **end_index**: `int` the end index of the span (inclusive).
+
+        Examples:
+
+        ```python
+        >>> from transformers import DPRReader, DPRReaderTokenizer
+        >>> tokenizer = DPRReaderTokenizer.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> model = DPRReader.from_pretrained('facebook/dpr-reader-single-nq-base')
+        >>> encoded_inputs = tokenizer(
+        ...         questions=["What is love ?"],
+        ...         titles=["Haddaway"],
+        ...         texts=["'What Is Love' is a song recorded by the artist Haddaway"],
+        ...         return_tensors='pt'
+        ...     )
+        >>> outputs = model(**encoded_inputs)
+        >>> predicted_spans = tokenizer.decode_best_spans(encoded_inputs, outputs)
+        >>> print(predicted_spans[0].text)  # best span
+        ```"""
        input_ids = reader_input["input_ids"]
        start_logits, end_logits, relevance_logits = reader_output[:3]
        n_passages = len(relevance_logits)
@@ -372,13 +370,13 @@ class CustomDPRReaderTokenizerMixin:
 @add_end_docstrings(CUSTOM_DPR_READER_DOCSTRING)
 class DPRReaderTokenizerFast(CustomDPRReaderTokenizerMixin, BertTokenizerFast):
    r"""
-    Constructs a "fast" DPRReader tokenizer (backed by HuggingFace's `tokenizers` library).
+    Constructs a "fast" DPRReader tokenizer (backed by HuggingFace's *tokenizers* library).

-    :class:`~transformers.DPRReaderTokenizerFast` is almost identical to :class:`~transformers.BertTokenizerFast` and
+    [`DPRReaderTokenizerFast`] is almost identical to [`BertTokenizerFast`] and
    runs end-to-end tokenization: punctuation splitting and wordpiece. The difference is that is has three inputs
-    strings: question, titles and texts that are combined to be fed to the :class:`~transformers.DPRReader` model.
+    strings: question, titles and texts that are combined to be fed to the [`DPRReader`] model.

-    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
    parameters.

    """

--- a/src/transformers/models/electra/configuration_electra.py
+++ b/src/transformers/models/electra/configuration_electra.py
@@ -33,96 +33,94 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class ElectraConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
-    :class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
+    This is the configuration class to store the configuration of a [`ElectraModel`] or a
+    [`TFElectraModel`]. It is used to instantiate a ELECTRA model according to the specified
    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    configuration to that of the ELECTRA `google/electra-small-discriminator
-    <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
+    configuration to that of the ELECTRA [google/electra-small-discriminator](https://huggingface.co/google/electra-small-discriminator) architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.


    Args:
-        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the
-            :obj:`inputs_ids` passed when calling :class:`~transformers.ElectraModel` or
-            :class:`~transformers.TFElectraModel`.
-        embedding_size (:obj:`int`, `optional`, defaults to 128):
+            `inputs_ids` passed when calling [`ElectraModel`] or
+            [`TFElectraModel`].
+        embedding_size (`int`, *optional*, defaults to 128):
            Dimensionality of the encoder layers and the pooler layer.
-        hidden_size (:obj:`int`, `optional`, defaults to 256):
+        hidden_size (`int`, *optional*, defaults to 256):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
+        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
+        num_attention_heads (`int`, *optional*, defaults to 4):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, `optional`, defaults to 1024):
+        intermediate_size (`int`, *optional*, defaults to 1024):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string,
-            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
-            :class:`~transformers.TFElectraModel`.
-        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`ElectraModel`] or
+            [`TFElectraModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        summary_type (:obj:`str`, `optional`, defaults to :obj:`"first"`):
+        summary_type (`str`, *optional*, defaults to `"first"`):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

            Has to be one of the following options:

-                - :obj:`"last"`: Take the last token hidden state (like XLNet).
-                - :obj:`"first"`: Take the first token hidden state (like BERT).
-                - :obj:`"mean"`: Take the mean of all tokens hidden states.
-                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
-                - :obj:`"attn"`: Not implemented now, use multi-head attention.
-        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+                - `"last"`: Take the last token hidden state (like XLNet).
+                - `"first"`: Take the first token hidden state (like BERT).
+                - `"mean"`: Take the mean of all tokens hidden states.
+                - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - `"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (`bool`, *optional*, defaults to `True`):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

            Whether or not to add a projection after the vector extraction.
-        summary_activation (:obj:`str`, `optional`):
+        summary_activation (`str`, *optional*):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

-            Pass :obj:`"gelu"` for a gelu activation to the output, any other value will result in no activation.
-        summary_last_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            Pass `"gelu"` for a gelu activation to the output, any other value will result in no activation.
+        summary_last_dropout (`float`, *optional*, defaults to 0.0):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

            The dropout ratio to be used after the projection and activation.
-        position_embedding_type (:obj:`str`, `optional`, defaults to :obj:`"absolute"`):
-            Type of position embedding. Choose one of :obj:`"absolute"`, :obj:`"relative_key"`,
-            :obj:`"relative_key_query"`. For positional embeddings use :obj:`"absolute"`. For more information on
-            :obj:`"relative_key"`, please refer to `Self-Attention with Relative Position Representations (Shaw et al.)
-            <https://arxiv.org/abs/1803.02155>`__. For more information on :obj:`"relative_key_query"`, please refer to
-            `Method 4` in `Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)
-            <https://arxiv.org/abs/2009.13658>`__.
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`,
+            `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on
+            `"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to
+            *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
-            relevant if ``config.is_decoder=True``.
-        classifier_dropout (:obj:`float`, `optional`):
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
            The dropout ratio for the classification head.

-    Examples::
+    Examples:

-        >>> from transformers import ElectraModel, ElectraConfig
+    ```python
+    >>> from transformers import ElectraModel, ElectraConfig

-        >>> # Initializing a ELECTRA electra-base-uncased style configuration
-        >>> configuration = ElectraConfig()
+    >>> # Initializing a ELECTRA electra-base-uncased style configuration
+    >>> configuration = ElectraConfig()

-        >>> # Initializing a model from the electra-base-uncased style configuration
-        >>> model = ElectraModel(configuration)
+    >>> # Initializing a model from the electra-base-uncased style configuration
+    >>> model = ElectraModel(configuration)

-        >>> # Accessing the model configuration
-        >>> configuration = model.config
-    """
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
    model_type = "electra"

    def __init__(

--- a/src/transformers/models/electra/modeling_flax_electra.py
+++ b/src/transformers/models/electra/modeling_flax_electra.py
@@ -814,17 +814,19 @@ class FlaxElectraForPreTraining(FlaxElectraPreTrainedModel):
 FLAX_ELECTRA_FOR_PRETRAINING_DOCSTRING = """
    Returns:

-    Example::
+    Example:

-        >>> from transformers import ElectraTokenizer, FlaxElectraForPreTraining
+    ```python
+    >>> from transformers import ElectraTokenizer, FlaxElectraForPreTraining

-        >>> tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
-        >>> model = FlaxElectraForPreTraining.from_pretrained('google/electra-small-discriminator')
+    >>> tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
+    >>> model = FlaxElectraForPreTraining.from_pretrained('google/electra-small-discriminator')

-        >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="np")
-        >>> outputs = model(**inputs)
+    >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="np")
+    >>> outputs = model(**inputs)

-        >>> prediction_logits = outputs.logits
+    >>> prediction_logits = outputs.logits
+    ```
 """

 overwrite_call_docstring(

--- a/src/transformers/models/electra/modeling_tf_electra.py
+++ b/src/transformers/models/electra/modeling_tf_electra.py
@@ -1082,17 +1082,18 @@ class TFElectraForPreTraining(TFElectraPreTrainedModel):
        r"""
        Returns:

-        Examples::
-
-            >>> import tensorflow as tf
-            >>> from transformers import ElectraTokenizer, TFElectraForPreTraining
-
-            >>> tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
-            >>> model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')
-            >>> input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
-            >>> outputs = model(input_ids)
-            >>> scores = outputs[0]
-        """
+        Examples:
+
+        ```python
+        >>> import tensorflow as tf
+        >>> from transformers import ElectraTokenizer, TFElectraForPreTraining
+
+        >>> tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
+        >>> model = TFElectraForPreTraining.from_pretrained('google/electra-small-discriminator')
+        >>> input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
+        >>> outputs = model(input_ids)
+        >>> scores = outputs[0]
+        ```"""
        inputs = input_processing(
            func=self.call,
            config=self.config,

--- a/src/transformers/models/electra/tokenization_electra.py
+++ b/src/transformers/models/electra/tokenization_electra.py
@@ -53,10 +53,10 @@ class ElectraTokenizer(BertTokenizer):
    r"""
    Construct an ELECTRA tokenizer.

-    :class:`~transformers.ElectraTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
+    [`ElectraTokenizer`] is identical to [`BertTokenizer`] and runs end-to-end
    tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizer`] for usage examples and documentation concerning
    parameters.
    """


--- a/src/transformers/models/electra/tokenization_electra_fast.py
+++ b/src/transformers/models/electra/tokenization_electra_fast.py
@@ -60,12 +60,12 @@ PRETRAINED_INIT_CONFIGURATION = {

 class ElectraTokenizerFast(BertTokenizerFast):
    r"""
-    Construct a "fast" ELECTRA tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" ELECTRA tokenizer (backed by HuggingFace's *tokenizers* library).

-    :class:`~transformers.ElectraTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    [`ElectraTokenizerFast`] is identical to [`BertTokenizerFast`] and runs
    end-to-end tokenization: punctuation splitting and wordpiece.

-    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
+    Refer to superclass [`BertTokenizerFast`] for usage examples and documentation concerning
    parameters.
    """
    vocab_files_names = VOCAB_FILES_NAMES

--- a/src/transformers/models/encoder_decoder/configuration_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/configuration_encoder_decoder.py
@@ -25,49 +25,50 @@ logger = logging.get_logger(__name__)

 class EncoderDecoderConfig(PretrainedConfig):
    r"""
-    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a
-    :class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
+    [`EncoderDecoderConfig`] is the configuration class to store the configuration of a
+    [`EncoderDecoderModel`]. It is used to instantiate an Encoder Decoder model according to the
    specified arguments, defining the encoder and decoder configs.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.

    Args:
-        kwargs (`optional`):
+        kwargs (*optional*):
            Dictionary of keyword arguments. Notably:

-                - **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                - **encoder** ([`PretrainedConfig`], *optional*) -- An instance of a configuration
                  object that defines the encoder config.
-                - **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                - **decoder** ([`PretrainedConfig`], *optional*) -- An instance of a configuration
                  object that defines the decoder config.

-    Examples::
+    Examples:

-        >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
+    ```python
+    >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel

-        >>> # Initializing a BERT bert-base-uncased style configuration
-        >>> config_encoder = BertConfig()
-        >>> config_decoder = BertConfig()
+    >>> # Initializing a BERT bert-base-uncased style configuration
+    >>> config_encoder = BertConfig()
+    >>> config_decoder = BertConfig()

-        >>> config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
+    >>> config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

-        >>> # Initializing a Bert2Bert model from the bert-base-uncased style configurations
-        >>> model = EncoderDecoderModel(config=config)
+    >>> # Initializing a Bert2Bert model from the bert-base-uncased style configurations
+    >>> model = EncoderDecoderModel(config=config)

-        >>> # Accessing the model configuration
-        >>> config_encoder = model.config.encoder
-        >>> config_decoder  = model.config.decoder
-        >>> # set decoder config to causal lm
-        >>> config_decoder.is_decoder = True
-        >>> config_decoder.add_cross_attention = True
+    >>> # Accessing the model configuration
+    >>> config_encoder = model.config.encoder
+    >>> config_decoder  = model.config.decoder
+    >>> # set decoder config to causal lm
+    >>> config_decoder.is_decoder = True
+    >>> config_decoder.add_cross_attention = True

-        >>> # Saving the model, including its configuration
-        >>> model.save_pretrained('my-model')
+    >>> # Saving the model, including its configuration
+    >>> model.save_pretrained('my-model')

-        >>> # loading model and config from pretrained folder
-        >>> encoder_decoder_config = EncoderDecoderConfig.from_pretrained('my-model')
-        >>> model = EncoderDecoderModel.from_pretrained('my-model', config=encoder_decoder_config)
-    """
+    >>> # loading model and config from pretrained folder
+    >>> encoder_decoder_config = EncoderDecoderConfig.from_pretrained('my-model')
+    >>> model = EncoderDecoderModel.from_pretrained('my-model', config=encoder_decoder_config)
+    ```"""
    model_type = "encoder-decoder"
    is_composition = True

@@ -92,11 +93,11 @@ class EncoderDecoderConfig(PretrainedConfig):
        cls, encoder_config: PretrainedConfig, decoder_config: PretrainedConfig, **kwargs
    ) -> PretrainedConfig:
        r"""
-        Instantiate a :class:`~transformers.EncoderDecoderConfig` (or a derived class) from a pre-trained encoder model
+        Instantiate a [`EncoderDecoderConfig`] (or a derived class) from a pre-trained encoder model
        configuration and decoder model configuration.

        Returns:
-            :class:`EncoderDecoderConfig`: An instance of a configuration object
+            [`EncoderDecoderConfig`]: An instance of a configuration object
        """
        logger.info("Set `config.is_decoder=True` and `config.add_cross_attention=True` for decoder_config")
        decoder_config.is_decoder = True
@@ -106,10 +107,10 @@ class EncoderDecoderConfig(PretrainedConfig):

    def to_dict(self):
        """
-        Serializes this instance to a Python dictionary. Override the default `to_dict()` from `PretrainedConfig`.
+        Serializes this instance to a Python dictionary. Override the default *to_dict()* from *PretrainedConfig*.

        Returns:
-            :obj:`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
        """
        output = copy.deepcopy(self.__dict__)
        output["encoder"] = self.encoder.to_dict()

--- a/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py
@@ -444,32 +444,32 @@ class EncoderDecoderModel(PreTrainedModel):
        r"""
        Returns:

-        Examples::
+        Examples:

-            >>> from transformers import EncoderDecoderModel, BertTokenizer
-            >>> import torch
-
-            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-            >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints
+        ```python
+        >>> from transformers import EncoderDecoderModel, BertTokenizer
+        >>> import torch

-            >>> # training
-            >>> model.config.decoder_start_token_id = tokenizer.cls_token_id
-            >>> model.config.pad_token_id = tokenizer.pad_token_id
-            >>> model.config.vocab_size = model.config.decoder.vocab_size
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints

-            >>> input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids
-            >>> labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids
-            >>> outputs = model(input_ids=input_ids, labels=input_ids)
-            >>> loss, logits = outputs.loss, outputs.logits
+        >>> # training
+        >>> model.config.decoder_start_token_id = tokenizer.cls_token_id
+        >>> model.config.pad_token_id = tokenizer.pad_token_id
+        >>> model.config.vocab_size = model.config.decoder.vocab_size

-            >>> # save and load from pretrained
-            >>> model.save_pretrained("bert2bert")
-            >>> model = EncoderDecoderModel.from_pretrained("bert2bert")
+        >>> input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids
+        >>> labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids
+        >>> outputs = model(input_ids=input_ids, labels=input_ids)
+        >>> loss, logits = outputs.loss, outputs.logits

-            >>> # generation
-            >>> generated = model.generate(input_ids)
+        >>> # save and load from pretrained
+        >>> model.save_pretrained("bert2bert")
+        >>> model = EncoderDecoderModel.from_pretrained("bert2bert")

-        """
+        >>> # generation
+        >>> generated = model.generate(input_ids)
+        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith("decoder_")}

--- a/src/transformers/models/encoder_decoder/modeling_flax_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_flax_encoder_decoder.py
@@ -428,20 +428,20 @@ class FlaxEncoderDecoderModel(FlaxPreTrainedModel):
        r"""
        Returns:

-        Example::
-
-            >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer
+        Example:

-            >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-            >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-cased', 'gpt2')
+        ```python
+        >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer

-            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+        >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
+        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-cased', 'gpt2')

-            >>> text = "My friends are cool but they eat too many carbs."
-            >>> input_ids = tokenizer.encode(text, return_tensors='np')
-            >>> encoder_outputs = model.encode(input_ids)
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

-        """
+        >>> text = "My friends are cool but they eat too many carbs."
+        >>> input_ids = tokenizer.encode(text, return_tensors='np')
+        >>> encoder_outputs = model.encode(input_ids)
+        ```"""
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -505,27 +505,27 @@ class FlaxEncoderDecoderModel(FlaxPreTrainedModel):
        r"""
        Returns:

-        Example::
-
-            >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer
-            >>> import jax.numpy as jnp
+        Example:

-            >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-            >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-cased', 'gpt2')
+        ```python
+        >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer
+        >>> import jax.numpy as jnp

-            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+        >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
+        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-cased', 'gpt2')

-            >>> text = "My friends are cool but they eat too many carbs."
-            >>> input_ids = tokenizer.encode(text, max_length=1024, return_tensors='np')
-            >>> encoder_outputs = model.encode(input_ids)
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

-            >>> decoder_start_token_id = model.config.decoder.bos_token_id
-            >>> decoder_input_ids = jnp.ones((input_ids.shape[0], 1), dtype="i4") * decoder_start_token_id
+        >>> text = "My friends are cool but they eat too many carbs."
+        >>> input_ids = tokenizer.encode(text, max_length=1024, return_tensors='np')
+        >>> encoder_outputs = model.encode(input_ids)

-            >>> outputs = model.decode(decoder_input_ids, encoder_outputs)
-            >>> logits = outputs.logits
+        >>> decoder_start_token_id = model.config.decoder.bos_token_id
+        >>> decoder_input_ids = jnp.ones((input_ids.shape[0], 1), dtype="i4") * decoder_start_token_id

-        """
+        >>> outputs = model.decode(decoder_input_ids, encoder_outputs)
+        >>> logits = outputs.logits
+        ```"""
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -631,32 +631,33 @@ class FlaxEncoderDecoderModel(FlaxPreTrainedModel):
        r"""
        Returns:

-        Examples::
+        Examples:

-            >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer, GPT2Tokenizer
+        ```python
+        >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer, GPT2Tokenizer

-            >>> # load a fine-tuned bert2gpt2 model
-            >>> model = FlaxEncoderDecoderModel.from_pretrained("patrickvonplaten/bert2gpt2-cnn_dailymail-fp16")
-            >>> # load input & output tokenizer
-            >>> tokenizer_input = BertTokenizer.from_pretrained('bert-base-cased')
-            >>> tokenizer_output = GPT2Tokenizer.from_pretrained('gpt2')
+        >>> # load a fine-tuned bert2gpt2 model
+        >>> model = FlaxEncoderDecoderModel.from_pretrained("patrickvonplaten/bert2gpt2-cnn_dailymail-fp16")
+        >>> # load input & output tokenizer
+        >>> tokenizer_input = BertTokenizer.from_pretrained('bert-base-cased')
+        >>> tokenizer_output = GPT2Tokenizer.from_pretrained('gpt2')

-            >>> article = '''Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members
-            ... singing a racist chant. SAE's national chapter suspended the students,
-            ... but University of Oklahoma President David Boren took it a step further,
-            ... saying the university's affiliation with the fraternity is permanently done.'''
+        >>> article = '''Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members
+        ... singing a racist chant. SAE's national chapter suspended the students,
+        ... but University of Oklahoma President David Boren took it a step further,
+        ... saying the university's affiliation with the fraternity is permanently done.'''

-            >>> input_ids = tokenizer_input(article, add_special_tokens=True, return_tensors='np').input_ids
+        >>> input_ids = tokenizer_input(article, add_special_tokens=True, return_tensors='np').input_ids

-            >>> # use GPT2's eos_token as the pad as well as eos token
-            >>> model.config.eos_token_id = model.config.decoder.eos_token_id
-            >>> model.config.pad_token_id = model.config.eos_token_id
+        >>> # use GPT2's eos_token as the pad as well as eos token
+        >>> model.config.eos_token_id = model.config.decoder.eos_token_id
+        >>> model.config.pad_token_id = model.config.eos_token_id

-            >>> sequences = model.generate(input_ids, num_beams=4, max_length=12).sequences
+        >>> sequences = model.generate(input_ids, num_beams=4, max_length=12).sequences

-            >>> summary = tokenizer_output.batch_decode(sequences, skip_special_tokens=True)[0]
-            >>> assert summary == "SAS Alpha Epsilon suspended Sigma Alpha Epsilon members"
-        """
+        >>> summary = tokenizer_output.batch_decode(sequences, skip_special_tokens=True)[0]
+        >>> assert summary == "SAS Alpha Epsilon suspended Sigma Alpha Epsilon members"
+        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (

--- a/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
@@ -263,26 +263,28 @@ class TFEncoderDecoderModel(TFPreTrainedModel):
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
        r"""
-        Initializing `TFEncoderDecoderModel` from a pytorch checkpoint is not supported currently.
+        Initializing *TFEncoderDecoderModel* from a pytorch checkpoint is not supported currently.

-        If there are only pytorch checkpoints for a particular encoder-decoder model, a workaround is::
+        If there are only pytorch checkpoints for a particular encoder-decoder model, a workaround is:

-            >>> # a workaround to load from pytorch checkpoint
-            >>> _model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
-            >>> _model.encoder.save_pretrained("./encoder")
-            >>> _model.decoder.save_pretrained("./decoder")
-            >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
-            ...     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
-            ... )
-            >>> # This is only for copying some specific attributes of this particular model.
-            >>> model.config = _model.config
-
-        Example::
+        ```python
+        >>> # a workaround to load from pytorch checkpoint
+        >>> _model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
+        >>> _model.encoder.save_pretrained("./encoder")
+        >>> _model.decoder.save_pretrained("./decoder")
+        >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
+        ...     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
+        ... )
+        >>> # This is only for copying some specific attributes of this particular model.
+        >>> model.config = _model.config
+        ```

-            >>> from transformers import TFEncoderDecoderModel
-            >>> model = TFEncoderDecoderModel.from_pretrained("ydshieh/bert2bert-cnn_dailymail-fp16")
+        Example:

-        """
+        ```python
+        >>> from transformers import TFEncoderDecoderModel
+        >>> model = TFEncoderDecoderModel.from_pretrained("ydshieh/bert2bert-cnn_dailymail-fp16")
+        ```"""

        from_pt = kwargs.pop("from_pt", False)
        if from_pt:
@@ -481,31 +483,31 @@ class TFEncoderDecoderModel(TFPreTrainedModel):
        r"""
        Returns:

-        Examples::
-
-            >>> from transformers import TFEncoderDecoderModel, BertTokenizer
+        Examples:

-            >>> # initialize a bert2gpt2 from a pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-            >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-cased', 'gpt2')
+        ```python
+        >>> from transformers import TFEncoderDecoderModel, BertTokenizer

-            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+        >>> # initialize a bert2gpt2 from a pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
+        >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-cased', 'gpt2')

-            >>> # forward
-            >>> input_ids = tokenizer.encode("Hello, my dog is cute", add_special_tokens=True, return_tensors='tf')  # Batch size 1
-            >>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)
+        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

-            >>> # training
-            >>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids)
-            >>> loss, logits = outputs.loss, outputs.logits
+        >>> # forward
+        >>> input_ids = tokenizer.encode("Hello, my dog is cute", add_special_tokens=True, return_tensors='tf')  # Batch size 1
+        >>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)

-            >>> # save and load from pretrained
-            >>> model.save_pretrained("bert2gpt2")
-            >>> model = TFEncoderDecoderModel.from_pretrained("bert2gpt2")
+        >>> # training
+        >>> outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids)
+        >>> loss, logits = outputs.loss, outputs.logits

-            >>> # generation
-            >>> generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.bos_token_id)
+        >>> # save and load from pretrained
+        >>> model.save_pretrained("bert2gpt2")
+        >>> model = TFEncoderDecoderModel.from_pretrained("bert2gpt2")

-        """
+        >>> # generation
+        >>> generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.bos_token_id)
+        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith("decoder_")}

--- a/src/transformers/models/flaubert/configuration_flaubert.py
+++ b/src/transformers/models/flaubert/configuration_flaubert.py
@@ -30,105 +30,105 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class FlaubertConfig(XLMConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.FlaubertModel` or a
-    :class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
+    This is the configuration class to store the configuration of a [`FlaubertModel`] or a
+    [`TFFlaubertModel`]. It is used to instantiate a FlauBERT model according to the specified
    arguments, defining the model architecture.

-    Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
-    outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model
+    outputs. Read the documentation from [`PretrainedConfig`] for more information.

    Args:
-        pre_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        pre_norm (`bool`, *optional*, defaults to `False`):
            Whether to apply the layer normalization before or after the feed forward layer following the attention in
            each layer (Vaswani et al., Tensor2Tensor for Neural Machine Translation. 2018)
-        layerdrop (:obj:`float`, `optional`, defaults to 0.0):
+        layerdrop (`float`, *optional*, defaults to 0.0):
            Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand with
            Structured Dropout. ICLR 2020)
-        vocab_size (:obj:`int`, `optional`, defaults to 30145):
+        vocab_size (`int`, *optional*, defaults to 30145):
            Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by
-            the :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
-            :class:`~transformers.TFFlaubertModel`.
-        emb_dim (:obj:`int`, `optional`, defaults to 2048):
+            the `inputs_ids` passed when calling [`FlaubertModel`] or
+            [`TFFlaubertModel`].
+        emb_dim (`int`, *optional*, defaults to 2048):
            Dimensionality of the encoder layers and the pooler layer.
-        n_layer (:obj:`int`, `optional`, defaults to 12):
+        n_layer (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, `optional`, defaults to 16):
+        n_head (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        dropout (:obj:`float`, `optional`, defaults to 0.1):
+        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
+        attention_dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for the attention mechanism
-        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether or not to use a `gelu` activation instead of `relu`.
-        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        gelu_activation (`bool`, *optional*, defaults to `True`):
+            Whether or not to use a *gelu* activation instead of *relu*.
+        sinusoidal_embeddings (`bool`, *optional*, defaults to `False`):
            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
-        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        causal (`bool`, *optional*, defaults to `False`):
            Whether or not the model should behave in a causal manner. Causal models use a triangular attention mask in
            order to only attend to the left-side context instead if a bidirectional context.
-        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
+        asm (`bool`, *optional*, defaults to `False`):
            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
            layer.
-        n_langs (:obj:`int`, `optional`, defaults to 1):
+        n_langs (`int`, *optional*, defaults to 1):
            The number of languages the model handles. Set to 1 for monolingual models.
-        use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
-            Whether to use language embeddings. Some models use additional language embeddings, see `the multilingual
-            models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__ for
+        use_lang_emb (`bool`, *optional*, defaults to `True`)
+            Whether to use language embeddings. Some models use additional language embeddings, see [the multilingual
+            models page](http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings) for
            information on how to use them.
-        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
+        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
-        embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
+        embed_init_std (`float`, *optional*, defaults to 2048^-0.5):
            The standard deviation of the truncated_normal_initializer for initializing the embedding matrices.
-        init_std (:obj:`int`, `optional`, defaults to 50257):
+        init_std (`int`, *optional*, defaults to 50257):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the
            embedding matrices.
-        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        bos_index (:obj:`int`, `optional`, defaults to 0):
+        bos_index (`int`, *optional*, defaults to 0):
            The index of the beginning of sentence token in the vocabulary.
-        eos_index (:obj:`int`, `optional`, defaults to 1):
+        eos_index (`int`, *optional*, defaults to 1):
            The index of the end of sentence token in the vocabulary.
-        pad_index (:obj:`int`, `optional`, defaults to 2):
+        pad_index (`int`, *optional*, defaults to 2):
            The index of the padding token in the vocabulary.
-        unk_index (:obj:`int`, `optional`, defaults to 3):
+        unk_index (`int`, *optional*, defaults to 3):
            The index of the unknown token in the vocabulary.
-        mask_index (:obj:`int`, `optional`, defaults to 5):
+        mask_index (`int`, *optional*, defaults to 5):
            The index of the masking token in the vocabulary.
-        is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
+        is_encoder(`bool`, *optional*, defaults to `True`):
            Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
-        summary_type (:obj:`string`, `optional`, defaults to "first"):
+        summary_type (`string`, *optional*, defaults to "first"):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

            Has to be one of the following options:

-                - :obj:`"last"`: Take the last token hidden state (like XLNet).
-                - :obj:`"first"`: Take the first token hidden state (like BERT).
-                - :obj:`"mean"`: Take the mean of all tokens hidden states.
-                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
-                - :obj:`"attn"`: Not implemented now, use multi-head attention.
-        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+                - `"last"`: Take the last token hidden state (like XLNet).
+                - `"first"`: Take the first token hidden state (like BERT).
+                - `"mean"`: Take the mean of all tokens hidden states.
+                - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - `"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (`bool`, *optional*, defaults to `True`):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

            Whether or not to add a projection after the vector extraction.
-        summary_activation (:obj:`str`, `optional`):
+        summary_activation (`str`, *optional*):
            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.

-            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
-        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (`bool`, *optional*, defaults to `True`):
            Used in the sequence classification and multiple choice models.

-            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
-        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes.
+        summary_first_dropout (`float`, *optional*, defaults to 0.1):
            Used in the sequence classification and multiple choice models.

            The dropout ratio to be used after the projection and activation.
-        start_n_top (:obj:`int`, `optional`, defaults to 5):
+        start_n_top (`int`, *optional*, defaults to 5):
            Used in the SQuAD evaluation script.
-        end_n_top (:obj:`int`, `optional`, defaults to 5):
+        end_n_top (`int`, *optional*, defaults to 5):
            Used in the SQuAD evaluation script.
-        mask_token_id (:obj:`int`, `optional`, defaults to 0):
+        mask_token_id (`int`, *optional*, defaults to 0):
            Model agnostic parameter to identify masked tokens when generating text in an MLM context.
-        lang_id (:obj:`int`, `optional`, defaults to 1):
+        lang_id (`int`, *optional*, defaults to 1):
            The ID of the language used by the model. This parameter is used when generating text in a given language.
    """