Improve Perceiver docs (#14786)

* Fix docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Code quality Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

Improve Perceiver docs (#14786)
* Fix docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Code quality Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
aece7bad · NielsRogge · GitHub · 50bc57ce · aece7bad
Unverified Commit aece7bad authored Dec 15, 2021 by NielsRogge Committed by GitHub Dec 15, 2021
Show whitespace changes
Inline Side-by-side

Showing with 44 additions and 38 deletions

src/transformers/models/perceiver/modeling_perceiver.py src/transformers/models/perceiver/modeling_perceiver.py +44 -38

No files found.
--- a/src/transformers/models/perceiver/modeling_perceiver.py
+++ b/src/transformers/models/perceiver/modeling_perceiver.py
@@ -810,15 +810,16 @@ class PerceiverModel(PerceiverPreTrainedModel):
            >>> # EXAMPLE 2: using the Perceiver to classify images
            >>> # - we define an ImagePreprocessor, which can be used to embed images
            >>> preprocessor=PerceiverImagePreprocessor(
-                            config,
+            ...              config,
-                            prep_type="conv1x1",
+            ...              prep_type="conv1x1",
-                            spatial_downsample=1,
+            ...              spatial_downsample=1,
-                            out_channels=256,
+            ...              out_channels=256,
-                            position_encoding_type="trainable",
+            ...              position_encoding_type="trainable",
-                            concat_or_add_pos="concat",
+            ...              concat_or_add_pos="concat",
-                            project_pos_dim=256,
+            ...              project_pos_dim=256,
-                            trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2),
+            ...              trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2,
-                        )
+            ...              ),
+            ... )
            >>> model = PerceiverModel(
            ...         config,
@@ -1188,10 +1189,11 @@ Example use of Perceiver for image classification, for tasks such as ImageNet.
 This model uses learned position embeddings. In other words, this model is not given any privileged information about
 the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.
-`PerceiverForImageClassificationLearned` uses
+:class:`~transformers.PerceiverForImageClassificationLearned` uses
-`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv1x1") to
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="conv1x1"`)
-preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
+to preprocess the input images, and
-decode the latent representation of `~transformers.PerceiverModel` into classification logits.
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
+representation of :class:`~transformers.PerceiverModel` into classification logits.
 """,
    PERCEIVER_START_DOCSTRING,
 )
@@ -1326,10 +1328,11 @@ Example use of Perceiver for image classification, for tasks such as ImageNet.
 This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of
 79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).
-`PerceiverForImageClassificationLearned` uses
+:class:`~transformers.PerceiverForImageClassificationLearned` uses
-`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "pixels") to
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="pixels"`)
-preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to
+to preprocess the input images, and
-decode the latent representation of `~transformers.PerceiverModel` into classification logits.
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
+representation of :class:`~transformers.PerceiverModel` into classification logits.
 """,
    PERCEIVER_START_DOCSTRING,
 )
@@ -1461,10 +1464,11 @@ Example use of Perceiver for image classification, for tasks such as ImageNet.
 This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy
 of 82.1 on ImageNet.
-`PerceiverForImageClassificationLearned` uses
+:class:`~transformers.PerceiverForImageClassificationLearned` uses
-`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "conv") to preprocess
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with :obj:`prep_type="conv"`) to
-the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the
+preprocess the input images, and
-latent representation of `~transformers.PerceiverModel` into classification logits.
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder` to decode the latent
+representation of :class:`~transformers.PerceiverModel` into classification logits.
 """,
    PERCEIVER_START_DOCSTRING,
 )
@@ -1592,10 +1596,11 @@ class PerceiverForImageClassificationConvProcessing(PerceiverPreTrainedModel):
 @add_start_docstrings(
    """
-Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. `PerceiverForOpticalFlow` uses
+Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI.
-`transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type` = "patches") to
+:class:`~transformers.PerceiverForOpticalFlow` uses
-preprocess the input images, and `transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder` to
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor` (with `prep_type="patches"`) to
-decode the latent representation of `~transformers.PerceiverModel`.
+preprocess the input images, and :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder`
+to decode the latent representation of :class:`~transformers.PerceiverModel`.
 As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel
 (leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position
@@ -1717,25 +1722,26 @@ class PerceiverForOpticalFlow(PerceiverPreTrainedModel):
    """
 Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.
-`PerceiverForMultimodalAutoencoding` uses
+:class:`~transformers.PerceiverForMultimodalAutoencoding` uses
-`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3 modalities:
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor` to preprocess the 3
-images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality
+modalities: images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every
-separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same
+modality separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to
-number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.
+the same number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver
+encoder.
-`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` is used to decode the latent
-representation of `~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
+representation of :class:`~transformers.PerceiverModel`. This decoder uses each modality-specific decoder to construct
 queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video
 in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do
 cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which
-can be provided as additional input to the forward pass of `PerceiverForMultimodalAutoencoding`.
+can be provided as additional input to the forward pass of :class:`~transformers.PerceiverForMultimodalAutoencoding`.
-`transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of the
+:class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder` also pads the decoder queries of
-different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
+the different modalities to the same number of channels, in order to concatenate them along the time dimension. Next,
-cross-attention is performed with the latent representation of `PerceiverModel`.
+cross-attention is performed with the latent representation of :class:`~transformers.PerceiverModel`.
-Finally, `transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn this
+Finally, :class:`~transformers.models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor` is used to turn
-tensor into an actual video. It first splits up the output into the different modalities, and then applies the
+this tensor into an actual video. It first splits up the output into the different modalities, and then applies the
 respective postprocessor for each modality.
 Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the