Adopt `:autosummary:` in `torchaudio.pipelines` module doc (#2689)

Summary: * Introduce the mini-index at `torchaudio.pipelines` page. * Add introductions * Update pipeline tutorials https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/pipelines.html <img width="1163" alt="Screen Shot 2022-09-20 at 1 23 29 PM" src="https://user-images.githubusercontent.com/855818/191167049-98324e93-2e16-41db-8538-3b5b54eb8224.png"> <img width="1115" alt="Screen Shot 2022-09-20 at 1 23 49 PM" src="https://user-images.githubusercontent.com/855818/191167071-4770f594-2540-43a4-a01c-e983bf59220f.png"> https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle <img width="1108" alt="Screen Shot 2022-09-20 at 1 24 18 PM" src="https://user-images.githubusercontent.com/855818/191167123-51b33a5f-c30c-46bc-b002-b05d2d0d27b7.png"> Pull Request resolved: https://github.com/pytorch/audio/pull/2689 Reviewed By: carolineechen Differential Revision: D39691253 Pulled By: mthrok fbshipit-source-id: ddf5fdadb0b64cf2867b6271ba53e8e8c0fa7e49

Adopt `:autosummary:` in `torchaudio.pipelines` module doc (#2689)
Summary: * Introduce the mini-index at `torchaudio.pipelines` page. * Add introductions * Update pipeline tutorials https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/pipelines.html <img width="1163" alt="Screen Shot 2022-09-20 at 1 23 29 PM" src="https://user-images.githubusercontent.com/855818/191167049-98324e93-2e16-41db-8538-3b5b54eb8224.png"> <img width="1115" alt="Screen Shot 2022-09-20 at 1 23 49 PM" src="https://user-images.githubusercontent.com/855818/191167071-4770f594-2540-43a4-a01c-e983bf59220f.png"> https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle <img width="1108" alt="Screen Shot 2022-09-20 at 1 24 18 PM" src="https://user-images.githubusercontent.com/855818/191167123-51b33a5f-c30c-46bc-b002-b05d2d0d27b7.png"> Pull Request resolved: https://github.com/pytorch/audio/pull/2689 Reviewed By: carolineechen Differential Revision: D39691253 Pulled By: mthrok fbshipit-source-id: ddf5fdadb0b64cf2867b6271ba53e8e8c0fa7e49
0b3ddec6 · moto · Facebook GitHub Bot · 045cc372 · 0b3ddec6 · 0b3ddec6
Commit 0b3ddec6 authored Sep 21, 2022 by moto Committed by Facebook GitHub Bot Sep 21, 2022
14 changed files
--- a/docs/source/_templates/autosummary/bundle_class.rst
+++ b/docs/source/_templates/autosummary/bundle_class.rst
+..
+  autogenerated from source/_templates/autosummary/bundle_class.rst
+{{ name | underline }}
+.. autoclass:: {{ fullname }}()
+{%- if name in ["RNNTBundle.FeatureExtractor", "RNNTBundle.TokenProcessor"] %}
+  {%- set methods = ["__call__"] %}
+{%- elif name == "Tacotron2TTSBundle.TextProcessor" %}
+  {%- set attributes = ["tokens"] %}
+  {%- set methods = ["__call__"] %}
+{%- elif name == "Tacotron2TTSBundle.Vocoder" %}
+  {%- set attributes=["sample_rate"] %}
+  {%- set methods = ["__call__"] %}
+{% endif %}
+..
+   ATTRIBUTES
+{%- for item in attributes %}
+{%- if not item.startswith('_') %}
+{{ item | underline("-") }}
+.. container:: py attribute
+   .. autoproperty:: {{[fullname, item] | join('.')}}
+{%- endif %}
+{%- endfor %}
+..
+   METHODS
+{%- for item in methods %}
+{%- if item != "__init__" %}
+{{item | underline("-") }}
+.. container:: py attribute
+   .. automethod:: {{[fullname, item] | join('.')}}
+{%- endif %}
+{%- endfor %}
--- a/docs/source/_templates/autosummary/bundle_data.rst
+++ b/docs/source/_templates/autosummary/bundle_data.rst
+..
+  autogenerated from source/_templates/autosummary/bundle_data.rst
+{{ name | underline }}
+.. container:: py attribute
+   .. autodata:: {{ fullname }}
+      :no-value:
--- a/docs/source/pipelines.rst
+++ b/docs/source/pipelines.rst
+.. py:module:: torchaudio.pipelines
 torchaudio.pipelines
 ====================
 .. currentmodule:: torchaudio.pipelines
-.. py:module:: torchaudio.pipelines
-The pipelines subpackage contains API to access the models with pretrained weights, and information/helper functions associated the pretrained weights.
+The ``torchaudio.pipelines`` module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks.
-RNN-T Streaming/Non-Streaming ASR
---------------------------------
-RNNTBundle
-~~~~~~~~~~
-.. autoclass:: RNNTBundle
-  :members: sample_rate, n_fft, n_mels, hop_length, segment_length, right_context_length
-  .. automethod:: get_decoder
-  .. automethod:: get_feature_extractor
+When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins).
-  .. automethod:: get_streaming_feature_extractor
+To make this information tied to a pre-trained model and easily accessible, ``torchaudio.pipelines`` module uses the concept of a `Bundle` class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines.
-  .. automethod:: get_token_processor
+The following figure illustrates this.
-RNNTBundle - FeatureExtractor
+.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: torchaudio.pipelines::RNNTBundle.FeatureExtractor
+A pre-trained model and associated pipelines are expressed as an instance of ``Bundle``. Different instances of same ``Bundle`` share the interface, but their implementations are not constrained to be of same types. For example, :class:`SourceSeparationBundle` defines the interface for performing source separation, but its instance :data:`CONVTASNET_BASE_LIBRI2MIX` instantiates a model of :class:`~torchaudio.models.ConvTasNet` while :data:`HDEMUCS_HIGH_MUSDB` instantiates a model of :class:`~torchaudio.models.HDemucs`. Still, because they share the same interface, the usage is the same.
-  :special-members: __call__
-RNNTBundle - TokenProcessor
+.. note::
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: torchaudio.pipelines::RNNTBundle.TokenProcessor
+   Under the hood, the implementations of ``Bundle`` use components from other ``torchaudio`` modules, such as :mod:`torchaudio.models` and :mod:`torchaudio.transforms`, or even third party libraries like `SentencPiece <https://github.com/google/sentencepiece>`__ and `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__. But this implementation detail is abstracted away from library users.
-  :special-members: __call__
-EMFORMER_RNNT_BASE_LIBRISPEECH
+RNN-T Streaming/Non-Streaming ASR
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+---------------------------------
-.. container:: py attribute
-   .. autodata:: EMFORMER_RNNT_BASE_LIBRISPEECH
-      :no-value:
-wav2vec 2.0 / HuBERT - Representation Learning
----------------------------------------------
-.. autoclass:: Wav2Vec2Bundle
-   :members: sample_rate
-   .. automethod:: get_model
-WAV2VEC2_BASE
-~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_BASE
+Interface
-      :no-value:
+^^^^^^^^^
-WAV2VEC2_LARGE
+``RNNTBundle`` defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.
-~~~~~~~~~~~~~~
-.. container:: py attribute
+.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png
-   .. autodata:: WAV2VEC2_LARGE
+.. autosummary::
-      :no-value:
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_class.rst
-WAV2VEC2_LARGE_LV60K
+   RNNTBundle
-~~~~~~~~~~~~~~~~~~~~
+   RNNTBundle.FeatureExtractor
+   RNNTBundle.TokenProcessor
-.. container:: py attribute
+.. rubric:: Tutorials using ``RNNTBundle``
-   .. autodata:: WAV2VEC2_LARGE_LV60K
+.. minigallery:: torchaudio.pipelines.RNNTBundle
-      :no-value:
+Pretrained Models
+^^^^^^^^^^^^^^^^^
-WAV2VEC2_XLSR53
+.. autosummary::
-~~~~~~~~~~~~~~~
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_data.rst
-.. container:: py attribute
+   EMFORMER_RNNT_BASE_LIBRISPEECH
-   .. autodata:: WAV2VEC2_XLSR53
-      :no-value:
-HUBERT_BASE
+wav2vec 2.0 / HuBERT - SSL
-~~~~~~~~~~~
+--------------------------
-.. container:: py attribute
+Interface
+^^^^^^^^^
-   .. autodata:: HUBERT_BASE
+``Wav2Vec2Bundle`` instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.
-      :no-value:
-HUBERT_LARGE
+.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png
-~~~~~~~~~~~~
-.. container:: py attribute
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_class.rst
-   .. autodata:: HUBERT_LARGE
+   Wav2Vec2Bundle
-      :no-value:
-HUBERT_XLARGE
+Pretrained Models
-~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^
-.. container:: py attribute
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_data.rst
-   .. autodata:: HUBERT_XLARGE
+   WAV2VEC2_BASE
-      :no-value:
+   WAV2VEC2_LARGE
+   WAV2VEC2_LARGE_LV60K
+   WAV2VEC2_XLSR53
+   HUBERT_BASE
+   HUBERT_LARGE
+   HUBERT_XLARGE
 wav2vec 2.0 / HuBERT - Fine-tuned ASR
 -------------------------------------
-Wav2Vec2ASRBundle
+Interface
-~~~~~~~~~~~~~~~~~
+^^^^^^^^^
-.. autoclass:: Wav2Vec2ASRBundle
-   :members: sample_rate
-   .. automethod:: get_model
-   .. automethod:: get_labels
-WAV2VEC2_ASR_BASE_10M
-~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_ASR_BASE_10M
-      :no-value:
-WAV2VEC2_ASR_BASE_100H
-~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_ASR_BASE_100H
-      :no-value:
-WAV2VEC2_ASR_BASE_960H
-~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_ASR_BASE_960H
-      :no-value:
-WAV2VEC2_ASR_LARGE_10M
-~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_ASR_LARGE_10M
-      :no-value:
-WAV2VEC2_ASR_LARGE_100H
-~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_ASR_LARGE_100H
-      :no-value:
-WAV2VEC2_ASR_LARGE_960H
-~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
+``Wav2Vec2ASRBundle`` instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.
-   .. autodata:: WAV2VEC2_ASR_LARGE_960H
+.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png
-      :no-value:
-WAV2VEC2_ASR_LARGE_LV60K_10M
+.. autosummary::
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_class.rst
-.. container:: py attribute
+   Wav2Vec2ASRBundle
-   .. autodata:: WAV2VEC2_ASR_LARGE_LV60K_10M
+.. rubric:: Tutorials using ``Wav2Vec2ASRBundle``
-      :no-value:
-WAV2VEC2_ASR_LARGE_LV60K_100H
+.. minigallery:: torchaudio.pipelines.Wav2Vec2ASRBundle
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
+Pretrained Models
+^^^^^^^^^^^^^^^^^
-   .. autodata:: WAV2VEC2_ASR_LARGE_LV60K_100H
+.. autosummary::
-      :no-value:
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_data.rst
-WAV2VEC2_ASR_LARGE_LV60K_960H
+   WAV2VEC2_ASR_BASE_10M
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   WAV2VEC2_ASR_BASE_100H
+   WAV2VEC2_ASR_BASE_960H
+   WAV2VEC2_ASR_LARGE_10M
+   WAV2VEC2_ASR_LARGE_100H
+   WAV2VEC2_ASR_LARGE_960H
+   WAV2VEC2_ASR_LARGE_LV60K_10M
+   WAV2VEC2_ASR_LARGE_LV60K_100H
+   WAV2VEC2_ASR_LARGE_LV60K_960H
+   VOXPOPULI_ASR_BASE_10K_DE
+   VOXPOPULI_ASR_BASE_10K_EN
+   VOXPOPULI_ASR_BASE_10K_ES
+   VOXPOPULI_ASR_BASE_10K_FR
+   VOXPOPULI_ASR_BASE_10K_IT
+   HUBERT_ASR_LARGE
+   HUBERT_ASR_XLARGE
-.. container:: py attribute
-   .. autodata:: WAV2VEC2_ASR_LARGE_LV60K_960H
-      :no-value:
-VOXPOPULI_ASR_BASE_10K_DE
-~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: VOXPOPULI_ASR_BASE_10K_DE
-      :no-value:
-VOXPOPULI_ASR_BASE_10K_EN
-~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: VOXPOPULI_ASR_BASE_10K_EN
-      :no-value:
-VOXPOPULI_ASR_BASE_10K_ES
-~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: VOXPOPULI_ASR_BASE_10K_ES
-      :no-value:
-VOXPOPULI_ASR_BASE_10K_FR
-~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: VOXPOPULI_ASR_BASE_10K_FR
-      :no-value:
-VOXPOPULI_ASR_BASE_10K_IT
-~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: VOXPOPULI_ASR_BASE_10K_IT
-      :no-value:
-HUBERT_ASR_LARGE
-~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: HUBERT_ASR_LARGE
-      :no-value:
-HUBERT_ASR_XLARGE
-~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: HUBERT_ASR_XLARGE
-      :no-value:
 Tacotron2 Text-To-Speech
 ------------------------
-Tacotron2TTSBundle
+``Tacotron2TTSBundle`` defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on :class:`~torchaudio.models.Tacotron2` model.
-~~~~~~~~~~~~~~~~~~
-.. autoclass:: Tacotron2TTSBundle
-   .. automethod:: get_text_processor
-   .. automethod:: get_tacotron2
-   .. automethod:: get_vocoder
-Tacotron2TTSBundle - TextProcessor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: torchaudio.pipelines::Tacotron2TTSBundle.TextProcessor
-   :members: tokens
-   :special-members: __call__
+.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png
-Tacotron2TTSBundle - Vocoder
+``TextProcessor`` can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: torchaudio.pipelines::Tacotron2TTSBundle.Vocoder
+Similarly ``Vocoder`` can be an algorithm without learning parameters, like `Griffin-Lim`, or a neural-network-based model like `Waveglow`.
-   :members: sample_rate
-   :special-members: __call__
+Interface
+^^^^^^^^^
-TACOTRON2_WAVERNN_PHONE_LJSPEECH
+.. autosummary::
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_class.rst
-.. container:: py attribute
+   Tacotron2TTSBundle
+   Tacotron2TTSBundle.TextProcessor
+   Tacotron2TTSBundle.Vocoder
-   .. autodata:: TACOTRON2_WAVERNN_PHONE_LJSPEECH
+.. rubric:: Tutorials using ``Tacotron2TTSBundle``
-      :no-value:
+.. minigallery:: torchaudio.pipelines.Tacotron2TTSBundle
-TACOTRON2_WAVERNN_CHAR_LJSPEECH
+Pretrained Models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^
-.. container:: py attribute
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_data.rst
-   .. autodata:: TACOTRON2_WAVERNN_CHAR_LJSPEECH
+   TACOTRON2_WAVERNN_PHONE_LJSPEECH
-      :no-value:
+   TACOTRON2_WAVERNN_CHAR_LJSPEECH
+   TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
-TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
+   TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
-      :no-value:
-TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
-   .. autodata:: TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
-      :no-value:
 Source Separation
 -----------------
-SourceSeparationBundle
+Interface
-~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^
-.. autoclass:: SourceSeparationBundle
-   :members: sample_rate
-   .. automethod:: get_model
-CONVTASNET_BASE_LIBRI2MIX
+``SourceSeparationBundle`` instantiates source separation models which take single channel audio and generates multi-channel audio.
-~~~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
+.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png
-   .. autodata:: CONVTASNET_BASE_LIBRI2MIX
+.. autosummary::
-      :no-value:
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_class.rst
-HDEMUCS_HIGH_MUSDB_PLUS
+   SourceSeparationBundle
-~~~~~~~~~~~~~~~~~~~~~~~
-.. container:: py attribute
+.. rubric:: Tutorials using ``SourceSeparationBundle``
-   .. autodata:: HDEMUCS_HIGH_MUSDB_PLUS
+.. minigallery:: torchaudio.pipelines.SourceSeparationBundle
-      :no-value:
-HDEMUCS_HIGH_MUSDB
+Pretrained Models
-~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^
-.. container:: py attribute
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: autosummary/bundle_data.rst
-   .. autodata:: HDEMUCS_HIGH_MUSDB
+   CONVTASNET_BASE_LIBRI2MIX
-      :no-value:
+   HDEMUCS_HIGH_MUSDB_PLUS
+   HDEMUCS_HIGH_MUSDB
--- a/docs/source/refs.bib
+++ b/docs/source/refs.bib
@@ -410,3 +410,16 @@
  booktitle={Proceedings of the ISMIR 2021 Workshop on Music Source Separation},
  year={2021}
 }
+@article{CATTONI2021101155,
+title = {MuST-C: A multilingual corpus for end-to-end speech translation},
+journal = {Computer Speech & Language},
+volume = {66},
+pages = {101155},
+year = {2021},
+issn = {0885-2308},
+doi = {https://doi.org/10.1016/j.csl.2020.101155},
+url = {https://www.sciencedirect.com/science/article/pii/S0885230820300887},
+author = {Roldano Cattoni and Mattia Antonino {Di Gangi} and Luisa Bentivogli and Matteo Negri and Marco Turchi},
+keywords = {Spoken language translation, Multilingual corpus},
+abstract = {End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C.}
+}
--- a/docs/source/transforms.rst
+++ b/docs/source/transforms.rst
+.. py:module:: torchaudio.transforms
 torchaudio.transforms
 =====================
 .. currentmodule:: torchaudio.transforms
-:mod:`torchaudio.transforms` module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.
+``torchaudio.transforms`` module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.
 .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png

--- a/examples/tutorials/asr_inference_with_ctc_decoder_tutorial.py
+++ b/examples/tutorials/asr_inference_with_ctc_decoder_tutorial.py
@@ -72,9 +72,10 @@ from torchaudio.utils import download_asset
 # We use the pretrained `Wav2Vec 2.0 <https://arxiv.org/abs/2006.11477>`__
 # Base model that is finetuned on 10 min of the `LibriSpeech
 # dataset <http://www.openslr.org/12>`__, which can be loaded in using
-# :py:func:`torchaudio.pipelines`. For more detail on running Wav2Vec 2.0 speech
+# :data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M`.
+# For more detail on running Wav2Vec 2.0 speech
 # recognition pipelines in torchaudio, please refer to `this
-# tutorial <https://pytorch.org/audio/main/tutorials/speech_recognition_pipeline_tutorial.html>`__.
+# tutorial <./speech_recognition_pipeline_tutorial.html>`__.
 #
 bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
@@ -177,7 +178,7 @@ print(tokens)
 # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 #
 # Pretrained files for the LibriSpeech dataset can be downloaded using
-# :py:func:`download_pretrained_files <torchaudio.models.decoder.download_pretrained_files>`.
+# :py:func:`~torchaudio.models.decoder.download_pretrained_files`.
 #
 # Note: this cell may take a couple of minutes to run, as the language
 # model can be large
@@ -202,7 +203,7 @@ print(files)
 # Beam Search Decoder
 # ~~~~~~~~~~~~~~~~~~~
 # The decoder can be constructed using the factory function
-# :py:func:`ctc_decoder <torchaudio.models.decoder.ctc_decoder>`.
+# :py:func:`~torchaudio.models.decoder.ctc_decoder`.
 # In addition to the previously mentioned components, it also takes in various beam
 # search decoding parameters and token/word parameters.
 #
@@ -262,7 +263,7 @@ greedy_decoder = GreedyCTCDecoder(tokens)
 #
 # Now that we have the data, acoustic model, and decoder, we can perform
 # inference. The output of the beam search decoder is of type
-# :py:func:`torchaudio.models.decoder.CTCHypothesis`, consisting of the
+# :py:class:`~torchaudio.models.decoder.CTCHypothesis`, consisting of the
 # predicted token IDs, corresponding words (if a lexicon is provided), hypothesis score,
 # and timesteps corresponding to the token IDs. Recall the transcript corresponding to the
 # waveform is
@@ -307,7 +308,8 @@ print(f"WER: {beam_search_wer}")
 ######################################################################
 # .. note::
 #
-#    The ``words`` field of the output hypotheses will be empty if no lexicon
+#    The :py:attr:`~torchaudio.models.decoder.CTCHypothesis.words`
+#    field of the output hypotheses will be empty if no lexicon
 #    is provided to the decoder. To retrieve a transcript with lexicon-free
 #    decoding, you can perform the following to retrieve the token indices,
 #    convert them to original tokens, then join them together.

--- a/examples/tutorials/online_asr_tutorial.py
+++ b/examples/tutorials/online_asr_tutorial.py
@@ -74,9 +74,9 @@ except ModuleNotFoundError:
 # -------------------------
 #
 # Pre-trained model weights and related pipeline components are
-# bundled as :py:func:`torchaudio.pipelines.RNNTBundle`.
+# bundled as :py:class:`torchaudio.pipelines.RNNTBundle`.
 #
-# We use :py:func:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`,
+# We use :py:data:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`,
 # which is a Emformer RNN-T model trained on LibriSpeech dataset.
 #
@@ -112,7 +112,7 @@ print(f"Right context: {context_length} frames ({context_length / sample_rate} s
 # 4. Configure the audio stream
 # -----------------------------
 #
-# Next, we configure the input audio stream using :py:func:`~torchaudio.io.StreamReader`.
+# Next, we configure the input audio stream using :py:class:`torchaudio.io.StreamReader`.
 #
 # For the detail of this API, please refer to the
 # `Media Stream API tutorial <./streaming_api_tutorial.html>`__.

--- a/examples/tutorials/speech_recognition_pipeline_tutorial.py
+++ b/examples/tutorials/speech_recognition_pipeline_tutorial.py
@@ -26,7 +26,7 @@ pre-trained models from wav2vec 2.0
 # Torchaudio provides easy access to the pre-trained weights and
 # associated information, such as the expected sample rate and class
 # labels. They are bundled together and available under
-# :py:func:`torchaudio.pipelines` module.
+# :py:mod:`torchaudio.pipelines` module.
 #
@@ -34,36 +34,26 @@ pre-trained models from wav2vec 2.0
 # Preparation
 # -----------
 #
-# First we import the necessary packages, and fetch data that we work on.
-#
-# %matplotlib inline
-import os
-import IPython
-import matplotlib
-import matplotlib.pyplot as plt
-import requests
 import torch
 import torchaudio
-matplotlib.rcParams["figure.figsize"] = [16.0, 4.8]
+print(torch.__version__)
+print(torchaudio.__version__)
 torch.random.manual_seed(0)
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-print(torch.__version__)
-print(torchaudio.__version__)
 print(device)
-SPEECH_URL = "https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav"  # noqa: E501
+######################################################################
-SPEECH_FILE = "_assets/speech.wav"
+#
+import IPython
+import matplotlib.pyplot as plt
+from torchaudio.utils import download_asset
-if not os.path.exists(SPEECH_FILE):
+SPEECH_FILE = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
-    os.makedirs("_assets", exist_ok=True)
-    with open(SPEECH_FILE, "wb") as file:
-        file.write(requests.get(SPEECH_URL).content)
 ######################################################################
@@ -85,11 +75,10 @@ if not os.path.exists(SPEECH_FILE):
 # for other downstream tasks as well, but this tutorial does not
 # cover that.
 #
-# We will use :py:func:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here.
+# We will use :py:data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here.
 #
-# There are multiple models available as
+# There are multiple pre-trained models available in :py:mod:`torchaudio.pipelines`.
-# :py:mod:`torchaudio.pipelines`. Please check the documentation for
+# Please check the documentation for the detail of how they are trained.
-# the detail of how they are trained.
 #
 # The bundle object provides the interface to instantiate model and other
 # information. Sampling rate and the class labels are found as follow.
@@ -134,7 +123,7 @@ IPython.display.Audio(SPEECH_FILE)
 #
 #    - :py:func:`torchaudio.functional.resample` works on CUDA tensors as well.
 #    - When performing resampling multiple times on the same set of sample rates,
-#      using :py:func:`torchaudio.transforms.Resample` might improve the performace.
+#      using :py:class:`torchaudio.transforms.Resample` might improve the performace.
 #
 waveform, sample_rate = torchaudio.load(SPEECH_FILE)
@@ -167,7 +156,7 @@ with torch.inference_mode():
 fig, ax = plt.subplots(len(features), 1, figsize=(16, 4.3 * len(features)))
 for i, feats in enumerate(features):
-    ax[i].imshow(feats[0].cpu())
+    ax[i].imshow(feats[0].cpu(), interpolation="nearest")
    ax[i].set_title(f"Feature from transformer layer {i+1}")
    ax[i].set_xlabel("Feature dimension")
    ax[i].set_ylabel("Frame (time-axis)")
@@ -197,7 +186,7 @@ with torch.inference_mode():
 # Let’s visualize this.
 #
-plt.imshow(emission[0].cpu().T)
+plt.imshow(emission[0].cpu().T, interpolation="nearest")
 plt.title("Classification result")
 plt.xlabel("Frame (time-axis)")
 plt.ylabel("Class")
@@ -291,7 +280,7 @@ IPython.display.Audio(SPEECH_FILE)
 # Conclusion
 # ----------
 #
-# In this tutorial, we looked at how to use :py:mod:`torchaudio.pipelines` to
+# In this tutorial, we looked at how to use :py:class:`~torchaudio.pipelines.Wav2Vec2ASRBundle` to
 # perform acoustic feature extraction and speech recognition. Constructing
 # a model and getting the emission is as short as two lines.
 #

--- a/examples/tutorials/tacotron2_pipeline_tutorial.py
+++ b/examples/tutorials/tacotron2_pipeline_tutorial.py
@@ -45,7 +45,7 @@ import matplotlib.pyplot as plt
 #
 # .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png
 #
-# All the related components are bundled in :py:func:`torchaudio.pipelines.Tacotron2TTSBundle`,
+# All the related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`,
 # but this tutorial will also cover the process under the hood.
 ######################################################################
@@ -196,10 +196,11 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
 # however, note that the input to Tacotron2 models need to be processed
 # by the matching text processor.
 #
-# :py:func:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
+# :py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
 # models and processors together so that it is easy to create the pipeline.
 #
-# For the available bundles, and its usage, please refer to :py:mod:`torchaudio.pipelines`.
+# For the available bundles, and its usage, please refer to
+# :py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.
 #
 bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
@@ -271,8 +272,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
 ax1.imshow(spec[0].cpu().detach())
 ax2.plot(waveforms[0].cpu().detach())
-torchaudio.save("_assets/output_wavernn.wav", waveforms[0:1].cpu(), sample_rate=vocoder.sample_rate)
+IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)
-IPython.display.Audio("_assets/output_wavernn.wav")
 ######################################################################
@@ -280,7 +280,9 @@ IPython.display.Audio("_assets/output_wavernn.wav")
 # ~~~~~~~~~~~
 #
 # Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
-# the vocode object with ``get_vocoder`` method and pass the spectrogram.
+# the vocode object with
+# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
+# method and pass the spectrogram.
 #
 bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
@@ -300,12 +302,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
 ax1.imshow(spec[0].cpu().detach())
 ax2.plot(waveforms[0].cpu().detach())
-torchaudio.save(
+IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)
-    "_assets/output_griffinlim.wav",
-    waveforms[0:1].cpu(),
-    sample_rate=vocoder.sample_rate,
-)
-IPython.display.Audio("_assets/output_griffinlim.wav")
 ######################################################################
@@ -344,5 +341,4 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
 ax1.imshow(spec[0].cpu().detach())
 ax2.plot(waveforms[0].cpu().detach())
-torchaudio.save("_assets/output_waveglow.wav", waveforms[0:1].cpu(), sample_rate=22050)
+IPython.display.Audio(waveforms[0:1].cpu(), rate=22050)
-IPython.display.Audio("_assets/output_waveglow.wav")
--- a/torchaudio/pipelines/_source_separation_pipeline.py
+++ b/torchaudio/pipelines/_source_separation_pipeline.py
@@ -10,9 +10,7 @@ from torchaudio.models import conv_tasnet_base, hdemucs_high
 @dataclass
 class SourceSeparationBundle:
-    """torchaudio.pipelines.SourceSeparationBundle()
+    """Dataclass that bundles components for performing source separation.
-    Dataclass that bundles components for performing source separation.
    Example
        >>> import torchaudio
@@ -66,16 +64,16 @@ CONVTASNET_BASE_LIBRI2MIX = SourceSeparationBundle(
    _model_factory_func=partial(conv_tasnet_base, num_sources=2),
    _sample_rate=8000,
 )
-CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet* :cite:`Luo_2019` trained on
+CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet*
-    *Libri2Mix dataset* :cite:`cosentino2020librimix`.
+:cite:`Luo_2019` trained on *Libri2Mix dataset* :cite:`cosentino2020librimix`.
-    The source separation model is constructed by :py:func:`torchaudio.models.conv_tasnet_base`
+The source separation model is constructed by :func:`~torchaudio.models.conv_tasnet_base`
-    and is trained using the training script ``lightning_train.py``
+and is trained using the training script ``lightning_train.py``
-    `here <https://github.com/pytorch/audio/tree/release/0.12/examples/source_separation/>`__
+`here <https://github.com/pytorch/audio/tree/release/0.12/examples/source_separation/>`__
-    with default arguments.
+with default arguments.
-    Please refer to :py:class:`SourceSeparationBundle` for usage instructions.
+Please refer to :class:`SourceSeparationBundle` for usage instructions.
-    """
+"""
 HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
@@ -83,14 +81,16 @@ HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
    _model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
    _sample_rate=44100,
 )
-HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music
+HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained music source separation pipeline with
-    source separation trained on MUSDB-HQ :cite:`MUSDB18HQ` and additional internal training data.
+*Hybrid Demucs* :cite:`defossez2021hybrid` trained on MUSDB-HQ :cite:`MUSDB18HQ`
+and additional internal training data.
-    The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`.
+The model is constructed by :func:`~torchaudio.models.hdemucs_high`.
-    Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
-    Please refer to :py:class:`SourceSeparationBundle` for usage instructions.
+Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
-    """
+Please refer to :class:`SourceSeparationBundle` for usage instructions.
+"""
 HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
@@ -98,11 +98,11 @@ HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
    _model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
    _sample_rate=44100,
 )
-HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music
+HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained music source separation pipeline with
-    source separation trained on MUSDB-HQ :cite:`MUSDB18HQ`.
+*Hybrid Demucs* :cite:`defossez2021hybrid` trained on MUSDB-HQ :cite:`MUSDB18HQ`.
-    The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`.
+The model is constructed by :func:`~torchaudio.models.hdemucs_high`.
-    Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
+Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
-    Please refer to :py:class:`SourceSeparationBundle` for usage instructions.
+Please refer to :class:`SourceSeparationBundle` for usage instructions.
-    """
+"""
--- a/torchaudio/pipelines/_tts/impl.py
+++ b/torchaudio/pipelines/_tts/impl.py
@@ -213,17 +213,14 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH = _Tacotron2GriffinLimCharBundle(
    _tacotron2_path="tacotron2_english_characters_1500_epochs_ljspeech.pth",
    _tacotron2_params=utils._get_taco_params(n_symbols=38),
 )
-TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
+TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and
-:py:class:`torchaudio.transforms.GriffinLim`.
+:py:class:`~torchaudio.transforms.GriffinLim` as vocoder.
 The text processor encodes the input texts character-by-character.
-Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The default parameters were used.
-The vocoder is based on :py:class:`torchaudio.transforms.GriffinLim`.
 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
 Example - "Hello world! T T S stands for Text to Speech!"
@@ -255,8 +252,8 @@ TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH = _Tacotron2GriffinLimPhoneBundle(
    _tacotron2_path="tacotron2_english_phonemes_1500_epochs_ljspeech.pth",
    _tacotron2_params=utils._get_taco_params(n_symbols=96),
 )
-TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
+TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and
-:py:class:`torchaudio.transforms.GriffinLim`.
+:py:class:`~torchaudio.transforms.GriffinLim` as vocoder.
 The text processor encodes the input texts based on phoneme.
 It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert
@@ -264,12 +261,9 @@ graphemes to phonemes.
 The model (*en_us_cmudict_forward*) was trained on
 `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
-Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The text processor is set to the *"english_phonemes"*.
-The vocoder is based on :py:class:`torchaudio.transforms.GriffinLim`.
 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
 Example - "Hello world! T T S stands for Text to Speech!"
@@ -304,18 +298,14 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH = _Tacotron2WaveRNNCharBundle(
    _wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth",
    _wavernn_params=utils._get_wrnn_params(),
 )
-TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
+TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
-:py:class:`torchaudio.models.WaveRNN`.
 The text processor encodes the input texts character-by-character.
-Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
 ``mel_fmin=40``, and ``mel_fmax=11025``.
-The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
-It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
@@ -351,8 +341,8 @@ TACOTRON2_WAVERNN_PHONE_LJSPEECH = _Tacotron2WaveRNNPhoneBundle(
    _wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth",
    _wavernn_params=utils._get_wrnn_params(),
 )
-TACOTRON2_WAVERNN_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
+TACOTRON2_WAVERNN_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and
-:py:class:`torchaudio.models.WaveRNN`.
+:py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
 The text processor encodes the input texts based on phoneme.
 It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert
@@ -360,14 +350,11 @@ graphemes to phonemes.
 The model (*en_us_cmudict_forward*) was trained on
 `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
-Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
+You can find the training script for Tacotron2 `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
-You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
 ``mel_fmin=40``, and ``mel_fmax=11025``.
-The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
+You can find the training script for WaveRNN `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
-It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
-You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.

--- a/torchaudio/pipelines/_tts/interface.py
+++ b/torchaudio/pipelines/_tts/interface.py
@@ -11,8 +11,6 @@ class _TextProcessor(ABC):
    def tokens(self):
        """The tokens that the each value in the processed tensor represent.
-        See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
        :type: List[str]
        """
@@ -20,8 +18,6 @@ class _TextProcessor(ABC):
    def __call__(self, texts: Union[str, List[str]]) -> Tuple[Tensor, Tensor]:
        """Encode the given (batch of) texts into numerical tensors
-        See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
        Args:
            text (str or list of str): The input texts.
@@ -40,8 +36,6 @@ class _Vocoder(ABC):
    def sample_rate(self):
        """The sample rate of the resulting waveform
-        See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
        :type: float
        """
@@ -49,8 +43,6 @@ class _Vocoder(ABC):
    def __call__(self, specgrams: Tensor, lengths: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]:
        """Generate waveform from the given input, such as spectrogram
-        See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
        Args:
            specgrams (Tensor):
                The input spectrogram. Shape: `(batch, frequency bins, time)`.
@@ -149,22 +141,19 @@ class Tacotron2TTSBundle(ABC):
    # The thing is, text processing and vocoder are generic and we do not know what kind of
    # new text processing and vocoder will be added in the future, so we want to make these
    # interfaces specific to this Tacotron2TTS pipeline.
    class TextProcessor(_TextProcessor):
        """Interface of the text processing part of Tacotron2TTS pipeline
        See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
        """
-        pass
    class Vocoder(_Vocoder):
        """Interface of the vocoder part of Tacotron2TTS pipeline
        See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
        """
-        pass
    @abstractmethod
    def get_text_processor(self, *, dl_kwargs=None) -> TextProcessor:
        """Create a text processor
@@ -181,7 +170,7 @@ class Tacotron2TTSBundle(ABC):
                Passed to :func:`torch.hub.download_url_to_file`.
        Returns:
-            TTSTextProcessor:
+            TextProcessor:
                A callable which takes a string or a list of strings as input and
                returns Tensor of encoded texts and Tensor of valid lengths.
                The object also has ``tokens`` property, which allows to recover the
@@ -246,7 +235,7 @@ class Tacotron2TTSBundle(ABC):
                Passed to :func:`torch.hub.load_state_dict_from_url`.
        Returns:
-            Callable[[Tensor, Optional[Tensor]], Tuple[Tensor, Optional[Tensor]]]:
+            Vocoder:
                A vocoder module, which takes spectrogram Tensor and an optional
                length Tensor, then returns resulting waveform Tensor and an optional
                length Tensor.

--- a/torchaudio/pipelines/_wav2vec2/impl.py
+++ b/torchaudio/pipelines/_wav2vec2/impl.py
@@ -13,9 +13,7 @@ __all__ = []
 @dataclass
 class Wav2Vec2Bundle:
-    """torchaudio.pipelines.Wav2Vec2Bundle()
+    """Data class that bundles associated information to use pretrained :py:class:`~torchaudio.models.Wav2Vec2Model`.
-    Data class that bundles associated information to use pretrained Wav2Vec2Model.
    This class provides interfaces for instantiating the pretrained model along with
    the information necessary to retrieve pretrained weights and additional data
@@ -79,9 +77,8 @@ class Wav2Vec2Bundle:
 @dataclass
 class Wav2Vec2ASRBundle(Wav2Vec2Bundle):
-    """torchaudio.pipelines.Wav2Vec2ASRBundle()
+    """Data class that bundles associated information to use pretrained
+    :py:class:`~torchaudio.models.Wav2Vec2Model`.
-    Data class that bundles associated information to use pretrained Wav2Vec2Model.
    This class provides interfaces for instantiating the pretrained model along with
    the information necessary to retrieve pretrained weights and additional data
@@ -196,18 +193,16 @@ WAV2VEC2_BASE = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-WAV2VEC2_BASE.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+WAV2VEC2_BASE.__doc__ = """Wav2vec 2.0 model ("base" architecture),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
+(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
-(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
-Not fine-tuned.
 Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
@@ -241,9 +236,8 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_BASE_10M.__doc__ = """Wav2vec 2.0 model ("base" architecture with an extra linear module),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
 :cite:`librilight` ("train-10min" subset).
@@ -253,7 +247,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(
@@ -288,9 +282,8 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_BASE_100H.__doc__ = """Wav2vec 2.0 model ("base" architecture with an extra linear module),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset.
@@ -299,7 +292,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
@@ -333,9 +326,8 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_BASE_960H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_BASE_960H.__doc__ = """Wav2vec 2.0 model ("base" architecture with an extra linear module),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on the same audio with the corresponding transcripts.
@@ -344,7 +336,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_LARGE = Wav2Vec2Bundle(
@@ -377,18 +369,16 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-WAV2VEC2_LARGE.__doc__ = """Build "large" wav2vec2 model.
+WAV2VEC2_LARGE.__doc__ = """Wav2vec 2.0 model ("large" architecture),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
+(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
-(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
-Not fine-tuned.
 Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
@@ -422,9 +412,8 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_LARGE_10M.__doc__ = """Build "large" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_LARGE_10M.__doc__ = """Wav2vec 2.0 model ("large" architecture with an extra linear module),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
 :cite:`librilight` ("train-10min" subset).
@@ -434,7 +423,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
@@ -468,9 +457,8 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_LARGE_100H.__doc__ = """Build "large" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_LARGE_100H.__doc__ = """Wav2vec 2.0 model ("large" architecture with an extra linear module),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 100 hours of transcribed audio from
 the same dataset ("train-clean-100" subset).
@@ -480,7 +468,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
@@ -514,9 +502,8 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_LARGE_960H.__doc__ = """Build "large" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_LARGE_960H.__doc__ = """Wav2vec 2.0 model ("large" architecture with an extra linear module),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on the same audio with the corresponding transcripts.
@@ -525,7 +512,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa:  E501
 WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
@@ -558,18 +545,16 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-WAV2VEC2_LARGE_LV60K.__doc__ = """Build "large-lv60k" wav2vec2 model.
+WAV2VEC2_LARGE_LV60K.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`,
-Pre-trained on 60,000 hours of unlabeled audio from
+not fine-tuned.
-*Libri-Light* dataset :cite:`librilight`.
-Not fine-tuned.
 Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
@@ -603,19 +588,16 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`, and
-Pre-trained on 60,000 hours of unlabeled audio from
+fine-tuned for ASR on 10 minutes of transcribed audio from the same dataset ("train-10min" subset).
-*Libri-Light* dataset :cite:`librilight`, and
-fine-tuned for ASR on 10 minutes of transcribed audio from
-the same dataset ("train-10min" subset).
 Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
@@ -649,10 +631,8 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`, and
-Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset :cite:`librilight`, and
 fine-tuned for ASR on 100 hours of transcribed audio from
 *LibriSpeech* dataset :cite:`7178964` ("train-clean-100" subset).
@@ -661,7 +641,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
@@ -695,12 +675,9 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
+WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* :cite:`librilight` dataset, and
-Pre-trained on 60,000 hours of unlabeled audio from *Libri-Light*
+fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset :cite:`7178964`
-:cite:`librilight` dataset, and
-fine-tuned for ASR on 960 hours of transcribed audio from
-*LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
 Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
@@ -708,7 +685,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
@@ -741,13 +718,12 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-WAV2VEC2_XLSR53.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+WAV2VEC2_XLSR53.__doc__ = """Wav2vec 2.0 model ("base" architecture),
+pre-trained on 56,000 hours of unlabeled audio from multiple datasets (
-Trained on 56,000 hours of unlabeled audio from multiple datasets (
 *Multilingual LibriSpeech* :cite:`Pratap_2020`,
 *CommonVoice* :cite:`ardila2020common` and
-*BABEL* :cite:`Gales2014SpeechRA`).
+*BABEL* :cite:`Gales2014SpeechRA`),
-Not fine-tuned.
+not fine-tuned.
 Originally published by the authors of
 *Unsupervised Cross-lingual Representation Learning for Speech Recognition*
@@ -755,7 +731,7 @@ Originally published by the authors of
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 HUBERT_BASE = Wav2Vec2Bundle(
@@ -788,18 +764,16 @@ HUBERT_BASE = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-HUBERT_BASE.__doc__ = """HuBERT model with "Base" configuration.
+HUBERT_BASE.__doc__ = """HuBERT model ("base" architecture),
+pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
+(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
-(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
-Not fine-tuned.
 Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 HUBERT_LARGE = Wav2Vec2Bundle(
@@ -832,18 +806,16 @@ HUBERT_LARGE = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-HUBERT_LARGE.__doc__ = """HuBERT model with "Large" configuration.
+HUBERT_LARGE.__doc__ = """HuBERT model ("large" architecture),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`,
-Pre-trained on 60,000 hours of unlabeled audio from
+not fine-tuned.
-*Libri-Light* dataset :cite:`librilight`.
-Not fine-tuned.
 Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 HUBERT_XLARGE = Wav2Vec2Bundle(
@@ -876,18 +848,16 @@ HUBERT_XLARGE = Wav2Vec2Bundle(
    },
    _sample_rate=16000,
 )
-HUBERT_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.
+HUBERT_XLARGE.__doc__ = """HuBERT model ("extra large" architecture),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`,
-Pre-trained on 60,000 hours of unlabeled audio from
+not fine-tuned.
-*Libri-Light* dataset :cite:`librilight`.
-Not fine-tuned.
 Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
 """  # noqa: E501
 HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
@@ -921,12 +891,9 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-HUBERT_ASR_LARGE.__doc__ = """HuBERT model with "Large" configuration.
+HUBERT_ASR_LARGE.__doc__ = """HuBERT model ("large" architecture),
+pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`, and
-Pre-trained on 60,000 hours of unlabeled audio from
+fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset :cite:`7178964`
-*Libri-Light* dataset :cite:`librilight`, and
-fine-tuned for ASR on 960 hours of transcribed audio from
-*LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
 Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
@@ -934,7 +901,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
@@ -968,9 +935,8 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
    _labels=utils._get_en_labels(),
    _sample_rate=16000,
 )
-HUBERT_ASR_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.
+HUBERT_ASR_XLARGE.__doc__ = """HuBERT model ("extra large" architecture),
+pre-trained on 60,000 hours of unlabeled audio from
-Pre-trained on 60,000 hours of unlabeled audio from
 *Libri-Light* dataset :cite:`librilight`, and
 fine-tuned for ASR on 960 hours of transcribed audio from
 *LibriSpeech* dataset :cite:`7178964`
@@ -981,7 +947,7 @@ redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
@@ -1017,18 +983,17 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle(
    _sample_rate=16000,
    _remove_aux_axis=(1, 2, 3, 35),
 )
-VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model ("base" architecture),
+pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
+("10k" subset, consisting of 23 languages), and
-("10k" subset, consisting of 23 languages).
+fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
-Fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
 Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
@@ -1064,18 +1029,17 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle(
    _sample_rate=16000,
    _remove_aux_axis=(1, 2, 3, 31),
 )
-VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model ("base" architecture),
+pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
+("10k" subset, consisting of 23 languages), and
+fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
-("10k" subset, consisting of 23 languages).
-Fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
 Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
@@ -1111,18 +1075,17 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle(
    _sample_rate=16000,
    _remove_aux_axis=(1, 2, 3, 35),
 )
-VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model ("base" architecture),
+pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
+("10k" subset, consisting of 23 languages), and
-("10k" subset, consisting of 23 languages).
+fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
-Fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
 Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
 VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
@@ -1156,18 +1119,17 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
    _labels=utils._get_fr_labels(),
    _sample_rate=16000,
 )
-VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model ("base" architecture),
+pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
+("10k" subset, consisting of 23 languages), and
-("10k" subset, consisting of 23 languages).
+fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
-Fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
 Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
@@ -1203,16 +1165,15 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle(
    _sample_rate=16000,
    _remove_aux_axis=(1, 2, 3),
 )
-VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model with "Base" configuration.
+VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model ("base" architecture),
+pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
+("10k" subset, consisting of 23 languages), and
-("10k" subset, consisting of 23 languages).
+fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
-Fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
 Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
-Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
+Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
 """  # noqa: E501
--- a/torchaudio/pipelines/rnnt_pipeline.py
+++ b/torchaudio/pipelines/rnnt_pipeline.py
@@ -151,9 +151,7 @@ class _SentencePieceTokenProcessor(_TokenProcessor):
 @dataclass
 class RNNTBundle:
-    """torchaudio.pipelines.RNNTBundle()
+    """Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text)
-    Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text)
    inference with an RNN-T model.
    More specifically, the class provides methods that produce the featurization pipeline,
@@ -165,7 +163,7 @@ class RNNTBundle:
    Users should not directly instantiate objects of this class; rather, users should use the
    instances (representing pre-trained models) that exist within the module,
-    e.g. :py:obj:`EMFORMER_RNNT_BASE_LIBRISPEECH`.
+    e.g. :data:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`.
    Example
        >>> import torchaudio
@@ -226,10 +224,10 @@ class RNNTBundle:
    """
    class FeatureExtractor(_FeatureExtractor):
-        pass
+        """Interface of the feature extraction part of RNN-T pipeline"""
    class TokenProcessor(_TokenProcessor):
-        pass
+        """Interface of the token processor part of RNN-T pipeline"""
    _rnnt_path: str
    _rnnt_factory_func: Callable[[], RNNT]
@@ -370,11 +368,13 @@ EMFORMER_RNNT_BASE_LIBRISPEECH = RNNTBundle(
    _segment_length=16,
    _right_context_length=4,
 )
-EMFORMER_RNNT_BASE_LIBRISPEECH.__doc__ = """Pre-trained Emformer-RNNT-based ASR pipeline capable of performing both streaming and non-streaming inference.
+EMFORMER_RNNT_BASE_LIBRISPEECH.__doc__ = """ASR pipeline based on Emformer-RNNT,
+pretrained on *LibriSpeech* dataset :cite:`7178964`,
+capable of performing both streaming and non-streaming inference.
-    The underlying model is constructed by :py:func:`torchaudio.models.emformer_rnnt_base`
+The underlying model is constructed by :py:func:`torchaudio.models.emformer_rnnt_base`
-    and utilizes weights trained on LibriSpeech using training script ``train.py``
+and utilizes weights trained on LibriSpeech using training script ``train.py``
-    `here <https://github.com/pytorch/audio/tree/main/examples/asr/emformer_rnnt>`__ with default arguments.
+`here <https://github.com/pytorch/audio/tree/main/examples/asr/emformer_rnnt>`__ with default arguments.
-    Please refer to :py:class:`RNNTBundle` for usage instructions.
+Please refer to :py:class:`RNNTBundle` for usage instructions.
-    """
+"""