Consolidate bibliography / reference (#2676)

Summary: Preparation for the adoptation of `autosummary`. Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`. Example: https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html Pull Request resolved: https://github.com/pytorch/audio/pull/2676 Reviewed By: carolineechen Differential Revision: D39509431 Pulled By: mthrok fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a

Consolidate bibliography / reference (#2676)
Summary: Preparation for the adoptation of `autosummary`. Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`. Example: https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html Pull Request resolved: https://github.com/pytorch/audio/pull/2676 Reviewed By: carolineechen Differential Revision: D39509431 Pulled By: mthrok fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a
476ab9ab · moto · Facebook GitHub Bot · 50c66721 · 476ab9ab · 476ab9ab
Commit 476ab9ab authored Sep 14, 2022 by moto Committed by Facebook GitHub Bot Sep 14, 2022
5 changed files
--- a/torchaudio/pipelines/_tts/impl.py
+++ b/torchaudio/pipelines/_tts/impl.py
@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit

 The text processor encodes the input texts character-by-character.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The default parameters were used.

@@ -264,7 +264,7 @@ graphemes to phonemes.
 The model (*en_us_cmudict_forward*) was trained on
 `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The text processor is set to the *"english_phonemes"*.

@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :

 The text processor encodes the input texts character-by-character.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
 ``mel_fmin=40``, and ``mel_fmax=11025``.

 The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
-It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs.
+It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.

 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
@@ -360,13 +360,13 @@ graphemes to phonemes.
 The model (*en_us_cmudict_forward*) was trained on
 `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
 ``mel_fmin=40``, and ``mel_fmax=11025``.

 The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
-It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs.
+It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.

 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.

--- a/torchaudio/pipelines/_wav2vec2/impl.py
+++ b/torchaudio/pipelines/_wav2vec2/impl.py
@@ -198,11 +198,11 @@ WAV2VEC2_BASE = Wav2Vec2Bundle(
 )
 WAV2VEC2_BASE.__doc__ = """wav2vec 2.0 model with "Base" configuration.

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
 Not fine-tuned.

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -243,12 +243,12 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
 )
 WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
-[:footcite:`librilight`] ("train-10min" subset).
+:cite:`librilight` ("train-10min" subset).

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -290,11 +290,11 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(

 WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset.

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -335,11 +335,11 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
 )
 WAV2VEC2_ASR_BASE_960H.__doc__ = """Build "base" wav2vec2 model with an extra linear module

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on the same audio with the corresponding transcripts.

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -379,11 +379,11 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle(
 )
 WAV2VEC2_LARGE.__doc__ = """Build "large" wav2vec2 model.

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
 Not fine-tuned.

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -424,12 +424,12 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
 )
 WAV2VEC2_ASR_LARGE_10M.__doc__ = """Build "large" wav2vec2 model with an extra linear module

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
-[:footcite:`librilight`] ("train-10min" subset).
+:cite:`librilight` ("train-10min" subset).

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -470,12 +470,12 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
 )
 WAV2VEC2_ASR_LARGE_100H.__doc__ = """Build "large" wav2vec2 model with an extra linear module

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on 100 hours of transcribed audio from
 the same dataset ("train-clean-100" subset).

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -516,11 +516,11 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
 )
 WAV2VEC2_ASR_LARGE_960H.__doc__ = """Build "large" wav2vec2 model with an extra linear module

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
 fine-tuned for ASR on the same audio with the corresponding transcripts.

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -561,10 +561,10 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
 WAV2VEC2_LARGE_LV60K.__doc__ = """Build "large-lv60k" wav2vec2 model.

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`].
+*Libri-Light* dataset :cite:`librilight`.
 Not fine-tuned.

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -606,11 +606,11 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
 WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`], and
+*Libri-Light* dataset :cite:`librilight`, and
 fine-tuned for ASR on 10 minutes of transcribed audio from
 the same dataset ("train-10min" subset).

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -652,11 +652,11 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
 WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`], and
+*Libri-Light* dataset :cite:`librilight`, and
 fine-tuned for ASR on 100 hours of transcribed audio from
-*LibriSpeech* dataset [:footcite:`7178964`] ("train-clean-100" subset).
+*LibriSpeech* dataset :cite:`7178964` ("train-clean-100" subset).

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -698,12 +698,12 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
 WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module

 Pre-trained on 60,000 hours of unlabeled audio from *Libri-Light*
-[:footcite:`librilight`] dataset, and
+:cite:`librilight` dataset, and
 fine-tuned for ASR on 960 hours of transcribed audio from
-*LibriSpeech* dataset [:footcite:`7178964`]
+*LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").

-Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
+Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
@@ -744,14 +744,14 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
 WAV2VEC2_XLSR53.__doc__ = """wav2vec 2.0 model with "Base" configuration.

 Trained on 56,000 hours of unlabeled audio from multiple datasets (
-*Multilingual LibriSpeech* [:footcite:`Pratap_2020`],
-*CommonVoice* [:footcite:`ardila2020common`] and
-*BABEL* [:footcite:`Gales2014SpeechRA`]).
+*Multilingual LibriSpeech* :cite:`Pratap_2020`,
+*CommonVoice* :cite:`ardila2020common` and
+*BABEL* :cite:`Gales2014SpeechRA`).
 Not fine-tuned.

 Originally published by the authors of
 *Unsupervised Cross-lingual Representation Learning for Speech Recognition*
-[:footcite:`conneau2020unsupervised`] under MIT License and redistributed with the same license.
+:cite:`conneau2020unsupervised` under MIT License and redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]

@@ -790,11 +790,11 @@ HUBERT_BASE = Wav2Vec2Bundle(
 )
 HUBERT_BASE.__doc__ = """HuBERT model with "Base" configuration.

-Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
+Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
 Not fine-tuned.

-Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
+Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
@@ -835,10 +835,10 @@ HUBERT_LARGE = Wav2Vec2Bundle(
 HUBERT_LARGE.__doc__ = """HuBERT model with "Large" configuration.

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`].
+*Libri-Light* dataset :cite:`librilight`.
 Not fine-tuned.

-Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
+Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
@@ -879,10 +879,10 @@ HUBERT_XLARGE = Wav2Vec2Bundle(
 HUBERT_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`].
+*Libri-Light* dataset :cite:`librilight`.
 Not fine-tuned.

-Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
+Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
@@ -924,12 +924,12 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
 HUBERT_ASR_LARGE.__doc__ = """HuBERT model with "Large" configuration.

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`], and
+*Libri-Light* dataset :cite:`librilight`, and
 fine-tuned for ASR on 960 hours of transcribed audio from
-*LibriSpeech* dataset [:footcite:`7178964`]
+*LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").

-Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
+Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
@@ -971,12 +971,12 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
 HUBERT_ASR_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.

 Pre-trained on 60,000 hours of unlabeled audio from
-*Libri-Light* dataset [:footcite:`librilight`], and
+*Libri-Light* dataset :cite:`librilight`, and
 fine-tuned for ASR on 960 hours of transcribed audio from
-*LibriSpeech* dataset [:footcite:`7178964`]
+*LibriSpeech* dataset :cite:`7178964`
 (the combination of "train-clean-100", "train-clean-360", and "train-other-500").

-Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
+Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
 redistributed with the same license.
 [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
 `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
@@ -1019,11 +1019,11 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle(
 )
 VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model with "Base" configuration.

-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
+Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
 ("10k" subset, consisting of 23 languages).
 Fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.

-Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
+Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
@@ -1066,11 +1066,11 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle(
 )
 VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model with "Base" configuration.

-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
+Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
 ("10k" subset, consisting of 23 languages).

 Fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
-Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
+Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
@@ -1113,11 +1113,11 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle(
 )
 VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model with "Base" configuration.

-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
+Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
 ("10k" subset, consisting of 23 languages).
 Fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.

-Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
+Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
@@ -1158,11 +1158,11 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
 )
 VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model with "Base" configuration.

-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
+Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
 ("10k" subset, consisting of 23 languages).
 Fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.

-Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
+Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
@@ -1205,11 +1205,11 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle(
 )
 VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model with "Base" configuration.

-Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
+Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
 ("10k" subset, consisting of 23 languages).
 Fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.

-Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
+Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
 redistributed with the same license.
 [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
 `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]

--- a/torchaudio/prototype/models/conv_emformer.py
+++ b/torchaudio/prototype/models/conv_emformer.py
@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module):
 class ConvEmformer(_EmformerImpl):
    r"""Implements the convolution-augmented streaming transformer architecture introduced in
    *Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution*
-    [:footcite:`9747706`].
+    :cite:`9747706`.

    Args:
        input_dim (int): input dimension.

--- a/torchaudio/transforms/_multi_channel.py
+++ b/torchaudio/transforms/_multi_channel.py
@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module):
    Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py

    We provide three solutions of MVDR beamforming. One is based on *reference channel selection*
-    [:footcite:`souden2009optimal`] (``solution=ref_channel``).
+    :cite:`souden2009optimal` (``solution=ref_channel``).

    .. math::
        \\textbf{w}_{\\text{MVDR}}(f) =\
@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module):
        :math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation.

    We apply either *eigenvalue decomposition*
-    [:footcite:`higuchi2016robust`] or the *power method* [:footcite:`mises1929praktische`] to get the
+    :cite:`higuchi2016robust` or the *power method* :cite:`mises1929praktische` to get the
    steering vector from the PSD matrix of speech.

    After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by
@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module):
    where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\
        the single-channel enhanced speech, respectively.

-    For online streaming audio, we provide a *recursive method* [:footcite:`higuchi2017online`] to update the
+    For online streaming audio, we provide a *recursive method* :cite:`higuchi2017online` to update the
    PSD matrices of speech and noise, respectively.

    Args:
@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module):


 class RTFMVDR(torch.nn.Module):
-    r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module
+    r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
    based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise.

    .. devices:: CPU CUDA
@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module):


 class SoudenMVDR(torch.nn.Module):
-    r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module
-    based on the method proposed by *Souden et, al.* [:footcite:`souden2009optimal`].
+    r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
+    based on the method proposed by *Souden et, al.* :cite:`souden2009optimal`.

    .. devices:: CPU CUDA


--- a/torchaudio/transforms/_transforms.py
+++ b/torchaudio/transforms/_transforms.py
@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module):
    .. properties:: Autograd TorchScript

    Implementation ported from
-    *librosa* [:footcite:`brian_mcfee-proc-scipy-2015`], *A fast Griffin-Lim algorithm* [:footcite:`6701851`]
-    and *Signal estimation from modified short-time Fourier transform* [:footcite:`1172092`].
+    *librosa* :cite:`brian_mcfee-proc-scipy-2015`, *A fast Griffin-Lim algorithm* :cite:`6701851`
+    and *Signal estimation from modified short-time Fourier transform* :cite:`1172092`.

    Args:
        n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``)
@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module):

    .. properties:: Autograd TorchScript

-    Proposed in *SpecAugment* [:footcite:`specaugment`].
+    Proposed in *SpecAugment* :cite:`specaugment`.

    Args:
        hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``)
@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking):

    .. properties:: Autograd TorchScript

-    Proposed in *SpecAugment* [:footcite:`specaugment`].
+    Proposed in *SpecAugment* :cite:`specaugment`.

    Args:
        freq_mask_param (int): maximum possible length of the mask.
@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking):

    .. properties:: Autograd TorchScript

-    Proposed in *SpecAugment* [:footcite:`specaugment`].
+    Proposed in *SpecAugment* :cite:`specaugment`.

    Args:
        time_mask_param (int): maximum possible length of the mask.
@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module):

 class RNNTLoss(torch.nn.Module):
    """Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks*
-    [:footcite:`graves2012sequence`].
+    :cite:`graves2012sequence`.

    .. devices:: CPU CUDA