Consolidate bibliography / reference (#2676)

Summary: Preparation for the adoptation of `autosummary`. Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`. Example: https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html Pull Request resolved: https://github.com/pytorch/audio/pull/2676 Reviewed By: carolineechen Differential Revision: D39509431 Pulled By: mthrok fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a

Consolidate bibliography / reference (#2676)
Summary: Preparation for the adoptation of `autosummary`. Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`. Example: https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html Pull Request resolved: https://github.com/pytorch/audio/pull/2676 Reviewed By: carolineechen Differential Revision: D39509431 Pulled By: mthrok fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a
476ab9ab · moto · Facebook GitHub Bot · 50c66721 · 476ab9ab · 476ab9ab
Commit 476ab9ab authored Sep 14, 2022 by moto Committed by Facebook GitHub Bot Sep 14, 2022
5 changed files
--- a/torchaudio/pipelines/_tts/impl.py
+++ b/torchaudio/pipelines/_tts/impl.py
@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit

 The text processor encodes the input texts character-by-character.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The default parameters were used.

@@ -264,7 +264,7 @@ graphemes to phonemes.
 The model (*en_us_cmudict_forward*) was trained on
 `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The text processor is set to the *"english_phonemes"*.

@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :

 The text processor encodes the input texts character-by-character.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
 ``mel_fmin=40``, and ``mel_fmax=11025``.

 The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
-It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs.
+It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.

 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
@@ -360,13 +360,13 @@ graphemes to phonemes.
 The model (*en_us_cmudict_forward*) was trained on
 `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.

-Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
+Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
 The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
 ``mel_fmin=40``, and ``mel_fmax=11025``.

 The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
-It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs.
+It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
 You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.

 Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.

--- a/torchaudio/pipelines/_wav2vec2/impl.py
+++ b/torchaudio/pipelines/_wav2vec2/impl.py
--- a/torchaudio/prototype/models/conv_emformer.py
+++ b/torchaudio/prototype/models/conv_emformer.py
@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module):
 class ConvEmformer(_EmformerImpl):
    r"""Implements the convolution-augmented streaming transformer architecture introduced in
    *Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution*
-    [:footcite:`9747706`].
+    :cite:`9747706`.

    Args:
        input_dim (int): input dimension.

--- a/torchaudio/transforms/_multi_channel.py
+++ b/torchaudio/transforms/_multi_channel.py
@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module):
    Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py

    We provide three solutions of MVDR beamforming. One is based on *reference channel selection*
-    [:footcite:`souden2009optimal`] (``solution=ref_channel``).
+    :cite:`souden2009optimal` (``solution=ref_channel``).

    .. math::
        \\textbf{w}_{\\text{MVDR}}(f) =\
@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module):
        :math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation.

    We apply either *eigenvalue decomposition*
-    [:footcite:`higuchi2016robust`] or the *power method* [:footcite:`mises1929praktische`] to get the
+    :cite:`higuchi2016robust` or the *power method* :cite:`mises1929praktische` to get the
    steering vector from the PSD matrix of speech.

    After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by
@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module):
    where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\
        the single-channel enhanced speech, respectively.

-    For online streaming audio, we provide a *recursive method* [:footcite:`higuchi2017online`] to update the
+    For online streaming audio, we provide a *recursive method* :cite:`higuchi2017online` to update the
    PSD matrices of speech and noise, respectively.

    Args:
@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module):


 class RTFMVDR(torch.nn.Module):
-    r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module
+    r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
    based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise.

    .. devices:: CPU CUDA
@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module):


 class SoudenMVDR(torch.nn.Module):
-    r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module
-    based on the method proposed by *Souden et, al.* [:footcite:`souden2009optimal`].
+    r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
+    based on the method proposed by *Souden et, al.* :cite:`souden2009optimal`.

    .. devices:: CPU CUDA


--- a/torchaudio/transforms/_transforms.py
+++ b/torchaudio/transforms/_transforms.py
@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module):
    .. properties:: Autograd TorchScript

    Implementation ported from
-    *librosa* [:footcite:`brian_mcfee-proc-scipy-2015`], *A fast Griffin-Lim algorithm* [:footcite:`6701851`]
-    and *Signal estimation from modified short-time Fourier transform* [:footcite:`1172092`].
+    *librosa* :cite:`brian_mcfee-proc-scipy-2015`, *A fast Griffin-Lim algorithm* :cite:`6701851`
+    and *Signal estimation from modified short-time Fourier transform* :cite:`1172092`.

    Args:
        n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``)
@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module):

    .. properties:: Autograd TorchScript

-    Proposed in *SpecAugment* [:footcite:`specaugment`].
+    Proposed in *SpecAugment* :cite:`specaugment`.

    Args:
        hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``)
@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking):

    .. properties:: Autograd TorchScript

-    Proposed in *SpecAugment* [:footcite:`specaugment`].
+    Proposed in *SpecAugment* :cite:`specaugment`.

    Args:
        freq_mask_param (int): maximum possible length of the mask.
@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking):

    .. properties:: Autograd TorchScript

-    Proposed in *SpecAugment* [:footcite:`specaugment`].
+    Proposed in *SpecAugment* :cite:`specaugment`.

    Args:
        time_mask_param (int): maximum possible length of the mask.
@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module):

 class RNNTLoss(torch.nn.Module):
    """Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks*
-    [:footcite:`graves2012sequence`].
+    :cite:`graves2012sequence`.

    .. devices:: CPU CUDA