Commit 476ab9ab authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Consolidate bibliography / reference (#2676)

Summary:
Preparation for the adoptation of `autosummary`.

Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`.

Example:

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html

Pull Request resolved: https://github.com/pytorch/audio/pull/2676

Reviewed By: carolineechen

Differential Revision: D39509431

Pulled By: mthrok

fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a
parent 50c66721
...@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit ...@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit
The text processor encodes the input texts character-by-character. The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The default parameters were used. The default parameters were used.
...@@ -264,7 +264,7 @@ graphemes to phonemes. ...@@ -264,7 +264,7 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__. `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The text processor is set to the *"english_phonemes"*. The text processor is set to the *"english_phonemes"*.
...@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with : ...@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :
The text processor encodes the input texts character-by-character. The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``. ``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`. The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs. It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
...@@ -360,13 +360,13 @@ graphemes to phonemes. ...@@ -360,13 +360,13 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__. `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``. ``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`. The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs. It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......
This diff is collapsed.
...@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module): ...@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module):
class ConvEmformer(_EmformerImpl): class ConvEmformer(_EmformerImpl):
r"""Implements the convolution-augmented streaming transformer architecture introduced in r"""Implements the convolution-augmented streaming transformer architecture introduced in
*Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution* *Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution*
[:footcite:`9747706`]. :cite:`9747706`.
Args: Args:
input_dim (int): input dimension. input_dim (int): input dimension.
......
...@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module): ...@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module):
Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py
We provide three solutions of MVDR beamforming. One is based on *reference channel selection* We provide three solutions of MVDR beamforming. One is based on *reference channel selection*
[:footcite:`souden2009optimal`] (``solution=ref_channel``). :cite:`souden2009optimal` (``solution=ref_channel``).
.. math:: .. math::
\\textbf{w}_{\\text{MVDR}}(f) =\ \\textbf{w}_{\\text{MVDR}}(f) =\
...@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module): ...@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module):
:math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation. :math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation.
We apply either *eigenvalue decomposition* We apply either *eigenvalue decomposition*
[:footcite:`higuchi2016robust`] or the *power method* [:footcite:`mises1929praktische`] to get the :cite:`higuchi2016robust` or the *power method* :cite:`mises1929praktische` to get the
steering vector from the PSD matrix of speech. steering vector from the PSD matrix of speech.
After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by
...@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module): ...@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module):
where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\ where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\
the single-channel enhanced speech, respectively. the single-channel enhanced speech, respectively.
For online streaming audio, we provide a *recursive method* [:footcite:`higuchi2017online`] to update the For online streaming audio, we provide a *recursive method* :cite:`higuchi2017online` to update the
PSD matrices of speech and noise, respectively. PSD matrices of speech and noise, respectively.
Args: Args:
...@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module): ...@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module):
class RTFMVDR(torch.nn.Module): class RTFMVDR(torch.nn.Module):
r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise. based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise.
.. devices:: CPU CUDA .. devices:: CPU CUDA
...@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module): ...@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module):
class SoudenMVDR(torch.nn.Module): class SoudenMVDR(torch.nn.Module):
r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
based on the method proposed by *Souden et, al.* [:footcite:`souden2009optimal`]. based on the method proposed by *Souden et, al.* :cite:`souden2009optimal`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
......
...@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module): ...@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Implementation ported from Implementation ported from
*librosa* [:footcite:`brian_mcfee-proc-scipy-2015`], *A fast Griffin-Lim algorithm* [:footcite:`6701851`] *librosa* :cite:`brian_mcfee-proc-scipy-2015`, *A fast Griffin-Lim algorithm* :cite:`6701851`
and *Signal estimation from modified short-time Fourier transform* [:footcite:`1172092`]. and *Signal estimation from modified short-time Fourier transform* :cite:`1172092`.
Args: Args:
n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``) n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``)
...@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module): ...@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`]. Proposed in *SpecAugment* :cite:`specaugment`.
Args: Args:
hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``) hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``)
...@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking): ...@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`]. Proposed in *SpecAugment* :cite:`specaugment`.
Args: Args:
freq_mask_param (int): maximum possible length of the mask. freq_mask_param (int): maximum possible length of the mask.
...@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking): ...@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`]. Proposed in *SpecAugment* :cite:`specaugment`.
Args: Args:
time_mask_param (int): maximum possible length of the mask. time_mask_param (int): maximum possible length of the mask.
...@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module): ...@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module):
class RNNTLoss(torch.nn.Module): class RNNTLoss(torch.nn.Module):
"""Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks* """Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks*
[:footcite:`graves2012sequence`]. :cite:`graves2012sequence`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment