Commit 476ab9ab authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Consolidate bibliography / reference (#2676)

Summary:
Preparation for the adoptation of `autosummary`.

Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`.

Example:

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html

Pull Request resolved: https://github.com/pytorch/audio/pull/2676

Reviewed By: carolineechen

Differential Revision: D39509431

Pulled By: mthrok

fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a
parent 50c66721
...@@ -20,7 +20,7 @@ _RELEASE_CONFIGS = { ...@@ -20,7 +20,7 @@ _RELEASE_CONFIGS = {
class LJSPEECH(Dataset): class LJSPEECH(Dataset):
"""Create a Dataset for *LJSpeech-1.1* [:footcite:`ljspeech17`]. """Create a Dataset for *LJSpeech-1.1* :cite:`ljspeech17`.
Args: Args:
root (str or Path): Path to the directory where the dataset is found or downloaded. root (str or Path): Path to the directory where the dataset is found or downloaded.
......
...@@ -31,7 +31,7 @@ _VALIDATION_SET = [ ...@@ -31,7 +31,7 @@ _VALIDATION_SET = [
class MUSDB_HQ(Dataset): class MUSDB_HQ(Dataset):
"""Create *MUSDB_HQ* [:footcite:`MUSDB18HQ`] Dataset """Create *MUSDB_HQ* :cite:`MUSDB18HQ` Dataset
Args: Args:
root (str or Path): Root directory where the dataset's top level directory is found root (str or Path): Root directory where the dataset's top level directory is found
......
...@@ -23,7 +23,7 @@ _LANGUAGES = [ ...@@ -23,7 +23,7 @@ _LANGUAGES = [
class QUESST14(Dataset): class QUESST14(Dataset):
"""Create *QUESST14* [:footcite:`Mir2015QUESST2014EQ`] Dataset """Create *QUESST14* :cite:`Mir2015QUESST2014EQ` Dataset
Args: Args:
root (str or Path): Root directory where the dataset's top level directory is found root (str or Path): Root directory where the dataset's top level directory is found
......
...@@ -49,7 +49,7 @@ def load_speechcommands_item(filepath: str, path: str) -> Tuple[Tensor, int, str ...@@ -49,7 +49,7 @@ def load_speechcommands_item(filepath: str, path: str) -> Tuple[Tensor, int, str
class SPEECHCOMMANDS(Dataset): class SPEECHCOMMANDS(Dataset):
"""Create a Dataset for *Speech Commands* [:footcite:`speechcommandsv2`]. """Create a Dataset for *Speech Commands* :cite:`speechcommandsv2`.
Args: Args:
root (str or Path): Path to the directory where the dataset is found or downloaded. root (str or Path): Path to the directory where the dataset is found or downloaded.
......
...@@ -42,7 +42,7 @@ _RELEASE_CONFIGS = { ...@@ -42,7 +42,7 @@ _RELEASE_CONFIGS = {
class TEDLIUM(Dataset): class TEDLIUM(Dataset):
""" """
Create a Dataset for *Tedlium* [:footcite:`rousseau2012tedlium`]. It supports releases 1,2 and 3. Create a Dataset for *Tedlium* :cite:`rousseau2012tedlium`. It supports releases 1,2 and 3.
Args: Args:
root (str or Path): Path to the directory where the dataset is found or downloaded. root (str or Path): Path to the directory where the dataset is found or downloaded.
......
...@@ -17,7 +17,7 @@ SampleType = Tuple[Tensor, int, str, str, str] ...@@ -17,7 +17,7 @@ SampleType = Tuple[Tensor, int, str, str, str]
class VCTK_092(Dataset): class VCTK_092(Dataset):
"""Create *VCTK 0.92* [:footcite:`yamagishi2019vctk`] Dataset """Create *VCTK 0.92* :cite:`yamagishi2019vctk` Dataset
Args: Args:
root (str): Root directory where the dataset's top level directory is found. root (str): Root directory where the dataset's top level directory is found.
......
...@@ -90,7 +90,7 @@ def _get_file_id(file_path: str, _ext_audio: str): ...@@ -90,7 +90,7 @@ def _get_file_id(file_path: str, _ext_audio: str):
class VoxCeleb1(Dataset): class VoxCeleb1(Dataset):
"""Create *VoxCeleb1* [:footcite:`nagrani2017voxceleb`] Dataset. """Create *VoxCeleb1* :cite:`nagrani2017voxceleb` Dataset.
Args: Args:
root (str or Path): Path to the directory where the dataset is found or downloaded. root (str or Path): Path to the directory where the dataset is found or downloaded.
...@@ -119,7 +119,7 @@ class VoxCeleb1(Dataset): ...@@ -119,7 +119,7 @@ class VoxCeleb1(Dataset):
class VoxCeleb1Identification(VoxCeleb1): class VoxCeleb1Identification(VoxCeleb1):
"""Create *VoxCeleb1* [:footcite:`nagrani2017voxceleb`] Dataset for speaker identification task. """Create *VoxCeleb1* :cite:`nagrani2017voxceleb` Dataset for speaker identification task.
Each data sample contains the waveform, sample rate, speaker id, and the file id. Each data sample contains the waveform, sample rate, speaker id, and the file id.
Args: Args:
...@@ -167,7 +167,7 @@ class VoxCeleb1Identification(VoxCeleb1): ...@@ -167,7 +167,7 @@ class VoxCeleb1Identification(VoxCeleb1):
class VoxCeleb1Verification(VoxCeleb1): class VoxCeleb1Verification(VoxCeleb1):
"""Create *VoxCeleb1* [:footcite:`nagrani2017voxceleb`] Dataset for speaker verification task. """Create *VoxCeleb1* :cite:`nagrani2017voxceleb` Dataset for speaker verification task.
Each data sample contains a pair of waveforms, sample rate, the label indicating if they are Each data sample contains a pair of waveforms, sample rate, the label indicating if they are
from the same speaker, and the file ids. from the same speaker, and the file ids.
......
...@@ -19,7 +19,7 @@ _RELEASE_CONFIGS = { ...@@ -19,7 +19,7 @@ _RELEASE_CONFIGS = {
class YESNO(Dataset): class YESNO(Dataset):
"""Create a Dataset for *YesNo* [:footcite:`YesNo`]. """Create a Dataset for *YesNo* :cite:`YesNo`.
Args: Args:
root (str or Path): Path to the directory where the dataset is found or downloaded. root (str or Path): Path to the directory where the dataset is found or downloaded.
......
...@@ -269,8 +269,8 @@ def griffinlim( ...@@ -269,8 +269,8 @@ def griffinlim(
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Implementation ported from Implementation ported from
*librosa* [:footcite:`brian_mcfee-proc-scipy-2015`], *A fast Griffin-Lim algorithm* [:footcite:`6701851`] *librosa* :cite:`brian_mcfee-proc-scipy-2015`, *A fast Griffin-Lim algorithm* :cite:`6701851`
and *Signal estimation from modified short-time Fourier transform* [:footcite:`1172092`]. and *Signal estimation from modified short-time Fourier transform* :cite:`1172092`.
Args: Args:
specgram (Tensor): A magnitude-only STFT spectrogram of dimension `(..., freq, frames)` specgram (Tensor): A magnitude-only STFT spectrogram of dimension `(..., freq, frames)`
...@@ -1332,7 +1332,7 @@ def compute_kaldi_pitch( ...@@ -1332,7 +1332,7 @@ def compute_kaldi_pitch(
snip_edges: bool = True, snip_edges: bool = True,
) -> torch.Tensor: ) -> torch.Tensor:
"""Extract pitch based on method described in *A pitch extraction algorithm tuned """Extract pitch based on method described in *A pitch extraction algorithm tuned
for automatic speech recognition* [:footcite:`6854049`]. for automatic speech recognition* :cite:`6854049`.
.. devices:: CPU .. devices:: CPU
...@@ -1552,7 +1552,7 @@ def resample( ...@@ -1552,7 +1552,7 @@ def resample(
resampling_method: str = "sinc_interpolation", resampling_method: str = "sinc_interpolation",
beta: Optional[float] = None, beta: Optional[float] = None,
) -> Tensor: ) -> Tensor:
r"""Resamples the waveform at the new frequency using bandlimited interpolation. [:footcite:`RESAMPLE`]. r"""Resamples the waveform at the new frequency using bandlimited interpolation. :cite:`RESAMPLE`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
...@@ -1840,7 +1840,7 @@ def rnnt_loss( ...@@ -1840,7 +1840,7 @@ def rnnt_loss(
reduction: str = "mean", reduction: str = "mean",
): ):
"""Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks* """Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks*
[:footcite:`graves2012sequence`]. :cite:`graves2012sequence`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
...@@ -2009,8 +2009,8 @@ def mvdr_weights_souden( ...@@ -2009,8 +2009,8 @@ def mvdr_weights_souden(
diag_eps: float = 1e-7, diag_eps: float = 1e-7,
eps: float = 1e-8, eps: float = 1e-8,
) -> Tensor: ) -> Tensor:
r"""Compute the Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) beamforming weights r"""Compute the Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) beamforming weights
by the method proposed by *Souden et, al.* [:footcite:`souden2009optimal`]. by the method proposed by *Souden et, al.* :cite:`souden2009optimal`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
...@@ -2072,7 +2072,7 @@ def mvdr_weights_rtf( ...@@ -2072,7 +2072,7 @@ def mvdr_weights_rtf(
diag_eps: float = 1e-7, diag_eps: float = 1e-7,
eps: float = 1e-8, eps: float = 1e-8,
) -> Tensor: ) -> Tensor:
r"""Compute the Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) beamforming weights r"""Compute the Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) beamforming weights
based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise. based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise.
.. devices:: CPU CUDA .. devices:: CPU CUDA
......
...@@ -300,7 +300,7 @@ class _HDecLayer(torch.nn.Module): ...@@ -300,7 +300,7 @@ class _HDecLayer(torch.nn.Module):
class HDemucs(torch.nn.Module): class HDemucs(torch.nn.Module):
r""" r"""
Hybrid Demucs model from *Hybrid Spectrogram and Waveform Source Separation* [:footcite:`defossez2021hybrid`]. Hybrid Demucs model from *Hybrid Spectrogram and Waveform Source Separation* :cite:`defossez2021hybrid`.
Args: Args:
sources (List[str]): list of source names. List can contain the following source sources (List[str]): list of source names. List can contain the following source
......
...@@ -215,7 +215,7 @@ class ConformerLayer(torch.nn.Module): ...@@ -215,7 +215,7 @@ class ConformerLayer(torch.nn.Module):
class Conformer(torch.nn.Module): class Conformer(torch.nn.Module):
r"""Implements the Conformer architecture introduced in r"""Implements the Conformer architecture introduced in
*Conformer: Convolution-augmented Transformer for Speech Recognition* *Conformer: Convolution-augmented Transformer for Speech Recognition*
[:footcite:`gulati2020conformer`]. :cite:`gulati2020conformer`.
Args: Args:
input_dim (int): input dimension. input_dim (int): input dimension.
......
...@@ -162,7 +162,7 @@ class MaskGenerator(torch.nn.Module): ...@@ -162,7 +162,7 @@ class MaskGenerator(torch.nn.Module):
class ConvTasNet(torch.nn.Module): class ConvTasNet(torch.nn.Module):
"""Conv-TasNet: a fully-convolutional time-domain audio separation network """Conv-TasNet: a fully-convolutional time-domain audio separation network
*Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation* *Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation*
[:footcite:`Luo_2019`]. :cite:`Luo_2019`.
Args: Args:
num_sources (int, optional): The number of sources to split. num_sources (int, optional): The number of sources to split.
...@@ -304,7 +304,7 @@ class ConvTasNet(torch.nn.Module): ...@@ -304,7 +304,7 @@ class ConvTasNet(torch.nn.Module):
def conv_tasnet_base(num_sources: int = 2) -> ConvTasNet: def conv_tasnet_base(num_sources: int = 2) -> ConvTasNet:
r"""Builds the non-causal version of ConvTasNet in r"""Builds the non-causal version of ConvTasNet in
*Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation* *Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation*
[:footcite:`Luo_2019`]. :cite:`Luo_2019`.
The parameter settings follow the ones with the highest Si-SNR metirc score in the paper, The parameter settings follow the ones with the highest Si-SNR metirc score in the paper,
except the mask activation function is changed from "sigmoid" to "relu" for performance improvement. except the mask activation function is changed from "sigmoid" to "relu" for performance improvement.
......
...@@ -197,7 +197,7 @@ class CTCDecoder: ...@@ -197,7 +197,7 @@ class CTCDecoder:
""" """
.. devices:: CPU .. devices:: CPU
CTC beam search decoder from *Flashlight* [:footcite:`kahn2022flashlight`]. CTC beam search decoder from *Flashlight* :cite:`kahn2022flashlight`.
Note: Note:
To build the decoder, please use the factory function :py:func:`ctc_decoder`. To build the decoder, please use the factory function :py:func:`ctc_decoder`.
...@@ -349,7 +349,7 @@ def ctc_decoder( ...@@ -349,7 +349,7 @@ def ctc_decoder(
unk_word: str = "<unk>", unk_word: str = "<unk>",
) -> CTCDecoder: ) -> CTCDecoder:
""" """
Builds CTC beam search decoder from *Flashlight* [:footcite:`kahn2022flashlight`]. Builds CTC beam search decoder from *Flashlight* :cite:`kahn2022flashlight`.
Args: Args:
lexicon (str or None): lexicon file containing the possible words and corresponding spellings. lexicon (str or None): lexicon file containing the possible words and corresponding spellings.
......
...@@ -28,7 +28,7 @@ class FullyConnected(torch.nn.Module): ...@@ -28,7 +28,7 @@ class FullyConnected(torch.nn.Module):
class DeepSpeech(torch.nn.Module): class DeepSpeech(torch.nn.Module):
""" """
DeepSpeech model architecture from *Deep Speech: Scaling up end-to-end speech recognition* DeepSpeech model architecture from *Deep Speech: Scaling up end-to-end speech recognition*
[:footcite:`hannun2014deep`]. :cite:`hannun2014deep`.
Args: Args:
n_feature: Number of input features n_feature: Number of input features
......
...@@ -806,7 +806,7 @@ class _EmformerImpl(torch.nn.Module): ...@@ -806,7 +806,7 @@ class _EmformerImpl(torch.nn.Module):
class Emformer(_EmformerImpl): class Emformer(_EmformerImpl):
r"""Implements the Emformer architecture introduced in r"""Implements the Emformer architecture introduced in
*Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition* *Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition*
[:footcite:`shi2021emformer`]. :cite:`shi2021emformer`.
Args: Args:
input_dim (int): input dimension. input_dim (int): input dimension.
......
...@@ -872,7 +872,7 @@ class Tacotron2(nn.Module): ...@@ -872,7 +872,7 @@ class Tacotron2(nn.Module):
The original implementation was introduced in The original implementation was introduced in
*Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions* *Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions*
[:footcite:`shen2018natural`]. :cite:`shen2018natural`.
Args: Args:
mask_padding (bool, optional): Use mask padding (Default: ``False``). mask_padding (bool, optional): Use mask padding (Default: ``False``).
......
...@@ -7,7 +7,7 @@ __all__ = [ ...@@ -7,7 +7,7 @@ __all__ = [
class Wav2Letter(nn.Module): class Wav2Letter(nn.Module):
r"""Wav2Letter model architecture from *Wav2Letter: an End-to-End ConvNet-based Speech r"""Wav2Letter model architecture from *Wav2Letter: an End-to-End ConvNet-based Speech
Recognition System* [:footcite:`collobert2016wav2letter`]. Recognition System* :cite:`collobert2016wav2letter`.
:math:`\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}` :math:`\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}`
......
...@@ -10,7 +10,7 @@ from . import components ...@@ -10,7 +10,7 @@ from . import components
class Wav2Vec2Model(Module): class Wav2Vec2Model(Module):
"""torchaudio.models.Wav2Vec2Model(feature_extractor: torch.nn.Module, encoder: torch.nn.Module, aux: Optional[torch.nn.Module] = None) """torchaudio.models.Wav2Vec2Model(feature_extractor: torch.nn.Module, encoder: torch.nn.Module, aux: Optional[torch.nn.Module] = None)
Encoder model used in *wav2vec 2.0* [:footcite:`baevski2020wav2vec`]. Encoder model used in *wav2vec 2.0* :cite:`baevski2020wav2vec`.
Note: Note:
To build the model, please use one of the factory functions. To build the model, please use one of the factory functions.
...@@ -244,7 +244,7 @@ def wav2vec2_model( ...@@ -244,7 +244,7 @@ def wav2vec2_model(
`ConvFeatureExtractionModel <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L736>`__ `ConvFeatureExtractionModel <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L736>`__
in the original ``fairseq`` implementation. in the original ``fairseq`` implementation.
This is referred as "(convolutional) feature encoder" in the *wav2vec 2.0* This is referred as "(convolutional) feature encoder" in the *wav2vec 2.0*
[:footcite:`baevski2020wav2vec`] paper. :cite:`baevski2020wav2vec` paper.
The "encoder" below corresponds to `TransformerEncoder <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L817>`__, The "encoder" below corresponds to `TransformerEncoder <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L817>`__,
and this is referred as "Transformer" in the paper. and this is referred as "Transformer" in the paper.
...@@ -393,7 +393,7 @@ def wav2vec2_base( ...@@ -393,7 +393,7 @@ def wav2vec2_base(
encoder_layer_drop: float = 0.1, encoder_layer_drop: float = 0.1,
aux_num_out: Optional[int] = None, aux_num_out: Optional[int] = None,
) -> Wav2Vec2Model: ) -> Wav2Vec2Model:
"""Build Wav2Vec2Model with "base" architecture from *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] """Build Wav2Vec2Model with "base" architecture from *wav2vec 2.0* :cite:`baevski2020wav2vec`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -441,7 +441,7 @@ def wav2vec2_large( ...@@ -441,7 +441,7 @@ def wav2vec2_large(
encoder_layer_drop: float = 0.1, encoder_layer_drop: float = 0.1,
aux_num_out: Optional[int] = None, aux_num_out: Optional[int] = None,
) -> Wav2Vec2Model: ) -> Wav2Vec2Model:
"""Build Wav2Vec2Model with "large" architecture from *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] """Build Wav2Vec2Model with "large" architecture from *wav2vec 2.0* :cite:`baevski2020wav2vec`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -489,7 +489,7 @@ def wav2vec2_large_lv60k( ...@@ -489,7 +489,7 @@ def wav2vec2_large_lv60k(
encoder_layer_drop: float = 0.1, encoder_layer_drop: float = 0.1,
aux_num_out: Optional[int] = None, aux_num_out: Optional[int] = None,
) -> Wav2Vec2Model: ) -> Wav2Vec2Model:
"""Build Wav2Vec2Model with "large lv-60k" architecture from *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] """Build Wav2Vec2Model with "large lv-60k" architecture from *wav2vec 2.0* :cite:`baevski2020wav2vec`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -537,7 +537,7 @@ def hubert_base( ...@@ -537,7 +537,7 @@ def hubert_base(
encoder_layer_drop: float = 0.05, encoder_layer_drop: float = 0.05,
aux_num_out: Optional[int] = None, aux_num_out: Optional[int] = None,
) -> Wav2Vec2Model: ) -> Wav2Vec2Model:
"""Build HuBERT model with "base" architecture from *HuBERT* [:footcite:`hsu2021hubert`] """Build HuBERT model with "base" architecture from *HuBERT* :cite:`hsu2021hubert`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -585,7 +585,7 @@ def hubert_large( ...@@ -585,7 +585,7 @@ def hubert_large(
encoder_layer_drop: float = 0.0, encoder_layer_drop: float = 0.0,
aux_num_out: Optional[int] = None, aux_num_out: Optional[int] = None,
) -> Wav2Vec2Model: ) -> Wav2Vec2Model:
"""Build HuBERT model with "large" architecture from *HuBERT* [:footcite:`hsu2021hubert`] """Build HuBERT model with "large" architecture from *HuBERT* :cite:`hsu2021hubert`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -633,7 +633,7 @@ def hubert_xlarge( ...@@ -633,7 +633,7 @@ def hubert_xlarge(
encoder_layer_drop: float = 0.0, encoder_layer_drop: float = 0.0,
aux_num_out: Optional[int] = None, aux_num_out: Optional[int] = None,
) -> Wav2Vec2Model: ) -> Wav2Vec2Model:
"""Build HuBERT model with "extra large" architecture from *HuBERT* [:footcite:`hsu2021hubert`] """Build HuBERT model with "extra large" architecture from *HuBERT* :cite:`hsu2021hubert`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -714,7 +714,7 @@ def hubert_pretrain_model( ...@@ -714,7 +714,7 @@ def hubert_pretrain_model(
`ConvFeatureExtractionModel <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L736>`__ `ConvFeatureExtractionModel <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L736>`__
in the original ``fairseq`` implementation. in the original ``fairseq`` implementation.
This is referred as "(convolutional) feature encoder" in the *wav2vec 2.0* This is referred as "(convolutional) feature encoder" in the *wav2vec 2.0*
[:footcite:`baevski2020wav2vec`] paper. :cite:`baevski2020wav2vec` paper.
The "encoder" below corresponds to `TransformerEncoder <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L817>`__, The "encoder" below corresponds to `TransformerEncoder <https://github.com/pytorch/fairseq/blob/dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3/fairseq/models/wav2vec/wav2vec2.py#L817>`__,
and this is referred as "Transformer" in the paper. and this is referred as "Transformer" in the paper.
...@@ -975,7 +975,7 @@ def hubert_pretrain_base( ...@@ -975,7 +975,7 @@ def hubert_pretrain_base(
feature_grad_mult: Optional[float] = 0.1, feature_grad_mult: Optional[float] = 0.1,
num_classes: int = 100, num_classes: int = 100,
) -> HuBERTPretrainModel: ) -> HuBERTPretrainModel:
"""Build HuBERTPretrainModel model with "base" architecture from *HuBERT* [:footcite:`hsu2021hubert`] """Build HuBERTPretrainModel model with "base" architecture from *HuBERT* :cite:`hsu2021hubert`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -1050,7 +1050,7 @@ def hubert_pretrain_large( ...@@ -1050,7 +1050,7 @@ def hubert_pretrain_large(
mask_channel_length: int = 10, mask_channel_length: int = 10,
feature_grad_mult: Optional[float] = None, feature_grad_mult: Optional[float] = None,
) -> HuBERTPretrainModel: ) -> HuBERTPretrainModel:
"""Build HuBERTPretrainModel model for pre-training with "large" architecture from *HuBERT* [:footcite:`hsu2021hubert`] """Build HuBERTPretrainModel model for pre-training with "large" architecture from *HuBERT* :cite:`hsu2021hubert`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
...@@ -1123,7 +1123,7 @@ def hubert_pretrain_xlarge( ...@@ -1123,7 +1123,7 @@ def hubert_pretrain_xlarge(
mask_channel_length: int = 10, mask_channel_length: int = 10,
feature_grad_mult: Optional[float] = None, feature_grad_mult: Optional[float] = None,
) -> HuBERTPretrainModel: ) -> HuBERTPretrainModel:
"""Build HuBERTPretrainModel model for pre-training with "extra large" architecture from *HuBERT* [:footcite:`hsu2021hubert`] """Build HuBERTPretrainModel model for pre-training with "extra large" architecture from *HuBERT* :cite:`hsu2021hubert`
Args: Args:
encoder_projection_dropout (float): encoder_projection_dropout (float):
......
...@@ -15,7 +15,7 @@ __all__ = [ ...@@ -15,7 +15,7 @@ __all__ = [
class ResBlock(nn.Module): class ResBlock(nn.Module):
r"""ResNet block based on *Efficient Neural Audio Synthesis* [:footcite:`kalchbrenner2018efficient`]. r"""ResNet block based on *Efficient Neural Audio Synthesis* :cite:`kalchbrenner2018efficient`.
Args: Args:
n_freq: the number of bins in a spectrogram. (Default: ``128``) n_freq: the number of bins in a spectrogram. (Default: ``128``)
...@@ -200,7 +200,7 @@ class WaveRNN(nn.Module): ...@@ -200,7 +200,7 @@ class WaveRNN(nn.Module):
r"""WaveRNN model based on the implementation from `fatchord <https://github.com/fatchord/WaveRNN>`_. r"""WaveRNN model based on the implementation from `fatchord <https://github.com/fatchord/WaveRNN>`_.
The original implementation was introduced in *Efficient Neural Audio Synthesis* The original implementation was introduced in *Efficient Neural Audio Synthesis*
[:footcite:`kalchbrenner2018efficient`]. The input channels of waveform and spectrogram have to be 1. :cite:`kalchbrenner2018efficient`. The input channels of waveform and spectrogram have to be 1.
The product of `upsample_scales` must equal `hop_length`. The product of `upsample_scales` must equal `hop_length`.
Args: Args:
......
...@@ -66,8 +66,8 @@ CONVTASNET_BASE_LIBRI2MIX = SourceSeparationBundle( ...@@ -66,8 +66,8 @@ CONVTASNET_BASE_LIBRI2MIX = SourceSeparationBundle(
_model_factory_func=partial(conv_tasnet_base, num_sources=2), _model_factory_func=partial(conv_tasnet_base, num_sources=2),
_sample_rate=8000, _sample_rate=8000,
) )
CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet* [:footcite:`Luo_2019`] trained on CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet* :cite:`Luo_2019` trained on
*Libri2Mix dataset* [:footcite:`cosentino2020librimix`]. *Libri2Mix dataset* :cite:`cosentino2020librimix`.
The source separation model is constructed by :py:func:`torchaudio.models.conv_tasnet_base` The source separation model is constructed by :py:func:`torchaudio.models.conv_tasnet_base`
and is trained using the training script ``lightning_train.py`` and is trained using the training script ``lightning_train.py``
...@@ -83,8 +83,8 @@ HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle( ...@@ -83,8 +83,8 @@ HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
_model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]), _model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
_sample_rate=44100, _sample_rate=44100,
) )
HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained *Hybrid Demucs* [:footcite:`defossez2021hybrid`] pipeline for music HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music
source separation trained on MUSDB-HQ [:footcite:`MUSDB18HQ`] and additional internal training data. source separation trained on MUSDB-HQ :cite:`MUSDB18HQ` and additional internal training data.
The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`. The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__. Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
...@@ -98,8 +98,8 @@ HDEMUCS_HIGH_MUSDB = SourceSeparationBundle( ...@@ -98,8 +98,8 @@ HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
_model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]), _model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
_sample_rate=44100, _sample_rate=44100,
) )
HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained *Hybrid Demucs* [:footcite:`defossez2021hybrid`] pipeline for music HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music
source separation trained on MUSDB-HQ [:footcite:`MUSDB18HQ`]. source separation trained on MUSDB-HQ :cite:`MUSDB18HQ`.
The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`. The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__. Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment