Commit 476ab9ab authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Consolidate bibliography / reference (#2676)

Summary:
Preparation for the adoptation of `autosummary`.

Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`.

Example:

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html

Pull Request resolved: https://github.com/pytorch/audio/pull/2676

Reviewed By: carolineechen

Differential Revision: D39509431

Pulled By: mthrok

fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a
parent 50c66721
......@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit
The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The default parameters were used.
......@@ -264,7 +264,7 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The text processor is set to the *"english_phonemes"*.
......@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :
The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs.
It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......@@ -360,13 +360,13 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs.
It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......
......@@ -198,11 +198,11 @@ WAV2VEC2_BASE = Wav2Vec2Bundle(
)
WAV2VEC2_BASE.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -243,12 +243,12 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
)
WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
[:footcite:`librilight`] ("train-10min" subset).
:cite:`librilight` ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -290,11 +290,11 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -335,11 +335,11 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
)
WAV2VEC2_ASR_BASE_960H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on the same audio with the corresponding transcripts.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -379,11 +379,11 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle(
)
WAV2VEC2_LARGE.__doc__ = """Build "large" wav2vec2 model.
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -424,12 +424,12 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
)
WAV2VEC2_ASR_LARGE_10M.__doc__ = """Build "large" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
[:footcite:`librilight`] ("train-10min" subset).
:cite:`librilight` ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -470,12 +470,12 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
)
WAV2VEC2_ASR_LARGE_100H.__doc__ = """Build "large" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 100 hours of transcribed audio from
the same dataset ("train-clean-100" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -516,11 +516,11 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
)
WAV2VEC2_ASR_LARGE_960H.__doc__ = """Build "large" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on the same audio with the corresponding transcripts.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -561,10 +561,10 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
WAV2VEC2_LARGE_LV60K.__doc__ = """Build "large-lv60k" wav2vec2 model.
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`].
*Libri-Light* dataset :cite:`librilight`.
Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -606,11 +606,11 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 10 minutes of transcribed audio from
the same dataset ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -652,11 +652,11 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 100 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`] ("train-clean-100" subset).
*LibriSpeech* dataset :cite:`7178964` ("train-clean-100" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -698,12 +698,12 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of unlabeled audio from *Libri-Light*
[:footcite:`librilight`] dataset, and
:cite:`librilight` dataset, and
fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`]
*LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -744,14 +744,14 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
WAV2VEC2_XLSR53.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Trained on 56,000 hours of unlabeled audio from multiple datasets (
*Multilingual LibriSpeech* [:footcite:`Pratap_2020`],
*CommonVoice* [:footcite:`ardila2020common`] and
*BABEL* [:footcite:`Gales2014SpeechRA`]).
*Multilingual LibriSpeech* :cite:`Pratap_2020`,
*CommonVoice* :cite:`ardila2020common` and
*BABEL* :cite:`Gales2014SpeechRA`).
Not fine-tuned.
Originally published by the authors of
*Unsupervised Cross-lingual Representation Learning for Speech Recognition*
[:footcite:`conneau2020unsupervised`] under MIT License and redistributed with the same license.
:cite:`conneau2020unsupervised` under MIT License and redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
......@@ -790,11 +790,11 @@ HUBERT_BASE = Wav2Vec2Bundle(
)
HUBERT_BASE.__doc__ = """HuBERT model with "Base" configuration.
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`]
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned.
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
......@@ -835,10 +835,10 @@ HUBERT_LARGE = Wav2Vec2Bundle(
HUBERT_LARGE.__doc__ = """HuBERT model with "Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`].
*Libri-Light* dataset :cite:`librilight`.
Not fine-tuned.
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
......@@ -879,10 +879,10 @@ HUBERT_XLARGE = Wav2Vec2Bundle(
HUBERT_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`].
*Libri-Light* dataset :cite:`librilight`.
Not fine-tuned.
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
......@@ -924,12 +924,12 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
HUBERT_ASR_LARGE.__doc__ = """HuBERT model with "Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`]
*LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
......@@ -971,12 +971,12 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
HUBERT_ASR_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`]
*LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
......@@ -1019,11 +1019,11 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle(
)
VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
......@@ -1066,11 +1066,11 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle(
)
VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
......@@ -1113,11 +1113,11 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle(
)
VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
......@@ -1158,11 +1158,11 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
)
VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
......@@ -1205,11 +1205,11 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle(
)
VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`]
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
......
......@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module):
class ConvEmformer(_EmformerImpl):
r"""Implements the convolution-augmented streaming transformer architecture introduced in
*Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution*
[:footcite:`9747706`].
:cite:`9747706`.
Args:
input_dim (int): input dimension.
......
......@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module):
Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py
We provide three solutions of MVDR beamforming. One is based on *reference channel selection*
[:footcite:`souden2009optimal`] (``solution=ref_channel``).
:cite:`souden2009optimal` (``solution=ref_channel``).
.. math::
\\textbf{w}_{\\text{MVDR}}(f) =\
......@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module):
:math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation.
We apply either *eigenvalue decomposition*
[:footcite:`higuchi2016robust`] or the *power method* [:footcite:`mises1929praktische`] to get the
:cite:`higuchi2016robust` or the *power method* :cite:`mises1929praktische` to get the
steering vector from the PSD matrix of speech.
After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by
......@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module):
where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\
the single-channel enhanced speech, respectively.
For online streaming audio, we provide a *recursive method* [:footcite:`higuchi2017online`] to update the
For online streaming audio, we provide a *recursive method* :cite:`higuchi2017online` to update the
PSD matrices of speech and noise, respectively.
Args:
......@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module):
class RTFMVDR(torch.nn.Module):
r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module
r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise.
.. devices:: CPU CUDA
......@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module):
class SoudenMVDR(torch.nn.Module):
r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module
based on the method proposed by *Souden et, al.* [:footcite:`souden2009optimal`].
r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
based on the method proposed by *Souden et, al.* :cite:`souden2009optimal`.
.. devices:: CPU CUDA
......
......@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module):
.. properties:: Autograd TorchScript
Implementation ported from
*librosa* [:footcite:`brian_mcfee-proc-scipy-2015`], *A fast Griffin-Lim algorithm* [:footcite:`6701851`]
and *Signal estimation from modified short-time Fourier transform* [:footcite:`1172092`].
*librosa* :cite:`brian_mcfee-proc-scipy-2015`, *A fast Griffin-Lim algorithm* :cite:`6701851`
and *Signal estimation from modified short-time Fourier transform* :cite:`1172092`.
Args:
n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``)
......@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module):
.. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`].
Proposed in *SpecAugment* :cite:`specaugment`.
Args:
hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``)
......@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking):
.. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`].
Proposed in *SpecAugment* :cite:`specaugment`.
Args:
freq_mask_param (int): maximum possible length of the mask.
......@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking):
.. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`].
Proposed in *SpecAugment* :cite:`specaugment`.
Args:
time_mask_param (int): maximum possible length of the mask.
......@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module):
class RNNTLoss(torch.nn.Module):
"""Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks*
[:footcite:`graves2012sequence`].
:cite:`graves2012sequence`.
.. devices:: CPU CUDA
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment