Commit 476ab9ab authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Consolidate bibliography / reference (#2676)

Summary:
Preparation for the adoptation of `autosummary`.

Replace `:footcite:` with `:cite:` and introduce dedicated reference page, as `:footcite:` does not work well with `autosummary`.

Example:

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/datasets.html#cmuarctic

https://output.circle-artifacts.com/output/job/4da47ba6-d9c7-418e-b5b0-e9f8a146a6c3/artifacts/0/docs/references.html

Pull Request resolved: https://github.com/pytorch/audio/pull/2676

Reviewed By: carolineechen

Differential Revision: D39509431

Pulled By: mthrok

fbshipit-source-id: e6003dd01ec3eff3d598054690f61de8ee31ac9a
parent 50c66721
...@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit ...@@ -218,7 +218,7 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline wit
The text processor encodes the input texts character-by-character. The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The default parameters were used. The default parameters were used.
...@@ -264,7 +264,7 @@ graphemes to phonemes. ...@@ -264,7 +264,7 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__. `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The text processor is set to the *"english_phonemes"*. The text processor is set to the *"english_phonemes"*.
...@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with : ...@@ -309,13 +309,13 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :
The text processor encodes the input texts character-by-character. The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``. ``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`. The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs. It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
...@@ -360,13 +360,13 @@ graphemes to phonemes. ...@@ -360,13 +360,13 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__. `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* [:footcite:`ljspeech17`] for 1,500 epochs. Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``. ``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`. The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* [:footcite:`ljspeech17`] for 10,000 epochs. It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......
...@@ -198,11 +198,11 @@ WAV2VEC2_BASE = Wav2Vec2Bundle( ...@@ -198,11 +198,11 @@ WAV2VEC2_BASE = Wav2Vec2Bundle(
) )
WAV2VEC2_BASE.__doc__ = """wav2vec 2.0 model with "Base" configuration. WAV2VEC2_BASE.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned. Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -243,12 +243,12 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle( ...@@ -243,12 +243,12 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
) )
WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
[:footcite:`librilight`] ("train-10min" subset). :cite:`librilight` ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -290,11 +290,11 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle( ...@@ -290,11 +290,11 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset. fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -335,11 +335,11 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle( ...@@ -335,11 +335,11 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
) )
WAV2VEC2_ASR_BASE_960H.__doc__ = """Build "base" wav2vec2 model with an extra linear module WAV2VEC2_ASR_BASE_960H.__doc__ = """Build "base" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on the same audio with the corresponding transcripts. fine-tuned for ASR on the same audio with the corresponding transcripts.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -379,11 +379,11 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle( ...@@ -379,11 +379,11 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle(
) )
WAV2VEC2_LARGE.__doc__ = """Build "large" wav2vec2 model. WAV2VEC2_LARGE.__doc__ = """Build "large" wav2vec2 model.
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned. Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -424,12 +424,12 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle( ...@@ -424,12 +424,12 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
) )
WAV2VEC2_ASR_LARGE_10M.__doc__ = """Build "large" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_10M.__doc__ = """Build "large" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
[:footcite:`librilight`] ("train-10min" subset). :cite:`librilight` ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -470,12 +470,12 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle( ...@@ -470,12 +470,12 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
) )
WAV2VEC2_ASR_LARGE_100H.__doc__ = """Build "large" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_100H.__doc__ = """Build "large" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 100 hours of transcribed audio from fine-tuned for ASR on 100 hours of transcribed audio from
the same dataset ("train-clean-100" subset). the same dataset ("train-clean-100" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -516,11 +516,11 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle( ...@@ -516,11 +516,11 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
) )
WAV2VEC2_ASR_LARGE_960H.__doc__ = """Build "large" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_960H.__doc__ = """Build "large" wav2vec2 model with an extra linear module
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on the same audio with the corresponding transcripts. fine-tuned for ASR on the same audio with the corresponding transcripts.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -561,10 +561,10 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle( ...@@ -561,10 +561,10 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
WAV2VEC2_LARGE_LV60K.__doc__ = """Build "large-lv60k" wav2vec2 model. WAV2VEC2_LARGE_LV60K.__doc__ = """Build "large-lv60k" wav2vec2 model.
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`]. *Libri-Light* dataset :cite:`librilight`.
Not fine-tuned. Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -606,11 +606,11 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle( ...@@ -606,11 +606,11 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and *Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 10 minutes of transcribed audio from fine-tuned for ASR on 10 minutes of transcribed audio from
the same dataset ("train-10min" subset). the same dataset ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -652,11 +652,11 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle( ...@@ -652,11 +652,11 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and *Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 100 hours of transcribed audio from fine-tuned for ASR on 100 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`] ("train-clean-100" subset). *LibriSpeech* dataset :cite:`7178964` ("train-clean-100" subset).
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -698,12 +698,12 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle( ...@@ -698,12 +698,12 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module
Pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* Pre-trained on 60,000 hours of unlabeled audio from *Libri-Light*
[:footcite:`librilight`] dataset, and :cite:`librilight` dataset, and
fine-tuned for ASR on 960 hours of transcribed audio from fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`] *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *wav2vec 2.0* [:footcite:`baevski2020wav2vec`] under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -744,14 +744,14 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle( ...@@ -744,14 +744,14 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
WAV2VEC2_XLSR53.__doc__ = """wav2vec 2.0 model with "Base" configuration. WAV2VEC2_XLSR53.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Trained on 56,000 hours of unlabeled audio from multiple datasets ( Trained on 56,000 hours of unlabeled audio from multiple datasets (
*Multilingual LibriSpeech* [:footcite:`Pratap_2020`], *Multilingual LibriSpeech* :cite:`Pratap_2020`,
*CommonVoice* [:footcite:`ardila2020common`] and *CommonVoice* :cite:`ardila2020common` and
*BABEL* [:footcite:`Gales2014SpeechRA`]). *BABEL* :cite:`Gales2014SpeechRA`).
Not fine-tuned. Not fine-tuned.
Originally published by the authors of Originally published by the authors of
*Unsupervised Cross-lingual Representation Learning for Speech Recognition* *Unsupervised Cross-lingual Representation Learning for Speech Recognition*
[:footcite:`conneau2020unsupervised`] under MIT License and redistributed with the same license. :cite:`conneau2020unsupervised` under MIT License and redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
...@@ -790,11 +790,11 @@ HUBERT_BASE = Wav2Vec2Bundle( ...@@ -790,11 +790,11 @@ HUBERT_BASE = Wav2Vec2Bundle(
) )
HUBERT_BASE.__doc__ = """HuBERT model with "Base" configuration. HUBERT_BASE.__doc__ = """HuBERT model with "Base" configuration.
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset [:footcite:`7178964`] Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned. Not fine-tuned.
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
...@@ -835,10 +835,10 @@ HUBERT_LARGE = Wav2Vec2Bundle( ...@@ -835,10 +835,10 @@ HUBERT_LARGE = Wav2Vec2Bundle(
HUBERT_LARGE.__doc__ = """HuBERT model with "Large" configuration. HUBERT_LARGE.__doc__ = """HuBERT model with "Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`]. *Libri-Light* dataset :cite:`librilight`.
Not fine-tuned. Not fine-tuned.
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
...@@ -879,10 +879,10 @@ HUBERT_XLARGE = Wav2Vec2Bundle( ...@@ -879,10 +879,10 @@ HUBERT_XLARGE = Wav2Vec2Bundle(
HUBERT_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration. HUBERT_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`]. *Libri-Light* dataset :cite:`librilight`.
Not fine-tuned. Not fine-tuned.
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
...@@ -924,12 +924,12 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle( ...@@ -924,12 +924,12 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
HUBERT_ASR_LARGE.__doc__ = """HuBERT model with "Large" configuration. HUBERT_ASR_LARGE.__doc__ = """HuBERT model with "Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and *Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 960 hours of transcribed audio from fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`] *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
...@@ -971,12 +971,12 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle( ...@@ -971,12 +971,12 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
HUBERT_ASR_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration. HUBERT_ASR_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration.
Pre-trained on 60,000 hours of unlabeled audio from Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset [:footcite:`librilight`], and *Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 960 hours of transcribed audio from fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset [:footcite:`7178964`] *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *HuBERT* [:footcite:`hsu2021hubert`] under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
...@@ -1019,11 +1019,11 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle( ...@@ -1019,11 +1019,11 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle(
) )
VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`] Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages). ("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 282 hours of transcribed audio from "de" subset. Fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
...@@ -1066,11 +1066,11 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle( ...@@ -1066,11 +1066,11 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle(
) )
VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`] Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages). ("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 543 hours of transcribed audio from "en" subset. Fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
...@@ -1113,11 +1113,11 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle( ...@@ -1113,11 +1113,11 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle(
) )
VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`] Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages). ("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 166 hours of transcribed audio from "es" subset. Fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
...@@ -1158,11 +1158,11 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle( ...@@ -1158,11 +1158,11 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
) )
VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`] Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages). ("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset. Fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
...@@ -1205,11 +1205,11 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle( ...@@ -1205,11 +1205,11 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle(
) )
VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model with "Base" configuration.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset [:footcite:`voxpopuli`] Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages). ("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 91 hours of transcribed audio from "it" subset. Fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
Originally published by the authors of *VoxPopuli* [:footcite:`voxpopuli`] under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
......
...@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module): ...@@ -444,7 +444,7 @@ class _ConvEmformerLayer(torch.nn.Module):
class ConvEmformer(_EmformerImpl): class ConvEmformer(_EmformerImpl):
r"""Implements the convolution-augmented streaming transformer architecture introduced in r"""Implements the convolution-augmented streaming transformer architecture introduced in
*Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution* *Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution*
[:footcite:`9747706`]. :cite:`9747706`.
Args: Args:
input_dim (int): input dimension. input_dim (int): input dimension.
......
...@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module): ...@@ -104,7 +104,7 @@ class MVDR(torch.nn.Module):
Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py
We provide three solutions of MVDR beamforming. One is based on *reference channel selection* We provide three solutions of MVDR beamforming. One is based on *reference channel selection*
[:footcite:`souden2009optimal`] (``solution=ref_channel``). :cite:`souden2009optimal` (``solution=ref_channel``).
.. math:: .. math::
\\textbf{w}_{\\text{MVDR}}(f) =\ \\textbf{w}_{\\text{MVDR}}(f) =\
...@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module): ...@@ -126,7 +126,7 @@ class MVDR(torch.nn.Module):
:math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation. :math:`.^{\\mathsf{H}}` denotes the Hermitian Conjugate operation.
We apply either *eigenvalue decomposition* We apply either *eigenvalue decomposition*
[:footcite:`higuchi2016robust`] or the *power method* [:footcite:`mises1929praktische`] to get the :cite:`higuchi2016robust` or the *power method* :cite:`mises1929praktische` to get the
steering vector from the PSD matrix of speech. steering vector from the PSD matrix of speech.
After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by
...@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module): ...@@ -137,7 +137,7 @@ class MVDR(torch.nn.Module):
where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\ where :math:`\\bf{Y}` and :math:`\\hat{\\bf{S}}` are the STFT of the multi-channel noisy speech and\
the single-channel enhanced speech, respectively. the single-channel enhanced speech, respectively.
For online streaming audio, we provide a *recursive method* [:footcite:`higuchi2017online`] to update the For online streaming audio, we provide a *recursive method* :cite:`higuchi2017online` to update the
PSD matrices of speech and noise, respectively. PSD matrices of speech and noise, respectively.
Args: Args:
...@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module): ...@@ -341,7 +341,7 @@ class MVDR(torch.nn.Module):
class RTFMVDR(torch.nn.Module): class RTFMVDR(torch.nn.Module):
r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise. based on the relative transfer function (RTF) and power spectral density (PSD) matrix of noise.
.. devices:: CPU CUDA .. devices:: CPU CUDA
...@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module): ...@@ -405,8 +405,8 @@ class RTFMVDR(torch.nn.Module):
class SoudenMVDR(torch.nn.Module): class SoudenMVDR(torch.nn.Module):
r"""Minimum Variance Distortionless Response (*MVDR* [:footcite:`capon1969high`]) module r"""Minimum Variance Distortionless Response (*MVDR* :cite:`capon1969high`) module
based on the method proposed by *Souden et, al.* [:footcite:`souden2009optimal`]. based on the method proposed by *Souden et, al.* :cite:`souden2009optimal`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
......
...@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module): ...@@ -214,8 +214,8 @@ class GriffinLim(torch.nn.Module):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Implementation ported from Implementation ported from
*librosa* [:footcite:`brian_mcfee-proc-scipy-2015`], *A fast Griffin-Lim algorithm* [:footcite:`6701851`] *librosa* :cite:`brian_mcfee-proc-scipy-2015`, *A fast Griffin-Lim algorithm* :cite:`6701851`
and *Signal estimation from modified short-time Fourier transform* [:footcite:`1172092`]. and *Signal estimation from modified short-time Fourier transform* :cite:`1172092`.
Args: Args:
n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``) n_fft (int, optional): Size of FFT, creates ``n_fft // 2 + 1`` bins. (Default: ``400``)
...@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module): ...@@ -1040,7 +1040,7 @@ class TimeStretch(torch.nn.Module):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`]. Proposed in *SpecAugment* :cite:`specaugment`.
Args: Args:
hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``) hop_length (int or None, optional): Length of hop between STFT windows. (Default: ``win_length // 2``)
...@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking): ...@@ -1226,7 +1226,7 @@ class FrequencyMasking(_AxisMasking):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`]. Proposed in *SpecAugment* :cite:`specaugment`.
Args: Args:
freq_mask_param (int): maximum possible length of the mask. freq_mask_param (int): maximum possible length of the mask.
...@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking): ...@@ -1260,7 +1260,7 @@ class TimeMasking(_AxisMasking):
.. properties:: Autograd TorchScript .. properties:: Autograd TorchScript
Proposed in *SpecAugment* [:footcite:`specaugment`]. Proposed in *SpecAugment* :cite:`specaugment`.
Args: Args:
time_mask_param (int): maximum possible length of the mask. time_mask_param (int): maximum possible length of the mask.
...@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module): ...@@ -1724,7 +1724,7 @@ class PitchShift(LazyModuleMixin, torch.nn.Module):
class RNNTLoss(torch.nn.Module): class RNNTLoss(torch.nn.Module):
"""Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks* """Compute the RNN Transducer loss from *Sequence Transduction with Recurrent Neural Networks*
[:footcite:`graves2012sequence`]. :cite:`graves2012sequence`.
.. devices:: CPU CUDA .. devices:: CPU CUDA
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment