Commit 0b3ddec6 authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Adopt `:autosummary:` in `torchaudio.pipelines` module doc (#2689)

Summary:
* Introduce the mini-index at `torchaudio.pipelines` page.
* Add introductions
* Update pipeline tutorials

https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/pipelines.html

<img width="1163" alt="Screen Shot 2022-09-20 at 1 23 29 PM" src="https://user-images.githubusercontent.com/855818/191167049-98324e93-2e16-41db-8538-3b5b54eb8224.png">

<img width="1115" alt="Screen Shot 2022-09-20 at 1 23 49 PM" src="https://user-images.githubusercontent.com/855818/191167071-4770f594-2540-43a4-a01c-e983bf59220f.png">

https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle

<img width="1108" alt="Screen Shot 2022-09-20 at 1 24 18 PM" src="https://user-images.githubusercontent.com/855818/191167123-51b33a5f-c30c-46bc-b002-b05d2d0d27b7.png">

Pull Request resolved: https://github.com/pytorch/audio/pull/2689

Reviewed By: carolineechen

Differential Revision: D39691253

Pulled By: mthrok

fbshipit-source-id: ddf5fdadb0b64cf2867b6271ba53e8e8c0fa7e49
parent 045cc372
..
autogenerated from source/_templates/autosummary/bundle_class.rst
{{ name | underline }}
.. autoclass:: {{ fullname }}()
{%- if name in ["RNNTBundle.FeatureExtractor", "RNNTBundle.TokenProcessor"] %}
{%- set methods = ["__call__"] %}
{%- elif name == "Tacotron2TTSBundle.TextProcessor" %}
{%- set attributes = ["tokens"] %}
{%- set methods = ["__call__"] %}
{%- elif name == "Tacotron2TTSBundle.Vocoder" %}
{%- set attributes=["sample_rate"] %}
{%- set methods = ["__call__"] %}
{% endif %}
..
ATTRIBUTES
{%- for item in attributes %}
{%- if not item.startswith('_') %}
{{ item | underline("-") }}
.. container:: py attribute
.. autoproperty:: {{[fullname, item] | join('.')}}
{%- endif %}
{%- endfor %}
..
METHODS
{%- for item in methods %}
{%- if item != "__init__" %}
{{item | underline("-") }}
.. container:: py attribute
.. automethod:: {{[fullname, item] | join('.')}}
{%- endif %}
{%- endfor %}
..
autogenerated from source/_templates/autosummary/bundle_data.rst
{{ name | underline }}
.. container:: py attribute
.. autodata:: {{ fullname }}
:no-value:
.. py:module:: torchaudio.pipelines
torchaudio.pipelines torchaudio.pipelines
==================== ====================
.. currentmodule:: torchaudio.pipelines .. currentmodule:: torchaudio.pipelines
.. py:module:: torchaudio.pipelines
The pipelines subpackage contains API to access the models with pretrained weights, and information/helper functions associated the pretrained weights. The ``torchaudio.pipelines`` module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks.
RNN-T Streaming/Non-Streaming ASR
---------------------------------
RNNTBundle
~~~~~~~~~~
.. autoclass:: RNNTBundle
:members: sample_rate, n_fft, n_mels, hop_length, segment_length, right_context_length
.. automethod:: get_decoder
.. automethod:: get_feature_extractor When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins).
.. automethod:: get_streaming_feature_extractor To make this information tied to a pre-trained model and easily accessible, ``torchaudio.pipelines`` module uses the concept of a `Bundle` class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines.
.. automethod:: get_token_processor The following figure illustrates this.
RNNTBundle - FeatureExtractor .. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: torchaudio.pipelines::RNNTBundle.FeatureExtractor A pre-trained model and associated pipelines are expressed as an instance of ``Bundle``. Different instances of same ``Bundle`` share the interface, but their implementations are not constrained to be of same types. For example, :class:`SourceSeparationBundle` defines the interface for performing source separation, but its instance :data:`CONVTASNET_BASE_LIBRI2MIX` instantiates a model of :class:`~torchaudio.models.ConvTasNet` while :data:`HDEMUCS_HIGH_MUSDB` instantiates a model of :class:`~torchaudio.models.HDemucs`. Still, because they share the same interface, the usage is the same.
:special-members: __call__
RNNTBundle - TokenProcessor .. note::
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: torchaudio.pipelines::RNNTBundle.TokenProcessor Under the hood, the implementations of ``Bundle`` use components from other ``torchaudio`` modules, such as :mod:`torchaudio.models` and :mod:`torchaudio.transforms`, or even third party libraries like `SentencPiece <https://github.com/google/sentencepiece>`__ and `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__. But this implementation detail is abstracted away from library users.
:special-members: __call__
EMFORMER_RNNT_BASE_LIBRISPEECH RNN-T Streaming/Non-Streaming ASR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ---------------------------------
.. container:: py attribute
.. autodata:: EMFORMER_RNNT_BASE_LIBRISPEECH
:no-value:
wav2vec 2.0 / HuBERT - Representation Learning
----------------------------------------------
.. autoclass:: Wav2Vec2Bundle
:members: sample_rate
.. automethod:: get_model
WAV2VEC2_BASE
~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_BASE Interface
:no-value: ^^^^^^^^^
WAV2VEC2_LARGE ``RNNTBundle`` defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.
~~~~~~~~~~~~~~
.. container:: py attribute .. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png
.. autodata:: WAV2VEC2_LARGE .. autosummary::
:no-value: :toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
WAV2VEC2_LARGE_LV60K RNNTBundle
~~~~~~~~~~~~~~~~~~~~ RNNTBundle.FeatureExtractor
RNNTBundle.TokenProcessor
.. container:: py attribute .. rubric:: Tutorials using ``RNNTBundle``
.. autodata:: WAV2VEC2_LARGE_LV60K .. minigallery:: torchaudio.pipelines.RNNTBundle
:no-value:
Pretrained Models
^^^^^^^^^^^^^^^^^
WAV2VEC2_XLSR53 .. autosummary::
~~~~~~~~~~~~~~~ :toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. container:: py attribute EMFORMER_RNNT_BASE_LIBRISPEECH
.. autodata:: WAV2VEC2_XLSR53
:no-value:
HUBERT_BASE wav2vec 2.0 / HuBERT - SSL
~~~~~~~~~~~ --------------------------
.. container:: py attribute Interface
^^^^^^^^^
.. autodata:: HUBERT_BASE ``Wav2Vec2Bundle`` instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.
:no-value:
HUBERT_LARGE .. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png
~~~~~~~~~~~~
.. container:: py attribute .. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
.. autodata:: HUBERT_LARGE Wav2Vec2Bundle
:no-value:
HUBERT_XLARGE Pretrained Models
~~~~~~~~~~~~~ ^^^^^^^^^^^^^^^^^
.. container:: py attribute .. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. autodata:: HUBERT_XLARGE WAV2VEC2_BASE
:no-value: WAV2VEC2_LARGE
WAV2VEC2_LARGE_LV60K
WAV2VEC2_XLSR53
HUBERT_BASE
HUBERT_LARGE
HUBERT_XLARGE
wav2vec 2.0 / HuBERT - Fine-tuned ASR wav2vec 2.0 / HuBERT - Fine-tuned ASR
------------------------------------- -------------------------------------
Wav2Vec2ASRBundle Interface
~~~~~~~~~~~~~~~~~ ^^^^^^^^^
.. autoclass:: Wav2Vec2ASRBundle
:members: sample_rate
.. automethod:: get_model
.. automethod:: get_labels
WAV2VEC2_ASR_BASE_10M
~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_BASE_10M
:no-value:
WAV2VEC2_ASR_BASE_100H
~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_BASE_100H
:no-value:
WAV2VEC2_ASR_BASE_960H
~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_BASE_960H
:no-value:
WAV2VEC2_ASR_LARGE_10M
~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_LARGE_10M
:no-value:
WAV2VEC2_ASR_LARGE_100H
~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_LARGE_100H
:no-value:
WAV2VEC2_ASR_LARGE_960H
~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute ``Wav2Vec2ASRBundle`` instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.
.. autodata:: WAV2VEC2_ASR_LARGE_960H .. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png
:no-value:
WAV2VEC2_ASR_LARGE_LV60K_10M .. autosummary::
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
.. container:: py attribute Wav2Vec2ASRBundle
.. autodata:: WAV2VEC2_ASR_LARGE_LV60K_10M .. rubric:: Tutorials using ``Wav2Vec2ASRBundle``
:no-value:
WAV2VEC2_ASR_LARGE_LV60K_100H .. minigallery:: torchaudio.pipelines.Wav2Vec2ASRBundle
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute Pretrained Models
^^^^^^^^^^^^^^^^^
.. autodata:: WAV2VEC2_ASR_LARGE_LV60K_100H .. autosummary::
:no-value: :toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
WAV2VEC2_ASR_LARGE_LV60K_960H WAV2VEC2_ASR_BASE_10M
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ WAV2VEC2_ASR_BASE_100H
WAV2VEC2_ASR_BASE_960H
WAV2VEC2_ASR_LARGE_10M
WAV2VEC2_ASR_LARGE_100H
WAV2VEC2_ASR_LARGE_960H
WAV2VEC2_ASR_LARGE_LV60K_10M
WAV2VEC2_ASR_LARGE_LV60K_100H
WAV2VEC2_ASR_LARGE_LV60K_960H
VOXPOPULI_ASR_BASE_10K_DE
VOXPOPULI_ASR_BASE_10K_EN
VOXPOPULI_ASR_BASE_10K_ES
VOXPOPULI_ASR_BASE_10K_FR
VOXPOPULI_ASR_BASE_10K_IT
HUBERT_ASR_LARGE
HUBERT_ASR_XLARGE
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_LARGE_LV60K_960H
:no-value:
VOXPOPULI_ASR_BASE_10K_DE
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_DE
:no-value:
VOXPOPULI_ASR_BASE_10K_EN
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_EN
:no-value:
VOXPOPULI_ASR_BASE_10K_ES
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_ES
:no-value:
VOXPOPULI_ASR_BASE_10K_FR
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_FR
:no-value:
VOXPOPULI_ASR_BASE_10K_IT
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_IT
:no-value:
HUBERT_ASR_LARGE
~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: HUBERT_ASR_LARGE
:no-value:
HUBERT_ASR_XLARGE
~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: HUBERT_ASR_XLARGE
:no-value:
Tacotron2 Text-To-Speech Tacotron2 Text-To-Speech
------------------------ ------------------------
Tacotron2TTSBundle ``Tacotron2TTSBundle`` defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on :class:`~torchaudio.models.Tacotron2` model.
~~~~~~~~~~~~~~~~~~
.. autoclass:: Tacotron2TTSBundle
.. automethod:: get_text_processor
.. automethod:: get_tacotron2
.. automethod:: get_vocoder
Tacotron2TTSBundle - TextProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: torchaudio.pipelines::Tacotron2TTSBundle.TextProcessor
:members: tokens
:special-members: __call__
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png
Tacotron2TTSBundle - Vocoder ``TextProcessor`` can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: torchaudio.pipelines::Tacotron2TTSBundle.Vocoder Similarly ``Vocoder`` can be an algorithm without learning parameters, like `Griffin-Lim`, or a neural-network-based model like `Waveglow`.
:members: sample_rate
:special-members: __call__
Interface
^^^^^^^^^
TACOTRON2_WAVERNN_PHONE_LJSPEECH .. autosummary::
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
.. container:: py attribute Tacotron2TTSBundle
Tacotron2TTSBundle.TextProcessor
Tacotron2TTSBundle.Vocoder
.. autodata:: TACOTRON2_WAVERNN_PHONE_LJSPEECH .. rubric:: Tutorials using ``Tacotron2TTSBundle``
:no-value:
.. minigallery:: torchaudio.pipelines.Tacotron2TTSBundle
TACOTRON2_WAVERNN_CHAR_LJSPEECH Pretrained Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^^^^^^^^^^^^^^^^^
.. container:: py attribute .. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. autodata:: TACOTRON2_WAVERNN_CHAR_LJSPEECH TACOTRON2_WAVERNN_PHONE_LJSPEECH
:no-value: TACOTRON2_WAVERNN_CHAR_LJSPEECH
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
:no-value:
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
:no-value:
Source Separation Source Separation
----------------- -----------------
SourceSeparationBundle Interface
~~~~~~~~~~~~~~~~~~~~~~ ^^^^^^^^^
.. autoclass:: SourceSeparationBundle
:members: sample_rate
.. automethod:: get_model
CONVTASNET_BASE_LIBRI2MIX ``SourceSeparationBundle`` instantiates source separation models which take single channel audio and generates multi-channel audio.
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute .. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png
.. autodata:: CONVTASNET_BASE_LIBRI2MIX .. autosummary::
:no-value: :toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
HDEMUCS_HIGH_MUSDB_PLUS SourceSeparationBundle
~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute .. rubric:: Tutorials using ``SourceSeparationBundle``
.. autodata:: HDEMUCS_HIGH_MUSDB_PLUS .. minigallery:: torchaudio.pipelines.SourceSeparationBundle
:no-value:
HDEMUCS_HIGH_MUSDB Pretrained Models
~~~~~~~~~~~~~~~~~~ ^^^^^^^^^^^^^^^^^
.. container:: py attribute .. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. autodata:: HDEMUCS_HIGH_MUSDB CONVTASNET_BASE_LIBRI2MIX
:no-value: HDEMUCS_HIGH_MUSDB_PLUS
HDEMUCS_HIGH_MUSDB
...@@ -410,3 +410,16 @@ ...@@ -410,3 +410,16 @@
booktitle={Proceedings of the ISMIR 2021 Workshop on Music Source Separation}, booktitle={Proceedings of the ISMIR 2021 Workshop on Music Source Separation},
year={2021} year={2021}
} }
@article{CATTONI2021101155,
title = {MuST-C: A multilingual corpus for end-to-end speech translation},
journal = {Computer Speech & Language},
volume = {66},
pages = {101155},
year = {2021},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2020.101155},
url = {https://www.sciencedirect.com/science/article/pii/S0885230820300887},
author = {Roldano Cattoni and Mattia Antonino {Di Gangi} and Luisa Bentivogli and Matteo Negri and Marco Turchi},
keywords = {Spoken language translation, Multilingual corpus},
abstract = {End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C.}
}
.. py:module:: torchaudio.transforms
torchaudio.transforms torchaudio.transforms
===================== =====================
.. currentmodule:: torchaudio.transforms .. currentmodule:: torchaudio.transforms
:mod:`torchaudio.transforms` module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms. ``torchaudio.transforms`` module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.
.. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png
......
...@@ -72,9 +72,10 @@ from torchaudio.utils import download_asset ...@@ -72,9 +72,10 @@ from torchaudio.utils import download_asset
# We use the pretrained `Wav2Vec 2.0 <https://arxiv.org/abs/2006.11477>`__ # We use the pretrained `Wav2Vec 2.0 <https://arxiv.org/abs/2006.11477>`__
# Base model that is finetuned on 10 min of the `LibriSpeech # Base model that is finetuned on 10 min of the `LibriSpeech
# dataset <http://www.openslr.org/12>`__, which can be loaded in using # dataset <http://www.openslr.org/12>`__, which can be loaded in using
# :py:func:`torchaudio.pipelines`. For more detail on running Wav2Vec 2.0 speech # :data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M`.
# For more detail on running Wav2Vec 2.0 speech
# recognition pipelines in torchaudio, please refer to `this # recognition pipelines in torchaudio, please refer to `this
# tutorial <https://pytorch.org/audio/main/tutorials/speech_recognition_pipeline_tutorial.html>`__. # tutorial <./speech_recognition_pipeline_tutorial.html>`__.
# #
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
...@@ -177,7 +178,7 @@ print(tokens) ...@@ -177,7 +178,7 @@ print(tokens)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# #
# Pretrained files for the LibriSpeech dataset can be downloaded using # Pretrained files for the LibriSpeech dataset can be downloaded using
# :py:func:`download_pretrained_files <torchaudio.models.decoder.download_pretrained_files>`. # :py:func:`~torchaudio.models.decoder.download_pretrained_files`.
# #
# Note: this cell may take a couple of minutes to run, as the language # Note: this cell may take a couple of minutes to run, as the language
# model can be large # model can be large
...@@ -202,7 +203,7 @@ print(files) ...@@ -202,7 +203,7 @@ print(files)
# Beam Search Decoder # Beam Search Decoder
# ~~~~~~~~~~~~~~~~~~~ # ~~~~~~~~~~~~~~~~~~~
# The decoder can be constructed using the factory function # The decoder can be constructed using the factory function
# :py:func:`ctc_decoder <torchaudio.models.decoder.ctc_decoder>`. # :py:func:`~torchaudio.models.decoder.ctc_decoder`.
# In addition to the previously mentioned components, it also takes in various beam # In addition to the previously mentioned components, it also takes in various beam
# search decoding parameters and token/word parameters. # search decoding parameters and token/word parameters.
# #
...@@ -262,7 +263,7 @@ greedy_decoder = GreedyCTCDecoder(tokens) ...@@ -262,7 +263,7 @@ greedy_decoder = GreedyCTCDecoder(tokens)
# #
# Now that we have the data, acoustic model, and decoder, we can perform # Now that we have the data, acoustic model, and decoder, we can perform
# inference. The output of the beam search decoder is of type # inference. The output of the beam search decoder is of type
# :py:func:`torchaudio.models.decoder.CTCHypothesis`, consisting of the # :py:class:`~torchaudio.models.decoder.CTCHypothesis`, consisting of the
# predicted token IDs, corresponding words (if a lexicon is provided), hypothesis score, # predicted token IDs, corresponding words (if a lexicon is provided), hypothesis score,
# and timesteps corresponding to the token IDs. Recall the transcript corresponding to the # and timesteps corresponding to the token IDs. Recall the transcript corresponding to the
# waveform is # waveform is
...@@ -307,7 +308,8 @@ print(f"WER: {beam_search_wer}") ...@@ -307,7 +308,8 @@ print(f"WER: {beam_search_wer}")
###################################################################### ######################################################################
# .. note:: # .. note::
# #
# The ``words`` field of the output hypotheses will be empty if no lexicon # The :py:attr:`~torchaudio.models.decoder.CTCHypothesis.words`
# field of the output hypotheses will be empty if no lexicon
# is provided to the decoder. To retrieve a transcript with lexicon-free # is provided to the decoder. To retrieve a transcript with lexicon-free
# decoding, you can perform the following to retrieve the token indices, # decoding, you can perform the following to retrieve the token indices,
# convert them to original tokens, then join them together. # convert them to original tokens, then join them together.
......
...@@ -74,9 +74,9 @@ except ModuleNotFoundError: ...@@ -74,9 +74,9 @@ except ModuleNotFoundError:
# ------------------------- # -------------------------
# #
# Pre-trained model weights and related pipeline components are # Pre-trained model weights and related pipeline components are
# bundled as :py:func:`torchaudio.pipelines.RNNTBundle`. # bundled as :py:class:`torchaudio.pipelines.RNNTBundle`.
# #
# We use :py:func:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`, # We use :py:data:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`,
# which is a Emformer RNN-T model trained on LibriSpeech dataset. # which is a Emformer RNN-T model trained on LibriSpeech dataset.
# #
...@@ -112,7 +112,7 @@ print(f"Right context: {context_length} frames ({context_length / sample_rate} s ...@@ -112,7 +112,7 @@ print(f"Right context: {context_length} frames ({context_length / sample_rate} s
# 4. Configure the audio stream # 4. Configure the audio stream
# ----------------------------- # -----------------------------
# #
# Next, we configure the input audio stream using :py:func:`~torchaudio.io.StreamReader`. # Next, we configure the input audio stream using :py:class:`torchaudio.io.StreamReader`.
# #
# For the detail of this API, please refer to the # For the detail of this API, please refer to the
# `Media Stream API tutorial <./streaming_api_tutorial.html>`__. # `Media Stream API tutorial <./streaming_api_tutorial.html>`__.
......
...@@ -26,7 +26,7 @@ pre-trained models from wav2vec 2.0 ...@@ -26,7 +26,7 @@ pre-trained models from wav2vec 2.0
# Torchaudio provides easy access to the pre-trained weights and # Torchaudio provides easy access to the pre-trained weights and
# associated information, such as the expected sample rate and class # associated information, such as the expected sample rate and class
# labels. They are bundled together and available under # labels. They are bundled together and available under
# :py:func:`torchaudio.pipelines` module. # :py:mod:`torchaudio.pipelines` module.
# #
...@@ -34,36 +34,26 @@ pre-trained models from wav2vec 2.0 ...@@ -34,36 +34,26 @@ pre-trained models from wav2vec 2.0
# Preparation # Preparation
# ----------- # -----------
# #
# First we import the necessary packages, and fetch data that we work on.
#
# %matplotlib inline
import os
import IPython
import matplotlib
import matplotlib.pyplot as plt
import requests
import torch import torch
import torchaudio import torchaudio
matplotlib.rcParams["figure.figsize"] = [16.0, 4.8] print(torch.__version__)
print(torchaudio.__version__)
torch.random.manual_seed(0) torch.random.manual_seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.__version__)
print(torchaudio.__version__)
print(device) print(device)
SPEECH_URL = "https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" # noqa: E501 ######################################################################
SPEECH_FILE = "_assets/speech.wav" #
import IPython
import matplotlib.pyplot as plt
from torchaudio.utils import download_asset
if not os.path.exists(SPEECH_FILE): SPEECH_FILE = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
os.makedirs("_assets", exist_ok=True)
with open(SPEECH_FILE, "wb") as file:
file.write(requests.get(SPEECH_URL).content)
###################################################################### ######################################################################
...@@ -85,11 +75,10 @@ if not os.path.exists(SPEECH_FILE): ...@@ -85,11 +75,10 @@ if not os.path.exists(SPEECH_FILE):
# for other downstream tasks as well, but this tutorial does not # for other downstream tasks as well, but this tutorial does not
# cover that. # cover that.
# #
# We will use :py:func:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here. # We will use :py:data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here.
# #
# There are multiple models available as # There are multiple pre-trained models available in :py:mod:`torchaudio.pipelines`.
# :py:mod:`torchaudio.pipelines`. Please check the documentation for # Please check the documentation for the detail of how they are trained.
# the detail of how they are trained.
# #
# The bundle object provides the interface to instantiate model and other # The bundle object provides the interface to instantiate model and other
# information. Sampling rate and the class labels are found as follow. # information. Sampling rate and the class labels are found as follow.
...@@ -134,7 +123,7 @@ IPython.display.Audio(SPEECH_FILE) ...@@ -134,7 +123,7 @@ IPython.display.Audio(SPEECH_FILE)
# #
# - :py:func:`torchaudio.functional.resample` works on CUDA tensors as well. # - :py:func:`torchaudio.functional.resample` works on CUDA tensors as well.
# - When performing resampling multiple times on the same set of sample rates, # - When performing resampling multiple times on the same set of sample rates,
# using :py:func:`torchaudio.transforms.Resample` might improve the performace. # using :py:class:`torchaudio.transforms.Resample` might improve the performace.
# #
waveform, sample_rate = torchaudio.load(SPEECH_FILE) waveform, sample_rate = torchaudio.load(SPEECH_FILE)
...@@ -167,7 +156,7 @@ with torch.inference_mode(): ...@@ -167,7 +156,7 @@ with torch.inference_mode():
fig, ax = plt.subplots(len(features), 1, figsize=(16, 4.3 * len(features))) fig, ax = plt.subplots(len(features), 1, figsize=(16, 4.3 * len(features)))
for i, feats in enumerate(features): for i, feats in enumerate(features):
ax[i].imshow(feats[0].cpu()) ax[i].imshow(feats[0].cpu(), interpolation="nearest")
ax[i].set_title(f"Feature from transformer layer {i+1}") ax[i].set_title(f"Feature from transformer layer {i+1}")
ax[i].set_xlabel("Feature dimension") ax[i].set_xlabel("Feature dimension")
ax[i].set_ylabel("Frame (time-axis)") ax[i].set_ylabel("Frame (time-axis)")
...@@ -197,7 +186,7 @@ with torch.inference_mode(): ...@@ -197,7 +186,7 @@ with torch.inference_mode():
# Let’s visualize this. # Let’s visualize this.
# #
plt.imshow(emission[0].cpu().T) plt.imshow(emission[0].cpu().T, interpolation="nearest")
plt.title("Classification result") plt.title("Classification result")
plt.xlabel("Frame (time-axis)") plt.xlabel("Frame (time-axis)")
plt.ylabel("Class") plt.ylabel("Class")
...@@ -291,7 +280,7 @@ IPython.display.Audio(SPEECH_FILE) ...@@ -291,7 +280,7 @@ IPython.display.Audio(SPEECH_FILE)
# Conclusion # Conclusion
# ---------- # ----------
# #
# In this tutorial, we looked at how to use :py:mod:`torchaudio.pipelines` to # In this tutorial, we looked at how to use :py:class:`~torchaudio.pipelines.Wav2Vec2ASRBundle` to
# perform acoustic feature extraction and speech recognition. Constructing # perform acoustic feature extraction and speech recognition. Constructing
# a model and getting the emission is as short as two lines. # a model and getting the emission is as short as two lines.
# #
......
...@@ -45,7 +45,7 @@ import matplotlib.pyplot as plt ...@@ -45,7 +45,7 @@ import matplotlib.pyplot as plt
# #
# .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png # .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png
# #
# All the related components are bundled in :py:func:`torchaudio.pipelines.Tacotron2TTSBundle`, # All the related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`,
# but this tutorial will also cover the process under the hood. # but this tutorial will also cover the process under the hood.
###################################################################### ######################################################################
...@@ -196,10 +196,11 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]]) ...@@ -196,10 +196,11 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
# however, note that the input to Tacotron2 models need to be processed # however, note that the input to Tacotron2 models need to be processed
# by the matching text processor. # by the matching text processor.
# #
# :py:func:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching # :py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
# models and processors together so that it is easy to create the pipeline. # models and processors together so that it is easy to create the pipeline.
# #
# For the available bundles, and its usage, please refer to :py:mod:`torchaudio.pipelines`. # For the available bundles, and its usage, please refer to
# :py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.
# #
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
...@@ -271,8 +272,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9)) ...@@ -271,8 +272,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach()) ax1.imshow(spec[0].cpu().detach())
ax2.plot(waveforms[0].cpu().detach()) ax2.plot(waveforms[0].cpu().detach())
torchaudio.save("_assets/output_wavernn.wav", waveforms[0:1].cpu(), sample_rate=vocoder.sample_rate) IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)
IPython.display.Audio("_assets/output_wavernn.wav")
###################################################################### ######################################################################
...@@ -280,7 +280,9 @@ IPython.display.Audio("_assets/output_wavernn.wav") ...@@ -280,7 +280,9 @@ IPython.display.Audio("_assets/output_wavernn.wav")
# ~~~~~~~~~~~ # ~~~~~~~~~~~
# #
# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate # Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
# the vocode object with ``get_vocoder`` method and pass the spectrogram. # the vocode object with
# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
# method and pass the spectrogram.
# #
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
...@@ -300,12 +302,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9)) ...@@ -300,12 +302,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach()) ax1.imshow(spec[0].cpu().detach())
ax2.plot(waveforms[0].cpu().detach()) ax2.plot(waveforms[0].cpu().detach())
torchaudio.save( IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)
"_assets/output_griffinlim.wav",
waveforms[0:1].cpu(),
sample_rate=vocoder.sample_rate,
)
IPython.display.Audio("_assets/output_griffinlim.wav")
###################################################################### ######################################################################
...@@ -344,5 +341,4 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9)) ...@@ -344,5 +341,4 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach()) ax1.imshow(spec[0].cpu().detach())
ax2.plot(waveforms[0].cpu().detach()) ax2.plot(waveforms[0].cpu().detach())
torchaudio.save("_assets/output_waveglow.wav", waveforms[0:1].cpu(), sample_rate=22050) IPython.display.Audio(waveforms[0:1].cpu(), rate=22050)
IPython.display.Audio("_assets/output_waveglow.wav")
...@@ -10,9 +10,7 @@ from torchaudio.models import conv_tasnet_base, hdemucs_high ...@@ -10,9 +10,7 @@ from torchaudio.models import conv_tasnet_base, hdemucs_high
@dataclass @dataclass
class SourceSeparationBundle: class SourceSeparationBundle:
"""torchaudio.pipelines.SourceSeparationBundle() """Dataclass that bundles components for performing source separation.
Dataclass that bundles components for performing source separation.
Example Example
>>> import torchaudio >>> import torchaudio
...@@ -66,16 +64,16 @@ CONVTASNET_BASE_LIBRI2MIX = SourceSeparationBundle( ...@@ -66,16 +64,16 @@ CONVTASNET_BASE_LIBRI2MIX = SourceSeparationBundle(
_model_factory_func=partial(conv_tasnet_base, num_sources=2), _model_factory_func=partial(conv_tasnet_base, num_sources=2),
_sample_rate=8000, _sample_rate=8000,
) )
CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet* :cite:`Luo_2019` trained on CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet*
*Libri2Mix dataset* :cite:`cosentino2020librimix`. :cite:`Luo_2019` trained on *Libri2Mix dataset* :cite:`cosentino2020librimix`.
The source separation model is constructed by :py:func:`torchaudio.models.conv_tasnet_base` The source separation model is constructed by :func:`~torchaudio.models.conv_tasnet_base`
and is trained using the training script ``lightning_train.py`` and is trained using the training script ``lightning_train.py``
`here <https://github.com/pytorch/audio/tree/release/0.12/examples/source_separation/>`__ `here <https://github.com/pytorch/audio/tree/release/0.12/examples/source_separation/>`__
with default arguments. with default arguments.
Please refer to :py:class:`SourceSeparationBundle` for usage instructions. Please refer to :class:`SourceSeparationBundle` for usage instructions.
""" """
HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle( HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
...@@ -83,14 +81,16 @@ HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle( ...@@ -83,14 +81,16 @@ HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
_model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]), _model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
_sample_rate=44100, _sample_rate=44100,
) )
HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained music source separation pipeline with
source separation trained on MUSDB-HQ :cite:`MUSDB18HQ` and additional internal training data. *Hybrid Demucs* :cite:`defossez2021hybrid` trained on MUSDB-HQ :cite:`MUSDB18HQ`
and additional internal training data.
The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`. The model is constructed by :func:`~torchaudio.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
Please refer to :py:class:`SourceSeparationBundle` for usage instructions. Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
"""
Please refer to :class:`SourceSeparationBundle` for usage instructions.
"""
HDEMUCS_HIGH_MUSDB = SourceSeparationBundle( HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
...@@ -98,11 +98,11 @@ HDEMUCS_HIGH_MUSDB = SourceSeparationBundle( ...@@ -98,11 +98,11 @@ HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
_model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]), _model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
_sample_rate=44100, _sample_rate=44100,
) )
HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained music source separation pipeline with
source separation trained on MUSDB-HQ :cite:`MUSDB18HQ`. *Hybrid Demucs* :cite:`defossez2021hybrid` trained on MUSDB-HQ :cite:`MUSDB18HQ`.
The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`. The model is constructed by :func:`~torchaudio.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__. Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
Please refer to :py:class:`SourceSeparationBundle` for usage instructions. Please refer to :class:`SourceSeparationBundle` for usage instructions.
""" """
...@@ -213,17 +213,14 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH = _Tacotron2GriffinLimCharBundle( ...@@ -213,17 +213,14 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH = _Tacotron2GriffinLimCharBundle(
_tacotron2_path="tacotron2_english_characters_1500_epochs_ljspeech.pth", _tacotron2_path="tacotron2_english_characters_1500_epochs_ljspeech.pth",
_tacotron2_params=utils._get_taco_params(n_symbols=38), _tacotron2_params=utils._get_taco_params(n_symbols=38),
) )
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and
:py:class:`torchaudio.transforms.GriffinLim`. :py:class:`~torchaudio.transforms.GriffinLim` as vocoder.
The text processor encodes the input texts character-by-character. The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The default parameters were used. The default parameters were used.
The vocoder is based on :py:class:`torchaudio.transforms.GriffinLim`.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
Example - "Hello world! T T S stands for Text to Speech!" Example - "Hello world! T T S stands for Text to Speech!"
...@@ -255,8 +252,8 @@ TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH = _Tacotron2GriffinLimPhoneBundle( ...@@ -255,8 +252,8 @@ TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH = _Tacotron2GriffinLimPhoneBundle(
_tacotron2_path="tacotron2_english_phonemes_1500_epochs_ljspeech.pth", _tacotron2_path="tacotron2_english_phonemes_1500_epochs_ljspeech.pth",
_tacotron2_params=utils._get_taco_params(n_symbols=96), _tacotron2_params=utils._get_taco_params(n_symbols=96),
) )
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and
:py:class:`torchaudio.transforms.GriffinLim`. :py:class:`~torchaudio.transforms.GriffinLim` as vocoder.
The text processor encodes the input texts based on phoneme. The text processor encodes the input texts based on phoneme.
It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert
...@@ -264,12 +261,9 @@ graphemes to phonemes. ...@@ -264,12 +261,9 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__. `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The text processor is set to the *"english_phonemes"*. The text processor is set to the *"english_phonemes"*.
The vocoder is based on :py:class:`torchaudio.transforms.GriffinLim`.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
Example - "Hello world! T T S stands for Text to Speech!" Example - "Hello world! T T S stands for Text to Speech!"
...@@ -304,18 +298,14 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH = _Tacotron2WaveRNNCharBundle( ...@@ -304,18 +298,14 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH = _Tacotron2WaveRNNCharBundle(
_wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth", _wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth",
_wavernn_params=utils._get_wrnn_params(), _wavernn_params=utils._get_wrnn_params(),
) )
TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
:py:class:`torchaudio.models.WaveRNN`.
The text processor encodes the input texts character-by-character. The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``. ``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__. You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
...@@ -351,8 +341,8 @@ TACOTRON2_WAVERNN_PHONE_LJSPEECH = _Tacotron2WaveRNNPhoneBundle( ...@@ -351,8 +341,8 @@ TACOTRON2_WAVERNN_PHONE_LJSPEECH = _Tacotron2WaveRNNPhoneBundle(
_wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth", _wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth",
_wavernn_params=utils._get_wrnn_params(), _wavernn_params=utils._get_wrnn_params(),
) )
TACOTRON2_WAVERNN_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and TACOTRON2_WAVERNN_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and
:py:class:`torchaudio.models.WaveRNN`. :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
The text processor encodes the input texts based on phoneme. The text processor encodes the input texts based on phoneme.
It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert
...@@ -360,14 +350,11 @@ graphemes to phonemes. ...@@ -360,14 +350,11 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__. `CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs. You can find the training script for Tacotron2 `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``, The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``. ``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`. You can find the training script for WaveRNN `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage. Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......
...@@ -11,8 +11,6 @@ class _TextProcessor(ABC): ...@@ -11,8 +11,6 @@ class _TextProcessor(ABC):
def tokens(self): def tokens(self):
"""The tokens that the each value in the processed tensor represent. """The tokens that the each value in the processed tensor represent.
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
:type: List[str] :type: List[str]
""" """
...@@ -20,8 +18,6 @@ class _TextProcessor(ABC): ...@@ -20,8 +18,6 @@ class _TextProcessor(ABC):
def __call__(self, texts: Union[str, List[str]]) -> Tuple[Tensor, Tensor]: def __call__(self, texts: Union[str, List[str]]) -> Tuple[Tensor, Tensor]:
"""Encode the given (batch of) texts into numerical tensors """Encode the given (batch of) texts into numerical tensors
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
Args: Args:
text (str or list of str): The input texts. text (str or list of str): The input texts.
...@@ -40,8 +36,6 @@ class _Vocoder(ABC): ...@@ -40,8 +36,6 @@ class _Vocoder(ABC):
def sample_rate(self): def sample_rate(self):
"""The sample rate of the resulting waveform """The sample rate of the resulting waveform
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
:type: float :type: float
""" """
...@@ -49,8 +43,6 @@ class _Vocoder(ABC): ...@@ -49,8 +43,6 @@ class _Vocoder(ABC):
def __call__(self, specgrams: Tensor, lengths: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]: def __call__(self, specgrams: Tensor, lengths: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]:
"""Generate waveform from the given input, such as spectrogram """Generate waveform from the given input, such as spectrogram
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
Args: Args:
specgrams (Tensor): specgrams (Tensor):
The input spectrogram. Shape: `(batch, frequency bins, time)`. The input spectrogram. Shape: `(batch, frequency bins, time)`.
...@@ -149,22 +141,19 @@ class Tacotron2TTSBundle(ABC): ...@@ -149,22 +141,19 @@ class Tacotron2TTSBundle(ABC):
# The thing is, text processing and vocoder are generic and we do not know what kind of # The thing is, text processing and vocoder are generic and we do not know what kind of
# new text processing and vocoder will be added in the future, so we want to make these # new text processing and vocoder will be added in the future, so we want to make these
# interfaces specific to this Tacotron2TTS pipeline. # interfaces specific to this Tacotron2TTS pipeline.
class TextProcessor(_TextProcessor): class TextProcessor(_TextProcessor):
"""Interface of the text processing part of Tacotron2TTS pipeline """Interface of the text processing part of Tacotron2TTS pipeline
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage. See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
""" """
pass
class Vocoder(_Vocoder): class Vocoder(_Vocoder):
"""Interface of the vocoder part of Tacotron2TTS pipeline """Interface of the vocoder part of Tacotron2TTS pipeline
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage. See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
""" """
pass
@abstractmethod @abstractmethod
def get_text_processor(self, *, dl_kwargs=None) -> TextProcessor: def get_text_processor(self, *, dl_kwargs=None) -> TextProcessor:
"""Create a text processor """Create a text processor
...@@ -181,7 +170,7 @@ class Tacotron2TTSBundle(ABC): ...@@ -181,7 +170,7 @@ class Tacotron2TTSBundle(ABC):
Passed to :func:`torch.hub.download_url_to_file`. Passed to :func:`torch.hub.download_url_to_file`.
Returns: Returns:
TTSTextProcessor: TextProcessor:
A callable which takes a string or a list of strings as input and A callable which takes a string or a list of strings as input and
returns Tensor of encoded texts and Tensor of valid lengths. returns Tensor of encoded texts and Tensor of valid lengths.
The object also has ``tokens`` property, which allows to recover the The object also has ``tokens`` property, which allows to recover the
...@@ -246,7 +235,7 @@ class Tacotron2TTSBundle(ABC): ...@@ -246,7 +235,7 @@ class Tacotron2TTSBundle(ABC):
Passed to :func:`torch.hub.load_state_dict_from_url`. Passed to :func:`torch.hub.load_state_dict_from_url`.
Returns: Returns:
Callable[[Tensor, Optional[Tensor]], Tuple[Tensor, Optional[Tensor]]]: Vocoder:
A vocoder module, which takes spectrogram Tensor and an optional A vocoder module, which takes spectrogram Tensor and an optional
length Tensor, then returns resulting waveform Tensor and an optional length Tensor, then returns resulting waveform Tensor and an optional
length Tensor. length Tensor.
......
...@@ -13,9 +13,7 @@ __all__ = [] ...@@ -13,9 +13,7 @@ __all__ = []
@dataclass @dataclass
class Wav2Vec2Bundle: class Wav2Vec2Bundle:
"""torchaudio.pipelines.Wav2Vec2Bundle() """Data class that bundles associated information to use pretrained :py:class:`~torchaudio.models.Wav2Vec2Model`.
Data class that bundles associated information to use pretrained Wav2Vec2Model.
This class provides interfaces for instantiating the pretrained model along with This class provides interfaces for instantiating the pretrained model along with
the information necessary to retrieve pretrained weights and additional data the information necessary to retrieve pretrained weights and additional data
...@@ -79,9 +77,8 @@ class Wav2Vec2Bundle: ...@@ -79,9 +77,8 @@ class Wav2Vec2Bundle:
@dataclass @dataclass
class Wav2Vec2ASRBundle(Wav2Vec2Bundle): class Wav2Vec2ASRBundle(Wav2Vec2Bundle):
"""torchaudio.pipelines.Wav2Vec2ASRBundle() """Data class that bundles associated information to use pretrained
:py:class:`~torchaudio.models.Wav2Vec2Model`.
Data class that bundles associated information to use pretrained Wav2Vec2Model.
This class provides interfaces for instantiating the pretrained model along with This class provides interfaces for instantiating the pretrained model along with
the information necessary to retrieve pretrained weights and additional data the information necessary to retrieve pretrained weights and additional data
...@@ -196,18 +193,16 @@ WAV2VEC2_BASE = Wav2Vec2Bundle( ...@@ -196,18 +193,16 @@ WAV2VEC2_BASE = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_BASE.__doc__ = """wav2vec 2.0 model with "Base" configuration. WAV2VEC2_BASE.__doc__ = """Wav2vec 2.0 model ("base" architecture),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964` (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle( WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
...@@ -241,9 +236,8 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle( ...@@ -241,9 +236,8 @@ WAV2VEC2_ASR_BASE_10M = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_BASE_10M.__doc__ = """Build "base" wav2vec2 model with an extra linear module WAV2VEC2_ASR_BASE_10M.__doc__ = """Wav2vec 2.0 model ("base" architecture with an extra linear module),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
:cite:`librilight` ("train-10min" subset). :cite:`librilight` ("train-10min" subset).
...@@ -253,7 +247,7 @@ redistributed with the same license. ...@@ -253,7 +247,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle( WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(
...@@ -288,9 +282,8 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle( ...@@ -288,9 +282,8 @@ WAV2VEC2_ASR_BASE_100H = Wav2Vec2ASRBundle(
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_BASE_100H.__doc__ = """Build "base" wav2vec2 model with an extra linear module WAV2VEC2_ASR_BASE_100H.__doc__ = """Wav2vec 2.0 model ("base" architecture with an extra linear module),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset. fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset.
...@@ -299,7 +292,7 @@ redistributed with the same license. ...@@ -299,7 +292,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle( WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
...@@ -333,9 +326,8 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle( ...@@ -333,9 +326,8 @@ WAV2VEC2_ASR_BASE_960H = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_BASE_960H.__doc__ = """Build "base" wav2vec2 model with an extra linear module WAV2VEC2_ASR_BASE_960H.__doc__ = """Wav2vec 2.0 model ("base" architecture with an extra linear module),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on the same audio with the corresponding transcripts. fine-tuned for ASR on the same audio with the corresponding transcripts.
...@@ -344,7 +336,7 @@ redistributed with the same license. ...@@ -344,7 +336,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_LARGE = Wav2Vec2Bundle( WAV2VEC2_LARGE = Wav2Vec2Bundle(
...@@ -377,18 +369,16 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle( ...@@ -377,18 +369,16 @@ WAV2VEC2_LARGE = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_LARGE.__doc__ = """Build "large" wav2vec2 model. WAV2VEC2_LARGE.__doc__ = """Wav2vec 2.0 model ("large" architecture),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964` (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle( WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
...@@ -422,9 +412,8 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle( ...@@ -422,9 +412,8 @@ WAV2VEC2_ASR_LARGE_10M = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_LARGE_10M.__doc__ = """Build "large" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_10M.__doc__ = """Wav2vec 2.0 model ("large" architecture with an extra linear module),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset fine-tuned for ASR on 10 minutes of transcribed audio from *Libri-Light* dataset
:cite:`librilight` ("train-10min" subset). :cite:`librilight` ("train-10min" subset).
...@@ -434,7 +423,7 @@ redistributed with the same license. ...@@ -434,7 +423,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle( WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
...@@ -468,9 +457,8 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle( ...@@ -468,9 +457,8 @@ WAV2VEC2_ASR_LARGE_100H = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_LARGE_100H.__doc__ = """Build "large" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_100H.__doc__ = """Wav2vec 2.0 model ("large" architecture with an extra linear module),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on 100 hours of transcribed audio from fine-tuned for ASR on 100 hours of transcribed audio from
the same dataset ("train-clean-100" subset). the same dataset ("train-clean-100" subset).
...@@ -480,7 +468,7 @@ redistributed with the same license. ...@@ -480,7 +468,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle( WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
...@@ -514,9 +502,8 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle( ...@@ -514,9 +502,8 @@ WAV2VEC2_ASR_LARGE_960H = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_LARGE_960H.__doc__ = """Build "large" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_960H.__doc__ = """Wav2vec 2.0 model ("large" architecture with an extra linear module),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and
fine-tuned for ASR on the same audio with the corresponding transcripts. fine-tuned for ASR on the same audio with the corresponding transcripts.
...@@ -525,7 +512,7 @@ redistributed with the same license. ...@@ -525,7 +512,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle( WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
...@@ -558,18 +545,16 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle( ...@@ -558,18 +545,16 @@ WAV2VEC2_LARGE_LV60K = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_LARGE_LV60K.__doc__ = """Build "large-lv60k" wav2vec2 model. WAV2VEC2_LARGE_LV60K.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`,
Pre-trained on 60,000 hours of unlabeled audio from not fine-tuned.
*Libri-Light* dataset :cite:`librilight`.
Not fine-tuned.
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle( WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
...@@ -603,19 +588,16 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle( ...@@ -603,19 +588,16 @@ WAV2VEC2_ASR_LARGE_LV60K_10M = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_LV60K_10M.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`, and
Pre-trained on 60,000 hours of unlabeled audio from fine-tuned for ASR on 10 minutes of transcribed audio from the same dataset ("train-10min" subset).
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 10 minutes of transcribed audio from
the same dataset ("train-10min" subset).
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle( WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
...@@ -649,10 +631,8 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle( ...@@ -649,10 +631,8 @@ WAV2VEC2_ASR_LARGE_LV60K_100H = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_LV60K_100H.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`, and
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 100 hours of transcribed audio from fine-tuned for ASR on 100 hours of transcribed audio from
*LibriSpeech* dataset :cite:`7178964` ("train-clean-100" subset). *LibriSpeech* dataset :cite:`7178964` ("train-clean-100" subset).
...@@ -661,7 +641,7 @@ redistributed with the same license. ...@@ -661,7 +641,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle( WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
...@@ -695,12 +675,9 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle( ...@@ -695,12 +675,9 @@ WAV2VEC2_ASR_LARGE_LV60K_960H = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Build "large-lv60k" wav2vec2 model with an extra linear module WAV2VEC2_ASR_LARGE_LV60K_960H.__doc__ = """Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* :cite:`librilight` dataset, and
Pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset :cite:`7178964`
:cite:`librilight` dataset, and
fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and Originally published by the authors of *wav2vec 2.0* :cite:`baevski2020wav2vec` under MIT License and
...@@ -708,7 +685,7 @@ redistributed with the same license. ...@@ -708,7 +685,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
WAV2VEC2_XLSR53 = Wav2Vec2Bundle( WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
...@@ -741,13 +718,12 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle( ...@@ -741,13 +718,12 @@ WAV2VEC2_XLSR53 = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
WAV2VEC2_XLSR53.__doc__ = """wav2vec 2.0 model with "Base" configuration. WAV2VEC2_XLSR53.__doc__ = """Wav2vec 2.0 model ("base" architecture),
pre-trained on 56,000 hours of unlabeled audio from multiple datasets (
Trained on 56,000 hours of unlabeled audio from multiple datasets (
*Multilingual LibriSpeech* :cite:`Pratap_2020`, *Multilingual LibriSpeech* :cite:`Pratap_2020`,
*CommonVoice* :cite:`ardila2020common` and *CommonVoice* :cite:`ardila2020common` and
*BABEL* :cite:`Gales2014SpeechRA`). *BABEL* :cite:`Gales2014SpeechRA`),
Not fine-tuned. not fine-tuned.
Originally published by the authors of Originally published by the authors of
*Unsupervised Cross-lingual Representation Learning for Speech Recognition* *Unsupervised Cross-lingual Representation Learning for Speech Recognition*
...@@ -755,7 +731,7 @@ Originally published by the authors of ...@@ -755,7 +731,7 @@ Originally published by the authors of
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/wav2vec#pre-trained-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
HUBERT_BASE = Wav2Vec2Bundle( HUBERT_BASE = Wav2Vec2Bundle(
...@@ -788,18 +764,16 @@ HUBERT_BASE = Wav2Vec2Bundle( ...@@ -788,18 +764,16 @@ HUBERT_BASE = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
HUBERT_BASE.__doc__ = """HuBERT model with "Base" configuration. HUBERT_BASE.__doc__ = """HuBERT model ("base" architecture),
pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964`
Pre-trained on 960 hours of unlabeled audio from *LibriSpeech* dataset :cite:`7178964` (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
(the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Not fine-tuned.
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
HUBERT_LARGE = Wav2Vec2Bundle( HUBERT_LARGE = Wav2Vec2Bundle(
...@@ -832,18 +806,16 @@ HUBERT_LARGE = Wav2Vec2Bundle( ...@@ -832,18 +806,16 @@ HUBERT_LARGE = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
HUBERT_LARGE.__doc__ = """HuBERT model with "Large" configuration. HUBERT_LARGE.__doc__ = """HuBERT model ("large" architecture),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`,
Pre-trained on 60,000 hours of unlabeled audio from not fine-tuned.
*Libri-Light* dataset :cite:`librilight`.
Not fine-tuned.
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
HUBERT_XLARGE = Wav2Vec2Bundle( HUBERT_XLARGE = Wav2Vec2Bundle(
...@@ -876,18 +848,16 @@ HUBERT_XLARGE = Wav2Vec2Bundle( ...@@ -876,18 +848,16 @@ HUBERT_XLARGE = Wav2Vec2Bundle(
}, },
_sample_rate=16000, _sample_rate=16000,
) )
HUBERT_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration. HUBERT_XLARGE.__doc__ = """HuBERT model ("extra large" architecture),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`,
Pre-trained on 60,000 hours of unlabeled audio from not fine-tuned.
*Libri-Light* dataset :cite:`librilight`.
Not fine-tuned.
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2Bundle` for the usage.
""" # noqa: E501 """ # noqa: E501
HUBERT_ASR_LARGE = Wav2Vec2ASRBundle( HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
...@@ -921,12 +891,9 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle( ...@@ -921,12 +891,9 @@ HUBERT_ASR_LARGE = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
HUBERT_ASR_LARGE.__doc__ = """HuBERT model with "Large" configuration. HUBERT_ASR_LARGE.__doc__ = """HuBERT model ("large" architecture),
pre-trained on 60,000 hours of unlabeled audio from *Libri-Light* dataset :cite:`librilight`, and
Pre-trained on 60,000 hours of unlabeled audio from fine-tuned for ASR on 960 hours of transcribed audio from *LibriSpeech* dataset :cite:`7178964`
*Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset :cite:`7178964`
(the combination of "train-clean-100", "train-clean-360", and "train-other-500"). (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and Originally published by the authors of *HuBERT* :cite:`hsu2021hubert` under MIT License and
...@@ -934,7 +901,7 @@ redistributed with the same license. ...@@ -934,7 +901,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle( HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
...@@ -968,9 +935,8 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle( ...@@ -968,9 +935,8 @@ HUBERT_ASR_XLARGE = Wav2Vec2ASRBundle(
_labels=utils._get_en_labels(), _labels=utils._get_en_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
HUBERT_ASR_XLARGE.__doc__ = """HuBERT model with "Extra Large" configuration. HUBERT_ASR_XLARGE.__doc__ = """HuBERT model ("extra large" architecture),
pre-trained on 60,000 hours of unlabeled audio from
Pre-trained on 60,000 hours of unlabeled audio from
*Libri-Light* dataset :cite:`librilight`, and *Libri-Light* dataset :cite:`librilight`, and
fine-tuned for ASR on 960 hours of transcribed audio from fine-tuned for ASR on 960 hours of transcribed audio from
*LibriSpeech* dataset :cite:`7178964` *LibriSpeech* dataset :cite:`7178964`
...@@ -981,7 +947,7 @@ redistributed with the same license. ...@@ -981,7 +947,7 @@ redistributed with the same license.
[`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__, [`License <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/LICENSE>`__,
`Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__] `Source <https://github.com/pytorch/fairseq/blob/ce6c9eeae163ac04b79539c78e74f292f29eaa18/examples/hubert#pre-trained-and-fine-tuned-asr-models>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
...@@ -1017,18 +983,17 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle( ...@@ -1017,18 +983,17 @@ VOXPOPULI_ASR_BASE_10K_DE = Wav2Vec2ASRBundle(
_sample_rate=16000, _sample_rate=16000,
_remove_aux_axis=(1, 2, 3, 35), _remove_aux_axis=(1, 2, 3, 35),
) )
VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_DE.__doc__ = """wav2vec 2.0 model ("base" architecture),
pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli` ("10k" subset, consisting of 23 languages), and
("10k" subset, consisting of 23 languages). fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
Fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
...@@ -1064,18 +1029,17 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle( ...@@ -1064,18 +1029,17 @@ VOXPOPULI_ASR_BASE_10K_EN = Wav2Vec2ASRBundle(
_sample_rate=16000, _sample_rate=16000,
_remove_aux_axis=(1, 2, 3, 31), _remove_aux_axis=(1, 2, 3, 31),
) )
VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_EN.__doc__ = """wav2vec 2.0 model ("base" architecture),
pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages), and
fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
("10k" subset, consisting of 23 languages).
Fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
...@@ -1111,18 +1075,17 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle( ...@@ -1111,18 +1075,17 @@ VOXPOPULI_ASR_BASE_10K_ES = Wav2Vec2ASRBundle(
_sample_rate=16000, _sample_rate=16000,
_remove_aux_axis=(1, 2, 3, 35), _remove_aux_axis=(1, 2, 3, 35),
) )
VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_ES.__doc__ = """wav2vec 2.0 model ("base" architecture),
pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli` ("10k" subset, consisting of 23 languages), and
("10k" subset, consisting of 23 languages). fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
Fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle( VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
...@@ -1156,18 +1119,17 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle( ...@@ -1156,18 +1119,17 @@ VOXPOPULI_ASR_BASE_10K_FR = Wav2Vec2ASRBundle(
_labels=utils._get_fr_labels(), _labels=utils._get_fr_labels(),
_sample_rate=16000, _sample_rate=16000,
) )
VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_FR.__doc__ = """wav2vec 2.0 model ("base" architecture),
pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli` ("10k" subset, consisting of 23 languages), and
("10k" subset, consisting of 23 languages). fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
Fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
...@@ -1203,16 +1165,15 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle( ...@@ -1203,16 +1165,15 @@ VOXPOPULI_ASR_BASE_10K_IT = Wav2Vec2ASRBundle(
_sample_rate=16000, _sample_rate=16000,
_remove_aux_axis=(1, 2, 3), _remove_aux_axis=(1, 2, 3),
) )
VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model with "Base" configuration. VOXPOPULI_ASR_BASE_10K_IT.__doc__ = """wav2vec 2.0 model ("base" architecture),
pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli`
Pre-trained on 10k hours of unlabeled audio from *VoxPopuli* dataset :cite:`voxpopuli` ("10k" subset, consisting of 23 languages), and
("10k" subset, consisting of 23 languages). fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
Fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and Originally published by the authors of *VoxPopuli* :cite:`voxpopuli` under CC BY-NC 4.0 and
redistributed with the same license. redistributed with the same license.
[`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__, [`License <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#license>`__,
`Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__] `Source <https://github.com/facebookresearch/voxpopuli/tree/160e4d7915bad9f99b2c35b1d3833e51fd30abf2#asr-and-lm>`__]
Please refer to :func:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage. Please refer to :py:class:`torchaudio.pipelines.Wav2Vec2ASRBundle` for the usage.
""" # noqa: E501 """ # noqa: E501
...@@ -151,9 +151,7 @@ class _SentencePieceTokenProcessor(_TokenProcessor): ...@@ -151,9 +151,7 @@ class _SentencePieceTokenProcessor(_TokenProcessor):
@dataclass @dataclass
class RNNTBundle: class RNNTBundle:
"""torchaudio.pipelines.RNNTBundle() """Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text)
Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text)
inference with an RNN-T model. inference with an RNN-T model.
More specifically, the class provides methods that produce the featurization pipeline, More specifically, the class provides methods that produce the featurization pipeline,
...@@ -165,7 +163,7 @@ class RNNTBundle: ...@@ -165,7 +163,7 @@ class RNNTBundle:
Users should not directly instantiate objects of this class; rather, users should use the Users should not directly instantiate objects of this class; rather, users should use the
instances (representing pre-trained models) that exist within the module, instances (representing pre-trained models) that exist within the module,
e.g. :py:obj:`EMFORMER_RNNT_BASE_LIBRISPEECH`. e.g. :data:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`.
Example Example
>>> import torchaudio >>> import torchaudio
...@@ -226,10 +224,10 @@ class RNNTBundle: ...@@ -226,10 +224,10 @@ class RNNTBundle:
""" """
class FeatureExtractor(_FeatureExtractor): class FeatureExtractor(_FeatureExtractor):
pass """Interface of the feature extraction part of RNN-T pipeline"""
class TokenProcessor(_TokenProcessor): class TokenProcessor(_TokenProcessor):
pass """Interface of the token processor part of RNN-T pipeline"""
_rnnt_path: str _rnnt_path: str
_rnnt_factory_func: Callable[[], RNNT] _rnnt_factory_func: Callable[[], RNNT]
...@@ -370,11 +368,13 @@ EMFORMER_RNNT_BASE_LIBRISPEECH = RNNTBundle( ...@@ -370,11 +368,13 @@ EMFORMER_RNNT_BASE_LIBRISPEECH = RNNTBundle(
_segment_length=16, _segment_length=16,
_right_context_length=4, _right_context_length=4,
) )
EMFORMER_RNNT_BASE_LIBRISPEECH.__doc__ = """Pre-trained Emformer-RNNT-based ASR pipeline capable of performing both streaming and non-streaming inference. EMFORMER_RNNT_BASE_LIBRISPEECH.__doc__ = """ASR pipeline based on Emformer-RNNT,
pretrained on *LibriSpeech* dataset :cite:`7178964`,
capable of performing both streaming and non-streaming inference.
The underlying model is constructed by :py:func:`torchaudio.models.emformer_rnnt_base` The underlying model is constructed by :py:func:`torchaudio.models.emformer_rnnt_base`
and utilizes weights trained on LibriSpeech using training script ``train.py`` and utilizes weights trained on LibriSpeech using training script ``train.py``
`here <https://github.com/pytorch/audio/tree/main/examples/asr/emformer_rnnt>`__ with default arguments. `here <https://github.com/pytorch/audio/tree/main/examples/asr/emformer_rnnt>`__ with default arguments.
Please refer to :py:class:`RNNTBundle` for usage instructions. Please refer to :py:class:`RNNTBundle` for usage instructions.
""" """
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment