Commit 0b3ddec6 authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Adopt `:autosummary:` in `torchaudio.pipelines` module doc (#2689)

Summary:
* Introduce the mini-index at `torchaudio.pipelines` page.
* Add introductions
* Update pipeline tutorials

https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/pipelines.html

<img width="1163" alt="Screen Shot 2022-09-20 at 1 23 29 PM" src="https://user-images.githubusercontent.com/855818/191167049-98324e93-2e16-41db-8538-3b5b54eb8224.png">

<img width="1115" alt="Screen Shot 2022-09-20 at 1 23 49 PM" src="https://user-images.githubusercontent.com/855818/191167071-4770f594-2540-43a4-a01c-e983bf59220f.png">

https://output.circle-artifacts.com/output/job/ccc57d95-1930-45c9-b967-c8d477d35f29/artifacts/0/docs/generated/torchaudio.pipelines.RNNTBundle.html#torchaudio.pipelines.RNNTBundle

<img width="1108" alt="Screen Shot 2022-09-20 at 1 24 18 PM" src="https://user-images.githubusercontent.com/855818/191167123-51b33a5f-c30c-46bc-b002-b05d2d0d27b7.png">

Pull Request resolved: https://github.com/pytorch/audio/pull/2689

Reviewed By: carolineechen

Differential Revision: D39691253

Pulled By: mthrok

fbshipit-source-id: ddf5fdadb0b64cf2867b6271ba53e8e8c0fa7e49
parent 045cc372
..
autogenerated from source/_templates/autosummary/bundle_class.rst
{{ name | underline }}
.. autoclass:: {{ fullname }}()
{%- if name in ["RNNTBundle.FeatureExtractor", "RNNTBundle.TokenProcessor"] %}
{%- set methods = ["__call__"] %}
{%- elif name == "Tacotron2TTSBundle.TextProcessor" %}
{%- set attributes = ["tokens"] %}
{%- set methods = ["__call__"] %}
{%- elif name == "Tacotron2TTSBundle.Vocoder" %}
{%- set attributes=["sample_rate"] %}
{%- set methods = ["__call__"] %}
{% endif %}
..
ATTRIBUTES
{%- for item in attributes %}
{%- if not item.startswith('_') %}
{{ item | underline("-") }}
.. container:: py attribute
.. autoproperty:: {{[fullname, item] | join('.')}}
{%- endif %}
{%- endfor %}
..
METHODS
{%- for item in methods %}
{%- if item != "__init__" %}
{{item | underline("-") }}
.. container:: py attribute
.. automethod:: {{[fullname, item] | join('.')}}
{%- endif %}
{%- endfor %}
..
autogenerated from source/_templates/autosummary/bundle_data.rst
{{ name | underline }}
.. container:: py attribute
.. autodata:: {{ fullname }}
:no-value:
.. py:module:: torchaudio.pipelines
torchaudio.pipelines
====================
.. currentmodule:: torchaudio.pipelines
.. py:module:: torchaudio.pipelines
The pipelines subpackage contains API to access the models with pretrained weights, and information/helper functions associated the pretrained weights.
RNN-T Streaming/Non-Streaming ASR
---------------------------------
RNNTBundle
~~~~~~~~~~
.. autoclass:: RNNTBundle
:members: sample_rate, n_fft, n_mels, hop_length, segment_length, right_context_length
.. automethod:: get_decoder
The ``torchaudio.pipelines`` module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks.
.. automethod:: get_feature_extractor
When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins).
.. automethod:: get_streaming_feature_extractor
To make this information tied to a pre-trained model and easily accessible, ``torchaudio.pipelines`` module uses the concept of a `Bundle` class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines.
.. automethod:: get_token_processor
The following figure illustrates this.
RNNTBundle - FeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png
.. autoclass:: torchaudio.pipelines::RNNTBundle.FeatureExtractor
:special-members: __call__
A pre-trained model and associated pipelines are expressed as an instance of ``Bundle``. Different instances of same ``Bundle`` share the interface, but their implementations are not constrained to be of same types. For example, :class:`SourceSeparationBundle` defines the interface for performing source separation, but its instance :data:`CONVTASNET_BASE_LIBRI2MIX` instantiates a model of :class:`~torchaudio.models.ConvTasNet` while :data:`HDEMUCS_HIGH_MUSDB` instantiates a model of :class:`~torchaudio.models.HDemucs`. Still, because they share the same interface, the usage is the same.
RNNTBundle - TokenProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
.. autoclass:: torchaudio.pipelines::RNNTBundle.TokenProcessor
:special-members: __call__
Under the hood, the implementations of ``Bundle`` use components from other ``torchaudio`` modules, such as :mod:`torchaudio.models` and :mod:`torchaudio.transforms`, or even third party libraries like `SentencPiece <https://github.com/google/sentencepiece>`__ and `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__. But this implementation detail is abstracted away from library users.
EMFORMER_RNNT_BASE_LIBRISPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: EMFORMER_RNNT_BASE_LIBRISPEECH
:no-value:
wav2vec 2.0 / HuBERT - Representation Learning
----------------------------------------------
.. autoclass:: Wav2Vec2Bundle
:members: sample_rate
.. automethod:: get_model
WAV2VEC2_BASE
~~~~~~~~~~~~~
.. container:: py attribute
RNN-T Streaming/Non-Streaming ASR
---------------------------------
.. autodata:: WAV2VEC2_BASE
:no-value:
Interface
^^^^^^^^^
WAV2VEC2_LARGE
~~~~~~~~~~~~~~
``RNNTBundle`` defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.
.. container:: py attribute
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png
.. autodata:: WAV2VEC2_LARGE
:no-value:
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
WAV2VEC2_LARGE_LV60K
~~~~~~~~~~~~~~~~~~~~
RNNTBundle
RNNTBundle.FeatureExtractor
RNNTBundle.TokenProcessor
.. container:: py attribute
.. rubric:: Tutorials using ``RNNTBundle``
.. autodata:: WAV2VEC2_LARGE_LV60K
:no-value:
.. minigallery:: torchaudio.pipelines.RNNTBundle
Pretrained Models
^^^^^^^^^^^^^^^^^
WAV2VEC2_XLSR53
~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. container:: py attribute
EMFORMER_RNNT_BASE_LIBRISPEECH
.. autodata:: WAV2VEC2_XLSR53
:no-value:
HUBERT_BASE
~~~~~~~~~~~
wav2vec 2.0 / HuBERT - SSL
--------------------------
.. container:: py attribute
Interface
^^^^^^^^^
.. autodata:: HUBERT_BASE
:no-value:
``Wav2Vec2Bundle`` instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.
HUBERT_LARGE
~~~~~~~~~~~~
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png
.. container:: py attribute
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
.. autodata:: HUBERT_LARGE
:no-value:
Wav2Vec2Bundle
HUBERT_XLARGE
~~~~~~~~~~~~~
Pretrained Models
^^^^^^^^^^^^^^^^^
.. container:: py attribute
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. autodata:: HUBERT_XLARGE
:no-value:
WAV2VEC2_BASE
WAV2VEC2_LARGE
WAV2VEC2_LARGE_LV60K
WAV2VEC2_XLSR53
HUBERT_BASE
HUBERT_LARGE
HUBERT_XLARGE
wav2vec 2.0 / HuBERT - Fine-tuned ASR
-------------------------------------
Wav2Vec2ASRBundle
~~~~~~~~~~~~~~~~~
.. autoclass:: Wav2Vec2ASRBundle
:members: sample_rate
.. automethod:: get_model
.. automethod:: get_labels
WAV2VEC2_ASR_BASE_10M
~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_BASE_10M
:no-value:
WAV2VEC2_ASR_BASE_100H
~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_BASE_100H
:no-value:
WAV2VEC2_ASR_BASE_960H
~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_BASE_960H
:no-value:
WAV2VEC2_ASR_LARGE_10M
~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_LARGE_10M
:no-value:
WAV2VEC2_ASR_LARGE_100H
~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_LARGE_100H
:no-value:
WAV2VEC2_ASR_LARGE_960H
~~~~~~~~~~~~~~~~~~~~~~~
Interface
^^^^^^^^^
.. container:: py attribute
``Wav2Vec2ASRBundle`` instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.
.. autodata:: WAV2VEC2_ASR_LARGE_960H
:no-value:
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png
WAV2VEC2_ASR_LARGE_LV60K_10M
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
.. container:: py attribute
Wav2Vec2ASRBundle
.. autodata:: WAV2VEC2_ASR_LARGE_LV60K_10M
:no-value:
.. rubric:: Tutorials using ``Wav2Vec2ASRBundle``
WAV2VEC2_ASR_LARGE_LV60K_100H
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. minigallery:: torchaudio.pipelines.Wav2Vec2ASRBundle
.. container:: py attribute
Pretrained Models
^^^^^^^^^^^^^^^^^
.. autodata:: WAV2VEC2_ASR_LARGE_LV60K_100H
:no-value:
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
WAV2VEC2_ASR_LARGE_LV60K_960H
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
WAV2VEC2_ASR_BASE_10M
WAV2VEC2_ASR_BASE_100H
WAV2VEC2_ASR_BASE_960H
WAV2VEC2_ASR_LARGE_10M
WAV2VEC2_ASR_LARGE_100H
WAV2VEC2_ASR_LARGE_960H
WAV2VEC2_ASR_LARGE_LV60K_10M
WAV2VEC2_ASR_LARGE_LV60K_100H
WAV2VEC2_ASR_LARGE_LV60K_960H
VOXPOPULI_ASR_BASE_10K_DE
VOXPOPULI_ASR_BASE_10K_EN
VOXPOPULI_ASR_BASE_10K_ES
VOXPOPULI_ASR_BASE_10K_FR
VOXPOPULI_ASR_BASE_10K_IT
HUBERT_ASR_LARGE
HUBERT_ASR_XLARGE
.. container:: py attribute
.. autodata:: WAV2VEC2_ASR_LARGE_LV60K_960H
:no-value:
VOXPOPULI_ASR_BASE_10K_DE
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_DE
:no-value:
VOXPOPULI_ASR_BASE_10K_EN
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_EN
:no-value:
VOXPOPULI_ASR_BASE_10K_ES
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_ES
:no-value:
VOXPOPULI_ASR_BASE_10K_FR
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_FR
:no-value:
VOXPOPULI_ASR_BASE_10K_IT
~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: VOXPOPULI_ASR_BASE_10K_IT
:no-value:
HUBERT_ASR_LARGE
~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: HUBERT_ASR_LARGE
:no-value:
HUBERT_ASR_XLARGE
~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: HUBERT_ASR_XLARGE
:no-value:
Tacotron2 Text-To-Speech
------------------------
Tacotron2TTSBundle
~~~~~~~~~~~~~~~~~~
.. autoclass:: Tacotron2TTSBundle
.. automethod:: get_text_processor
.. automethod:: get_tacotron2
.. automethod:: get_vocoder
Tacotron2TTSBundle - TextProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: torchaudio.pipelines::Tacotron2TTSBundle.TextProcessor
:members: tokens
:special-members: __call__
``Tacotron2TTSBundle`` defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on :class:`~torchaudio.models.Tacotron2` model.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png
Tacotron2TTSBundle - Vocoder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``TextProcessor`` can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.
.. autoclass:: torchaudio.pipelines::Tacotron2TTSBundle.Vocoder
:members: sample_rate
:special-members: __call__
Similarly ``Vocoder`` can be an algorithm without learning parameters, like `Griffin-Lim`, or a neural-network-based model like `Waveglow`.
Interface
^^^^^^^^^
TACOTRON2_WAVERNN_PHONE_LJSPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
.. container:: py attribute
Tacotron2TTSBundle
Tacotron2TTSBundle.TextProcessor
Tacotron2TTSBundle.Vocoder
.. autodata:: TACOTRON2_WAVERNN_PHONE_LJSPEECH
:no-value:
.. rubric:: Tutorials using ``Tacotron2TTSBundle``
.. minigallery:: torchaudio.pipelines.Tacotron2TTSBundle
TACOTRON2_WAVERNN_CHAR_LJSPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pretrained Models
^^^^^^^^^^^^^^^^^
.. container:: py attribute
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. autodata:: TACOTRON2_WAVERNN_CHAR_LJSPEECH
:no-value:
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
:no-value:
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. container:: py attribute
.. autodata:: TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
:no-value:
TACOTRON2_WAVERNN_PHONE_LJSPEECH
TACOTRON2_WAVERNN_CHAR_LJSPEECH
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
Source Separation
-----------------
SourceSeparationBundle
~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: SourceSeparationBundle
:members: sample_rate
.. automethod:: get_model
Interface
^^^^^^^^^
CONVTASNET_BASE_LIBRI2MIX
~~~~~~~~~~~~~~~~~~~~~~~~~
``SourceSeparationBundle`` instantiates source separation models which take single channel audio and generates multi-channel audio.
.. container:: py attribute
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png
.. autodata:: CONVTASNET_BASE_LIBRI2MIX
:no-value:
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
HDEMUCS_HIGH_MUSDB_PLUS
~~~~~~~~~~~~~~~~~~~~~~~
SourceSeparationBundle
.. container:: py attribute
.. rubric:: Tutorials using ``SourceSeparationBundle``
.. autodata:: HDEMUCS_HIGH_MUSDB_PLUS
:no-value:
.. minigallery:: torchaudio.pipelines.SourceSeparationBundle
HDEMUCS_HIGH_MUSDB
~~~~~~~~~~~~~~~~~~
Pretrained Models
^^^^^^^^^^^^^^^^^
.. container:: py attribute
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
.. autodata:: HDEMUCS_HIGH_MUSDB
:no-value:
CONVTASNET_BASE_LIBRI2MIX
HDEMUCS_HIGH_MUSDB_PLUS
HDEMUCS_HIGH_MUSDB
......@@ -410,3 +410,16 @@
booktitle={Proceedings of the ISMIR 2021 Workshop on Music Source Separation},
year={2021}
}
@article{CATTONI2021101155,
title = {MuST-C: A multilingual corpus for end-to-end speech translation},
journal = {Computer Speech & Language},
volume = {66},
pages = {101155},
year = {2021},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2020.101155},
url = {https://www.sciencedirect.com/science/article/pii/S0885230820300887},
author = {Roldano Cattoni and Mattia Antonino {Di Gangi} and Luisa Bentivogli and Matteo Negri and Marco Turchi},
keywords = {Spoken language translation, Multilingual corpus},
abstract = {End-to-end spoken language translation (SLT) has recently gained popularity thanks to the advancement of sequence to sequence learning in its two parent tasks: automatic speech recognition (ASR) and machine translation (MT). However, research in the field has to confront with the scarcity of publicly available corpora to train data-hungry neural networks. Indeed, while traditional cascade solutions can build on sizable ASR and MT training data for a variety of languages, the available SLT corpora suitable for end-to-end training are few, typically small and of limited language coverage. We contribute to fill this gap by presenting MuST-C, a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. Besides describing the corpus creation methodology and discussing the outcomes of empirical and manual quality evaluations, we present baseline results computed with strong systems on each language direction covered by MuST-C.}
}
.. py:module:: torchaudio.transforms
torchaudio.transforms
=====================
.. currentmodule:: torchaudio.transforms
:mod:`torchaudio.transforms` module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.
``torchaudio.transforms`` module contains common audio processings and feature extractions. The following diagram shows the relationship between some of the available transforms.
.. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png
......
......@@ -72,9 +72,10 @@ from torchaudio.utils import download_asset
# We use the pretrained `Wav2Vec 2.0 <https://arxiv.org/abs/2006.11477>`__
# Base model that is finetuned on 10 min of the `LibriSpeech
# dataset <http://www.openslr.org/12>`__, which can be loaded in using
# :py:func:`torchaudio.pipelines`. For more detail on running Wav2Vec 2.0 speech
# :data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M`.
# For more detail on running Wav2Vec 2.0 speech
# recognition pipelines in torchaudio, please refer to `this
# tutorial <https://pytorch.org/audio/main/tutorials/speech_recognition_pipeline_tutorial.html>`__.
# tutorial <./speech_recognition_pipeline_tutorial.html>`__.
#
bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
......@@ -177,7 +178,7 @@ print(tokens)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# Pretrained files for the LibriSpeech dataset can be downloaded using
# :py:func:`download_pretrained_files <torchaudio.models.decoder.download_pretrained_files>`.
# :py:func:`~torchaudio.models.decoder.download_pretrained_files`.
#
# Note: this cell may take a couple of minutes to run, as the language
# model can be large
......@@ -202,7 +203,7 @@ print(files)
# Beam Search Decoder
# ~~~~~~~~~~~~~~~~~~~
# The decoder can be constructed using the factory function
# :py:func:`ctc_decoder <torchaudio.models.decoder.ctc_decoder>`.
# :py:func:`~torchaudio.models.decoder.ctc_decoder`.
# In addition to the previously mentioned components, it also takes in various beam
# search decoding parameters and token/word parameters.
#
......@@ -262,7 +263,7 @@ greedy_decoder = GreedyCTCDecoder(tokens)
#
# Now that we have the data, acoustic model, and decoder, we can perform
# inference. The output of the beam search decoder is of type
# :py:func:`torchaudio.models.decoder.CTCHypothesis`, consisting of the
# :py:class:`~torchaudio.models.decoder.CTCHypothesis`, consisting of the
# predicted token IDs, corresponding words (if a lexicon is provided), hypothesis score,
# and timesteps corresponding to the token IDs. Recall the transcript corresponding to the
# waveform is
......@@ -307,7 +308,8 @@ print(f"WER: {beam_search_wer}")
######################################################################
# .. note::
#
# The ``words`` field of the output hypotheses will be empty if no lexicon
# The :py:attr:`~torchaudio.models.decoder.CTCHypothesis.words`
# field of the output hypotheses will be empty if no lexicon
# is provided to the decoder. To retrieve a transcript with lexicon-free
# decoding, you can perform the following to retrieve the token indices,
# convert them to original tokens, then join them together.
......
......@@ -74,9 +74,9 @@ except ModuleNotFoundError:
# -------------------------
#
# Pre-trained model weights and related pipeline components are
# bundled as :py:func:`torchaudio.pipelines.RNNTBundle`.
# bundled as :py:class:`torchaudio.pipelines.RNNTBundle`.
#
# We use :py:func:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`,
# We use :py:data:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`,
# which is a Emformer RNN-T model trained on LibriSpeech dataset.
#
......@@ -112,7 +112,7 @@ print(f"Right context: {context_length} frames ({context_length / sample_rate} s
# 4. Configure the audio stream
# -----------------------------
#
# Next, we configure the input audio stream using :py:func:`~torchaudio.io.StreamReader`.
# Next, we configure the input audio stream using :py:class:`torchaudio.io.StreamReader`.
#
# For the detail of this API, please refer to the
# `Media Stream API tutorial <./streaming_api_tutorial.html>`__.
......
......@@ -26,7 +26,7 @@ pre-trained models from wav2vec 2.0
# Torchaudio provides easy access to the pre-trained weights and
# associated information, such as the expected sample rate and class
# labels. They are bundled together and available under
# :py:func:`torchaudio.pipelines` module.
# :py:mod:`torchaudio.pipelines` module.
#
......@@ -34,36 +34,26 @@ pre-trained models from wav2vec 2.0
# Preparation
# -----------
#
# First we import the necessary packages, and fetch data that we work on.
#
# %matplotlib inline
import os
import IPython
import matplotlib
import matplotlib.pyplot as plt
import requests
import torch
import torchaudio
matplotlib.rcParams["figure.figsize"] = [16.0, 4.8]
print(torch.__version__)
print(torchaudio.__version__)
torch.random.manual_seed(0)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.__version__)
print(torchaudio.__version__)
print(device)
SPEECH_URL = "https://pytorch-tutorial-assets.s3.amazonaws.com/VOiCES_devkit/source-16k/train/sp0307/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" # noqa: E501
SPEECH_FILE = "_assets/speech.wav"
######################################################################
#
import IPython
import matplotlib.pyplot as plt
from torchaudio.utils import download_asset
if not os.path.exists(SPEECH_FILE):
os.makedirs("_assets", exist_ok=True)
with open(SPEECH_FILE, "wb") as file:
file.write(requests.get(SPEECH_URL).content)
SPEECH_FILE = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
######################################################################
......@@ -85,11 +75,10 @@ if not os.path.exists(SPEECH_FILE):
# for other downstream tasks as well, but this tutorial does not
# cover that.
#
# We will use :py:func:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here.
# We will use :py:data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here.
#
# There are multiple models available as
# :py:mod:`torchaudio.pipelines`. Please check the documentation for
# the detail of how they are trained.
# There are multiple pre-trained models available in :py:mod:`torchaudio.pipelines`.
# Please check the documentation for the detail of how they are trained.
#
# The bundle object provides the interface to instantiate model and other
# information. Sampling rate and the class labels are found as follow.
......@@ -134,7 +123,7 @@ IPython.display.Audio(SPEECH_FILE)
#
# - :py:func:`torchaudio.functional.resample` works on CUDA tensors as well.
# - When performing resampling multiple times on the same set of sample rates,
# using :py:func:`torchaudio.transforms.Resample` might improve the performace.
# using :py:class:`torchaudio.transforms.Resample` might improve the performace.
#
waveform, sample_rate = torchaudio.load(SPEECH_FILE)
......@@ -167,7 +156,7 @@ with torch.inference_mode():
fig, ax = plt.subplots(len(features), 1, figsize=(16, 4.3 * len(features)))
for i, feats in enumerate(features):
ax[i].imshow(feats[0].cpu())
ax[i].imshow(feats[0].cpu(), interpolation="nearest")
ax[i].set_title(f"Feature from transformer layer {i+1}")
ax[i].set_xlabel("Feature dimension")
ax[i].set_ylabel("Frame (time-axis)")
......@@ -197,7 +186,7 @@ with torch.inference_mode():
# Let’s visualize this.
#
plt.imshow(emission[0].cpu().T)
plt.imshow(emission[0].cpu().T, interpolation="nearest")
plt.title("Classification result")
plt.xlabel("Frame (time-axis)")
plt.ylabel("Class")
......@@ -291,7 +280,7 @@ IPython.display.Audio(SPEECH_FILE)
# Conclusion
# ----------
#
# In this tutorial, we looked at how to use :py:mod:`torchaudio.pipelines` to
# In this tutorial, we looked at how to use :py:class:`~torchaudio.pipelines.Wav2Vec2ASRBundle` to
# perform acoustic feature extraction and speech recognition. Constructing
# a model and getting the emission is as short as two lines.
#
......
......@@ -45,7 +45,7 @@ import matplotlib.pyplot as plt
#
# .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png
#
# All the related components are bundled in :py:func:`torchaudio.pipelines.Tacotron2TTSBundle`,
# All the related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`,
# but this tutorial will also cover the process under the hood.
######################################################################
......@@ -196,10 +196,11 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
# however, note that the input to Tacotron2 models need to be processed
# by the matching text processor.
#
# :py:func:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
# :py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
# models and processors together so that it is easy to create the pipeline.
#
# For the available bundles, and its usage, please refer to :py:mod:`torchaudio.pipelines`.
# For the available bundles, and its usage, please refer to
# :py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.
#
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
......@@ -271,8 +272,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach())
ax2.plot(waveforms[0].cpu().detach())
torchaudio.save("_assets/output_wavernn.wav", waveforms[0:1].cpu(), sample_rate=vocoder.sample_rate)
IPython.display.Audio("_assets/output_wavernn.wav")
IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)
######################################################################
......@@ -280,7 +280,9 @@ IPython.display.Audio("_assets/output_wavernn.wav")
# ~~~~~~~~~~~
#
# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
# the vocode object with ``get_vocoder`` method and pass the spectrogram.
# the vocode object with
# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
# method and pass the spectrogram.
#
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
......@@ -300,12 +302,7 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach())
ax2.plot(waveforms[0].cpu().detach())
torchaudio.save(
"_assets/output_griffinlim.wav",
waveforms[0:1].cpu(),
sample_rate=vocoder.sample_rate,
)
IPython.display.Audio("_assets/output_griffinlim.wav")
IPython.display.Audio(waveforms[0:1].cpu(), rate=vocoder.sample_rate)
######################################################################
......@@ -344,5 +341,4 @@ fig, [ax1, ax2] = plt.subplots(2, 1, figsize=(16, 9))
ax1.imshow(spec[0].cpu().detach())
ax2.plot(waveforms[0].cpu().detach())
torchaudio.save("_assets/output_waveglow.wav", waveforms[0:1].cpu(), sample_rate=22050)
IPython.display.Audio("_assets/output_waveglow.wav")
IPython.display.Audio(waveforms[0:1].cpu(), rate=22050)
......@@ -10,9 +10,7 @@ from torchaudio.models import conv_tasnet_base, hdemucs_high
@dataclass
class SourceSeparationBundle:
"""torchaudio.pipelines.SourceSeparationBundle()
Dataclass that bundles components for performing source separation.
"""Dataclass that bundles components for performing source separation.
Example
>>> import torchaudio
......@@ -66,16 +64,16 @@ CONVTASNET_BASE_LIBRI2MIX = SourceSeparationBundle(
_model_factory_func=partial(conv_tasnet_base, num_sources=2),
_sample_rate=8000,
)
CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet* :cite:`Luo_2019` trained on
*Libri2Mix dataset* :cite:`cosentino2020librimix`.
CONVTASNET_BASE_LIBRI2MIX.__doc__ = """Pre-trained Source Separation pipeline with *ConvTasNet*
:cite:`Luo_2019` trained on *Libri2Mix dataset* :cite:`cosentino2020librimix`.
The source separation model is constructed by :py:func:`torchaudio.models.conv_tasnet_base`
and is trained using the training script ``lightning_train.py``
`here <https://github.com/pytorch/audio/tree/release/0.12/examples/source_separation/>`__
with default arguments.
The source separation model is constructed by :func:`~torchaudio.models.conv_tasnet_base`
and is trained using the training script ``lightning_train.py``
`here <https://github.com/pytorch/audio/tree/release/0.12/examples/source_separation/>`__
with default arguments.
Please refer to :py:class:`SourceSeparationBundle` for usage instructions.
"""
Please refer to :class:`SourceSeparationBundle` for usage instructions.
"""
HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
......@@ -83,14 +81,16 @@ HDEMUCS_HIGH_MUSDB_PLUS = SourceSeparationBundle(
_model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
_sample_rate=44100,
)
HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music
source separation trained on MUSDB-HQ :cite:`MUSDB18HQ` and additional internal training data.
HDEMUCS_HIGH_MUSDB_PLUS.__doc__ = """Pre-trained music source separation pipeline with
*Hybrid Demucs* :cite:`defossez2021hybrid` trained on MUSDB-HQ :cite:`MUSDB18HQ`
and additional internal training data.
The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
The model is constructed by :func:`~torchaudio.models.hdemucs_high`.
Please refer to :py:class:`SourceSeparationBundle` for usage instructions.
"""
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
Please refer to :class:`SourceSeparationBundle` for usage instructions.
"""
HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
......@@ -98,11 +98,11 @@ HDEMUCS_HIGH_MUSDB = SourceSeparationBundle(
_model_factory_func=partial(hdemucs_high, sources=["drums", "bass", "other", "vocals"]),
_sample_rate=44100,
)
HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained *Hybrid Demucs* :cite:`defossez2021hybrid` pipeline for music
source separation trained on MUSDB-HQ :cite:`MUSDB18HQ`.
HDEMUCS_HIGH_MUSDB.__doc__ = """Pre-trained music source separation pipeline with
*Hybrid Demucs* :cite:`defossez2021hybrid` trained on MUSDB-HQ :cite:`MUSDB18HQ`.
The model is constructed by :py:func:`torchaudio.prototype.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
The model is constructed by :func:`~torchaudio.models.hdemucs_high`.
Training was performed in the original HDemucs repository `here <https://github.com/facebookresearch/demucs/>`__.
Please refer to :py:class:`SourceSeparationBundle` for usage instructions.
"""
Please refer to :class:`SourceSeparationBundle` for usage instructions.
"""
......@@ -213,17 +213,14 @@ TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH = _Tacotron2GriffinLimCharBundle(
_tacotron2_path="tacotron2_english_characters_1500_epochs_ljspeech.pth",
_tacotron2_params=utils._get_taco_params(n_symbols=38),
)
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
:py:class:`torchaudio.transforms.GriffinLim`.
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and
:py:class:`~torchaudio.transforms.GriffinLim` as vocoder.
The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The default parameters were used.
The vocoder is based on :py:class:`torchaudio.transforms.GriffinLim`.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
Example - "Hello world! T T S stands for Text to Speech!"
......@@ -255,8 +252,8 @@ TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH = _Tacotron2GriffinLimPhoneBundle(
_tacotron2_path="tacotron2_english_phonemes_1500_epochs_ljspeech.pth",
_tacotron2_params=utils._get_taco_params(n_symbols=96),
)
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
:py:class:`torchaudio.transforms.GriffinLim`.
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and
:py:class:`~torchaudio.transforms.GriffinLim` as vocoder.
The text processor encodes the input texts based on phoneme.
It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert
......@@ -264,12 +261,9 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The text processor is set to the *"english_phonemes"*.
The vocoder is based on :py:class:`torchaudio.transforms.GriffinLim`.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
Example - "Hello world! T T S stands for Text to Speech!"
......@@ -304,18 +298,14 @@ TACOTRON2_WAVERNN_CHAR_LJSPEECH = _Tacotron2WaveRNNCharBundle(
_wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth",
_wavernn_params=utils._get_wrnn_params(),
)
TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
:py:class:`torchaudio.models.WaveRNN`.
TACOTRON2_WAVERNN_CHAR_LJSPEECH.__doc__ = """Character-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs and :py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
The text processor encodes the input texts character-by-character.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......@@ -351,8 +341,8 @@ TACOTRON2_WAVERNN_PHONE_LJSPEECH = _Tacotron2WaveRNNPhoneBundle(
_wavernn_path="wavernn_10k_epochs_8bits_ljspeech.pth",
_wavernn_params=utils._get_wrnn_params(),
)
TACOTRON2_WAVERNN_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`torchaudio.models.Tacotron2` and
:py:class:`torchaudio.models.WaveRNN`.
TACOTRON2_WAVERNN_PHONE_LJSPEECH.__doc__ = """Phoneme-based TTS pipeline with :py:class:`~torchaudio.models.Tacotron2` trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs, and
:py:class:`~torchaudio.models.WaveRNN` vocoder trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
The text processor encodes the input texts based on phoneme.
It uses `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__ to convert
......@@ -360,14 +350,11 @@ graphemes to phonemes.
The model (*en_us_cmudict_forward*) was trained on
`CMUDict <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>`__.
Tacotron2 was trained on *LJSpeech* :cite:`ljspeech17` for 1,500 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
You can find the training script for Tacotron2 `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_tacotron2>`__.
The following parameters were used; ``win_length=1100``, ``hop_length=275``, ``n_fft=2048``,
``mel_fmin=40``, and ``mel_fmax=11025``.
The vocder is based on :py:class:`torchaudio.models.WaveRNN`.
It was trained on 8 bits depth waveform of *LJSpeech* :cite:`ljspeech17` for 10,000 epochs.
You can find the training script `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
You can find the training script for WaveRNN `here <https://github.com/pytorch/audio/tree/main/examples/pipeline_wavernn>`__.
Please refer to :func:`torchaudio.pipelines.Tacotron2TTSBundle` for the usage.
......
......@@ -11,8 +11,6 @@ class _TextProcessor(ABC):
def tokens(self):
"""The tokens that the each value in the processed tensor represent.
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
:type: List[str]
"""
......@@ -20,8 +18,6 @@ class _TextProcessor(ABC):
def __call__(self, texts: Union[str, List[str]]) -> Tuple[Tensor, Tensor]:
"""Encode the given (batch of) texts into numerical tensors
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
Args:
text (str or list of str): The input texts.
......@@ -40,8 +36,6 @@ class _Vocoder(ABC):
def sample_rate(self):
"""The sample rate of the resulting waveform
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
:type: float
"""
......@@ -49,8 +43,6 @@ class _Vocoder(ABC):
def __call__(self, specgrams: Tensor, lengths: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]:
"""Generate waveform from the given input, such as spectrogram
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
Args:
specgrams (Tensor):
The input spectrogram. Shape: `(batch, frequency bins, time)`.
......@@ -149,22 +141,19 @@ class Tacotron2TTSBundle(ABC):
# The thing is, text processing and vocoder are generic and we do not know what kind of
# new text processing and vocoder will be added in the future, so we want to make these
# interfaces specific to this Tacotron2TTS pipeline.
class TextProcessor(_TextProcessor):
"""Interface of the text processing part of Tacotron2TTS pipeline
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor` for the usage.
"""
pass
class Vocoder(_Vocoder):
"""Interface of the vocoder part of Tacotron2TTS pipeline
See :func:`torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` for the usage.
"""
pass
@abstractmethod
def get_text_processor(self, *, dl_kwargs=None) -> TextProcessor:
"""Create a text processor
......@@ -181,7 +170,7 @@ class Tacotron2TTSBundle(ABC):
Passed to :func:`torch.hub.download_url_to_file`.
Returns:
TTSTextProcessor:
TextProcessor:
A callable which takes a string or a list of strings as input and
returns Tensor of encoded texts and Tensor of valid lengths.
The object also has ``tokens`` property, which allows to recover the
......@@ -246,7 +235,7 @@ class Tacotron2TTSBundle(ABC):
Passed to :func:`torch.hub.load_state_dict_from_url`.
Returns:
Callable[[Tensor, Optional[Tensor]], Tuple[Tensor, Optional[Tensor]]]:
Vocoder:
A vocoder module, which takes spectrogram Tensor and an optional
length Tensor, then returns resulting waveform Tensor and an optional
length Tensor.
......
This diff is collapsed.
......@@ -151,9 +151,7 @@ class _SentencePieceTokenProcessor(_TokenProcessor):
@dataclass
class RNNTBundle:
"""torchaudio.pipelines.RNNTBundle()
Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text)
"""Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text)
inference with an RNN-T model.
More specifically, the class provides methods that produce the featurization pipeline,
......@@ -165,7 +163,7 @@ class RNNTBundle:
Users should not directly instantiate objects of this class; rather, users should use the
instances (representing pre-trained models) that exist within the module,
e.g. :py:obj:`EMFORMER_RNNT_BASE_LIBRISPEECH`.
e.g. :data:`torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH`.
Example
>>> import torchaudio
......@@ -226,10 +224,10 @@ class RNNTBundle:
"""
class FeatureExtractor(_FeatureExtractor):
pass
"""Interface of the feature extraction part of RNN-T pipeline"""
class TokenProcessor(_TokenProcessor):
pass
"""Interface of the token processor part of RNN-T pipeline"""
_rnnt_path: str
_rnnt_factory_func: Callable[[], RNNT]
......@@ -370,11 +368,13 @@ EMFORMER_RNNT_BASE_LIBRISPEECH = RNNTBundle(
_segment_length=16,
_right_context_length=4,
)
EMFORMER_RNNT_BASE_LIBRISPEECH.__doc__ = """Pre-trained Emformer-RNNT-based ASR pipeline capable of performing both streaming and non-streaming inference.
EMFORMER_RNNT_BASE_LIBRISPEECH.__doc__ = """ASR pipeline based on Emformer-RNNT,
pretrained on *LibriSpeech* dataset :cite:`7178964`,
capable of performing both streaming and non-streaming inference.
The underlying model is constructed by :py:func:`torchaudio.models.emformer_rnnt_base`
and utilizes weights trained on LibriSpeech using training script ``train.py``
`here <https://github.com/pytorch/audio/tree/main/examples/asr/emformer_rnnt>`__ with default arguments.
The underlying model is constructed by :py:func:`torchaudio.models.emformer_rnnt_base`
and utilizes weights trained on LibriSpeech using training script ``train.py``
`here <https://github.com/pytorch/audio/tree/main/examples/asr/emformer_rnnt>`__ with default arguments.
Please refer to :py:class:`RNNTBundle` for usage instructions.
"""
Please refer to :py:class:`RNNTBundle` for usage instructions.
"""
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment