Unverified Commit 17a70815 authored by mikeboensel's avatar mikeboensel Committed by GitHub
Browse files

Update tacotron2_pipeline_tutorial.py (#3759)

* Update tacotron2_pipeline_tutorial.py

- Fixed typo
- Clarified what was being done in different sections
parent 1bc1479c
...@@ -23,13 +23,13 @@ Text-to-Speech with Tacotron2 ...@@ -23,13 +23,13 @@ Text-to-Speech with Tacotron2
# #
# 2. Spectrogram generation # 2. Spectrogram generation
# #
# From the encoded text, a spectrogram is generated. We use ``Tacotron2`` # From the encoded text, a spectrogram is generated. We use the ``Tacotron2``
# model for this. # model for this.
# #
# 3. Time-domain conversion # 3. Time-domain conversion
# #
# The last step is converting the spectrogram into the waveform. The # The last step is converting the spectrogram into the waveform. The
# process to generate speech from spectrogram is also called Vocoder. # process to generate speech from spectrogram is also called a Vocoder.
# In this tutorial, three different vocoders are used, # In this tutorial, three different vocoders are used,
# :py:class:`~torchaudio.models.WaveRNN`, # :py:class:`~torchaudio.models.WaveRNN`,
# :py:class:`~torchaudio.transforms.GriffinLim`, and # :py:class:`~torchaudio.transforms.GriffinLim`, and
...@@ -90,17 +90,13 @@ import matplotlib.pyplot as plt ...@@ -90,17 +90,13 @@ import matplotlib.pyplot as plt
# works. # works.
# #
# Since the pre-trained Tacotron2 model expects specific set of symbol # Since the pre-trained Tacotron2 model expects specific set of symbol
# tables, the same functionalities available in ``torchaudio``. This # tables, the same functionalities is available in ``torchaudio``. However,
# section is more for the explanation of the basis of encoding. # we will first manually implement the encoding to aid in understanding.
# #
# Firstly, we define the set of symbols. For example, we can use # First, we define the set of symbols
# ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the # ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the
# each character of the input text into the index of the corresponding # each character of the input text into the index of the corresponding
# symbol in the table. # symbol in the table. Symbols that are not in the table are ignored.
#
# The following is an example of such processing. In the example, symbols
# that are not in the table are ignored.
#
symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz" symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)} look_up = {s: i for i, s in enumerate(symbols)}
...@@ -118,8 +114,8 @@ print(text_to_sequence(text)) ...@@ -118,8 +114,8 @@ print(text_to_sequence(text))
###################################################################### ######################################################################
# As mentioned in the above, the symbol table and indices must match # As mentioned in the above, the symbol table and indices must match
# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the # what the pretrained Tacotron2 model expects. ``torchaudio`` provides the same
# transform along with the pretrained model. For example, you can # transform along with the pretrained model. You can
# instantiate and use such transform as follow. # instantiate and use such transform as follow.
# #
...@@ -133,12 +129,12 @@ print(lengths) ...@@ -133,12 +129,12 @@ print(lengths)
###################################################################### ######################################################################
# The ``processor`` object takes either a text or list of texts as inputs. # Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.
# When a list of texts are provided, the returned ``lengths`` variable # When a list of texts are provided, the returned ``lengths`` variable
# represents the valid length of each processed tokens in the output # represents the valid length of each processed tokens in the output
# batch. # batch.
# #
# The intermediate representation can be retrieved as follow. # The intermediate representation can be retrieved as follows:
# #
print([processor.tokens[i] for i in processed[0, : lengths[0]]]) print([processor.tokens[i] for i in processed[0, : lengths[0]]])
...@@ -152,7 +148,7 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]]) ...@@ -152,7 +148,7 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
# uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme) # uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
# model. # model.
# #
# The detail of the G2P model is out of scope of this tutorial, we will # The detail of the G2P model is out of the scope of this tutorial, we will
# just look at what the conversion looks like. # just look at what the conversion looks like.
# #
# Similar to the case of character-based encoding, the encoding process is # Similar to the case of character-based encoding, the encoding process is
...@@ -195,7 +191,7 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]]) ...@@ -195,7 +191,7 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
# encoded text. For the detail of the model, please refer to `the # encoded text. For the detail of the model, please refer to `the
# paper <https://arxiv.org/abs/1712.05884>`__. # paper <https://arxiv.org/abs/1712.05884>`__.
# #
# It is easy to instantiate a Tacotron2 model with pretrained weight, # It is easy to instantiate a Tacotron2 model with pretrained weights,
# however, note that the input to Tacotron2 models need to be processed # however, note that the input to Tacotron2 models need to be processed
# by the matching text processor. # by the matching text processor.
# #
...@@ -224,7 +220,7 @@ _ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto") ...@@ -224,7 +220,7 @@ _ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
###################################################################### ######################################################################
# Note that ``Tacotron2.infer`` method perfoms multinomial sampling, # Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
# therefor, the process of generating the spectrogram incurs randomness. # therefore, the process of generating the spectrogram incurs randomness.
# #
...@@ -245,7 +241,7 @@ plot() ...@@ -245,7 +241,7 @@ plot()
# ------------------- # -------------------
# #
# Once the spectrogram is generated, the last process is to recover the # Once the spectrogram is generated, the last process is to recover the
# waveform from the spectrogram. # waveform from the spectrogram using a vocoder.
# #
# ``torchaudio`` provides vocoders based on ``GriffinLim`` and # ``torchaudio`` provides vocoders based on ``GriffinLim`` and
# ``WaveRNN``. # ``WaveRNN``.
...@@ -253,8 +249,8 @@ plot() ...@@ -253,8 +249,8 @@ plot()
###################################################################### ######################################################################
# WaveRNN # WaveRNN Vocoder
# ~~~~~~~ # ~~~~~~~~~~~~~~~
# #
# Continuing from the previous section, we can instantiate the matching # Continuing from the previous section, we can instantiate the matching
# WaveRNN model from the same bundle. # WaveRNN model from the same bundle.
...@@ -294,11 +290,11 @@ plot(waveforms, spec, vocoder.sample_rate) ...@@ -294,11 +290,11 @@ plot(waveforms, spec, vocoder.sample_rate)
###################################################################### ######################################################################
# Griffin-Lim # Griffin-Lim Vocoder
# ~~~~~~~~~~~ # ~~~~~~~~~~~~~~~~~~~
# #
# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate # Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
# the vocode object with # the vocoder object with
# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder` # :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
# method and pass the spectrogram. # method and pass the spectrogram.
# #
...@@ -323,8 +319,8 @@ plot(waveforms, spec, vocoder.sample_rate) ...@@ -323,8 +319,8 @@ plot(waveforms, spec, vocoder.sample_rate)
###################################################################### ######################################################################
# Waveglow # Waveglow Vocoder
# ~~~~~~~~ # ~~~~~~~~~~~~~~~~
# #
# Waveglow is a vocoder published by Nvidia. The pretrained weights are # Waveglow is a vocoder published by Nvidia. The pretrained weights are
# published on Torch Hub. One can instantiate the model using ``torch.hub`` # published on Torch Hub. One can instantiate the model using ``torch.hub``
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment