Update tacotron2_pipeline_tutorial.py (#3759)

* Update tacotron2_pipeline_tutorial.py - Fixed typo - Clarified what was being done in different sections

Update tacotron2_pipeline_tutorial.py (#3759)
* Update tacotron2_pipeline_tutorial.py - Fixed typo - Clarified what was being done in different sections
17a70815 · mikeboensel · GitHub · 1bc1479c · 17a70815
Unverified Commit 17a70815 authored Mar 18, 2024 by mikeboensel Committed by GitHub Mar 18, 2024
Show whitespace changes
Inline Side-by-side

Showing with 21 additions and 25 deletions

examples/tutorials/tacotron2_pipeline_tutorial.py examples/tutorials/tacotron2_pipeline_tutorial.py +21 -25

No files found.
--- a/examples/tutorials/tacotron2_pipeline_tutorial.py
+++ b/examples/tutorials/tacotron2_pipeline_tutorial.py
@@ -23,13 +23,13 @@ Text-to-Speech with Tacotron2
 #
 # 2. Spectrogram generation
 #
-#    From the encoded text, a spectrogram is generated. We use ``Tacotron2``
+#    From the encoded text, a spectrogram is generated. We use the ``Tacotron2``
 #    model for this.
 #
 # 3. Time-domain conversion
 #
 #    The last step is converting the spectrogram into the waveform. The
-#    process to generate speech from spectrogram is also called Vocoder.
+#    process to generate speech from spectrogram is also called a Vocoder.
 #    In this tutorial, three different vocoders are used,
 #    :py:class:`~torchaudio.models.WaveRNN`,
 #    :py:class:`~torchaudio.transforms.GriffinLim`, and
@@ -90,17 +90,13 @@ import matplotlib.pyplot as plt
 # works.
 #
 # Since the pre-trained Tacotron2 model expects specific set of symbol
-# tables, the same functionalities available in ``torchaudio``. This
+# tables, the same functionalities is available in ``torchaudio``. However,
-# section is more for the explanation of the basis of encoding.
+# we will first manually implement the encoding to aid in understanding.
 #
-# Firstly, we define the set of symbols. For example, we can use
+# First, we define the set of symbols
 # ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the
 # each character of the input text into the index of the corresponding
-# symbol in the table.
+# symbol in the table. Symbols that are not in the table are ignored.
-#
-# The following is an example of such processing. In the example, symbols
-# that are not in the table are ignored.
-#
 symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
 look_up = {s: i for i, s in enumerate(symbols)}
@@ -118,8 +114,8 @@ print(text_to_sequence(text))
 ######################################################################
 # As mentioned in the above, the symbol table and indices must match
-# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the
+# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the same
-# transform along with the pretrained model. For example, you can
+# transform along with the pretrained model. You can
 # instantiate and use such transform as follow.
 #
@@ -133,12 +129,12 @@ print(lengths)
 ######################################################################
-# The ``processor`` object takes either a text or list of texts as inputs.
+# Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.
 # When a list of texts are provided, the returned ``lengths`` variable
 # represents the valid length of each processed tokens in the output
 # batch.
 #
-# The intermediate representation can be retrieved as follow.
+# The intermediate representation can be retrieved as follows:
 #
 print([processor.tokens[i] for i in processed[0, : lengths[0]]])
@@ -152,7 +148,7 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
 # uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
 # model.
 #
-# The detail of the G2P model is out of scope of this tutorial, we will
+# The detail of the G2P model is out of the scope of this tutorial, we will
 # just look at what the conversion looks like.
 #
 # Similar to the case of character-based encoding, the encoding process is
@@ -195,7 +191,7 @@ print([processor.tokens[i] for i in processed[0, : lengths[0]]])
 # encoded text. For the detail of the model, please refer to `the
 # paper <https://arxiv.org/abs/1712.05884>`__.
 #
-# It is easy to instantiate a Tacotron2 model with pretrained weight,
+# It is easy to instantiate a Tacotron2 model with pretrained weights,
 # however, note that the input to Tacotron2 models need to be processed
 # by the matching text processor.
 #
@@ -224,7 +220,7 @@ _ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
 ######################################################################
 # Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
-# therefor, the process of generating the spectrogram incurs randomness.
+# therefore, the process of generating the spectrogram incurs randomness.
 #
@@ -245,7 +241,7 @@ plot()
 # -------------------
 #
 # Once the spectrogram is generated, the last process is to recover the
-# waveform from the spectrogram.
+# waveform from the spectrogram using a vocoder.
 #
 # ``torchaudio`` provides vocoders based on ``GriffinLim`` and
 # ``WaveRNN``.
@@ -253,8 +249,8 @@ plot()
 ######################################################################
-# WaveRNN
+# WaveRNN Vocoder
-# ~~~~~~~
+# ~~~~~~~~~~~~~~~
 #
 # Continuing from the previous section, we can instantiate the matching
 # WaveRNN model from the same bundle.
@@ -294,11 +290,11 @@ plot(waveforms, spec, vocoder.sample_rate)
 ######################################################################
-# Griffin-Lim
+# Griffin-Lim Vocoder
-# ~~~~~~~~~~~
+# ~~~~~~~~~~~~~~~~~~~
 #
 # Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
-# the vocode object with
+# the vocoder object with
 # :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
 # method and pass the spectrogram.
 #
@@ -323,8 +319,8 @@ plot(waveforms, spec, vocoder.sample_rate)
 ######################################################################
-# Waveglow
+# Waveglow Vocoder
-# ~~~~~~~~
+# ~~~~~~~~~~~~~~~~
 #
 # Waveglow is a vocoder published by Nvidia. The pretrained weights are
 # published on Torch Hub. One can instantiate the model using ``torch.hub``