Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
hehl2
Torchaudio
Commits
b3c2cfce
You need to sign in or sign up before continuing.
Unverified
Commit
b3c2cfce
authored
Nov 04, 2021
by
moto
Committed by
GitHub
Nov 04, 2021
Browse files
Port TTS tutorial (#1973)
parent
b7625f2a
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
340 additions
and
0 deletions
+340
-0
docs/source/index.rst
docs/source/index.rst
+1
-0
docs/source/pipelines.rst
docs/source/pipelines.rst
+4
-0
examples/gallery/tts/README.rst
examples/gallery/tts/README.rst
+2
-0
examples/gallery/tts/tacotron2_pipeline_tutorial.py
examples/gallery/tts/tacotron2_pipeline_tutorial.py
+333
-0
No files found.
docs/source/index.rst
View file @
b3c2cfce
...
@@ -47,6 +47,7 @@ The :mod:`torchaudio` package consists of I/O, popular datasets and common audio
...
@@ -47,6 +47,7 @@ The :mod:`torchaudio` package consists of I/O, popular datasets and common audio
:caption: Tutorials
:caption: Tutorials
auto_examples/wav2vec2/index
auto_examples/wav2vec2/index
auto_examples/tts/index
.. toctree::
.. toctree::
:maxdepth: 1
:maxdepth: 1
...
...
docs/source/pipelines.rst
View file @
b3c2cfce
...
@@ -231,6 +231,10 @@ Tacotron2TTSBundle
...
@@ -231,6 +231,10 @@ Tacotron2TTSBundle
.. automethod:: get_vocoder
.. automethod:: get_vocoder
.. minigallery:: torchaudio.pipelines.Tacotron2TTSBundle
:add-heading: Examples using ``Tacotron2TTSBundle``
:heading-level: ~
Tacotron2TTSBundle - TextProcessor
Tacotron2TTSBundle - TextProcessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
examples/gallery/tts/README.rst
0 → 100644
View file @
b3c2cfce
Text-to-Speech
==============
examples/gallery/tts/tacotron2_pipeline_tutorial.py
0 → 100644
View file @
b3c2cfce
"""
Text-to-speech with torchaudio Tacotron2
========================================
**Author** `Yao-Yuan Yang <https://github.com/yangarbiter>`__,
`Moto Hira <moto@fb.com>`__
"""
######################################################################
# Overview
# --------
#
# This tutorial shows how to build text-to-speech pipeline, using the
# pretrained Tacotron2 in torchaudio.
#
# The text-to-speech pipeline goes as follows:
#
# 1. Text preprocessing
#
# First, the input text is encoded into a list of symbols. In this
# tutorial, we will use English characters and phonemes as the symbols.
#
# 2. Spectrogram generation
#
# From the encoded text, a spectrogram is generated. We use ``Tacotron2``
# model for this.
#
# 3. Time-domain conversion
#
# The last step is converting the spectrogram into the waveform. The
# process to generate speech from spectrogram is also called Vocoder.
# In this tutorial, three different vocoders are used,
# `WaveRNN <https://pytorch.org/audio/stable/models/wavernn.html>`__,
# `Griffin-Lim <https://pytorch.org/audio/stable/transforms.html#griffinlim>`__,
# and
# `Nvidia's WaveGlow <https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/>`__.
#
#
# The following figure illustrates the whole process.
#
# .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png
#
# All the related components are bundled in :py:func:`torchaudio.pipelines.Tacotron2TTSBundle`,
# but this tutorial will also cover the process under the hood.
######################################################################
# Preparation
# -----------
#
# First, we install the necessary dependencies. In addition to
# ``torchaudio``, ``DeepPhonemizer`` is required to perform phoneme-based
# encoding.
#
# When running this example in notebook, install DeepPhonemizer
# !pip3 install deep_phonemizer
import
torch
import
torchaudio
import
matplotlib
import
matplotlib.pyplot
as
plt
import
IPython
matplotlib
.
rcParams
[
'figure.figsize'
]
=
[
16.0
,
4.8
]
torch
.
random
.
manual_seed
(
0
)
device
=
"cuda"
if
torch
.
cuda
.
is_available
()
else
"cpu"
print
(
torch
.
__version__
)
print
(
torchaudio
.
__version__
)
print
(
device
)
######################################################################
# Text Processing
# ---------------
#
######################################################################
# Character-based encoding
# ~~~~~~~~~~~~~~~~~~~~~~~~
#
# In this section, we will go through how the character-based encoding
# works.
#
# Since the pre-trained Tacotron2 model expects specific set of symbol
# tables, the same functionalities available in ``torchaudio``. This
# section is more for the explanation of the basis of encoding.
#
# Firstly, we define the set of symbols. For example, we can use
# ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the
# each character of the input text into the index of the corresponding
# symbol in the table.
#
# The following is an example of such processing. In the example, symbols
# that are not in the table are ignored.
#
symbols
=
'_-!
\'
(),.:;? abcdefghijklmnopqrstuvwxyz'
look_up
=
{
s
:
i
for
i
,
s
in
enumerate
(
symbols
)}
symbols
=
set
(
symbols
)
def
text_to_sequence
(
text
):
text
=
text
.
lower
()
return
[
look_up
[
s
]
for
s
in
text
if
s
in
symbols
]
text
=
"Hello world! Text to speech!"
print
(
text_to_sequence
(
text
))
######################################################################
# As mentioned in the above, the symbol table and indices must match
# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the
# transform along with the pretrained model. For example, you can
# instantiate and use such transform as follow.
#
processor
=
torchaudio
.
pipelines
.
TACOTRON2_WAVERNN_CHAR_LJSPEECH
.
get_text_processor
()
text
=
"Hello world! Text to speech!"
processed
,
lengths
=
processor
(
text
)
print
(
processed
)
print
(
lengths
)
######################################################################
# The ``processor`` object takes either a text or list of texts as inputs.
# When a list of texts are provided, the returned ``lengths`` variable
# represents the valid length of each processed tokens in the output
# batch.
#
# The intermediate representation can be retrieved as follow.
#
print
([
processor
.
tokens
[
i
]
for
i
in
processed
[
0
,
:
lengths
[
0
]]])
######################################################################
# Phoneme-based encoding
# ~~~~~~~~~~~~~~~~~~~~~~
#
# Phoneme-based encoding is similar to character-based encoding, but it
# uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
# model.
#
# The detail of the G2P model is out of scope of this tutorial, we will
# just look at what the conversion looks like.
#
# Similar to the case of character-based encoding, the encoding process is
# expected to match what a pretrained Tacotron2 model is trained on.
# ``torchaudio`` has an interface to create the process.
#
# The following code illustrates how to make and use the process. Behind
# the scene, a G2P model is created using ``DeepPhonemizer`` package, and
# the pretrained weights published by the author of ``DeepPhonemizer`` is
# fetched.
#
bundle
=
torchaudio
.
pipelines
.
TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor
=
bundle
.
get_text_processor
()
text
=
"Hello world! Text to speech!"
with
torch
.
inference_mode
():
processed
,
lengths
=
processor
(
text
)
print
(
processed
)
print
(
lengths
)
######################################################################
# Notice that the encoded values are different from the example of
# character-based encoding.
#
# The intermediate representation looks like the following.
#
print
([
processor
.
tokens
[
i
]
for
i
in
processed
[
0
,
:
lengths
[
0
]]])
######################################################################
# Spectrogram Generation
# ----------------------
#
# ``Tacotron2`` is the model we use to generate spectrogram from the
# encoded text. For the detail of the model, please refer to `the
# paper <https://arxiv.org/abs/1712.05884>`__.
#
# It is easy to instantiate a Tacotron2 model with pretrained weight,
# however, note that the input to Tacotron2 models need to be processed
# by the matching text processor.
#
# :py:func:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching
# models and processors together so that it is easy to create the pipeline.
#
# For the available bundles, and its usage, please refer to :py:mod:`torchaudio.pipelines`.
#
bundle
=
torchaudio
.
pipelines
.
TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor
=
bundle
.
get_text_processor
()
tacotron2
=
bundle
.
get_tacotron2
().
to
(
device
)
text
=
"Hello world! Text to speech!"
with
torch
.
inference_mode
():
processed
,
lengths
=
processor
(
text
)
processed
=
processed
.
to
(
device
)
lengths
=
lengths
.
to
(
device
)
spec
,
_
,
_
=
tacotron2
.
infer
(
processed
,
lengths
)
plt
.
imshow
(
spec
[
0
].
cpu
().
detach
())
######################################################################
# Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
# therefor, the process of generating the spectrogram incurs randomness.
#
fig
,
ax
=
plt
.
subplots
(
3
,
1
,
figsize
=
(
16
,
4.3
*
3
))
for
i
in
range
(
3
):
with
torch
.
inference_mode
():
spec
,
spec_lengths
,
_
=
tacotron2
.
infer
(
processed
,
lengths
)
print
(
spec
[
0
].
shape
)
ax
[
i
].
imshow
(
spec
[
0
].
cpu
().
detach
())
plt
.
show
()
######################################################################
# Waveform Generation
# -------------------
#
# Once the spectrogram is generated, the last process is to recover the
# waveform from the spectrogram.
#
# ``torchaudio`` provides vocoders based on ``GriffinLim`` and
# ``WaveRNN``.
#
######################################################################
# WaveRNN
# ~~~~~~~
#
# Continuing from the previous section, we can instantiate the matching
# WaveRNN model from the same bundle.
#
bundle
=
torchaudio
.
pipelines
.
TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor
=
bundle
.
get_text_processor
()
tacotron2
=
bundle
.
get_tacotron2
().
to
(
device
)
vocoder
=
bundle
.
get_vocoder
().
to
(
device
)
text
=
"Hello world! Text to speech!"
with
torch
.
inference_mode
():
processed
,
lengths
=
processor
(
text
)
processed
=
processed
.
to
(
device
)
lengths
=
lengths
.
to
(
device
)
spec
,
spec_lengths
,
_
=
tacotron2
.
infer
(
processed
,
lengths
)
waveforms
,
lengths
=
vocoder
(
spec
,
spec_lengths
)
fig
,
[
ax1
,
ax2
]
=
plt
.
subplots
(
2
,
1
,
figsize
=
(
16
,
9
))
ax1
.
imshow
(
spec
[
0
].
cpu
().
detach
())
ax2
.
plot
(
waveforms
[
0
].
cpu
().
detach
())
torchaudio
.
save
(
"output_wavernn.wav"
,
waveforms
[
0
:
1
].
cpu
(),
sample_rate
=
vocoder
.
sample_rate
)
IPython
.
display
.
display
(
IPython
.
display
.
Audio
(
"output_wavernn.wav"
))
######################################################################
# Griffin-Lim
# ~~~~~~~~~~~
#
# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
# the vocode object with ``get_vocoder`` method and pass the spectrogram.
#
bundle
=
torchaudio
.
pipelines
.
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
processor
=
bundle
.
get_text_processor
()
tacotron2
=
bundle
.
get_tacotron2
().
to
(
device
)
vocoder
=
bundle
.
get_vocoder
().
to
(
device
)
with
torch
.
inference_mode
():
processed
,
lengths
=
processor
(
text
)
processed
=
processed
.
to
(
device
)
lengths
=
lengths
.
to
(
device
)
spec
,
spec_lengths
,
_
=
tacotron2
.
infer
(
processed
,
lengths
)
waveforms
,
lengths
=
vocoder
(
spec
,
spec_lengths
)
fig
,
[
ax1
,
ax2
]
=
plt
.
subplots
(
2
,
1
,
figsize
=
(
16
,
9
))
ax1
.
imshow
(
spec
[
0
].
cpu
().
detach
())
ax2
.
plot
(
waveforms
[
0
].
cpu
().
detach
())
torchaudio
.
save
(
"output_griffinlim.wav"
,
waveforms
[
0
:
1
].
cpu
(),
sample_rate
=
vocoder
.
sample_rate
)
IPython
.
display
.
display
(
IPython
.
display
.
Audio
(
"output_griffinlim.wav"
))
######################################################################
# Waveglow
# ~~~~~~~~
#
# Waveglow is a vocoder published by Nvidia. The pretrained weight is
# publishe on Torch Hub. One can instantiate the model using ``torch.hub``
# module.
#
# Workaround to load model mapped on GPU
# https://stackoverflow.com/a/61840832
waveglow
=
torch
.
hub
.
load
(
'NVIDIA/DeepLearningExamples:torchhub'
,
'nvidia_waveglow'
,
model_math
=
'fp32'
,
pretrained
=
False
)
checkpoint
=
torch
.
hub
.
load_state_dict_from_url
(
'https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth'
,
progress
=
False
,
map_location
=
device
)
state_dict
=
{
key
.
replace
(
"module."
,
""
):
value
for
key
,
value
in
checkpoint
[
"state_dict"
].
items
()}
waveglow
.
load_state_dict
(
state_dict
)
waveglow
=
waveglow
.
remove_weightnorm
(
waveglow
)
waveglow
=
waveglow
.
to
(
device
)
waveglow
.
eval
()
with
torch
.
no_grad
():
waveforms
=
waveglow
.
infer
(
spec
)
fig
,
[
ax1
,
ax2
]
=
plt
.
subplots
(
2
,
1
,
figsize
=
(
16
,
9
))
ax1
.
imshow
(
spec
[
0
].
cpu
().
detach
())
ax2
.
plot
(
waveforms
[
0
].
cpu
().
detach
())
torchaudio
.
save
(
"output_waveglow.wav"
,
waveforms
[
0
:
1
].
cpu
(),
sample_rate
=
22050
)
IPython
.
display
.
display
(
IPython
.
display
.
Audio
(
"output_waveglow.wav"
))
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment