Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Torchaudio
Commits
9e7aeea1
"src/vscode:/vscode.git/clone" did not exist on "b63419a28a93ae4f4e6ced0ffff07c042d3af474"
Unverified
Commit
9e7aeea1
authored
Aug 03, 2021
by
yangarbiter
Committed by
GitHub
Aug 03, 2021
Browse files
Add tacotron2 training script (#1642)
parent
af652ca6
Changes
4
Expand all
Show whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
799 additions
and
0 deletions
+799
-0
examples/pipeline_tacotron2/README.md
examples/pipeline_tacotron2/README.md
+44
-0
examples/pipeline_tacotron2/datasets.py
examples/pipeline_tacotron2/datasets.py
+171
-0
examples/pipeline_tacotron2/train.py
examples/pipeline_tacotron2/train.py
+497
-0
examples/pipeline_tacotron2/utils.py
examples/pipeline_tacotron2/utils.py
+87
-0
No files found.
examples/pipeline_tacotron2/README.md
View file @
9e7aeea1
This is an example pipeline for text-to-speech using Tacotron2.
This is an example pipeline for text-to-speech using Tacotron2.
## Install required packages
Required packages
```
bash
pip
install
librosa tqdm inflect
```
To use tensorboard
```
bash
pip
install
tensorboard pillow
```
## Training Tacotron2
The training of Tacotron2 can be invoked with the following command.
```
bash
python train.py
\
--learning-rate
1e-3
\
--epochs
1501
\
--anneal-steps
500 1000 1500
\
--anneal-factor
0.1
\
--batch-size
96
\
--weight-decay
1e-6
\
--grad-clip
1.0
\
--text-preprocessor
character
\
--logging-dir
./logs
\
--checkpoint-path
./ckpt.pth
\
--dataset-path
./
```
The training script will use all GPUs that is available, please set the
environment variable
`CUDA_VISIBLE_DEVICES`
if you don't want all GPUs to be used.
The newest checkpoint will be saved to
`./ckpt.pth`
and the checkpoint with the best validation
loss will be saved to
`./best_ckpt.pth`
.
The training log will be saved to
`./logs/train.log`
and the tensorboard results will also
be in
`./logs`
.
If
`./ckpt.pth`
already exist, this script will automatically load the file and try to continue
training from the checkpoint.
This command takes around 36 hours to train on 8 NVIDIA Tesla V100 GPUs.
\ No newline at end of file
examples/pipeline_tacotron2/datasets.py
0 → 100644
View file @
9e7aeea1
# *****************************************************************************
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the NVIDIA CORPORATION nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# *****************************************************************************
from
typing
import
Tuple
,
Callable
,
List
import
torch
from
torch
import
Tensor
from
torch.utils.data.dataset
import
random_split
from
torchaudio.datasets
import
LJSPEECH
class
SpectralNormalization
(
torch
.
nn
.
Module
):
def
forward
(
self
,
input
):
return
torch
.
log
(
torch
.
clamp
(
input
,
min
=
1e-5
))
class
InverseSpectralNormalization
(
torch
.
nn
.
Module
):
def
forward
(
self
,
input
):
return
torch
.
exp
(
input
)
class
MapMemoryCache
(
torch
.
utils
.
data
.
Dataset
):
r
"""Wrap a dataset so that, whenever a new item is returned, it is saved to memory.
"""
def
__init__
(
self
,
dataset
):
self
.
dataset
=
dataset
self
.
_cache
=
[
None
]
*
len
(
dataset
)
def
__getitem__
(
self
,
n
):
if
self
.
_cache
[
n
]
is
not
None
:
return
self
.
_cache
[
n
]
item
=
self
.
dataset
[
n
]
self
.
_cache
[
n
]
=
item
return
item
def
__len__
(
self
):
return
len
(
self
.
dataset
)
class
Processed
(
torch
.
utils
.
data
.
Dataset
):
def
__init__
(
self
,
dataset
,
transforms
,
text_preprocessor
):
self
.
dataset
=
dataset
self
.
transforms
=
transforms
self
.
text_preprocessor
=
text_preprocessor
def
__getitem__
(
self
,
key
):
item
=
self
.
dataset
[
key
]
return
self
.
process_datapoint
(
item
)
def
__len__
(
self
):
return
len
(
self
.
dataset
)
def
process_datapoint
(
self
,
item
):
melspec
=
self
.
transforms
(
item
[
0
])
text_norm
=
torch
.
IntTensor
(
self
.
text_preprocessor
(
item
[
2
]))
return
text_norm
,
torch
.
squeeze
(
melspec
,
0
)
def
split_process_dataset
(
dataset
:
str
,
file_path
:
str
,
val_ratio
:
float
,
transforms
:
Callable
,
text_preprocessor
:
Callable
[[
str
],
List
[
int
]],
)
->
Tuple
[
torch
.
utils
.
data
.
Dataset
,
torch
.
utils
.
data
.
Dataset
]:
"""Returns the Training and validation datasets.
Args:
dataset (str): The dataset to use. Avaliable options: [`'ljspeech'`]
file_path (str): Path to the data.
val_ratio (float): Path to the data.
transforms (callable): A function/transform that takes in a waveform and
returns a transformed waveform (mel spectrogram in this example).
text_preprocess (callable): A function that takes in a string and
returns a list of integers representing each of the symbol in the string.
Returns:
train_dataset (`torch.utils.data.Dataset`): The training set.
val_dataset (`torch.utils.data.Dataset`): The validation set.
"""
if
dataset
==
'ljspeech'
:
data
=
LJSPEECH
(
root
=
file_path
,
download
=
False
)
val_length
=
int
(
len
(
data
)
*
val_ratio
)
lengths
=
[
len
(
data
)
-
val_length
,
val_length
]
train_dataset
,
val_dataset
=
random_split
(
data
,
lengths
)
else
:
raise
ValueError
(
f
"Expected datasets: `ljspeech`, but found
{
dataset
}
"
)
train_dataset
=
Processed
(
train_dataset
,
transforms
,
text_preprocessor
)
val_dataset
=
Processed
(
val_dataset
,
transforms
,
text_preprocessor
)
train_dataset
=
MapMemoryCache
(
train_dataset
)
val_dataset
=
MapMemoryCache
(
val_dataset
)
return
train_dataset
,
val_dataset
def
text_mel_collate_fn
(
batch
:
Tuple
[
Tensor
,
Tensor
],
n_frames_per_step
:
int
=
1
)
->
Tuple
[
Tensor
,
Tensor
,
Tensor
,
Tensor
,
Tensor
]:
"""The collate function padding and adjusting the data based on `n_frames_per_step`.
Modified from https://github.com/NVIDIA/DeepLearningExamples
Args:
batch (tuple of two tensors): the first tensor is the mel spectrogram with shape
(n_batch, n_mels, n_frames), the second tensor is the text with shape (n_batch, ).
n_frames_per_step (int): The number of frames to advance every step.
Returns:
text_padded (Tensor): The input text to Tacotron2 with shape (n_batch, max of ``text_lengths``).
text_lengths (Tensor): The length of each text with shape (n_batch).
mel_specgram_padded (Tensor): The target mel spectrogram
with shape (n_batch, n_mels, max of ``mel_specgram_lengths``)
mel_specgram_lengths (Tensor): The length of each mel spectrogram with shape (n_batch).
gate_padded (Tensor): The ground truth gate output
with shape (n_batch, max of ``mel_specgram_lengths``)
"""
text_lengths
,
ids_sorted_decreasing
=
torch
.
sort
(
torch
.
LongTensor
([
len
(
x
[
0
])
for
x
in
batch
]),
dim
=
0
,
descending
=
True
)
max_input_len
=
text_lengths
[
0
]
text_padded
=
torch
.
zeros
((
len
(
batch
),
max_input_len
),
dtype
=
torch
.
int64
)
for
i
in
range
(
len
(
ids_sorted_decreasing
)):
text
=
batch
[
ids_sorted_decreasing
[
i
]][
0
]
text_padded
[
i
,
:
text
.
size
(
0
)]
=
text
# Right zero-pad mel-spec
num_mels
=
batch
[
0
][
1
].
size
(
0
)
max_target_len
=
max
([
x
[
1
].
size
(
1
)
for
x
in
batch
])
if
max_target_len
%
n_frames_per_step
!=
0
:
max_target_len
+=
n_frames_per_step
-
max_target_len
%
n_frames_per_step
assert
max_target_len
%
n_frames_per_step
==
0
# include mel padded and gate padded
mel_specgram_padded
=
torch
.
zeros
((
len
(
batch
),
num_mels
,
max_target_len
),
dtype
=
torch
.
float32
)
gate_padded
=
torch
.
zeros
((
len
(
batch
),
max_target_len
),
dtype
=
torch
.
float32
)
mel_specgram_lengths
=
torch
.
LongTensor
(
len
(
batch
))
for
i
in
range
(
len
(
ids_sorted_decreasing
)):
mel
=
batch
[
ids_sorted_decreasing
[
i
]][
1
]
mel_specgram_padded
[
i
,
:,
:
mel
.
size
(
1
)]
=
mel
mel_specgram_lengths
[
i
]
=
mel
.
size
(
1
)
gate_padded
[
i
,
mel
.
size
(
1
)
-
1
:]
=
1
return
text_padded
,
text_lengths
,
mel_specgram_padded
,
mel_specgram_lengths
,
gate_padded
examples/pipeline_tacotron2/train.py
0 → 100644
View file @
9e7aeea1
This diff is collapsed.
Click to expand it.
examples/pipeline_tacotron2/utils.py
0 → 100644
View file @
9e7aeea1
# *****************************************************************************
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the NVIDIA CORPORATION nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# *****************************************************************************
import
logging
import
os
import
shutil
from
typing
import
List
,
Tuple
,
Callable
import
torch
from
torch
import
Tensor
def
save_checkpoint
(
state
,
is_best
,
filename
):
r
"""Save the model to a temporary file first, then copy it to filename,
in case signals interrupt the torch.save() process.
"""
torch
.
save
(
state
,
filename
)
logging
.
info
(
f
"Checkpoint saved to
{
filename
}
"
)
if
is_best
:
path
,
best_filename
=
os
.
path
.
split
(
filename
)
best_filename
=
os
.
path
.
join
(
path
,
"best_"
+
best_filename
)
shutil
.
copyfile
(
filename
,
best_filename
)
logging
.
info
(
f
"Current best checkpoint saved to
{
best_filename
}
"
)
def
pad_sequences
(
batch
:
List
[
Tensor
])
->
Tuple
[
Tensor
,
Tensor
]:
r
"""Right zero-pad all one-hot text sequences to max input length.
Modified from https://github.com/NVIDIA/DeepLearningExamples.
"""
input_lengths
,
ids_sorted_decreasing
=
torch
.
sort
(
torch
.
LongTensor
([
len
(
x
)
for
x
in
batch
]),
dim
=
0
,
descending
=
True
)
max_input_len
=
input_lengths
[
0
]
text_padded
=
torch
.
LongTensor
(
len
(
batch
),
max_input_len
)
text_padded
.
zero_
()
for
i
in
range
(
len
(
ids_sorted_decreasing
)):
text
=
batch
[
ids_sorted_decreasing
[
i
]]
text_padded
[
i
,
:
text
.
size
(
0
)]
=
text
return
text_padded
,
input_lengths
def
prepare_input_sequence
(
texts
:
List
[
str
],
text_processor
:
Callable
[[
str
],
List
[
int
]])
->
Tuple
[
Tensor
,
Tensor
]:
d
=
[]
for
text
in
texts
:
d
.
append
(
torch
.
IntTensor
(
text_processor
(
text
)[:]))
text_padded
,
input_lengths
=
pad_sequences
(
d
)
return
text_padded
,
input_lengths
def
get_text_preprocessor
(
preprocessor_name
:
str
)
->
Tuple
[
List
[
str
],
Callable
[[
str
],
List
[
int
]]]:
if
preprocessor_name
==
"character"
:
from
text.text_preprocessing
import
symbols
from
text.text_preprocessing
import
text_to_sequence
text_preprocessor
=
text_to_sequence
else
:
raise
ValueError
(
"The preprocessor_name ({preprocessor_name}) provided is not supported."
)
return
symbols
,
text_preprocessor
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment