Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
hehl2
Torchaudio
Commits
084455a3
"vscode:/vscode.git/clone" did not exist on "1b71d0bbe4324561a2f3ee69f93e5dff41c93794"
Unverified
Commit
084455a3
authored
Aug 10, 2021
by
yangarbiter
Committed by
GitHub
Aug 10, 2021
Browse files
Add phoneme text preprocessing for Tacotron2 (#1668)
parent
8094751f
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
243 additions
and
46 deletions
+243
-46
examples/pipeline_tacotron2/README.md
examples/pipeline_tacotron2/README.md
+102
-4
examples/pipeline_tacotron2/text/text_preprocessing.py
examples/pipeline_tacotron2/text/text_preprocessing.py
+84
-5
examples/pipeline_tacotron2/train.py
examples/pipeline_tacotron2/train.py
+57
-26
examples/pipeline_tacotron2/utils.py
examples/pipeline_tacotron2/utils.py
+0
-11
No files found.
examples/pipeline_tacotron2/README.md
View file @
084455a3
...
@@ -5,7 +5,7 @@ This is an example pipeline for text-to-speech using Tacotron2.
...
@@ -5,7 +5,7 @@ This is an example pipeline for text-to-speech using Tacotron2.
Required packages
Required packages
```
bash
```
bash
pip
install
librosa tqdm inflect
pip
install
librosa tqdm inflect
joblib
```
```
To use tensorboard
To use tensorboard
...
@@ -13,7 +13,7 @@ To use tensorboard
...
@@ -13,7 +13,7 @@ To use tensorboard
pip
install
tensorboard pillow
pip
install
tensorboard pillow
```
```
## Training Tacotron2
## Training Tacotron2
with character as input
The training of Tacotron2 can be invoked with the following command.
The training of Tacotron2 can be invoked with the following command.
...
@@ -26,7 +26,7 @@ python train.py \
...
@@ -26,7 +26,7 @@ python train.py \
--batch-size
96
\
--batch-size
96
\
--weight-decay
1e-6
\
--weight-decay
1e-6
\
--grad-clip
1.0
\
--grad-clip
1.0
\
--text-preprocessor
character
\
--text-preprocessor
english_
character
s
\
--logging-dir
./logs
\
--logging-dir
./logs
\
--checkpoint-path
./ckpt.pth
\
--checkpoint-path
./ckpt.pth
\
--dataset-path
./
--dataset-path
./
...
@@ -42,4 +42,102 @@ be in `./logs`.
...
@@ -42,4 +42,102 @@ be in `./logs`.
If
`./ckpt.pth`
already exist, this script will automatically load the file and try to continue
If
`./ckpt.pth`
already exist, this script will automatically load the file and try to continue
training from the checkpoint.
training from the checkpoint.
This command takes around 36 hours to train on 8 NVIDIA Tesla V100 GPUs.
This command takes around 36 hours to train on 8 NVIDIA Tesla V100 GPUs.
\ No newline at end of file
To train the Tacotron2 model to work with the
[
pretrained wavernn
](
https://pytorch.org/audio/main/models.html#id10
)
with checkpoint_name
`"wavernn_10k_epochs_8bits_ljspeech"`
, please run the following command instead.
```
bash
python train.py
--learning-rate
1e-3
\
--epochs
1501
\
--anneal-steps
500 1000 1500
\
--anneal-factor
0.1
\
--sample-rate
22050
\
--n-fft
2048
\
--hop-length
275
\
--win-length
1100
\
--mel-fmin
40
\
--mel-fmax
11025
\
--batch-size
96
\
--weight-decay
1e-6
\
--grad-clip
1.0
\
--text-preprocessor
english_characters
\
--logging-dir
./wavernn_logs
\
--checkpoint-path
./ckpt_wavernn.pth
\
--dataset-path
./
```
## Training Tacotron2 with phoneme as input
#### Dependencies
This example use the
[
DeepPhonemizer
](
https://github.com/as-ideas/DeepPhonemizer
)
as
the phonemizer (the function to turn text into phonemes),
please install it with the following command (the code is tested with version 0.0.15).
```
bash
pip
install
deep-phonemizer
==
0.0.15
```
Then download the model weights from
[
their website
](
https://github.com/as-ideas/DeepPhonemizer
)
The link to the checkpoint that is tested with this example is
[
https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt
](
https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt
)
.
#### Running training script
The training of Tacotron2 with english phonemes as input can be invoked with the following command.
```
bash
python train.py
\
--workers
12
\
--learning-rate
1e-3
\
--epochs
1501
\
--anneal-steps
500 1000 1500
\
--anneal-factor
0.1
\
--batch-size
96
\
--weight-decay
1e-6
\
--grad-clip
1.0
\
--text-preprocessor
english_phonemes
\
--phonemizer
DeepPhonemizer
\
--phonemizer-checkpoint
./en_us_cmudict_forward.pt
\
--cmudict-root
./
\
--logging-dir
./english_phonemes_logs
\
--checkpoint-path
./english_phonemes_ckpt.pth
\
--dataset-path
./
```
Similar to the previous examples, this command will save the log in the directory
`./english_phonemes_logs`
and the checkpoint will be saved to
`./english_phonemes_ckpt.pth`
.
To train the Tacotron2 model with english phonemes that works with the
[
pretrained wavernn
](
https://pytorch.org/audio/main/models.html#id10
)
with checkpoint_name
`"wavernn_10k_epochs_8bits_ljspeech"`
, please run the following command.
```
bash
python train.py
\
--workers
12
\
--learning-rate
1e-3
\
--epochs
1501
\
--anneal-steps
500 1000 1500
\
--anneal-factor
0.1
\
--sample-rate
22050
\
--n-fft
2048
\
--hop-length
275
\
--win-length
1100
\
--mel-fmin
40
\
--mel-fmax
11025
\
--batch-size
96
\
--weight-decay
1e-6
\
--grad-clip
1.0
\
--text-preprocessor
english_phonemes
\
--phonemizer
DeepPhonemizer
\
--phonemizer-checkpoint
./en_us_cmudict_forward.pt
\
--cmudict-root
./
\
--logging-dir
./english_phonemes_wavernn_logs
\
--checkpoint-path
./english_phonemes_wavernn_ckpt.pth
\
--dataset-path
./
```
examples/pipeline_tacotron2/text/text_preprocessing.py
View file @
084455a3
...
@@ -24,10 +24,11 @@
...
@@ -24,10 +24,11 @@
Modified from https://github.com/keithito/tacotron
Modified from https://github.com/keithito/tacotron
"""
"""
from
typing
import
List
from
typing
import
List
,
Union
,
Optional
import
re
import
re
from
unidecode
import
unidecode
from
unidecode
import
unidecode
from
torchaudio.datasets
import
CMUDict
from
.numbers
import
normalize_numbers
from
.numbers
import
normalize_numbers
...
@@ -63,18 +64,87 @@ _special = '-'
...
@@ -63,18 +64,87 @@ _special = '-'
_letters
=
'abcdefghijklmnopqrstuvwxyz'
_letters
=
'abcdefghijklmnopqrstuvwxyz'
symbols
=
[
_pad
]
+
list
(
_special
)
+
list
(
_punctuation
)
+
list
(
_letters
)
symbols
=
[
_pad
]
+
list
(
_special
)
+
list
(
_punctuation
)
+
list
(
_letters
)
_
symbol_to_id
=
{
s
:
i
for
i
,
s
in
enumerate
(
symbols
)}
_
phonemizer
=
None
def
text_to_sequence
(
sent
:
str
)
->
List
[
int
]:
available_symbol_set
=
set
([
"english_characters"
,
"english_phonemes"
])
available_phonemizers
=
set
([
"DeepPhonemizer"
])
def
get_symbol_list
(
symbol_list
:
str
=
"english_characters"
,
cmudict_root
:
Optional
[
str
]
=
"./"
)
->
List
[
str
]:
if
symbol_list
==
"english_characters"
:
return
[
_pad
]
+
list
(
_special
)
+
list
(
_punctuation
)
+
list
(
_letters
)
elif
symbol_list
==
"english_phonemes"
:
return
[
_pad
]
+
list
(
_special
)
+
list
(
_punctuation
)
+
CMUDict
(
cmudict_root
).
symbols
else
:
raise
ValueError
(
f
"The `symbol_list`
{
symbol_list
}
is not supported."
f
"Supported `symbol_list` includes
{
available_symbol_set
}
."
)
def
word_to_phonemes
(
sent
:
str
,
phonemizer
:
str
,
checkpoint
:
str
)
->
List
[
str
]:
if
phonemizer
==
"DeepPhonemizer"
:
from
dp.phonemizer
import
Phonemizer
global
_phonemizer
_other_symbols
=
''
.
join
(
list
(
_special
)
+
list
(
_punctuation
))
_phone_symbols_re
=
r
'(\[[A-Z]+?\]|'
+
'['
+
_other_symbols
+
'])'
# [\[([A-Z]+?)\]|[-!'(),.:;? ]]
if
_phonemizer
is
None
:
# using a global variable so that we don't have to relode checkpoint
# everytime this function is called
_phonemizer
=
Phonemizer
.
from_checkpoint
(
checkpoint
)
# Example:
# sent = "hello world!"
# '[HH][AH][L][OW] [W][ER][L][D]!'
sent
=
_phonemizer
(
sent
,
lang
=
'en_us'
)
# ['[HH]', '[AH]', '[L]', '[OW]', ' ', '[W]', '[ER]', '[L]', '[D]', '!']
ret
=
re
.
findall
(
_phone_symbols_re
,
sent
)
# ['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', '!']
ret
=
[
r
.
replace
(
"["
,
""
).
replace
(
"]"
,
""
)
for
r
in
ret
]
return
ret
else
:
raise
ValueError
(
f
"The `phonemizer`
{
phonemizer
}
is not supported. "
"Supported `symbol_list` includes `'DeepPhonemizer'`."
)
def
text_to_sequence
(
sent
:
str
,
symbol_list
:
Union
[
str
,
List
[
str
]]
=
"english_characters"
,
phonemizer
:
Optional
[
str
]
=
"DeepPhonemizer"
,
checkpoint
:
Optional
[
str
]
=
"./en_us_cmudict_forward.pt"
,
cmudict_root
:
Optional
[
str
]
=
"./"
)
->
List
[
int
]:
r
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
r
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
Args:
Args:
sent (str): The input sentence to convert to a sequence.
sent (str): The input sentence to convert to a sequence.
symbol_list (str or List of string, optional): When the input is a string, available options include
"english_characters" and "english_phonemes". When the input is a list of string, ``symbol_list`` will
directly be used as the symbol to encode. (Default: "english_characters")
phonemizer (str, optional): The phonemizer to use. Only used when ``symbol_list`` is "english_phonemes".
Available options include "DeepPhonemizer". (Default: "DeepPhonemizer")
checkpoint (str, optional): The path to the checkpoint of the phonemizer. Only used when ``symbol_list`` is
"english_phonemes". (Default: "./en_us_cmudict_forward.pt")
cmudict_root (str, optional): The path to the directory where the CMUDict dataset is found or downloaded.
Only used when ``symbol_list`` is "english_phonemes". (Default: "./")
Returns:
Returns:
List of integers corresponding to the symbols in the sentence.
List of integers corresponding to the symbols in the sentence.
Examples:
>>> text_to_sequence("hello world!", "english_characters")
[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2]
>>> text_to_sequence("hello world!", "english_phonemes")
[54, 20, 65, 69, 11, 92, 44, 65, 38, 2]
'''
'''
if
symbol_list
==
"english_phonemes"
:
if
any
(
param
is
None
for
param
in
[
phonemizer
,
checkpoint
,
cmudict_root
]):
raise
ValueError
(
"When `symbol_list` is 'english_phonemes', "
"all of `phonemizer`, `checkpoint`, and `cmudict_root` must be provided."
)
sent
=
unidecode
(
sent
)
# convert to ascii
sent
=
unidecode
(
sent
)
# convert to ascii
sent
=
sent
.
lower
()
# lower case
sent
=
sent
.
lower
()
# lower case
sent
=
normalize_numbers
(
sent
)
# expand numbers
sent
=
normalize_numbers
(
sent
)
# expand numbers
...
@@ -82,4 +152,13 @@ def text_to_sequence(sent: str) -> List[int]:
...
@@ -82,4 +152,13 @@ def text_to_sequence(sent: str) -> List[int]:
sent
=
re
.
sub
(
regex
,
replacement
,
sent
)
sent
=
re
.
sub
(
regex
,
replacement
,
sent
)
sent
=
re
.
sub
(
_whitespace_re
,
' '
,
sent
)
# collapse whitespace
sent
=
re
.
sub
(
_whitespace_re
,
' '
,
sent
)
# collapse whitespace
if
isinstance
(
symbol_list
,
list
):
symbols
=
symbol_list
elif
isinstance
(
symbol_list
,
str
):
symbols
=
get_symbol_list
(
symbol_list
,
cmudict_root
=
cmudict_root
)
if
symbol_list
==
"english_phonemes"
:
sent
=
word_to_phonemes
(
sent
,
phonemizer
=
phonemizer
,
checkpoint
=
checkpoint
)
_symbol_to_id
=
{
s
:
i
for
i
,
s
in
enumerate
(
symbols
)}
return
[
_symbol_to_id
[
s
]
for
s
in
sent
if
s
in
_symbol_to_id
]
return
[
_symbol_to_id
[
s
]
for
s
in
sent
if
s
in
_symbol_to_id
]
examples/pipeline_tacotron2/train.py
View file @
084455a3
...
@@ -50,8 +50,14 @@ import matplotlib.pyplot as plt
...
@@ -50,8 +50,14 @@ import matplotlib.pyplot as plt
plt
.
switch_backend
(
'agg'
)
plt
.
switch_backend
(
'agg'
)
from
datasets
import
text_mel_collate_fn
,
split_process_dataset
,
SpectralNormalization
from
datasets
import
text_mel_collate_fn
,
split_process_dataset
,
SpectralNormalization
from
utils
import
save_checkpoint
,
get_text_preprocessor
from
utils
import
save_checkpoint
from
loss
import
Tacotron2Loss
from
loss
import
Tacotron2Loss
from
text.text_preprocessing
import
(
available_symbol_set
,
available_phonemizers
,
get_symbol_list
,
text_to_sequence
,
)
logging
.
basicConfig
(
format
=
'%(asctime)s %(levelname)-8s %(message)s'
,
logging
.
basicConfig
(
format
=
'%(asctime)s %(levelname)-8s %(message)s'
,
...
@@ -76,13 +82,22 @@ def parse_args(parser):
...
@@ -76,13 +82,22 @@ def parse_args(parser):
parser
.
add_argument
(
'--anneal-factor'
,
type
=
float
,
choices
=
[
0.1
,
0.3
],
default
=
0.1
,
parser
.
add_argument
(
'--anneal-factor'
,
type
=
float
,
choices
=
[
0.1
,
0.3
],
default
=
0.1
,
help
=
'factor for annealing learning rate'
)
help
=
'factor for annealing learning rate'
)
parser
.
add_argument
(
'--text-preprocessor'
,
default
=
'character'
,
type
=
str
,
choices
=
[
'character'
],
help
=
'[string] Select text preprocessor to use.'
)
parser
.
add_argument
(
'--master-addr'
,
default
=
None
,
type
=
str
,
parser
.
add_argument
(
'--master-addr'
,
default
=
None
,
type
=
str
,
help
=
'
T
he address to use for distributed training
.
'
)
help
=
'
t
he address to use for distributed training'
)
parser
.
add_argument
(
'--master-port'
,
default
=
None
,
type
=
str
,
parser
.
add_argument
(
'--master-port'
,
default
=
None
,
type
=
str
,
help
=
'The port to use for distributed training.'
)
help
=
'the port to use for distributed training'
)
preprocessor
=
parser
.
add_argument_group
(
'text preprocessor setup'
)
preprocessor
.
add_argument
(
'--text-preprocessor'
,
default
=
'english_characters'
,
type
=
str
,
choices
=
available_symbol_set
,
help
=
'select text preprocessor to use.'
)
preprocessor
.
add_argument
(
'--phonemizer'
,
type
=
str
,
choices
=
available_phonemizers
,
help
=
'select phonemizer to use, only used when text-preprocessor is "english_phonemes"'
)
preprocessor
.
add_argument
(
'--phonemizer-checkpoint'
,
type
=
str
,
help
=
'the path or name of the checkpoint for the phonemizer, '
'only used when text-preprocessor is "english_phonemes"'
)
preprocessor
.
add_argument
(
'--cmudict-root'
,
default
=
"./"
,
type
=
str
,
help
=
'the root directory for storing cmudictionary files'
)
# training
# training
training
=
parser
.
add_argument_group
(
'training setup'
)
training
=
parser
.
add_argument_group
(
'training setup'
)
...
@@ -263,6 +278,36 @@ def log_additional_info(writer, model, loader, epoch):
...
@@ -263,6 +278,36 @@ def log_additional_info(writer, model, loader, epoch):
writer
.
add_image
(
"trn/alignment"
,
alignment
[
0
],
epoch
,
dataformats
=
"HW"
)
writer
.
add_image
(
"trn/alignment"
,
alignment
[
0
],
epoch
,
dataformats
=
"HW"
)
def
get_datasets
(
args
):
text_preprocessor
=
partial
(
text_to_sequence
,
symbol_list
=
args
.
text_preprocessor
,
phonemizer
=
args
.
phonemizer
,
checkpoint
=
args
.
phonemizer_checkpoint
,
cmudict_root
=
args
.
cmudict_root
,
)
transforms
=
torch
.
nn
.
Sequential
(
torchaudio
.
transforms
.
MelSpectrogram
(
sample_rate
=
args
.
sample_rate
,
n_fft
=
args
.
n_fft
,
win_length
=
args
.
win_length
,
hop_length
=
args
.
hop_length
,
f_min
=
args
.
mel_fmin
,
f_max
=
args
.
mel_fmax
,
n_mels
=
args
.
n_mels
,
mel_scale
=
'slaney'
,
normalized
=
False
,
power
=
1
,
norm
=
'slaney'
,
),
SpectralNormalization
()
)
trainset
,
valset
=
split_process_dataset
(
args
.
dataset
,
args
.
dataset_path
,
args
.
val_ratio
,
transforms
,
text_preprocessor
)
return
trainset
,
valset
def
train
(
rank
,
world_size
,
args
):
def
train
(
rank
,
world_size
,
args
):
dist
.
init_process_group
(
"nccl"
,
rank
=
rank
,
world_size
=
world_size
)
dist
.
init_process_group
(
"nccl"
,
rank
=
rank
,
world_size
=
world_size
)
...
@@ -281,7 +326,7 @@ def train(rank, world_size, args):
...
@@ -281,7 +326,7 @@ def train(rank, world_size, args):
torch
.
cuda
.
set_device
(
rank
)
torch
.
cuda
.
set_device
(
rank
)
symbols
,
text_preprocessor
=
get_text_preprocessor
(
args
.
text_preprocessor
)
symbols
=
get_symbol_list
(
args
.
text_preprocessor
)
model
=
Tacotron2
(
model
=
Tacotron2
(
mask_padding
=
args
.
mask_padding
,
mask_padding
=
args
.
mask_padding
,
...
@@ -330,24 +375,7 @@ def train(rank, world_size, args):
...
@@ -330,24 +375,7 @@ def train(rank, world_size, args):
f
"Checkpoint: loaded '
{
args
.
checkpoint_path
}
' at epoch
{
checkpoint
[
'epoch'
]
}
"
f
"Checkpoint: loaded '
{
args
.
checkpoint_path
}
' at epoch
{
checkpoint
[
'epoch'
]
}
"
)
)
transforms
=
torch
.
nn
.
Sequential
(
trainset
,
valset
=
get_datasets
(
args
)
torchaudio
.
transforms
.
MelSpectrogram
(
sample_rate
=
args
.
sample_rate
,
n_fft
=
args
.
n_fft
,
win_length
=
args
.
win_length
,
hop_length
=
args
.
hop_length
,
f_min
=
args
.
mel_fmin
,
f_max
=
args
.
mel_fmax
,
n_mels
=
args
.
n_mels
,
mel_scale
=
'slaney'
,
normalized
=
False
,
power
=
1
,
norm
=
'slaney'
,
),
SpectralNormalization
()
)
trainset
,
valset
=
split_process_dataset
(
args
.
dataset
,
args
.
dataset_path
,
args
.
val_ratio
,
transforms
,
text_preprocessor
)
train_sampler
=
torch
.
utils
.
data
.
distributed
.
DistributedSampler
(
train_sampler
=
torch
.
utils
.
data
.
distributed
.
DistributedSampler
(
trainset
,
trainset
,
...
@@ -365,6 +393,8 @@ def train(rank, world_size, args):
...
@@ -365,6 +393,8 @@ def train(rank, world_size, args):
loader_params
=
{
loader_params
=
{
"batch_size"
:
args
.
batch_size
,
"batch_size"
:
args
.
batch_size
,
"num_workers"
:
args
.
workers
,
"num_workers"
:
args
.
workers
,
"prefetch_factor"
:
1024
,
'persistent_workers'
:
True
,
"shuffle"
:
False
,
"shuffle"
:
False
,
"pin_memory"
:
True
,
"pin_memory"
:
True
,
"drop_last"
:
False
,
"drop_last"
:
False
,
...
@@ -484,7 +514,8 @@ def main(args):
...
@@ -484,7 +514,8 @@ def main(args):
if
device_counts
==
1
:
if
device_counts
==
1
:
train
(
0
,
1
,
args
)
train
(
0
,
1
,
args
)
else
:
else
:
mp
.
spawn
(
train
,
args
=
(
device_counts
,
args
,
),
nprocs
=
device_counts
,
join
=
True
)
mp
.
spawn
(
train
,
args
=
(
device_counts
,
args
,
),
nprocs
=
device_counts
,
join
=
True
)
logger
.
info
(
f
"End time:
{
datetime
.
now
()
}
"
)
logger
.
info
(
f
"End time:
{
datetime
.
now
()
}
"
)
...
...
examples/pipeline_tacotron2/utils.py
View file @
084455a3
...
@@ -74,14 +74,3 @@ def prepare_input_sequence(texts: List[str],
...
@@ -74,14 +74,3 @@ def prepare_input_sequence(texts: List[str],
text_padded
,
input_lengths
=
pad_sequences
(
d
)
text_padded
,
input_lengths
=
pad_sequences
(
d
)
return
text_padded
,
input_lengths
return
text_padded
,
input_lengths
def
get_text_preprocessor
(
preprocessor_name
:
str
)
->
Tuple
[
List
[
str
],
Callable
[[
str
],
List
[
int
]]]:
if
preprocessor_name
==
"character"
:
from
text.text_preprocessing
import
symbols
from
text.text_preprocessing
import
text_to_sequence
text_preprocessor
=
text_to_sequence
else
:
raise
ValueError
(
"The preprocessor_name ({preprocessor_name}) provided is not supported."
)
return
symbols
,
text_preprocessor
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment