"vscode:/vscode.git/clone" did not exist on "66777b63fc4e78f80df15fe4487ebf57c03dda75"
Commit 39ac40a9 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2747 failed with stages
in 0 seconds
encoder_type: RoPE Encoder
encoder_params:
n_feats: ${model.n_feats}
n_channels: 192
filter_channels: 768
filter_channels_dp: 256
n_heads: 2
n_layers: 6
kernel_size: 3
p_dropout: 0.1
spk_emb_dim: 64
n_spks: 1
prenet: true
duration_predictor_params:
filter_channels_dp: ${model.encoder.encoder_params.filter_channels_dp}
kernel_size: 3
p_dropout: ${model.encoder.encoder_params.p_dropout}
defaults:
- _self_
- encoder: default.yaml
- decoder: default.yaml
- cfm: default.yaml
- optimizer: adam.yaml
_target_: matcha.models.matcha_tts.MatchaTTS
n_vocab: 178
n_spks: ${data.n_spks}
spk_emb_dim: 64
n_feats: 80
data_statistics: ${data.data_statistics}
out_size: null # Must be divisible by 4
prior_loss: true
use_precomputed_durations: ${data.load_durations}
_target_: torch.optim.Adam
_partial_: true
lr: 1e-4
weight_decay: 0.0
# path to root directory
# this requires PROJECT_ROOT environment variable to exist
# you can replace it with "." if you want the root to be the current working directory
root_dir: ${oc.env:PROJECT_ROOT}
# path to data directory
data_dir: ${paths.root_dir}/data/
# path to logging directory
log_dir: ${paths.root_dir}/logs/
# path to output directory, created dynamically by hydra
# path generation pattern is specified in `configs/hydra/default.yaml`
# use it to store all files generated during the run, like ckpts and metrics
output_dir: ${hydra:runtime.output_dir}
# path to working directory
work_dir: ${hydra:runtime.cwd}
# @package _global_
# specify here default configuration
# order of defaults determines the order in which configs override each other
defaults:
- _self_
- data: ljspeech
- model: matcha
- callbacks: default
- logger: tensorboard # set logger here or use command line (e.g. `python train.py logger=tensorboard`)
- trainer: default
- paths: default
- extras: default
- hydra: default
# experiment configs allow for version control of specific hyperparameters
# e.g. best hyperparameters for given model and datamodule
- experiment: null
# config for hyperparameter optimization
- hparams_search: null
# optional local config for machine/user specific settings
# it's optional since it doesn't need to exist and is excluded from version control
- optional local: default
# debugging config (enable through command line, e.g. `python train.py debug=default)
- debug: null
# task name, determines output directory path
task_name: "train"
run_name: ???
# tags to help you identify your experiments
# you can overwrite this in experiment configs
# overwrite from command line with `python train.py tags="[first_tag, second_tag]"`
tags: ["dev"]
# set False to skip model training
train: True
# evaluate on test set, using best model weights achieved during training
# lightning chooses best weights based on the metric specified in checkpoint callback
test: True
# simply provide checkpoint path to resume training
ckpt_path: null
# seed for random number generators in pytorch, numpy and python.random
seed: 1234
defaults:
- default
strategy: ddp
accelerator: gpu
devices: [0,1]
num_nodes: 1
sync_batchnorm: True
defaults:
- default
# simulate DDP on CPU, useful for debugging
accelerator: cpu
devices: 2
strategy: ddp_spawn
_target_: lightning.pytorch.trainer.Trainer
default_root_dir: ${paths.output_dir}
max_epochs: -1
accelerator: gpu
devices: [0]
# mixed precision for extra speed-up
precision: 16-mixed
# perform a validation loop every N training epochs
check_val_every_n_epoch: 1
# set True to to ensure deterministic results
# makes training slower but gives more reproducibility than just setting seeds
deterministic: False
gradient_clip_val: 5.0
import tempfile
from argparse import Namespace
from pathlib import Path
import gradio as gr
import soundfile as sf
import torch
from matcha.cli import (
MATCHA_URLS,
VOCODER_URLS,
assert_model_downloaded,
get_device,
load_matcha,
load_vocoder,
process_text,
to_waveform,
)
from matcha.utils.utils import get_user_data_dir, plot_tensor
LOCATION = Path(get_user_data_dir())
args = Namespace(
cpu=False,
model="matcha_vctk",
vocoder="hifigan_univ_v1",
spk=0,
)
CURRENTLY_LOADED_MODEL = args.model
def MATCHA_TTS_LOC(x):
return LOCATION / f"{x}.ckpt"
def VOCODER_LOC(x):
return LOCATION / f"{x}"
LOGO_URL = "https://shivammehta25.github.io/Matcha-TTS/images/logo.png"
RADIO_OPTIONS = {
"Multi Speaker (VCTK)": {
"model": "matcha_vctk",
"vocoder": "hifigan_univ_v1",
},
"Single Speaker (LJ Speech)": {
"model": "matcha_ljspeech",
"vocoder": "hifigan_T2_v1",
},
}
# Ensure all the required models are downloaded
assert_model_downloaded(MATCHA_TTS_LOC("matcha_ljspeech"), MATCHA_URLS["matcha_ljspeech"])
assert_model_downloaded(VOCODER_LOC("hifigan_T2_v1"), VOCODER_URLS["hifigan_T2_v1"])
assert_model_downloaded(MATCHA_TTS_LOC("matcha_vctk"), MATCHA_URLS["matcha_vctk"])
assert_model_downloaded(VOCODER_LOC("hifigan_univ_v1"), VOCODER_URLS["hifigan_univ_v1"])
device = get_device(args)
# Load default model
model = load_matcha(args.model, MATCHA_TTS_LOC(args.model), device)
vocoder, denoiser = load_vocoder(args.vocoder, VOCODER_LOC(args.vocoder), device)
def load_model(model_name, vocoder_name):
model = load_matcha(model_name, MATCHA_TTS_LOC(model_name), device)
vocoder, denoiser = load_vocoder(vocoder_name, VOCODER_LOC(vocoder_name), device)
return model, vocoder, denoiser
def load_model_ui(model_type, textbox):
model_name, vocoder_name = RADIO_OPTIONS[model_type]["model"], RADIO_OPTIONS[model_type]["vocoder"]
global model, vocoder, denoiser, CURRENTLY_LOADED_MODEL # pylint: disable=global-statement
if CURRENTLY_LOADED_MODEL != model_name:
model, vocoder, denoiser = load_model(model_name, vocoder_name)
CURRENTLY_LOADED_MODEL = model_name
if model_name == "matcha_ljspeech":
spk_slider = gr.update(visible=False, value=-1)
single_speaker_examples = gr.update(visible=True)
multi_speaker_examples = gr.update(visible=False)
length_scale = gr.update(value=0.95)
else:
spk_slider = gr.update(visible=True, value=0)
single_speaker_examples = gr.update(visible=False)
multi_speaker_examples = gr.update(visible=True)
length_scale = gr.update(value=0.85)
return (
textbox,
gr.update(interactive=True),
spk_slider,
single_speaker_examples,
multi_speaker_examples,
length_scale,
)
@torch.inference_mode()
def process_text_gradio(text):
output = process_text(1, text, device)
return output["x_phones"][1::2], output["x"], output["x_lengths"]
@torch.inference_mode()
def synthesise_mel(text, text_length, n_timesteps, temperature, length_scale, spk):
spk = torch.tensor([spk], device=device, dtype=torch.long) if spk >= 0 else None
output = model.synthesise(
text,
text_length,
n_timesteps=n_timesteps,
temperature=temperature,
spks=spk,
length_scale=length_scale,
)
output["waveform"] = to_waveform(output["mel"], vocoder, denoiser)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
sf.write(fp.name, output["waveform"], 22050, "PCM_24")
return fp.name, plot_tensor(output["mel"].squeeze().cpu().numpy())
def multispeaker_example_cacher(text, n_timesteps, mel_temp, length_scale, spk):
global CURRENTLY_LOADED_MODEL # pylint: disable=global-statement
if CURRENTLY_LOADED_MODEL != "matcha_vctk":
global model, vocoder, denoiser # pylint: disable=global-statement
model, vocoder, denoiser = load_model("matcha_vctk", "hifigan_univ_v1")
CURRENTLY_LOADED_MODEL = "matcha_vctk"
phones, text, text_lengths = process_text_gradio(text)
audio, mel_spectrogram = synthesise_mel(text, text_lengths, n_timesteps, mel_temp, length_scale, spk)
return phones, audio, mel_spectrogram
def ljspeech_example_cacher(text, n_timesteps, mel_temp, length_scale, spk=-1):
global CURRENTLY_LOADED_MODEL # pylint: disable=global-statement
if CURRENTLY_LOADED_MODEL != "matcha_ljspeech":
global model, vocoder, denoiser # pylint: disable=global-statement
model, vocoder, denoiser = load_model("matcha_ljspeech", "hifigan_T2_v1")
CURRENTLY_LOADED_MODEL = "matcha_ljspeech"
phones, text, text_lengths = process_text_gradio(text)
audio, mel_spectrogram = synthesise_mel(text, text_lengths, n_timesteps, mel_temp, length_scale, spk)
return phones, audio, mel_spectrogram
def main():
description = """# 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
### [Shivam Mehta](https://www.kth.se/profile/smehta), [Ruibo Tu](https://www.kth.se/profile/ruibo), [Jonas Beskow](https://www.kth.se/profile/beskow), [Éva Székely](https://www.kth.se/profile/szekely), and [Gustav Eje Henter](https://people.kth.se/~ghe/)
We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:
* Is probabilistic
* Has compact memory footprint
* Sounds highly natural
* Is very fast to synthesise from
Check out our [demo page](https://shivammehta25.github.io/Matcha-TTS). Read our [arXiv preprint for more details](https://arxiv.org/abs/2309.03199).
Code is available in our [GitHub repository](https://github.com/shivammehta25/Matcha-TTS), along with pre-trained models.
Cached examples are available at the bottom of the page.
"""
with gr.Blocks(title="🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching") as demo:
processed_text = gr.State(value=None)
processed_text_len = gr.State(value=None)
with gr.Box():
with gr.Row():
gr.Markdown(description, scale=3)
with gr.Column():
gr.Image(LOGO_URL, label="Matcha-TTS logo", height=50, width=50, scale=1, show_label=False)
html = '<br><iframe width="560" height="315" src="https://www.youtube.com/embed/xmvJkz3bqw0?si=jN7ILyDsbPwJCGoa" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>'
gr.HTML(html)
with gr.Box():
radio_options = list(RADIO_OPTIONS.keys())
model_type = gr.Radio(
radio_options, value=radio_options[0], label="Choose a Model", interactive=True, container=False
)
with gr.Row():
gr.Markdown("# Text Input")
with gr.Row():
text = gr.Textbox(value="", lines=2, label="Text to synthesise", scale=3)
spk_slider = gr.Slider(
minimum=0, maximum=107, step=1, value=args.spk, label="Speaker ID", interactive=True, scale=1
)
with gr.Row():
gr.Markdown("### Hyper parameters")
with gr.Row():
n_timesteps = gr.Slider(
label="Number of ODE steps",
minimum=1,
maximum=100,
step=1,
value=10,
interactive=True,
)
length_scale = gr.Slider(
label="Length scale (Speaking rate)",
minimum=0.5,
maximum=1.5,
step=0.05,
value=1.0,
interactive=True,
)
mel_temp = gr.Slider(
label="Sampling temperature",
minimum=0.00,
maximum=2.001,
step=0.16675,
value=0.667,
interactive=True,
)
synth_btn = gr.Button("Synthesise")
with gr.Box():
with gr.Row():
gr.Markdown("### Phonetised text")
phonetised_text = gr.Textbox(interactive=False, scale=10, label="Phonetised text")
with gr.Box():
with gr.Row():
mel_spectrogram = gr.Image(interactive=False, label="mel spectrogram")
# with gr.Row():
audio = gr.Audio(interactive=False, label="Audio")
with gr.Row(visible=False) as example_row_lj_speech:
examples = gr.Examples( # pylint: disable=unused-variable
examples=[
[
"We propose Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up O D E-based speech synthesis.",
50,
0.677,
0.95,
],
[
"The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.",
2,
0.677,
0.95,
],
[
"The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.",
4,
0.677,
0.95,
],
[
"The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.",
10,
0.677,
0.95,
],
[
"The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.",
50,
0.677,
0.95,
],
[
"The narrative of these events is based largely on the recollections of the participants.",
10,
0.677,
0.95,
],
[
"The jury did not believe him, and the verdict was for the defendants.",
10,
0.677,
0.95,
],
],
fn=ljspeech_example_cacher,
inputs=[text, n_timesteps, mel_temp, length_scale],
outputs=[phonetised_text, audio, mel_spectrogram],
cache_examples=True,
)
with gr.Row() as example_row_multispeaker:
multi_speaker_examples = gr.Examples( # pylint: disable=unused-variable
examples=[
[
"Hello everyone! I am speaker 0 and I am here to tell you that Matcha-TTS is amazing!",
10,
0.677,
0.85,
0,
],
[
"Hello everyone! I am speaker 16 and I am here to tell you that Matcha-TTS is amazing!",
10,
0.677,
0.85,
16,
],
[
"Hello everyone! I am speaker 44 and I am here to tell you that Matcha-TTS is amazing!",
50,
0.677,
0.85,
44,
],
[
"Hello everyone! I am speaker 45 and I am here to tell you that Matcha-TTS is amazing!",
50,
0.677,
0.85,
45,
],
[
"Hello everyone! I am speaker 58 and I am here to tell you that Matcha-TTS is amazing!",
4,
0.677,
0.85,
58,
],
],
fn=multispeaker_example_cacher,
inputs=[text, n_timesteps, mel_temp, length_scale, spk_slider],
outputs=[phonetised_text, audio, mel_spectrogram],
cache_examples=True,
label="Multi Speaker Examples",
)
model_type.change(lambda x: gr.update(interactive=False), inputs=[synth_btn], outputs=[synth_btn]).then(
load_model_ui,
inputs=[model_type, text],
outputs=[text, synth_btn, spk_slider, example_row_lj_speech, example_row_multispeaker, length_scale],
)
synth_btn.click(
fn=process_text_gradio,
inputs=[
text,
],
outputs=[phonetised_text, processed_text, processed_text_len],
api_name="matcha_tts",
queue=True,
).then(
fn=synthesise_mel,
inputs=[processed_text, processed_text_len, n_timesteps, mel_temp, length_scale, spk_slider],
outputs=[audio, mel_spectrogram],
)
demo.queue().launch(share=True)
if __name__ == "__main__":
main()
import argparse
import datetime as dt
import os
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf
import torch
from matcha.hifigan.config import v1
from matcha.hifigan.denoiser import Denoiser
from matcha.hifigan.env import AttrDict
from matcha.hifigan.models import Generator as HiFiGAN
from matcha.models.matcha_tts import MatchaTTS
from matcha.text import sequence_to_text, text_to_sequence
from matcha.utils.utils import assert_model_downloaded, get_user_data_dir, intersperse
MATCHA_URLS = {
"matcha_ljspeech": "https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/matcha_ljspeech.ckpt",
"matcha_vctk": "https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/matcha_vctk.ckpt",
}
VOCODER_URLS = {
"hifigan_T2_v1": "https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/generator_v1", # Old url: https://drive.google.com/file/d/14NENd4equCBLyyCSke114Mv6YR_j_uFs/view?usp=drive_link
"hifigan_univ_v1": "https://github.com/shivammehta25/Matcha-TTS-checkpoints/releases/download/v1.0/g_02500000", # Old url: https://drive.google.com/file/d/1qpgI41wNXFcH-iKq1Y42JlBC9j0je8PW/view?usp=drive_link
}
MULTISPEAKER_MODEL = {
"matcha_vctk": {"vocoder": "hifigan_univ_v1", "speaking_rate": 0.85, "spk": 0, "spk_range": (0, 107)}
}
SINGLESPEAKER_MODEL = {"matcha_ljspeech": {"vocoder": "hifigan_T2_v1", "speaking_rate": 0.95, "spk": None}}
def plot_spectrogram_to_numpy(spectrogram, filename):
fig, ax = plt.subplots(figsize=(12, 3))
im = ax.imshow(spectrogram, aspect="auto", origin="lower", interpolation="none")
plt.colorbar(im, ax=ax)
plt.xlabel("Frames")
plt.ylabel("Channels")
plt.title("Synthesised Mel-Spectrogram")
fig.canvas.draw()
plt.savefig(filename)
def process_text(i: int, text: str, device: torch.device):
print(f"[{i}] - Input text: {text}")
x = torch.tensor(
intersperse(text_to_sequence(text, ["english_cleaners2"])[0], 0),
dtype=torch.long,
device=device,
)[None]
x_lengths = torch.tensor([x.shape[-1]], dtype=torch.long, device=device)
x_phones = sequence_to_text(x.squeeze(0).tolist())
print(f"[{i}] - Phonetised text: {x_phones[1::2]}")
return {"x_orig": text, "x": x, "x_lengths": x_lengths, "x_phones": x_phones}
def get_texts(args):
if args.text:
texts = [args.text]
else:
with open(args.file, encoding="utf-8") as f:
texts = f.readlines()
return texts
def assert_required_models_available(args):
save_dir = get_user_data_dir()
if not hasattr(args, "checkpoint_path") and args.checkpoint_path is None:
model_path = args.checkpoint_path
else:
model_path = save_dir / f"{args.model}.ckpt"
assert_model_downloaded(model_path, MATCHA_URLS[args.model])
vocoder_path = save_dir / f"{args.vocoder}"
assert_model_downloaded(vocoder_path, VOCODER_URLS[args.vocoder])
return {"matcha": model_path, "vocoder": vocoder_path}
def load_hifigan(checkpoint_path, device):
h = AttrDict(v1)
hifigan = HiFiGAN(h).to(device)
hifigan.load_state_dict(torch.load(checkpoint_path, map_location=device)["generator"])
_ = hifigan.eval()
hifigan.remove_weight_norm()
return hifigan
def load_vocoder(vocoder_name, checkpoint_path, device):
print(f"[!] Loading {vocoder_name}!")
vocoder = None
if vocoder_name in ("hifigan_T2_v1", "hifigan_univ_v1"):
vocoder = load_hifigan(checkpoint_path, device)
else:
raise NotImplementedError(
f"Vocoder {vocoder_name} not implemented! define a load_<<vocoder_name>> method for it"
)
denoiser = Denoiser(vocoder, mode="zeros")
print(f"[+] {vocoder_name} loaded!")
return vocoder, denoiser
def load_matcha(model_name, checkpoint_path, device):
print(f"[!] Loading {model_name}!")
model = MatchaTTS.load_from_checkpoint(checkpoint_path, map_location=device)
_ = model.eval()
print(f"[+] {model_name} loaded!")
return model
def to_waveform(mel, vocoder, denoiser=None, denoiser_strength=0.00025):
audio = vocoder(mel).clamp(-1, 1)
if denoiser is not None:
audio = denoiser(audio.squeeze(), strength=denoiser_strength).cpu().squeeze()
return audio.cpu().squeeze()
def save_to_folder(filename: str, output: dict, folder: str):
folder = Path(folder)
folder.mkdir(exist_ok=True, parents=True)
plot_spectrogram_to_numpy(np.array(output["mel"].squeeze().float().cpu()), f"{filename}.png")
np.save(folder / f"{filename}", output["mel"].cpu().numpy())
sf.write(folder / f"{filename}.wav", output["waveform"], 22050, "PCM_24")
return folder.resolve() / f"{filename}.wav"
def validate_args(args):
assert (
args.text or args.file
), "Either text or file must be provided Matcha-T(ea)TTS need sometext to whisk the waveforms."
assert args.temperature >= 0, "Sampling temperature cannot be negative"
assert args.steps > 0, "Number of ODE steps must be greater than 0"
if args.checkpoint_path is None:
# When using pretrained models
if args.model in SINGLESPEAKER_MODEL:
args = validate_args_for_single_speaker_model(args)
if args.model in MULTISPEAKER_MODEL:
args = validate_args_for_multispeaker_model(args)
else:
# When using a custom model
if args.vocoder != "hifigan_univ_v1":
warn_ = "[-] Using custom model checkpoint! I would suggest passing --vocoder hifigan_univ_v1, unless the custom model is trained on LJ Speech."
warnings.warn(warn_, UserWarning)
if args.speaking_rate is None:
args.speaking_rate = 1.0
if args.batched:
assert args.batch_size > 0, "Batch size must be greater than 0"
assert args.speaking_rate > 0, "Speaking rate must be greater than 0"
return args
def validate_args_for_multispeaker_model(args):
if args.vocoder is not None:
if args.vocoder != MULTISPEAKER_MODEL[args.model]["vocoder"]:
warn_ = f"[-] Using {args.model} model! I would suggest passing --vocoder {MULTISPEAKER_MODEL[args.model]['vocoder']}"
warnings.warn(warn_, UserWarning)
else:
args.vocoder = MULTISPEAKER_MODEL[args.model]["vocoder"]
if args.speaking_rate is None:
args.speaking_rate = MULTISPEAKER_MODEL[args.model]["speaking_rate"]
spk_range = MULTISPEAKER_MODEL[args.model]["spk_range"]
if args.spk is not None:
assert (
args.spk >= spk_range[0] and args.spk <= spk_range[-1]
), f"Speaker ID must be between {spk_range} for this model."
else:
available_spk_id = MULTISPEAKER_MODEL[args.model]["spk"]
warn_ = f"[!] Speaker ID not provided! Using speaker ID {available_spk_id}"
warnings.warn(warn_, UserWarning)
args.spk = available_spk_id
return args
def validate_args_for_single_speaker_model(args):
if args.vocoder is not None:
if args.vocoder != SINGLESPEAKER_MODEL[args.model]["vocoder"]:
warn_ = f"[-] Using {args.model} model! I would suggest passing --vocoder {SINGLESPEAKER_MODEL[args.model]['vocoder']}"
warnings.warn(warn_, UserWarning)
else:
args.vocoder = SINGLESPEAKER_MODEL[args.model]["vocoder"]
if args.speaking_rate is None:
args.speaking_rate = SINGLESPEAKER_MODEL[args.model]["speaking_rate"]
if args.spk != SINGLESPEAKER_MODEL[args.model]["spk"]:
warn_ = f"[-] Ignoring speaker id {args.spk} for {args.model}"
warnings.warn(warn_, UserWarning)
args.spk = SINGLESPEAKER_MODEL[args.model]["spk"]
return args
@torch.inference_mode()
def cli():
parser = argparse.ArgumentParser(
description=" 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching"
)
parser.add_argument(
"--model",
type=str,
default="matcha_ljspeech",
help="Model to use",
choices=MATCHA_URLS.keys(),
)
parser.add_argument(
"--checkpoint_path",
type=str,
default=None,
help="Path to the custom model checkpoint",
)
parser.add_argument(
"--vocoder",
type=str,
default=None,
help="Vocoder to use (default: will use the one suggested with the pretrained model))",
choices=VOCODER_URLS.keys(),
)
parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
parser.add_argument("--file", type=str, default=None, help="Text file to synthesize")
parser.add_argument("--spk", type=int, default=None, help="Speaker ID")
parser.add_argument(
"--temperature",
type=float,
default=0.667,
help="Variance of the x0 noise (default: 0.667)",
)
parser.add_argument(
"--speaking_rate",
type=float,
default=None,
help="change the speaking rate, a higher value means slower speaking rate (default: 1.0)",
)
parser.add_argument("--steps", type=int, default=10, help="Number of ODE steps (default: 10)")
parser.add_argument("--cpu", action="store_true", help="Use CPU for inference (default: use GPU if available)")
parser.add_argument(
"--denoiser_strength",
type=float,
default=0.00025,
help="Strength of the vocoder bias denoiser (default: 0.00025)",
)
parser.add_argument(
"--output_folder",
type=str,
default=os.getcwd(),
help="Output folder to save results (default: current dir)",
)
parser.add_argument("--batched", action="store_true", help="Batched inference (default: False)")
parser.add_argument(
"--batch_size", type=int, default=32, help="Batch size only useful when --batched (default: 32)"
)
args = parser.parse_args()
args = validate_args(args)
device = get_device(args)
print_config(args)
paths = assert_required_models_available(args)
if args.checkpoint_path is not None:
print(f"[🍵] Loading custom model from {args.checkpoint_path}")
paths["matcha"] = args.checkpoint_path
args.model = "custom_model"
model = load_matcha(args.model, paths["matcha"], device)
vocoder, denoiser = load_vocoder(args.vocoder, paths["vocoder"], device)
texts = get_texts(args)
spk = torch.tensor([args.spk], device=device, dtype=torch.long) if args.spk is not None else None
if len(texts) == 1 or not args.batched:
unbatched_synthesis(args, device, model, vocoder, denoiser, texts, spk)
else:
batched_synthesis(args, device, model, vocoder, denoiser, texts, spk)
class BatchedSynthesisDataset(torch.utils.data.Dataset):
def __init__(self, processed_texts):
self.processed_texts = processed_texts
def __len__(self):
return len(self.processed_texts)
def __getitem__(self, idx):
return self.processed_texts[idx]
def batched_collate_fn(batch):
x = []
x_lengths = []
for b in batch:
x.append(b["x"].squeeze(0))
x_lengths.append(b["x_lengths"])
x = torch.nn.utils.rnn.pad_sequence(x, batch_first=True)
x_lengths = torch.concat(x_lengths, dim=0)
return {"x": x, "x_lengths": x_lengths}
def batched_synthesis(args, device, model, vocoder, denoiser, texts, spk):
total_rtf = []
total_rtf_w = []
processed_text = [process_text(i, text, "cpu") for i, text in enumerate(texts)]
dataloader = torch.utils.data.DataLoader(
BatchedSynthesisDataset(processed_text),
batch_size=args.batch_size,
collate_fn=batched_collate_fn,
num_workers=8,
)
for i, batch in enumerate(dataloader):
i = i + 1
start_t = dt.datetime.now()
b = batch["x"].shape[0]
output = model.synthesise(
batch["x"].to(device),
batch["x_lengths"].to(device),
n_timesteps=args.steps,
temperature=args.temperature,
spks=spk.expand(b) if spk is not None else spk,
length_scale=args.speaking_rate,
)
output["waveform"] = to_waveform(output["mel"], vocoder, denoiser, args.denoiser_strength)
t = (dt.datetime.now() - start_t).total_seconds()
rtf_w = t * 22050 / (output["waveform"].shape[-1])
print(f"[🍵-Batch: {i}] Matcha-TTS RTF: {output['rtf']:.4f}")
print(f"[🍵-Batch: {i}] Matcha-TTS + VOCODER RTF: {rtf_w:.4f}")
total_rtf.append(output["rtf"])
total_rtf_w.append(rtf_w)
for j in range(output["mel"].shape[0]):
base_name = f"utterance_{j:03d}_speaker_{args.spk:03d}" if args.spk is not None else f"utterance_{j:03d}"
length = output["mel_lengths"][j]
new_dict = {"mel": output["mel"][j][:, :length], "waveform": output["waveform"][j][: length * 256]}
location = save_to_folder(base_name, new_dict, args.output_folder)
print(f"[🍵-{j}] Waveform saved: {location}")
print("".join(["="] * 100))
print(f"[🍵] Average Matcha-TTS RTF: {np.mean(total_rtf):.4f} ± {np.std(total_rtf)}")
print(f"[🍵] Average Matcha-TTS + VOCODER RTF: {np.mean(total_rtf_w):.4f} ± {np.std(total_rtf_w)}")
print("[🍵] Enjoy the freshly whisked 🍵 Matcha-TTS!")
def unbatched_synthesis(args, device, model, vocoder, denoiser, texts, spk):
total_rtf = []
total_rtf_w = []
for i, text in enumerate(texts):
i = i + 1
base_name = f"utterance_{i:03d}_speaker_{args.spk:03d}" if args.spk is not None else f"utterance_{i:03d}"
print("".join(["="] * 100))
text = text.strip()
text_processed = process_text(i, text, device)
print(f"[🍵] Whisking Matcha-T(ea)TS for: {i}")
start_t = dt.datetime.now()
output = model.synthesise(
text_processed["x"],
text_processed["x_lengths"],
n_timesteps=args.steps,
temperature=args.temperature,
spks=spk,
length_scale=args.speaking_rate,
)
output["waveform"] = to_waveform(output["mel"], vocoder, denoiser, args.denoiser_strength)
# RTF with HiFiGAN
t = (dt.datetime.now() - start_t).total_seconds()
rtf_w = t * 22050 / (output["waveform"].shape[-1])
print(f"[🍵-{i}] Matcha-TTS RTF: {output['rtf']:.4f}")
print(f"[🍵-{i}] Matcha-TTS + VOCODER RTF: {rtf_w:.4f}")
total_rtf.append(output["rtf"])
total_rtf_w.append(rtf_w)
location = save_to_folder(base_name, output, args.output_folder)
print(f"[+] Waveform saved: {location}")
print("".join(["="] * 100))
print(f"[🍵] Average Matcha-TTS RTF: {np.mean(total_rtf):.4f} ± {np.std(total_rtf)}")
print(f"[🍵] Average Matcha-TTS + VOCODER RTF: {np.mean(total_rtf_w):.4f} ± {np.std(total_rtf_w)}")
print("[🍵] Enjoy the freshly whisked 🍵 Matcha-TTS!")
def print_config(args):
print("[!] Configurations: ")
print(f"\t- Model: {args.model}")
print(f"\t- Vocoder: {args.vocoder}")
print(f"\t- Temperature: {args.temperature}")
print(f"\t- Speaking rate: {args.speaking_rate}")
print(f"\t- Number of ODE steps: {args.steps}")
print(f"\t- Speaker: {args.spk}")
def get_device(args):
if torch.cuda.is_available() and not args.cpu:
print("[+] GPU Available! Using GPU")
device = torch.device("cuda")
else:
print("[-] GPU not available or forced CPU run! Using CPU")
device = torch.device("cpu")
return device
if __name__ == "__main__":
cli()
import random
from pathlib import Path
from typing import Any, Dict, Optional
import numpy as np
import torch
import torchaudio as ta
from lightning import LightningDataModule
from torch.utils.data.dataloader import DataLoader
from matcha.text import text_to_sequence
from matcha.utils.audio import mel_spectrogram
from matcha.utils.model import fix_len_compatibility, normalize
from matcha.utils.utils import intersperse
def parse_filelist(filelist_path, split_char="|"):
with open(filelist_path, encoding="utf-8") as f:
filepaths_and_text = [line.strip().split(split_char) for line in f]
return filepaths_and_text
class TextMelDataModule(LightningDataModule):
def __init__( # pylint: disable=unused-argument
self,
name,
train_filelist_path,
valid_filelist_path,
batch_size,
num_workers,
pin_memory,
cleaners,
add_blank,
n_spks,
n_fft,
n_feats,
sample_rate,
hop_length,
win_length,
f_min,
f_max,
data_statistics,
seed,
load_durations,
):
super().__init__()
# this line allows to access init params with 'self.hparams' attribute
# also ensures init params will be stored in ckpt
self.save_hyperparameters(logger=False)
def setup(self, stage: Optional[str] = None): # pylint: disable=unused-argument
"""Load data. Set variables: `self.data_train`, `self.data_val`, `self.data_test`.
This method is called by lightning with both `trainer.fit()` and `trainer.test()`, so be
careful not to execute things like random split twice!
"""
# load and split datasets only if not loaded already
self.trainset = TextMelDataset( # pylint: disable=attribute-defined-outside-init
self.hparams.train_filelist_path,
self.hparams.n_spks,
self.hparams.cleaners,
self.hparams.add_blank,
self.hparams.n_fft,
self.hparams.n_feats,
self.hparams.sample_rate,
self.hparams.hop_length,
self.hparams.win_length,
self.hparams.f_min,
self.hparams.f_max,
self.hparams.data_statistics,
self.hparams.seed,
self.hparams.load_durations,
)
self.validset = TextMelDataset( # pylint: disable=attribute-defined-outside-init
self.hparams.valid_filelist_path,
self.hparams.n_spks,
self.hparams.cleaners,
self.hparams.add_blank,
self.hparams.n_fft,
self.hparams.n_feats,
self.hparams.sample_rate,
self.hparams.hop_length,
self.hparams.win_length,
self.hparams.f_min,
self.hparams.f_max,
self.hparams.data_statistics,
self.hparams.seed,
self.hparams.load_durations,
)
def train_dataloader(self):
return DataLoader(
dataset=self.trainset,
batch_size=self.hparams.batch_size,
num_workers=self.hparams.num_workers,
pin_memory=self.hparams.pin_memory,
shuffle=True,
collate_fn=TextMelBatchCollate(self.hparams.n_spks),
)
def val_dataloader(self):
return DataLoader(
dataset=self.validset,
batch_size=self.hparams.batch_size,
num_workers=self.hparams.num_workers,
pin_memory=self.hparams.pin_memory,
shuffle=False,
collate_fn=TextMelBatchCollate(self.hparams.n_spks),
)
def teardown(self, stage: Optional[str] = None):
"""Clean up after fit or test."""
pass # pylint: disable=unnecessary-pass
def state_dict(self):
"""Extra things to save to checkpoint."""
return {}
def load_state_dict(self, state_dict: Dict[str, Any]):
"""Things to do when loading checkpoint."""
pass # pylint: disable=unnecessary-pass
class TextMelDataset(torch.utils.data.Dataset):
def __init__(
self,
filelist_path,
n_spks,
cleaners,
add_blank=True,
n_fft=1024,
n_mels=80,
sample_rate=22050,
hop_length=256,
win_length=1024,
f_min=0.0,
f_max=8000,
data_parameters=None,
seed=None,
load_durations=False,
):
self.filepaths_and_text = parse_filelist(filelist_path)
self.n_spks = n_spks
self.cleaners = cleaners
self.add_blank = add_blank
self.n_fft = n_fft
self.n_mels = n_mels
self.sample_rate = sample_rate
self.hop_length = hop_length
self.win_length = win_length
self.f_min = f_min
self.f_max = f_max
self.load_durations = load_durations
if data_parameters is not None:
self.data_parameters = data_parameters
else:
self.data_parameters = {"mel_mean": 0, "mel_std": 1}
random.seed(seed)
random.shuffle(self.filepaths_and_text)
def get_datapoint(self, filepath_and_text):
if self.n_spks > 1:
filepath, spk, text = (
filepath_and_text[0],
int(filepath_and_text[1]),
filepath_and_text[2],
)
else:
filepath, text = filepath_and_text[0], filepath_and_text[1]
spk = None
text, cleaned_text = self.get_text(text, add_blank=self.add_blank)
mel = self.get_mel(filepath)
durations = self.get_durations(filepath, text) if self.load_durations else None
return {"x": text, "y": mel, "spk": spk, "filepath": filepath, "x_text": cleaned_text, "durations": durations}
def get_durations(self, filepath, text):
filepath = Path(filepath)
data_dir, name = filepath.parent.parent, filepath.stem
try:
dur_loc = data_dir / "durations" / f"{name}.npy"
durs = torch.from_numpy(np.load(dur_loc).astype(int))
except FileNotFoundError as e:
raise FileNotFoundError(
f"Tried loading the durations but durations didn't exist at {dur_loc}, make sure you've generate the durations first using: python matcha/utils/get_durations_from_trained_model.py \n"
) from e
assert len(durs) == len(text), f"Length of durations {len(durs)} and text {len(text)} do not match"
return durs
def get_mel(self, filepath):
audio, sr = ta.load(filepath)
assert sr == self.sample_rate
mel = mel_spectrogram(
audio,
self.n_fft,
self.n_mels,
self.sample_rate,
self.hop_length,
self.win_length,
self.f_min,
self.f_max,
center=False,
).squeeze()
mel = normalize(mel, self.data_parameters["mel_mean"], self.data_parameters["mel_std"])
return mel
def get_text(self, text, add_blank=True):
text_norm, cleaned_text = text_to_sequence(text, self.cleaners)
if self.add_blank:
text_norm = intersperse(text_norm, 0)
text_norm = torch.IntTensor(text_norm)
return text_norm, cleaned_text
def __getitem__(self, index):
datapoint = self.get_datapoint(self.filepaths_and_text[index])
return datapoint
def __len__(self):
return len(self.filepaths_and_text)
class TextMelBatchCollate:
def __init__(self, n_spks):
self.n_spks = n_spks
def __call__(self, batch):
B = len(batch)
y_max_length = max([item["y"].shape[-1] for item in batch]) # pylint: disable=consider-using-generator
y_max_length = fix_len_compatibility(y_max_length)
x_max_length = max([item["x"].shape[-1] for item in batch]) # pylint: disable=consider-using-generator
n_feats = batch[0]["y"].shape[-2]
y = torch.zeros((B, n_feats, y_max_length), dtype=torch.float32)
x = torch.zeros((B, x_max_length), dtype=torch.long)
durations = torch.zeros((B, x_max_length), dtype=torch.long)
y_lengths, x_lengths = [], []
spks = []
filepaths, x_texts = [], []
for i, item in enumerate(batch):
y_, x_ = item["y"], item["x"]
y_lengths.append(y_.shape[-1])
x_lengths.append(x_.shape[-1])
y[i, :, : y_.shape[-1]] = y_
x[i, : x_.shape[-1]] = x_
spks.append(item["spk"])
filepaths.append(item["filepath"])
x_texts.append(item["x_text"])
if item["durations"] is not None:
durations[i, : item["durations"].shape[-1]] = item["durations"]
y_lengths = torch.tensor(y_lengths, dtype=torch.long)
x_lengths = torch.tensor(x_lengths, dtype=torch.long)
spks = torch.tensor(spks, dtype=torch.long) if self.n_spks > 1 else None
return {
"x": x,
"x_lengths": x_lengths,
"y": y,
"y_lengths": y_lengths,
"spks": spks,
"filepaths": filepaths,
"x_texts": x_texts,
"durations": durations if not torch.eq(durations, 0).all() else None,
}
MIT License
Copyright (c) 2020 Jungil Kong
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
In our [paper](https://arxiv.org/abs/2010.05646),
we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
We provide our implementation and pretrained models as open source in this repository.
**Abstract :**
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
Although such methods improve the sampling efficiency and memory usage,
their sample quality has not yet reached that of autoregressive and flow-based generative models.
In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
As speech audio consists of sinusoidal signals with various periods,
we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
faster than real-time on CPU with comparable quality to an autoregressive counterpart.
Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.
## Pre-requisites
1. Python >= 3.6
2. Clone this repository.
3. Install python requirements. Please refer [requirements.txt](requirements.txt)
4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
And move all wav files to `LJSpeech-1.1/wavs`
## Training
```
python train.py --config config_v1.json
```
To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
You can change the path by adding `--checkpoint_path` option.
Validation loss during training with V1 generator.<br>
![validation loss](./validation_loss.png)
## Pretrained Model
You can also use pretrained models we provide.<br/>
[Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
Details of each folder are as in follows:
| Folder Name | Generator | Dataset | Fine-Tuned |
| ------------ | --------- | --------- | ------------------------------------------------------ |
| LJ_V1 | V1 | LJSpeech | No |
| LJ_V2 | V2 | LJSpeech | No |
| LJ_V3 | V3 | LJSpeech | No |
| LJ_FT_T2_V1 | V1 | LJSpeech | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
| LJ_FT_T2_V2 | V2 | LJSpeech | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
| LJ_FT_T2_V3 | V3 | LJSpeech | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
| VCTK_V1 | V1 | VCTK | No |
| VCTK_V2 | V2 | VCTK | No |
| VCTK_V3 | V3 | VCTK | No |
| UNIVERSAL_V1 | V1 | Universal | No |
We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.
## Fine-Tuning
1. Generate mel-spectrograms in numpy format using [Tacotron2](https://github.com/NVIDIA/tacotron2) with teacher-forcing.<br/>
The file name of the generated mel-spectrogram should match the audio file and the extension should be `.npy`.<br/>
Example:
` Audio File : LJ001-0001.wav
Mel-Spectrogram File : LJ001-0001.npy`
2. Create `ft_dataset` folder and copy the generated mel-spectrogram files into it.<br/>
3. Run the following command.
```
python train.py --fine_tuning True --config config_v1.json
```
For other command line options, please refer to the training section.
## Inference from wav file
1. Make `test_files` directory and copy wav files into the directory.
2. Run the following command.
` python inference.py --checkpoint_file [generator checkpoint file path]`
Generated wav files are saved in `generated_files` by default.<br>
You can change the path by adding `--output_dir` option.
## Inference for end-to-end speech synthesis
1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
[Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
2. Run the following command.
` python inference_e2e.py --checkpoint_file [generator checkpoint file path]`
Generated wav files are saved in `generated_files_from_mel` by default.<br>
You can change the path by adding `--output_dir` option.
## Acknowledgements
We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment