Commit 4130a52d authored by changhl's avatar changhl
Browse files

init model

parent eb6a18fd
Pipeline #1617 failed with stages
in 0 seconds
# Tacotron2_pytorch # Tacotron2
## 论文
- https://arxiv.org/pdf/1712.05884
## 开源代码
- https://github.com/NVIDIA/tacotron2
## 模型结构
Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。该模型主要由两部分构成:
- 声谱预测网络:一个Encoder-Attention-Decoder网络,用于将输入的字符序列预测为梅尔频谱的帧序列
- 声码器(vocoder):一个WaveNet的修订版,用于将预测的梅尔频谱帧序列产生时域波形
<div align="center">
<img src="./images/architecture.png"/>
</div>
## 算法原理
在这个架构中,Tacotron2将原先Tacotron的RNN模型进行改进,使用了LSTM模型,加入了遗忘门、输入门、输出门等门控结构,优化了梯度消失的问题,使得模型在反向传播的记忆力上有所提升,提高了合成的语音的质量。
<div align="center">
<img src="./images/algorithm.png"/>
</div>
## 环境配置
### Docker (方法一)
**注意修改路径参数**
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -it --network=host --ipc=host --name=your_container_name --shm-size=32G --device=/dev/kfd --device=/dev/mkfd --device=/dev/dri -v /opt/hyhal:/opt/hyhal:ro -v /path/your_code_data/:/path/your_code_data/ --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
cd /path/your_code_data/
pip3 install -r requirements.txt
```
### Dockerfile (方法二)
```
cd ./docker
docker build --no-cache -t tacotron2 .
docker run -it -v /path/your_code_data/:/path/your_code_data/ --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
pip3 install -r requirements.txt
```
### Anaconda (方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装: https://developer.hpccube.com/tool/
```
DTK软件栈:dtk24.04.1
python:python3.10
torch:2.1.0
torchvision:0.16.0
torchaudio: 2.1.2
```
Tips:以上dtk软件栈、python、torch等DCU相关工具版本需要严格一一对应
2、其他非特殊库直接按照requirements.txt安装
```
pip3 install -r requirements.txt
```
## 数据集
- SCnet快速下载链接:
- [LJSpeech数据集下载](http://113.200.138.88:18080/aidatasets/lj_speech)
- 官方下载链接:
- [LJSpeech数据集下载](https://keithito.com/LJ-Speech-Dataset/)
```LJSpeech-1.1```:用于语音合成的数据集,包含语音和文本信息,语音为wav格式,文本以csv格式保存。
```
├── LJSpeech-1.1
│ ├──wav
│ │ ├── LJ001-0001.wav
│ │ ├── LJ001-0002.wav
│ │ ├── LJ001-0003.wav
│ │ ├── ...
│ ├──metadata.csv
│ ├──README
```
- LJSpeech
- wav:语音数据目录
- LJ001-0001.wav:语音文件
- LJ001-0002.wav:语音文件
- ...
- metadata.csv:文本信息文件
- 第一列:语音文件名称
- 第二列:文本信息
- 第三列:规范化后的文本信息
- README:说明文档
## 预训练模型
**推理前先下载预训练好的权重文件**
- SCnet下载地址:
- [tacotron2模型权重下载地址](http://113.200.138.88:18080/aimodels/tacotron2_ljspeech)
- [hifigan模型权重下载地址](http://113.200.138.88:18080/aimodels/hifigan_ljspeech)
- 官方下载地址:
- [tacotron2模型权重下载地址](https://hf-mirror.com/speechbrain/tts-tacotron2-ljspeech)
- [hifigan模型权重下载地址](https://hf-mirror.com/speechbrain/tts-hifigan-ljspeech)
## 训练
**确保当前的工作目录为tacotron2_pytorch,指定可见卡**
### 单卡
```
export HIP_VISIBLE_DEVICES 设置可见卡
bash train_s.sh $dataset_path $save_path
```
- $dataset_path:数据集路径
- $save_path:训练权重保存路径
### 多卡
```
export HIP_VISIBLE_DEVICES 设置可见卡
bash train_m.sh $dataset_path $save_path
```
- $dataset_path:数据集路径
- $save_path:训练权重保存路径
## 推理
```
export HIP_VISIBLE_DEVICES 设置可见卡
python3 inference.py -m modelpath_tacotron2 -v modelpath_hifigan -t "hi, nice to meet you"
```
- -m:tacotron2模型权重路径
- -v:hifigan模型权重路径
- -t:输入文本
- -res:结果文件保存路径
## result
```
输入:“hi,nice to meet you”
输出:./res/example.wav
```
## 应用场景
### 算法分类
```
语音合成
```
### 热点应用行业
```
金融,通信,广媒
```
## 源码仓库及问题反馈
https://developer.hpccube.com/codes/modelzoo/tacotron2_pytorch
## 参考
[GitHub - NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2)
[HF - speechbrain/tts-tacotron2-ljspeech](https://hf-mirror.com/speechbrain/tts-tacotron2-ljspeech)
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
RUN source /opt/dtk/env.sh
\ No newline at end of file
icon.png

64.4 KB

import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN
import os
import argparse
def parse_opt(known=False):
parser = argparse.ArgumentParser()
parser.add_argument('-m', '--model-path', type=str, default="", help="the tacotron2 model path")
parser.add_argument('-v', '--vocoder-path', type=str, default="", help="the vocoder model path")
parser.add_argument('-t', '--text', type=str, default="Autumn, the season of change.", help="input text")
parser.add_argument('-res', '--result_path', type=str, default="./res", help="the path to save wav file")
opt = parser.parse_known_args()[0] if known else parser.parse_args()
return opt
def main(opt):
tacotron2 = Tacotron2.from_hparams(source=opt.model_path, run_opts={"device":"cuda"})
hifi_gan = HIFIGAN.from_hparams(source=opt.vocoder_path,run_opts={"device":"cuda"})
# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text(opt.text)
# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)
# Save the waverform
torchaudio.save(os.path.join(opt.result_path, 'example.wav'),waveforms.squeeze(1).cpu(), 22050)
if __name__ == "__main__":
main(opt=parse_opt())
#模型编码
modelCode=917
# 模型名称
modelName=tacotron2_pytorch
# 模型描述
modelDescription=Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。
# 应用场景(多个标签以英文逗号分割)
appScenario=训练,推理,语音合成,金融,通信,广媒
# 框架类型(多个标签以英文逗号分割)
frameType=PyTorch
\ No newline at end of file
soundfile==0.12.1
librosa==0.10.2.post1
speechbrain==1.0.0
hyperpyyaml>=0.0.1
joblib>=0.14.1
pre-commit>=2.3.0
pygtrie>=2.1,<3.0
tgt==1.5
unidecode==1.3.8
\ No newline at end of file
File added
# Text-to-Speech (with LJSpeech)
This folder contains the recipes for training TTS systems (including vocoders) with the popular LJSpeech dataset.
# Dataset
The dataset can be downloaded from here:
https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
# Installing Extra Dependencies
Before proceeding, ensure you have installed the necessary additional dependencies. To do this, simply run the following command in your terminal:
```
pip install -r extra_requirements.txt
```
# Tacotron 2
The subfolder "tacotron2" contains the recipe for training the popular [tacotron2](https://arxiv.org/abs/1712.05884) TTS model.
To run this recipe, go into the "tacotron2" folder and run:
```
python train.py --device=cuda:0 --max_grad_norm=1.0 --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
```
The training logs are available [here](https://www.dropbox.com/sh/1npvo1g1ncafipf/AAC5DR1ErF2Q9V4bd1DHqX43a?dl=0).
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech).
# FastSpeech2
The subfolder "fastspeech2" contains the recipes for training the non-autoregressive transformer based TTS model [FastSpeech2](https://arxiv.org/abs/2006.04558).
### FastSpeech2 with pre-extracted durations from a forced aligner
Training FastSpeech2 requires pre-extracted phoneme alignments (durations). The LJSpeech phoneme alignments from Montreal Forced Aligner are automatically downloaded, decompressed and stored at this location: ```/your_folder/LJSpeech-1.1/TextGrid```.
To run this recipe, please first install the extra-dependencies :
```
pip install -r extra_requirements.txt
````
Then go into the "fastspeech2" folder and run:
```
python train.py --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
```
Training takes about 3 minutes/epoch on 1 * V100 32G.
The training logs are available [here](https://www.dropbox.com/scl/fo/vtgbltqdrvw9r0vs7jz67/h?rlkey=cm2mwh5rce5ad9e90qaciypox&dl=0).
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-fastspeech2-ljspeech).
### FastSpeech2 with internal alignment
This recipe allows training FastSpeech2 without forced aligner referring to [One TTS Alignment To Rule Them All](https://arxiv.org/pdf/2108.10447.pdf). The alignment can be learnt by an internal alignment network that is added to FastSpeech2. This recipe aims to simplify training when using custom data and provide better alignments for punctuations.
To run this recipe, please first install the extra-requirements:
```
pip install -r extra_requirements.txt
```
Then go into the "fastspeech2" folder and run:
```
python train_internal_alignment.py hparams/train_internal_alignment.yaml --data_folder=/your_folder/LJSpeech-1.1
```
The data preparation includes a grapheme-to-phoneme process for the entire corpus which may take several hours. Training takes about 5 minutes/epoch on 1 * V100 32G.
The training logs are available [here](https://www.dropbox.com/scl/fo/4ctkc6jjas3uij9dzcwta/h?rlkey=i0k086d77flcsdx40du1ppm2d&dl=0).
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-fastspeech2-internal-alignment-ljspeech).
# HiFiGAN (Vocoder)
The subfolder "vocoder/hifigan/" contains the [HiFiGAN vocoder](https://arxiv.org/pdf/2010.05646.pdf).
The vocoder is a neural network that converts a spectrogram into a waveform (it can be used on top of Tacotron2/FastSpeech2).
We suggest using `tensorboard_logger` by setting `use_tensorboard: True` in the yaml file, thus `Tensorboard` should be installed.
To run this recipe, go into the "vocoder/hifigan/" folder and run:
```
python train.py hparams/train.yaml --data_folder /path/to/LJspeech
```
Training takes about 10 minutes/epoch on an nvidia RTX8000.
The training logs are available [here](https://www.dropbox.com/sh/m2xrdssiroipn8g/AAD-TqPYLrSg6eNxUkcImeg4a?dl=0)
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-hifigan-ljspeech).
# DiffWave (Vocoder)
The subfolder "vocoder/diffwave/" contains the [Diffwave](https://arxiv.org/pdf/2009.09761.pdf) vocoder.
DiffWave is a versatile diffusion model for audio synthesis, which produces high-fidelity audio in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation.
Here it serves as a vocoder that generates waveforms given spectrograms as conditions (it can be used on top of Tacotron2/FastSpeech2).
To run this recipe, go into the "vocoder/diffwave/" folder and run:
```
python train.py hparams/train.yaml --data_folder /path/to/LJspeech
```
The scripts will output a synthesized audio to `<output_folder>/samples` for a certain interval of training epoch.
We suggest using tensorboard_logger by setting `use_tensorboard: True` in the yaml file, thus torch.Tensorboard should be installed.
Training takes about 6 minutes/epoch on 1 * V100 32G.
The training logs are available [here](https://www.dropbox.com/sh/tbhpn1xirtaix68/AACvYaVDiUGAKURf2o-fvgMoa?dl=0)
For inference, by setting `fast_sampling: True` , a fast sampling can be realized by passing user-defined variance schedules. According to the paper, high-quality audios can be generated with only 6 steps. This is highly recommended.
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-diffwave-ljspeech).
# HiFiGAN Unit Vocoder
The subfolder "vocoder/hifigan_discrete/" contains the [HiFiGAN Unit vocoder](https://arxiv.org/abs/2406.10735). This vocoder is a neural network designed to transform discrete self-supervised representations into waveform data.
This is suitable for a wide range of generative tasks such as speech enhancement, separation, text-to-speech, voice cloning, etc. Please read [DASB - Discrete Audio and Speech Benchmark](https://arxiv.org/abs/2406.14294) for more information.
To run this recipe successfully, start by installing the necessary extra dependencies:
```bash
pip install -r extra_requirements.txt
```
Before training the vocoder, you need to choose a speech encoder to extract representations that will be used as discrete audio input. We support k-means models using features from HuBERT, WavLM, or Wav2Vec2. Below are the available self-supervised speech encoders for which we provide pre-trained k-means checkpoints:
| Encoder | HF model |
|----------|-----------------------------------------|
| HuBERT | facebook/hubert-large-ll60k |
| Wav2Vec2 | facebook/wav2vec2-large-960h-lv60-self |
| WavLM | microsoft/wavlm-large |
Checkpoints are available in the HF [SSL_Quantization](https://huggingface.co/speechbrain/SSL_Quantization) repository. Alternatively, you can train your own k-means model by following instructions in the "LJSpeech/quantization" README.
Next, configure the SSL model type, k-means model, and corresponding hub in your YAML configuration file. Follow these steps:
1. Navigate to the "vocoder/hifigan_discrete/hparams" folder and open "train.yaml" file.
2. Modify the `encoder_type` field to specify one of the SSL models: "HuBERT", "WavLM", or "Wav2Vec2".
3. Update the `encoder_hub` field with the specific name of the SSL Hub associated with your chosen model type.
If you have trained your own k-means model, follow these additional steps:
4. Update the `kmeans_folder` field with the specific name of the SSL Hub containing your trained k-means model. Please follow the same file structure as the official one in [SSL_Quantization](https://huggingface.co/speechbrain/SSL_Quantization).
5. Update the `kmeans_dataset` field with the specific name of the dataset on which the k-means model was trained.
6. Update the `num_clusters` field according to the number of clusters of your k-means model.
Finally, navigate back to the "vocoder/hifigan_discrete/" folder and run the following command:
```bash
python train.py hparams/train.yaml --data_folder=/path/to/LJspeech
```
Training typically takes around 4 minutes per epoch when using an NVIDIA A100 40G.
# **About SpeechBrain**
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/
# **Citing SpeechBrain**
Please, cite SpeechBrain if you use it for your research or business.
```bibtex
@misc{ravanelli2024opensourceconversationalaispeechbrain,
title={Open-Source Conversational AI with SpeechBrain 1.0},
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
year={2024},
eprint={2407.00463},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
```
# Needed only for quantization
scikit-learn
# Needed only with use_tensorboard=True
# torchvision is needed to save spectrograms
tensorboard
tgt
torchvision
unidecode
############################################################################
# Model: FastSpeech2
# Tokens: Raw characters (English text)
# Training: LJSpeech
# Authors: Sathvik Udupa, Yingzhi Wang, Pradnya Kandarkar
# ############################################################################
###################################
# Experiment Parameters and setup #
###################################
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/fastspeech2/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
epochs: 500
train_spn_predictor_epochs: 8
progress_samples: True
progress_sample_path: !ref <output_folder>/samples
progress_samples_min_run: 10
progress_samples_interval: 10
progress_batch_sample_size: 4
#################################
# Data files and pre-processing #
#################################
data_folder: #!PLACEHOLDER # e.g., /data/Database/LJSpeech-1.1
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json
splits: ["train", "valid"]
split_ratio: [90, 10]
skip_prep: False
################################
# Audio Parameters #
################################
sample_rate: 22050
hop_length: 256
win_length: null
n_mel_channels: 80
n_fft: 1024
mel_fmin: 0.0
mel_fmax: 8000.0
power: 1
norm: "slaney"
mel_scale: "slaney"
dynamic_range_compression: True
mel_normalized: False
min_max_energy_norm: True
min_f0: 65 #(torchaudio pyin values)
max_f0: 2093 #(torchaudio pyin values)
################################
# Optimization Hyperparameters #
################################
learning_rate: 0.0001
weight_decay: 0.000001
max_grad_norm: 1.0
batch_size: 32 #minimum 2
num_workers_train: 16
num_workers_valid: 4
betas: [0.9, 0.98]
################################
# Model Parameters and model #
################################
# Input parameters
lexicon:
- AA
- AE
- AH
- AO
- AW
- AY
- B
- CH
- D
- DH
- EH
- ER
- EY
- F
- G
- HH
- IH
- IY
- JH
- K
- L
- M
- N
- NG
- OW
- OY
- P
- R
- S
- SH
- T
- TH
- UH
- UW
- V
- W
- Y
- Z
- ZH
- spn
n_symbols: 42 #fixed depending on symbols in the lexicon +1 for a dummy symbol used for padding
padding_idx: 0
# Encoder parameters
enc_num_layers: 4
enc_num_head: 2
enc_d_model: 384
enc_ffn_dim: 1024
enc_k_dim: 384
enc_v_dim: 384
enc_dropout: 0.2
# Decoder parameters
dec_num_layers: 4
dec_num_head: 2
dec_d_model: 384
dec_ffn_dim: 1024
dec_k_dim: 384
dec_v_dim: 384
dec_dropout: 0.2
# Postnet parameters
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
postnet_dropout: 0.5
# common
normalize_before: True
ffn_type: 1dcnn #1dcnn or ffn
ffn_cnn_kernel_size_list: [9, 1]
# variance predictor
dur_pred_kernel_size: 3
pitch_pred_kernel_size: 3
energy_pred_kernel_size: 3
variance_predictor_dropout: 0.5
# silent phoneme token predictor
spn_predictor: !new:speechbrain.lobes.models.FastSpeech2.SPNPredictor
enc_num_layers: !ref <enc_num_layers>
enc_num_head: !ref <enc_num_head>
enc_d_model: !ref <enc_d_model>
enc_ffn_dim: !ref <enc_ffn_dim>
enc_k_dim: !ref <enc_k_dim>
enc_v_dim: !ref <enc_v_dim>
enc_dropout: !ref <enc_dropout>
normalize_before: !ref <normalize_before>
ffn_type: !ref <ffn_type>
ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
n_char: !ref <n_symbols>
padding_idx: !ref <padding_idx>
#model
model: !new:speechbrain.lobes.models.FastSpeech2.FastSpeech2
enc_num_layers: !ref <enc_num_layers>
enc_num_head: !ref <enc_num_head>
enc_d_model: !ref <enc_d_model>
enc_ffn_dim: !ref <enc_ffn_dim>
enc_k_dim: !ref <enc_k_dim>
enc_v_dim: !ref <enc_v_dim>
enc_dropout: !ref <enc_dropout>
dec_num_layers: !ref <dec_num_layers>
dec_num_head: !ref <dec_num_head>
dec_d_model: !ref <dec_d_model>
dec_ffn_dim: !ref <dec_ffn_dim>
dec_k_dim: !ref <dec_k_dim>
dec_v_dim: !ref <dec_v_dim>
dec_dropout: !ref <dec_dropout>
normalize_before: !ref <normalize_before>
ffn_type: !ref <ffn_type>
ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
n_char: !ref <n_symbols>
n_mels: !ref <n_mel_channels>
postnet_embedding_dim: !ref <postnet_embedding_dim>
postnet_kernel_size: !ref <postnet_kernel_size>
postnet_n_convolutions: !ref <postnet_n_convolutions>
postnet_dropout: !ref <postnet_dropout>
padding_idx: !ref <padding_idx>
dur_pred_kernel_size: !ref <dur_pred_kernel_size>
pitch_pred_kernel_size: !ref <pitch_pred_kernel_size>
energy_pred_kernel_size: !ref <energy_pred_kernel_size>
variance_predictor_dropout: !ref <variance_predictor_dropout>
mel_spectogram: !name:speechbrain.lobes.models.FastSpeech2.mel_spectogram
sample_rate: !ref <sample_rate>
hop_length: !ref <hop_length>
win_length: !ref <win_length>
n_fft: !ref <n_fft>
n_mels: !ref <n_mel_channels>
f_min: !ref <mel_fmin>
f_max: !ref <mel_fmax>
power: !ref <power>
normalized: !ref <mel_normalized>
min_max_energy_norm: !ref <min_max_energy_norm>
norm: !ref <norm>
mel_scale: !ref <mel_scale>
compression: !ref <dynamic_range_compression>
criterion: !new:speechbrain.lobes.models.FastSpeech2.Loss
log_scale_durations: True
duration_loss_weight: 1.0
pitch_loss_weight: 1.0
energy_loss_weight: 1.0
ssim_loss_weight: 1.0
mel_loss_weight: 1.0
postnet_mel_loss_weight: 1.0
spn_loss_weight: 1.0
spn_loss_max_epochs: !ref <train_spn_predictor_epochs>
vocoder: "hifi-gan"
pretrained_vocoder: True
vocoder_source: speechbrain/tts-hifigan-ljspeech
vocoder_download_path: tmpdir_vocoder
modules:
spn_predictor: !ref <spn_predictor>
model: !ref <model>
train_dataloader_opts:
batch_size: !ref <batch_size>
drop_last: False #True #False
num_workers: !ref <num_workers_train>
shuffle: True
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollate
valid_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers_valid>
shuffle: False
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollate
#optimizer
opt_class: !name:torch.optim.Adam
lr: !ref <learning_rate>
weight_decay: !ref <weight_decay>
betas: !ref <betas>
noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
lr_initial: !ref <learning_rate>
n_warmup_steps: 4000
#epoch object
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <epochs>
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
#checkpointer
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
spn_predictor: !ref <spn_predictor>
model: !ref <model>
lr_annealing: !ref <noam_annealing>
counter: !ref <epoch_counter>
input_encoder: !new:speechbrain.dataio.encoder.TextEncoder
progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
output_path: !ref <progress_sample_path>
batch_sample_size: !ref <progress_batch_sample_size>
formats:
raw_batch: raw
############################################################################
# Model: FastSpeech2 with internal alignment
# Tokens: Phonemes (ARPABET)
# Dataset: LJSpeech
# Authors: Yingzhi Wang 2023
# ############################################################################
###################################
# Experiment Parameters and setup #
###################################
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/fastspeech2_internal_alignment/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
epochs: 500
progress_samples: True
progress_sample_path: !ref <output_folder>/samples
progress_samples_min_run: 10
progress_samples_interval: 10
progress_batch_sample_size: 4
#################################
# Data files and pre-processing #
#################################
data_folder: !PLACEHOLDER # e.g., /data/Database/LJSpeech-1.1
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json
splits: ["train", "valid"]
split_ratio: [90, 10]
skip_prep: False
################################
# Audio Parameters #
################################
sample_rate: 22050
hop_length: 256
win_length: null
n_mel_channels: 80
n_fft: 1024
mel_fmin: 0.0
mel_fmax: 8000.0
power: 1
norm: "slaney"
mel_scale: "slaney"
dynamic_range_compression: True
mel_normalized: False
min_max_energy_norm: True
min_f0: 65 #(torchaudio pyin values)
max_f0: 2093 #(torchaudio pyin values)
################################
# Optimization Hyperparameters #
################################
learning_rate: 0.0001
weight_decay: 0.000001
max_grad_norm: 1.0
batch_size: 16 #minimum 2
betas: [0.9, 0.998]
num_workers_train: 16
num_workers_valid: 4
################################
# Model Parameters and model #
################################
# Input parameters
lexicon:
- "AA"
- "AE"
- "AH"
- "AO"
- "AW"
- "AY"
- "B"
- "CH"
- "D"
- "DH"
- "EH"
- "ER"
- "EY"
- "F"
- "G"
- "HH"
- "IH"
- "IY"
- "JH"
- "K"
- "L"
- "M"
- "N"
- "NG"
- "OW"
- "OY"
- "P"
- "R"
- "S"
- "SH"
- "T"
- "TH"
- "UH"
- "UW"
- "V"
- "W"
- "Y"
- "Z"
- "ZH"
- "-"
- "!"
- "'"
- "("
- ")"
- ","
- "."
- ":"
- ";"
- "?"
- " "
n_symbols: 52 #fixed depending on symbols in the lexicon (+1 for a dummy symbol used for padding, +1 for unknown)
padding_idx: 0
hidden_channels: 512
# Encoder parameters
enc_num_layers: 4
enc_num_head: 2
enc_d_model: !ref <hidden_channels>
enc_ffn_dim: 1024
enc_k_dim: !ref <hidden_channels>
enc_v_dim: !ref <hidden_channels>
enc_dropout: 0.2
# Aligner parameters
in_query_channels: 80
in_key_channels: !ref <hidden_channels> # 512 in the paper
attn_channels: 80
temperature: 0.0005
# Decoder parameters
dec_num_layers: 4
dec_num_head: 2
dec_d_model: !ref <hidden_channels>
dec_ffn_dim: 1024
dec_k_dim: !ref <hidden_channels>
dec_v_dim: !ref <hidden_channels>
dec_dropout: 0.2
# Postnet parameters
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
postnet_dropout: 0.2
# common
normalize_before: True
ffn_type: 1dcnn #1dcnn or ffn
ffn_cnn_kernel_size_list: [9, 1]
# variance predictor
dur_pred_kernel_size: 3
pitch_pred_kernel_size: 3
energy_pred_kernel_size: 3
variance_predictor_dropout: 0.5
#model
model: !new:speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment
enc_num_layers: !ref <enc_num_layers>
enc_num_head: !ref <enc_num_head>
enc_d_model: !ref <enc_d_model>
enc_ffn_dim: !ref <enc_ffn_dim>
enc_k_dim: !ref <enc_k_dim>
enc_v_dim: !ref <enc_v_dim>
enc_dropout: !ref <enc_dropout>
in_query_channels: !ref <in_query_channels>
in_key_channels: !ref <in_key_channels>
attn_channels: !ref <attn_channels>
temperature: !ref <temperature>
dec_num_layers: !ref <dec_num_layers>
dec_num_head: !ref <dec_num_head>
dec_d_model: !ref <dec_d_model>
dec_ffn_dim: !ref <dec_ffn_dim>
dec_k_dim: !ref <dec_k_dim>
dec_v_dim: !ref <dec_v_dim>
dec_dropout: !ref <dec_dropout>
normalize_before: !ref <normalize_before>
ffn_type: !ref <ffn_type>
ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
n_char: !ref <n_symbols>
n_mels: !ref <n_mel_channels>
postnet_embedding_dim: !ref <postnet_embedding_dim>
postnet_kernel_size: !ref <postnet_kernel_size>
postnet_n_convolutions: !ref <postnet_n_convolutions>
postnet_dropout: !ref <postnet_dropout>
padding_idx: !ref <padding_idx>
dur_pred_kernel_size: !ref <dur_pred_kernel_size>
pitch_pred_kernel_size: !ref <pitch_pred_kernel_size>
energy_pred_kernel_size: !ref <energy_pred_kernel_size>
variance_predictor_dropout: !ref <variance_predictor_dropout>
mel_spectogram: !name:speechbrain.lobes.models.FastSpeech2.mel_spectogram
sample_rate: !ref <sample_rate>
hop_length: !ref <hop_length>
win_length: !ref <win_length>
n_fft: !ref <n_fft>
n_mels: !ref <n_mel_channels>
f_min: !ref <mel_fmin>
f_max: !ref <mel_fmax>
power: !ref <power>
normalized: !ref <mel_normalized>
min_max_energy_norm: !ref <min_max_energy_norm>
norm: !ref <norm>
mel_scale: !ref <mel_scale>
compression: !ref <dynamic_range_compression>
criterion: !new:speechbrain.lobes.models.FastSpeech2.LossWithAlignment
log_scale_durations: True
duration_loss_weight: 1.0
pitch_loss_weight: 1.0
energy_loss_weight: 1.0
ssim_loss_weight: 1.0
mel_loss_weight: 1.0
postnet_mel_loss_weight: 1.0
aligner_loss_weight: 1.0
binary_alignment_loss_weight: 0.2
binary_alignment_loss_warmup_epochs: 1
binary_alignment_loss_max_epochs: 80
vocoder: "hifi-gan"
pretrained_vocoder: True
vocoder_source: speechbrain/tts-hifigan-ljspeech
vocoder_download_path: tmpdir_vocoder
modules:
model: !ref <model>
train_dataloader_opts:
batch_size: !ref <batch_size>
drop_last: False #True #False
num_workers: !ref <num_workers_train>
shuffle: True
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment
valid_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers_valid>
shuffle: False
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment
#optimizer
opt_class: !name:torch.optim.Adam
lr: !ref <learning_rate>
weight_decay: !ref <weight_decay>
betas: !ref <betas>
noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
lr_initial: !ref <learning_rate>
n_warmup_steps: 4000
#epoch object
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <epochs>
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
#checkpointer
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
model: !ref <model>
lr_annealing: !ref <noam_annealing>
counter: !ref <epoch_counter>
input_encoder: !new:speechbrain.dataio.encoder.TextEncoder
progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
output_path: !ref <progress_sample_path>
batch_sample_size: !ref <progress_batch_sample_size>
formats:
raw_batch: raw
../../ljspeech_prepare.py
\ No newline at end of file
This diff is collapsed.
"""
Recipe for training the FastSpeech2 Text-To-Speech model
Instead of using pre-extracted phoneme durations from MFA,
This recipe trains an internal alignment from scratch, as introduced in:
https://arxiv.org/pdf/2108.10447.pdf (One TTS Alignment To Rule Them All)
To run this recipe, do the following:
# python train_internal_alignment.py hparams/train_internal_alignment.yaml
Authors
* Yingzhi Wang 2023
"""
import logging
import os
import sys
from pathlib import Path
import numpy as np
import torch
import torchaudio
from hyperpyyaml import load_hyperpyyaml
import speechbrain as sb
from speechbrain.inference.vocoders import HIFIGAN
from speechbrain.utils.data_utils import scalarize
os.environ["TOKENIZERS_PARALLELISM"] = "false"
logger = logging.getLogger(__name__)
class FastSpeech2Brain(sb.Brain):
def on_fit_start(self):
"""Gets called at the beginning of ``fit()``, on multiple processes
if ``distributed_count > 0`` and backend is ddp and initializes statistics
"""
self.hparams.progress_sample_logger.reset()
self.last_epoch = 0
self.last_batch = None
self.last_loss_stats = {}
return super().on_fit_start()
def compute_forward(self, batch, stage):
"""Computes the forward pass
Arguments
---------
batch: str
a single batch
stage: speechbrain.Stage
the training stage
Returns
-------
the model output
"""
inputs, _ = self.batch_to_device(batch)
return self.hparams.model(*inputs)
def on_fit_batch_end(self, batch, outputs, loss, should_step):
"""At the end of the optimizer step, apply noam annealing and logging."""
if should_step:
self.hparams.noam_annealing(self.optimizer)
def compute_objectives(self, predictions, batch, stage):
"""Computes the loss given the predicted and targeted outputs.
Arguments
---------
predictions : torch.Tensor
The model generated spectrograms and other metrics from `compute_forward`.
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
loss : torch.Tensor
A one-element tensor used for backpropagating the gradient.
"""
x, y, metadata = self.batch_to_device(batch, return_metadata=True)
self.last_batch = [x[0], y[-1], y[-2], predictions[0], *metadata]
self._remember_sample([x[0], *y, *metadata], predictions)
loss = self.hparams.criterion(
predictions, y, self.hparams.epoch_counter.current
)
self.last_loss_stats[stage] = scalarize(loss)
return loss["total_loss"]
def _remember_sample(self, batch, predictions):
"""Remembers samples of spectrograms and the batch for logging purposes
Arguments
---------
batch: tuple
a training batch
predictions: tuple
predictions (raw output of the FastSpeech2
model)
"""
(
phoneme_padded,
mel_padded,
pitch,
energy,
output_lengths,
input_lengths,
labels,
wavs,
) = batch
(
mel_post,
postnet_mel_out,
predict_durations,
predict_pitch,
average_pitch,
predict_energy,
average_energy,
predict_mel_lens,
alignment_durations,
alignment_soft,
alignment_logprob,
alignment_mas,
) = predictions
self.hparams.progress_sample_logger.remember(
target=self.process_mel(mel_padded, output_lengths),
output=self.process_mel(postnet_mel_out, output_lengths),
raw_batch=self.hparams.progress_sample_logger.get_batch_sample(
{
"tokens": phoneme_padded,
"input_lengths": input_lengths,
"mel_target": mel_padded,
"mel_out": postnet_mel_out,
"mel_lengths": predict_mel_lens,
"durations": alignment_durations,
"predict_durations": predict_durations,
"labels": labels,
"wavs": wavs,
}
),
)
def process_mel(self, mel, len, index=0):
"""Converts a mel spectrogram to one that can be saved as an image
sample = sqrt(exp(mel))
Arguments
---------
mel: torch.Tensor
the mel spectrogram (as used in the model)
len: int
length of the mel spectrogram
index: int
batch index
Returns
-------
mel: torch.Tensor
the spectrogram, for image saving purposes
"""
assert mel.dim() == 3
return torch.sqrt(torch.exp(mel[index][: len[index]]))
def on_stage_end(self, stage, stage_loss, epoch):
"""Gets called at the end of an epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
stage_loss : float
The average loss for all of the data processed in this stage.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# At the end of validation, we can write
if stage == sb.Stage.VALID:
# Update learning rate
self.last_epoch = epoch
lr = self.hparams.noam_annealing.current_lr
# The train_logger writes a summary to stdout and to the logfile.
self.hparams.train_logger.log_stats( # 1#2#
stats_meta={"Epoch": epoch, "lr": lr},
train_stats=self.last_loss_stats[sb.Stage.TRAIN],
valid_stats=self.last_loss_stats[sb.Stage.VALID],
)
output_progress_sample = (
self.hparams.progress_samples
and epoch % self.hparams.progress_samples_interval == 0
and epoch >= self.hparams.progress_samples_min_run
)
if output_progress_sample:
logger.info("Saving predicted samples")
inference_mel, mel_lens = self.run_inference()
self.hparams.progress_sample_logger.save(epoch)
self.run_vocoder(inference_mel, mel_lens)
# Save the current checkpoint and delete previous checkpoints.
# UNCOMMENT THIS
self.checkpointer.save_and_keep_only(
meta=self.last_loss_stats[stage],
min_keys=["total_loss"],
)
# We also write statistics about test data spectogram to stdout and to the logfile.
if stage == sb.Stage.TEST:
self.hparams.train_logger.log_stats(
{"Epoch loaded": self.hparams.epoch_counter.current},
test_stats=self.last_loss_stats[sb.Stage.TEST],
)
def run_inference(self):
"""Produces a sample in inference mode with predicted durations."""
if self.last_batch is None:
return
tokens, *_ = self.last_batch
(
_,
postnet_mel_out,
_,
_,
_,
_,
_,
predict_mel_lens,
_,
_,
_,
_,
) = self.hparams.model(tokens)
self.hparams.progress_sample_logger.remember(
infer_output=self.process_mel(
postnet_mel_out, [len(postnet_mel_out[0])]
)
)
return postnet_mel_out, predict_mel_lens
def run_vocoder(self, inference_mel, mel_lens):
"""Uses a pretrained vocoder to generate audio from predicted mel
spectogram. By default, uses speechbrain hifigan.
Arguments
---------
inference_mel: torch.Tensor
predicted mel from fastspeech2 inference
mel_lens: torch.Tensor
predicted mel lengths from fastspeech2 inference
used to mask the noise from padding
Returns
-------
None
"""
if self.last_batch is None:
return
*_, wavs = self.last_batch
inference_mel = inference_mel[: self.hparams.progress_batch_sample_size]
mel_lens = mel_lens[0 : self.hparams.progress_batch_sample_size]
assert (
self.hparams.vocoder == "hifi-gan"
and self.hparams.pretrained_vocoder is True
), "Specified vocoder not supported yet"
logger.info(
f"Generating audio with pretrained {self.hparams.vocoder_source} vocoder"
)
hifi_gan = HIFIGAN.from_hparams(
source=self.hparams.vocoder_source,
savedir=self.hparams.vocoder_download_path,
)
waveforms = hifi_gan.decode_batch(
inference_mel.transpose(2, 1), mel_lens, self.hparams.hop_length
)
for idx, wav in enumerate(waveforms):
path = os.path.join(
self.hparams.progress_sample_path,
str(self.last_epoch),
f"pred_{Path(wavs[idx]).stem}.wav",
)
torchaudio.save(path, wav, self.hparams.sample_rate)
def batch_to_device(self, batch, return_metadata=False):
"""Transfers the batch to the target device
Arguments
---------
batch: tuple
the batch to use
return_metadata: bool
Whether to additionally return labels and wavs.
Returns
-------
x: tuple
phonemes, spectrogram, pitch, energy
y: tuple
spectrogram, pitch, energy, mel_lengths, input_lengths
metadata: tuple
labels, wavs
"""
(
phoneme_padded,
input_lengths,
mel_padded,
pitch_padded,
energy_padded,
output_lengths,
# len_x,
labels,
wavs,
) = batch
# durations = durations.to(self.device, non_blocking=True).long()
phonemes = phoneme_padded.to(self.device, non_blocking=True).long()
input_lengths = input_lengths.to(self.device, non_blocking=True).long()
spectogram = mel_padded.to(self.device, non_blocking=True).float()
pitch = pitch_padded.to(self.device, non_blocking=True).float()
energy = energy_padded.to(self.device, non_blocking=True).float()
mel_lengths = output_lengths.to(self.device, non_blocking=True).long()
x = (phonemes, spectogram, pitch, energy)
y = (spectogram, pitch, energy, mel_lengths, input_lengths)
metadata = (labels, wavs)
if return_metadata:
return x, y, metadata
return x, y
def dataio_prepare(hparams):
"Creates the datasets and their data processing pipelines."
# Load lexicon
lexicon = hparams["lexicon"]
input_encoder = hparams.get("input_encoder")
# add a dummy symbol for idx 0 - used for padding.
lexicon = ["@@"] + lexicon
input_encoder.update_from_iterable(lexicon, sequence_input=False)
input_encoder.add_unk()
# load audio, text and durations on the fly; encode audio and text.
@sb.utils.data_pipeline.takes("wav", "phonemes", "pitch")
@sb.utils.data_pipeline.provides("mel_text_pair")
def audio_pipeline(wav, phonemes, pitch):
phoneme_seq = input_encoder.encode_sequence_torch(phonemes).int()
audio, fs = torchaudio.load(wav)
audio = audio.squeeze()
mel, energy = hparams["mel_spectogram"](audio=audio)
pitch = np.load(pitch)
pitch = torch.from_numpy(pitch)
pitch = pitch[: mel.shape[-1]]
return phoneme_seq, mel, pitch, energy, len(phoneme_seq), len(mel)
# define splits and load it as sb dataset
datasets = {}
for dataset in hparams["splits"]:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
json_path=hparams[f"{dataset}_json"],
replacements={"data_root": hparams["data_folder"]},
dynamic_items=[audio_pipeline],
output_keys=["mel_text_pair", "wav", "label", "pitch"],
)
return datasets
def main():
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
with open(hparams_file) as fin:
hparams = load_hyperpyyaml(fin, overrides)
sb.utils.distributed.ddp_init_group(run_opts)
sb.create_experiment_directory(
experiment_directory=hparams["output_folder"],
hyperparams_to_save=hparams_file,
overrides=overrides,
)
from ljspeech_prepare import prepare_ljspeech
sb.utils.distributed.run_on_main(
prepare_ljspeech,
kwargs={
"data_folder": hparams["data_folder"],
"save_folder": hparams["save_folder"],
"splits": hparams["splits"],
"split_ratio": hparams["split_ratio"],
"model_name": hparams["model"].__class__.__name__,
"seed": hparams["seed"],
"pitch_n_fft": hparams["n_fft"],
"pitch_hop_length": hparams["hop_length"],
"pitch_min_f0": hparams["min_f0"],
"pitch_max_f0": hparams["max_f0"],
"skip_prep": hparams["skip_prep"],
"use_custom_cleaner": True,
"device": "cuda",
},
)
datasets = dataio_prepare(hparams)
# Brain class initialization
fastspeech2_brain = FastSpeech2Brain(
modules=hparams["modules"],
opt_class=hparams["opt_class"],
hparams=hparams,
run_opts=run_opts,
checkpointer=hparams["checkpointer"],
)
# Training
fastspeech2_brain.fit(
fastspeech2_brain.hparams.epoch_counter,
datasets["train"],
datasets["valid"],
train_loader_kwargs=hparams["train_dataloader_opts"],
valid_loader_kwargs=hparams["valid_dataloader_opts"],
)
if __name__ == "__main__":
main()
############################################################################
# Model: Tacotron2
# Tokens: Raw characters (English text)
# losses: Transducer
# Training: LJSpeech
# Authors: Georges Abous-Rjeili, Artem Ploujnikov, Yingzhi Wang
# ############################################################################
###################################
# Experiment Parameters and setup #
###################################
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref ./results/tacotron2/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
epochs: 750
keep_checkpoint_interval: 50
###################################
# Progress Samples #
###################################
# Progress samples are used to monitor the progress
# of an ongoing training session by outputting samples
# of spectrograms, alignments, etc at regular intervals
# Whether to enable progress samples
progress_samples: True
# The path where the samples will be stored
progress_sample_path: !ref <output_folder>/samples
# The interval, in epochs. For instance, if it is set to 5,
# progress samples will be output every 5 epochs
progress_samples_interval: 1
# The sample size for raw batch samples saved in batch.pth
# (useful mostly for model debugging)
progress_batch_sample_size: 3
#################################
# Data files and pre-processing #
#################################
data_folder: !PLACEHOLDER # e.g, /localscratch/ljspeech
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json
splits: ["train", "valid"]
split_ratio: [90, 10]
skip_prep: False
# Use the original preprocessing from nvidia
# The cleaners to be used (applicable to nvidia only)
text_cleaners: ['english_cleaners']
################################
# Audio Parameters #
################################
sample_rate: 22050
hop_length: 256
win_length: 1024
n_mel_channels: 80
n_fft: 1024
mel_fmin: 0.0
mel_fmax: 8000.0
mel_normalized: False
power: 1
norm: "slaney"
mel_scale: "slaney"
dynamic_range_compression: True
################################
# Optimization Hyperparameters #
################################
learning_rate: 0.001
weight_decay: 0.000006
batch_size: 64 #minimum 2
num_workers: 8
mask_padding: True
guided_attention_sigma: 0.2
guided_attention_weight: 50.0
guided_attention_weight_half_life: 10.
guided_attention_hard_stop: 50
gate_loss_weight: 1.0
train_dataloader_opts:
batch_size: !ref <batch_size>
drop_last: False #True #False
num_workers: !ref <num_workers>
collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
valid_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers>
collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
test_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers>
collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
################################
# Model Parameters and model #
################################
n_symbols: 148 #fixed depending on symbols in textToSequence
symbols_embedding_dim: 512
# Encoder parameters
encoder_kernel_size: 5
encoder_n_convolutions: 3
encoder_embedding_dim: 512
# Decoder parameters
# The number of frames in the target per encoder step
n_frames_per_step: 1
decoder_rnn_dim: 1024
prenet_dim: 256
max_decoder_steps: 1000
gate_threshold: 0.5
p_attention_dropout: 0.1
p_decoder_dropout: 0.1
decoder_no_early_stopping: False
# Attention parameters
attention_rnn_dim: 1024
attention_dim: 128
# Location Layer parameters
attention_location_n_filters: 32
attention_location_kernel_size: 31
# Mel-post processing network parameters
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
mel_spectogram: !name:speechbrain.lobes.models.Tacotron2.mel_spectogram
sample_rate: !ref <sample_rate>
hop_length: !ref <hop_length>
win_length: !ref <win_length>
n_fft: !ref <n_fft>
n_mels: !ref <n_mel_channels>
f_min: !ref <mel_fmin>
f_max: !ref <mel_fmax>
power: !ref <power>
normalized: !ref <mel_normalized>
norm: !ref <norm>
mel_scale: !ref <mel_scale>
compression: !ref <dynamic_range_compression>
#model
model: !new:speechbrain.lobes.models.Tacotron2.Tacotron2
mask_padding: !ref <mask_padding>
n_mel_channels: !ref <n_mel_channels>
# symbols
n_symbols: !ref <n_symbols>
symbols_embedding_dim: !ref <symbols_embedding_dim>
# encoder
encoder_kernel_size: !ref <encoder_kernel_size>
encoder_n_convolutions: !ref <encoder_n_convolutions>
encoder_embedding_dim: !ref <encoder_embedding_dim>
# attention
attention_rnn_dim: !ref <attention_rnn_dim>
attention_dim: !ref <attention_dim>
# attention location
attention_location_n_filters: !ref <attention_location_n_filters>
attention_location_kernel_size: !ref <attention_location_kernel_size>
# decoder
n_frames_per_step: !ref <n_frames_per_step>
decoder_rnn_dim: !ref <decoder_rnn_dim>
prenet_dim: !ref <prenet_dim>
max_decoder_steps: !ref <max_decoder_steps>
gate_threshold: !ref <gate_threshold>
p_attention_dropout: !ref <p_attention_dropout>
p_decoder_dropout: !ref <p_decoder_dropout>
# postnet
postnet_embedding_dim: !ref <postnet_embedding_dim>
postnet_kernel_size: !ref <postnet_kernel_size>
postnet_n_convolutions: !ref <postnet_n_convolutions>
decoder_no_early_stopping: !ref <decoder_no_early_stopping>
guided_attention_scheduler: !new:speechbrain.nnet.schedulers.StepScheduler
initial_value: !ref <guided_attention_weight>
half_life: !ref <guided_attention_weight_half_life>
criterion: !new:speechbrain.lobes.models.Tacotron2.Loss
gate_loss_weight: !ref <gate_loss_weight>
guided_attention_weight: !ref <guided_attention_weight>
guided_attention_sigma: !ref <guided_attention_sigma>
guided_attention_scheduler: !ref <guided_attention_scheduler>
guided_attention_hard_stop: !ref <guided_attention_hard_stop>
modules:
model: !ref <model>
#optimizer
opt_class: !name:torch.optim.Adam
lr: !ref <learning_rate>
weight_decay: !ref <weight_decay>
#epoch object
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <epochs>
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
#annealing_function
lr_annealing: !new:speechbrain.nnet.schedulers.IntervalScheduler
intervals:
- steps: 6000
lr: 0.0005
- steps: 8000
lr: 0.0003
- steps: 10000
lr: 0.0001
#checkpointer
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
model: !ref <model>
counter: !ref <epoch_counter>
scheduler: !ref <lr_annealing>
#infer: !name:speechbrain.lobes.models.Tacotron2.infer
progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
output_path: !ref <progress_sample_path>
batch_sample_size: !ref <progress_batch_sample_size>
formats:
raw_batch: raw
This diff is collapsed.
SpeechBrain system description
==============================
Python version:
3.10.12 (main, May 26 2024, 00:14:02) [GCC 9.4.0]
==============================
Installed Python packages:
accelerate==0.31.0
addict==2.4.0
aiosignal==1.3.1
aitemplate @ http://10.6.10.68:8000/release/aitemplate/dtk24.04.1/aitemplate-0.0.1%2Bdas1.1.git5d8aa20.dtk2404.torch2.1.0-py3-none-any.whl#sha256=ad763a7cfd3935857cf10a07a2a97899fd64dda481add2f48de8b8930bd341dd
annotated-types==0.7.0
anyio==4.4.0
apex @ http://10.6.10.68:8000/release/apex/dtk24.04.1/apex-1.1.0%2Bdas1.1.gitf477a3a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=85eb662d13d6e6c3b61c2d878378c2338c4479bc03a1912c3eabddc2d9d08aa1
attrs==23.2.0
audioread==3.0.1
bitsandbytes @ http://10.6.10.68:8000/release/bitsandbyte/dtk24.04.1/bitsandbytes-0.42.0%2Bdas1.1.gitce85679.abi1.dtk2404.torch2.1.0-py3-none-any.whl#sha256=6324e330c8d12b858d39f4986c0ed0836fcb05f539cee92a7cf558e17954ae0d
certifi==2024.6.2
cffi==1.17.0
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
coloredlogs==15.0.1
contourpy==1.2.1
cycler==0.12.1
decorator==5.1.1
deepspeed @ http://10.6.10.68:8000/release/deepspeed/dtk24.04.1/deepspeed-0.12.3%2Bgita724046.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=2c158ed2dab21f4f09e7fc29776cb43a1593b13cec33168ce3483f318b852fc9
distlib==0.3.8
dnspython==2.6.1
dropout-layer-norm @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/dropout_layer_norm-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ae10c7cc231a8e38492292e91e76ba710d7679762604c0a7f10964b2385cdbd7
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastpt @ http://10.6.10.68:8000/release/fastpt/dtk24.04.1/fastpt-1.0.0%2Bdas1.1.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ecf30dadcd2482adb1107991edde19b6559b8237379dbb0a3e6eb7306aad3f9a
filelock==3.15.1
fire==0.6.0
flash-attn @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/flash_attn-2.0.4%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7ca8e78ee0624b1ff0e91e9fc265e61b9510f02123a010ac71a2f8e5d08a62f7
flatbuffers==24.3.25
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.6.0
fused-dense-lib @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/fused_dense_lib-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7202dd258a86bb7a1572e3b44b90dae667b0c948bf0f420b05924a107aaaba03
h11==0.14.0
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.4
humanfriendly==10.0
HyperPyYAML==1.2.2
hypothesis==5.35.1
identify==2.6.0
idna==3.7
importlib_metadata==7.1.0
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
layer-check-pt @ http://10.6.10.68:8000/release/layercheck/dtk24.04.1/layer_check_pt-1.2.3.git59a087a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=807adae2d4d4b74898777f81e1b94f1af4d881afe6a7826c7c910b211accbea7
lazy_loader==0.4
librosa==0.10.2.post1
lightop @ http://10.6.10.68:8000/release/lightop/dtk24.04.1/lightop-0.4%2Bdas1.1git8e60f07.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=2f2c88fd3fe4be179f44c4849e9224cb5b2b259843fc5a2d088e468b7a14c1b1
llvmlite==0.43.0
lmdeploy @ http://10.6.10.68:8000/release/lmdeploy/dtk24.04.1/lmdeploy-0.2.6%2Bdas1.1.git6ba90df.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=92ecee2c8b982f86e5c3219ded24d2ede219f415bf2cd4297f989a03387a203c
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mmcv @ http://10.6.10.68:8000/release/mmcv/dtk24.04.1/mmcv-2.0.1%2Bdas1.1.gite58da25.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7a937ae22f81b44d9100907e11303c31bf9a670cb4c92e361675674a41a8a07f
mmengine==0.10.4
mmengine-lite==0.10.4
mpmath==1.3.0
msgpack==1.0.8
networkx==3.3
ninja==1.11.1.1
nodeenv==1.9.1
numba==0.60.0
numpy==1.24.3
onnxruntime @ http://10.6.10.68:8000/release/onnxruntime/dtk24.04.1/onnxruntime-1.15.0%2Bdas1.1.git739f24d.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=d0d24167188d2c85f1ed4110fc43e62ea40c74280716d9b5fe9540256f17869a
opencv-python==4.10.0.82
orjson==3.10.5
packaging==24.1
pandas==2.2.2
peft==0.9.0
pillow==10.3.0
platformdirs==4.2.2
pooch==1.8.2
pre-commit==3.8.0
prometheus_client==0.20.0
protobuf==5.27.1
psutil==5.9.8
py-cpuinfo==9.0.0
pycparser==2.22
pydantic==2.7.4
pydantic_core==2.18.4
Pygments==2.18.0
pygtrie==2.5.0
pynvml==11.5.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
ray==2.9.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rotary-emb @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/rotary_emb-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=cc15ec6ae73875515243d7f5c96ab214455a33a4a99eb7f1327f773cae1e6721
rpds-py==0.18.1
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
safetensors==0.4.3
scikit-learn==1.5.1
scipy==1.13.1
sentencepiece==0.2.0
shellingham==1.5.4
shortuuid==1.0.13
six==1.16.0
sniffio==1.3.1
sortedcontainers==2.4.0
soundfile==0.12.1
soxr==0.5.0
speechbrain==1.0.0
starlette==0.37.2
sympy==1.12.1
termcolor==2.4.0
tgt==1.5
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.15.0
tomli==2.0.1
torch @ http://10.6.10.68:8000/release/pytorch/dtk24.04.1/torch-2.1.0%2Bdas1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=5fd3bcef3aa197c0922727913aca53db9ce3f2fd4a9b22bba1973c3d526377f9
torchaudio @ http://10.6.10.68:8000/release/torchaudio/dtk24.04.1/torchaudio-2.1.2%2Bdas1.1.git63d9a68.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=4fcc556a7a2fffe64ddd57f22e5972b1b2b723f6fdfdaa305bd01551036df38b
torchvision @ http://10.6.10.68:8000/release/vision/dtk24.04.1/torchvision-0.16.0%2Bdas1.1.git7d45932.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=e3032e1bcc0857b54391d66744f97e5cff0dc7e7bb508196356ee927fb81ec01
tqdm==4.66.4
transformers==4.38.0
triton @ http://10.6.10.68:8000/release/triton/dtk24.04.1/triton-2.1.0%2Bdas1.1.git4bf1007a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=4c30d45dab071e65d1704a5cd189b14c4ac20bd59a7061032dfd631b1fc37645
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
Unidecode==1.3.8
urllib3==2.2.1
uvicorn==0.30.1
uvloop==0.19.0
virtualenv==20.26.3
vllm @ http://10.6.10.68:8000/release/vllm/dtk24.04.1/vllm-0.3.3%2Bdas1.1.gitdf6349c.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=48d265b07efa36f028eca45a3667fa10d3cf30eb1b8f019b62e3b255fb9e49c4
watchfiles==0.22.0
websockets==12.0
xentropy-cuda-lib @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/xentropy_cuda_lib-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=91b058d6a5fd2734a5085d68e08d3a1f948fe9c0119c46885d19f55293e2cce4
xformers @ http://10.6.10.68:8000/release/xformers/dtk24.04.1/xformers-0.0.25%2Bdas1.1.git8ef8bc1.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ca87fd065753c1be3b9fad552eba02d30cd3f4c673f01e81a763834eb5cbb9cc
yapf==0.40.2
zipp==3.19.2
==============================
Could not get git revision==============================
ROCm version:
5.7.24213
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment