"vscode:/vscode.git/clone" did not exist on "7729723b868112d72b45926daacc3f03483b1f63"
Commit 4130a52d authored by changhl's avatar changhl
Browse files

init model

parent eb6a18fd
Pipeline #1617 failed with stages
in 0 seconds
# Tacotron2_pytorch
# Tacotron2
## 论文
- https://arxiv.org/pdf/1712.05884
## 开源代码
- https://github.com/NVIDIA/tacotron2
## 模型结构
Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。该模型主要由两部分构成:
- 声谱预测网络:一个Encoder-Attention-Decoder网络,用于将输入的字符序列预测为梅尔频谱的帧序列
- 声码器(vocoder):一个WaveNet的修订版,用于将预测的梅尔频谱帧序列产生时域波形
<div align="center">
<img src="./images/architecture.png"/>
</div>
## 算法原理
在这个架构中,Tacotron2将原先Tacotron的RNN模型进行改进,使用了LSTM模型,加入了遗忘门、输入门、输出门等门控结构,优化了梯度消失的问题,使得模型在反向传播的记忆力上有所提升,提高了合成的语音的质量。
<div align="center">
<img src="./images/algorithm.png"/>
</div>
## 环境配置
### Docker (方法一)
**注意修改路径参数**
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
docker run -it --network=host --ipc=host --name=your_container_name --shm-size=32G --device=/dev/kfd --device=/dev/mkfd --device=/dev/dri -v /opt/hyhal:/opt/hyhal:ro -v /path/your_code_data/:/path/your_code_data/ --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
cd /path/your_code_data/
pip3 install -r requirements.txt
```
### Dockerfile (方法二)
```
cd ./docker
docker build --no-cache -t tacotron2 .
docker run -it -v /path/your_code_data/:/path/your_code_data/ --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
pip3 install -r requirements.txt
```
### Anaconda (方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装: https://developer.hpccube.com/tool/
```
DTK软件栈:dtk24.04.1
python:python3.10
torch:2.1.0
torchvision:0.16.0
torchaudio: 2.1.2
```
Tips:以上dtk软件栈、python、torch等DCU相关工具版本需要严格一一对应
2、其他非特殊库直接按照requirements.txt安装
```
pip3 install -r requirements.txt
```
## 数据集
- SCnet快速下载链接:
- [LJSpeech数据集下载](http://113.200.138.88:18080/aidatasets/lj_speech)
- 官方下载链接:
- [LJSpeech数据集下载](https://keithito.com/LJ-Speech-Dataset/)
```LJSpeech-1.1```:用于语音合成的数据集,包含语音和文本信息,语音为wav格式,文本以csv格式保存。
```
├── LJSpeech-1.1
│ ├──wav
│ │ ├── LJ001-0001.wav
│ │ ├── LJ001-0002.wav
│ │ ├── LJ001-0003.wav
│ │ ├── ...
│ ├──metadata.csv
│ ├──README
```
- LJSpeech
- wav:语音数据目录
- LJ001-0001.wav:语音文件
- LJ001-0002.wav:语音文件
- ...
- metadata.csv:文本信息文件
- 第一列:语音文件名称
- 第二列:文本信息
- 第三列:规范化后的文本信息
- README:说明文档
## 预训练模型
**推理前先下载预训练好的权重文件**
- SCnet下载地址:
- [tacotron2模型权重下载地址](http://113.200.138.88:18080/aimodels/tacotron2_ljspeech)
- [hifigan模型权重下载地址](http://113.200.138.88:18080/aimodels/hifigan_ljspeech)
- 官方下载地址:
- [tacotron2模型权重下载地址](https://hf-mirror.com/speechbrain/tts-tacotron2-ljspeech)
- [hifigan模型权重下载地址](https://hf-mirror.com/speechbrain/tts-hifigan-ljspeech)
## 训练
**确保当前的工作目录为tacotron2_pytorch,指定可见卡**
### 单卡
```
export HIP_VISIBLE_DEVICES 设置可见卡
bash train_s.sh $dataset_path $save_path
```
- $dataset_path:数据集路径
- $save_path:训练权重保存路径
### 多卡
```
export HIP_VISIBLE_DEVICES 设置可见卡
bash train_m.sh $dataset_path $save_path
```
- $dataset_path:数据集路径
- $save_path:训练权重保存路径
## 推理
```
export HIP_VISIBLE_DEVICES 设置可见卡
python3 inference.py -m modelpath_tacotron2 -v modelpath_hifigan -t "hi, nice to meet you"
```
- -m:tacotron2模型权重路径
- -v:hifigan模型权重路径
- -t:输入文本
- -res:结果文件保存路径
## result
```
输入:“hi,nice to meet you”
输出:./res/example.wav
```
## 应用场景
### 算法分类
```
语音合成
```
### 热点应用行业
```
金融,通信,广媒
```
## 源码仓库及问题反馈
https://developer.hpccube.com/codes/modelzoo/tacotron2_pytorch
## 参考
[GitHub - NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2)
[HF - speechbrain/tts-tacotron2-ljspeech](https://hf-mirror.com/speechbrain/tts-tacotron2-ljspeech)
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
RUN source /opt/dtk/env.sh
\ No newline at end of file
icon.png

64.4 KB

import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN
import os
import argparse
def parse_opt(known=False):
parser = argparse.ArgumentParser()
parser.add_argument('-m', '--model-path', type=str, default="", help="the tacotron2 model path")
parser.add_argument('-v', '--vocoder-path', type=str, default="", help="the vocoder model path")
parser.add_argument('-t', '--text', type=str, default="Autumn, the season of change.", help="input text")
parser.add_argument('-res', '--result_path', type=str, default="./res", help="the path to save wav file")
opt = parser.parse_known_args()[0] if known else parser.parse_args()
return opt
def main(opt):
tacotron2 = Tacotron2.from_hparams(source=opt.model_path, run_opts={"device":"cuda"})
hifi_gan = HIFIGAN.from_hparams(source=opt.vocoder_path,run_opts={"device":"cuda"})
# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text(opt.text)
# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)
# Save the waverform
torchaudio.save(os.path.join(opt.result_path, 'example.wav'),waveforms.squeeze(1).cpu(), 22050)
if __name__ == "__main__":
main(opt=parse_opt())
#模型编码
modelCode=917
# 模型名称
modelName=tacotron2_pytorch
# 模型描述
modelDescription=Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。
# 应用场景(多个标签以英文逗号分割)
appScenario=训练,推理,语音合成,金融,通信,广媒
# 框架类型(多个标签以英文逗号分割)
frameType=PyTorch
\ No newline at end of file
soundfile==0.12.1
librosa==0.10.2.post1
speechbrain==1.0.0
hyperpyyaml>=0.0.1
joblib>=0.14.1
pre-commit>=2.3.0
pygtrie>=2.1,<3.0
tgt==1.5
unidecode==1.3.8
\ No newline at end of file
File added
# Text-to-Speech (with LJSpeech)
This folder contains the recipes for training TTS systems (including vocoders) with the popular LJSpeech dataset.
# Dataset
The dataset can be downloaded from here:
https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
# Installing Extra Dependencies
Before proceeding, ensure you have installed the necessary additional dependencies. To do this, simply run the following command in your terminal:
```
pip install -r extra_requirements.txt
```
# Tacotron 2
The subfolder "tacotron2" contains the recipe for training the popular [tacotron2](https://arxiv.org/abs/1712.05884) TTS model.
To run this recipe, go into the "tacotron2" folder and run:
```
python train.py --device=cuda:0 --max_grad_norm=1.0 --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
```
The training logs are available [here](https://www.dropbox.com/sh/1npvo1g1ncafipf/AAC5DR1ErF2Q9V4bd1DHqX43a?dl=0).
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech).
# FastSpeech2
The subfolder "fastspeech2" contains the recipes for training the non-autoregressive transformer based TTS model [FastSpeech2](https://arxiv.org/abs/2006.04558).
### FastSpeech2 with pre-extracted durations from a forced aligner
Training FastSpeech2 requires pre-extracted phoneme alignments (durations). The LJSpeech phoneme alignments from Montreal Forced Aligner are automatically downloaded, decompressed and stored at this location: ```/your_folder/LJSpeech-1.1/TextGrid```.
To run this recipe, please first install the extra-dependencies :
```
pip install -r extra_requirements.txt
````
Then go into the "fastspeech2" folder and run:
```
python train.py --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
```
Training takes about 3 minutes/epoch on 1 * V100 32G.
The training logs are available [here](https://www.dropbox.com/scl/fo/vtgbltqdrvw9r0vs7jz67/h?rlkey=cm2mwh5rce5ad9e90qaciypox&dl=0).
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-fastspeech2-ljspeech).
### FastSpeech2 with internal alignment
This recipe allows training FastSpeech2 without forced aligner referring to [One TTS Alignment To Rule Them All](https://arxiv.org/pdf/2108.10447.pdf). The alignment can be learnt by an internal alignment network that is added to FastSpeech2. This recipe aims to simplify training when using custom data and provide better alignments for punctuations.
To run this recipe, please first install the extra-requirements:
```
pip install -r extra_requirements.txt
```
Then go into the "fastspeech2" folder and run:
```
python train_internal_alignment.py hparams/train_internal_alignment.yaml --data_folder=/your_folder/LJSpeech-1.1
```
The data preparation includes a grapheme-to-phoneme process for the entire corpus which may take several hours. Training takes about 5 minutes/epoch on 1 * V100 32G.
The training logs are available [here](https://www.dropbox.com/scl/fo/4ctkc6jjas3uij9dzcwta/h?rlkey=i0k086d77flcsdx40du1ppm2d&dl=0).
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-fastspeech2-internal-alignment-ljspeech).
# HiFiGAN (Vocoder)
The subfolder "vocoder/hifigan/" contains the [HiFiGAN vocoder](https://arxiv.org/pdf/2010.05646.pdf).
The vocoder is a neural network that converts a spectrogram into a waveform (it can be used on top of Tacotron2/FastSpeech2).
We suggest using `tensorboard_logger` by setting `use_tensorboard: True` in the yaml file, thus `Tensorboard` should be installed.
To run this recipe, go into the "vocoder/hifigan/" folder and run:
```
python train.py hparams/train.yaml --data_folder /path/to/LJspeech
```
Training takes about 10 minutes/epoch on an nvidia RTX8000.
The training logs are available [here](https://www.dropbox.com/sh/m2xrdssiroipn8g/AAD-TqPYLrSg6eNxUkcImeg4a?dl=0)
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-hifigan-ljspeech).
# DiffWave (Vocoder)
The subfolder "vocoder/diffwave/" contains the [Diffwave](https://arxiv.org/pdf/2009.09761.pdf) vocoder.
DiffWave is a versatile diffusion model for audio synthesis, which produces high-fidelity audio in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation.
Here it serves as a vocoder that generates waveforms given spectrograms as conditions (it can be used on top of Tacotron2/FastSpeech2).
To run this recipe, go into the "vocoder/diffwave/" folder and run:
```
python train.py hparams/train.yaml --data_folder /path/to/LJspeech
```
The scripts will output a synthesized audio to `<output_folder>/samples` for a certain interval of training epoch.
We suggest using tensorboard_logger by setting `use_tensorboard: True` in the yaml file, thus torch.Tensorboard should be installed.
Training takes about 6 minutes/epoch on 1 * V100 32G.
The training logs are available [here](https://www.dropbox.com/sh/tbhpn1xirtaix68/AACvYaVDiUGAKURf2o-fvgMoa?dl=0)
For inference, by setting `fast_sampling: True` , a fast sampling can be realized by passing user-defined variance schedules. According to the paper, high-quality audios can be generated with only 6 steps. This is highly recommended.
You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-diffwave-ljspeech).
# HiFiGAN Unit Vocoder
The subfolder "vocoder/hifigan_discrete/" contains the [HiFiGAN Unit vocoder](https://arxiv.org/abs/2406.10735). This vocoder is a neural network designed to transform discrete self-supervised representations into waveform data.
This is suitable for a wide range of generative tasks such as speech enhancement, separation, text-to-speech, voice cloning, etc. Please read [DASB - Discrete Audio and Speech Benchmark](https://arxiv.org/abs/2406.14294) for more information.
To run this recipe successfully, start by installing the necessary extra dependencies:
```bash
pip install -r extra_requirements.txt
```
Before training the vocoder, you need to choose a speech encoder to extract representations that will be used as discrete audio input. We support k-means models using features from HuBERT, WavLM, or Wav2Vec2. Below are the available self-supervised speech encoders for which we provide pre-trained k-means checkpoints:
| Encoder | HF model |
|----------|-----------------------------------------|
| HuBERT | facebook/hubert-large-ll60k |
| Wav2Vec2 | facebook/wav2vec2-large-960h-lv60-self |
| WavLM | microsoft/wavlm-large |
Checkpoints are available in the HF [SSL_Quantization](https://huggingface.co/speechbrain/SSL_Quantization) repository. Alternatively, you can train your own k-means model by following instructions in the "LJSpeech/quantization" README.
Next, configure the SSL model type, k-means model, and corresponding hub in your YAML configuration file. Follow these steps:
1. Navigate to the "vocoder/hifigan_discrete/hparams" folder and open "train.yaml" file.
2. Modify the `encoder_type` field to specify one of the SSL models: "HuBERT", "WavLM", or "Wav2Vec2".
3. Update the `encoder_hub` field with the specific name of the SSL Hub associated with your chosen model type.
If you have trained your own k-means model, follow these additional steps:
4. Update the `kmeans_folder` field with the specific name of the SSL Hub containing your trained k-means model. Please follow the same file structure as the official one in [SSL_Quantization](https://huggingface.co/speechbrain/SSL_Quantization).
5. Update the `kmeans_dataset` field with the specific name of the dataset on which the k-means model was trained.
6. Update the `num_clusters` field according to the number of clusters of your k-means model.
Finally, navigate back to the "vocoder/hifigan_discrete/" folder and run the following command:
```bash
python train.py hparams/train.yaml --data_folder=/path/to/LJspeech
```
Training typically takes around 4 minutes per epoch when using an NVIDIA A100 40G.
# **About SpeechBrain**
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/
# **Citing SpeechBrain**
Please, cite SpeechBrain if you use it for your research or business.
```bibtex
@misc{ravanelli2024opensourceconversationalaispeechbrain,
title={Open-Source Conversational AI with SpeechBrain 1.0},
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
year={2024},
eprint={2407.00463},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
```
# Needed only for quantization
scikit-learn
# Needed only with use_tensorboard=True
# torchvision is needed to save spectrograms
tensorboard
tgt
torchvision
unidecode
############################################################################
# Model: FastSpeech2
# Tokens: Raw characters (English text)
# Training: LJSpeech
# Authors: Sathvik Udupa, Yingzhi Wang, Pradnya Kandarkar
# ############################################################################
###################################
# Experiment Parameters and setup #
###################################
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/fastspeech2/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
epochs: 500
train_spn_predictor_epochs: 8
progress_samples: True
progress_sample_path: !ref <output_folder>/samples
progress_samples_min_run: 10
progress_samples_interval: 10
progress_batch_sample_size: 4
#################################
# Data files and pre-processing #
#################################
data_folder: #!PLACEHOLDER # e.g., /data/Database/LJSpeech-1.1
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json
splits: ["train", "valid"]
split_ratio: [90, 10]
skip_prep: False
################################
# Audio Parameters #
################################
sample_rate: 22050
hop_length: 256
win_length: null
n_mel_channels: 80
n_fft: 1024
mel_fmin: 0.0
mel_fmax: 8000.0
power: 1
norm: "slaney"
mel_scale: "slaney"
dynamic_range_compression: True
mel_normalized: False
min_max_energy_norm: True
min_f0: 65 #(torchaudio pyin values)
max_f0: 2093 #(torchaudio pyin values)
################################
# Optimization Hyperparameters #
################################
learning_rate: 0.0001
weight_decay: 0.000001
max_grad_norm: 1.0
batch_size: 32 #minimum 2
num_workers_train: 16
num_workers_valid: 4
betas: [0.9, 0.98]
################################
# Model Parameters and model #
################################
# Input parameters
lexicon:
- AA
- AE
- AH
- AO
- AW
- AY
- B
- CH
- D
- DH
- EH
- ER
- EY
- F
- G
- HH
- IH
- IY
- JH
- K
- L
- M
- N
- NG
- OW
- OY
- P
- R
- S
- SH
- T
- TH
- UH
- UW
- V
- W
- Y
- Z
- ZH
- spn
n_symbols: 42 #fixed depending on symbols in the lexicon +1 for a dummy symbol used for padding
padding_idx: 0
# Encoder parameters
enc_num_layers: 4
enc_num_head: 2
enc_d_model: 384
enc_ffn_dim: 1024
enc_k_dim: 384
enc_v_dim: 384
enc_dropout: 0.2
# Decoder parameters
dec_num_layers: 4
dec_num_head: 2
dec_d_model: 384
dec_ffn_dim: 1024
dec_k_dim: 384
dec_v_dim: 384
dec_dropout: 0.2
# Postnet parameters
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
postnet_dropout: 0.5
# common
normalize_before: True
ffn_type: 1dcnn #1dcnn or ffn
ffn_cnn_kernel_size_list: [9, 1]
# variance predictor
dur_pred_kernel_size: 3
pitch_pred_kernel_size: 3
energy_pred_kernel_size: 3
variance_predictor_dropout: 0.5
# silent phoneme token predictor
spn_predictor: !new:speechbrain.lobes.models.FastSpeech2.SPNPredictor
enc_num_layers: !ref <enc_num_layers>
enc_num_head: !ref <enc_num_head>
enc_d_model: !ref <enc_d_model>
enc_ffn_dim: !ref <enc_ffn_dim>
enc_k_dim: !ref <enc_k_dim>
enc_v_dim: !ref <enc_v_dim>
enc_dropout: !ref <enc_dropout>
normalize_before: !ref <normalize_before>
ffn_type: !ref <ffn_type>
ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
n_char: !ref <n_symbols>
padding_idx: !ref <padding_idx>
#model
model: !new:speechbrain.lobes.models.FastSpeech2.FastSpeech2
enc_num_layers: !ref <enc_num_layers>
enc_num_head: !ref <enc_num_head>
enc_d_model: !ref <enc_d_model>
enc_ffn_dim: !ref <enc_ffn_dim>
enc_k_dim: !ref <enc_k_dim>
enc_v_dim: !ref <enc_v_dim>
enc_dropout: !ref <enc_dropout>
dec_num_layers: !ref <dec_num_layers>
dec_num_head: !ref <dec_num_head>
dec_d_model: !ref <dec_d_model>
dec_ffn_dim: !ref <dec_ffn_dim>
dec_k_dim: !ref <dec_k_dim>
dec_v_dim: !ref <dec_v_dim>
dec_dropout: !ref <dec_dropout>
normalize_before: !ref <normalize_before>
ffn_type: !ref <ffn_type>
ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
n_char: !ref <n_symbols>
n_mels: !ref <n_mel_channels>
postnet_embedding_dim: !ref <postnet_embedding_dim>
postnet_kernel_size: !ref <postnet_kernel_size>
postnet_n_convolutions: !ref <postnet_n_convolutions>
postnet_dropout: !ref <postnet_dropout>
padding_idx: !ref <padding_idx>
dur_pred_kernel_size: !ref <dur_pred_kernel_size>
pitch_pred_kernel_size: !ref <pitch_pred_kernel_size>
energy_pred_kernel_size: !ref <energy_pred_kernel_size>
variance_predictor_dropout: !ref <variance_predictor_dropout>
mel_spectogram: !name:speechbrain.lobes.models.FastSpeech2.mel_spectogram
sample_rate: !ref <sample_rate>
hop_length: !ref <hop_length>
win_length: !ref <win_length>
n_fft: !ref <n_fft>
n_mels: !ref <n_mel_channels>
f_min: !ref <mel_fmin>
f_max: !ref <mel_fmax>
power: !ref <power>
normalized: !ref <mel_normalized>
min_max_energy_norm: !ref <min_max_energy_norm>
norm: !ref <norm>
mel_scale: !ref <mel_scale>
compression: !ref <dynamic_range_compression>
criterion: !new:speechbrain.lobes.models.FastSpeech2.Loss
log_scale_durations: True
duration_loss_weight: 1.0
pitch_loss_weight: 1.0
energy_loss_weight: 1.0
ssim_loss_weight: 1.0
mel_loss_weight: 1.0
postnet_mel_loss_weight: 1.0
spn_loss_weight: 1.0
spn_loss_max_epochs: !ref <train_spn_predictor_epochs>
vocoder: "hifi-gan"
pretrained_vocoder: True
vocoder_source: speechbrain/tts-hifigan-ljspeech
vocoder_download_path: tmpdir_vocoder
modules:
spn_predictor: !ref <spn_predictor>
model: !ref <model>
train_dataloader_opts:
batch_size: !ref <batch_size>
drop_last: False #True #False
num_workers: !ref <num_workers_train>
shuffle: True
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollate
valid_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers_valid>
shuffle: False
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollate
#optimizer
opt_class: !name:torch.optim.Adam
lr: !ref <learning_rate>
weight_decay: !ref <weight_decay>
betas: !ref <betas>
noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
lr_initial: !ref <learning_rate>
n_warmup_steps: 4000
#epoch object
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <epochs>
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
#checkpointer
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
spn_predictor: !ref <spn_predictor>
model: !ref <model>
lr_annealing: !ref <noam_annealing>
counter: !ref <epoch_counter>
input_encoder: !new:speechbrain.dataio.encoder.TextEncoder
progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
output_path: !ref <progress_sample_path>
batch_sample_size: !ref <progress_batch_sample_size>
formats:
raw_batch: raw
############################################################################
# Model: FastSpeech2 with internal alignment
# Tokens: Phonemes (ARPABET)
# Dataset: LJSpeech
# Authors: Yingzhi Wang 2023
# ############################################################################
###################################
# Experiment Parameters and setup #
###################################
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/fastspeech2_internal_alignment/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
epochs: 500
progress_samples: True
progress_sample_path: !ref <output_folder>/samples
progress_samples_min_run: 10
progress_samples_interval: 10
progress_batch_sample_size: 4
#################################
# Data files and pre-processing #
#################################
data_folder: !PLACEHOLDER # e.g., /data/Database/LJSpeech-1.1
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json
splits: ["train", "valid"]
split_ratio: [90, 10]
skip_prep: False
################################
# Audio Parameters #
################################
sample_rate: 22050
hop_length: 256
win_length: null
n_mel_channels: 80
n_fft: 1024
mel_fmin: 0.0
mel_fmax: 8000.0
power: 1
norm: "slaney"
mel_scale: "slaney"
dynamic_range_compression: True
mel_normalized: False
min_max_energy_norm: True
min_f0: 65 #(torchaudio pyin values)
max_f0: 2093 #(torchaudio pyin values)
################################
# Optimization Hyperparameters #
################################
learning_rate: 0.0001
weight_decay: 0.000001
max_grad_norm: 1.0
batch_size: 16 #minimum 2
betas: [0.9, 0.998]
num_workers_train: 16
num_workers_valid: 4
################################
# Model Parameters and model #
################################
# Input parameters
lexicon:
- "AA"
- "AE"
- "AH"
- "AO"
- "AW"
- "AY"
- "B"
- "CH"
- "D"
- "DH"
- "EH"
- "ER"
- "EY"
- "F"
- "G"
- "HH"
- "IH"
- "IY"
- "JH"
- "K"
- "L"
- "M"
- "N"
- "NG"
- "OW"
- "OY"
- "P"
- "R"
- "S"
- "SH"
- "T"
- "TH"
- "UH"
- "UW"
- "V"
- "W"
- "Y"
- "Z"
- "ZH"
- "-"
- "!"
- "'"
- "("
- ")"
- ","
- "."
- ":"
- ";"
- "?"
- " "
n_symbols: 52 #fixed depending on symbols in the lexicon (+1 for a dummy symbol used for padding, +1 for unknown)
padding_idx: 0
hidden_channels: 512
# Encoder parameters
enc_num_layers: 4
enc_num_head: 2
enc_d_model: !ref <hidden_channels>
enc_ffn_dim: 1024
enc_k_dim: !ref <hidden_channels>
enc_v_dim: !ref <hidden_channels>
enc_dropout: 0.2
# Aligner parameters
in_query_channels: 80
in_key_channels: !ref <hidden_channels> # 512 in the paper
attn_channels: 80
temperature: 0.0005
# Decoder parameters
dec_num_layers: 4
dec_num_head: 2
dec_d_model: !ref <hidden_channels>
dec_ffn_dim: 1024
dec_k_dim: !ref <hidden_channels>
dec_v_dim: !ref <hidden_channels>
dec_dropout: 0.2
# Postnet parameters
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
postnet_dropout: 0.2
# common
normalize_before: True
ffn_type: 1dcnn #1dcnn or ffn
ffn_cnn_kernel_size_list: [9, 1]
# variance predictor
dur_pred_kernel_size: 3
pitch_pred_kernel_size: 3
energy_pred_kernel_size: 3
variance_predictor_dropout: 0.5
#model
model: !new:speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment
enc_num_layers: !ref <enc_num_layers>
enc_num_head: !ref <enc_num_head>
enc_d_model: !ref <enc_d_model>
enc_ffn_dim: !ref <enc_ffn_dim>
enc_k_dim: !ref <enc_k_dim>
enc_v_dim: !ref <enc_v_dim>
enc_dropout: !ref <enc_dropout>
in_query_channels: !ref <in_query_channels>
in_key_channels: !ref <in_key_channels>
attn_channels: !ref <attn_channels>
temperature: !ref <temperature>
dec_num_layers: !ref <dec_num_layers>
dec_num_head: !ref <dec_num_head>
dec_d_model: !ref <dec_d_model>
dec_ffn_dim: !ref <dec_ffn_dim>
dec_k_dim: !ref <dec_k_dim>
dec_v_dim: !ref <dec_v_dim>
dec_dropout: !ref <dec_dropout>
normalize_before: !ref <normalize_before>
ffn_type: !ref <ffn_type>
ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
n_char: !ref <n_symbols>
n_mels: !ref <n_mel_channels>
postnet_embedding_dim: !ref <postnet_embedding_dim>
postnet_kernel_size: !ref <postnet_kernel_size>
postnet_n_convolutions: !ref <postnet_n_convolutions>
postnet_dropout: !ref <postnet_dropout>
padding_idx: !ref <padding_idx>
dur_pred_kernel_size: !ref <dur_pred_kernel_size>
pitch_pred_kernel_size: !ref <pitch_pred_kernel_size>
energy_pred_kernel_size: !ref <energy_pred_kernel_size>
variance_predictor_dropout: !ref <variance_predictor_dropout>
mel_spectogram: !name:speechbrain.lobes.models.FastSpeech2.mel_spectogram
sample_rate: !ref <sample_rate>
hop_length: !ref <hop_length>
win_length: !ref <win_length>
n_fft: !ref <n_fft>
n_mels: !ref <n_mel_channels>
f_min: !ref <mel_fmin>
f_max: !ref <mel_fmax>
power: !ref <power>
normalized: !ref <mel_normalized>
min_max_energy_norm: !ref <min_max_energy_norm>
norm: !ref <norm>
mel_scale: !ref <mel_scale>
compression: !ref <dynamic_range_compression>
criterion: !new:speechbrain.lobes.models.FastSpeech2.LossWithAlignment
log_scale_durations: True
duration_loss_weight: 1.0
pitch_loss_weight: 1.0
energy_loss_weight: 1.0
ssim_loss_weight: 1.0
mel_loss_weight: 1.0
postnet_mel_loss_weight: 1.0
aligner_loss_weight: 1.0
binary_alignment_loss_weight: 0.2
binary_alignment_loss_warmup_epochs: 1
binary_alignment_loss_max_epochs: 80
vocoder: "hifi-gan"
pretrained_vocoder: True
vocoder_source: speechbrain/tts-hifigan-ljspeech
vocoder_download_path: tmpdir_vocoder
modules:
model: !ref <model>
train_dataloader_opts:
batch_size: !ref <batch_size>
drop_last: False #True #False
num_workers: !ref <num_workers_train>
shuffle: True
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment
valid_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers_valid>
shuffle: False
collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment
#optimizer
opt_class: !name:torch.optim.Adam
lr: !ref <learning_rate>
weight_decay: !ref <weight_decay>
betas: !ref <betas>
noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
lr_initial: !ref <learning_rate>
n_warmup_steps: 4000
#epoch object
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <epochs>
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
#checkpointer
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
model: !ref <model>
lr_annealing: !ref <noam_annealing>
counter: !ref <epoch_counter>
input_encoder: !new:speechbrain.dataio.encoder.TextEncoder
progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
output_path: !ref <progress_sample_path>
batch_sample_size: !ref <progress_batch_sample_size>
formats:
raw_batch: raw
../../ljspeech_prepare.py
\ No newline at end of file
"""
Recipe for training the FastSpeech2 Text-To-Speech model, an end-to-end
neural text-to-speech (TTS) system introduced in 'FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
synthesis' paper
(https://arxiv.org/abs/2006.04558)
To run this recipe, do the following:
# python train.py hparams/train.yaml
Authors
* Sathvik Udupa 2022
* Yingzhi Wang 2022
* Pradnya Kandarkar 2023
"""
import logging
import os
import sys
from pathlib import Path
import numpy as np
import torch
import torchaudio
from hyperpyyaml import load_hyperpyyaml
import speechbrain as sb
from speechbrain.inference.text import GraphemeToPhoneme
from speechbrain.inference.vocoders import HIFIGAN
from speechbrain.utils.data_utils import scalarize
os.environ["TOKENIZERS_PARALLELISM"] = "false"
logger = logging.getLogger(__name__)
class FastSpeech2Brain(sb.Brain):
def on_fit_start(self):
"""Gets called at the beginning of ``fit()``, on multiple processes
if ``distributed_count > 0`` and backend is ddp and initializes statistics
"""
self.hparams.progress_sample_logger.reset()
self.last_epoch = 0
self.last_batch = None
self.last_loss_stats = {}
self.g2p = GraphemeToPhoneme.from_hparams("speechbrain/soundchoice-g2p")
self.spn_token_encoded = (
self.input_encoder.encode_sequence_torch(["spn"]).int().item()
)
return super().on_fit_start()
def compute_forward(self, batch, stage):
"""Computes the forward pass
Arguments
---------
batch: str
a single batch
stage: speechbrain.Stage
the training stage
Returns
-------
the model output
"""
inputs, _ = self.batch_to_device(batch)
tokens, durations, pitch, energy, no_spn_seqs, last_phonemes = inputs
# Forward pass for the silent token predictor module
if (
self.hparams.epoch_counter.current
> self.hparams.train_spn_predictor_epochs
):
self.hparams.modules["spn_predictor"].eval()
with torch.no_grad():
spn_preds = self.hparams.modules["spn_predictor"](
no_spn_seqs, last_phonemes
)
else:
spn_preds = self.hparams.modules["spn_predictor"](
no_spn_seqs, last_phonemes
)
# Forward pass for the FastSpeech2 module
(
predict_mel_post,
predict_postnet_output,
predict_durations,
predict_pitch,
predict_avg_pitch,
predict_energy,
predict_avg_energy,
predict_mel_lens,
) = self.hparams.model(tokens, durations, pitch, energy)
return (
predict_mel_post,
predict_postnet_output,
predict_durations,
predict_pitch,
predict_avg_pitch,
predict_energy,
predict_avg_energy,
predict_mel_lens,
spn_preds,
)
def on_fit_batch_end(self, batch, outputs, loss, should_step):
"""At the end of the optimizer step, apply noam annealing."""
if should_step:
self.hparams.noam_annealing(self.optimizer)
def compute_objectives(self, predictions, batch, stage):
"""Computes the loss given the predicted and targeted outputs.
Arguments
---------
predictions : torch.Tensor
The model generated spectrograms and other metrics from `compute_forward`.
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
loss : torch.Tensor
A one-element tensor used for backpropagating the gradient.
"""
x, y, metadata = self.batch_to_device(batch, return_metadata=True)
self.last_batch = [x[0], y[-2], y[-3], predictions[0], *metadata]
self._remember_sample([x[0], *y, *metadata], predictions)
loss = self.hparams.criterion(
predictions, y, self.hparams.epoch_counter.current
)
self.last_loss_stats[stage] = scalarize(loss)
return loss["total_loss"]
def _remember_sample(self, batch, predictions):
"""Remembers samples of spectrograms and the batch for logging purposes
Arguments
---------
batch: tuple
a training batch
predictions: tuple
predictions (raw output of the FastSpeech2
model)
"""
(
tokens,
spectogram,
durations,
pitch,
energy,
mel_lengths,
input_lengths,
spn_labels,
labels,
wavs,
) = batch
(
mel_post,
postnet_mel_out,
predict_durations,
predict_pitch,
predict_avg_pitch,
predict_energy,
predict_avg_energy,
predict_mel_lens,
spn_preds,
) = predictions
self.hparams.progress_sample_logger.remember(
target=self.process_mel(spectogram, mel_lengths),
output=self.process_mel(postnet_mel_out, mel_lengths),
raw_batch=self.hparams.progress_sample_logger.get_batch_sample(
{
"tokens": tokens,
"input_lengths": input_lengths,
"mel_target": spectogram,
"mel_out": postnet_mel_out,
"mel_lengths": predict_mel_lens,
"durations": durations,
"predict_durations": predict_durations,
"labels": labels,
"wavs": wavs,
}
),
)
def process_mel(self, mel, len, index=0):
"""Converts a mel spectrogram to one that can be saved as an image
sample = sqrt(exp(mel))
Arguments
---------
mel: torch.Tensor
the mel spectrogram (as used in the model)
len: int
length of the mel spectrogram
index: int
batch index
Returns
-------
mel: torch.Tensor
the spectrogram, for image saving purposes
"""
assert mel.dim() == 3
return torch.sqrt(torch.exp(mel[index][: len[index]]))
def on_stage_end(self, stage, stage_loss, epoch):
"""Gets called at the end of an epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
stage_loss : float
The average loss for all of the data processed in this stage.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# At the end of validation, we can write
if stage == sb.Stage.VALID:
# Update learning rate
self.last_epoch = epoch
lr = self.hparams.noam_annealing.current_lr
# The train_logger writes a summary to stdout and to the logfile.
self.hparams.train_logger.log_stats( # 1#2#
stats_meta={"Epoch": epoch, "lr": lr},
train_stats=self.last_loss_stats[sb.Stage.TRAIN],
valid_stats=self.last_loss_stats[sb.Stage.VALID],
)
output_progress_sample = (
self.hparams.progress_samples
and epoch % self.hparams.progress_samples_interval == 0
and epoch >= self.hparams.progress_samples_min_run
)
if output_progress_sample:
logger.info("Saving predicted samples")
(
inference_mel,
mel_lens,
inf_mel_spn_pred,
mel_lens_spn_pred,
) = self.run_inference()
self.hparams.progress_sample_logger.save(epoch)
self.run_vocoder(
inference_mel, mel_lens, sample_type="with_spn"
)
self.run_vocoder(
inf_mel_spn_pred, mel_lens_spn_pred, sample_type="no_spn"
)
# Save the current checkpoint and delete previous checkpoints.
# UNCOMMENT THIS
self.checkpointer.save_and_keep_only(
meta=self.last_loss_stats[stage],
min_keys=["total_loss"],
)
# We also write statistics about test data spectogram to stdout and to the logfile.
if stage == sb.Stage.TEST:
self.hparams.train_logger.log_stats(
{"Epoch loaded": self.hparams.epoch_counter.current},
test_stats=self.last_loss_stats[sb.Stage.TEST],
)
def run_inference(self):
"""Produces a sample in inference mode with predicted durations."""
if self.last_batch is None:
return
tokens, *_, labels, _ = self.last_batch
# Generates inference samples without using the silent phoneme predictor
(
_,
postnet_mel_out,
_,
_,
_,
_,
_,
predict_mel_lens,
) = self.hparams.model(tokens)
self.hparams.progress_sample_logger.remember(
infer_output=self.process_mel(
postnet_mel_out, [len(postnet_mel_out[0])]
)
)
# Generates inference samples using the silent phoneme predictor
# Preprocessing required at the inference time for the input text
# "label" below contains input text
# "phoneme_labels" contain the phoneme sequences corresponding to input text labels
# "last_phonemes_combined" is used to indicate whether the index position is for a last phoneme of a word
phoneme_labels = list()
last_phonemes_combined = list()
for label in labels:
phoneme_label = list()
last_phonemes = list()
words = label.split()
words = [word.strip() for word in words]
words_phonemes = self.g2p(words)
for words_phonemes_seq in words_phonemes:
for phoneme in words_phonemes_seq:
if not phoneme.isspace():
phoneme_label.append(phoneme)
last_phonemes.append(0)
last_phonemes[-1] = 1
phoneme_labels.append(phoneme_label)
last_phonemes_combined.append(last_phonemes)
# Inserts silent phonemes in the input phoneme sequence
all_tokens_with_spn = list()
max_seq_len = -1
for i in range(len(phoneme_labels)):
phoneme_label = phoneme_labels[i]
token_seq = (
self.input_encoder.encode_sequence_torch(phoneme_label)
.int()
.to(self.device)
)
last_phonemes = torch.LongTensor(last_phonemes_combined[i]).to(
self.device
)
# Runs the silent phoneme predictor
spn_preds = (
self.hparams.modules["spn_predictor"]
.infer(token_seq.unsqueeze(0), last_phonemes.unsqueeze(0))
.int()
)
spn_to_add = torch.nonzero(spn_preds).reshape(-1).tolist()
tokens_with_spn = list()
for token_idx in range(token_seq.shape[0]):
tokens_with_spn.append(token_seq[token_idx].item())
if token_idx in spn_to_add:
tokens_with_spn.append(self.spn_token_encoded)
tokens_with_spn = torch.LongTensor(tokens_with_spn).to(self.device)
all_tokens_with_spn.append(tokens_with_spn)
if max_seq_len < tokens_with_spn.shape[-1]:
max_seq_len = tokens_with_spn.shape[-1]
# "tokens_with_spn_tensor" holds the input phoneme sequence with silent phonemes
tokens_with_spn_tensor = torch.LongTensor(
tokens.shape[0], max_seq_len
).to(self.device)
tokens_with_spn_tensor.zero_()
for seq_idx, seq in enumerate(all_tokens_with_spn):
tokens_with_spn_tensor[seq_idx, : len(seq)] = seq
(
_,
postnet_mel_out_spn_pred,
_,
_,
_,
_,
_,
predict_mel_lens_spn_pred,
) = self.hparams.model(tokens_with_spn_tensor)
return (
postnet_mel_out,
predict_mel_lens,
postnet_mel_out_spn_pred,
predict_mel_lens_spn_pred,
)
def run_vocoder(self, inference_mel, mel_lens, sample_type=""):
"""Uses a pretrained vocoder to generate audio from predicted mel
spectogram. By default, uses speechbrain hifigan.
Arguments
---------
inference_mel: torch.Tensor
predicted mel from fastspeech2 inference
mel_lens: torch.Tensor
predicted mel lengths from fastspeech2 inference
used to mask the noise from padding
sample_type: str
used for logging the type of the inference sample being generated
Returns
-------
None
"""
if self.last_batch is None:
return
*_, wavs = self.last_batch
inference_mel = inference_mel[: self.hparams.progress_batch_sample_size]
mel_lens = mel_lens[0 : self.hparams.progress_batch_sample_size]
assert (
self.hparams.vocoder == "hifi-gan"
and self.hparams.pretrained_vocoder is True
), "Specified vocoder not supported yet"
logger.info(
f"Generating audio with pretrained {self.hparams.vocoder_source} vocoder"
)
hifi_gan = HIFIGAN.from_hparams(
source=self.hparams.vocoder_source,
savedir=self.hparams.vocoder_download_path,
)
waveforms = hifi_gan.decode_batch(
inference_mel.transpose(2, 1), mel_lens, self.hparams.hop_length
)
for idx, wav in enumerate(waveforms):
path = os.path.join(
self.hparams.progress_sample_path,
str(self.last_epoch),
f"pred_{sample_type}_{Path(wavs[idx]).stem}.wav",
)
torchaudio.save(path, wav, self.hparams.sample_rate)
def batch_to_device(self, batch, return_metadata=False):
"""Transfers the batch to the target device
Arguments
---------
batch: tuple
the batch to use
return_metadata: bool
indicates whether the metadata should be returned
Returns
-------
batch: tuple
the batch on the correct device
"""
(
text_padded,
durations,
input_lengths,
mel_padded,
pitch_padded,
energy_padded,
output_lengths,
len_x,
labels,
wavs,
no_spn_seq_padded,
spn_labels_padded,
last_phonemes_padded,
) = batch
durations = durations.to(self.device, non_blocking=True).long()
phonemes = text_padded.to(self.device, non_blocking=True).long()
input_lengths = input_lengths.to(self.device, non_blocking=True).long()
spectogram = mel_padded.to(self.device, non_blocking=True).float()
pitch = pitch_padded.to(self.device, non_blocking=True).float()
energy = energy_padded.to(self.device, non_blocking=True).float()
mel_lengths = output_lengths.to(self.device, non_blocking=True).long()
no_spn_seqs = no_spn_seq_padded.to(
self.device, non_blocking=True
).long()
spn_labels = spn_labels_padded.to(self.device, non_blocking=True).long()
last_phonemes = last_phonemes_padded.to(
self.device, non_blocking=True
).long()
x = (phonemes, durations, pitch, energy, no_spn_seqs, last_phonemes)
y = (
spectogram,
durations,
pitch,
energy,
mel_lengths,
input_lengths,
spn_labels,
)
metadata = (labels, wavs)
if return_metadata:
return x, y, metadata
return x, y
def dataio_prepare(hparams):
# Load lexicon
lexicon = hparams["lexicon"]
input_encoder = hparams.get("input_encoder")
# add a dummy symbol for idx 0 - used for padding.
lexicon = ["@@"] + lexicon
input_encoder.update_from_iterable(lexicon, sequence_input=False)
input_encoder.add_unk()
# load audio, text and durations on the fly; encode audio and text.
@sb.utils.data_pipeline.takes(
"wav",
"label_phoneme",
"durations",
"pitch",
"start",
"end",
"spn_labels",
"last_phoneme_flags",
)
@sb.utils.data_pipeline.provides("mel_text_pair")
def audio_pipeline(
wav,
label_phoneme,
dur,
pitch,
start,
end,
spn_labels,
last_phoneme_flags,
):
durs = np.load(dur)
durs_seq = torch.from_numpy(durs).int()
label_phoneme = label_phoneme.strip()
label_phoneme = label_phoneme.split()
text_seq = input_encoder.encode_sequence_torch(label_phoneme).int()
assert len(text_seq) == len(
durs
), f"{len(text_seq)}, {len(durs), len(label_phoneme)}, ({label_phoneme})" # ensure every token has a duration
no_spn_label, last_phonemes = list(), list()
for i in range(len(label_phoneme)):
if label_phoneme[i] != "spn":
no_spn_label.append(label_phoneme[i])
last_phonemes.append(last_phoneme_flags[i])
no_spn_seq = input_encoder.encode_sequence_torch(no_spn_label).int()
spn_labels = [
spn_labels[i]
for i in range(len(label_phoneme))
if label_phoneme[i] != "spn"
]
audio, fs = torchaudio.load(wav)
audio = audio.squeeze()
audio = audio[int(fs * start) : int(fs * end)]
mel, energy = hparams["mel_spectogram"](audio=audio)
mel = mel[:, : sum(durs)]
energy = energy[: sum(durs)]
pitch = np.load(pitch)
pitch = torch.from_numpy(pitch)
pitch = pitch[: mel.shape[-1]]
return (
text_seq,
durs_seq,
mel,
pitch,
energy,
len(text_seq),
last_phonemes,
no_spn_seq,
spn_labels,
)
# define splits and load it as sb dataset
datasets = {}
for dataset in hparams["splits"]:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
json_path=hparams[f"{dataset}_json"],
replacements={"data_root": hparams["data_folder"]},
dynamic_items=[audio_pipeline],
output_keys=["mel_text_pair", "wav", "label", "durations", "pitch"],
)
return datasets, input_encoder
def main():
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
with open(hparams_file) as fin:
hparams = load_hyperpyyaml(fin, overrides)
sb.utils.distributed.ddp_init_group(run_opts)
sb.create_experiment_directory(
experiment_directory=hparams["output_folder"],
hyperparams_to_save=hparams_file,
overrides=overrides,
)
from ljspeech_prepare import prepare_ljspeech
sb.utils.distributed.run_on_main(
prepare_ljspeech,
kwargs={
"data_folder": hparams["data_folder"],
"save_folder": hparams["save_folder"],
"splits": hparams["splits"],
"split_ratio": hparams["split_ratio"],
"model_name": hparams["model"].__class__.__name__,
"seed": hparams["seed"],
"pitch_n_fft": hparams["n_fft"],
"pitch_hop_length": hparams["hop_length"],
"pitch_min_f0": hparams["min_f0"],
"pitch_max_f0": hparams["max_f0"],
"skip_prep": hparams["skip_prep"],
"use_custom_cleaner": True,
},
)
datasets, input_encoder = dataio_prepare(hparams)
# Brain class initialization
fastspeech2_brain = FastSpeech2Brain(
modules=hparams["modules"],
opt_class=hparams["opt_class"],
hparams=hparams,
run_opts=run_opts,
checkpointer=hparams["checkpointer"],
)
fastspeech2_brain.input_encoder = input_encoder
# Training
fastspeech2_brain.fit(
fastspeech2_brain.hparams.epoch_counter,
datasets["train"],
datasets["valid"],
train_loader_kwargs=hparams["train_dataloader_opts"],
valid_loader_kwargs=hparams["valid_dataloader_opts"],
)
if __name__ == "__main__":
main()
"""
Recipe for training the FastSpeech2 Text-To-Speech model
Instead of using pre-extracted phoneme durations from MFA,
This recipe trains an internal alignment from scratch, as introduced in:
https://arxiv.org/pdf/2108.10447.pdf (One TTS Alignment To Rule Them All)
To run this recipe, do the following:
# python train_internal_alignment.py hparams/train_internal_alignment.yaml
Authors
* Yingzhi Wang 2023
"""
import logging
import os
import sys
from pathlib import Path
import numpy as np
import torch
import torchaudio
from hyperpyyaml import load_hyperpyyaml
import speechbrain as sb
from speechbrain.inference.vocoders import HIFIGAN
from speechbrain.utils.data_utils import scalarize
os.environ["TOKENIZERS_PARALLELISM"] = "false"
logger = logging.getLogger(__name__)
class FastSpeech2Brain(sb.Brain):
def on_fit_start(self):
"""Gets called at the beginning of ``fit()``, on multiple processes
if ``distributed_count > 0`` and backend is ddp and initializes statistics
"""
self.hparams.progress_sample_logger.reset()
self.last_epoch = 0
self.last_batch = None
self.last_loss_stats = {}
return super().on_fit_start()
def compute_forward(self, batch, stage):
"""Computes the forward pass
Arguments
---------
batch: str
a single batch
stage: speechbrain.Stage
the training stage
Returns
-------
the model output
"""
inputs, _ = self.batch_to_device(batch)
return self.hparams.model(*inputs)
def on_fit_batch_end(self, batch, outputs, loss, should_step):
"""At the end of the optimizer step, apply noam annealing and logging."""
if should_step:
self.hparams.noam_annealing(self.optimizer)
def compute_objectives(self, predictions, batch, stage):
"""Computes the loss given the predicted and targeted outputs.
Arguments
---------
predictions : torch.Tensor
The model generated spectrograms and other metrics from `compute_forward`.
batch : PaddedBatch
This batch object contains all the relevant tensors for computation.
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
Returns
-------
loss : torch.Tensor
A one-element tensor used for backpropagating the gradient.
"""
x, y, metadata = self.batch_to_device(batch, return_metadata=True)
self.last_batch = [x[0], y[-1], y[-2], predictions[0], *metadata]
self._remember_sample([x[0], *y, *metadata], predictions)
loss = self.hparams.criterion(
predictions, y, self.hparams.epoch_counter.current
)
self.last_loss_stats[stage] = scalarize(loss)
return loss["total_loss"]
def _remember_sample(self, batch, predictions):
"""Remembers samples of spectrograms and the batch for logging purposes
Arguments
---------
batch: tuple
a training batch
predictions: tuple
predictions (raw output of the FastSpeech2
model)
"""
(
phoneme_padded,
mel_padded,
pitch,
energy,
output_lengths,
input_lengths,
labels,
wavs,
) = batch
(
mel_post,
postnet_mel_out,
predict_durations,
predict_pitch,
average_pitch,
predict_energy,
average_energy,
predict_mel_lens,
alignment_durations,
alignment_soft,
alignment_logprob,
alignment_mas,
) = predictions
self.hparams.progress_sample_logger.remember(
target=self.process_mel(mel_padded, output_lengths),
output=self.process_mel(postnet_mel_out, output_lengths),
raw_batch=self.hparams.progress_sample_logger.get_batch_sample(
{
"tokens": phoneme_padded,
"input_lengths": input_lengths,
"mel_target": mel_padded,
"mel_out": postnet_mel_out,
"mel_lengths": predict_mel_lens,
"durations": alignment_durations,
"predict_durations": predict_durations,
"labels": labels,
"wavs": wavs,
}
),
)
def process_mel(self, mel, len, index=0):
"""Converts a mel spectrogram to one that can be saved as an image
sample = sqrt(exp(mel))
Arguments
---------
mel: torch.Tensor
the mel spectrogram (as used in the model)
len: int
length of the mel spectrogram
index: int
batch index
Returns
-------
mel: torch.Tensor
the spectrogram, for image saving purposes
"""
assert mel.dim() == 3
return torch.sqrt(torch.exp(mel[index][: len[index]]))
def on_stage_end(self, stage, stage_loss, epoch):
"""Gets called at the end of an epoch.
Arguments
---------
stage : sb.Stage
One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
stage_loss : float
The average loss for all of the data processed in this stage.
epoch : int
The currently-starting epoch. This is passed
`None` during the test stage.
"""
# At the end of validation, we can write
if stage == sb.Stage.VALID:
# Update learning rate
self.last_epoch = epoch
lr = self.hparams.noam_annealing.current_lr
# The train_logger writes a summary to stdout and to the logfile.
self.hparams.train_logger.log_stats( # 1#2#
stats_meta={"Epoch": epoch, "lr": lr},
train_stats=self.last_loss_stats[sb.Stage.TRAIN],
valid_stats=self.last_loss_stats[sb.Stage.VALID],
)
output_progress_sample = (
self.hparams.progress_samples
and epoch % self.hparams.progress_samples_interval == 0
and epoch >= self.hparams.progress_samples_min_run
)
if output_progress_sample:
logger.info("Saving predicted samples")
inference_mel, mel_lens = self.run_inference()
self.hparams.progress_sample_logger.save(epoch)
self.run_vocoder(inference_mel, mel_lens)
# Save the current checkpoint and delete previous checkpoints.
# UNCOMMENT THIS
self.checkpointer.save_and_keep_only(
meta=self.last_loss_stats[stage],
min_keys=["total_loss"],
)
# We also write statistics about test data spectogram to stdout and to the logfile.
if stage == sb.Stage.TEST:
self.hparams.train_logger.log_stats(
{"Epoch loaded": self.hparams.epoch_counter.current},
test_stats=self.last_loss_stats[sb.Stage.TEST],
)
def run_inference(self):
"""Produces a sample in inference mode with predicted durations."""
if self.last_batch is None:
return
tokens, *_ = self.last_batch
(
_,
postnet_mel_out,
_,
_,
_,
_,
_,
predict_mel_lens,
_,
_,
_,
_,
) = self.hparams.model(tokens)
self.hparams.progress_sample_logger.remember(
infer_output=self.process_mel(
postnet_mel_out, [len(postnet_mel_out[0])]
)
)
return postnet_mel_out, predict_mel_lens
def run_vocoder(self, inference_mel, mel_lens):
"""Uses a pretrained vocoder to generate audio from predicted mel
spectogram. By default, uses speechbrain hifigan.
Arguments
---------
inference_mel: torch.Tensor
predicted mel from fastspeech2 inference
mel_lens: torch.Tensor
predicted mel lengths from fastspeech2 inference
used to mask the noise from padding
Returns
-------
None
"""
if self.last_batch is None:
return
*_, wavs = self.last_batch
inference_mel = inference_mel[: self.hparams.progress_batch_sample_size]
mel_lens = mel_lens[0 : self.hparams.progress_batch_sample_size]
assert (
self.hparams.vocoder == "hifi-gan"
and self.hparams.pretrained_vocoder is True
), "Specified vocoder not supported yet"
logger.info(
f"Generating audio with pretrained {self.hparams.vocoder_source} vocoder"
)
hifi_gan = HIFIGAN.from_hparams(
source=self.hparams.vocoder_source,
savedir=self.hparams.vocoder_download_path,
)
waveforms = hifi_gan.decode_batch(
inference_mel.transpose(2, 1), mel_lens, self.hparams.hop_length
)
for idx, wav in enumerate(waveforms):
path = os.path.join(
self.hparams.progress_sample_path,
str(self.last_epoch),
f"pred_{Path(wavs[idx]).stem}.wav",
)
torchaudio.save(path, wav, self.hparams.sample_rate)
def batch_to_device(self, batch, return_metadata=False):
"""Transfers the batch to the target device
Arguments
---------
batch: tuple
the batch to use
return_metadata: bool
Whether to additionally return labels and wavs.
Returns
-------
x: tuple
phonemes, spectrogram, pitch, energy
y: tuple
spectrogram, pitch, energy, mel_lengths, input_lengths
metadata: tuple
labels, wavs
"""
(
phoneme_padded,
input_lengths,
mel_padded,
pitch_padded,
energy_padded,
output_lengths,
# len_x,
labels,
wavs,
) = batch
# durations = durations.to(self.device, non_blocking=True).long()
phonemes = phoneme_padded.to(self.device, non_blocking=True).long()
input_lengths = input_lengths.to(self.device, non_blocking=True).long()
spectogram = mel_padded.to(self.device, non_blocking=True).float()
pitch = pitch_padded.to(self.device, non_blocking=True).float()
energy = energy_padded.to(self.device, non_blocking=True).float()
mel_lengths = output_lengths.to(self.device, non_blocking=True).long()
x = (phonemes, spectogram, pitch, energy)
y = (spectogram, pitch, energy, mel_lengths, input_lengths)
metadata = (labels, wavs)
if return_metadata:
return x, y, metadata
return x, y
def dataio_prepare(hparams):
"Creates the datasets and their data processing pipelines."
# Load lexicon
lexicon = hparams["lexicon"]
input_encoder = hparams.get("input_encoder")
# add a dummy symbol for idx 0 - used for padding.
lexicon = ["@@"] + lexicon
input_encoder.update_from_iterable(lexicon, sequence_input=False)
input_encoder.add_unk()
# load audio, text and durations on the fly; encode audio and text.
@sb.utils.data_pipeline.takes("wav", "phonemes", "pitch")
@sb.utils.data_pipeline.provides("mel_text_pair")
def audio_pipeline(wav, phonemes, pitch):
phoneme_seq = input_encoder.encode_sequence_torch(phonemes).int()
audio, fs = torchaudio.load(wav)
audio = audio.squeeze()
mel, energy = hparams["mel_spectogram"](audio=audio)
pitch = np.load(pitch)
pitch = torch.from_numpy(pitch)
pitch = pitch[: mel.shape[-1]]
return phoneme_seq, mel, pitch, energy, len(phoneme_seq), len(mel)
# define splits and load it as sb dataset
datasets = {}
for dataset in hparams["splits"]:
datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
json_path=hparams[f"{dataset}_json"],
replacements={"data_root": hparams["data_folder"]},
dynamic_items=[audio_pipeline],
output_keys=["mel_text_pair", "wav", "label", "pitch"],
)
return datasets
def main():
hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
with open(hparams_file) as fin:
hparams = load_hyperpyyaml(fin, overrides)
sb.utils.distributed.ddp_init_group(run_opts)
sb.create_experiment_directory(
experiment_directory=hparams["output_folder"],
hyperparams_to_save=hparams_file,
overrides=overrides,
)
from ljspeech_prepare import prepare_ljspeech
sb.utils.distributed.run_on_main(
prepare_ljspeech,
kwargs={
"data_folder": hparams["data_folder"],
"save_folder": hparams["save_folder"],
"splits": hparams["splits"],
"split_ratio": hparams["split_ratio"],
"model_name": hparams["model"].__class__.__name__,
"seed": hparams["seed"],
"pitch_n_fft": hparams["n_fft"],
"pitch_hop_length": hparams["hop_length"],
"pitch_min_f0": hparams["min_f0"],
"pitch_max_f0": hparams["max_f0"],
"skip_prep": hparams["skip_prep"],
"use_custom_cleaner": True,
"device": "cuda",
},
)
datasets = dataio_prepare(hparams)
# Brain class initialization
fastspeech2_brain = FastSpeech2Brain(
modules=hparams["modules"],
opt_class=hparams["opt_class"],
hparams=hparams,
run_opts=run_opts,
checkpointer=hparams["checkpointer"],
)
# Training
fastspeech2_brain.fit(
fastspeech2_brain.hparams.epoch_counter,
datasets["train"],
datasets["valid"],
train_loader_kwargs=hparams["train_dataloader_opts"],
valid_loader_kwargs=hparams["valid_dataloader_opts"],
)
if __name__ == "__main__":
main()
############################################################################
# Model: Tacotron2
# Tokens: Raw characters (English text)
# losses: Transducer
# Training: LJSpeech
# Authors: Georges Abous-Rjeili, Artem Ploujnikov, Yingzhi Wang
# ############################################################################
###################################
# Experiment Parameters and setup #
###################################
seed: 1234
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref ./results/tacotron2/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
epochs: 750
keep_checkpoint_interval: 50
###################################
# Progress Samples #
###################################
# Progress samples are used to monitor the progress
# of an ongoing training session by outputting samples
# of spectrograms, alignments, etc at regular intervals
# Whether to enable progress samples
progress_samples: True
# The path where the samples will be stored
progress_sample_path: !ref <output_folder>/samples
# The interval, in epochs. For instance, if it is set to 5,
# progress samples will be output every 5 epochs
progress_samples_interval: 1
# The sample size for raw batch samples saved in batch.pth
# (useful mostly for model debugging)
progress_batch_sample_size: 3
#################################
# Data files and pre-processing #
#################################
data_folder: !PLACEHOLDER # e.g, /localscratch/ljspeech
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json
splits: ["train", "valid"]
split_ratio: [90, 10]
skip_prep: False
# Use the original preprocessing from nvidia
# The cleaners to be used (applicable to nvidia only)
text_cleaners: ['english_cleaners']
################################
# Audio Parameters #
################################
sample_rate: 22050
hop_length: 256
win_length: 1024
n_mel_channels: 80
n_fft: 1024
mel_fmin: 0.0
mel_fmax: 8000.0
mel_normalized: False
power: 1
norm: "slaney"
mel_scale: "slaney"
dynamic_range_compression: True
################################
# Optimization Hyperparameters #
################################
learning_rate: 0.001
weight_decay: 0.000006
batch_size: 64 #minimum 2
num_workers: 8
mask_padding: True
guided_attention_sigma: 0.2
guided_attention_weight: 50.0
guided_attention_weight_half_life: 10.
guided_attention_hard_stop: 50
gate_loss_weight: 1.0
train_dataloader_opts:
batch_size: !ref <batch_size>
drop_last: False #True #False
num_workers: !ref <num_workers>
collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
valid_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers>
collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
test_dataloader_opts:
batch_size: !ref <batch_size>
num_workers: !ref <num_workers>
collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
################################
# Model Parameters and model #
################################
n_symbols: 148 #fixed depending on symbols in textToSequence
symbols_embedding_dim: 512
# Encoder parameters
encoder_kernel_size: 5
encoder_n_convolutions: 3
encoder_embedding_dim: 512
# Decoder parameters
# The number of frames in the target per encoder step
n_frames_per_step: 1
decoder_rnn_dim: 1024
prenet_dim: 256
max_decoder_steps: 1000
gate_threshold: 0.5
p_attention_dropout: 0.1
p_decoder_dropout: 0.1
decoder_no_early_stopping: False
# Attention parameters
attention_rnn_dim: 1024
attention_dim: 128
# Location Layer parameters
attention_location_n_filters: 32
attention_location_kernel_size: 31
# Mel-post processing network parameters
postnet_embedding_dim: 512
postnet_kernel_size: 5
postnet_n_convolutions: 5
mel_spectogram: !name:speechbrain.lobes.models.Tacotron2.mel_spectogram
sample_rate: !ref <sample_rate>
hop_length: !ref <hop_length>
win_length: !ref <win_length>
n_fft: !ref <n_fft>
n_mels: !ref <n_mel_channels>
f_min: !ref <mel_fmin>
f_max: !ref <mel_fmax>
power: !ref <power>
normalized: !ref <mel_normalized>
norm: !ref <norm>
mel_scale: !ref <mel_scale>
compression: !ref <dynamic_range_compression>
#model
model: !new:speechbrain.lobes.models.Tacotron2.Tacotron2
mask_padding: !ref <mask_padding>
n_mel_channels: !ref <n_mel_channels>
# symbols
n_symbols: !ref <n_symbols>
symbols_embedding_dim: !ref <symbols_embedding_dim>
# encoder
encoder_kernel_size: !ref <encoder_kernel_size>
encoder_n_convolutions: !ref <encoder_n_convolutions>
encoder_embedding_dim: !ref <encoder_embedding_dim>
# attention
attention_rnn_dim: !ref <attention_rnn_dim>
attention_dim: !ref <attention_dim>
# attention location
attention_location_n_filters: !ref <attention_location_n_filters>
attention_location_kernel_size: !ref <attention_location_kernel_size>
# decoder
n_frames_per_step: !ref <n_frames_per_step>
decoder_rnn_dim: !ref <decoder_rnn_dim>
prenet_dim: !ref <prenet_dim>
max_decoder_steps: !ref <max_decoder_steps>
gate_threshold: !ref <gate_threshold>
p_attention_dropout: !ref <p_attention_dropout>
p_decoder_dropout: !ref <p_decoder_dropout>
# postnet
postnet_embedding_dim: !ref <postnet_embedding_dim>
postnet_kernel_size: !ref <postnet_kernel_size>
postnet_n_convolutions: !ref <postnet_n_convolutions>
decoder_no_early_stopping: !ref <decoder_no_early_stopping>
guided_attention_scheduler: !new:speechbrain.nnet.schedulers.StepScheduler
initial_value: !ref <guided_attention_weight>
half_life: !ref <guided_attention_weight_half_life>
criterion: !new:speechbrain.lobes.models.Tacotron2.Loss
gate_loss_weight: !ref <gate_loss_weight>
guided_attention_weight: !ref <guided_attention_weight>
guided_attention_sigma: !ref <guided_attention_sigma>
guided_attention_scheduler: !ref <guided_attention_scheduler>
guided_attention_hard_stop: !ref <guided_attention_hard_stop>
modules:
model: !ref <model>
#optimizer
opt_class: !name:torch.optim.Adam
lr: !ref <learning_rate>
weight_decay: !ref <weight_decay>
#epoch object
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <epochs>
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
#annealing_function
lr_annealing: !new:speechbrain.nnet.schedulers.IntervalScheduler
intervals:
- steps: 6000
lr: 0.0005
- steps: 8000
lr: 0.0003
- steps: 10000
lr: 0.0001
#checkpointer
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
model: !ref <model>
counter: !ref <epoch_counter>
scheduler: !ref <lr_annealing>
#infer: !name:speechbrain.lobes.models.Tacotron2.infer
progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
output_path: !ref <progress_sample_path>
batch_sample_size: !ref <progress_batch_sample_size>
formats:
raw_batch: raw
"""
LJspeech data preparation.
Download: https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
Authors
* Yingzhi WANG 2022
* Sathvik Udupa 2022
* Pradnya Kandarkar 2023
"""
import csv
import json
import logging
import os
import random
import re
import numpy as np
import tgt
import torch
import torchaudio
from tqdm import tqdm
from unidecode import unidecode
from speechbrain.dataio.dataio import load_pkl, save_pkl
from speechbrain.inference.text import GraphemeToPhoneme
from speechbrain.utils.data_utils import download_file
from speechbrain.utils.text_to_sequence import _g2p_keep_punctuations
logger = logging.getLogger(__name__)
OPT_FILE = "opt_ljspeech_prepare.pkl"
METADATA_CSV = "metadata.csv"
TRAIN_JSON = "train.json"
VALID_JSON = "valid.json"
TEST_JSON = "test.json"
WAVS = "wavs"
DURATIONS = "durations"
logger = logging.getLogger(__name__)
OPT_FILE = "opt_ljspeech_prepare.pkl"
def prepare_ljspeech(
data_folder,
save_folder,
splits=["train", "valid"],
split_ratio=[90, 10],
model_name=None,
seed=1234,
pitch_n_fft=1024,
pitch_hop_length=256,
pitch_min_f0=65,
pitch_max_f0=400,
skip_prep=False,
use_custom_cleaner=False,
device="cpu",
):
"""
Prepares the csv files for the LJspeech datasets.
Arguments
---------
data_folder : str
Path to the folder where the original LJspeech dataset is stored
save_folder : str
The directory where to store the csv/json files
splits : list
List of dataset splits to prepare
split_ratio : list
Proportion for dataset splits
model_name : str
Model name (used to prepare additional model specific data)
seed : int
Random seed
pitch_n_fft : int
Number of fft points for pitch computation
pitch_hop_length : int
Hop length for pitch computation
pitch_min_f0 : int
Minimum f0 for pitch computation
pitch_max_f0 : int
Max f0 for pitch computation
skip_prep : bool
If True, skip preparation
use_custom_cleaner : bool
If True, uses custom cleaner defined for this recipe
device : str
Device for to be used for computation (used as required)
Returns
-------
None
Example
-------
>>> from recipes.LJSpeech.TTS.ljspeech_prepare import prepare_ljspeech
>>> data_folder = 'data/LJspeech/'
>>> save_folder = 'save/'
>>> splits = ['train', 'valid']
>>> split_ratio = [90, 10]
>>> seed = 1234
>>> prepare_ljspeech(data_folder, save_folder, splits, split_ratio, seed)
"""
# Sets seeds for reproducible code
random.seed(seed)
if skip_prep:
return
# Creating configuration for easily skipping data_preparation stage
conf = {
"data_folder": data_folder,
"splits": splits,
"split_ratio": split_ratio,
"save_folder": save_folder,
"seed": seed,
}
if not os.path.exists(save_folder):
os.makedirs(save_folder)
# Setting output files
meta_csv = os.path.join(data_folder, METADATA_CSV)
wavs_folder = os.path.join(data_folder, WAVS)
save_opt = os.path.join(save_folder, OPT_FILE)
save_json_train = os.path.join(save_folder, TRAIN_JSON)
save_json_valid = os.path.join(save_folder, VALID_JSON)
save_json_test = os.path.join(save_folder, TEST_JSON)
phoneme_alignments_folder = None
duration_folder = None
pitch_folder = None
# Setting up additional folders required for FastSpeech2
if model_name is not None and "FastSpeech2" in model_name:
# This step requires phoneme alignments to be present in the data_folder
# We automatically download the alignments from https://www.dropbox.com/s/v28x5ldqqa288pu/LJSpeech.zip
# Download and unzip LJSpeech phoneme alignments from here: https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4
alignment_URL = (
"https://www.dropbox.com/s/v28x5ldqqa288pu/LJSpeech.zip?dl=1"
)
phoneme_alignments_folder = os.path.join(
data_folder, "TextGrid", "LJSpeech"
)
download_file(
alignment_URL, data_folder + "/alignments.zip", unpack=True
)
duration_folder = os.path.join(data_folder, "durations")
if not os.path.exists(duration_folder):
os.makedirs(duration_folder)
# extract pitch for both Fastspeech2 and FastSpeech2WithAligner models
pitch_folder = os.path.join(data_folder, "pitch")
if not os.path.exists(pitch_folder):
os.makedirs(pitch_folder)
# Check if this phase is already done (if so, skip it)
if skip(splits, save_folder, conf):
logger.info("Skipping preparation, completed in previous run.")
return
# Additional check to make sure metadata.csv and wavs folder exists
assert os.path.exists(meta_csv), "metadata.csv does not exist"
assert os.path.exists(wavs_folder), "wavs/ folder does not exist"
# Prepare data splits
msg = "Creating json file for ljspeech Dataset.."
logger.info(msg)
data_split, meta_csv = split_sets(data_folder, splits, split_ratio)
if "train" in splits:
prepare_json(
model_name,
data_split["train"],
save_json_train,
wavs_folder,
meta_csv,
phoneme_alignments_folder,
duration_folder,
pitch_folder,
pitch_n_fft,
pitch_hop_length,
pitch_min_f0,
pitch_max_f0,
use_custom_cleaner,
device,
)
if "valid" in splits:
prepare_json(
model_name,
data_split["valid"],
save_json_valid,
wavs_folder,
meta_csv,
phoneme_alignments_folder,
duration_folder,
pitch_folder,
pitch_n_fft,
pitch_hop_length,
pitch_min_f0,
pitch_max_f0,
use_custom_cleaner,
device,
)
if "test" in splits:
prepare_json(
model_name,
data_split["test"],
save_json_test,
wavs_folder,
meta_csv,
phoneme_alignments_folder,
duration_folder,
pitch_folder,
pitch_n_fft,
pitch_hop_length,
pitch_min_f0,
pitch_max_f0,
use_custom_cleaner,
device,
)
save_pkl(conf, save_opt)
def skip(splits, save_folder, conf):
"""
Detects if the ljspeech data_preparation has been already done.
If the preparation has been done, we can skip it.
Arguments
---------
splits : list
The portions of data to review.
save_folder : str
The path to the directory containing prepared files.
conf : dict
Configuration to match against saved config.
Returns
-------
bool
if True, the preparation phase can be skipped.
if False, it must be done.
"""
# Checking json files
skip = True
split_files = {
"train": TRAIN_JSON,
"valid": VALID_JSON,
"test": TEST_JSON,
}
for split in splits:
if not os.path.isfile(os.path.join(save_folder, split_files[split])):
skip = False
# Checking saved options
save_opt = os.path.join(save_folder, OPT_FILE)
if skip is True:
if os.path.isfile(save_opt):
opts_old = load_pkl(save_opt)
if opts_old == conf:
skip = True
else:
skip = False
else:
skip = False
return skip
def split_sets(data_folder, splits, split_ratio):
"""Randomly splits the wav list into training, validation, and test lists.
Note that a better approach is to make sure that all the classes have the
same proportion of samples for each session.
Arguments
---------
data_folder : str
The path to the directory containing the data.
splits : list
The list of the selected splits.
split_ratio : list
List composed of three integers that sets split ratios for train,
valid, and test sets, respectively.
For instance split_ratio=[80, 10, 10] will assign 80% of the sentences
to training, 10% for validation, and 10% for test.
Returns
-------
dictionary containing train, valid, and test splits.
"""
meta_csv = os.path.join(data_folder, METADATA_CSV)
csv_reader = csv.reader(
open(meta_csv), delimiter="|", quoting=csv.QUOTE_NONE
)
meta_csv = list(csv_reader)
index_for_sessions = []
session_id_start = "LJ001"
index_this_session = []
for i in range(len(meta_csv)):
session_id = meta_csv[i][0].split("-")[0]
if session_id == session_id_start:
index_this_session.append(i)
if i == len(meta_csv) - 1:
index_for_sessions.append(index_this_session)
else:
index_for_sessions.append(index_this_session)
session_id_start = session_id
index_this_session = [i]
session_len = [len(session) for session in index_for_sessions]
data_split = {}
for i, split in enumerate(splits):
data_split[split] = []
for j in range(len(index_for_sessions)):
if split == "train":
random.shuffle(index_for_sessions[j])
n_snts = int(session_len[j] * split_ratio[i] / sum(split_ratio))
data_split[split].extend(index_for_sessions[j][0:n_snts])
del index_for_sessions[j][0:n_snts]
if split == "valid":
if "test" in splits:
random.shuffle(index_for_sessions[j])
n_snts = int(
session_len[j] * split_ratio[i] / sum(split_ratio)
)
data_split[split].extend(index_for_sessions[j][0:n_snts])
del index_for_sessions[j][0:n_snts]
else:
data_split[split].extend(index_for_sessions[j])
if split == "test":
data_split[split].extend(index_for_sessions[j])
return data_split, meta_csv
def prepare_json(
model_name,
seg_lst,
json_file,
wavs_folder,
csv_reader,
phoneme_alignments_folder,
durations_folder,
pitch_folder,
pitch_n_fft,
pitch_hop_length,
pitch_min_f0,
pitch_max_f0,
use_custom_cleaner=False,
device="cpu",
):
"""
Creates json file given a list of indexes.
Arguments
---------
model_name : str
Model name (used to prepare additional model specific data)
seg_lst : list
The list of json indexes of a given data split
json_file : str
Output json path
wavs_folder : str
LJspeech wavs folder
csv_reader : _csv.reader
LJspeech metadata
phoneme_alignments_folder : path
Path where the phoneme alignments are stored
durations_folder : path
Folder where to store the duration values of each audio
pitch_folder : path
Folder where to store the pitch of each audio
pitch_n_fft : int
Number of fft points for pitch computation
pitch_hop_length : int
Hop length for pitch computation
pitch_min_f0 : int
Minimum f0 for pitch computation
pitch_max_f0 : int
Max f0 for pitch computation
use_custom_cleaner : bool
If True, uses custom cleaner defined for this recipe
device : str
Device for to be used for computation (used as required)
"""
logger.info(f"preparing {json_file}.")
if model_name in ["Tacotron2", "FastSpeech2WithAlignment"]:
logger.info(
"Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."
)
g2p = GraphemeToPhoneme.from_hparams(
"speechbrain/soundchoice-g2p", run_opts={"device": device}
)
if model_name is not None and "FastSpeech2" in model_name:
logger.info(
"Computing pitch as required for FastSpeech2. This may take a while."
)
json_dict = {}
for index in tqdm(seg_lst):
# Common data preparation
id = list(csv_reader)[index][0]
wav = os.path.join(wavs_folder, f"{id}.wav")
label = list(csv_reader)[index][2]
if use_custom_cleaner:
label = custom_clean(label, model_name)
json_dict[id] = {
"uttid": id,
"wav": wav,
"label": label,
"segment": True if "train" in json_file else False,
}
# FastSpeech2 specific data preparation
if model_name == "FastSpeech2":
audio, fs = torchaudio.load(wav)
# Parses phoneme alignments
textgrid_path = os.path.join(
phoneme_alignments_folder, f"{id}.TextGrid"
)
textgrid = tgt.io.read_textgrid(
textgrid_path, include_empty_intervals=True
)
last_phoneme_flags = get_last_phoneme_info(
textgrid.get_tier_by_name("words"),
textgrid.get_tier_by_name("phones"),
)
(
phonemes,
duration,
start,
end,
trimmed_last_phoneme_flags,
) = get_alignment(
textgrid.get_tier_by_name("phones"),
fs,
pitch_hop_length,
last_phoneme_flags,
)
# Gets label phonemes
label_phoneme = " ".join(phonemes)
spn_labels = [0] * len(phonemes)
for i in range(1, len(phonemes)):
if phonemes[i] == "spn":
spn_labels[i - 1] = 1
if start >= end:
print(f"Skipping {id}")
continue
# Saves durations
duration_file_path = os.path.join(durations_folder, f"{id}.npy")
np.save(duration_file_path, duration)
# Computes pitch
audio = audio[:, int(fs * start) : int(fs * end)]
pitch_file = wav.replace(".wav", ".npy").replace(
wavs_folder, pitch_folder
)
if not os.path.isfile(pitch_file):
pitch = torchaudio.functional.detect_pitch_frequency(
waveform=audio,
sample_rate=fs,
frame_time=(pitch_hop_length / fs),
win_length=3,
freq_low=pitch_min_f0,
freq_high=pitch_max_f0,
).squeeze(0)
# Concatenate last element to match duration.
pitch = torch.cat([pitch, pitch[-1].unsqueeze(0)])
# Mean and Variance Normalization
mean = 256.1732939688805
std = 328.319759158607
pitch = (pitch - mean) / std
pitch = pitch[: sum(duration)]
np.save(pitch_file, pitch)
# Updates data for the utterance
json_dict[id].update({"label_phoneme": label_phoneme})
json_dict[id].update({"spn_labels": spn_labels})
json_dict[id].update({"start": start})
json_dict[id].update({"end": end})
json_dict[id].update({"durations": duration_file_path})
json_dict[id].update({"pitch": pitch_file})
json_dict[id].update(
{"last_phoneme_flags": trimmed_last_phoneme_flags}
)
# FastSpeech2WithAlignment specific data preparation
if model_name == "FastSpeech2WithAlignment":
audio, fs = torchaudio.load(wav)
# Computes pitch
pitch_file = wav.replace(".wav", ".npy").replace(
wavs_folder, pitch_folder
)
if not os.path.isfile(pitch_file):
if torchaudio.__version__ < "2.1":
pitch = torchaudio.functional.compute_kaldi_pitch(
waveform=audio,
sample_rate=fs,
frame_length=(pitch_n_fft / fs * 1000),
frame_shift=(pitch_hop_length / fs * 1000),
min_f0=pitch_min_f0,
max_f0=pitch_max_f0,
)[0, :, 0]
else:
pitch = torchaudio.functional.detect_pitch_frequency(
waveform=audio,
sample_rate=fs,
frame_time=(pitch_hop_length / fs),
win_length=3,
freq_low=pitch_min_f0,
freq_high=pitch_max_f0,
).squeeze(0)
# Concatenate last element to match duration.
pitch = torch.cat([pitch, pitch[-1].unsqueeze(0)])
# Mean and Variance Normalization
mean = 256.1732939688805
std = 328.319759158607
pitch = (pitch - mean) / std
np.save(pitch_file, pitch)
phonemes = _g2p_keep_punctuations(g2p, label)
# Updates data for the utterance
json_dict[id].update({"phonemes": phonemes})
json_dict[id].update({"pitch": pitch_file})
# Writing the dictionary to the json file
with open(json_file, mode="w") as json_f:
json.dump(json_dict, json_f, indent=2)
logger.info(f"{json_file} successfully created!")
def get_alignment(tier, sampling_rate, hop_length, last_phoneme_flags):
"""
Returns phonemes, phoneme durations (in frames), start time (in seconds), end time (in seconds).
This function is adopted from https://github.com/ming024/FastSpeech2/blob/master/preprocessor/preprocessor.py
Arguments
---------
tier : tgt.core.IntervalTier
For an utterance, contains Interval objects for phonemes and their start time and end time in seconds
sampling_rate : int
Sample rate if audio signal
hop_length : int
Hop length for duration computation
last_phoneme_flags : list
List of (phoneme, flag) tuples with flag=1 if the phoneme is the last phoneme else flag=0
Returns
-------
(phones, durations, start_time, end_time) : tuple
The phonemes, durations, start time, and end time for an utterance
"""
sil_phones = ["sil", "sp", "spn", ""]
phonemes = []
durations = []
start_time = 0
end_time = 0
end_idx = 0
trimmed_last_phoneme_flags = []
flag_iter = iter(last_phoneme_flags)
for t in tier._objects:
s, e, p = t.start_time, t.end_time, t.text
current_flag = next(flag_iter)
# Trims leading silences
if phonemes == []:
if p in sil_phones:
continue
else:
start_time = s
if p not in sil_phones:
# For ordinary phones
# Removes stress indicators
if p[-1].isdigit():
phonemes.append(p[:-1])
else:
phonemes.append(p)
trimmed_last_phoneme_flags.append(current_flag[1])
end_time = e
end_idx = len(phonemes)
else:
# Uses a unique token for all silent phones
phonemes.append("spn")
trimmed_last_phoneme_flags.append(current_flag[1])
durations.append(
int(
np.round(e * sampling_rate / hop_length)
- np.round(s * sampling_rate / hop_length)
)
)
# Trims tailing silences
phonemes = phonemes[:end_idx]
durations = durations[:end_idx]
return phonemes, durations, start_time, end_time, trimmed_last_phoneme_flags
def get_last_phoneme_info(words_seq, phones_seq):
"""This function takes word and phoneme tiers from a TextGrid file as input
and provides a list of tuples for the phoneme sequence indicating whether
each of the phonemes is the last phoneme of a word or not.
Each tuple of the returned list has this format: (phoneme, flag)
Arguments
---------
words_seq : tier
word tier from a TextGrid file
phones_seq : tier
phoneme tier from a TextGrid file
Returns
-------
last_phoneme_flags : list
each tuple of the returned list has this format: (phoneme, flag)
"""
# Gets all phoneme objects for the entire sequence
phoneme_objects = phones_seq._objects
phoneme_iter = iter(phoneme_objects)
# Stores flags to show if an element (phoneme) is a the last phoneme of a word
last_phoneme_flags = list()
# Matches the end times of the phoneme and word objects to get the last phoneme information
for word_obj in words_seq._objects:
word_end_time = word_obj.end_time
current_phoneme = next(phoneme_iter, None)
while current_phoneme:
phoneme_end_time = current_phoneme.end_time
if phoneme_end_time == word_end_time:
last_phoneme_flags.append((current_phoneme.text, 1))
break
else:
last_phoneme_flags.append((current_phoneme.text, 0))
current_phoneme = next(phoneme_iter, None)
return last_phoneme_flags
def custom_clean(text, model_name):
"""
Uses custom criteria to clean text.
Arguments
---------
text : str
Input text to be cleaned
model_name : str
whether to treat punctuations
Returns
-------
text : str
Cleaned text
"""
_abbreviations = [
(re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
for x in [
("mrs", "missus"),
("mr", "mister"),
("dr", "doctor"),
("st", "saint"),
("co", "company"),
("jr", "junior"),
("maj", "major"),
("gen", "general"),
("drs", "doctors"),
("rev", "reverend"),
("lt", "lieutenant"),
("hon", "honorable"),
("sgt", "sergeant"),
("capt", "captain"),
("esq", "esquire"),
("ltd", "limited"),
("col", "colonel"),
("ft", "fort"),
]
]
text = unidecode(text.lower())
if model_name != "FastSpeech2WithAlignment":
text = re.sub("[:;]", " - ", text)
text = re.sub(r'[)(\[\]"]', " ", text)
text = text.strip().strip().strip("-")
text = re.sub(" +", " ", text)
for regex, replacement in _abbreviations:
text = re.sub(regex, replacement, text)
return text
SpeechBrain system description
==============================
Python version:
3.10.12 (main, May 26 2024, 00:14:02) [GCC 9.4.0]
==============================
Installed Python packages:
accelerate==0.31.0
addict==2.4.0
aiosignal==1.3.1
aitemplate @ http://10.6.10.68:8000/release/aitemplate/dtk24.04.1/aitemplate-0.0.1%2Bdas1.1.git5d8aa20.dtk2404.torch2.1.0-py3-none-any.whl#sha256=ad763a7cfd3935857cf10a07a2a97899fd64dda481add2f48de8b8930bd341dd
annotated-types==0.7.0
anyio==4.4.0
apex @ http://10.6.10.68:8000/release/apex/dtk24.04.1/apex-1.1.0%2Bdas1.1.gitf477a3a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=85eb662d13d6e6c3b61c2d878378c2338c4479bc03a1912c3eabddc2d9d08aa1
attrs==23.2.0
audioread==3.0.1
bitsandbytes @ http://10.6.10.68:8000/release/bitsandbyte/dtk24.04.1/bitsandbytes-0.42.0%2Bdas1.1.gitce85679.abi1.dtk2404.torch2.1.0-py3-none-any.whl#sha256=6324e330c8d12b858d39f4986c0ed0836fcb05f539cee92a7cf558e17954ae0d
certifi==2024.6.2
cffi==1.17.0
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
coloredlogs==15.0.1
contourpy==1.2.1
cycler==0.12.1
decorator==5.1.1
deepspeed @ http://10.6.10.68:8000/release/deepspeed/dtk24.04.1/deepspeed-0.12.3%2Bgita724046.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=2c158ed2dab21f4f09e7fc29776cb43a1593b13cec33168ce3483f318b852fc9
distlib==0.3.8
dnspython==2.6.1
dropout-layer-norm @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/dropout_layer_norm-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ae10c7cc231a8e38492292e91e76ba710d7679762604c0a7f10964b2385cdbd7
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastpt @ http://10.6.10.68:8000/release/fastpt/dtk24.04.1/fastpt-1.0.0%2Bdas1.1.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ecf30dadcd2482adb1107991edde19b6559b8237379dbb0a3e6eb7306aad3f9a
filelock==3.15.1
fire==0.6.0
flash-attn @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/flash_attn-2.0.4%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7ca8e78ee0624b1ff0e91e9fc265e61b9510f02123a010ac71a2f8e5d08a62f7
flatbuffers==24.3.25
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2024.6.0
fused-dense-lib @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/fused_dense_lib-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7202dd258a86bb7a1572e3b44b90dae667b0c948bf0f420b05924a107aaaba03
h11==0.14.0
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.4
humanfriendly==10.0
HyperPyYAML==1.2.2
hypothesis==5.35.1
identify==2.6.0
idna==3.7
importlib_metadata==7.1.0
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
layer-check-pt @ http://10.6.10.68:8000/release/layercheck/dtk24.04.1/layer_check_pt-1.2.3.git59a087a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=807adae2d4d4b74898777f81e1b94f1af4d881afe6a7826c7c910b211accbea7
lazy_loader==0.4
librosa==0.10.2.post1
lightop @ http://10.6.10.68:8000/release/lightop/dtk24.04.1/lightop-0.4%2Bdas1.1git8e60f07.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=2f2c88fd3fe4be179f44c4849e9224cb5b2b259843fc5a2d088e468b7a14c1b1
llvmlite==0.43.0
lmdeploy @ http://10.6.10.68:8000/release/lmdeploy/dtk24.04.1/lmdeploy-0.2.6%2Bdas1.1.git6ba90df.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=92ecee2c8b982f86e5c3219ded24d2ede219f415bf2cd4297f989a03387a203c
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
mdurl==0.1.2
mmcv @ http://10.6.10.68:8000/release/mmcv/dtk24.04.1/mmcv-2.0.1%2Bdas1.1.gite58da25.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7a937ae22f81b44d9100907e11303c31bf9a670cb4c92e361675674a41a8a07f
mmengine==0.10.4
mmengine-lite==0.10.4
mpmath==1.3.0
msgpack==1.0.8
networkx==3.3
ninja==1.11.1.1
nodeenv==1.9.1
numba==0.60.0
numpy==1.24.3
onnxruntime @ http://10.6.10.68:8000/release/onnxruntime/dtk24.04.1/onnxruntime-1.15.0%2Bdas1.1.git739f24d.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=d0d24167188d2c85f1ed4110fc43e62ea40c74280716d9b5fe9540256f17869a
opencv-python==4.10.0.82
orjson==3.10.5
packaging==24.1
pandas==2.2.2
peft==0.9.0
pillow==10.3.0
platformdirs==4.2.2
pooch==1.8.2
pre-commit==3.8.0
prometheus_client==0.20.0
protobuf==5.27.1
psutil==5.9.8
py-cpuinfo==9.0.0
pycparser==2.22
pydantic==2.7.4
pydantic_core==2.18.4
Pygments==2.18.0
pygtrie==2.5.0
pynvml==11.5.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
ray==2.9.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rotary-emb @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/rotary_emb-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=cc15ec6ae73875515243d7f5c96ab214455a33a4a99eb7f1327f773cae1e6721
rpds-py==0.18.1
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
safetensors==0.4.3
scikit-learn==1.5.1
scipy==1.13.1
sentencepiece==0.2.0
shellingham==1.5.4
shortuuid==1.0.13
six==1.16.0
sniffio==1.3.1
sortedcontainers==2.4.0
soundfile==0.12.1
soxr==0.5.0
speechbrain==1.0.0
starlette==0.37.2
sympy==1.12.1
termcolor==2.4.0
tgt==1.5
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.15.0
tomli==2.0.1
torch @ http://10.6.10.68:8000/release/pytorch/dtk24.04.1/torch-2.1.0%2Bdas1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=5fd3bcef3aa197c0922727913aca53db9ce3f2fd4a9b22bba1973c3d526377f9
torchaudio @ http://10.6.10.68:8000/release/torchaudio/dtk24.04.1/torchaudio-2.1.2%2Bdas1.1.git63d9a68.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=4fcc556a7a2fffe64ddd57f22e5972b1b2b723f6fdfdaa305bd01551036df38b
torchvision @ http://10.6.10.68:8000/release/vision/dtk24.04.1/torchvision-0.16.0%2Bdas1.1.git7d45932.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=e3032e1bcc0857b54391d66744f97e5cff0dc7e7bb508196356ee927fb81ec01
tqdm==4.66.4
transformers==4.38.0
triton @ http://10.6.10.68:8000/release/triton/dtk24.04.1/triton-2.1.0%2Bdas1.1.git4bf1007a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=4c30d45dab071e65d1704a5cd189b14c4ac20bd59a7061032dfd631b1fc37645
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
Unidecode==1.3.8
urllib3==2.2.1
uvicorn==0.30.1
uvloop==0.19.0
virtualenv==20.26.3
vllm @ http://10.6.10.68:8000/release/vllm/dtk24.04.1/vllm-0.3.3%2Bdas1.1.gitdf6349c.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=48d265b07efa36f028eca45a3667fa10d3cf30eb1b8f019b62e3b255fb9e49c4
watchfiles==0.22.0
websockets==12.0
xentropy-cuda-lib @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/xentropy_cuda_lib-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=91b058d6a5fd2734a5085d68e08d3a1f948fe9c0119c46885d19f55293e2cce4
xformers @ http://10.6.10.68:8000/release/xformers/dtk24.04.1/xformers-0.0.25%2Bdas1.1.git8ef8bc1.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ca87fd065753c1be3b9fad552eba02d30cd3f4c673f01e81a763834eb5cbb9cc
yapf==0.40.2
zipp==3.19.2
==============================
Could not get git revision==============================
ROCm version:
5.7.24213
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment