init model

4130a52d · changhl · eb6a18fd · 4130a52d · 4130a52d · 4130a52d
Commit 4130a52d authored Aug 27, 2024 by changhl
20 changed files
--- a/README.md
+++ b/README.md
-# Tacotron2_pytorch
+# Tacotron2

+## 论文
+
+  - https://arxiv.org/pdf/1712.05884
+
+
+## 开源代码
+
+  - https://github.com/NVIDIA/tacotron2
+
+
+## 模型结构
+
+Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。该模型主要由两部分构成：
+- 声谱预测网络：一个Encoder-Attention-Decoder网络，用于将输入的字符序列预测为梅尔频谱的帧序列
+- 声码器（vocoder）：一个WaveNet的修订版，用于将预测的梅尔频谱帧序列产生时域波形
+
+<div align="center">
+    <img src="./images/architecture.png"/>
+</div>
+
+
+## 算法原理
+在这个架构中，Tacotron2将原先Tacotron的RNN模型进行改进，使用了LSTM模型，加入了遗忘门、输入门、输出门等门控结构，优化了梯度消失的问题，使得模型在反向传播的记忆力上有所提升，提高了合成的语音的质量。
+
+<div align="center">
+    <img src="./images/algorithm.png"/>
+</div>
+
+
+
+
+## 环境配置
+
+### Docker (方法一)
+**注意修改路径参数**
+
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+
+docker run -it --network=host --ipc=host --name=your_container_name --shm-size=32G --device=/dev/kfd --device=/dev/mkfd --device=/dev/dri -v /opt/hyhal:/opt/hyhal:ro -v /path/your_code_data/:/path/your_code_data/ --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10 /bin/bash
+
+cd /path/your_code_data/
+pip3 install -r requirements.txt
+```
+
+### Dockerfile (方法二)
+
+```
+cd ./docker
+docker build --no-cache -t tacotron2 .
+docker run -it -v /path/your_code_data/:/path/your_code_data/ --shm-size=32G --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name docker_name imageID bash
+
+pip3 install -r requirements.txt
+```
+
+### Anaconda (方法三)
+
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装： https://developer.hpccube.com/tool/
+
+```
+DTK软件栈：dtk24.04.1
+python：python3.10
+torch：2.1.0
+torchvision：0.16.0
+torchaudio: 2.1.2
+```
+
+Tips：以上dtk软件栈、python、torch等DCU相关工具版本需要严格一一对应
+
+2、其他非特殊库直接按照requirements.txt安装
+
+```
+pip3 install -r requirements.txt
+```
+
+## 数据集
+
+
+- SCnet快速下载链接：
+  - [LJSpeech数据集下载](http://113.200.138.88:18080/aidatasets/lj_speech)
+  
+- 官方下载链接：
+  - [LJSpeech数据集下载](https://keithito.com/LJ-Speech-Dataset/)
+
+```LJSpeech-1.1```:用于语音合成的数据集，包含语音和文本信息，语音为wav格式，文本以csv格式保存。
+```
+├── LJSpeech-1.1
+│   ├──wav
+│   │   ├── LJ001-0001.wav
+│   │   ├── LJ001-0002.wav
+│   │   ├── LJ001-0003.wav
+│   │   ├── ...
+│   ├──metadata.csv
+│   ├──README
+```
+- LJSpeech
+  - wav：语音数据目录
+    - LJ001-0001.wav：语音文件
+    - LJ001-0002.wav：语音文件
+    - ...
+  - metadata.csv：文本信息文件
+    - 第一列：语音文件名称
+    - 第二列：文本信息
+    - 第三列：规范化后的文本信息
+  - README：说明文档
+
+
+## 预训练模型
+**推理前先下载预训练好的权重文件**
+- SCnet下载地址：
+  - [tacotron2模型权重下载地址](http://113.200.138.88:18080/aimodels/tacotron2_ljspeech)
+  - [hifigan模型权重下载地址](http://113.200.138.88:18080/aimodels/hifigan_ljspeech)
+- 官方下载地址：
+  - [tacotron2模型权重下载地址](https://hf-mirror.com/speechbrain/tts-tacotron2-ljspeech)
+  - [hifigan模型权重下载地址](https://hf-mirror.com/speechbrain/tts-hifigan-ljspeech)
+
+## 训练
+**确保当前的工作目录为tacotron2_pytorch，指定可见卡**
+
+### 单卡
+```
+export  HIP_VISIBLE_DEVICES 设置可见卡
+bash train_s.sh $dataset_path $save_path
+```
+- $dataset_path:数据集路径
+- $save_path:训练权重保存路径
+  
+### 多卡
+```
+export  HIP_VISIBLE_DEVICES 设置可见卡
+bash train_m.sh $dataset_path $save_path
+```
+- $dataset_path:数据集路径
+- $save_path:训练权重保存路径
+  
+## 推理
+
+```
+export  HIP_VISIBLE_DEVICES 设置可见卡
+python3 inference.py -m modelpath_tacotron2 -v modelpath_hifigan -t "hi, nice to meet you" 
+```
+- -m:tacotron2模型权重路径
+- -v:hifigan模型权重路径
+- -t:输入文本
+- -res:结果文件保存路径
+  
+## result
+
+```
+  输入：“hi,nice to meet you”
+  输出：./res/example.wav
+```
+
+## 应用场景
+
+### 算法分类
+```
+语音合成
+```
+
+### 热点应用行业
+```
+金融，通信，广媒
+```
+
+## 源码仓库及问题反馈
+
+https://developer.hpccube.com/codes/modelzoo/tacotron2_pytorch
+
+## 参考
+
+[GitHub - NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2)
+[HF - speechbrain/tts-tacotron2-ljspeech](https://hf-mirror.com/speechbrain/tts-tacotron2-ljspeech)
\ No newline at end of file
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
+RUN source /opt/dtk/env.sh
\ No newline at end of file
--- a/icon.png
+++ b/icon.png
--- a/images/algorithm.png
+++ b/images/algorithm.png
--- a/images/architecture.png
+++ b/images/architecture.png
--- a/inference.py
+++ b/inference.py
+import torchaudio
+from speechbrain.inference.TTS import Tacotron2
+from speechbrain.inference.vocoders import HIFIGAN
+import os
+import argparse
+
+
+def parse_opt(known=False):
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-m', '--model-path', type=str, default="", help="the tacotron2 model path")
+    parser.add_argument('-v', '--vocoder-path', type=str, default="", help="the vocoder model path")
+    parser.add_argument('-t', '--text', type=str, default="Autumn, the season of change.", help="input text")
+    parser.add_argument('-res', '--result_path', type=str, default="./res", help="the path to save wav file")
+    opt = parser.parse_known_args()[0] if known else parser.parse_args()
+    return opt
+
+
+def main(opt):    
+    tacotron2 = Tacotron2.from_hparams(source=opt.model_path, run_opts={"device":"cuda"})
+    hifi_gan = HIFIGAN.from_hparams(source=opt.vocoder_path,run_opts={"device":"cuda"})
+
+    # Running the TTS
+    mel_output, mel_length, alignment = tacotron2.encode_text(opt.text)
+
+    # Running Vocoder (spectrogram-to-waveform)
+    waveforms = hifi_gan.decode_batch(mel_output)
+
+    # Save the waverform
+    torchaudio.save(os.path.join(opt.result_path, 'example.wav'),waveforms.squeeze(1).cpu(), 22050)
+
+if __name__ == "__main__":
+    main(opt=parse_opt())
--- a/model.properties
+++ b/model.properties
+#模型编码
+modelCode=917
+# 模型名称
+modelName=tacotron2_pytorch
+# 模型描述
+modelDescription=Tacotron2是由Google Brain在2017年提出来的一个End-to-End语音合成框架。
+# 应用场景(多个标签以英文逗号分割)
+appScenario=训练,推理,语音合成,金融,通信,广媒
+# 框架类型(多个标签以英文逗号分割)
+frameType=PyTorch
\ No newline at end of file
--- a/requirements.txt
+++ b/requirements.txt
+soundfile==0.12.1
+librosa==0.10.2.post1
+speechbrain==1.0.0
+hyperpyyaml>=0.0.1
+joblib>=0.14.1
+pre-commit>=2.3.0
+pygtrie>=2.1,<3.0
+tgt==1.5
+unidecode==1.3.8
\ No newline at end of file
--- a/res/example.wav
+++ b/res/example.wav
--- a/speechbrain/recipes/LJSpeech/TTS/README.md
+++ b/speechbrain/recipes/LJSpeech/TTS/README.md
+# Text-to-Speech (with LJSpeech)
+This folder contains the recipes for training TTS systems (including vocoders) with the popular LJSpeech dataset.
+
+# Dataset
+The dataset can be downloaded from here:
+https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
+
+# Installing Extra Dependencies
+
+Before proceeding, ensure you have installed the necessary additional dependencies. To do this, simply run the following command in your terminal:
+
+```
+pip install -r extra_requirements.txt
+```
+
+# Tacotron 2
+The subfolder "tacotron2" contains the recipe for training the popular [tacotron2](https://arxiv.org/abs/1712.05884) TTS model.
+To run this recipe, go into the "tacotron2" folder and run:
+
+```
+python train.py --device=cuda:0 --max_grad_norm=1.0 --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
+```
+
+The training logs are available [here](https://www.dropbox.com/sh/1npvo1g1ncafipf/AAC5DR1ErF2Q9V4bd1DHqX43a?dl=0).
+
+You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-tacotron2-ljspeech).
+
+# FastSpeech2
+The subfolder "fastspeech2" contains the recipes for training the non-autoregressive transformer based TTS model [FastSpeech2](https://arxiv.org/abs/2006.04558).
+
+### FastSpeech2 with pre-extracted durations from a forced aligner
+Training FastSpeech2 requires pre-extracted phoneme alignments (durations). The LJSpeech phoneme alignments from Montreal Forced Aligner are automatically downloaded, decompressed and stored at this location: ```/your_folder/LJSpeech-1.1/TextGrid```.
+
+To run this recipe, please first install the extra-dependencies :
+
+```
+pip install -r extra_requirements.txt
+````
+
+Then go into the "fastspeech2" folder and run:
+
+```
+python train.py --data_folder=/your_folder/LJSpeech-1.1 hparams/train.yaml
+```
+Training takes about 3 minutes/epoch on 1 * V100 32G.
+
+The training logs are available [here](https://www.dropbox.com/scl/fo/vtgbltqdrvw9r0vs7jz67/h?rlkey=cm2mwh5rce5ad9e90qaciypox&dl=0).
+
+You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-fastspeech2-ljspeech).
+
+### FastSpeech2 with internal alignment
+This recipe allows training FastSpeech2 without forced aligner referring to [One TTS Alignment To Rule Them All](https://arxiv.org/pdf/2108.10447.pdf). The alignment can be learnt by an internal alignment network that is added to FastSpeech2. This recipe aims to simplify training when using custom data and provide better alignments for punctuations.
+
+To run this recipe, please first install the extra-requirements:
+```
+pip install -r extra_requirements.txt
+```
+Then go into the "fastspeech2" folder and run:
+```
+python train_internal_alignment.py hparams/train_internal_alignment.yaml --data_folder=/your_folder/LJSpeech-1.1
+```
+The data preparation includes a grapheme-to-phoneme process for the entire corpus which may take several hours. Training takes about 5 minutes/epoch on 1 * V100 32G.
+
+The training logs are available [here](https://www.dropbox.com/scl/fo/4ctkc6jjas3uij9dzcwta/h?rlkey=i0k086d77flcsdx40du1ppm2d&dl=0).
+
+You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-fastspeech2-internal-alignment-ljspeech).
+
+# HiFiGAN (Vocoder)
+The subfolder "vocoder/hifigan/" contains the [HiFiGAN vocoder](https://arxiv.org/pdf/2010.05646.pdf).
+The vocoder is a neural network that converts a spectrogram into a waveform (it can be used on top of Tacotron2/FastSpeech2).
+
+We suggest using `tensorboard_logger` by setting `use_tensorboard: True` in the yaml file, thus `Tensorboard` should be installed.
+
+To run this recipe, go into the "vocoder/hifigan/" folder and run:
+
+```
+python train.py hparams/train.yaml --data_folder /path/to/LJspeech
+```
+
+Training takes about 10 minutes/epoch on an nvidia RTX8000.
+
+The training logs are available [here](https://www.dropbox.com/sh/m2xrdssiroipn8g/AAD-TqPYLrSg6eNxUkcImeg4a?dl=0)
+
+You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-hifigan-ljspeech).
+
+# DiffWave (Vocoder)
+The subfolder "vocoder/diffwave/" contains the [Diffwave](https://arxiv.org/pdf/2009.09761.pdf) vocoder.
+
+DiffWave is a versatile diffusion model for audio synthesis, which produces high-fidelity audio in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation.
+
+Here it serves as a vocoder that generates waveforms given spectrograms as conditions (it can be used on top of Tacotron2/FastSpeech2).
+
+To run this recipe, go into the "vocoder/diffwave/" folder and run:
+
+```
+python train.py hparams/train.yaml --data_folder /path/to/LJspeech
+```
+
+The scripts will output a synthesized audio to `<output_folder>/samples` for a certain interval of training epoch.
+
+We suggest using tensorboard_logger by setting `use_tensorboard: True` in the yaml file, thus torch.Tensorboard should be installed.
+
+Training takes about 6 minutes/epoch on 1 * V100 32G.
+
+The training logs are available [here](https://www.dropbox.com/sh/tbhpn1xirtaix68/AACvYaVDiUGAKURf2o-fvgMoa?dl=0)
+
+For inference, by setting `fast_sampling: True` , a fast sampling can be realized by passing user-defined variance schedules. According to the paper, high-quality audios can be generated with only 6 steps. This is highly recommended.
+
+You can find the pre-trained model with an easy-inference function on [HuggingFace](https://huggingface.co/speechbrain/tts-diffwave-ljspeech).
+
+
+# HiFiGAN Unit Vocoder
+The subfolder "vocoder/hifigan_discrete/" contains the [HiFiGAN Unit vocoder](https://arxiv.org/abs/2406.10735). This vocoder is a neural network designed to transform discrete self-supervised representations into waveform data.
+This is suitable for a wide range of generative tasks such as speech enhancement, separation, text-to-speech, voice cloning, etc. Please read [DASB - Discrete Audio and Speech Benchmark](https://arxiv.org/abs/2406.14294) for more information.
+
+To run this recipe successfully, start by installing the necessary extra dependencies:
+
+```bash
+pip install -r extra_requirements.txt
+```
+
+Before training the vocoder, you need to choose a speech encoder to extract representations that will be used as discrete audio input. We support k-means models using features from HuBERT, WavLM, or Wav2Vec2. Below are the available self-supervised speech encoders for which we provide pre-trained k-means checkpoints:
+
+| Encoder  | HF model                                |
+|----------|-----------------------------------------|
+| HuBERT   | facebook/hubert-large-ll60k             |
+| Wav2Vec2 | facebook/wav2vec2-large-960h-lv60-self  |
+| WavLM    | microsoft/wavlm-large                   |
+
+Checkpoints are available in the HF [SSL_Quantization](https://huggingface.co/speechbrain/SSL_Quantization) repository. Alternatively, you can train your own k-means model by following instructions in the "LJSpeech/quantization" README.
+
+Next, configure the SSL model type, k-means model, and corresponding hub in your YAML configuration file. Follow these steps:
+
+1. Navigate to the "vocoder/hifigan_discrete/hparams" folder and open "train.yaml" file.
+2. Modify the `encoder_type` field to specify one of the SSL models: "HuBERT", "WavLM", or "Wav2Vec2".
+3. Update the `encoder_hub` field with the specific name of the SSL Hub associated with your chosen model type.
+
+If you have trained your own k-means model, follow these additional steps:
+
+4. Update the `kmeans_folder` field with the specific name of the SSL Hub containing your trained k-means model. Please follow the same file structure as the official one in [SSL_Quantization](https://huggingface.co/speechbrain/SSL_Quantization).
+5. Update the `kmeans_dataset` field with the specific name of the dataset on which the k-means model was trained.
+6. Update the `num_clusters` field according to the number of clusters of your k-means model.
+
+Finally, navigate back to the "vocoder/hifigan_discrete/" folder and run the following command:
+
+```bash
+python train.py hparams/train.yaml --data_folder=/path/to/LJspeech
+```
+
+Training typically takes around 4 minutes per epoch when using an NVIDIA A100 40G.
+
+
+# **About SpeechBrain**
+- Website: https://speechbrain.github.io/
+- Code: https://github.com/speechbrain/speechbrain/
+- HuggingFace: https://huggingface.co/speechbrain/
+
+
+# **Citing SpeechBrain**
+Please, cite SpeechBrain if you use it for your research or business.
+
+```bibtex
+@misc{ravanelli2024opensourceconversationalaispeechbrain,
+      title={Open-Source Conversational AI with SpeechBrain 1.0},
+      author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
+      year={2024},
+      eprint={2407.00463},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2407.00463},
+}
+@misc{speechbrain,
+  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
+  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
+  year={2021},
+  eprint={2106.04624},
+  archivePrefix={arXiv},
+  primaryClass={eess.AS},
+  note={arXiv:2106.04624}
+}
+```
+
--- a/speechbrain/recipes/LJSpeech/TTS/extra_requirements.txt
+++ b/speechbrain/recipes/LJSpeech/TTS/extra_requirements.txt
+# Needed only for quantization
+scikit-learn
+# Needed only with use_tensorboard=True
+# torchvision is needed to save spectrograms
+tensorboard
+tgt
+torchvision
+unidecode
--- a/speechbrain/recipes/LJSpeech/TTS/fastspeech2/hparams/train.yaml
+++ b/speechbrain/recipes/LJSpeech/TTS/fastspeech2/hparams/train.yaml
+############################################################################
+# Model: FastSpeech2
+# Tokens: Raw characters (English text)
+# Training: LJSpeech
+# Authors: Sathvik Udupa, Yingzhi Wang, Pradnya Kandarkar
+# ############################################################################
+
+###################################
+# Experiment Parameters and setup #
+###################################
+seed: 1234
+__set_seed: !apply:torch.manual_seed [!ref <seed>]
+output_folder: !ref results/fastspeech2/<seed>
+save_folder: !ref <output_folder>/save
+train_log: !ref <output_folder>/train_log.txt
+epochs: 500
+train_spn_predictor_epochs: 8
+progress_samples: True
+progress_sample_path: !ref <output_folder>/samples
+progress_samples_min_run: 10
+progress_samples_interval: 10
+progress_batch_sample_size: 4
+
+#################################
+# Data files and pre-processing #
+#################################
+data_folder: #!PLACEHOLDER # e.g., /data/Database/LJSpeech-1.1
+
+train_json: !ref <save_folder>/train.json
+valid_json: !ref <save_folder>/valid.json
+test_json: !ref <save_folder>/test.json
+
+splits: ["train", "valid"]
+split_ratio: [90, 10]
+
+skip_prep: False
+
+################################
+# Audio Parameters             #
+################################
+sample_rate: 22050
+hop_length: 256
+win_length: null
+n_mel_channels: 80
+n_fft: 1024
+mel_fmin: 0.0
+mel_fmax: 8000.0
+power: 1
+norm: "slaney"
+mel_scale: "slaney"
+dynamic_range_compression: True
+mel_normalized: False
+min_max_energy_norm: True
+min_f0: 65  #(torchaudio pyin values)
+max_f0: 2093 #(torchaudio pyin values)
+
+################################
+# Optimization Hyperparameters #
+################################
+learning_rate: 0.0001
+weight_decay: 0.000001
+max_grad_norm: 1.0
+batch_size: 32 #minimum 2
+num_workers_train: 16
+num_workers_valid: 4
+betas: [0.9, 0.98]
+
+################################
+# Model Parameters and model   #
+################################
+# Input parameters
+lexicon:
+    - AA
+    - AE
+    - AH
+    - AO
+    - AW
+    - AY
+    - B
+    - CH
+    - D
+    - DH
+    - EH
+    - ER
+    - EY
+    - F
+    - G
+    - HH
+    - IH
+    - IY
+    - JH
+    - K
+    - L
+    - M
+    - N
+    - NG
+    - OW
+    - OY
+    - P
+    - R
+    - S
+    - SH
+    - T
+    - TH
+    - UH
+    - UW
+    - V
+    - W
+    - Y
+    - Z
+    - ZH
+    - spn
+
+n_symbols: 42 #fixed depending on symbols in the lexicon +1 for a dummy symbol used for padding
+padding_idx: 0
+
+# Encoder parameters
+enc_num_layers: 4
+enc_num_head: 2
+enc_d_model: 384
+enc_ffn_dim: 1024
+enc_k_dim: 384
+enc_v_dim: 384
+enc_dropout: 0.2
+
+# Decoder parameters
+dec_num_layers: 4
+dec_num_head: 2
+dec_d_model: 384
+dec_ffn_dim: 1024
+dec_k_dim: 384
+dec_v_dim: 384
+dec_dropout: 0.2
+
+# Postnet parameters
+postnet_embedding_dim: 512
+postnet_kernel_size: 5
+postnet_n_convolutions: 5
+postnet_dropout: 0.5
+
+# common
+normalize_before: True
+ffn_type: 1dcnn #1dcnn or ffn
+ffn_cnn_kernel_size_list: [9, 1]
+
+# variance predictor
+dur_pred_kernel_size: 3
+pitch_pred_kernel_size: 3
+energy_pred_kernel_size: 3
+variance_predictor_dropout: 0.5
+
+# silent phoneme token predictor
+spn_predictor: !new:speechbrain.lobes.models.FastSpeech2.SPNPredictor
+    enc_num_layers: !ref <enc_num_layers>
+    enc_num_head: !ref <enc_num_head>
+    enc_d_model: !ref <enc_d_model>
+    enc_ffn_dim: !ref <enc_ffn_dim>
+    enc_k_dim: !ref <enc_k_dim>
+    enc_v_dim: !ref <enc_v_dim>
+    enc_dropout: !ref <enc_dropout>
+    normalize_before: !ref <normalize_before>
+    ffn_type: !ref <ffn_type>
+    ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
+    n_char: !ref <n_symbols>
+    padding_idx: !ref <padding_idx>
+
+#model
+model: !new:speechbrain.lobes.models.FastSpeech2.FastSpeech2
+    enc_num_layers: !ref <enc_num_layers>
+    enc_num_head: !ref <enc_num_head>
+    enc_d_model: !ref <enc_d_model>
+    enc_ffn_dim: !ref <enc_ffn_dim>
+    enc_k_dim: !ref <enc_k_dim>
+    enc_v_dim: !ref <enc_v_dim>
+    enc_dropout: !ref <enc_dropout>
+    dec_num_layers: !ref <dec_num_layers>
+    dec_num_head: !ref <dec_num_head>
+    dec_d_model: !ref <dec_d_model>
+    dec_ffn_dim: !ref <dec_ffn_dim>
+    dec_k_dim: !ref <dec_k_dim>
+    dec_v_dim: !ref <dec_v_dim>
+    dec_dropout: !ref <dec_dropout>
+    normalize_before: !ref <normalize_before>
+    ffn_type: !ref <ffn_type>
+    ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
+    n_char: !ref <n_symbols>
+    n_mels: !ref <n_mel_channels>
+    postnet_embedding_dim: !ref <postnet_embedding_dim>
+    postnet_kernel_size: !ref <postnet_kernel_size>
+    postnet_n_convolutions: !ref <postnet_n_convolutions>
+    postnet_dropout: !ref <postnet_dropout>
+    padding_idx: !ref <padding_idx>
+    dur_pred_kernel_size: !ref <dur_pred_kernel_size>
+    pitch_pred_kernel_size: !ref <pitch_pred_kernel_size>
+    energy_pred_kernel_size: !ref <energy_pred_kernel_size>
+    variance_predictor_dropout: !ref <variance_predictor_dropout>
+
+mel_spectogram: !name:speechbrain.lobes.models.FastSpeech2.mel_spectogram
+    sample_rate: !ref <sample_rate>
+    hop_length: !ref <hop_length>
+    win_length: !ref <win_length>
+    n_fft: !ref <n_fft>
+    n_mels: !ref <n_mel_channels>
+    f_min: !ref <mel_fmin>
+    f_max: !ref <mel_fmax>
+    power: !ref <power>
+    normalized: !ref <mel_normalized>
+    min_max_energy_norm: !ref <min_max_energy_norm>
+    norm: !ref <norm>
+    mel_scale: !ref <mel_scale>
+    compression: !ref <dynamic_range_compression>
+
+criterion: !new:speechbrain.lobes.models.FastSpeech2.Loss
+    log_scale_durations: True
+    duration_loss_weight: 1.0
+    pitch_loss_weight: 1.0
+    energy_loss_weight: 1.0
+    ssim_loss_weight: 1.0
+    mel_loss_weight: 1.0
+    postnet_mel_loss_weight: 1.0
+    spn_loss_weight: 1.0
+    spn_loss_max_epochs: !ref <train_spn_predictor_epochs>
+
+vocoder: "hifi-gan"
+pretrained_vocoder: True
+vocoder_source: speechbrain/tts-hifigan-ljspeech
+vocoder_download_path: tmpdir_vocoder
+
+modules:
+    spn_predictor: !ref <spn_predictor>
+    model: !ref <model>
+
+train_dataloader_opts:
+    batch_size: !ref <batch_size>
+    drop_last: False  #True #False
+    num_workers: !ref <num_workers_train>
+    shuffle: True
+    collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollate
+
+valid_dataloader_opts:
+    batch_size: !ref <batch_size>
+    num_workers: !ref <num_workers_valid>
+    shuffle: False
+    collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollate
+
+#optimizer
+opt_class: !name:torch.optim.Adam
+    lr: !ref <learning_rate>
+    weight_decay: !ref <weight_decay>
+    betas: !ref <betas>
+
+noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
+    lr_initial: !ref <learning_rate>
+    n_warmup_steps: 4000
+
+#epoch object
+epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
+    limit: !ref <epochs>
+
+train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
+    save_file: !ref <train_log>
+
+#checkpointer
+checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
+    checkpoints_dir: !ref <save_folder>
+    recoverables:
+        spn_predictor: !ref <spn_predictor>
+        model: !ref <model>
+        lr_annealing: !ref <noam_annealing>
+        counter: !ref <epoch_counter>
+
+input_encoder: !new:speechbrain.dataio.encoder.TextEncoder
+
+progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
+    output_path: !ref <progress_sample_path>
+    batch_sample_size: !ref <progress_batch_sample_size>
+    formats:
+        raw_batch: raw
--- a/speechbrain/recipes/LJSpeech/TTS/fastspeech2/hparams/train_internal_alignment.yaml
+++ b/speechbrain/recipes/LJSpeech/TTS/fastspeech2/hparams/train_internal_alignment.yaml
+############################################################################
+# Model: FastSpeech2 with internal alignment
+# Tokens: Phonemes (ARPABET)
+# Dataset: LJSpeech
+# Authors: Yingzhi Wang 2023
+# ############################################################################
+
+###################################
+# Experiment Parameters and setup #
+###################################
+seed: 1234
+__set_seed: !apply:torch.manual_seed [!ref <seed>]
+output_folder: !ref results/fastspeech2_internal_alignment/<seed>
+save_folder: !ref <output_folder>/save
+train_log: !ref <output_folder>/train_log.txt
+epochs: 500
+progress_samples: True
+progress_sample_path: !ref <output_folder>/samples
+progress_samples_min_run: 10
+progress_samples_interval: 10
+progress_batch_sample_size: 4
+
+#################################
+# Data files and pre-processing #
+#################################
+data_folder: !PLACEHOLDER # e.g., /data/Database/LJSpeech-1.1
+
+train_json: !ref <save_folder>/train.json
+valid_json: !ref <save_folder>/valid.json
+test_json: !ref <save_folder>/test.json
+
+splits: ["train", "valid"]
+split_ratio: [90, 10]
+
+skip_prep: False
+
+################################
+# Audio Parameters             #
+################################
+sample_rate: 22050
+hop_length: 256
+win_length: null
+n_mel_channels: 80
+n_fft: 1024
+mel_fmin: 0.0
+mel_fmax: 8000.0
+power: 1
+norm: "slaney"
+mel_scale: "slaney"
+dynamic_range_compression: True
+mel_normalized: False
+min_max_energy_norm: True
+min_f0: 65  #(torchaudio pyin values)
+max_f0: 2093 #(torchaudio pyin values)
+
+################################
+# Optimization Hyperparameters #
+################################
+learning_rate: 0.0001
+weight_decay: 0.000001
+max_grad_norm: 1.0
+batch_size: 16 #minimum 2
+betas: [0.9, 0.998]
+num_workers_train: 16
+num_workers_valid: 4
+
+################################
+# Model Parameters and model   #
+################################
+# Input parameters
+lexicon:
+    - "AA"
+    - "AE"
+    - "AH"
+    - "AO"
+    - "AW"
+    - "AY"
+    - "B"
+    - "CH"
+    - "D"
+    - "DH"
+    - "EH"
+    - "ER"
+    - "EY"
+    - "F"
+    - "G"
+    - "HH"
+    - "IH"
+    - "IY"
+    - "JH"
+    - "K"
+    - "L"
+    - "M"
+    - "N"
+    - "NG"
+    - "OW"
+    - "OY"
+    - "P"
+    - "R"
+    - "S"
+    - "SH"
+    - "T"
+    - "TH"
+    - "UH"
+    - "UW"
+    - "V"
+    - "W"
+    - "Y"
+    - "Z"
+    - "ZH"
+    - "-"
+    - "!"
+    - "'"
+    - "("
+    - ")"
+    - ","
+    - "."
+    - ":"
+    - ";"
+    - "?"
+    - " "
+
+n_symbols: 52 #fixed depending on symbols in the lexicon (+1 for a dummy symbol used for padding, +1 for unknown)
+padding_idx: 0
+
+hidden_channels: 512
+# Encoder parameters
+enc_num_layers: 4
+enc_num_head: 2
+enc_d_model: !ref <hidden_channels>
+enc_ffn_dim: 1024
+enc_k_dim: !ref <hidden_channels>
+enc_v_dim: !ref <hidden_channels>
+enc_dropout: 0.2
+
+# Aligner parameters
+in_query_channels: 80
+in_key_channels: !ref <hidden_channels> # 512 in the paper
+attn_channels: 80
+temperature: 0.0005
+
+# Decoder parameters
+dec_num_layers: 4
+dec_num_head: 2
+dec_d_model: !ref <hidden_channels>
+dec_ffn_dim: 1024
+dec_k_dim: !ref <hidden_channels>
+dec_v_dim: !ref <hidden_channels>
+dec_dropout: 0.2
+
+# Postnet parameters
+postnet_embedding_dim: 512
+postnet_kernel_size: 5
+postnet_n_convolutions: 5
+postnet_dropout: 0.2
+
+# common
+normalize_before: True
+ffn_type: 1dcnn #1dcnn or ffn
+ffn_cnn_kernel_size_list: [9, 1]
+
+# variance predictor
+dur_pred_kernel_size: 3
+pitch_pred_kernel_size: 3
+energy_pred_kernel_size: 3
+variance_predictor_dropout: 0.5
+
+#model
+model: !new:speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment
+    enc_num_layers: !ref <enc_num_layers>
+    enc_num_head: !ref <enc_num_head>
+    enc_d_model: !ref <enc_d_model>
+    enc_ffn_dim: !ref <enc_ffn_dim>
+    enc_k_dim: !ref <enc_k_dim>
+    enc_v_dim: !ref <enc_v_dim>
+    enc_dropout: !ref <enc_dropout>
+    in_query_channels: !ref <in_query_channels>
+    in_key_channels: !ref <in_key_channels>
+    attn_channels: !ref <attn_channels>
+    temperature: !ref <temperature>
+    dec_num_layers: !ref <dec_num_layers>
+    dec_num_head: !ref <dec_num_head>
+    dec_d_model: !ref <dec_d_model>
+    dec_ffn_dim: !ref <dec_ffn_dim>
+    dec_k_dim: !ref <dec_k_dim>
+    dec_v_dim: !ref <dec_v_dim>
+    dec_dropout: !ref <dec_dropout>
+    normalize_before: !ref <normalize_before>
+    ffn_type: !ref <ffn_type>
+    ffn_cnn_kernel_size_list: !ref <ffn_cnn_kernel_size_list>
+    n_char: !ref <n_symbols>
+    n_mels: !ref <n_mel_channels>
+    postnet_embedding_dim: !ref <postnet_embedding_dim>
+    postnet_kernel_size: !ref <postnet_kernel_size>
+    postnet_n_convolutions: !ref <postnet_n_convolutions>
+    postnet_dropout: !ref <postnet_dropout>
+    padding_idx: !ref <padding_idx>
+    dur_pred_kernel_size: !ref <dur_pred_kernel_size>
+    pitch_pred_kernel_size: !ref <pitch_pred_kernel_size>
+    energy_pred_kernel_size: !ref <energy_pred_kernel_size>
+    variance_predictor_dropout: !ref <variance_predictor_dropout>
+
+mel_spectogram: !name:speechbrain.lobes.models.FastSpeech2.mel_spectogram
+    sample_rate: !ref <sample_rate>
+    hop_length: !ref <hop_length>
+    win_length: !ref <win_length>
+    n_fft: !ref <n_fft>
+    n_mels: !ref <n_mel_channels>
+    f_min: !ref <mel_fmin>
+    f_max: !ref <mel_fmax>
+    power: !ref <power>
+    normalized: !ref <mel_normalized>
+    min_max_energy_norm: !ref <min_max_energy_norm>
+    norm: !ref <norm>
+    mel_scale: !ref <mel_scale>
+    compression: !ref <dynamic_range_compression>
+
+criterion: !new:speechbrain.lobes.models.FastSpeech2.LossWithAlignment
+    log_scale_durations: True
+    duration_loss_weight: 1.0
+    pitch_loss_weight: 1.0
+    energy_loss_weight: 1.0
+    ssim_loss_weight: 1.0
+    mel_loss_weight: 1.0
+    postnet_mel_loss_weight: 1.0
+    aligner_loss_weight: 1.0
+    binary_alignment_loss_weight: 0.2
+    binary_alignment_loss_warmup_epochs: 1
+    binary_alignment_loss_max_epochs: 80
+
+vocoder: "hifi-gan"
+pretrained_vocoder: True
+vocoder_source: speechbrain/tts-hifigan-ljspeech
+vocoder_download_path: tmpdir_vocoder
+
+modules:
+    model: !ref <model>
+
+train_dataloader_opts:
+    batch_size: !ref <batch_size>
+    drop_last: False  #True #False
+    num_workers: !ref <num_workers_train>
+    shuffle: True
+    collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment
+
+valid_dataloader_opts:
+    batch_size: !ref <batch_size>
+    num_workers: !ref <num_workers_valid>
+    shuffle: False
+    collate_fn: !new:speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment
+
+#optimizer
+opt_class: !name:torch.optim.Adam
+    lr: !ref <learning_rate>
+    weight_decay: !ref <weight_decay>
+    betas: !ref <betas>
+
+noam_annealing: !new:speechbrain.nnet.schedulers.NoamScheduler
+    lr_initial: !ref <learning_rate>
+    n_warmup_steps: 4000
+
+
+#epoch object
+epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
+    limit: !ref <epochs>
+
+train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
+    save_file: !ref <train_log>
+
+#checkpointer
+checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
+    checkpoints_dir: !ref <save_folder>
+    recoverables:
+        model: !ref <model>
+        lr_annealing: !ref <noam_annealing>
+        counter: !ref <epoch_counter>
+
+input_encoder: !new:speechbrain.dataio.encoder.TextEncoder
+
+progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
+    output_path: !ref <progress_sample_path>
+    batch_sample_size: !ref <progress_batch_sample_size>
+    formats:
+        raw_batch: raw
--- a/speechbrain/recipes/LJSpeech/TTS/fastspeech2/ljspeech_prepare.py
+++ b/speechbrain/recipes/LJSpeech/TTS/fastspeech2/ljspeech_prepare.py
+../../ljspeech_prepare.py
\ No newline at end of file
--- a/speechbrain/recipes/LJSpeech/TTS/fastspeech2/train.py
+++ b/speechbrain/recipes/LJSpeech/TTS/fastspeech2/train.py
+"""
+ Recipe for training the FastSpeech2 Text-To-Speech model, an end-to-end
+ neural text-to-speech (TTS) system introduced in 'FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
+synthesis' paper
+ (https://arxiv.org/abs/2006.04558)
+ To run this recipe, do the following:
+ # python train.py hparams/train.yaml
+ Authors
+ * Sathvik Udupa 2022
+ * Yingzhi Wang 2022
+ * Pradnya Kandarkar 2023
+"""
+
+import logging
+import os
+import sys
+from pathlib import Path
+
+import numpy as np
+import torch
+import torchaudio
+from hyperpyyaml import load_hyperpyyaml
+
+import speechbrain as sb
+from speechbrain.inference.text import GraphemeToPhoneme
+from speechbrain.inference.vocoders import HIFIGAN
+from speechbrain.utils.data_utils import scalarize
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+logger = logging.getLogger(__name__)
+
+
+class FastSpeech2Brain(sb.Brain):
+    def on_fit_start(self):
+        """Gets called at the beginning of ``fit()``, on multiple processes
+        if ``distributed_count > 0`` and backend is ddp and initializes statistics
+        """
+        self.hparams.progress_sample_logger.reset()
+        self.last_epoch = 0
+        self.last_batch = None
+        self.last_loss_stats = {}
+        self.g2p = GraphemeToPhoneme.from_hparams("speechbrain/soundchoice-g2p")
+        self.spn_token_encoded = (
+            self.input_encoder.encode_sequence_torch(["spn"]).int().item()
+        )
+        return super().on_fit_start()
+
+    def compute_forward(self, batch, stage):
+        """Computes the forward pass
+        Arguments
+        ---------
+        batch: str
+            a single batch
+        stage: speechbrain.Stage
+            the training stage
+        Returns
+        -------
+        the model output
+        """
+        inputs, _ = self.batch_to_device(batch)
+
+        tokens, durations, pitch, energy, no_spn_seqs, last_phonemes = inputs
+
+        # Forward pass for the silent token predictor module
+        if (
+            self.hparams.epoch_counter.current
+            > self.hparams.train_spn_predictor_epochs
+        ):
+            self.hparams.modules["spn_predictor"].eval()
+            with torch.no_grad():
+                spn_preds = self.hparams.modules["spn_predictor"](
+                    no_spn_seqs, last_phonemes
+                )
+        else:
+            spn_preds = self.hparams.modules["spn_predictor"](
+                no_spn_seqs, last_phonemes
+            )
+
+        # Forward pass for the FastSpeech2 module
+        (
+            predict_mel_post,
+            predict_postnet_output,
+            predict_durations,
+            predict_pitch,
+            predict_avg_pitch,
+            predict_energy,
+            predict_avg_energy,
+            predict_mel_lens,
+        ) = self.hparams.model(tokens, durations, pitch, energy)
+
+        return (
+            predict_mel_post,
+            predict_postnet_output,
+            predict_durations,
+            predict_pitch,
+            predict_avg_pitch,
+            predict_energy,
+            predict_avg_energy,
+            predict_mel_lens,
+            spn_preds,
+        )
+
+    def on_fit_batch_end(self, batch, outputs, loss, should_step):
+        """At the end of the optimizer step, apply noam annealing."""
+        if should_step:
+            self.hparams.noam_annealing(self.optimizer)
+
+    def compute_objectives(self, predictions, batch, stage):
+        """Computes the loss given the predicted and targeted outputs.
+        Arguments
+        ---------
+        predictions : torch.Tensor
+            The model generated spectrograms and other metrics from `compute_forward`.
+        batch : PaddedBatch
+            This batch object contains all the relevant tensors for computation.
+        stage : sb.Stage
+            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
+        Returns
+        -------
+        loss : torch.Tensor
+            A one-element tensor used for backpropagating the gradient.
+        """
+        x, y, metadata = self.batch_to_device(batch, return_metadata=True)
+        self.last_batch = [x[0], y[-2], y[-3], predictions[0], *metadata]
+        self._remember_sample([x[0], *y, *metadata], predictions)
+        loss = self.hparams.criterion(
+            predictions, y, self.hparams.epoch_counter.current
+        )
+        self.last_loss_stats[stage] = scalarize(loss)
+        return loss["total_loss"]
+
+    def _remember_sample(self, batch, predictions):
+        """Remembers samples of spectrograms and the batch for logging purposes
+        Arguments
+        ---------
+        batch: tuple
+            a training batch
+        predictions: tuple
+            predictions (raw output of the FastSpeech2
+             model)
+        """
+        (
+            tokens,
+            spectogram,
+            durations,
+            pitch,
+            energy,
+            mel_lengths,
+            input_lengths,
+            spn_labels,
+            labels,
+            wavs,
+        ) = batch
+        (
+            mel_post,
+            postnet_mel_out,
+            predict_durations,
+            predict_pitch,
+            predict_avg_pitch,
+            predict_energy,
+            predict_avg_energy,
+            predict_mel_lens,
+            spn_preds,
+        ) = predictions
+        self.hparams.progress_sample_logger.remember(
+            target=self.process_mel(spectogram, mel_lengths),
+            output=self.process_mel(postnet_mel_out, mel_lengths),
+            raw_batch=self.hparams.progress_sample_logger.get_batch_sample(
+                {
+                    "tokens": tokens,
+                    "input_lengths": input_lengths,
+                    "mel_target": spectogram,
+                    "mel_out": postnet_mel_out,
+                    "mel_lengths": predict_mel_lens,
+                    "durations": durations,
+                    "predict_durations": predict_durations,
+                    "labels": labels,
+                    "wavs": wavs,
+                }
+            ),
+        )
+
+    def process_mel(self, mel, len, index=0):
+        """Converts a mel spectrogram to one that can be saved as an image
+        sample  = sqrt(exp(mel))
+        Arguments
+        ---------
+        mel: torch.Tensor
+            the mel spectrogram (as used in the model)
+        len: int
+            length of the mel spectrogram
+        index: int
+            batch index
+        Returns
+        -------
+        mel: torch.Tensor
+            the spectrogram, for image saving purposes
+        """
+        assert mel.dim() == 3
+        return torch.sqrt(torch.exp(mel[index][: len[index]]))
+
+    def on_stage_end(self, stage, stage_loss, epoch):
+        """Gets called at the end of an epoch.
+        Arguments
+        ---------
+        stage : sb.Stage
+            One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
+        stage_loss : float
+            The average loss for all of the data processed in this stage.
+        epoch : int
+            The currently-starting epoch. This is passed
+            `None` during the test stage.
+        """
+        # At the end of validation, we can write
+        if stage == sb.Stage.VALID:
+            # Update learning rate
+            self.last_epoch = epoch
+            lr = self.hparams.noam_annealing.current_lr
+
+            # The train_logger writes a summary to stdout and to the logfile.
+            self.hparams.train_logger.log_stats(  # 1#2#
+                stats_meta={"Epoch": epoch, "lr": lr},
+                train_stats=self.last_loss_stats[sb.Stage.TRAIN],
+                valid_stats=self.last_loss_stats[sb.Stage.VALID],
+            )
+            output_progress_sample = (
+                self.hparams.progress_samples
+                and epoch % self.hparams.progress_samples_interval == 0
+                and epoch >= self.hparams.progress_samples_min_run
+            )
+
+            if output_progress_sample:
+                logger.info("Saving predicted samples")
+                (
+                    inference_mel,
+                    mel_lens,
+                    inf_mel_spn_pred,
+                    mel_lens_spn_pred,
+                ) = self.run_inference()
+                self.hparams.progress_sample_logger.save(epoch)
+                self.run_vocoder(
+                    inference_mel, mel_lens, sample_type="with_spn"
+                )
+                self.run_vocoder(
+                    inf_mel_spn_pred, mel_lens_spn_pred, sample_type="no_spn"
+                )
+            # Save the current checkpoint and delete previous checkpoints.
+            # UNCOMMENT THIS
+            self.checkpointer.save_and_keep_only(
+                meta=self.last_loss_stats[stage],
+                min_keys=["total_loss"],
+            )
+        # We also write statistics about test data spectogram to stdout and to the logfile.
+        if stage == sb.Stage.TEST:
+            self.hparams.train_logger.log_stats(
+                {"Epoch loaded": self.hparams.epoch_counter.current},
+                test_stats=self.last_loss_stats[sb.Stage.TEST],
+            )
+
+    def run_inference(self):
+        """Produces a sample in inference mode with predicted durations."""
+        if self.last_batch is None:
+            return
+        tokens, *_, labels, _ = self.last_batch
+
+        # Generates inference samples without using the silent phoneme predictor
+        (
+            _,
+            postnet_mel_out,
+            _,
+            _,
+            _,
+            _,
+            _,
+            predict_mel_lens,
+        ) = self.hparams.model(tokens)
+
+        self.hparams.progress_sample_logger.remember(
+            infer_output=self.process_mel(
+                postnet_mel_out, [len(postnet_mel_out[0])]
+            )
+        )
+
+        # Generates inference samples using the silent phoneme predictor
+
+        # Preprocessing required at the inference time for the input text
+        # "label" below contains input text
+        # "phoneme_labels" contain the phoneme sequences corresponding to input text labels
+        # "last_phonemes_combined" is used to indicate whether the index position is for a last phoneme of a word
+        phoneme_labels = list()
+        last_phonemes_combined = list()
+
+        for label in labels:
+            phoneme_label = list()
+            last_phonemes = list()
+
+            words = label.split()
+            words = [word.strip() for word in words]
+            words_phonemes = self.g2p(words)
+
+            for words_phonemes_seq in words_phonemes:
+                for phoneme in words_phonemes_seq:
+                    if not phoneme.isspace():
+                        phoneme_label.append(phoneme)
+                        last_phonemes.append(0)
+                last_phonemes[-1] = 1
+
+            phoneme_labels.append(phoneme_label)
+            last_phonemes_combined.append(last_phonemes)
+
+        # Inserts silent phonemes in the input phoneme sequence
+        all_tokens_with_spn = list()
+        max_seq_len = -1
+        for i in range(len(phoneme_labels)):
+            phoneme_label = phoneme_labels[i]
+            token_seq = (
+                self.input_encoder.encode_sequence_torch(phoneme_label)
+                .int()
+                .to(self.device)
+            )
+            last_phonemes = torch.LongTensor(last_phonemes_combined[i]).to(
+                self.device
+            )
+
+            # Runs the silent phoneme predictor
+            spn_preds = (
+                self.hparams.modules["spn_predictor"]
+                .infer(token_seq.unsqueeze(0), last_phonemes.unsqueeze(0))
+                .int()
+            )
+
+            spn_to_add = torch.nonzero(spn_preds).reshape(-1).tolist()
+
+            tokens_with_spn = list()
+
+            for token_idx in range(token_seq.shape[0]):
+                tokens_with_spn.append(token_seq[token_idx].item())
+                if token_idx in spn_to_add:
+                    tokens_with_spn.append(self.spn_token_encoded)
+
+            tokens_with_spn = torch.LongTensor(tokens_with_spn).to(self.device)
+            all_tokens_with_spn.append(tokens_with_spn)
+            if max_seq_len < tokens_with_spn.shape[-1]:
+                max_seq_len = tokens_with_spn.shape[-1]
+
+        # "tokens_with_spn_tensor" holds the input phoneme sequence with silent phonemes
+        tokens_with_spn_tensor = torch.LongTensor(
+            tokens.shape[0], max_seq_len
+        ).to(self.device)
+        tokens_with_spn_tensor.zero_()
+
+        for seq_idx, seq in enumerate(all_tokens_with_spn):
+            tokens_with_spn_tensor[seq_idx, : len(seq)] = seq
+
+        (
+            _,
+            postnet_mel_out_spn_pred,
+            _,
+            _,
+            _,
+            _,
+            _,
+            predict_mel_lens_spn_pred,
+        ) = self.hparams.model(tokens_with_spn_tensor)
+
+        return (
+            postnet_mel_out,
+            predict_mel_lens,
+            postnet_mel_out_spn_pred,
+            predict_mel_lens_spn_pred,
+        )
+
+    def run_vocoder(self, inference_mel, mel_lens, sample_type=""):
+        """Uses a pretrained vocoder to generate audio from predicted mel
+        spectogram. By default, uses speechbrain hifigan.
+
+        Arguments
+        ---------
+        inference_mel: torch.Tensor
+            predicted mel from fastspeech2 inference
+        mel_lens: torch.Tensor
+            predicted mel lengths from fastspeech2 inference
+            used to mask the noise from padding
+        sample_type: str
+            used for logging the type of the inference sample being generated
+
+        Returns
+        -------
+        None
+        """
+        if self.last_batch is None:
+            return
+        *_, wavs = self.last_batch
+
+        inference_mel = inference_mel[: self.hparams.progress_batch_sample_size]
+        mel_lens = mel_lens[0 : self.hparams.progress_batch_sample_size]
+        assert (
+            self.hparams.vocoder == "hifi-gan"
+            and self.hparams.pretrained_vocoder is True
+        ), "Specified vocoder not supported yet"
+        logger.info(
+            f"Generating audio with pretrained {self.hparams.vocoder_source} vocoder"
+        )
+        hifi_gan = HIFIGAN.from_hparams(
+            source=self.hparams.vocoder_source,
+            savedir=self.hparams.vocoder_download_path,
+        )
+        waveforms = hifi_gan.decode_batch(
+            inference_mel.transpose(2, 1), mel_lens, self.hparams.hop_length
+        )
+        for idx, wav in enumerate(waveforms):
+            path = os.path.join(
+                self.hparams.progress_sample_path,
+                str(self.last_epoch),
+                f"pred_{sample_type}_{Path(wavs[idx]).stem}.wav",
+            )
+            torchaudio.save(path, wav, self.hparams.sample_rate)
+
+    def batch_to_device(self, batch, return_metadata=False):
+        """Transfers the batch to the target device
+        Arguments
+        ---------
+        batch: tuple
+            the batch to use
+        return_metadata: bool
+            indicates whether the metadata should be returned
+        Returns
+        -------
+        batch: tuple
+            the batch on the correct device
+        """
+
+        (
+            text_padded,
+            durations,
+            input_lengths,
+            mel_padded,
+            pitch_padded,
+            energy_padded,
+            output_lengths,
+            len_x,
+            labels,
+            wavs,
+            no_spn_seq_padded,
+            spn_labels_padded,
+            last_phonemes_padded,
+        ) = batch
+
+        durations = durations.to(self.device, non_blocking=True).long()
+        phonemes = text_padded.to(self.device, non_blocking=True).long()
+        input_lengths = input_lengths.to(self.device, non_blocking=True).long()
+        spectogram = mel_padded.to(self.device, non_blocking=True).float()
+        pitch = pitch_padded.to(self.device, non_blocking=True).float()
+        energy = energy_padded.to(self.device, non_blocking=True).float()
+        mel_lengths = output_lengths.to(self.device, non_blocking=True).long()
+        no_spn_seqs = no_spn_seq_padded.to(
+            self.device, non_blocking=True
+        ).long()
+        spn_labels = spn_labels_padded.to(self.device, non_blocking=True).long()
+        last_phonemes = last_phonemes_padded.to(
+            self.device, non_blocking=True
+        ).long()
+        x = (phonemes, durations, pitch, energy, no_spn_seqs, last_phonemes)
+        y = (
+            spectogram,
+            durations,
+            pitch,
+            energy,
+            mel_lengths,
+            input_lengths,
+            spn_labels,
+        )
+        metadata = (labels, wavs)
+        if return_metadata:
+            return x, y, metadata
+        return x, y
+
+
+def dataio_prepare(hparams):
+    # Load lexicon
+    lexicon = hparams["lexicon"]
+    input_encoder = hparams.get("input_encoder")
+
+    # add a dummy symbol for idx 0 - used for padding.
+    lexicon = ["@@"] + lexicon
+    input_encoder.update_from_iterable(lexicon, sequence_input=False)
+    input_encoder.add_unk()
+
+    # load audio, text and durations on the fly; encode audio and text.
+    @sb.utils.data_pipeline.takes(
+        "wav",
+        "label_phoneme",
+        "durations",
+        "pitch",
+        "start",
+        "end",
+        "spn_labels",
+        "last_phoneme_flags",
+    )
+    @sb.utils.data_pipeline.provides("mel_text_pair")
+    def audio_pipeline(
+        wav,
+        label_phoneme,
+        dur,
+        pitch,
+        start,
+        end,
+        spn_labels,
+        last_phoneme_flags,
+    ):
+        durs = np.load(dur)
+        durs_seq = torch.from_numpy(durs).int()
+        label_phoneme = label_phoneme.strip()
+        label_phoneme = label_phoneme.split()
+        text_seq = input_encoder.encode_sequence_torch(label_phoneme).int()
+
+        assert len(text_seq) == len(
+            durs
+        ), f"{len(text_seq)}, {len(durs), len(label_phoneme)}, ({label_phoneme})"  # ensure every token has a duration
+
+        no_spn_label, last_phonemes = list(), list()
+        for i in range(len(label_phoneme)):
+            if label_phoneme[i] != "spn":
+                no_spn_label.append(label_phoneme[i])
+                last_phonemes.append(last_phoneme_flags[i])
+
+        no_spn_seq = input_encoder.encode_sequence_torch(no_spn_label).int()
+
+        spn_labels = [
+            spn_labels[i]
+            for i in range(len(label_phoneme))
+            if label_phoneme[i] != "spn"
+        ]
+
+        audio, fs = torchaudio.load(wav)
+
+        audio = audio.squeeze()
+        audio = audio[int(fs * start) : int(fs * end)]
+
+        mel, energy = hparams["mel_spectogram"](audio=audio)
+        mel = mel[:, : sum(durs)]
+        energy = energy[: sum(durs)]
+        pitch = np.load(pitch)
+        pitch = torch.from_numpy(pitch)
+        pitch = pitch[: mel.shape[-1]]
+        return (
+            text_seq,
+            durs_seq,
+            mel,
+            pitch,
+            energy,
+            len(text_seq),
+            last_phonemes,
+            no_spn_seq,
+            spn_labels,
+        )
+
+    # define splits and load it as sb dataset
+    datasets = {}
+
+    for dataset in hparams["splits"]:
+        datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
+            json_path=hparams[f"{dataset}_json"],
+            replacements={"data_root": hparams["data_folder"]},
+            dynamic_items=[audio_pipeline],
+            output_keys=["mel_text_pair", "wav", "label", "durations", "pitch"],
+        )
+    return datasets, input_encoder
+
+
+def main():
+    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
+    with open(hparams_file) as fin:
+        hparams = load_hyperpyyaml(fin, overrides)
+    sb.utils.distributed.ddp_init_group(run_opts)
+
+    sb.create_experiment_directory(
+        experiment_directory=hparams["output_folder"],
+        hyperparams_to_save=hparams_file,
+        overrides=overrides,
+    )
+
+    from ljspeech_prepare import prepare_ljspeech
+
+    sb.utils.distributed.run_on_main(
+        prepare_ljspeech,
+        kwargs={
+            "data_folder": hparams["data_folder"],
+            "save_folder": hparams["save_folder"],
+            "splits": hparams["splits"],
+            "split_ratio": hparams["split_ratio"],
+            "model_name": hparams["model"].__class__.__name__,
+            "seed": hparams["seed"],
+            "pitch_n_fft": hparams["n_fft"],
+            "pitch_hop_length": hparams["hop_length"],
+            "pitch_min_f0": hparams["min_f0"],
+            "pitch_max_f0": hparams["max_f0"],
+            "skip_prep": hparams["skip_prep"],
+            "use_custom_cleaner": True,
+        },
+    )
+
+    datasets, input_encoder = dataio_prepare(hparams)
+
+    # Brain class initialization
+    fastspeech2_brain = FastSpeech2Brain(
+        modules=hparams["modules"],
+        opt_class=hparams["opt_class"],
+        hparams=hparams,
+        run_opts=run_opts,
+        checkpointer=hparams["checkpointer"],
+    )
+
+    fastspeech2_brain.input_encoder = input_encoder
+    # Training
+    fastspeech2_brain.fit(
+        fastspeech2_brain.hparams.epoch_counter,
+        datasets["train"],
+        datasets["valid"],
+        train_loader_kwargs=hparams["train_dataloader_opts"],
+        valid_loader_kwargs=hparams["valid_dataloader_opts"],
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/speechbrain/recipes/LJSpeech/TTS/fastspeech2/train_internal_alignment.py
+++ b/speechbrain/recipes/LJSpeech/TTS/fastspeech2/train_internal_alignment.py
+"""
+Recipe for training the FastSpeech2 Text-To-Speech model
+Instead of using pre-extracted phoneme durations from MFA,
+This recipe trains an internal alignment from scratch, as introduced in:
+https://arxiv.org/pdf/2108.10447.pdf (One TTS Alignment To Rule Them All)
+To run this recipe, do the following:
+# python train_internal_alignment.py hparams/train_internal_alignment.yaml
+
+Authors
+* Yingzhi Wang 2023
+"""
+
+import logging
+import os
+import sys
+from pathlib import Path
+
+import numpy as np
+import torch
+import torchaudio
+from hyperpyyaml import load_hyperpyyaml
+
+import speechbrain as sb
+from speechbrain.inference.vocoders import HIFIGAN
+from speechbrain.utils.data_utils import scalarize
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+logger = logging.getLogger(__name__)
+
+
+class FastSpeech2Brain(sb.Brain):
+    def on_fit_start(self):
+        """Gets called at the beginning of ``fit()``, on multiple processes
+        if ``distributed_count > 0`` and backend is ddp and initializes statistics
+        """
+        self.hparams.progress_sample_logger.reset()
+        self.last_epoch = 0
+        self.last_batch = None
+        self.last_loss_stats = {}
+        return super().on_fit_start()
+
+    def compute_forward(self, batch, stage):
+        """Computes the forward pass
+        Arguments
+        ---------
+        batch: str
+            a single batch
+        stage: speechbrain.Stage
+            the training stage
+        Returns
+        -------
+        the model output
+        """
+        inputs, _ = self.batch_to_device(batch)
+        return self.hparams.model(*inputs)
+
+    def on_fit_batch_end(self, batch, outputs, loss, should_step):
+        """At the end of the optimizer step, apply noam annealing and logging."""
+        if should_step:
+            self.hparams.noam_annealing(self.optimizer)
+
+    def compute_objectives(self, predictions, batch, stage):
+        """Computes the loss given the predicted and targeted outputs.
+        Arguments
+        ---------
+        predictions : torch.Tensor
+            The model generated spectrograms and other metrics from `compute_forward`.
+        batch : PaddedBatch
+            This batch object contains all the relevant tensors for computation.
+        stage : sb.Stage
+            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.
+        Returns
+        -------
+        loss : torch.Tensor
+            A one-element tensor used for backpropagating the gradient.
+        """
+        x, y, metadata = self.batch_to_device(batch, return_metadata=True)
+        self.last_batch = [x[0], y[-1], y[-2], predictions[0], *metadata]
+        self._remember_sample([x[0], *y, *metadata], predictions)
+        loss = self.hparams.criterion(
+            predictions, y, self.hparams.epoch_counter.current
+        )
+        self.last_loss_stats[stage] = scalarize(loss)
+        return loss["total_loss"]
+
+    def _remember_sample(self, batch, predictions):
+        """Remembers samples of spectrograms and the batch for logging purposes
+        Arguments
+        ---------
+        batch: tuple
+            a training batch
+        predictions: tuple
+            predictions (raw output of the FastSpeech2
+             model)
+        """
+        (
+            phoneme_padded,
+            mel_padded,
+            pitch,
+            energy,
+            output_lengths,
+            input_lengths,
+            labels,
+            wavs,
+        ) = batch
+
+        (
+            mel_post,
+            postnet_mel_out,
+            predict_durations,
+            predict_pitch,
+            average_pitch,
+            predict_energy,
+            average_energy,
+            predict_mel_lens,
+            alignment_durations,
+            alignment_soft,
+            alignment_logprob,
+            alignment_mas,
+        ) = predictions
+        self.hparams.progress_sample_logger.remember(
+            target=self.process_mel(mel_padded, output_lengths),
+            output=self.process_mel(postnet_mel_out, output_lengths),
+            raw_batch=self.hparams.progress_sample_logger.get_batch_sample(
+                {
+                    "tokens": phoneme_padded,
+                    "input_lengths": input_lengths,
+                    "mel_target": mel_padded,
+                    "mel_out": postnet_mel_out,
+                    "mel_lengths": predict_mel_lens,
+                    "durations": alignment_durations,
+                    "predict_durations": predict_durations,
+                    "labels": labels,
+                    "wavs": wavs,
+                }
+            ),
+        )
+
+    def process_mel(self, mel, len, index=0):
+        """Converts a mel spectrogram to one that can be saved as an image
+        sample  = sqrt(exp(mel))
+        Arguments
+        ---------
+        mel: torch.Tensor
+            the mel spectrogram (as used in the model)
+        len: int
+            length of the mel spectrogram
+        index: int
+            batch index
+        Returns
+        -------
+        mel: torch.Tensor
+            the spectrogram, for image saving purposes
+        """
+        assert mel.dim() == 3
+        return torch.sqrt(torch.exp(mel[index][: len[index]]))
+
+    def on_stage_end(self, stage, stage_loss, epoch):
+        """Gets called at the end of an epoch.
+        Arguments
+        ---------
+        stage : sb.Stage
+            One of sb.Stage.TRAIN, sb.Stage.VALID, sb.Stage.TEST
+        stage_loss : float
+            The average loss for all of the data processed in this stage.
+        epoch : int
+            The currently-starting epoch. This is passed
+            `None` during the test stage.
+        """
+        # At the end of validation, we can write
+        if stage == sb.Stage.VALID:
+            # Update learning rate
+            self.last_epoch = epoch
+            lr = self.hparams.noam_annealing.current_lr
+
+            # The train_logger writes a summary to stdout and to the logfile.
+            self.hparams.train_logger.log_stats(  # 1#2#
+                stats_meta={"Epoch": epoch, "lr": lr},
+                train_stats=self.last_loss_stats[sb.Stage.TRAIN],
+                valid_stats=self.last_loss_stats[sb.Stage.VALID],
+            )
+            output_progress_sample = (
+                self.hparams.progress_samples
+                and epoch % self.hparams.progress_samples_interval == 0
+                and epoch >= self.hparams.progress_samples_min_run
+            )
+
+            if output_progress_sample:
+                logger.info("Saving predicted samples")
+                inference_mel, mel_lens = self.run_inference()
+                self.hparams.progress_sample_logger.save(epoch)
+                self.run_vocoder(inference_mel, mel_lens)
+            # Save the current checkpoint and delete previous checkpoints.
+            # UNCOMMENT THIS
+            self.checkpointer.save_and_keep_only(
+                meta=self.last_loss_stats[stage],
+                min_keys=["total_loss"],
+            )
+        # We also write statistics about test data spectogram to stdout and to the logfile.
+        if stage == sb.Stage.TEST:
+            self.hparams.train_logger.log_stats(
+                {"Epoch loaded": self.hparams.epoch_counter.current},
+                test_stats=self.last_loss_stats[sb.Stage.TEST],
+            )
+
+    def run_inference(self):
+        """Produces a sample in inference mode with predicted durations."""
+        if self.last_batch is None:
+            return
+        tokens, *_ = self.last_batch
+
+        (
+            _,
+            postnet_mel_out,
+            _,
+            _,
+            _,
+            _,
+            _,
+            predict_mel_lens,
+            _,
+            _,
+            _,
+            _,
+        ) = self.hparams.model(tokens)
+        self.hparams.progress_sample_logger.remember(
+            infer_output=self.process_mel(
+                postnet_mel_out, [len(postnet_mel_out[0])]
+            )
+        )
+        return postnet_mel_out, predict_mel_lens
+
+    def run_vocoder(self, inference_mel, mel_lens):
+        """Uses a pretrained vocoder to generate audio from predicted mel
+        spectogram. By default, uses speechbrain hifigan.
+
+        Arguments
+        ---------
+        inference_mel: torch.Tensor
+            predicted mel from fastspeech2 inference
+        mel_lens: torch.Tensor
+            predicted mel lengths from fastspeech2 inference
+            used to mask the noise from padding
+
+        Returns
+        -------
+        None
+        """
+        if self.last_batch is None:
+            return
+        *_, wavs = self.last_batch
+
+        inference_mel = inference_mel[: self.hparams.progress_batch_sample_size]
+        mel_lens = mel_lens[0 : self.hparams.progress_batch_sample_size]
+        assert (
+            self.hparams.vocoder == "hifi-gan"
+            and self.hparams.pretrained_vocoder is True
+        ), "Specified vocoder not supported yet"
+        logger.info(
+            f"Generating audio with pretrained {self.hparams.vocoder_source} vocoder"
+        )
+        hifi_gan = HIFIGAN.from_hparams(
+            source=self.hparams.vocoder_source,
+            savedir=self.hparams.vocoder_download_path,
+        )
+        waveforms = hifi_gan.decode_batch(
+            inference_mel.transpose(2, 1), mel_lens, self.hparams.hop_length
+        )
+        for idx, wav in enumerate(waveforms):
+            path = os.path.join(
+                self.hparams.progress_sample_path,
+                str(self.last_epoch),
+                f"pred_{Path(wavs[idx]).stem}.wav",
+            )
+            torchaudio.save(path, wav, self.hparams.sample_rate)
+
+    def batch_to_device(self, batch, return_metadata=False):
+        """Transfers the batch to the target device
+
+        Arguments
+        ---------
+        batch: tuple
+            the batch to use
+        return_metadata: bool
+            Whether to additionally return labels and wavs.
+
+        Returns
+        -------
+        x: tuple
+            phonemes, spectrogram, pitch, energy
+        y: tuple
+            spectrogram, pitch, energy, mel_lengths, input_lengths
+        metadata: tuple
+            labels, wavs
+        """
+
+        (
+            phoneme_padded,
+            input_lengths,
+            mel_padded,
+            pitch_padded,
+            energy_padded,
+            output_lengths,
+            # len_x,
+            labels,
+            wavs,
+        ) = batch
+
+        # durations = durations.to(self.device, non_blocking=True).long()
+        phonemes = phoneme_padded.to(self.device, non_blocking=True).long()
+        input_lengths = input_lengths.to(self.device, non_blocking=True).long()
+        spectogram = mel_padded.to(self.device, non_blocking=True).float()
+        pitch = pitch_padded.to(self.device, non_blocking=True).float()
+        energy = energy_padded.to(self.device, non_blocking=True).float()
+        mel_lengths = output_lengths.to(self.device, non_blocking=True).long()
+        x = (phonemes, spectogram, pitch, energy)
+        y = (spectogram, pitch, energy, mel_lengths, input_lengths)
+        metadata = (labels, wavs)
+        if return_metadata:
+            return x, y, metadata
+        return x, y
+
+
+def dataio_prepare(hparams):
+    "Creates the datasets and their data processing pipelines."
+    # Load lexicon
+    lexicon = hparams["lexicon"]
+    input_encoder = hparams.get("input_encoder")
+
+    # add a dummy symbol for idx 0 - used for padding.
+    lexicon = ["@@"] + lexicon
+    input_encoder.update_from_iterable(lexicon, sequence_input=False)
+    input_encoder.add_unk()
+
+    # load audio, text and durations on the fly; encode audio and text.
+    @sb.utils.data_pipeline.takes("wav", "phonemes", "pitch")
+    @sb.utils.data_pipeline.provides("mel_text_pair")
+    def audio_pipeline(wav, phonemes, pitch):
+        phoneme_seq = input_encoder.encode_sequence_torch(phonemes).int()
+
+        audio, fs = torchaudio.load(wav)
+        audio = audio.squeeze()
+        mel, energy = hparams["mel_spectogram"](audio=audio)
+
+        pitch = np.load(pitch)
+        pitch = torch.from_numpy(pitch)
+        pitch = pitch[: mel.shape[-1]]
+        return phoneme_seq, mel, pitch, energy, len(phoneme_seq), len(mel)
+
+    # define splits and load it as sb dataset
+    datasets = {}
+
+    for dataset in hparams["splits"]:
+        datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
+            json_path=hparams[f"{dataset}_json"],
+            replacements={"data_root": hparams["data_folder"]},
+            dynamic_items=[audio_pipeline],
+            output_keys=["mel_text_pair", "wav", "label", "pitch"],
+        )
+    return datasets
+
+
+def main():
+    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
+    with open(hparams_file) as fin:
+        hparams = load_hyperpyyaml(fin, overrides)
+    sb.utils.distributed.ddp_init_group(run_opts)
+
+    sb.create_experiment_directory(
+        experiment_directory=hparams["output_folder"],
+        hyperparams_to_save=hparams_file,
+        overrides=overrides,
+    )
+
+    from ljspeech_prepare import prepare_ljspeech
+
+    sb.utils.distributed.run_on_main(
+        prepare_ljspeech,
+        kwargs={
+            "data_folder": hparams["data_folder"],
+            "save_folder": hparams["save_folder"],
+            "splits": hparams["splits"],
+            "split_ratio": hparams["split_ratio"],
+            "model_name": hparams["model"].__class__.__name__,
+            "seed": hparams["seed"],
+            "pitch_n_fft": hparams["n_fft"],
+            "pitch_hop_length": hparams["hop_length"],
+            "pitch_min_f0": hparams["min_f0"],
+            "pitch_max_f0": hparams["max_f0"],
+            "skip_prep": hparams["skip_prep"],
+            "use_custom_cleaner": True,
+            "device": "cuda",
+        },
+    )
+
+    datasets = dataio_prepare(hparams)
+
+    # Brain class initialization
+    fastspeech2_brain = FastSpeech2Brain(
+        modules=hparams["modules"],
+        opt_class=hparams["opt_class"],
+        hparams=hparams,
+        run_opts=run_opts,
+        checkpointer=hparams["checkpointer"],
+    )
+    # Training
+    fastspeech2_brain.fit(
+        fastspeech2_brain.hparams.epoch_counter,
+        datasets["train"],
+        datasets["valid"],
+        train_loader_kwargs=hparams["train_dataloader_opts"],
+        valid_loader_kwargs=hparams["valid_dataloader_opts"],
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/speechbrain/recipes/LJSpeech/TTS/tacotron2/__pycache__/ljspeech_prepare.cpython-310.pyc
+++ b/speechbrain/recipes/LJSpeech/TTS/tacotron2/__pycache__/ljspeech_prepare.cpython-310.pyc
--- a/speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml
+++ b/speechbrain/recipes/LJSpeech/TTS/tacotron2/hparams/train.yaml
+############################################################################
+# Model: Tacotron2
+# Tokens: Raw characters (English text)
+# losses: Transducer
+# Training: LJSpeech
+# Authors: Georges Abous-Rjeili, Artem Ploujnikov, Yingzhi Wang
+# ############################################################################
+
+
+###################################
+# Experiment Parameters and setup #
+###################################
+seed: 1234
+__set_seed: !apply:torch.manual_seed [!ref <seed>]
+output_folder: !ref ./results/tacotron2/<seed>
+save_folder: !ref <output_folder>/save
+train_log: !ref <output_folder>/train_log.txt
+epochs: 750
+keep_checkpoint_interval: 50
+
+###################################
+# Progress Samples                #
+###################################
+# Progress samples are used to monitor the progress
+# of an ongoing training session by outputting samples
+# of spectrograms, alignments, etc at regular intervals
+
+# Whether to enable progress samples
+progress_samples: True
+
+# The path where the samples will be stored
+progress_sample_path: !ref <output_folder>/samples
+# The interval, in epochs. For instance, if it is set to 5,
+# progress samples will be output every 5 epochs
+progress_samples_interval: 1
+# The sample size for raw batch samples saved in batch.pth
+# (useful mostly for model debugging)
+progress_batch_sample_size: 3
+
+#################################
+# Data files and pre-processing #
+#################################
+data_folder: !PLACEHOLDER # e.g, /localscratch/ljspeech
+
+train_json: !ref <save_folder>/train.json
+valid_json: !ref <save_folder>/valid.json
+test_json: !ref <save_folder>/test.json
+
+splits: ["train", "valid"]
+split_ratio: [90, 10]
+
+skip_prep: False
+
+# Use the original preprocessing from nvidia
+# The cleaners to be used (applicable to nvidia only)
+text_cleaners: ['english_cleaners']
+
+################################
+# Audio Parameters             #
+################################
+sample_rate: 22050
+hop_length: 256
+win_length: 1024
+n_mel_channels: 80
+n_fft: 1024
+mel_fmin: 0.0
+mel_fmax: 8000.0
+mel_normalized: False
+power: 1
+norm: "slaney"
+mel_scale: "slaney"
+dynamic_range_compression: True
+
+################################
+# Optimization Hyperparameters #
+################################
+learning_rate: 0.001
+weight_decay: 0.000006
+batch_size: 64 #minimum 2
+num_workers: 8
+mask_padding: True
+guided_attention_sigma: 0.2
+guided_attention_weight: 50.0
+guided_attention_weight_half_life: 10.
+guided_attention_hard_stop: 50
+gate_loss_weight: 1.0
+
+train_dataloader_opts:
+  batch_size: !ref <batch_size>
+  drop_last: False  #True #False
+  num_workers: !ref <num_workers>
+  collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
+
+valid_dataloader_opts:
+  batch_size: !ref <batch_size>
+  num_workers: !ref <num_workers>
+  collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
+
+test_dataloader_opts:
+  batch_size: !ref <batch_size>
+  num_workers: !ref <num_workers>
+  collate_fn: !new:speechbrain.lobes.models.Tacotron2.TextMelCollate
+
+################################
+# Model Parameters and model   #
+################################
+n_symbols: 148 #fixed depending on symbols in textToSequence
+symbols_embedding_dim: 512
+
+# Encoder parameters
+encoder_kernel_size: 5
+encoder_n_convolutions: 3
+encoder_embedding_dim: 512
+
+# Decoder parameters
+# The number of frames in the target per encoder step
+n_frames_per_step: 1
+decoder_rnn_dim: 1024
+prenet_dim: 256
+max_decoder_steps: 1000
+gate_threshold: 0.5
+p_attention_dropout: 0.1
+p_decoder_dropout: 0.1
+decoder_no_early_stopping: False
+
+# Attention parameters
+attention_rnn_dim: 1024
+attention_dim: 128
+
+# Location Layer parameters
+attention_location_n_filters: 32
+attention_location_kernel_size: 31
+
+# Mel-post processing network parameters
+postnet_embedding_dim: 512
+postnet_kernel_size: 5
+postnet_n_convolutions: 5
+
+mel_spectogram: !name:speechbrain.lobes.models.Tacotron2.mel_spectogram
+  sample_rate: !ref <sample_rate>
+  hop_length: !ref <hop_length>
+  win_length: !ref <win_length>
+  n_fft: !ref <n_fft>
+  n_mels: !ref <n_mel_channels>
+  f_min: !ref <mel_fmin>
+  f_max: !ref <mel_fmax>
+  power: !ref <power>
+  normalized: !ref <mel_normalized>
+  norm: !ref <norm>
+  mel_scale: !ref <mel_scale>
+  compression: !ref <dynamic_range_compression>
+
+#model
+model: !new:speechbrain.lobes.models.Tacotron2.Tacotron2
+  mask_padding: !ref <mask_padding>
+  n_mel_channels: !ref <n_mel_channels>
+  # symbols
+  n_symbols: !ref <n_symbols>
+  symbols_embedding_dim: !ref <symbols_embedding_dim>
+  # encoder
+  encoder_kernel_size: !ref <encoder_kernel_size>
+  encoder_n_convolutions: !ref <encoder_n_convolutions>
+  encoder_embedding_dim: !ref <encoder_embedding_dim>
+  # attention
+  attention_rnn_dim: !ref <attention_rnn_dim>
+  attention_dim: !ref <attention_dim>
+  # attention location
+  attention_location_n_filters: !ref <attention_location_n_filters>
+  attention_location_kernel_size: !ref <attention_location_kernel_size>
+  # decoder
+  n_frames_per_step: !ref <n_frames_per_step>
+  decoder_rnn_dim: !ref <decoder_rnn_dim>
+  prenet_dim: !ref <prenet_dim>
+  max_decoder_steps: !ref <max_decoder_steps>
+  gate_threshold: !ref <gate_threshold>
+  p_attention_dropout: !ref <p_attention_dropout>
+  p_decoder_dropout: !ref <p_decoder_dropout>
+  # postnet
+  postnet_embedding_dim: !ref <postnet_embedding_dim>
+  postnet_kernel_size: !ref <postnet_kernel_size>
+  postnet_n_convolutions: !ref <postnet_n_convolutions>
+  decoder_no_early_stopping: !ref <decoder_no_early_stopping>
+
+guided_attention_scheduler: !new:speechbrain.nnet.schedulers.StepScheduler
+  initial_value: !ref <guided_attention_weight>
+  half_life: !ref <guided_attention_weight_half_life>
+
+criterion: !new:speechbrain.lobes.models.Tacotron2.Loss
+  gate_loss_weight: !ref <gate_loss_weight>
+  guided_attention_weight: !ref <guided_attention_weight>
+  guided_attention_sigma: !ref <guided_attention_sigma>
+  guided_attention_scheduler: !ref <guided_attention_scheduler>
+  guided_attention_hard_stop: !ref <guided_attention_hard_stop>
+
+modules:
+  model: !ref <model>
+
+#optimizer
+opt_class: !name:torch.optim.Adam
+  lr: !ref <learning_rate>
+  weight_decay: !ref <weight_decay>
+
+#epoch object
+epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
+  limit: !ref <epochs>
+
+train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
+  save_file: !ref <train_log>
+
+#annealing_function
+lr_annealing: !new:speechbrain.nnet.schedulers.IntervalScheduler
+  intervals:
+    - steps: 6000
+      lr: 0.0005
+    - steps: 8000
+      lr: 0.0003
+    - steps: 10000
+      lr: 0.0001
+
+#checkpointer
+checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
+  checkpoints_dir: !ref <save_folder>
+  recoverables:
+    model: !ref <model>
+    counter: !ref <epoch_counter>
+    scheduler: !ref <lr_annealing>
+
+#infer: !name:speechbrain.lobes.models.Tacotron2.infer
+
+progress_sample_logger: !new:speechbrain.utils.train_logger.ProgressSampleLogger
+  output_path: !ref <progress_sample_path>
+  batch_sample_size: !ref <progress_batch_sample_size>
+  formats:
+    raw_batch: raw
--- a/speechbrain/recipes/LJSpeech/TTS/tacotron2/ljspeech_prepare.py
+++ b/speechbrain/recipes/LJSpeech/TTS/tacotron2/ljspeech_prepare.py
+"""
+LJspeech data preparation.
+Download: https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
+
+Authors
+ * Yingzhi WANG 2022
+ * Sathvik Udupa 2022
+ * Pradnya Kandarkar 2023
+"""
+
+import csv
+import json
+import logging
+import os
+import random
+import re
+
+import numpy as np
+import tgt
+import torch
+import torchaudio
+from tqdm import tqdm
+from unidecode import unidecode
+
+from speechbrain.dataio.dataio import load_pkl, save_pkl
+from speechbrain.inference.text import GraphemeToPhoneme
+from speechbrain.utils.data_utils import download_file
+from speechbrain.utils.text_to_sequence import _g2p_keep_punctuations
+
+logger = logging.getLogger(__name__)
+OPT_FILE = "opt_ljspeech_prepare.pkl"
+METADATA_CSV = "metadata.csv"
+TRAIN_JSON = "train.json"
+VALID_JSON = "valid.json"
+TEST_JSON = "test.json"
+WAVS = "wavs"
+DURATIONS = "durations"
+
+logger = logging.getLogger(__name__)
+OPT_FILE = "opt_ljspeech_prepare.pkl"
+
+
+def prepare_ljspeech(
+    data_folder,
+    save_folder,
+    splits=["train", "valid"],
+    split_ratio=[90, 10],
+    model_name=None,
+    seed=1234,
+    pitch_n_fft=1024,
+    pitch_hop_length=256,
+    pitch_min_f0=65,
+    pitch_max_f0=400,
+    skip_prep=False,
+    use_custom_cleaner=False,
+    device="cpu",
+):
+    """
+    Prepares the csv files for the LJspeech datasets.
+
+    Arguments
+    ---------
+    data_folder : str
+        Path to the folder where the original LJspeech dataset is stored
+    save_folder : str
+        The directory where to store the csv/json files
+    splits : list
+        List of dataset splits to prepare
+    split_ratio : list
+        Proportion for dataset splits
+    model_name : str
+        Model name (used to prepare additional model specific data)
+    seed : int
+        Random seed
+    pitch_n_fft : int
+        Number of fft points for pitch computation
+    pitch_hop_length : int
+        Hop length for pitch computation
+    pitch_min_f0 : int
+        Minimum f0 for pitch computation
+    pitch_max_f0 : int
+        Max f0 for pitch computation
+    skip_prep : bool
+        If True, skip preparation
+    use_custom_cleaner : bool
+        If True, uses custom cleaner defined for this recipe
+    device : str
+        Device for to be used for computation (used as required)
+
+    Returns
+    -------
+    None
+
+    Example
+    -------
+    >>> from recipes.LJSpeech.TTS.ljspeech_prepare import prepare_ljspeech
+    >>> data_folder = 'data/LJspeech/'
+    >>> save_folder = 'save/'
+    >>> splits = ['train', 'valid']
+    >>> split_ratio = [90, 10]
+    >>> seed = 1234
+    >>> prepare_ljspeech(data_folder, save_folder, splits, split_ratio, seed)
+    """
+    # Sets seeds for reproducible code
+    random.seed(seed)
+
+    if skip_prep:
+        return
+
+    # Creating configuration for easily skipping data_preparation stage
+    conf = {
+        "data_folder": data_folder,
+        "splits": splits,
+        "split_ratio": split_ratio,
+        "save_folder": save_folder,
+        "seed": seed,
+    }
+    if not os.path.exists(save_folder):
+        os.makedirs(save_folder)
+
+    # Setting output files
+    meta_csv = os.path.join(data_folder, METADATA_CSV)
+    wavs_folder = os.path.join(data_folder, WAVS)
+
+    save_opt = os.path.join(save_folder, OPT_FILE)
+    save_json_train = os.path.join(save_folder, TRAIN_JSON)
+    save_json_valid = os.path.join(save_folder, VALID_JSON)
+    save_json_test = os.path.join(save_folder, TEST_JSON)
+
+    phoneme_alignments_folder = None
+    duration_folder = None
+    pitch_folder = None
+    # Setting up additional folders required for FastSpeech2
+    if model_name is not None and "FastSpeech2" in model_name:
+        # This step requires phoneme alignments to be present in the data_folder
+        # We automatically download the alignments from https://www.dropbox.com/s/v28x5ldqqa288pu/LJSpeech.zip
+        # Download and unzip LJSpeech phoneme alignments from here: https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4
+        alignment_URL = (
+            "https://www.dropbox.com/s/v28x5ldqqa288pu/LJSpeech.zip?dl=1"
+        )
+        phoneme_alignments_folder = os.path.join(
+            data_folder, "TextGrid", "LJSpeech"
+        )
+        download_file(
+            alignment_URL, data_folder + "/alignments.zip", unpack=True
+        )
+
+        duration_folder = os.path.join(data_folder, "durations")
+        if not os.path.exists(duration_folder):
+            os.makedirs(duration_folder)
+
+        # extract pitch for both Fastspeech2 and FastSpeech2WithAligner models
+        pitch_folder = os.path.join(data_folder, "pitch")
+        if not os.path.exists(pitch_folder):
+            os.makedirs(pitch_folder)
+
+    # Check if this phase is already done (if so, skip it)
+    if skip(splits, save_folder, conf):
+        logger.info("Skipping preparation, completed in previous run.")
+        return
+
+    # Additional check to make sure metadata.csv and wavs folder exists
+    assert os.path.exists(meta_csv), "metadata.csv does not exist"
+    assert os.path.exists(wavs_folder), "wavs/ folder does not exist"
+
+    # Prepare data splits
+    msg = "Creating json file for ljspeech Dataset.."
+    logger.info(msg)
+    data_split, meta_csv = split_sets(data_folder, splits, split_ratio)
+
+    if "train" in splits:
+        prepare_json(
+            model_name,
+            data_split["train"],
+            save_json_train,
+            wavs_folder,
+            meta_csv,
+            phoneme_alignments_folder,
+            duration_folder,
+            pitch_folder,
+            pitch_n_fft,
+            pitch_hop_length,
+            pitch_min_f0,
+            pitch_max_f0,
+            use_custom_cleaner,
+            device,
+        )
+    if "valid" in splits:
+        prepare_json(
+            model_name,
+            data_split["valid"],
+            save_json_valid,
+            wavs_folder,
+            meta_csv,
+            phoneme_alignments_folder,
+            duration_folder,
+            pitch_folder,
+            pitch_n_fft,
+            pitch_hop_length,
+            pitch_min_f0,
+            pitch_max_f0,
+            use_custom_cleaner,
+            device,
+        )
+    if "test" in splits:
+        prepare_json(
+            model_name,
+            data_split["test"],
+            save_json_test,
+            wavs_folder,
+            meta_csv,
+            phoneme_alignments_folder,
+            duration_folder,
+            pitch_folder,
+            pitch_n_fft,
+            pitch_hop_length,
+            pitch_min_f0,
+            pitch_max_f0,
+            use_custom_cleaner,
+            device,
+        )
+    save_pkl(conf, save_opt)
+
+
+def skip(splits, save_folder, conf):
+    """
+    Detects if the ljspeech data_preparation has been already done.
+    If the preparation has been done, we can skip it.
+
+    Arguments
+    ---------
+    splits : list
+        The portions of data to review.
+    save_folder : str
+        The path to the directory containing prepared files.
+    conf : dict
+        Configuration to match against saved config.
+
+    Returns
+    -------
+    bool
+        if True, the preparation phase can be skipped.
+        if False, it must be done.
+    """
+    # Checking json files
+    skip = True
+
+    split_files = {
+        "train": TRAIN_JSON,
+        "valid": VALID_JSON,
+        "test": TEST_JSON,
+    }
+
+    for split in splits:
+        if not os.path.isfile(os.path.join(save_folder, split_files[split])):
+            skip = False
+
+    #  Checking saved options
+    save_opt = os.path.join(save_folder, OPT_FILE)
+    if skip is True:
+        if os.path.isfile(save_opt):
+            opts_old = load_pkl(save_opt)
+            if opts_old == conf:
+                skip = True
+            else:
+                skip = False
+        else:
+            skip = False
+    return skip
+
+
+def split_sets(data_folder, splits, split_ratio):
+    """Randomly splits the wav list into training, validation, and test lists.
+    Note that a better approach is to make sure that all the classes have the
+    same proportion of samples for each session.
+
+    Arguments
+    ---------
+    data_folder : str
+        The path to the directory containing the data.
+    splits : list
+        The list of the selected splits.
+    split_ratio : list
+        List composed of three integers that sets split ratios for train,
+        valid, and test sets, respectively.
+        For instance split_ratio=[80, 10, 10] will assign 80% of the sentences
+        to training, 10% for validation, and 10% for test.
+
+    Returns
+    -------
+    dictionary containing train, valid, and test splits.
+    """
+    meta_csv = os.path.join(data_folder, METADATA_CSV)
+    csv_reader = csv.reader(
+        open(meta_csv), delimiter="|", quoting=csv.QUOTE_NONE
+    )
+
+    meta_csv = list(csv_reader)
+
+    index_for_sessions = []
+    session_id_start = "LJ001"
+    index_this_session = []
+    for i in range(len(meta_csv)):
+        session_id = meta_csv[i][0].split("-")[0]
+        if session_id == session_id_start:
+            index_this_session.append(i)
+            if i == len(meta_csv) - 1:
+                index_for_sessions.append(index_this_session)
+        else:
+            index_for_sessions.append(index_this_session)
+            session_id_start = session_id
+            index_this_session = [i]
+
+    session_len = [len(session) for session in index_for_sessions]
+
+    data_split = {}
+    for i, split in enumerate(splits):
+        data_split[split] = []
+        for j in range(len(index_for_sessions)):
+            if split == "train":
+                random.shuffle(index_for_sessions[j])
+                n_snts = int(session_len[j] * split_ratio[i] / sum(split_ratio))
+                data_split[split].extend(index_for_sessions[j][0:n_snts])
+                del index_for_sessions[j][0:n_snts]
+            if split == "valid":
+                if "test" in splits:
+                    random.shuffle(index_for_sessions[j])
+                    n_snts = int(
+                        session_len[j] * split_ratio[i] / sum(split_ratio)
+                    )
+                    data_split[split].extend(index_for_sessions[j][0:n_snts])
+                    del index_for_sessions[j][0:n_snts]
+                else:
+                    data_split[split].extend(index_for_sessions[j])
+            if split == "test":
+                data_split[split].extend(index_for_sessions[j])
+
+    return data_split, meta_csv
+
+
+def prepare_json(
+    model_name,
+    seg_lst,
+    json_file,
+    wavs_folder,
+    csv_reader,
+    phoneme_alignments_folder,
+    durations_folder,
+    pitch_folder,
+    pitch_n_fft,
+    pitch_hop_length,
+    pitch_min_f0,
+    pitch_max_f0,
+    use_custom_cleaner=False,
+    device="cpu",
+):
+    """
+    Creates json file given a list of indexes.
+
+    Arguments
+    ---------
+    model_name : str
+        Model name (used to prepare additional model specific data)
+    seg_lst : list
+        The list of json indexes of a given data split
+    json_file : str
+        Output json path
+    wavs_folder : str
+        LJspeech wavs folder
+    csv_reader : _csv.reader
+        LJspeech metadata
+    phoneme_alignments_folder : path
+        Path where the phoneme alignments are stored
+    durations_folder : path
+        Folder where to store the duration values of each audio
+    pitch_folder : path
+        Folder where to store the pitch of each audio
+    pitch_n_fft : int
+        Number of fft points for pitch computation
+    pitch_hop_length : int
+        Hop length for pitch computation
+    pitch_min_f0 : int
+        Minimum f0 for pitch computation
+    pitch_max_f0 : int
+        Max f0 for pitch computation
+    use_custom_cleaner : bool
+        If True, uses custom cleaner defined for this recipe
+    device : str
+        Device for to be used for computation (used as required)
+    """
+
+    logger.info(f"preparing {json_file}.")
+    if model_name in ["Tacotron2", "FastSpeech2WithAlignment"]:
+        logger.info(
+            "Computing phonemes for LJSpeech labels using SpeechBrain G2P. This may take a while."
+        )
+        g2p = GraphemeToPhoneme.from_hparams(
+            "speechbrain/soundchoice-g2p", run_opts={"device": device}
+        )
+    if model_name is not None and "FastSpeech2" in model_name:
+        logger.info(
+            "Computing pitch as required for FastSpeech2. This may take a while."
+        )
+
+    json_dict = {}
+    for index in tqdm(seg_lst):
+        # Common data preparation
+        id = list(csv_reader)[index][0]
+        wav = os.path.join(wavs_folder, f"{id}.wav")
+        label = list(csv_reader)[index][2]
+        if use_custom_cleaner:
+            label = custom_clean(label, model_name)
+
+        json_dict[id] = {
+            "uttid": id,
+            "wav": wav,
+            "label": label,
+            "segment": True if "train" in json_file else False,
+        }
+
+        # FastSpeech2 specific data preparation
+        if model_name == "FastSpeech2":
+            audio, fs = torchaudio.load(wav)
+
+            # Parses phoneme alignments
+            textgrid_path = os.path.join(
+                phoneme_alignments_folder, f"{id}.TextGrid"
+            )
+            textgrid = tgt.io.read_textgrid(
+                textgrid_path, include_empty_intervals=True
+            )
+
+            last_phoneme_flags = get_last_phoneme_info(
+                textgrid.get_tier_by_name("words"),
+                textgrid.get_tier_by_name("phones"),
+            )
+            (
+                phonemes,
+                duration,
+                start,
+                end,
+                trimmed_last_phoneme_flags,
+            ) = get_alignment(
+                textgrid.get_tier_by_name("phones"),
+                fs,
+                pitch_hop_length,
+                last_phoneme_flags,
+            )
+
+            # Gets label phonemes
+            label_phoneme = " ".join(phonemes)
+            spn_labels = [0] * len(phonemes)
+            for i in range(1, len(phonemes)):
+                if phonemes[i] == "spn":
+                    spn_labels[i - 1] = 1
+            if start >= end:
+                print(f"Skipping {id}")
+                continue
+
+            # Saves durations
+            duration_file_path = os.path.join(durations_folder, f"{id}.npy")
+            np.save(duration_file_path, duration)
+
+            # Computes pitch
+            audio = audio[:, int(fs * start) : int(fs * end)]
+            pitch_file = wav.replace(".wav", ".npy").replace(
+                wavs_folder, pitch_folder
+            )
+            if not os.path.isfile(pitch_file):
+                pitch = torchaudio.functional.detect_pitch_frequency(
+                    waveform=audio,
+                    sample_rate=fs,
+                    frame_time=(pitch_hop_length / fs),
+                    win_length=3,
+                    freq_low=pitch_min_f0,
+                    freq_high=pitch_max_f0,
+                ).squeeze(0)
+
+                # Concatenate last element to match duration.
+                pitch = torch.cat([pitch, pitch[-1].unsqueeze(0)])
+
+                # Mean and Variance Normalization
+                mean = 256.1732939688805
+                std = 328.319759158607
+
+                pitch = (pitch - mean) / std
+
+                pitch = pitch[: sum(duration)]
+                np.save(pitch_file, pitch)
+
+            # Updates data for the utterance
+            json_dict[id].update({"label_phoneme": label_phoneme})
+            json_dict[id].update({"spn_labels": spn_labels})
+            json_dict[id].update({"start": start})
+            json_dict[id].update({"end": end})
+            json_dict[id].update({"durations": duration_file_path})
+            json_dict[id].update({"pitch": pitch_file})
+            json_dict[id].update(
+                {"last_phoneme_flags": trimmed_last_phoneme_flags}
+            )
+
+        # FastSpeech2WithAlignment specific data preparation
+        if model_name == "FastSpeech2WithAlignment":
+            audio, fs = torchaudio.load(wav)
+            # Computes pitch
+            pitch_file = wav.replace(".wav", ".npy").replace(
+                wavs_folder, pitch_folder
+            )
+            if not os.path.isfile(pitch_file):
+                if torchaudio.__version__ < "2.1":
+                    pitch = torchaudio.functional.compute_kaldi_pitch(
+                        waveform=audio,
+                        sample_rate=fs,
+                        frame_length=(pitch_n_fft / fs * 1000),
+                        frame_shift=(pitch_hop_length / fs * 1000),
+                        min_f0=pitch_min_f0,
+                        max_f0=pitch_max_f0,
+                    )[0, :, 0]
+                else:
+                    pitch = torchaudio.functional.detect_pitch_frequency(
+                        waveform=audio,
+                        sample_rate=fs,
+                        frame_time=(pitch_hop_length / fs),
+                        win_length=3,
+                        freq_low=pitch_min_f0,
+                        freq_high=pitch_max_f0,
+                    ).squeeze(0)
+
+                    # Concatenate last element to match duration.
+                    pitch = torch.cat([pitch, pitch[-1].unsqueeze(0)])
+
+                    # Mean and Variance Normalization
+                    mean = 256.1732939688805
+                    std = 328.319759158607
+
+                    pitch = (pitch - mean) / std
+
+                np.save(pitch_file, pitch)
+
+            phonemes = _g2p_keep_punctuations(g2p, label)
+            # Updates data for the utterance
+            json_dict[id].update({"phonemes": phonemes})
+            json_dict[id].update({"pitch": pitch_file})
+
+    # Writing the dictionary to the json file
+    with open(json_file, mode="w") as json_f:
+        json.dump(json_dict, json_f, indent=2)
+
+    logger.info(f"{json_file} successfully created!")
+
+
+def get_alignment(tier, sampling_rate, hop_length, last_phoneme_flags):
+    """
+    Returns phonemes, phoneme durations (in frames), start time (in seconds), end time (in seconds).
+    This function is adopted from https://github.com/ming024/FastSpeech2/blob/master/preprocessor/preprocessor.py
+
+    Arguments
+    ---------
+    tier : tgt.core.IntervalTier
+        For an utterance, contains Interval objects for phonemes and their start time and end time in seconds
+    sampling_rate : int
+        Sample rate if audio signal
+    hop_length : int
+        Hop length for duration computation
+    last_phoneme_flags : list
+        List of (phoneme, flag) tuples with flag=1 if the phoneme is the last phoneme else flag=0
+
+
+    Returns
+    -------
+    (phones, durations, start_time, end_time) : tuple
+        The phonemes, durations, start time, and end time for an utterance
+    """
+
+    sil_phones = ["sil", "sp", "spn", ""]
+
+    phonemes = []
+    durations = []
+    start_time = 0
+    end_time = 0
+    end_idx = 0
+    trimmed_last_phoneme_flags = []
+
+    flag_iter = iter(last_phoneme_flags)
+
+    for t in tier._objects:
+        s, e, p = t.start_time, t.end_time, t.text
+        current_flag = next(flag_iter)
+
+        # Trims leading silences
+        if phonemes == []:
+            if p in sil_phones:
+                continue
+            else:
+                start_time = s
+
+        if p not in sil_phones:
+            # For ordinary phones
+            # Removes stress indicators
+            if p[-1].isdigit():
+                phonemes.append(p[:-1])
+            else:
+                phonemes.append(p)
+            trimmed_last_phoneme_flags.append(current_flag[1])
+            end_time = e
+            end_idx = len(phonemes)
+        else:
+            # Uses a unique token for all silent phones
+            phonemes.append("spn")
+            trimmed_last_phoneme_flags.append(current_flag[1])
+
+        durations.append(
+            int(
+                np.round(e * sampling_rate / hop_length)
+                - np.round(s * sampling_rate / hop_length)
+            )
+        )
+
+    # Trims tailing silences
+    phonemes = phonemes[:end_idx]
+    durations = durations[:end_idx]
+
+    return phonemes, durations, start_time, end_time, trimmed_last_phoneme_flags
+
+
+def get_last_phoneme_info(words_seq, phones_seq):
+    """This function takes word and phoneme tiers from a TextGrid file as input
+    and provides a list of tuples for the phoneme sequence indicating whether
+    each of the phonemes is the last phoneme of a word or not.
+
+    Each tuple of the returned list has this format: (phoneme, flag)
+
+
+    Arguments
+    ---------
+    words_seq : tier
+        word tier from a TextGrid file
+    phones_seq : tier
+        phoneme tier from a TextGrid file
+
+    Returns
+    -------
+    last_phoneme_flags : list
+        each tuple of the returned list has this format: (phoneme, flag)
+    """
+
+    # Gets all phoneme objects for the entire sequence
+    phoneme_objects = phones_seq._objects
+    phoneme_iter = iter(phoneme_objects)
+
+    # Stores flags to show if an element (phoneme) is a the last phoneme of a word
+    last_phoneme_flags = list()
+
+    # Matches the end times of the phoneme and word objects to get the last phoneme information
+    for word_obj in words_seq._objects:
+        word_end_time = word_obj.end_time
+
+        current_phoneme = next(phoneme_iter, None)
+        while current_phoneme:
+            phoneme_end_time = current_phoneme.end_time
+            if phoneme_end_time == word_end_time:
+                last_phoneme_flags.append((current_phoneme.text, 1))
+                break
+            else:
+                last_phoneme_flags.append((current_phoneme.text, 0))
+            current_phoneme = next(phoneme_iter, None)
+
+    return last_phoneme_flags
+
+
+def custom_clean(text, model_name):
+    """
+    Uses custom criteria to clean text.
+
+    Arguments
+    ---------
+    text : str
+        Input text to be cleaned
+    model_name : str
+        whether to treat punctuations
+
+    Returns
+    -------
+    text : str
+        Cleaned text
+    """
+
+    _abbreviations = [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mrs", "missus"),
+            ("mr", "mister"),
+            ("dr", "doctor"),
+            ("st", "saint"),
+            ("co", "company"),
+            ("jr", "junior"),
+            ("maj", "major"),
+            ("gen", "general"),
+            ("drs", "doctors"),
+            ("rev", "reverend"),
+            ("lt", "lieutenant"),
+            ("hon", "honorable"),
+            ("sgt", "sergeant"),
+            ("capt", "captain"),
+            ("esq", "esquire"),
+            ("ltd", "limited"),
+            ("col", "colonel"),
+            ("ft", "fort"),
+        ]
+    ]
+    text = unidecode(text.lower())
+    if model_name != "FastSpeech2WithAlignment":
+        text = re.sub("[:;]", " - ", text)
+        text = re.sub(r'[)(\[\]"]', " ", text)
+        text = text.strip().strip().strip("-")
+
+    text = re.sub(" +", " ", text)
+    for regex, replacement in _abbreviations:
+        text = re.sub(regex, replacement, text)
+    return text
--- a/speechbrain/recipes/LJSpeech/TTS/tacotron2/results/tacotron2/1234/env.log
+++ b/speechbrain/recipes/LJSpeech/TTS/tacotron2/results/tacotron2/1234/env.log
+SpeechBrain system description
+==============================
+Python version:
+3.10.12 (main, May 26 2024, 00:14:02) [GCC 9.4.0]
+==============================
+Installed Python packages:
+accelerate==0.31.0
+addict==2.4.0
+aiosignal==1.3.1
+aitemplate @ http://10.6.10.68:8000/release/aitemplate/dtk24.04.1/aitemplate-0.0.1%2Bdas1.1.git5d8aa20.dtk2404.torch2.1.0-py3-none-any.whl#sha256=ad763a7cfd3935857cf10a07a2a97899fd64dda481add2f48de8b8930bd341dd
+annotated-types==0.7.0
+anyio==4.4.0
+apex @ http://10.6.10.68:8000/release/apex/dtk24.04.1/apex-1.1.0%2Bdas1.1.gitf477a3a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=85eb662d13d6e6c3b61c2d878378c2338c4479bc03a1912c3eabddc2d9d08aa1
+attrs==23.2.0
+audioread==3.0.1
+bitsandbytes @ http://10.6.10.68:8000/release/bitsandbyte/dtk24.04.1/bitsandbytes-0.42.0%2Bdas1.1.gitce85679.abi1.dtk2404.torch2.1.0-py3-none-any.whl#sha256=6324e330c8d12b858d39f4986c0ed0836fcb05f539cee92a7cf558e17954ae0d
+certifi==2024.6.2
+cffi==1.17.0
+cfgv==3.4.0
+charset-normalizer==3.3.2
+click==8.1.7
+coloredlogs==15.0.1
+contourpy==1.2.1
+cycler==0.12.1
+decorator==5.1.1
+deepspeed @ http://10.6.10.68:8000/release/deepspeed/dtk24.04.1/deepspeed-0.12.3%2Bgita724046.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=2c158ed2dab21f4f09e7fc29776cb43a1593b13cec33168ce3483f318b852fc9
+distlib==0.3.8
+dnspython==2.6.1
+dropout-layer-norm @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/dropout_layer_norm-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ae10c7cc231a8e38492292e91e76ba710d7679762604c0a7f10964b2385cdbd7
+einops==0.8.0
+email_validator==2.1.1
+exceptiongroup==1.2.1
+fastapi==0.111.0
+fastapi-cli==0.0.4
+fastpt @ http://10.6.10.68:8000/release/fastpt/dtk24.04.1/fastpt-1.0.0%2Bdas1.1.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ecf30dadcd2482adb1107991edde19b6559b8237379dbb0a3e6eb7306aad3f9a
+filelock==3.15.1
+fire==0.6.0
+flash-attn @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/flash_attn-2.0.4%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7ca8e78ee0624b1ff0e91e9fc265e61b9510f02123a010ac71a2f8e5d08a62f7
+flatbuffers==24.3.25
+fonttools==4.53.0
+frozenlist==1.4.1
+fsspec==2024.6.0
+fused-dense-lib @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/fused_dense_lib-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7202dd258a86bb7a1572e3b44b90dae667b0c948bf0f420b05924a107aaaba03
+h11==0.14.0
+hjson==3.1.0
+httpcore==1.0.5
+httptools==0.6.1
+httpx==0.27.0
+huggingface-hub==0.23.4
+humanfriendly==10.0
+HyperPyYAML==1.2.2
+hypothesis==5.35.1
+identify==2.6.0
+idna==3.7
+importlib_metadata==7.1.0
+Jinja2==3.1.4
+joblib==1.4.2
+jsonschema==4.22.0
+jsonschema-specifications==2023.12.1
+kiwisolver==1.4.5
+layer-check-pt @ http://10.6.10.68:8000/release/layercheck/dtk24.04.1/layer_check_pt-1.2.3.git59a087a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=807adae2d4d4b74898777f81e1b94f1af4d881afe6a7826c7c910b211accbea7
+lazy_loader==0.4
+librosa==0.10.2.post1
+lightop @ http://10.6.10.68:8000/release/lightop/dtk24.04.1/lightop-0.4%2Bdas1.1git8e60f07.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=2f2c88fd3fe4be179f44c4849e9224cb5b2b259843fc5a2d088e468b7a14c1b1
+llvmlite==0.43.0
+lmdeploy @ http://10.6.10.68:8000/release/lmdeploy/dtk24.04.1/lmdeploy-0.2.6%2Bdas1.1.git6ba90df.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=92ecee2c8b982f86e5c3219ded24d2ede219f415bf2cd4297f989a03387a203c
+markdown-it-py==3.0.0
+MarkupSafe==2.1.5
+matplotlib==3.9.0
+mdurl==0.1.2
+mmcv @ http://10.6.10.68:8000/release/mmcv/dtk24.04.1/mmcv-2.0.1%2Bdas1.1.gite58da25.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=7a937ae22f81b44d9100907e11303c31bf9a670cb4c92e361675674a41a8a07f
+mmengine==0.10.4
+mmengine-lite==0.10.4
+mpmath==1.3.0
+msgpack==1.0.8
+networkx==3.3
+ninja==1.11.1.1
+nodeenv==1.9.1
+numba==0.60.0
+numpy==1.24.3
+onnxruntime @ http://10.6.10.68:8000/release/onnxruntime/dtk24.04.1/onnxruntime-1.15.0%2Bdas1.1.git739f24d.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=d0d24167188d2c85f1ed4110fc43e62ea40c74280716d9b5fe9540256f17869a
+opencv-python==4.10.0.82
+orjson==3.10.5
+packaging==24.1
+pandas==2.2.2
+peft==0.9.0
+pillow==10.3.0
+platformdirs==4.2.2
+pooch==1.8.2
+pre-commit==3.8.0
+prometheus_client==0.20.0
+protobuf==5.27.1
+psutil==5.9.8
+py-cpuinfo==9.0.0
+pycparser==2.22
+pydantic==2.7.4
+pydantic_core==2.18.4
+Pygments==2.18.0
+pygtrie==2.5.0
+pynvml==11.5.0
+pyparsing==3.1.2
+python-dateutil==2.9.0.post0
+python-dotenv==1.0.1
+python-multipart==0.0.9
+pytz==2024.1
+PyYAML==6.0.1
+ray==2.9.1
+referencing==0.35.1
+regex==2024.5.15
+requests==2.32.3
+rich==13.7.1
+rotary-emb @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/rotary_emb-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=cc15ec6ae73875515243d7f5c96ab214455a33a4a99eb7f1327f773cae1e6721
+rpds-py==0.18.1
+ruamel.yaml==0.18.6
+ruamel.yaml.clib==0.2.8
+safetensors==0.4.3
+scikit-learn==1.5.1
+scipy==1.13.1
+sentencepiece==0.2.0
+shellingham==1.5.4
+shortuuid==1.0.13
+six==1.16.0
+sniffio==1.3.1
+sortedcontainers==2.4.0
+soundfile==0.12.1
+soxr==0.5.0
+speechbrain==1.0.0
+starlette==0.37.2
+sympy==1.12.1
+termcolor==2.4.0
+tgt==1.5
+threadpoolctl==3.5.0
+tiktoken==0.7.0
+tokenizers==0.15.0
+tomli==2.0.1
+torch @ http://10.6.10.68:8000/release/pytorch/dtk24.04.1/torch-2.1.0%2Bdas1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=5fd3bcef3aa197c0922727913aca53db9ce3f2fd4a9b22bba1973c3d526377f9
+torchaudio @ http://10.6.10.68:8000/release/torchaudio/dtk24.04.1/torchaudio-2.1.2%2Bdas1.1.git63d9a68.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=4fcc556a7a2fffe64ddd57f22e5972b1b2b723f6fdfdaa305bd01551036df38b
+torchvision @ http://10.6.10.68:8000/release/vision/dtk24.04.1/torchvision-0.16.0%2Bdas1.1.git7d45932.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=e3032e1bcc0857b54391d66744f97e5cff0dc7e7bb508196356ee927fb81ec01
+tqdm==4.66.4
+transformers==4.38.0
+triton @ http://10.6.10.68:8000/release/triton/dtk24.04.1/triton-2.1.0%2Bdas1.1.git4bf1007a.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=4c30d45dab071e65d1704a5cd189b14c4ac20bd59a7061032dfd631b1fc37645
+typer==0.12.3
+typing_extensions==4.12.2
+tzdata==2024.1
+ujson==5.10.0
+Unidecode==1.3.8
+urllib3==2.2.1
+uvicorn==0.30.1
+uvloop==0.19.0
+virtualenv==20.26.3
+vllm @ http://10.6.10.68:8000/release/vllm/dtk24.04.1/vllm-0.3.3%2Bdas1.1.gitdf6349c.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=48d265b07efa36f028eca45a3667fa10d3cf30eb1b8f019b62e3b255fb9e49c4
+watchfiles==0.22.0
+websockets==12.0
+xentropy-cuda-lib @ http://10.6.10.68:8000/release/flash_attn/dtk24.04.1/xentropy_cuda_lib-0.1%2Bdas1.1gitc7a8c18.abi1.dtk2404.torch2.1-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=91b058d6a5fd2734a5085d68e08d3a1f948fe9c0119c46885d19f55293e2cce4
+xformers @ http://10.6.10.68:8000/release/xformers/dtk24.04.1/xformers-0.0.25%2Bdas1.1.git8ef8bc1.abi1.dtk2404.torch2.1.0-cp310-cp310-manylinux_2_31_x86_64.whl#sha256=ca87fd065753c1be3b9fad552eba02d30cd3f4c673f01e81a763834eb5cbb9cc
+yapf==0.40.2
+zipp==3.19.2
+==============================
+Could not get git revision==============================
+ROCm version:
+5.7.24213
\ No newline at end of file