README.md

This is an example pipeline for text-to-speech using Tacotron2.


## Install required packages

Required packages
```bash
pip install librosa tqdm inflect joblib
```

To use tensorboard
```bash
pip install tensorboard pillow
```

## Training Tacotron2 with character as input

The training of Tacotron2 can be invoked with the following command.

```bash
python train.py \
    --learning-rate 1e-3 \
    --epochs 1501 \
    --anneal-steps 500 1000 1500 \
    --anneal-factor 0.1 \
    --batch-size 96 \
    --weight-decay 1e-6 \
    --grad-clip 1.0 \
    --text-preprocessor english_characters \
    --logging-dir ./logs \
    --checkpoint-path ./ckpt.pth \
    --dataset-path ./
```

The training script will use all GPUs that is available, please set the
environment variable `CUDA_VISIBLE_DEVICES` if you don't want all GPUs to be used.
The newest checkpoint will be saved to `./ckpt.pth` and the checkpoint with the best validation
loss will be saved to `./best_ckpt.pth`.
The training log will be saved to `./logs/train.log` and the tensorboard results will also
be in `./logs`.

If `./ckpt.pth` already exist, this script will automatically load the file and try to continue
training from the checkpoint.

This command takes around 36 hours to train on 8 NVIDIA Tesla V100 GPUs.

To train the Tacotron2 model to work with the [pretrained wavernn](https://pytorch.org/audio/main/models.html#id10)
with checkpoint_name `"wavernn_10k_epochs_8bits_ljspeech"`, please run the following command instead.

```bash
python train.py
    --learning-rate 1e-3 \
    --epochs 1501 \
    --anneal-steps 500 1000 1500 \
    --anneal-factor 0.1 \
    --sample-rate 22050 \
    --n-fft 2048 \
    --hop-length 275 \
    --win-length 1100 \
    --mel-fmin 40 \
    --mel-fmax 11025 \
    --batch-size 96 \
    --weight-decay 1e-6 \
    --grad-clip 1.0 \
    --text-preprocessor english_characters \
    --logging-dir ./wavernn_logs \
    --checkpoint-path ./ckpt_wavernn.pth \
    --dataset-path ./
```


## Training Tacotron2 with phoneme as input

#### Dependencies

This example use the [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) as
the phonemizer (the function to turn text into phonemes),
please install it with the following command (the code is tested with version 0.0.15).

```bash
pip install deep-phonemizer==0.0.15
```

Then download the model weights from [their website](https://github.com/as-ideas/DeepPhonemizer)

The link to the checkpoint that is tested with this example is
[https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt](https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt).

#### Running training script

The training of Tacotron2 with english phonemes as input can be invoked with the following command.

```bash
python train.py \
    --workers 12 \
    --learning-rate 1e-3 \
    --epochs 1501 \
    --anneal-steps 500 1000 1500 \
    --anneal-factor 0.1 \
    --batch-size 96 \
    --weight-decay 1e-6 \
    --grad-clip 1.0 \
    --text-preprocessor english_phonemes \
    --phonemizer DeepPhonemizer \
    --phonemizer-checkpoint ./en_us_cmudict_forward.pt \
    --cmudict-root ./ \
    --logging-dir ./english_phonemes_logs \
    --checkpoint-path ./english_phonemes_ckpt.pth \
    --dataset-path ./
```

Similar to the previous examples, this command will save the log in the directory `./english_phonemes_logs`
and the checkpoint will be saved to `./english_phonemes_ckpt.pth`.


To train the Tacotron2 model with english phonemes that works with the
[pretrained wavernn](https://pytorch.org/audio/main/models.html#id10)
with checkpoint_name `"wavernn_10k_epochs_8bits_ljspeech"`, please run the following command.

```bash
python train.py \
    --workers 12 \
    --learning-rate 1e-3 \
    --epochs 1501 \
    --anneal-steps 500 1000 1500 \
    --anneal-factor 0.1 \
    --sample-rate 22050 \
    --n-fft 2048 \
    --hop-length 275 \
    --win-length 1100 \
    --mel-fmin 40 \
    --mel-fmax 11025 \
    --batch-size 96 \
    --weight-decay 1e-6 \
    --grad-clip 1.0 \
    --text-preprocessor english_phonemes \
    --phonemizer DeepPhonemizer \
    --phonemizer-checkpoint ./en_us_cmudict_forward.pt \
    --cmudict-root ./ \
    --logging-dir ./english_phonemes_wavernn_logs \
    --checkpoint-path ./english_phonemes_wavernn_ckpt.pth \
    --dataset-path ./
```