ATTENTION: don't forget to add group_by_length in configs.
# Parler-TTS
# Parler-TTS
Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
[[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
[[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
Reproducing the TTS model requires the following 5 steps to be completed in order:
[[Interactive Demo]](TODO - linked to spaces)
1. Train the Accent Classifier
2. Annotate the Training Set
> [!IMPORTANT]
3. Aggregate Statistics
> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.
4. Create Descriptions
5. Train the Model
Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
## Step 1: Train the Accent Classifier
Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from
the [Transformers library](https://github.com/huggingface/transformers)(e.g. Wav2Vec2, MMS, or Whisper) for the accent
## Inference
classification task.
> [!TIP]
Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the
> You can directly try it out in an interactive demo [here](TODO: add link to spaces)!
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained.
The linear classifier is randomly initialised, and is thus always trained.
Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
```py
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126)
importsoundfileassf
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
# TODO: change repo id
2. The train split of [VCTK](https://huggingface.co/datasets/vctk)
3. The dev split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
to give the final classification accuracy.
prompt="Hey, how are you doing today?"
description="A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
Parler-TTS has light-weight dependencies and can be installed in one line:
--output_dir"./"\
```sh
--do_train\
pip install parler-tts
--do_eval\
--overwrite_output_dir\
--remove_unused_columns False \
--fp16\
--learning_rate 1e-4 \
--max_length_seconds 20 \
--attention_mask False \
--warmup_ratio 0.1 \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--preprocessing_num_workers 16 \
--dataloader_num_workers 4 \
--logging_strategy"steps"\
--logging_steps 10 \
--evaluation_strategy"epoch"\
--save_strategy"epoch"\
--load_best_model_at_end True \
--metric_for_best_model"accuracy"\
--save_total_limit 3 \
--freeze_base_model\
--push_to_hub\
--trust_remote_code
```
```
Tips:
## Gradio demo
1.**Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
## Step 2: Annotate the Training Set
You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate.
```sh
pip install gradio
```
## Step 3: Aggregate Statistics
Then, run:
Aggregate statistics from Step 2. Convert continuous values to discrete labels.
```python
pythonhelpers/gradio_demo/app.py
```
## Step 4: Create Descriptions
Convert sequence of discrete labels to text description (using an LLM).
## Acknowledgements
## Step 5: Train the Model
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
Train MusicGen-style model on the TTS task.
Special thanks to:
Needs DAC.
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},