ATTENTION: don't forget to add group_by_length in configs.
# Parler-TTS
# Parler-TTS
Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
[[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
[[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
[[Interactive Demo]](TODO - linked to spaces)
> [!IMPORTANT]
> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.
Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
Reproducing the TTS model requires the following 5 steps to be completed in order:
Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
1. Train the Accent Classifier
2. Annotate the Training Set
## Inference
3. Aggregate Statistics
4. Create Descriptions
> [!TIP]
5. Train the Model
> You can directly try it out in an interactive demo [here](TODO: add link to spaces)!
## Step 1: Train the Accent Classifier
Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from
```py
the [Transformers library](https://github.com/huggingface/transformers)(e.g. Wav2Vec2, MMS, or Whisper) for the accent
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
prompt="Hey, how are you doing today?"
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126)
description="A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
1.**Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
## Step 2: Annotate the Training Set
You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate.
```sh
pip install gradio
```
## Step 3: Aggregate Statistics
Then, run:
Aggregate statistics from Step 2. Convert continuous values to discrete labels.
```python
pythonhelpers/gradio_demo/app.py
```
## Step 4: Create Descriptions
Convert sequence of discrete labels to text description (using an LLM).
## Acknowledgements
## Step 5: Train the Model
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
Train MusicGen-style model on the TTS task.
Special thanks to:
Needs DAC.
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},