Commit c2f3296f authored by Yoach Lacombe's avatar Yoach Lacombe
Browse files

update README

parent 6732a076
ATTENTION: don't forget to add group_by_length in configs.
# Parler-TTS # Parler-TTS
Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) [[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. [[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
Reproducing the TTS model requires the following 5 steps to be completed in order: [[Interactive Demo]](TODO - linked to spaces)
1. Train the Accent Classifier
2. Annotate the Training Set > [!IMPORTANT]
3. Aggregate Statistics > We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.
4. Create Descriptions
5. Train the Model Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
## Step 1: Train the Accent Classifier
Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from
the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent ## Inference
classification task.
> [!TIP]
Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the > You can directly try it out in an interactive demo [here](TODO: add link to spaces)!
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained.
The linear classifier is randomly initialised, and is thus always trained. Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by ```py
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example). from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, AutoFeatureExtractor
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) import soundfile as sf
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) # TODO: change repo id
2. The train split of [VCTK](https://huggingface.co/datasets/vctk)
3. The dev split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc) model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler_tts_300M_v0.09")
tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler_tts_300M_v0.09")
The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
to give the final classification accuracy. prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
```bash
#!/usr/bin/env bash input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids
python run_audio_classification.py \
--model_name_or_path "facebook/mms-lid-126" \ generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
--train_dataset_name "vctk+facebook/voxpopuli+sanchit-gandhi/edacc" \ audio_arr = generation.cpu().numpy().squeeze()
--train_dataset_config_name "main+en_accented+default" \ sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
--train_split_name "train+test+validation" \ ```
--train_label_column_name "accent+accent+accent" \
--eval_dataset_name "sanchit-gandhi/edacc" \
--eval_dataset_config_name "default" \ ## Installation steps
--eval_split_name "test" \
--eval_label_column_name "accent" \ Parler-TTS has light-weight dependencies and can be installed in one line:
--output_dir "./" \ ```sh
--do_train \ pip install parler-tts
--do_eval \
--overwrite_output_dir \
--remove_unused_columns False \
--fp16 \
--learning_rate 1e-4 \
--max_length_seconds 20 \
--attention_mask False \
--warmup_ratio 0.1 \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--preprocessing_num_workers 16 \
--dataloader_num_workers 4 \
--logging_strategy "steps" \
--logging_steps 10 \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--load_best_model_at_end True \
--metric_for_best_model "accuracy" \
--save_total_limit 3 \
--freeze_base_model \
--push_to_hub \
--trust_remote_code
``` ```
Tips: ## Gradio demo
1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
## Step 2: Annotate the Training Set You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. ```sh
pip install gradio
```
## Step 3: Aggregate Statistics Then, run:
Aggregate statistics from Step 2. Convert continuous values to discrete labels. ```python
python helpers/gradio_demo/app.py
```
## Step 4: Create Descriptions
Convert sequence of discrete labels to text description (using an LLM). ## Acknowledgements
## Step 5: Train the Model This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
Train MusicGen-style model on the TTS task. Special thanks to:
Needs DAC. - Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ylacombe/dataspeech}}
}
```
...@@ -12,6 +12,7 @@ class DACConfig(PretrainedConfig): ...@@ -12,6 +12,7 @@ class DACConfig(PretrainedConfig):
codebook_size: int = 1024, codebook_size: int = 1024,
latent_dim: int = 1024, latent_dim: int = 1024,
frame_rate: int = 86, frame_rate: int = 86,
sampling_rate: int = 44100,
**kwargs, **kwargs,
): ):
self.codebook_size = codebook_size self.codebook_size = codebook_size
...@@ -19,5 +20,6 @@ class DACConfig(PretrainedConfig): ...@@ -19,5 +20,6 @@ class DACConfig(PretrainedConfig):
self.latent_dim = latent_dim self.latent_dim = latent_dim
self.num_codebooks = num_codebooks self.num_codebooks = num_codebooks
self.frame_rate = frame_rate self.frame_rate = frame_rate
self.sampling_rate = sampling_rate
super().__init__(**kwargs) super().__init__(**kwargs)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment