Commit c2f3296f authored by Yoach Lacombe's avatar Yoach Lacombe
Browse files

update README

parent 6732a076
ATTENTION: don't forget to add group_by_length in configs.
# Parler-TTS # Parler-TTS
Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) [[Paper we reproduce]](https://arxiv.org/abs/2402.01912)
[[Models]](https://huggingface.co/parler-tts)
[[Training Code]](training)
[[Interactive Demo]](TODO - linked to spaces)
> [!IMPORTANT]
> We're proud to release Parler-TTS v0.1, our first 300M-parameters Parler-TTS model, trained on 10.5K hours of audio data.
Parler-TTS is a reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively. by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
Reproducing the TTS model requires the following 5 steps to be completed in order: Contrarily to standard TTS models, Parler-TTS allows you to directly describe the speaker characteristics with a simple text description where you can modulate gender, pitch, speaking style, accent, etc.
1. Train the Accent Classifier
2. Annotate the Training Set ## Inference
3. Aggregate Statistics
4. Create Descriptions > [!TIP]
5. Train the Model > You can directly try it out in an interactive demo [here](TODO: add link to spaces)!
## Step 1: Train the Accent Classifier Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.
The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from ```py
the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent from parler_tts import ParlerTTSForConditionalGeneration
classification task. from transformers import AutoTokenizer, AutoFeatureExtractor
import soundfile as sf
Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained. # TODO: change repo id
The linear classifier is randomly initialised, and is thus always trained.
model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler_tts_300M_v0.09")
The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler_tts_300M_v0.09")
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
prompt = "Hey, how are you doing today?"
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) input_ids = tokenizer(description, return_tensors="pt").input_ids
2. The train split of [VCTK](https://huggingface.co/datasets/vctk) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids
3. The dev split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc) audio_arr = generation.cpu().numpy().squeeze()
to give the final classification accuracy. sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
```
```bash
#!/usr/bin/env bash
## Installation steps
python run_audio_classification.py \
--model_name_or_path "facebook/mms-lid-126" \ Parler-TTS has light-weight dependencies and can be installed in one line:
--train_dataset_name "vctk+facebook/voxpopuli+sanchit-gandhi/edacc" \ ```sh
--train_dataset_config_name "main+en_accented+default" \ pip install parler-tts
--train_split_name "train+test+validation" \
--train_label_column_name "accent+accent+accent" \
--eval_dataset_name "sanchit-gandhi/edacc" \
--eval_dataset_config_name "default" \
--eval_split_name "test" \
--eval_label_column_name "accent" \
--output_dir "./" \
--do_train \
--do_eval \
--overwrite_output_dir \
--remove_unused_columns False \
--fp16 \
--learning_rate 1e-4 \
--max_length_seconds 20 \
--attention_mask False \
--warmup_ratio 0.1 \
--num_train_epochs 5 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--preprocessing_num_workers 16 \
--dataloader_num_workers 4 \
--logging_strategy "steps" \
--logging_steps 10 \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--load_best_model_at_end True \
--metric_for_best_model "accuracy" \
--save_total_limit 3 \
--freeze_base_model \
--push_to_hub \
--trust_remote_code
``` ```
Tips: ## Gradio demo
1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
## Step 2: Annotate the Training Set You can host your own Parler-TTS demo. First, install [`gradio`](https://www.gradio.app/) with:
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. ```sh
pip install gradio
```
## Step 3: Aggregate Statistics Then, run:
Aggregate statistics from Step 2. Convert continuous values to discrete labels. ```python
python helpers/gradio_demo/app.py
```
## Step 4: Create Descriptions
Convert sequence of discrete labels to text description (using an LLM). ## Acknowledgements
## Step 5: Train the Model This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
Train MusicGen-style model on the TTS task. Special thanks to:
Needs DAC. - Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [transformers](https://huggingface.co/docs/transformers/index).
## Citation
```
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ylacombe/dataspeech}}
}
```
...@@ -12,6 +12,7 @@ class DACConfig(PretrainedConfig): ...@@ -12,6 +12,7 @@ class DACConfig(PretrainedConfig):
codebook_size: int = 1024, codebook_size: int = 1024,
latent_dim: int = 1024, latent_dim: int = 1024,
frame_rate: int = 86, frame_rate: int = 86,
sampling_rate: int = 44100,
**kwargs, **kwargs,
): ):
self.codebook_size = codebook_size self.codebook_size = codebook_size
...@@ -19,5 +20,6 @@ class DACConfig(PretrainedConfig): ...@@ -19,5 +20,6 @@ class DACConfig(PretrainedConfig):
self.latent_dim = latent_dim self.latent_dim = latent_dim
self.num_codebooks = num_codebooks self.num_codebooks = num_codebooks
self.frame_rate = frame_rate self.frame_rate = frame_rate
self.sampling_rate = sampling_rate
super().__init__(**kwargs) super().__init__(**kwargs)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment