Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
Reproducing the TTS model requires the following 5 steps to be completed in order:
1. Train the Accent Classifier
2. Annotate the Training Set
3. Aggregate Statistics
4. Create Descriptions
5. Train the Model
## Step 1: Train the Accent Classifier
The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from
the [Transformers library](https://github.com/huggingface/transformers)(e.g. Wav2Vec2, MMS, or Whisper) for the accent
classification task.
Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained.
The linear classifier is randomly initialised, and is thus always trained.
The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).
In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126)
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
2. The train split of [VCTK](https://huggingface.co/datasets/vctk)
3. The dev split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
This repository contains the inference and training code for Parler-TTS. It is designed to accompany the [Data-Speech](https://github.com/huggingface/dataspeech) repository for dataset annotation.
> [!IMPORTANT]
> We're proud to release Parler-TTS v0.1, our first 300M parameter model, trained on 10.5K hours of audio data.
> In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model.
description="A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."
The [training folder](/training/) contains all the information to train or fine-tune your own Parler-TTS model. It consists of:
-[1. An introduction to the Parler-TTS architecture](/training/README.md#1-architecture)
-[2. The first steps to get started](/training/README.md#2-getting-started)
-[3. A training guide](/training/README.md#3-training)
> [!IMPORTANT]
> **TL;DR:** After having followed the [installation steps](/training/README.md#requirements), you can reproduce the Parler-TTS v0.1 training recipe with the following command line:
1.**Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.
## Acknowledgements
## Step 2: Annotate the Training Set
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
Annotate the training dataset with information on: SNR, C50, pitch and speaking rate.
Special thanks to:
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- the many libraries used, namely [🤗 datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [🤗 accelerate](https://huggingface.co/docs/accelerate/en/index), [jiwer](https://github.com/jitsi/jiwer), [wandb](https://wandb.ai/), and [🤗 transformers](https://huggingface.co/docs/transformers/index).
- Descript for the [DAC codec model](https://github.com/descriptinc/descript-audio-codec)
- Hugging Face 🤗 for providing compute resources and time to explore!
## Step 3: Aggregate Statistics
Aggregate statistics from Step 2. Convert continuous values to discrete labels.
## Citation
## Step 4: Create Descriptions
If you found this repository useful, please consider citing this work and also the original Stability AI paper:
Convert sequence of discrete labels to text description (using an LLM).
```
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
default_text="Please surprise me and speak in whatever voice you enjoy."
title="# Parler-TTS </div>"
examples=[
[
"'This is the best time of my life, Bartley,' she said happily.",
"A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.",
],
[
"Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom. ",
"A male speaker with a slightly high-pitched voice delivering his words at a slightly slow pace in a small, confined space with a touch of background noise and a quite monotone tone.",
],
[
"montrose also after having experienced still more variety of good and bad fortune threw down his arms and retired out of the kingdom",
"A male speaker with a low-pitched voice delivering his words at a fast pace in a small, confined space with a lot of background noise and an animated tone.",
"Mid-atlantic united states english,philadelphia, pennsylvania, united states english,united states english,philadelphia style united states english":"American",
"Mid-atlantic,england english,united states english":"American",
"Midatlantic,england english":"American",
"Midwestern states (michigan),united states english":"American",
"Mild northern england english":"English",
"Minor french accent":"French",
"Mix of american and british ,native polish":"Polish",
"Mix of american and british accent":"Unknown",# Combination not clearly mapped
"Mostly american with some british and australian inflections":"Unknown",# Combination not clearly mapped
"My accent is influenced by the phones of all letters within a sentence.,southern african (south africa, zimbabwe, namibia)":"South african",
"New zealand english":"New Zealand English",
"Nigeria english":"Nigerian",# Note: added new
"Non native speaker from france":"French",
"Non-native":"Unknown",# Too vague
"Non-native,german accent":"German",
"North european english":"Unknown",# Too broad
"Norwegian":"Norwegian",# Note: added new
"Ontario,canadian english":"Canadian",# Note: added new
"Polish english":"Polish",
"Rhode island new england accent":"American",
"Singaporean english":"Singaporean",# Note: added new
"Slavic":"Eastern european",
"Slighty southern affected by decades in the midwest, 4 years in spain and germany, speak some german, spanish, polish. have lived in nine states.":"Unknown",# Complex blend
"South african":"South african",
"South atlantic (falkland islands, saint helena)":"Unknown",# Specific regions not listed
"South australia":"Australian",
"South indian":"Indian",
"Southern drawl":"American",
"Southern texas accent,united states english":"American",
"Southern united states,united states english":"American",
"Spanish bilingual":"Spanish",
"Spanish,foreign,non-native":"Spanish",
"Strong latvian accent":"Latvian",
"Swedish accent":"Swedish",# Note: added new
"Transnational englishes blend":"Unknown",# Too vague
"U.k. english":"English",
"Very slight russian accent,standard american english,boston influence":"American",
"Welsh english":"Welsh",
"West african":"Unknown",# No specific West African category
"West indian":"Unknown",# Caribbean, but no specific match
"Western europe":"Unknown",# Too broad
"With heavy cantonese accent":"Chinese",
}
defpreprocess_labels(label:str)->str:
"""Apply pre-processing formatting to the accent labels"""
if"_"inlabel:
# voxpopuli stylises the accent as a language code (e.g. en_pl for "polish") - convert to full accent
language_code=label.split("_")[-1]
label=LANGUAGES[language_code]
# VCTK labels for two words are concatenated into one (NewZeleand-> New Zealand)
# convert Whisper language code (polish) to capitalised (Polish)
label=label.capitalize()
iflabelinACCENT_MAPPING:
label=ACCENT_MAPPING[label]
returnlabel
@dataclass
classDataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
Using `HfArgumentParser` we can turn this class
into argparse arguments to be able to specify them on
the command line.
"""
train_dataset_name:str=field(
default=None,
metadata={
"help":"The name of the training dataset to use (via the datasets library). Load and combine "
"multiple datasets by separating dataset ids by a '+' symbol. For example, to load and combine "
" librispeech and common voice, set `train_dataset_name='librispeech_asr+common_voice'`."
},
)
train_dataset_config_name:Optional[str]=field(
default=None,
metadata={
"help":"The configuration name of the training dataset to use (via the datasets library). Load and combine "
"multiple datasets by separating dataset configs by a '+' symbol."
},
)
train_split_name:str=field(
default="train",
metadata={
"help":("The name of the training data set split to use (via the datasets library). Defaults to 'train'")
},
)
train_dataset_samples:str=field(
default=None,
metadata={
"help":"Number of samples in the training data. Load and combine "
"multiple datasets by separating dataset samples by a '+' symbol."
},
)
eval_dataset_name:str=field(
default=None,
metadata={
"help":"The name of the evaluation dataset to use (via the datasets library). Defaults to the training dataset name if unspecified."
},
)
eval_dataset_config_name:Optional[str]=field(
default=None,
metadata={
"help":"The configuration name of the evaluation dataset to use (via the datasets library). Defaults to the training dataset config name if unspecified"
},
)
eval_split_name:str=field(
default="validation",
metadata={
"help":(
"The name of the evaluation data set split to use (via the datasets"
" library). Defaults to 'validation'"
)
},
)
audio_column_name:str=field(
default="audio",
metadata={"help":"The name of the dataset column containing the audio data. Defaults to 'audio'"},
)
train_label_column_name:str=field(
default="labels",
metadata={
"help":"The name of the dataset column containing the labels in the train set. Defaults to 'label'"
},
)
eval_label_column_name:str=field(
default="labels",
metadata={"help":"The name of the dataset column containing the labels in the eval set. Defaults to 'label'"},
)
max_train_samples:Optional[int]=field(
default=None,
metadata={
"help":(
"For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
)
},
)
max_eval_samples:Optional[int]=field(
default=None,
metadata={
"help":(
"For debugging purposes or quicker training, truncate the number of evaluation examples to this "
"value if set."
)
},
)
max_length_seconds:Optional[float]=field(
default=20,
metadata={"help":"Audio samples will be randomly cut to this length during training if the value is set."},
)
min_length_seconds:Optional[float]=field(
default=5,
metadata={"help":"Audio samples less than this value will be filtered during training if the value is set."},
)
preprocessing_num_workers:Optional[int]=field(
default=None,
metadata={"help":"The number of processes to use for the preprocessing."},
)
filter_threshold:Optional[float]=field(
default=1.0,
metadata={"help":"Filter labels that occur less than `filter_threshold` percent in the training/eval data."},
)
@dataclass
classModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path:str=field(
default="facebook/wav2vec2-base",
metadata={"help":"Path to pretrained model or model identifier from huggingface.co/models"},
)
config_name:Optional[str]=field(
default=None,metadata={"help":"Pretrained config name or path if not the same as model_name"}
)
cache_dir:Optional[str]=field(
default=None,metadata={"help":"Where do you want to store the pretrained models downloaded from the Hub"}
)
model_revision:str=field(
default="main",
metadata={"help":"The specific model version to use (can be a branch name, tag name or commit id)."},
)
feature_extractor_name:Optional[str]=field(
default=None,metadata={"help":"Name or path of preprocessor config."}
)
freeze_feature_encoder:bool=field(
default=False,
metadata={
"help":"Whether to freeze the feature encoder layers of the model. Only relevant for Wav2Vec2-style models."
},
)
freeze_base_model:bool=field(
default=True,metadata={"help":"Whether to freeze the base encoder of the model."}
)
attention_mask:bool=field(
default=True,metadata={"help":"Whether to generate an attention mask in the feature extractor."}
)
token:str=field(
default=None,
metadata={
"help":(
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
trust_remote_code:bool=field(
default=False,
metadata={
"help":(
"Whether or not to allow for custom models defined on the Hub in their own modeling files. This option "
"should only be set to `True` for repositories you trust and in which you have read the code, as it will "
"execute code present on the Hub on your local machine."
)
},
)
ignore_mismatched_sizes:bool=field(
default=True,
metadata={"help":"Will enable to load a pretrained model whose head dimensions are different."},
)
attention_dropout:float=field(
default=0.0,metadata={"help":"The dropout ratio for the attention probabilities."}
)
activation_dropout:float=field(
default=0.0,metadata={"help":"The dropout ratio for activations inside the fully connected layer."}
)
feat_proj_dropout:float=field(default=0.0,metadata={"help":"The dropout ratio for the projected features."})
hidden_dropout:float=field(
default=0.0,
metadata={
"help":"The dropout probability for all fully connected layers in the embeddings, encoder, and pooler."
},
)
final_dropout:float=field(
default=0.0,
metadata={"help":"The dropout probability for the final projection layer."},
)
mask_time_prob:float=field(
default=0.05,
metadata={
"help":(
"Probability of each feature vector along the time axis to be chosen as the start of the vector "
"span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature "
"vectors will be masked along the time axis."
)
},
)
mask_time_length:int=field(
default=10,
metadata={"help":"Length of vector span to mask along the time axis."},
)
mask_feature_prob:float=field(
default=0.0,
metadata={
"help":(
"Probability of each feature vector along the feature axis to be chosen as the start of the vectorspan"
" to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature"
" bins will be masked along the time axis."
)
},
)
mask_feature_length:int=field(
default=10,
metadata={"help":"Length of vector span to mask along the feature axis."},