# Stable Speech

Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.

Reproducing the TTS model requires the following 5 steps to be completed in order:
1. Train the Accent Classifier
2. Annotate the Training Set
3. Aggregate Statistics
4. Create Descriptions
5. Train the Model

## Step 1: Train the Accent Classifier

The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from 
the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent
classification task.

Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the 
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained. 
The linear classifier is randomly initialised, and is thus always trained.

The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).

In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) 
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
2. The train split of [VCTK](https://huggingface.co/datasets/vctk) 
3. The dev split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)

The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/sanchit-gandhi/edacc)
to give the final classification accuracy.

```bash
#!/usr/bin/env bash

python run_audio_classification.py \
    --model_name_or_path "facebook/mms-lid-126" \
    --train_dataset_name "vctk+facebook/voxpopuli+sanchit-gandhi/edacc" \
    --train_dataset_config_name "main+en_accented+default" \
    --train_split_name "train+test+validation" \
    --train_label_column_name "accent+accent+accent" \
    --eval_dataset_name "sanchit-gandhi/edacc" \
    --eval_dataset_config_name "default" \
    --eval_split_name "test" \
    --eval_label_column_name "accent" \
    --output_dir "./" \
    --do_train \
    --do_eval \
    --overwrite_output_dir \
    --remove_unused_columns False \
    --fp16 \
    --learning_rate 1e-4 \
    --max_length_seconds 20 \
    --attention_mask False \
    --warmup_ratio 0.1 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --preprocessing_num_workers 16 \
    --dataloader_num_workers 4 \
    --logging_strategy "steps" \
    --logging_steps 10 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --load_best_model_at_end True \
    --metric_for_best_model "accuracy" \
    --save_total_limit 3 \
    --freeze_base_model \
    --push_to_hub \
    --trust_remote_code
```

Tips:
1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.

## Step 2: Annotate the Training Set

Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. 

## Step 3: Aggregate Statistics

Aggregate statistics from Step 2. Convert continuous values to discrete labels.

## Step 4: Create Descriptions

Convert sequence of discrete labels to text description (using an LLM). 

## Step 4: Train the Model

Train MusicGen-style model on the TTS task.