README.md 3.99 KB
Newer Older
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
1
2
3
4
5
# Stable Speech

Work in-progress reproduction of the text-to-speech (TTS) model from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com)
by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.

sanchit-gandhi's avatar
sanchit-gandhi committed
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Reproducing the TTS model requires the following 5 steps to be completed in order:
1. Train the Accent Classifier
2. Annotate the Training Set
3. Aggregate Statistics
4. Create Descriptions
5. Train the Model

## Step 1: Train the Accent Classifier

The script [`run_audio_classification.py`](run_audio_classification.py) can be used to train an audio encoder model from 
the [Transformers library](https://github.com/huggingface/transformers) (e.g. Wav2Vec2, MMS, or Whisper) for the accent
classification task.

Starting with a pre-trained audio encoder model, a simple linear classifier is appended to the last hidden-layer to map the 
audio embeddings to class label predictions. The audio encoder can either be frozen (`--freeze_base_model`) or trained. 
The linear classifier is randomly initialised, and is thus always trained.

The script can be used to train on a single accent dataset, or a combination of datasets, which should be specified by
separating dataset names, configs and splits by the `+` character in the launch command (see below for an example).

In the proceeding example, we follow Stability's approach by taking audio embeddings from a frozen [MMS-LID](https://huggingface.co/facebook/mms-lid-126) 
model, and training the linear classifier on a combination of three open-source datasets:
1. The English Accented (`en_accented`) subset of [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
2. The train split of [VCTK](https://huggingface.co/datasets/vctk) 
sanchit-gandhi's avatar
sanchit-gandhi committed
30
3. The dev split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
sanchit-gandhi's avatar
sanchit-gandhi committed
31

sanchit-gandhi's avatar
sanchit-gandhi committed
32
The model is subsequently evaluated on the test split of [EdAcc](https://huggingface.co/datasets/edinburghcstr/edacc)
sanchit-gandhi's avatar
sanchit-gandhi committed
33
34
35
36
37
38
39
to give the final classification accuracy.

```bash
#!/usr/bin/env bash

python run_audio_classification.py \
    --model_name_or_path "facebook/mms-lid-126" \
sanchit-gandhi's avatar
sanchit-gandhi committed
40
    --train_dataset_name "vctk+facebook/voxpopuli+edinburghcstr/edacc" \
sanchit-gandhi's avatar
sanchit-gandhi committed
41
42
43
    --train_dataset_config_name "main+en_accented+default" \
    --train_split_name "train+test+validation" \
    --train_label_column_name "accent+accent+accent" \
sanchit-gandhi's avatar
sanchit-gandhi committed
44
    --eval_dataset_name "edinburghcstr/edacc" \
sanchit-gandhi's avatar
sanchit-gandhi committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
    --eval_dataset_config_name "default" \
    --eval_split_name "test" \
    --eval_label_column_name "accent" \
    --output_dir "./" \
    --do_train \
    --do_eval \
    --overwrite_output_dir \
    --remove_unused_columns False \
    --fp16 \
    --learning_rate 1e-4 \
    --max_length_seconds 20 \
    --attention_mask False \
    --warmup_ratio 0.1 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --preprocessing_num_workers 16 \
    --dataloader_num_workers 4 \
    --logging_strategy "steps" \
    --logging_steps 10 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --load_best_model_at_end True \
    --metric_for_best_model "accuracy" \
    --save_total_limit 3 \
    --freeze_base_model \
    --push_to_hub \
    --trust_remote_code
```

Tips:
1. **Number of labels:** normalisation should be applied to the target class labels to group linguistically similar accents together (e.g. "Southern Irish" and "Irish" should both be "Irish"). This helps _balance_ the dataset by removing labels with very few examples. You can modify the function `preprocess_labels` to implement any custom normalisation strategy.

## Step 2: Annotate the Training Set

Annotate the training dataset with information on: SNR, C50, pitch and speaking rate. 

## Step 3: Aggregate Statistics

Aggregate statistics from Step 2. Convert continuous values to discrete labels.

## Step 4: Create Descriptions

Convert sequence of discrete labels to text description (using an LLM). 

sanchit-gandhi's avatar
typo  
sanchit-gandhi committed
90
## Step 5: Train the Model
sanchit-gandhi's avatar
sanchit-gandhi committed
91
92

Train MusicGen-style model on the TTS task.
sanchit-gandhi's avatar
setup  
sanchit-gandhi committed
93