# Text-Free Prosody-Aware Generative Spoken Language Modeling
This folder contains code and recipes to reproduce results reported in a paper _Text-Free Prosody-Aware Generative Spoken Language Modeling_,
Eugene Kharitonov*, Ann Lee*, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu, 2021. arxiv/2109.03264 [[arxiv]](https://arxiv.org/abs/2109.03264).
`*` denotes equal contribution.
You can find demo samples [[here]](https://speechbot.github.io/pgslm/index.html).
If you find this code useful, please consider citing our work using this bibtex
```
@misc{Kharitonov2021,
title={Text-Free Prosody-Aware Generative Spoken Language Modeling},
author={Eugene Kharitonov and Ann Lee and Adam Polyak and Yossi Adi and Jade Copet and Kushal Lakhotia and Tu-Anh Nguyen and Morgane Rivière and Abdelrahman Mohamed and Emmanuel Dupoux and Wei-Ning Hsu},
year={2021},
eprint={2109.03264},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Additional requirements
Three packages are required in addition to fairseq, they are installable with pip:
```bash
pip install AMFM-decompy SoundFile scipy sklearn torchaudio npy-append-array
```
## Data preprocessing
### Prepare unit pseudo-text transcriptions of the audio
To get unit trascripts of the speech data we rely on the preprocessing steps of [GSLM](https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit/) work.
Firstly, we will need to prepare manifest files for the dataset we want to preprocess
```
mkdir manifests/
python examples/wav2vec/wav2vec_manifest.py --valid-percent=0.0 $DATA_PATH --dest=manifests/train/
```
Next, we need a pre-trained HuBERT-base-ls960 model [[download]](https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt) and a corresponding kmeans-100 quantizer [[download]](https://dl.fbaipublicfiles.com/textless_nlp/gslm/hubert/km100/km.bin). Having those we can quantize the dataset:
```
python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
--feature_type hubert \
--kmeans_model_path km.bin \
--acoustic_model_path hubert_base_ls960.pt \
--layer 6 \
--manifest_path manifests/train/train.tsv \
--out_quantized_file_path manifests/train/units
```
Finally, by running
```
python examples/textless_nlp/pgslm/scripts/join_units_manifest.py --manifest=manifests/train/train.tsv --units=manifests/train/units --output=train.txt
```
We will get the training data description `train.txt` in the format that pGSLM expects. The above steps have to be repeated for
dev/test sets. Importantly, we rely on an assumption that the directories are structured as in LibriSpeech, i.e. the file paths follow the
`//.wav` format.
### Preprocess data for pGSLM
The very first step is to obtain the F0 quantization bins.
Assume the vocoder training manifest is `vocoder_train.txt` (in pGSLM data format prepared with the same process above).
We prepare the quantized F0 from the vocoder training data by running
```sh
bash examples/textless_nlp/pgslm/scripts/prepare_f0_quantization.sh \
vocoder_train.txt 32 # we use 32 bins in the paper
```
- ``: sampling rate of the audio files in the manifest
- ``: where to output the output files
- ``: prefix of the output files
The script will generate
- `.f0_stat.pt`: the speaker-level F0 statistics, which can be used in vocoder training
- `_mean_norm_log_f0_bin.th`: the quantized F0, which should be used in `prepare_data.sh` below
**Note:** See "Pre-trained models" for the pre-computed speaker-level F0 statistics and quantized F0 bins. We suggest using the pre-computed statistics for the data preparation below in order to take advantage of the pre-trained vocoder for waveform generation.
Next prepare the pGSLM data.
Assume train/valid/test manifests are `{train,valid,test}.txt`.
Here is an example of how to preprocess data:
```sh
bash examples/textless_nlp/pgslm/scripts/prepare_data.sh \
train.txt valid.txt test.txt \
/_mean_norm_log_f0_bin.th
```
- ``: discrete unit vocabulary size (we used a kmeans quantizer with the number of units equal to 100 in the example above)
- ``: downsampling rate relative to the waveform (e.g., 320 for HuBERT units)
- ``: sampling rate of the audio files in the manifest
- ``: where to output the preprocessed files
This will create the dataset json config used for the next section at
`/data_config.json`.
Note that the example script uses only one thread to compute F0, which can take
_very long_ for preprocessing large datasets. It is suggested to distribute
jobs over multiple nodes/processes with `--nshards=x` and `--rank=z` (where z is
in [1, x]) in `preprocess_f0.py`, and set `--nshards_list=x` in
`prepare_data.py` correspondingly to collect sharded F0 data.
Now, everything is ready for training a model.
## Training Multi-Stream Transformer Unit Language Model (MS-TLM)
Below is an example command that trains Multi-Stream Transformer Language Model (MS-TLM) on a prepared dataset:
```bash
DATASET=data_config.json
fairseq-train $DATASET \
--task=speech_unit_modeling \
--arch="transformer_ulm_tiny" \
--criterion=speech_unit_lm_criterion \
--share-decoder-input-output-embed \
--dropout=0.1 \
--attention-dropout=0.1 \
--optimizer="adam" \
--adam-betas="(0.9, 0.98)" \
--clip-norm=1.0 \
--lr=0.0005 \
--lr-scheduler="inverse_sqrt" \
--warmup-updates=4000 \
--warmup-init-lr=1e-07 \
--tokens-per-sample=3072 \
--max-tokens=3072 \
--update-freq=4 \
--max-epoch=70 \
--num-workers=0 \
--skip-invalid-size-inputs-valid-test \
--loss-weights="1.0;0.5;0.0" \
--ignore-f0-input \
--checkpoint-activations \
--fp16 \
--max-target-positions=4096 \
--stream-shifts="1,1" \
--log-f0 --normalize-f0-mean --interpolate-f0 \
--ignore-unused-valid-subsets \
--discrete-duration --discrete-f0
```
Some of the important parameters that are specific to MS-TLM:
* `arch`: specifies the Transformer architecture used. Supported options are:
* `transformer_ulm_tiny` - a tiny model that can be used for debugging; it has 2 layers, 1 attention head, FFN and embedding dimensions of 64,
* `transformer_ulm` - a base model with 6 layers, 8 heads, embedding dimension 512, and FFN dimensionality of 2048,
* `transformer_ulm_big` - the largest model we experiment with in the paper: 12-layer/16 heads, 1024/4096 embedding and FFN dimensions;
* `loss-weights`: this parameter sets importance weights (must be non-negative) for the components of the loss that correspond to unit, duration, and F0 streams. To turn off a component of the loss, its weight has to be set to 0. For instance, to predict only unit stream the parameter should be set to "1;0;0";
* `stream-shifts`: specifies relative shifts of the two prosodic streams w.r.t. the unit stream (duration and F0, respectively). No shift corresponds to "0,0";
* `ignore-duration-input`/`ignore-f0-input`: setting these flags would zero-out correpsonding input streams;
* `max-token-duration`: duration values would be max-capped by the specified value;
* `discrete-duration`/`discrete-f0`: whether duration and F0 streams should be quantized;
* `log_f0`, `normalize-f0-mean`, `normalize-f0-std`, `interpolate-f0`: configure how F0 stream is treated. `log_f0` sets up modelling in the log-space, `normalize-f0-mean`/`normalize-f0-std` control per-speaker normalization, and `interpolate-f0` enables F0 interpolation for unvoiced regions where F0 was set to 0,
* `mask-dur-prob`, `mask-f0-prob`, `mask-dur-seg-prob`, `mask-f0-seg-prob`, `mask-unit-seg-prob`, `mask-unit-seg-leng`: this family of parameters sets the probababilities of masking individual steps and spans on each stream as well as lengths of the maked spans.
## Pre-trained models
### MS-TLM
Below you can find checkpoints for four best-performing models from the paper (IDs 9..12 in Table 1). These models are trained on Hubert-100 transcripts of the LibriLight-6K dataset. They have the prosody streams shifted by 1 w.r.t. the unit stream. All models predict all three streams (units, duration, and F0), but two
of them only have unit steam in their input.
| | Continuous prosody | Quantized prosody |
|-------------------|--------------------|-------------------|
| No prosody input | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/ulm_checkpoints/continuous_no_prosody_shift_1_1.pt) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/ulm_checkpoints/discrete_no_prosody_shift_1_1.pt) |
| Has prosody input | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/ulm_checkpoints/continuous_prosody_shift_1_1.pt) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/ulm_checkpoints/discrete_prosody_shift_1_1.pt)|
The optimal per-stream sampling temperatures/scaling parameters that we have identified for these models, in the (`T-token, T-duration, T-f0`) format:
| | Continuous prosody | Quantized prosody |
|-------------------|--------------------|-------------------|
| No prosody input | 0.7, 0.125, 0.0003125| 0.7, 0.25, 0.5 |
| Has prosody input | 0.7, 0.125, 0.00125 | 0.7, 0.25, 0.7 |
## Vocoder
| Units | Prosody | F0 stats | Checkpoint | Config |
|-------------------|---------|--------------|------------|--------|
| HuBERT-base-ls960, kmeans-100 | [[Quantized 32 bins]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/mean_norm_log_f0_seg_bin.th) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/f0_stats.pt) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/naive_quant_32_norm_log_seg_hubert/checkpoint.pt) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/naive_quant_32_norm_log_seg_hubert/config.json) |
| HuBERT-base-ls960, kmeans-100 | Continuous | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/f0_stats.pt) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/mean_norm_log_f0_hubert/checkpoint.pt) | [[download]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/mean_norm_log_f0_hubert/config.json) |
## Evaluating a trained model
Evaluation is done with the `eval/cont_metrics.py` scripts. As described in the paper, there are several metrics used.
**Teacher-forced metrics**
```bash
SET=valid
CHECKPOINT_PATH=discrete_prosody_shift_1_1.pt
DATA=data_config.json
python examples/textless_nlp/pgslm/eval/cont_metrics.py $DATA \
--metric=teacher_force_everything \
--path=$CHECKPOINT_PATH \
--batch-size=16 \
--fp16 \
--seed=111 \
--eval-subset=$SET \
--f0-discretization-bounds=mean_norm_log_f0_seg_bin.th --dequantize-prosody
```
(Using this command, our provided `discrete_prosody_shift_1_1.pt` checkpoint should produce `{'token_loss': 1.408..., 'duration_loss': 0.5424..., 'f0_loss': 0.0474...}` on LibriSpeech dev-clean).
The parameters `--f0-discretization-bounds=mean_norm_log_f0_seg_bin.th --dequantize-prosody` are specific for quantized-prosody models. They signal that the prosody streams must be decoded into the continuous domain before calculating correlation. It is the same `*_mean_norm_log_f0_bin.th` file as we prepared before.
The `mean_norm_log_f0_seg_bin.th` file we used with the pre-trained models can be downloaded [[here]](https://dl.fbaipublicfiles.com/textless_nlp/pgslm/vocoder/blizzard2013/mean_norm_log_f0_seg_bin.th).
**Consistency (aka Correlation) metrics**
The following command estimates correlation between mean values of the F0 stream in the prompt and in the generated continuation (unit and duration steams are fixed).
```bash
T_F0=0.7
EXPLOSION=20
SET=test
CHECKPOINT_PATH=discrete_prosody_shift_1_1.pt
DATA=data_config.json
python examples/textless_nlp/pgslm/eval/cont_metrics.py $DATA \
--prefix-length=150 \
--metric=correlation \
--path=$CHECKPOINT_PATH \
--batch-size=16 \
--fp16 \
--seed=111 \
--teacher-force-tokens \
--teacher-force-duration \
--min-length=300 \
--batch-explosion-rate=$EXPLOSION \
--T-f0=$T_F0 \
--eval-subset=$SET \
--f0-discretization-bounds=mean_norm_log_f0_seg_bin.th \
--dequantize-prosody --n-workers=8
```
(Using this command, our provided `discrete_prosody_shift_1_1.pt` checkpoint should produce `{...'F0 corr': 0.315 ..}` on LibriSpeech test-clean).
* By using flags `--teacher-force-tokens, --teacher-force-duration, --teacher-force-f0` one can calculate correlations along each stream while having other two streams fixed to ground-truth values (or freeze all three streams to get ground-truth correlation values);
* The parameters `T-f0`, `T-duration`, and `T-token` specify per-stream temperatures and, in the case of continuous-valued prosody, scaling parameter of the corresponding Laplace distribution (setting a temperature to 0 will enforce greedy sampling);
* `min-length` filters out sequences that are shorter then 300 duration units (i.e. 6s in the case of Hubert units);
* `prefix-length` specifies that we want to use first 150 duration units are prompt (i.e. 3s in the case of Hubert units)
**Correctness (aka Continuation) and Expressiveness (aka Std) metrics**
By running the following command, we can get minMAE and Std for the log-F0 stream for the model with quantized prosody.
```bash
DATA=data_config.json
EXPLOSION=20
SET=test
CHECKPOINT_PATH=discrete_prosody_shift_1_1.pt
T_F0=0.7
python examples/textless_nlp/pgslm/eval/cont_metrics.py $DATA \
--prefix-length=150 \
--metric=continuation \
--path=$CHECKPOINT_PATH \
--batch-size=16 \
--fp16 \
--seed=111 \
--batch-explosion-rate=$EXPLOSION \
--teacher-force-tokens \
--teacher-force-duration \
--T-f0=$T_F0 \
--eval-subset=$SET \
--f0-discretization-bounds=mean_norm_log_f0_seg_bin.th --dequantize-prosody
```
(Using this command, our provided `discrete_prosody_shift_1_1.pt` checkpoint should produce `{...'F0 MAE': 0.0772, 'F0 Std': 0.1489...}` on LibriSpeech test-clean).
Again, by setting `--teacher-force-tokens, --teacher-force-duration, --teacher-force-f0` we can calculate Token BLEU for the token stream (when `--teacher-force-duration` & `--teacher-force-f0` are on) and per-stream min MAE for each prosody stream individually.
Finally, `cont_metrics.py` allows to specify the number of workers (e.g., `n-workers=8`) which allows to speed up the computation by spreading multiple worker processes
over the available GPUs.
**Cont Word BLEU**
We used the code and the evaluation protocol of [(Lakhotia et al., 2021)](https://arxiv.org/abs/2102.01192).
## Sampling from a trained model
To get (prompted or not) samples from a trained model it is enough to run `sample.py`:
```bash
CHECKPOINT_PATH=checkpoints/checkpoint_best.pt
DATASET=examples/textless_nlp/pgslm/repro/dataset/data_config.json
python examples/textless_nlp/pgslm/sample/sample.py $DATASET \
--output=$SAMPLES \
--path=$CHECKPOINT_PATH \
--sampling \
--T-token=0.7 \
--T-duration=0.25 \
--T-f0=0.7 \
--max-length=500 \
--prefix-length=150 \
--subset=valid \
--seed=1 \
--match-duration \
--code-type=hubert \
--batch-explosion-rate=2
```
Some useful parameters:
* `T-token`, `T-duration`, `T-f0` specify sampling temperature for the three streams. Setting a temperature to `0` switches sample to the greedy (argmax) one;
* `prefix-length`: length of the prompt, measured in timesteps (e.g. for Hubert (CPC) each timestep is 20 (10) ms);
* `subset`: which subset of the dataset to use as prompts (can be `train`, `valid`, `test`);
* `teacher-force-tokens`, `teacher-force-duration`, `teacher-force-f0`: if set, at each autoregressive step, ground-truth values replace the produced one;
* `short-curcuit`: replace sampling by ground-truth inputs;
* `match-duration`: forces the produced sample to have the same duration (in time), as the entire sequence (beyond the prompt if there is any);
* `batch-explosion-rate`: number of samples per prompt;
* `f0-discretization-bounds`: path to a file with quantization boundaries. If it is set, F0 values are de-quantized back to the continuous domain
(the model must be a quanized one);
* `max-length` sets the maximal number of segment steps to be produced.
Note that `sample.py` automatically uses all available GPUs, to avoid that please use environment variable `CUDA_VISIBLE_DEVICES`.
## Vocoding samples
To generate audios for output from `sample.py` (`$IN_FILE`):
```bash
python examples/textless_nlp/pgslm/generate_waveform.py \
--in-file=$IN_FILE \
--vocoder=$VODOER \
--vocoder-cfg=$VOCODER_CFG \
--results-path=$RESULTS_PATH
```
See "Pre-trained model" for `$VOCODER` and `VOCODER_CFG`.