"docs/api-reference.md" did not exist on "55ea963c9e9033d01c7c20a54c5ede5babb6878e"
Commit c394d7d1 authored by “change”'s avatar “change”
Browse files

init

parents
# wav2vec 2.0
wav2vec 2.0 learns speech representations on unlabeled data as described in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](https://arxiv.org/abs/2006.11477).
We learned speech representations in multiple languages as well in [Unsupervised Cross-lingual Representation Learning for Speech Recognition (Conneau et al., 2020)](https://arxiv.org/abs/2006.13979).
We also combined wav2vec 2.0 with self-training in [Self-training and Pre-training are Complementary for Speech Recognition (Xu et al., 2020)](https://arxiv.org/abs/2010.11430).
## Pre-trained models
Model | Finetuning split | Dataset | Model
|---|---|---|---
Wav2Vec 2.0 Base | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt)
Wav2Vec 2.0 Base | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_10m.pt)
Wav2Vec 2.0 Base | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_100h.pt)
Wav2Vec 2.0 Base | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt)
Wav2Vec 2.0 Large | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/libri960_big.pt)
Wav2Vec 2.0 Large | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_10m.pt)
Wav2Vec 2.0 Large | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_100h.pt)
Wav2Vec 2.0 Large | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_960h.pt)
Wav2Vec 2.0 Large (LV-60)* | No finetuning | [Libri-Light](https://github.com/facebookresearch/libri-light) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt)
Wav2Vec 2.0 Large (LV-60)* | 10 minutes | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_10m_new.pt)
Wav2Vec 2.0 Large (LV-60)* | 100 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_100h_new.pt)
Wav2Vec 2.0 Large (LV-60)* | 960 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h_new.pt)
Wav2Vec 2.0 Large (LV-60) + Self Training * | 10 minutes | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_10m_pl.pt)
Wav2Vec 2.0 Large (LV-60) + Self Training * | 100 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_100h_pl.pt)
Wav2Vec 2.0 Large (LV-60) + Self Training * | 960 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_960h_pl.pt)
\* updated (Oct. 24, 2020)
We also release multilingual pre-trained wav2vec 2.0 (XLSR) models:
Model | Architecture | Hours | Languages | Datasets | Model
|---|---|---|---|---|---
XLSR-53 | Large | 56k | 53 | MLS, CommonVoice, BABEL | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt)
The XLSR model uses the following datasets for multilingual pretraining:
* **[MLS: Multilingual LibriSpeech](https://indico2.conference4me.psnc.pl/event/35/contributions/3585/attachments/1060/1101/Wed-2-6-10.pdf)** (8 languages, 50.7k hours): *Dutch, English, French, German, Italian, Polish, Portuguese, Spanish*
* **[CommonVoice](https://commonvoice.mozilla.org/en/languages)** (36 languages, 3.6k hours): *Arabic, Basque, Breton, Chinese (CN), Chinese (HK), Chinese (TW), Chuvash, Dhivehi, Dutch, English, Esperanto, Estonian, French, German, Hakh-Chin, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Mongolian, Persian, Portuguese, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Welsh* (see also [finetuning splits]([https://dl.fbaipublicfiles.com/cpc_audio/common_voices_splits.tar.gz]) from [this paper](https://arxiv.org/abs/2002.02848)).
* **[Babel](https://catalog.ldc.upenn.edu/byyear)** (17 languages, 1.7k hours): *Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Kurmanji, Lao, Pashto, Swahili, Tagalog, Tamil, Tok, Turkish, Vietnamese, Zulu*
## Training a new model with the CLI tools
Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
### Prepare training data manifest:
First, install the `soundfile` library:
```shell script
pip install soundfile
```
Next, run:
```shell script
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid
```
$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read.
$valid should be set to some reasonable percentage (like 0.01) of training data to use for validation.
To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a
separately pre-processed manifest file.
### Train a wav2vec 2.0 base model:
This configuration was used for the base model trained on the Librispeech dataset in the wav2vec 2.0 paper
Note that the input is expected to be single channel, sampled at 16 kHz
```shell script
$ fairseq-hydra-train \
task.data=/path/to/data \
--config-dir /path/to/fairseq-py/examples/wav2vec/config/pretraining \
--config-name wav2vec2_base_librispeech
```
Note: you can simulate 64 GPUs by using k GPUs and adding command line parameters (before `--config-dir`)
`distributed_training.distributed_world_size=k` `+optimization.update_freq='[x]'` where x = 64/k
### Train a wav2vec 2.0 large model:
This configuration was used for the large model trained on the Libri-light dataset in the wav2vec 2.0 paper
```shell script
$ fairseq-hydra-train \
task.data=/path/to/data \
--config-dir /path/to/fairseq-py/examples/wav2vec/config/pretraining \
--config-name wav2vec2_large_librivox
```
Note: you can simulate 128 GPUs by using k GPUs and adding command line parameters (before `--config-dir`)
`distributed_training.distributed_world_size=k` `+optimization.update_freq='[x]'` where x = 128/k
### Fine-tune a pre-trained model with CTC:
Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format.
A letter vocabulary can be downloaded [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt).
An example [script](libri_labels.py) that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:
```shell script
split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split
```
Fine-tuning on 100h of Librispeech with letter targets:
```shell script
$ fairseq-hydra-train \
distributed_training.distributed_port=$PORT \
task.data=/path/to/data \
model.w2v_path=/path/to/model.pt \
--config-dir /path/to/fairseq-py/examples/wav2vec/config/finetuning \
--config-name base_100h
```
There are other config files in the config/finetuning directory that can be used to fine-tune on other splits.
You can specify the right config via the `--config-name` parameter.
Note: you can simulate 24 GPUs by using k GPUs and adding command line parameters (before `--config-dir`)
`distributed_training.distributed_world_size=k` `+optimization.update_freq='[x]'` where x = 24/k
Decoding with a language model during training requires flashlight [python bindings](https://github.com/facebookresearch/flashlight/tree/master/bindings/python) (previously called [wav2letter](https://github.com/facebookresearch/wav2letter).
If you want to use a language model, add `+criterion.wer_args='[/path/to/kenlm, /path/to/lexicon, 2, -1]'` to the command line.
### Evaluating a CTC model:
Evaluating a CTC model with a language model requires [flashlight python bindings](https://github.com/facebookresearch/flashlight/tree/master/bindings/python) (previously called [wav2letter](https://github.com/facebookresearch/wav2letter) to be installed.
Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the [wav2letter model repository](https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019).
Be sure to upper-case the language model vocab after downloading it.
Letter dictionary for pre-trained models can be found [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt).
Next, run the evaluation command:
```shell script
$subset=dev_other
python examples/speech_recognition/infer.py /checkpoint/abaevski/data/speech/libri/10h/wav2vec/raw --task audio_pretraining \
--nbest 1 --path /path/to/model --gen-subset $subset --results-path /path/to/save/results/for/sclite --w2l-decoder kenlm \
--lm-model /path/to/kenlm.bin --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--post-process letter
```
To get raw numbers, use --w2l-decoder viterbi and omit the lexicon. To use the transformer language model, use --w2l-decoder fairseqlm.
## Use wav2vec 2.0 with 🤗Transformers:
Wav2Vec2 is also available in the [🤗Transformers library](https://github.com/huggingface/transformers) since version 4.4.
Pretrained Models can be found on the [hub](https://huggingface.co/models?filter=wav2vec2)
and documentation can be found [here](https://huggingface.co/transformers/master/model_doc/wav2vec2.html).
Usage example:
```python
# !pip install transformers
# !pip install datasets
import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# load pretrained model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
librispeech_samples_ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
# load audio
audio_input, sample_rate = sf.read(librispeech_samples_ds[0]["file"])
# pad input values and return pt tensor
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values
# INFERENCE
# retrieve logits & take argmax
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
# transcribe
transcription = processor.decode(predicted_ids[0])
# FINE-TUNE
target_transcription = "A MAN SAID TO THE UNIVERSE I EXIST"
# encode labels
with processor.as_target_processor():
labels = processor(target_transcription, return_tensors="pt").input_ids
# compute loss by passing labels
loss = model(input_values, labels=labels).loss
loss.backward()
```
# wav2vec
Example to train a wav2vec model as described in [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](https://arxiv.org/abs/1904.05862).
## Pre-trained models
Description | Dataset | Model
---|---|---
Wav2Vec large | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt)
#### Example usage:
```python
import torch
import fairseq
cp_path = '/path/to/wav2vec.pt'
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([cp_path])
model = model[0]
model.eval()
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
c = model.feature_aggregator(z)
```
## Training a new model with the CLI tools
Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate files 10 to 30 seconds in length)
### Prepare training data manifest:
```
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav
```
### Train a wav2vec model:
```
$ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
--arch wav2vec --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 --optimizer adam --lr 0.005 --lr-scheduler cosine \
--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \
--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \
--max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test
```
### Run wav2vec2 pre-training on Google Cloud TPUs:
Wav2Vec2 is now supported on TPUs! It's currently pre-training only.
#### Using hydra on a v3-8:
```
$ OMP_NUM_THREADS=1 fairseq-hydra-train \
task.data=/manifest/path \
--config-dir /PATH/TO/FAIRSEQ/examples/wav2vec/config/pretraining \
--config-name wav2vec2_large_librivox_tpu.yaml
```
#### Using command line arguments on a v3-8:
```
$ OMP_NUM_THREADS=1 python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
--arch wav2vec2 --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 --optimizer adam --lr 0.005 --lr-scheduler cosine \
--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \
--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \
--max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test \
--tpu --distributed-world-size 8 --num-batch-buckets 3 --enable-padding \
--encoder-layerdrop 0 --mask-channel-prob 0.1
```
#### Using hydra on a pod slice (v3-N with N > 8):
```
$ OMP_NUM_THREADS=1 fairseq-hydra-train \
task.data=/manifest/path \
--config-dir /PATH/TO/FAIRSEQ/examples/wav2vec/config/pretraining \
--config-name wav2vec2_large_librivox_tpu-pod.yaml # edit distributed-world-size accordingly
```
#### Using command line arguments on a pod slice (v3-N with N > 8):
```
$ python -m torch_xla.distributed.xla_dist \
--tpu ${TPUNAME} --conda-env=torch-xla-${TORCH_XLA_VERSION} --env OMP_NUM_THREADS=1 \
-- \
python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
--arch wav2vec2 --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 --optimizer adam --lr 0.005 --lr-scheduler cosine \
--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \
--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \
--max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test \
--tpu --distributed-world-size ${WORLD_SIZE} --num-batch-buckets 3 --enable-padding \
--encoder-layerdrop 0 --mask-channel-prob 0.1
```
### Extract embeddings from the downstream task data:
```
$ PYTHONPATH=/path/to/fairseq python examples/wav2vec/wav2vec_featurize.py --input /path/to/task/waves --output /path/to/output \
--model /model/path/checkpoint_best.pt --split train valid test
```
# vq-wav2vec
Example to train a vq-wav2vec model as described in [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (Baevski et al., 2019)](https://arxiv.org/abs/1910.05453).
These models are also used in [Effectiveness of self-supervised pre-training for speech recognition (Baevski et al., 2019)](https://arxiv.org/abs/1911.03912).
## Pre-trained models
Description | Dataset | Model
---|---|---
vq-wav2vec Gumbel | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec.pt)
vq-wav2vec K-means | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec_kmeans.pt)
Roberta on K-means codes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/bert_kmeans.tar)
#### Example usage:
```python
import torch
import fairseq
cp = torch.load('/path/to/vq-wav2vec.pt')
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([cp])
model = model[0]
model.eval()
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
_, idxs = model.vector_quantizer.forward_idx(z)
print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model
```
## Training a new model with the CLI tools
Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
### Prepare training data manifest:
```
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav
```
### Train a gumbel vq-wav2vec model:
```
$ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 \
--save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 \
--optimizer adam --lr 1e-05 --lr-scheduler cosine \
--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1), (512, 1, 1)] \
--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
--activation gelu --offset auto --skip-connections-agg --residual-scale 0.5 \
--log-keys ["prob_perplexity","code_perplexity","temp"] --vq-type gumbel --vq-groups 2 --vq-depth 2 \
--combine-groups --vq-vars 320 --vq-temp (2,0.5,0.999995) --prediction-steps 12 --warmup-updates 1000 \
--warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 --max-sample-size 150000 \
--max-tokens 300000 --cross-sample-negatives 0 --update-freq 1 --seed 2 --skip-invalid-size-inputs-valid-test
```
for k-means training, set vq-type with "kmeans" and add --loss-weights [1] argument. Pre-trained models were trained on 16 GPUs.
### Tokenize audio data (e.g. for BERT training):
```
$ PYTHONPATH=/path/to/fairseq python examples/wav2vec/vq-wav2vec_featurize.py --data-dir /manifest/path --output-dir /path/to/output \
--checkpoint /model/path/checkpoint_best.pt --split train valid test --extension tsv
```
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: false
labels: ltr
dataset:
num_workers: 6
max_tokens: 3200000
skip_invalid_size_inputs_valid_test: true
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 2
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 80000
lr: [0.00003]
sentence_avg: true
update_freq: [4]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.65
mask_channel_prob: 0.5
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 0
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval: 50
save_interval_updates: 10000
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: false
labels: ltr
dataset:
num_workers: 6
max_tokens: 3200000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 10000
validate_interval: 50
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 2
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 20000
lr: [0.00005]
sentence_avg: true
update_freq: [4]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.65
mask_channel_prob: 0.5
mask_channel_length: 64
layerdrop: 0.05
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval: 1000
save_interval_updates: 50
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: false
labels: ltr
dataset:
num_workers: 6
max_tokens: 3200000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 10000
validate_interval: 1000
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 2
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 13000
lr: [0.00005]
sentence_avg: true
update_freq: [4]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.65
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval: 1000
save_interval_updates: 50
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: false
labels: ltr
dataset:
num_workers: 6
max_tokens: 3200000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 10000
validate_interval: 1000
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 2
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 13000
lr: [0.00005]
sentence_avg: true
update_freq: [4]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.65
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: false
labels: ltr
dataset:
num_workers: 6
max_tokens: 3200000
skip_invalid_size_inputs_valid_test: true
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 8
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 320000
lr: [0.0001]
sentence_avg: true
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.5
mask_channel_prob: 0.1
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 0
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: true
labels: ltr
dataset:
num_workers: 6
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 4
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 80000
lr: [0.00003]
sentence_avg: true
update_freq: [5]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.5
mask_channel_prob: 0.5
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval: 50
save_interval_updates: 10000
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: true
labels: ltr
dataset:
num_workers: 6
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 10000
validate_interval: 50
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 4
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 20000
lr: [0.0001]
sentence_avg: true
update_freq: [5]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.75
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval: 1000
save_interval_updates: 50
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: true
labels: ltr
dataset:
num_workers: 6
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 10000
validate_interval: 1000
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 4
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 13000
lr: [0.0001]
sentence_avg: true
update_freq: [5]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.65
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval: 1000
save_interval_updates: 50
keep_interval_updates: 1
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: true
labels: ltr
dataset:
num_workers: 6
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
validate_after_updates: 10000
validate_interval: 1000
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 4
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 13000
lr: [0.0003]
sentence_avg: true
update_freq: [5]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.75
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
no_epoch_checkpoints: true
best_checkpoint_metric: wer
task:
_name: audio_pretraining
data: ???
normalize: true
labels: ltr
dataset:
num_workers: 6
max_tokens: 1280000
skip_invalid_size_inputs_valid_test: true
valid_subset: dev_other
distributed_training:
ddp_backend: legacy_ddp
distributed_world_size: 24
criterion:
_name: ctc
zero_infinity: true
optimization:
max_update: 320000
lr: [0.00003]
sentence_avg: true
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-08
lr_scheduler:
_name: tri_stage
phase_ratio: [0.1, 0.4, 0.5]
final_lr_scale: 0.05
model:
_name: wav2vec_ctc
w2v_path: ???
apply_mask: true
mask_prob: 0.5
mask_channel_prob: 0.25
mask_channel_length: 64
layerdrop: 0.1
activation_dropout: 0.1
feature_grad_mult: 0.0
freeze_finetune_updates: 10000
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval_updates: 25000
keep_interval_updates: 1
no_epoch_checkpoints: true
task:
_name: audio_pretraining
data: ???
max_sample_size: 250000
min_sample_size: 32000
normalize: false
dataset:
num_workers: 6
max_tokens: 1400000
skip_invalid_size_inputs_valid_test: true
distributed_training:
distributed_world_size: 64
ddp_backend: legacy_ddp
criterion:
_name: wav2vec
infonce: true
log_keys: ["prob_perplexity","code_perplexity","temp"]
loss_weights: [0.1, 10]
optimization:
max_update: 400000
lr: [0.0005]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-06
weight_decay: 0.01
lr_scheduler:
_name: polynomial_decay
warmup_updates: 32000
model:
_name: wav2vec2
quantize_targets: true
final_dim: 256
encoder_layerdrop: 0.05
dropout_input: 0.1
dropout_features: 0.1
feature_grad_mult: 0.1
encoder_embed_dim: 768
# @package _group_
common:
fp16: true
log_format: json
log_interval: 200
checkpoint:
save_interval_updates: 25000
keep_interval_updates: 1
no_epoch_checkpoints: true
task:
_name: audio_pretraining
data: ???
max_sample_size: 320000
min_sample_size: 32000
normalize: true
dataset:
num_workers: 6
max_tokens: 1200000
skip_invalid_size_inputs_valid_test: true
distributed_training:
distributed_world_size: 128
ddp_backend: legacy_ddp
criterion:
_name: wav2vec
infonce: true
log_keys: ["prob_perplexity","code_perplexity","temp"]
loss_weights: [0.1, 0]
optimization:
max_update: 1000000
lr: [0.005]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-06
weight_decay: 0.01
lr_scheduler:
_name: polynomial_decay
warmup_updates: 32000
model:
_name: wav2vec2
quantize_targets: true
extractor_mode: layer_norm
layer_norm_first: true
final_dim: 768
latent_temp: [2.0,0.1,0.999995]
encoder_layerdrop: 0.00
dropout_input: 0.0
dropout_features: 0.0
dropout: 0.0
attention_dropout: 0.0
conv_bias: true
encoder_layers: 24
encoder_embed_dim: 1024
encoder_ffn_embed_dim: 4096
encoder_attention_heads: 16
feature_grad_mult: 1.0
# @package _group_
common:
tpu: true
fp16: false
log_format: json
log_interval: 10
checkpoint:
save_interval_updates: 25000
keep_interval_updates: 1
no_epoch_checkpoints: true
task:
_name: audio_pretraining
data: ???
max_sample_size: 250000
min_sample_size: 32000
normalize: true
num_batch_buckets: 3
precompute_mask_indices: true
enable_padding: true
dataset:
num_workers: 6
max_tokens: 1200000
skip_invalid_size_inputs_valid_test: true
distributed_training:
distributed_world_size: 128
ddp_backend: legacy_ddp
criterion:
_name: wav2vec
infonce: true
log_keys: ["prob_perplexity","code_perplexity","temp"]
loss_weights: [0.1, 0]
optimization:
max_update: 1000000
lr: [0.005]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-06
weight_decay: 0.01
lr_scheduler:
_name: polynomial_decay
warmup_updates: 32000
model:
_name: wav2vec2
quantize_targets: true
extractor_mode: layer_norm
layer_norm_first: true
final_dim: 768
latent_temp: [2.0,0.1,0.999995]
encoder_layerdrop: 0.00
dropout_input: 0.0
dropout_features: 0.0
dropout: 0.0
attention_dropout: 0.0
conv_bias: true
encoder_layers: 24
encoder_embed_dim: 1024
encoder_ffn_embed_dim: 4096
encoder_attention_heads: 16
feature_grad_mult: 1.0
# @package _group_
common:
tpu: true
fp16: false
log_format: json
log_interval: 10
checkpoint:
save_interval_updates: 25000
keep_interval_updates: 1
no_epoch_checkpoints: true
task:
_name: audio_pretraining
data: ???
max_sample_size: 250000
min_sample_size: 32000
normalize: true
num_batch_buckets: 3
precompute_mask_indices: true
enable_padding: true
dataset:
num_workers: 6
max_tokens: 1200000
skip_invalid_size_inputs_valid_test: true
distributed_training:
distributed_world_size: 8
ddp_backend: legacy_ddp
criterion:
_name: wav2vec
infonce: true
log_keys: ["prob_perplexity","code_perplexity","temp"]
loss_weights: [0.1, 0]
optimization:
max_update: 1000000
lr: [0.005]
optimizer:
_name: adam
adam_betas: (0.9,0.98)
adam_eps: 1e-06
weight_decay: 0.01
lr_scheduler:
_name: polynomial_decay
warmup_updates: 32000
model:
_name: wav2vec2
quantize_targets: true
extractor_mode: layer_norm
layer_norm_first: true
final_dim: 768
latent_temp: [2.0,0.1,0.999995]
encoder_layerdrop: 0.00
dropout_input: 0.0
dropout_features: 0.0
dropout: 0.0
attention_dropout: 0.0
conv_bias: true
encoder_layers: 24
encoder_embed_dim: 1024
encoder_ffn_embed_dim: 4096
encoder_attention_heads: 16
feature_grad_mult: 1.0
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Helper script to pre-compute embeddings for a flashlight (previously called wav2letter++) dataset
"""
import argparse
import os
def main():
parser = argparse.ArgumentParser()
parser.add_argument("tsv")
parser.add_argument("--output-dir", required=True)
parser.add_argument("--output-name", required=True)
args = parser.parse_args()
os.makedirs(args.output_dir, exist_ok=True)
transcriptions = {}
with open(args.tsv, "r") as tsv, open(
os.path.join(args.output_dir, args.output_name + ".ltr"), "w"
) as ltr_out, open(
os.path.join(args.output_dir, args.output_name + ".wrd"), "w"
) as wrd_out:
root = next(tsv).strip()
for line in tsv:
line = line.strip()
dir = os.path.dirname(line)
if dir not in transcriptions:
parts = dir.split(os.path.sep)
trans_path = f"{parts[-2]}-{parts[-1]}.trans.txt"
path = os.path.join(root, dir, trans_path)
assert os.path.exists(path)
texts = {}
with open(path, "r") as trans_f:
for tline in trans_f:
items = tline.strip().split()
texts[items[0]] = " ".join(items[1:])
transcriptions[dir] = texts
part = os.path.basename(line).split(".")[0]
assert part in transcriptions[dir]
print(transcriptions[dir][part], file=wrd_out)
print(
" ".join(list(transcriptions[dir][part].replace(" ", "|"))) + " |",
file=ltr_out,
)
if __name__ == "__main__":
main()
#!/usr/bin/env bash
# usage: bash binarize_manifest <dest_dir> <train_split> <valid_split>
DEST_DIR=$1
TRAIN_SPLIT=$2
VALID_SPLIT=$3
FAIRSEQ_ROOT=$4
mkdir -p $DEST_DIR
# split file path and lengths into separate files
cut -f1 $TRAIN_SPLIT.tsv > $DEST_DIR/train_fnames.txt
cut -f1 $VALID_SPLIT.tsv > $DEST_DIR/valid_fnames.txt
cut -f2 $TRAIN_SPLIT.tsv > $DEST_DIR/train.lengths
cut -f2 $VALID_SPLIT.tsv > $DEST_DIR/valid.lengths
# copy root directory
head -1 $TRAIN_SPLIT.tsv > $DEST_DIR/train.root
head -1 $VALID_SPLIT.tsv > $DEST_DIR/valid.root
# remove root directory
sed -i '1d' $DEST_DIR/train_fnames.txt
sed -i '1d' $DEST_DIR/valid_fnames.txt
sed -i '1d' $DEST_DIR/train.lengths
sed -i '1d' $DEST_DIR/valid.lengths
# insert spaces between characters
sed -i -e 's/\(.\)/\1 /g' $DEST_DIR/train_fnames.txt
sed -i -e 's/\(.\)/\1 /g' $DEST_DIR/valid_fnames.txt
# run preprocessor
PYTHONPATH=$FAIRSEQ_ROOT python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $DEST_DIR/train_fnames.txt --validpref $DEST_DIR/valid_fnames.txt --workers 60 --only-source --destdir $DEST_DIR
# wav2vec Unsupervised (wav2vec-U)
Wav2vec Unsupervised (wav2vec-U) is a framework for building speech recognition systems without any labeled training data as described in [Unsupervised Speech Recognition (Baevski et al., 2021)](https://ai.facebook.com/research/publications/unsupervised-speech-recognition). The model takes as input wav2vec 2.0 or XLSR representations (see [pretrained models](https://github.com/pytorch/fairseq/blob/master/examples/wav2vec)) as well as unlabeled speech and text data.
The wav2vec-U training procedure consists of three consecutive main steps:
* Preparation of speech representations and text data
* Generative adversarial training (GAN)
* Iterative self-training + Kaldi LM-decoding
## Preparation of speech and text data
Similar to [wav2vec 2.0](https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md), data folders contain {train,valid,test}.{tsv,wrd,phn} files, where audio paths are stored in tsv files, and word, letter or phoneme transcriptions are stored in .{wrd,ltr,phn}.
In **/path/to/data/with_silence** you need a *train.tsv* file as well as (optionally) *{valid,test}.{tsv,wrd,phn}*. It is nice to have *10h.{tsv,phn}* files there too for reproducing the ablation study on layer selection. In **/path/to/data/without_silence** you have the same files, except *.tsv* files contain audios with silences removed using rVAD.
Pre-requisites:
* set FAIRSEQ_ROOT environmental variable to your fairseq installation
* set RVAD_ROOT environmental variable to a checkout of [rVADfast](https://github.com/zhenghuatan/rVADfast)
* set KENLM_ROOT environmental variable to the location of [KenLM](https://github.com/kpu/kenlm) binaries
* install [PyKaldi](https://github.com/pykaldi/pykaldi) and set KALDI_ROOT environmental variable to the location of your kaldi installation. To use the version bundled with PyKaldi, you can use /path/to/pykaldi/tools/kaldi
Create new audio files without silences:
```shell
# create a manifest file for the set original of audio files
python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py /dir/to/save/audio/files --ext wav --dest /path/to/new/train.tsv --valid-percent 0
python scripts/vads.py -r $RVAD_ROOT < /path/to/train.tsv > train.vads
python scripts/remove_silence.py --tsv /path/to/train.tsv --vads train.vads --out /dir/to/save/audio/files
python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py /dir/to/save/audio/files --ext wav --dest /path/to/new/train.tsv --valid-percent 0.01
```
Next, we need to preprocess the audio data to better match phonemized text data:
```shell
zsh scripts/prepare_audio.sh /dir/with/{train,test,valid}.tsv /output/dir /path/to/wav2vec2/model.pt 512 14
```
Note that if you have splits different than train/valid/test, you will need to modify this script. The last two arguments are the PCA dimensionality and the 0-based index of the layer from which to extract representations.
Now we need to prepare text data:
```shell
zsh scripts/prepare_text.sh language /path/to/text/file /output/dir 1000 espeak /path/to/fasttext/lid/model
```
The fourth argument is minimum number observations of phones to keep. If your text corpus is small, you might want to reduce this number.
The fifth argument is which phonemizer to use. Supported values are [espeak](http://espeak.sourceforge.net/), [espeak-ng](https://github.com/espeak-ng/espeak-ng), and [G2P](https://github.com/Kyubyong/g2p) (english only).
Pre-trained fasttext LID models can be downloaded [here](https://fasttext.cc/docs/en/language-identification.html).
### Prepare TIMIT data
TIMIT transcripts include silence. Therefore VAD is not used for audio preprocessing, and we do not wrap transcripts with silences or insert random silence in between words.
To prepare TIMIT data for both the matched an unmatched setup:
```shell
bash scripts/prepare_timit.sh /dir/to/timit/raw/data /output/dir /path/to/wav2vec2/model.pt
```
Note that we assume the TIMIT distribution with capitalized directories and filenames are used (e.g., `TRAIN/DR1/FCJF0/SA1.PHN`).
## Generative adversarial training (GAN)
We then use a GAN model to build a first unsupervised ASR model. The data preparation above of both speech features and text data is a necessary procedure that enables the generator to match speech to text in an unsupervised way.
Launching GAN training on top of preprocessed features, with default hyperparameters can be done with:
```
PREFIX=w2v_unsup_gan_xp
TASK_DATA=/path/to/features/precompute_unfiltered_pca512_cls128_mean_pooled
TEXT_DATA=/path/to/data/phones # path to fairseq-preprocessed GAN data (phones dir)
KENLM_PATH=/path/to/data/phones/kenlm.phn.o4.bin # KenLM 4-gram phoneme language model (LM data = GAN data here)
PYTHONPATH=$FAIRSEQ_ROOT PREFIX=$PREFIX fairseq-hydra-train \
-m --config-dir config/gan \
--config-name w2vu \
task.data=${TASK_DATA} \
task.text_data=${TEXT_DATA} \
task.kenlm_path=${KENLM_PATH} \
common.user_dir=${FAIRSEQ_ROOT}/examples/wav2vec/unsupervised \
model.code_penalty=2,4 model.gradient_penalty=1.5,2.0 \
model.smoothness_weight=0.5,0.75,1.0 'common.seed=range(0,5)'
```
Once we find the best checkpoint (chosen using unsupervised metric that combined language model perplexity and vocabulary usage), we can use it to generate phone labels (or word labels with an appropriate kaldi WFST):
```shell
python w2vu_generate.py --config-dir config/generate --config-name viterbi \
fairseq.common.user_dir=${FAIRSEQ_ROOT}/examples/wav2vec/unsupervised \
fairseq.task.data=/path/to/dir/with/features \
fairseq.common_eval.path=/path/to/gan/checkpoint \
fairseq.dataset.gen_subset=valid results_path=/where/to/save/transcriptions
```
The decoding without LM works best on the same adjacent-mean-pooled features that the gan was trained on, while decoding with LM works better on features before the adjacent timestep mean-pooling step (without the "_pooled" suffix).
## Iterative self-training + Kaldi LM-decoding
After the GAN training provides a first unsupervised model, we can then progressively refine the quality of transcriptions using several iterations of semi-supervised learning. We perform two iterations: first, pseudo-label the training data with the unsupervised GAN model and train an HMM on the pseudo-labels. Second, we relabel the training data with the HMM and then fine-tune the original wav2vec 2.0 model using the HMM pseudo-labels with a CTC loss. Note that HMM models use phonemes as output, while wav2vec 2.0 use letter. Both are decoded using WFST decoders into words.
Please see [this README](kaldi_self_train/README.md) for more instructions on how to do iterative self-training + Kaldi LM-decoding.
*** Note: these instructions are a work in progress and will be updated over the next few days
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment