This is the official implementation of paper "[UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597)". The implementation mainly based on [fairseq](https://github.com/pytorch/fairseq) codebase. We release the training recipes on CommonVoice dataset.
## Requirements and Installation
- Pytorch >= 1.6.0
- python version >= 3.6
``` bash
cd src
pip install soundfile
pip install librosa
pip install pydub
pip install--editable ./
```
## Data Preparation
Download pretraining audio data from [here](https://commonvoice.mozilla.org/datasets). (We use the June 2020 release version in our paper).
Get the wav list and the transcription for each dataset by run:
For the finetuning data, our train/val/test splits are following [this](https://dl.fbaipublicfiles.com/cpc_audio/common_voices_splits.tar.gz).
The phoneme transcriptions are generated by [phonemizer](https://github.com/bootphon/phonemizer) to convert texts to phonemes. Then we create .id files using different vocabularies. All our pre-processed data as well as the dictionaries can be downloaded from [here].
## Pretraining
We give the training examples for large model here.
### Stage 1. Pretraining UniSpeech with labeled data.
The following script can be used to pre-train an English model:
### Stage 3. Finetuning with low-resource labeled data.
Finally, fint-tune the model with 1 hour labeled data.
For multilingual models, you can choose to use separate vocabulary (examples/unispeech/data/en/vocab_sep.json) or shared vocabulary (examples/unispeech/data/en/vocab_share.json)
We also evaluate our models on typical speech processing benchmarks.
### Speaker Verification
Evaluate on the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/#:~:text=VoxCeleb%20is%20an%20audio%2Dvisual,interview%20videos%20uploaded%20to%20YouTube)
| Model |Fix pre-train| Vox1-O | Vox1-E | Vox1-H |
Evaluate on the [LibriSpeech](https://www.openslr.org/12)

## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq) project.
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
### Reference
If you find our work is useful in your research, please cite the following paper:
``` latex
@article{Chen2021WavLM,
title = {WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing},
author = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Micheal Zeng and Furu Wei},
eprint={2110.13900},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2021}
}
```
### Contact Information
For help or issues using WavLM models, please submit a GitHub issue.
For other communications related to WavLM, please contact Yu Wu (`yuwu1@microsoft.com`).
self.extractor_mode:str="default"# mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.encoder_layers:int=12# num encoder layers in the transformer
self.encoder_ffn_embed_dim:int=3072# encoder embedding dimension for FFN
self.encoder_attention_heads:int=12# num encoder attention heads
self.activation_fn:str="gelu"# activation function to use
self.layer_norm_first:bool=False# apply layernorm first in the transformer
self.conv_feature_layers:str="[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"# string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_bias:bool=False# include bias in conv encoder
self.feature_grad_mult:float=1.0# multiply feature extractor var grads by this
self.normalize:bool=False# normalize input to have 0 mean and unit variance during training
# dropouts
self.dropout:float=0.1# dropout probability for the transformer
self.attention_dropout:float=0.1# dropout probability for attention weights
self.activation_dropout:float=0.0# dropout probability after activation in FFN
self.encoder_layerdrop:float=0.0# probability of dropping a tarnsformer layer
self.dropout_input:float=0.0# dropout to apply to the input (after feat extr)
self.dropout_features:float=0.0# dropout to apply to the features (after feat extr)
# masking
self.mask_length:int=10# mask length)
self.mask_prob:float=0.65# probability of replacing a token with mask
self.mask_selection:str="static"# how to choose mask length
self.mask_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.no_mask_overlap:bool=False# whether to allow masks to overlap
self.mask_min_space:int=1# min space between spans (if no overlap is enabled)
# channel masking
self.mask_channel_length:int=10# length of the mask for features (channels)
self.mask_channel_prob:float=0.0# probability of replacing a feature with 0
self.mask_channel_selection:str="static"# how to choose mask length for channel masking
self.mask_channel_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.no_mask_channel_overlap:bool=False# whether to allow channel masks to overlap
self.mask_channel_min_space:int=1# min space between spans (if no overlap is enabled)
# positional embeddings
self.conv_pos:int=128# number of filters for convolutional positional embeddings
self.conv_pos_groups:int=16# number of groups for convolutional positional embedding
# relative position embedding
self.relative_position_embedding:bool=False# apply relative position embedding
self.num_buckets:int=320# number of buckets for relative position embedding
self.max_distance:int=1280# maximum distance for relative position embedding
self.gru_rel_pos:bool=False# apply gated relative position embedding
# to be consistent with GLU_Linear, we assume the input always has the #channel (#dim) in the last dimension of the tensor, so need to switch the dimension first for 1D-Conv case