<!--**Pre-trained models for speech related tasks**-->
[**SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data**](https://arxiv.org/abs/2209.15329)
- June 2023: We have corrected the errors in the pre-training data for SpeechLM-P Base models, and new results are updated.
- April 2023: We discovered some errors about the data in the pre-training experiments, which will affect all the results about SpeechLM-P Base models. We are re-conducting the related experiments and will update the paper with the new results.
- (Done) Oct 2022: release the code and models
- Oct 2022: release preprint in [arXiv](https://arxiv.org/abs/2209.15329)
## Pre-Trained and Fine-tuned Models
| Model | Pre-training Dataset | Fine-tuning Dataset | Model |
For easier use of our pre-trained models, we merge all inference-related code to [`SpeechLM.py`](SpeechLM.py) and make cleaned checkpoints [~~`SpeechLM-P Base`~~] [`SpeechLM-H Base`] [`SpeechLM-P Large`] by removing non-required modules. Now you can directly use the following script to extract your speech features:
To fine-tune or pre-train more models, please follow the instructions below.
```bash
git submodule update --init SpeechLM/fairseq
cd SpeechLM/
pip install--editable fairseq/
pip install sacrebleu==1.5.1
```
## ASR on LibriSpeech
### Data preparation
Please follow the steps of wav2vec 2.0 manifest [here](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#prepare-training-data-manifest) to prepare `train.tsv` and `train.ltr`. You should make sure the vocabulary [`dict.ltr.txt`](dataset/LibriSpeech/asr/dict.ltr.txt) is the same as that used for the pre-trained model.
Put yout prepared data into `$data_dir`, we provided eamples in [`dataset/LibriSpeech/asr`](dataset/LibriSpeech/asr/).
- Decode with 4-gram language model using [flashlight](https://github.com/flashlight/flashlight/tree/main/bindings/python) and [kenlm](https://github.com/kpu/kenlm).
> Please put [4-gram.arpa](https://www.openslr.org/resources/11/4-gram.arpa.gz) and the word-to-letter lexicon [librispeech_lexicon.lst](https://drive.google.com/file/d/1q7IbNGqtwXnctjvuvpviQ4ZmepFHQmTO/view?usp=sharing) into `$data_dir`.
- Decode large models with fairseq-lm using [flashlight](https://github.com/flashlight/flashlight/tree/main/bindings/python).
> Please put [lm_librispeech_word_transformer.pt](https://dl.fbaipublicfiles.com/wav2letter/sota/2019/lm/lm_librispeech_word_transformer.pt) and its vocabulary [`dict.txt`](https://dl.fbaipublicfiles.com/wav2letter/sota/2019/lm/lm_librispeech_word_transformer.dict) into `$data_dir/fairseq_word_lm`, and the word-to-letter lexicon [librispeech_lexicon.lst](https://drive.google.com/file/d/1q7IbNGqtwXnctjvuvpviQ4ZmepFHQmTO/view?usp=sharing) into `$data_dir`. Capitalize the `dict.txt` to amke it compatible with the word-to-letter lexicon.
1. Download [Common Voice audio clips](https://commonvoice.mozilla.org/en/datasets)(version 4) for English into `$cv_root/en`.
2. Get data manifest. The following script will convert mp3 files to waveform, create tsv file containing speech/translation paires, create data config files.
This tokenizer is used to produce the frame-laigned phonemes for unlabeled speech, which is actually a hybrid HMM ASR model.
In the Base setting, we use 100h LibriSpeech labeled data to train the HMM model under Kaldi recipe, then decode the unpaired speech and get the aligned phonemes from the lattice.
Here we provided the processed phonemes of 960h speech here: [`train_960.tsv`](https://drive.google.com/file/d/1rxlikMglL2kEsF4NfqekZRoA02klY7CE/view?usp=sharing), [`train_960.phn`](),[`dev_clean.tsv`](https://drive.google.com/file/d/1NuVwe687jLBFkDLRy1EV2A2uXyV_kBo2/view?usp=sharing), [`dev_clean.phn`](https://drive.google.com/file/d/1cq_gbS-UgCALOoaE5QmhWrhkTdXuc_Uc/view?usp=sharing). Note that the label-rate is 100 (10ms).
> The phoneme inventory is 300+ word-position-dependent phones including silence phones.
### Phoneme-unit Tokenizer for Text
This tokenizer is used to phonemize the unpaired text data to (phonemes, letters) paired data, following a `words -> phonemes -> upsampled phones` pipeline.
The following script will download LibriSpeech LM corpus and produce the required data: `train_text.phn-ltr.phn.{idx,bin}` and `train_text.phn-ltr.ltr.{idx,bin}`.
> Before runing it, make sure you have our provided [`dict.phn.txt`](dataset/LibriLM/phone_unit/bin-idx/dict.phn.txt) and [`dict.ltr.txt`](dataset/LibriLM/phone_unit/bin-idx/dict.ltr.txt) in the output dir `dataset/LibriLM/phone_unit/bin-idx/`.
> The phoneme inventory is 300+ word-position-dependent phones including silence phones.
Please follow the steps of data preparation for HuBERT [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation) to prepare 1) wav recordings [`train.tsv`](dataset/LibriSpeech/hidden_unit/train_sample100.tsv) and 2) corresponding hidden-units [`train.km`](dataset/LibriSpeech/hidden_unit/train_sample100.km), and 3) unit vocabulary [`dict.km.txt`](dataset/LibriSpeech/hidden_unit/dict.km.txt).
### Hidden-unit Tokenizer for Text
This tokenizer is used to produce the speech-style hidden units from unpaired text.
We train a [FastSpeech](https://arxiv.org/abs/2006.04558)-like model (instead generating continuous spectrum in the original paper, here we generate discrete units) on a small amount of ASR data ([100 hrs LibriSpeech](http://www.openslr.org/12)) as the tokenizer.
Train:
1. Convert asr transcripts to phoneme sequence with duration information.
2. Extract hidden-units from speech, using the [Hidden-unit Tokenizer for Speech](#hidden-unit-tokenizer-for-speech).
3. Train the [model](speechlm/models/fasttext2unit.py) on the paired data:
We provided train/generate data examples in [`dataset/LibriSpeech/fast_phone2unit`](dataset/LibriSpeech/fast_phone2unit), and the model checkpoint [here](https://drive.google.com/file/d/1e-aYf8hPXuly8DEvNg5SISOlcUxsgED0/view?usp=sharing).
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq).
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
## Reference
If you find our work is useful in your research, please cite the following paper:
```bibtex
@article{zhang2022speechlm,
title={SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data},
author={Zhang, Ziqiang and Chen, Sanyuan and Zhou, Long and Wu, Yu and Ren, Shuo and Liu, Shujie and Yao, Zhuoyuan and Gong, Xun and Dai, Lirong and Li, Jinyu and Wei, Furu},
eprint={2209.15329},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2022}
}
```
### Contact Information
For help or issues using SpeechLM models, please submit a GitHub issue.
For other communications related to SpeechLM, please contact Long Zhou (`lozhou@microsoft.com`).
# from fairseq.models.transformer import TransformerConfig
logger=logging.getLogger(__name__)
classDictConfig:
def__init__(self,cfg=None):
ifcfgisnotNone:
self.update(cfg)
defupdate(self,cfg:dict):
self.__dict__.update(cfg)
classTransformerConfig:
def__init__(self,cfg=None):
ifcfgisnotNone:
self.update(cfg)
defupdate(self,cfg:dict):
if'encoder'incfg:
self.encoder=DictConfig(cfg['encoder'])
delcfg['encoder']
if'quant_noise'incfg:
self.quant_noise=DictConfig(cfg['quant_noise'])
delcfg['quant_noise']
if'decoder'incfg:
delcfg['decoder']
self.__dict__.update(cfg)
classSpeechLMConfig:
def__init__(self,cfg=None):
self.label_rate:int=50
self.extractor_mode:str="default"# mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.encoder_layers:int=12# num encoder layers in the transformer
self.encoder_ffn_embed_dim:int=3072# encoder embedding dimension for FFN
self.encoder_attention_heads:int=12# num encoder attention heads
self.activation_fn:str="gelu"# activation function to use
self.layer_type:str="transformer"# layer type in encoder
# dropouts
self.dropout:float=0.1# dropout probability for the transformer
self.attention_dropout:float=0.1# dropout probability for attention weights
self.activation_dropout:float=0.0# dropout probability after activation in FFN
self.encoder_layerdrop:float=0.0# probability of dropping a tarnsformer layer
self.dropout_input:float=0.0# dropout to apply to the input (after feat extr)
self.dropout_features:float=0.0# dropout to apply to the features (after feat extr)
self.final_dim:int=256# project final representations and targets to this many dimensions
self.layer_norm_first:bool=False# apply layernorm first in the transformer
self.conv_feature_layers:str="[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"# string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_bias:bool=False# include bias in conv encoder
self.feature_grad_mult:float=1.0# multiply feature extractor var grads by this
# masking
self.mask_length:int=10# mask length
self.mask_prob:float=0.65# probability of replacing a token with mask
self.mask_selection:str="static"# how to choose mask length
self.mask_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.no_mask_overlap:bool=False# whether to allow masks to overlap
self.mask_min_space:int=1# min space between spans (if no overlap is enabled)
# channel masking
self.mask_channel_length:int=10# length of the mask for features (channels)
self.mask_channel_prob:float=0.0# probability of replacing a feature with 0
self.mask_channel_selection:str="static"# how to choose mask length for channel masking
self.mask_channel_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.no_mask_channel_overlap:bool=False# whether to allow channel masks to overlap
self.mask_channel_min_space:int=1# min space between spans (if no overlap is enabled)
# positional embeddings
self.conv_pos:int=128# number of filters for convolutional positional embeddings
self.conv_pos_groups:int=16# number of groups for convolutional positional embedding
# loss computation
self.skip_masked:bool=False# skip computing losses over masked frames
self.skip_nomask:bool=False# skip computing losses over unmasked frames
self.checkpoint_activations:bool=False# recompute activations and save memory for extra compute
# FP16 optimization
self.required_seq_len_multiple:int=2# pad the input to encoder such that the sequence length is divisible by multiple
# Custom
self.use_rel_pos_enc:bool=False# whether to use relative positional encoding
self.scaling_for_att:float=1.0# scaling for attention weights to prevent overflow issue (for large model)
# unit encoder-decoder
self.add_unit_encoder:bool=False# add unit encoder
# embedding mixing
self.mix_with_unit:bool=True# mix with the unit embeddings
self.use_pred_unit:bool=False# use the embeddings of predicted units
self.l2_embedding:bool=False# compute l2 loss between unit embedding and unit hidden state