"flashlight python bindings are required to use this functionality. Please install from https://github.com/facebookresearch/flashlight/tree/master/bindings/python"
)
LM=object
LMState=object
classW2lDecoder(object):
def__init__(self,args,tgt_dict):
self.tgt_dict=tgt_dict
self.vocab_size=len(tgt_dict)
self.nbest=args.nbest
# criterion-specific init
self.criterion_type=CriterionType.CTC
self.blank=(
tgt_dict.index("<ctc_blank>")
if"<ctc_blank>"intgt_dict.indices
elsetgt_dict.bos()
)
if"<sep>"intgt_dict.indices:
self.silence=tgt_dict.index("<sep>")
elif"|"intgt_dict.indices:
self.silence=tgt_dict.index("|")
else:
self.silence=tgt_dict.eos()
self.asg_transitions=None
defgenerate(self,models,sample,**unused):
"""Generate a batch of inferences."""
# model.forward normally channels prev_output_tokens into the decoder
# separately, but SequenceGenerator directly calls model.encoder
and [ST (CoVoST 2)](docs/covost_example.md#interactive-decoding).
- 01/08/2021: Several fixes for S2T Transformer model, inference-time de-tokenization, scorer configuration and data
preparation scripts. We also add pre-trained models to the examples and revise the instructions.
Breaking changes: the data preparation scripts now extract filterbank features without CMVN. CMVN is instead applied
on-the-fly (defined in the config YAML).
## What's Next
- We are migrating the old fairseq [ASR example](../speech_recognition) into this S2T framework and
merging the features from both sides.
- The following papers also base their experiments on fairseq S2T. We are adding more examples for replication.
-[Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation (Wang et al., 2020)](https://arxiv.org/abs/2006.05474)
-[Self-Supervised Representations Improve End-to-End Speech Translation (Wu et al., 2020)](https://arxiv.org/abs/2006.12124)
-[Self-Training for End-to-End Speech Translation (Pino et al., 2020)](https://arxiv.org/abs/2006.02490)
-[CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus (Wang et al., 2020)](https://arxiv.org/abs/2002.01320)
-[Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade (Pino et al., 2019)](https://arxiv.org/abs/1909.06515)
## Citation
Please cite as:
```
@inproceedings{wang2020fairseqs2t,
title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
year = {2020},
}
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
title={Multilingual TEDx Corpus for Speech Recognition and Translation},
author={Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post},
year={2021},
}
@inproceedings{wang2020fairseqs2t,
title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
year = {2020},
}
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
# Simultaneous Speech Translation (SimulST) on MuST-C
This is a tutorial of training and evaluating a transformer *wait-k* simultaneous model on MUST-C English-Germen Dataset, from [SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation](https://www.aclweb.org/anthology/2020.aacl-main.58.pdf).
[MuST-C](https://www.aclweb.org/anthology/N19-1202) is multilingual speech-to-text translation corpus with 8-language translations on English TED talks.
## Data Preparation
This section introduces the data preparation for training and evaluation.
If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference-&-evaluation)
[Download](https://ict.fbk.eu/must-c) and unpack MuST-C data to a path
`${MUSTC_ROOT}/en-${TARGET_LANG_ID}`, then preprocess it with
```bash
# Additional Python packages for S2T data processing/model training
We need a pretrained offline ASR model. Assuming the save directory of the ASR model is `${ASR_SAVE_DIR}`.
The following command (and the subsequent training commands in this tutorial) assume training on 1 GPU (you can also train on 8 GPUs and remove the `--update-freq 8` option).
The source file `${SRC_LIST_OF_AUDIO}` is a list of paths of audio files. Assuming your audio files stored at `/home/user/data`,
it should look like this
```bash
/home/user/data/audio-1.wav
/home/user/data/audio-2.wav
```
Each line of target file `${TGT_FILE}` is the translation for each audio file input.
```bash
Translation_1
Translation_2
```
The evaluation runs on the original MUSTC segmentation.
The following command will generate the wav list and text file for a evaluation set `${SPLIT}` (chose from `dev`, `tst-COMMON` and `tst-HE`) in MUSTC to `${EVAL_DATA}`.
The `--data-bin` and `--config` should be the same in previous section if you prepare the data from the scratch.
If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1.0_en_de_databin.tgz). It contains
-`spm_unigram10000_st.model`: a sentencepiece model binary.
-`spm_unigram10000_st.txt`: the dictionary file generated by the sentencepiece model.
-`gcmvn.npz`: the binary for global cepstral mean and variance.
-`config_st.yaml`: the config yaml file. It looks like this.
You will need to set the absolute paths for `sentencepiece_model` and `stats_npz_path` if the data directory is downloaded.
Notice that once a `--data-bin` is set, the `--config` is the base name of the config yaml, not the full path.
Set `--model-path` to the model checkpoint.
A pretrained checkpoint can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/convtransformer_wait5_pre7), which is a wait-5 model with a pre-decision of 280 ms.
The result of this model on `tst-COMMON` is:
```bash
{
"Quality": {
"BLEU": 13.94974229366959
},
"Latency": {
"AL": 1751.8031870037803,
"AL_CA": 2338.5911762796536,
"AP": 0.7931395378788959,
"AP_CA": 0.9405103863210942,
"DAL": 1987.7811616943081,
"DAL_CA": 2425.2751560926167
}
}
```
If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.
The quality is measured by detokenized BLEU. So make sure that the predicted words sent to the server are detokenized.
The latency metrics are
* Average Proportion
* Average Lagging
* Differentiable Average Lagging
Again they will also be evaluated on detokenized text.