# Non-autoregressive Neural Machine Translation (NAT) This page mainly includes instructions for reproducing results from the paper * [Levenshtein Transformer (Gu et al., 2019)](https://arxiv.org/abs/1905.11006). We also provided our own implementations for several popular non-autoregressive-based models as reference:
* [Non-Autoregressive Neural Machine Translation (Gu et al., 2017)](https://arxiv.org/abs/1711.02281)
* [Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee et al. 2018)](https://arxiv.org/abs/1802.06901)
* [Insertion Transformer: Flexible Sequence Generation via Insertion Operations (Stern et al. 2019)](https://arxiv.org/abs/1902.03249)
* [Mask-Predict: Parallel Decoding of Conditional Masked Language Models (Ghazvininejad et al., 2019)](https://arxiv.org/abs/1904.09324v2) ## Dataset First, follow the [instructions to download and preprocess the WMT'14 En-De dataset](../translation#prepare-wmt14en2desh). Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`. ### Knowledge Distillation Following [Gu et al. 2019](https://arxiv.org/abs/1905.11006), [knowledge distillation](https://arxiv.org/abs/1606.07947) from an autoregressive model can effectively simplify the training data distribution, which is sometimes essential for NAT-based models to learn good translations. The easiest way of performing distillation is to follow the [instructions of training a standard transformer model](../translation) on the same data, and then decode the training set to produce a distillation dataset for NAT. ### Download We also provided the preprocessed [original](http://dl.fbaipublicfiles.com/nat/original_dataset.zip) and [distillation](http://dl.fbaipublicfiles.com/nat/distill_dataset.zip) datasets. Please build the binarized dataset on your own. ## Train a model Then we can train a nonautoregressive model using the `translation_lev` task and a new criterion `nat_loss`. Use the `--noise` flag to specify the input noise used on the target sentences. In default, we run the task for *Levenshtein Transformer*, with `--noise='random_delete'`. Full scripts to run other models can also be found [here](./scripts.md). The following command will train a *Levenshtein Transformer* on the binarized dataset. ```bash fairseq-train \ data-bin/wmt14_en_de_distill \ --save-dir checkpoints \ --ddp-backend=no_c10d \ --task translation_lev \ --criterion nat_loss \ --arch levenshtein_transformer \ --noise random_delete \ --share-all-embeddings \ --optimizer adam --adam-betas '(0.9,0.98)' \ --lr 0.0005 --lr-scheduler inverse_sqrt \ --min-lr '1e-09' --warmup-updates 10000 \ --warmup-init-lr '1e-07' --label-smoothing 0.1 \ --dropout 0.3 --weight-decay 0.01 \ --decoder-learned-pos \ --encoder-learned-pos \ --apply-bert-init \ --log-format 'simple' --log-interval 100 \ --fixed-validation-seed 7 \ --max-tokens 8000 \ --save-interval-updates 10000 \ --max-update 300000 ``` ## Translate Once a model is trained, we can generate translations using an `iterative_refinement_generator` which will based on the model's initial output and iteratively read and greedily refine the translation until (1) the model predicts the same translations for two consecutive iterations; or (2) the generator reaches the maximum iterations (`--iter-decode-max-iter`). Use `--print-step` to check the actual # of iteration for each sentence. For *Levenshtein Transformer*, it sometimes helps to apply a `--iter-decode-eos-penalty` (typically, 0~3) to penalize the model finishing generation too early and generating too short translations. For example, to generate with `--iter-decode-max-iter=9`: ```bash fairseq-generate \ data-bin/wmt14_en_de_distill \ --gen-subset test \ --task translation_lev \ --path checkpoints/checkpoint_best.pt \ --iter-decode-max-iter 9 \ --iter-decode-eos-penalty 0 \ --beam 1 --remove-bpe \ --print-step \ --batch-size 400 ``` In the end of the generation, we can see the tokenized BLEU score for the translation. ## Citation ```bibtex @article{gu2019levenshtein, title={Levenshtein Transformer}, author={Gu, Jiatao and Wang, Changhan and Zhao, Jake}, journal={arXiv preprint arXiv:1905.11006}, year={2019} } ```