v1.0

799a38c5 · chenzk · 799a38c5 · 799a38c5 · 799a38c5 · 799a38c5
Commit 799a38c5 authored Nov 10, 2023 by chenzk
20 changed files
--- a/fairseq/examples/backtranslation/sacrebleu.sh
+++ b/fairseq/examples/backtranslation/sacrebleu.sh
+#!/bin/bash
+
+if [ $# -ne 5 ]; then
+    echo "usage: $0 [dataset=wmt14/full] [langpair=en-de] [databin] [bpecode] [model]"
+    exit
+fi
+
+
+DATASET=$1
+LANGPAIR=$2
+DATABIN=$3
+BPECODE=$4
+MODEL=$5
+
+SRCLANG=$(echo $LANGPAIR | cut -d '-' -f 1)
+TGTLANG=$(echo $LANGPAIR | cut -d '-' -f 2)
+
+
+BPEROOT=examples/backtranslation/subword-nmt/subword_nmt
+if [ ! -e $BPEROOT ]; then
+    BPEROOT=subword-nmt/subword_nmt
+    if [ ! -e $BPEROOT ]; then
+        echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+        git clone https://github.com/rsennrich/subword-nmt.git
+    fi
+fi
+
+
+sacrebleu -t $DATASET -l $LANGPAIR --echo src \
+| sacremoses tokenize -a -l $SRCLANG -q \
+| python $BPEROOT/apply_bpe.py -c $BPECODE \
+| fairseq-interactive $DATABIN --path $MODEL \
+    -s $SRCLANG -t $TGTLANG \
+    --beam 5 --remove-bpe --buffer-size 1024 --max-tokens 8000 \
+| grep ^H- | cut -f 3- \
+| sacremoses detokenize -l $TGTLANG -q \
+| sacrebleu -t $DATASET -l $LANGPAIR
--- a/fairseq/examples/backtranslation/tokenized_bleu.sh
+++ b/fairseq/examples/backtranslation/tokenized_bleu.sh
+#!/bin/bash
+
+if [ $# -ne 5 ]; then
+    echo "usage: $0 [dataset=wmt14/full] [langpair=en-de] [databin] [bpecode] [model]"
+    exit
+fi
+
+
+DATASET=$1
+LANGPAIR=$2
+DATABIN=$3
+BPECODE=$4
+MODEL=$5
+
+SRCLANG=$(echo $LANGPAIR | cut -d '-' -f 1)
+TGTLANG=$(echo $LANGPAIR | cut -d '-' -f 2)
+
+
+BPEROOT=examples/backtranslation/subword-nmt/subword_nmt
+if [ ! -e $BPEROOT ]; then
+    BPEROOT=subword-nmt/subword_nmt
+    if [ ! -e $BPEROOT ]; then
+        echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+        git clone https://github.com/rsennrich/subword-nmt.git
+    fi
+fi
+
+
+TMP_REF=$(mktemp)
+
+sacrebleu -t $DATASET -l $LANGPAIR --echo ref -q \
+| sacremoses normalize -l $TGTLANG -q \
+| sacremoses tokenize -a -l $TGTLANG -q \
+> $TMP_REF
+
+sacrebleu -t $DATASET -l $LANGPAIR --echo src -q \
+| sacremoses normalize -l $SRCLANG -q \
+| sacremoses tokenize -a -l $SRCLANG -q \
+| python $BPEROOT/apply_bpe.py -c $BPECODE \
+| fairseq-interactive $DATABIN --path $MODEL \
+    -s $SRCLANG -t $TGTLANG \
+    --beam 5 --remove-bpe --buffer-size 1024 --max-tokens 8000 \
+| grep ^H- | cut -f 3- \
+| fairseq-score --ref $TMP_REF
+
+rm -f $TMP_REF
--- a/fairseq/examples/bart/README.glue.md
+++ b/fairseq/examples/bart/README.glue.md
+# Fine-tuning BART on GLUE tasks
+
+### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
+```bash
+wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
+python download_glue_data.py --data_dir glue_data --tasks all
+```
+
+### 2) Preprocess GLUE task data (same as RoBERTa):
+```bash
+./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
+```
+`glue_task_name` is one of the following:
+`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
+Use `ALL` for preprocessing all the glue tasks.
+
+### 3) Fine-tuning on GLUE task:
+Example fine-tuning cmd for `RTE` task
+```bash
+TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
+WARMUP_UPDATES=61      # 6 percent of the number of updates
+LR=1e-05                # Peak LR for polynomial LR scheduler.
+NUM_CLASSES=2
+MAX_SENTENCES=16        # Batch size.
+BART_PATH=/path/to/bart/model.pt
+
+CUDA_VISIBLE_DEVICES=0,1 fairseq-train RTE-bin/ \
+    --restore-file $BART_PATH \
+    --batch-size $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --add-prev-output-tokens \
+    --layernorm-embedding \
+    --share-all-embeddings \
+    --share-decoder-input-output-embed \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 \
+    --arch bart_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --find-unused-parameters \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
+```
+
+For each of the GLUE task, you will need to use following cmd-line arguments:
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
+`--lr` | 5e-6 | 1e-5 | 1e-5 | 1e-5 | 5e-6 | 2e-5 | 2e-5 | 2e-5
+`bsz` | 128 | 32 | 32 | 32 | 128 | 64 | 64 | 32
+`--total-num-update` | 30968 | 33112 | 113272 | 1018 | 5233 | 1148 | 1334 | 1799
+`--warmup-updates` | 1858 | 1986 | 6796 | 61 | 314 | 68 | 80 | 107
+
+For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
+
+**Note:**
+
+a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--batch-size=32/64/128` depending on the task.
+
+b) Above cmd-args and hyperparams are tested on Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--batch-size`.
+
+### Inference on GLUE task
+After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using following python code snippet:
+
+```python
+from fairseq.models.bart import BARTModel
+
+bart = BARTModel.from_pretrained(
+    'checkpoints/',
+    checkpoint_file='checkpoint_best.pt',
+    data_name_or_path='RTE-bin'
+)
+
+label_fn = lambda label: bart.task.label_dictionary.string(
+    [label + bart.task.label_dictionary.nspecial]
+)   
+ncorrect, nsamples = 0, 0
+bart.cuda()
+bart.eval()
+with open('glue_data/RTE/dev.tsv') as fin:
+    fin.readline()
+    for index, line in enumerate(fin):
+        tokens = line.strip().split('\t')
+        sent1, sent2, target = tokens[1], tokens[2], tokens[3]
+        tokens = bart.encode(sent1, sent2)
+        prediction = bart.predict('sentence_classification_head', tokens).argmax().item()
+        prediction_label = label_fn(prediction)
+        ncorrect += int(prediction_label == target)
+        nsamples += 1
+print('| Accuracy: ', float(ncorrect)/float(nsamples))
+```
--- a/fairseq/examples/bart/README.md
+++ b/fairseq/examples/bart/README.md
+# BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
+
+[https://arxiv.org/abs/1910.13461](https://arxiv.org/abs/1910.13461)
+
+## Introduction
+
+BART is sequence-to-sequence model trained with denoising as pretraining objective. We show that this pretraining objective is more generic and show that we can match [RoBERTa](../roberta) results on SQuAD and GLUE and gain state-of-the-art results on summarization (XSum, CNN dataset), long form generative question answering (ELI5) and dialog response genration (ConvAI2). See the associated paper for more details.
+
+## Pre-trained models
+
+Model | Description | # params | Download
+---|---|---|---
+`bart.base` | BART model with 6 encoder and decoder layers | 140M | [bart.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.base.tar.gz)
+`bart.large` | BART model with 12 encoder and decoder layers | 400M | [bart.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz)
+`bart.large.mnli` | `bart.large` finetuned on `MNLI` | 400M | [bart.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.mnli.tar.gz)
+`bart.large.cnn` | `bart.large` finetuned on `CNN-DM` | 400M | [bart.large.cnn.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.cnn.tar.gz)
+`bart.large.xsum` | `bart.large` finetuned on `Xsum` | 400M | [bart.large.xsum.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.xsum.tar.gz)
+
+## Results
+
+**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)**
+_(dev set, single model, single-task finetuning)_
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
+`bart.large` | 89.9 | 94.9 | 92.5 | 87.0 | 96.6 | 90.4 | 62.8 | 91.2
+
+**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)**
+_(dev set, no additional data used)_
+
+Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
+---|---|---
+`roberta.large` | 88.9/94.6 | 86.5/89.4
+`bart.large` | 88.8/94.6 | 86.1/89.2
+
+**[CNN/Daily Mail](http://nlpprogress.com/english/summarization.html)**
+_(test set, no additional data used)_
+
+Model | R1 | R2 | RL
+---|---|---|---
+`BERTSUMEXTABS` | 42.13 | 19.60 | 39.18
+`bart.large` | 44.16 | 21.28 | 40.90
+
+## Example usage
+
+##### Load BART from torch.hub (PyTorch >= 1.1):
+```python
+import torch
+bart = torch.hub.load('pytorch/fairseq', 'bart.large')
+bart.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Load BART (for PyTorch 1.0 or custom models):
+```python
+# Download bart.large model
+wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
+tar -xzvf bart.large.tar.gz
+
+# Load the model in fairseq
+from fairseq.models.bart import BARTModel
+bart = BARTModel.from_pretrained('/path/to/bart.large', checkpoint_file='model.pt')
+bart.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Apply Byte-Pair Encoding (BPE) to input text:
+```python
+tokens = bart.encode('Hello world!')
+assert tokens.tolist() == [0, 31414, 232, 328, 2]
+bart.decode(tokens)  # 'Hello world!'
+```
+
+##### Extract features from BART:
+```python
+# Extract the last layer's features
+last_layer_features = bart.extract_features(tokens)
+assert last_layer_features.size() == torch.Size([1, 5, 1024])
+
+# Extract all layer's features from decoder (layer 0 is the embedding layer)
+all_layers = bart.extract_features(tokens, return_all_hiddens=True)
+assert len(all_layers) == 13
+assert torch.all(all_layers[-1] == last_layer_features)
+```
+
+##### Use BART for sentence-pair classification tasks:
+```python
+# Download BART already finetuned for MNLI
+bart = torch.hub.load('pytorch/fairseq', 'bart.large.mnli')
+bart.eval()  # disable dropout for evaluation
+
+# Encode a pair of sentences and make a prediction
+tokens = bart.encode('BART is a seq2seq model.', 'BART is not sequence to sequence.')
+bart.predict('mnli', tokens).argmax()  # 0: contradiction
+
+# Encode another pair of sentences
+tokens = bart.encode('BART is denoising autoencoder.', 'BART is version of autoencoder.')
+bart.predict('mnli', tokens).argmax()  # 2: entailment
+```
+
+##### Register a new (randomly initialized) classification head:
+```python
+bart.register_classification_head('new_task', num_classes=3)
+logprobs = bart.predict('new_task', tokens)
+```
+
+##### Batched prediction:
+```python
+import torch
+from fairseq.data.data_utils import collate_tokens
+
+bart = torch.hub.load('pytorch/fairseq', 'bart.large.mnli')
+bart.eval()
+
+batch_of_pairs = [
+    ['BART is a seq2seq model.', 'BART is not sequence to sequence.'],
+    ['BART is denoising autoencoder.', 'BART is version of autoencoder.'],
+]
+
+batch = collate_tokens(
+    [bart.encode(pair[0], pair[1]) for pair in batch_of_pairs], pad_idx=1
+)
+
+logprobs = bart.predict('mnli', batch)
+print(logprobs.argmax(dim=1))
+# tensor([0, 2])
+```
+
+##### Using the GPU:
+```python
+bart.cuda()
+bart.predict('new_task', tokens)
+```
+
+#### Filling masks:
+
+BART can be used to fill multiple `<mask>` tokens in the input.
+```python
+bart = torch.hub.load('pytorch/fairseq', 'bart.base')
+bart.eval()
+bart.fill_mask(['The cat <mask> on the <mask>.'], topk=3, beam=10)
+# [[('The cat was on the ground.', tensor(-0.6183)), ('The cat was on the floor.', tensor(-0.6798)), ('The cat sleeps on the couch.', tensor(-0.6830))]]
+```
+
+Note that by default we enforce the output length to match the input length.
+This can be disabled by setting ``match_source_len=False``:
+```
+bart.fill_mask(['The cat <mask> on the <mask>.'], topk=3, beam=10, match_source_len=False)
+# [[('The cat was on the ground.', tensor(-0.6185)), ('The cat was asleep on the couch.', tensor(-0.6276)), ('The cat was on the floor.', tensor(-0.6800))]]
+```
+
+Example code to fill masks for a batch of sentences using GPU
+```
+bart.cuda()
+bart.fill_mask(['The cat <mask> on the <mask>.', 'The dog <mask> on the <mask>.'], topk=3, beam=10)
+# [[('The cat was on the ground.', tensor(-0.6183)), ('The cat was on the floor.', tensor(-0.6798)), ('The cat sleeps on the couch.', tensor(-0.6830))], [('The dog was on the ground.', tensor(-0.6190)), ('The dog lay on the ground.', tensor(-0.6711)),
+('The dog was asleep on the couch', tensor(-0.6796))]]
+```
+
+#### Evaluating the `bart.large.mnli` model:
+
+Example python code snippet to evaluate accuracy on the MNLI `dev_matched` set.
+```python
+label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
+ncorrect, nsamples = 0, 0
+bart.cuda()
+bart.eval()
+with open('glue_data/MNLI/dev_matched.tsv') as fin:
+    fin.readline()
+    for index, line in enumerate(fin):
+        tokens = line.strip().split('\t')
+        sent1, sent2, target = tokens[8], tokens[9], tokens[-1]
+        tokens = bart.encode(sent1, sent2)
+        prediction = bart.predict('mnli', tokens).argmax().item()
+        prediction_label = label_map[prediction]
+        ncorrect += int(prediction_label == target)
+        nsamples += 1
+        print('| Accuracy: ', float(ncorrect)/float(nsamples))
+# Expected output: 0.9010
+```
+
+#### Evaluating the `bart.large.cnn` model:
+- Follow instructions [here](https://github.com/abisee/cnn-dailymail) to download and process into data-files such that `test.source` and `test.target` has one line for each non-tokenized sample.
+- For simpler preprocessing, you can also `wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz`, although there is no guarantee of identical scores
+- `huggingface/transformers` has a simpler interface that supports [single-gpu](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/run_eval.py) and [multi-gpu](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/run_distributed_eval.py) beam search.
+    In `huggingface/transformers`, the BART models' paths are `facebook/bart-large-cnn` and `facebook/bart-large-xsum`.
+
+In `fairseq`, summaries can be generated using:
+
+```bash
+cp data-bin/cnn_dm/dict.source.txt  checkpoints/
+python examples/bart/summarize.py \
+  --model-dir pytorch/fairseq \
+  --model-file bart.large.cnn \
+  --src cnn_dm/test.source \
+  --out cnn_dm/test.hypo
+```
+
+For calculating rouge, install `files2rouge` from [here](https://github.com/pltrdy/files2rouge).
+
+```bash
+export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
+
+# Tokenize hypothesis and target files.
+cat test.hypo | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.tokenized
+cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.target
+files2rouge test.hypo.tokenized test.hypo.target
+# Expected output: (ROUGE-2 Average_F: 0.21238)
+```
+
+
+## Finetuning
+
+- [Finetuning on GLUE](README.glue.md)
+- [Finetuning on CNN-DM](README.summarization.md)
+
+## Citation
+
+```bibtex
+@article{lewis2019bart,
+    title = {BART: Denoising Sequence-to-Sequence Pre-training for Natural
+Language Generation, Translation, and Comprehension},
+    author = {Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and
+              Abdelrahman Mohamed and Omer Levy and Veselin Stoyanov
+              and Luke Zettlemoyer },
+    journal={arXiv preprint arXiv:1910.13461},
+    year = {2019},
+}
+```
--- a/fairseq/examples/bart/README.summarization.md
+++ b/fairseq/examples/bart/README.summarization.md
+# Fine-tuning BART on CNN-Dailymail summarization task
+
+### 1) Download the CNN and Daily Mail data and preprocess it into data files with non-tokenized cased samples.
+
+Follow the instructions [here](https://github.com/abisee/cnn-dailymail) to download the original CNN and Daily Mail datasets. To preprocess the data, refer to the pointers in [this issue](https://github.com/pytorch/fairseq/issues/1391) or check out the code [here](https://github.com/artmatsak/cnn-dailymail).
+
+Follow the instructions [here](https://github.com/EdinburghNLP/XSum) to download the original Extreme Summarization datasets, or check out the code [here](https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset), Please keep the raw dataset and make sure no tokenization nor BPE on the dataset.
+
+### 2) BPE preprocess:
+
+```bash
+wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
+wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
+wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'
+
+TASK=cnn_dm
+for SPLIT in train val
+do
+  for LANG in source target
+  do
+    python -m examples.roberta.multiprocessing_bpe_encoder \
+    --encoder-json encoder.json \
+    --vocab-bpe vocab.bpe \
+    --inputs "$TASK/$SPLIT.$LANG" \
+    --outputs "$TASK/$SPLIT.bpe.$LANG" \
+    --workers 60 \
+    --keep-empty;
+  done
+done
+```
+
+### 3) Binarize dataset:
+```bash
+fairseq-preprocess \
+  --source-lang "source" \
+  --target-lang "target" \
+  --trainpref "${TASK}/train.bpe" \
+  --validpref "${TASK}/val.bpe" \
+  --destdir "${TASK}-bin/" \
+  --workers 60 \
+  --srcdict dict.txt \
+  --tgtdict dict.txt;
+```
+
+### 4) Fine-tuning on CNN-DM summarization task:
+Example fine-tuning CNN-DM
+```bash
+TOTAL_NUM_UPDATES=20000  
+WARMUP_UPDATES=500      
+LR=3e-05
+MAX_TOKENS=2048
+UPDATE_FREQ=4
+BART_PATH=/path/to/bart/model.pt
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train cnn_dm-bin \
+    --restore-file $BART_PATH \
+    --max-tokens $MAX_TOKENS \
+    --task translation \
+    --source-lang source --target-lang target \
+    --truncate-source \
+    --layernorm-embedding \
+    --share-all-embeddings \
+    --share-decoder-input-output-embed \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --arch bart_large \
+    --criterion label_smoothed_cross_entropy \
+    --label-smoothing 0.1 \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
+    --clip-norm 0.1 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --update-freq $UPDATE_FREQ \
+    --skip-invalid-size-inputs-valid-test \
+    --find-unused-parameters;
+```
+Above is expected to run on `1` node with `8 32gb-V100`.
+Expected training time is about `5 hours`. Training time can be reduced with distributed training on `4` nodes and `--update-freq 1`.
+
+Use TOTAL_NUM_UPDATES=15000 UPDATE_FREQ=2 for Xsum task
+
+### Inference for CNN-DM test data using above trained checkpoint.
+After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using `eval_cnn.py`, for example
+
+```bash
+cp data-bin/cnn_dm/dict.source.txt  checkpoints/
+python examples/bart/summarize.py \
+  --model-dir checkpoints \
+  --model-file checkpoint_best.pt \
+  --src cnn_dm/test.source \
+  --out cnn_dm/test.hypo
+```
+For XSUM, which uses beam=6, lenpen=1.0, max_len_b=60, min_len=10:
+```bash
+cp data-bin/cnn_dm/dict.source.txt  checkpoints/
+python examples/bart/summarize.py \
+  --model-dir checkpoints \
+  --model-file checkpoint_best.pt \
+  --src cnn_dm/test.source \
+  --out cnn_dm/test.hypo \
+  --xsum-kwargs
+```
--- a/fairseq/examples/bart/summarize.py
+++ b/fairseq/examples/bart/summarize.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+from fairseq.models.bart import BARTModel
+import argparse
+
+XSUM_KWARGS = dict(beam=6, lenpen=1.0, max_len_b=60, min_len=10, no_repeat_ngram_size=3)
+CNN_KWARGS = dict(beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
+
+
+@torch.no_grad()
+def generate(bart, infile, outfile="bart_hypo.txt", bsz=32, n_obs=None, **eval_kwargs):
+    count = 1
+
+    # if n_obs is not None: bsz = min(bsz, n_obs)
+
+    with open(infile) as source, open(outfile, "w") as fout:
+        sline = source.readline().strip()
+        slines = [sline]
+        for sline in source:
+            if n_obs is not None and count > n_obs:
+                break
+            if count % bsz == 0:
+                hypotheses_batch = bart.sample(slines, **eval_kwargs)
+                for hypothesis in hypotheses_batch:
+                    fout.write(hypothesis + "\n")
+                    fout.flush()
+                slines = []
+
+            slines.append(sline.strip())
+            count += 1
+
+        if slines != []:
+            hypotheses_batch = bart.sample(slines, **eval_kwargs)
+            for hypothesis in hypotheses_batch:
+                fout.write(hypothesis + "\n")
+                fout.flush()
+
+
+def main():
+    """
+    Usage::
+
+         python examples/bart/summarize.py \
+            --model-dir $HOME/bart.large.cnn \
+            --model-file model.pt \
+            --src $HOME/data-bin/cnn_dm/test.source
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model-dir",
+        required=True,
+        type=str,
+        default="bart.large.cnn/",
+        help="path containing model file and src_dict.txt",
+    )
+    parser.add_argument(
+        "--model-file",
+        default="checkpoint_best.pt",
+        help="where in model_dir are weights saved",
+    )
+    parser.add_argument(
+        "--src", default="test.source", help="text to summarize", type=str
+    )
+    parser.add_argument(
+        "--out", default="test.hypo", help="where to save summaries", type=str
+    )
+    parser.add_argument("--bsz", default=32, help="where to save summaries", type=int)
+    parser.add_argument(
+        "--n", default=None, help="how many examples to summarize", type=int
+    )
+    parser.add_argument(
+        "--xsum-kwargs",
+        action="store_true",
+        default=False,
+        help="if true use XSUM_KWARGS else CNN_KWARGS",
+    )
+    args = parser.parse_args()
+    eval_kwargs = XSUM_KWARGS if args.xsum_kwargs else CNN_KWARGS
+    if args.model_dir == "pytorch/fairseq":
+        bart = torch.hub.load("pytorch/fairseq", args.model_file)
+    else:
+        bart = BARTModel.from_pretrained(
+            args.model_dir,
+            checkpoint_file=args.model_file,
+            data_name_or_path=args.model_dir,
+        )
+    bart = bart.eval()
+    if torch.cuda.is_available():
+        bart = bart.cuda().half()
+    generate(
+        bart, args.src, bsz=args.bsz, n_obs=args.n, outfile=args.out, **eval_kwargs
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/fairseq/examples/byte_level_bpe/README.md
+++ b/fairseq/examples/byte_level_bpe/README.md
+# Neural Machine Translation with Byte-Level Subwords
+
+https://arxiv.org/abs/1909.03341
+
+We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as
+example.
+
+## Data
+Get data and generate fairseq binary dataset:
+```bash
+bash ./get_data.sh
+```
+
+## Model Training
+Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`):
+```bash
+# VOCAB=bytes
+# VOCAB=chars
+VOCAB=bbpe2048
+# VOCAB=bpe2048
+# VOCAB=bbpe4096
+# VOCAB=bpe4096
+# VOCAB=bpe16384
+```
+```bash
+fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
+    --arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \
+    --optimizer adam --adam-betas '(0.9, 0.98)' \
+    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \
+    --batch-size 100 --max-update 100000 --update-freq 2
+```
+
+## Generation
+`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters:
+```bash
+# BPE=--bpe bytes
+# BPE=--bpe characters
+BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model
+# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model
+# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model
+# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model
+# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model
+```
+
+```bash
+fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
+    --source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \
+    --tokenizer moses --moses-target-lang en ${BPE}
+```
+When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions:
+```bash
+fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
+    --path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \
+    --moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000
+```
+
+## Results
+| Vocabulary    | Model  | BLEU |
+|:-------------:|:-------------:|:-------------:|
+| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 |
+| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) |
+| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) |
+| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) |
+| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) |
+| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) |
+| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) |
+| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) |
+
+
+## Citation
+```
+@misc{wang2019neural,
+    title={Neural Machine Translation with Byte-Level Subwords},
+    author={Changhan Wang and Kyunghyun Cho and Jiatao Gu},
+    year={2019},
+    eprint={1909.03341},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+
+## Contact
+Changhan Wang ([changhan@fb.com](mailto:changhan@fb.com)),
+Kyunghyun Cho ([kyunghyuncho@fb.com](mailto:kyunghyuncho@fb.com)),
+Jiatao Gu ([jgu@fb.com](mailto:jgu@fb.com))
--- a/fairseq/examples/byte_level_bpe/get_bitext.py
+++ b/fairseq/examples/byte_level_bpe/get_bitext.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+
+import argparse
+import os
+import os.path as op
+from collections import namedtuple
+from multiprocessing import cpu_count
+from typing import List, Optional
+
+import sentencepiece as sp
+from fairseq.data.encoders.byte_bpe import ByteBPE
+from fairseq.data.encoders.byte_utils import byte_encode
+from fairseq.data.encoders.bytes import Bytes
+from fairseq.data.encoders.characters import Characters
+from fairseq.data.encoders.moses_tokenizer import MosesTokenizer
+from fairseq.data.encoders.sentencepiece_bpe import SentencepieceBPE
+
+
+SPLITS = ["train", "valid", "test"]
+
+
+def _convert_xml(in_path: str, out_path: str):
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            ss = s.strip()
+            if not ss.startswith("<seg"):
+                continue
+            ss = ss.replace("</seg>", "").split('">')
+            assert len(ss) == 2
+            f_o.write(ss[1].strip() + "\n")
+
+
+def _convert_train(in_path: str, out_path: str):
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            ss = s.strip()
+            if ss.startswith("<"):
+                continue
+            f_o.write(ss.strip() + "\n")
+
+
+def _get_bytes(in_path: str, out_path: str):
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            f_o.write(Bytes.encode(s.strip()) + "\n")
+
+
+def _get_chars(in_path: str, out_path: str):
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            f_o.write(Characters.encode(s.strip()) + "\n")
+
+
+def pretokenize(in_path: str, out_path: str, src: str, tgt: str):
+    Args = namedtuple(
+        "Args",
+        [
+            "moses_source_lang",
+            "moses_target_lang",
+            "moses_no_dash_splits",
+            "moses_no_escape",
+        ],
+    )
+    args = Args(
+        moses_source_lang=src,
+        moses_target_lang=tgt,
+        moses_no_dash_splits=False,
+        moses_no_escape=False,
+    )
+    pretokenizer = MosesTokenizer(args)
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            f_o.write(pretokenizer.encode(s.strip()) + "\n")
+
+
+def _convert_to_bchar(in_path_prefix: str, src: str, tgt: str, out_path: str):
+    with open(out_path, "w") as f_o:
+        for lang in [src, tgt]:
+            with open(f"{in_path_prefix}.{lang}") as f:
+                for s in f:
+                    f_o.write(byte_encode(s.strip()) + "\n")
+
+
+def _get_bpe(in_path: str, model_prefix: str, vocab_size: int):
+    arguments = [
+        f"--input={in_path}",
+        f"--model_prefix={model_prefix}",
+        f"--model_type=bpe",
+        f"--vocab_size={vocab_size}",
+        "--character_coverage=1.0",
+        "--normalization_rule_name=identity",
+        f"--num_threads={cpu_count()}",
+    ]
+    sp.SentencePieceTrainer.Train(" ".join(arguments))
+
+
+def _apply_bbpe(model_path: str, in_path: str, out_path: str):
+    Args = namedtuple("Args", ["sentencepiece_model_path"])
+    args = Args(sentencepiece_model_path=model_path)
+    tokenizer = ByteBPE(args)
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            f_o.write(tokenizer.encode(s.strip()) + "\n")
+
+
+def _apply_bpe(model_path: str, in_path: str, out_path: str):
+    Args = namedtuple("Args", ["sentencepiece_model"])
+    args = Args(sentencepiece_model=model_path)
+    tokenizer = SentencepieceBPE(args)
+    with open(in_path) as f, open(out_path, "w") as f_o:
+        for s in f:
+            f_o.write(tokenizer.encode(s.strip()) + "\n")
+
+
+def _concat_files(in_paths: List[str], out_path: str):
+    with open(out_path, "w") as f_o:
+        for p in in_paths:
+            with open(p) as f:
+                for r in f:
+                    f_o.write(r)
+
+
+def preprocess_iwslt17(
+    root: str,
+    src: str,
+    tgt: str,
+    bpe_size: Optional[int],
+    need_chars: bool,
+    bbpe_size: Optional[int],
+    need_bytes: bool,
+):
+    # extract bitext
+    in_root = op.join(root, f"{src}-{tgt}")
+    for lang in [src, tgt]:
+        _convert_train(
+            op.join(in_root, f"train.tags.{src}-{tgt}.{lang}"),
+            op.join(root, f"train.{lang}"),
+        )
+        _convert_xml(
+            op.join(in_root, f"IWSLT17.TED.dev2010.{src}-{tgt}.{lang}.xml"),
+            op.join(root, f"valid.{lang}"),
+        )
+        _convert_xml(
+            op.join(in_root, f"IWSLT17.TED.tst2015.{src}-{tgt}.{lang}.xml"),
+            op.join(root, f"test.{lang}"),
+        )
+    # pre-tokenize
+    for lang in [src, tgt]:
+        for split in SPLITS:
+            pretokenize(
+                op.join(root, f"{split}.{lang}"),
+                op.join(root, f"{split}.moses.{lang}"),
+                src,
+                tgt,
+            )
+    # tokenize with BPE vocabulary
+    if bpe_size is not None:
+        # learn vocabulary
+        concated_train_path = op.join(root, "train.all")
+        _concat_files(
+            [op.join(root, "train.moses.fr"), op.join(root, "train.moses.en")],
+            concated_train_path,
+        )
+        bpe_model_prefix = op.join(root, f"spm_bpe{bpe_size}")
+        _get_bpe(concated_train_path, bpe_model_prefix, bpe_size)
+        os.remove(concated_train_path)
+        # apply
+        for lang in [src, tgt]:
+            for split in SPLITS:
+                _apply_bpe(
+                    bpe_model_prefix + ".model",
+                    op.join(root, f"{split}.moses.{lang}"),
+                    op.join(root, f"{split}.moses.bpe{bpe_size}.{lang}"),
+                )
+    # tokenize with bytes vocabulary
+    if need_bytes:
+        for lang in [src, tgt]:
+            for split in SPLITS:
+                _get_bytes(
+                    op.join(root, f"{split}.moses.{lang}"),
+                    op.join(root, f"{split}.moses.bytes.{lang}"),
+                )
+    # tokenize with characters vocabulary
+    if need_chars:
+        for lang in [src, tgt]:
+            for split in SPLITS:
+                _get_chars(
+                    op.join(root, f"{split}.moses.{lang}"),
+                    op.join(root, f"{split}.moses.chars.{lang}"),
+                )
+    # tokenize with byte-level BPE vocabulary
+    if bbpe_size is not None:
+        # learn vocabulary
+        bchar_path = op.join(root, "train.bchar")
+        _convert_to_bchar(op.join(root, "train.moses"), src, tgt, bchar_path)
+        bbpe_model_prefix = op.join(root, f"spm_bbpe{bbpe_size}")
+        _get_bpe(bchar_path, bbpe_model_prefix, bbpe_size)
+        os.remove(bchar_path)
+        # apply
+        for lang in [src, tgt]:
+            for split in SPLITS:
+                _apply_bbpe(
+                    bbpe_model_prefix + ".model",
+                    op.join(root, f"{split}.moses.{lang}"),
+                    op.join(root, f"{split}.moses.bbpe{bbpe_size}.{lang}"),
+                )
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--root", type=str, default="data")
+    parser.add_argument(
+        "--bpe-vocab",
+        default=None,
+        type=int,
+        help="Generate tokenized bitext with BPE of size K."
+        "Default to None (disabled).",
+    )
+    parser.add_argument(
+        "--bbpe-vocab",
+        default=None,
+        type=int,
+        help="Generate tokenized bitext with BBPE of size K."
+        "Default to None (disabled).",
+    )
+    parser.add_argument(
+        "--byte-vocab",
+        action="store_true",
+        help="Generate tokenized bitext with bytes vocabulary",
+    )
+    parser.add_argument(
+        "--char-vocab",
+        action="store_true",
+        help="Generate tokenized bitext with chars vocabulary",
+    )
+    args = parser.parse_args()
+
+    preprocess_iwslt17(
+        args.root,
+        "fr",
+        "en",
+        args.bpe_vocab,
+        args.char_vocab,
+        args.bbpe_vocab,
+        args.byte_vocab,
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/fairseq/examples/byte_level_bpe/get_data.sh
+++ b/fairseq/examples/byte_level_bpe/get_data.sh
+#!/bin/bash
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+PY_BIN_ROOT=
+
+# PyPI dependency
+${PY_BIN_ROOT}pip install sentencepiece sacremoses
+
+# Get data
+if [ ! -d "data" ]; then
+  mkdir data
+fi
+
+if [ ! -f "data/fr-en.tgz" ]; then
+  wget https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz -P data
+  tar xvf data/fr-en.tgz -C data
+fi
+${PY_BIN_ROOT}python get_bitext.py --bpe-vocab 16384 --byte-vocab --char-vocab
+for VOCAB_SIZE in 2048 4096; do
+  ${PY_BIN_ROOT}python get_bitext.py --bpe-vocab ${VOCAB_SIZE} --bbpe-vocab ${VOCAB_SIZE}
+done
+rm -r data/fr-en data/fr-en.tgz
+
+# Generate binary dataset
+${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_bpe16384 --joined-dictionary \
+  --workers "$(nproc)" --trainpref data/train.moses.bpe16384 --validpref data/valid.moses.bpe16384 \
+  --testpref data/test.moses.bpe16384
+
+${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_bytes --joined-dictionary \
+  --workers "$(nproc)" --trainpref data/train.moses.bytes --validpref data/valid.moses.bytes \
+  --testpref data/test.moses.bytes
+
+${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_chars --joined-dictionary \
+  --workers "$(nproc)" --trainpref data/train.moses.chars --validpref data/valid.moses.chars \
+  --testpref data/test.moses.chars
+
+for VOCAB_SIZE in 2048 4096; do
+  for TYPE in bbpe bpe; do
+    ${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir "data/bin_${TYPE}${VOCAB_SIZE}" \
+      --joined-dictionary --workers "$(nproc)" --trainpref "data/train.moses.${TYPE}${VOCAB_SIZE}" \
+      --validpref "data/valid.moses.${TYPE}${VOCAB_SIZE}" --testpref "data/test.moses.${TYPE}${VOCAB_SIZE}"
+  done
+done
--- a/fairseq/examples/byte_level_bpe/gru_transformer.py
+++ b/fairseq/examples/byte_level_bpe/gru_transformer.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch.nn as nn
+import torch.nn.functional as F
+from fairseq.models import register_model, register_model_architecture
+from fairseq.models.transformer import TransformerEncoder, TransformerModel
+
+
+@register_model("gru_transformer")
+class GRUTransformerModel(TransformerModel):
+    @classmethod
+    def build_encoder(cls, args, src_dict, embed_tokens):
+        return GRUTransformerEncoder(args, src_dict, embed_tokens)
+
+
+class GRUTransformerEncoder(TransformerEncoder):
+    def __init__(self, args, dictionary, embed_tokens):
+        super().__init__(args, dictionary, embed_tokens)
+        self.emb_ctx = nn.GRU(
+            input_size=embed_tokens.embedding_dim,
+            hidden_size=embed_tokens.embedding_dim // 2,
+            num_layers=1,
+            bidirectional=True,
+        )
+
+    def forward_embedding(self, src_tokens):
+        # embed tokens and positions
+        x = embed = self.embed_scale * self.embed_tokens(src_tokens)
+        if self.embed_positions is not None:
+            x = embed + self.embed_positions(src_tokens)
+
+        # contextualize embeddings
+        x = x.transpose(0, 1)
+        x = self.dropout_module(x)
+        x, _ = self.emb_ctx.forward(x)
+        x = x.transpose(0, 1)
+
+        if self.layernorm_embedding is not None:
+            x = self.layernorm_embedding(x)
+        x = self.dropout_module(x)
+        return x, embed
+
+
+@register_model_architecture("gru_transformer", "gru_transformer")
+def gru_transformer_base_architecture(args):
+    args.encoder_embed_path = getattr(args, "encoder_embed_path", None)
+    args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 512)
+    args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 2048)
+    args.encoder_layers = getattr(args, "encoder_layers", 6)
+    args.encoder_attention_heads = getattr(args, "encoder_attention_heads", 8)
+    args.encoder_normalize_before = getattr(args, "encoder_normalize_before", False)
+    args.encoder_learned_pos = getattr(args, "encoder_learned_pos", False)
+    args.decoder_embed_path = getattr(args, "decoder_embed_path", None)
+    args.decoder_embed_dim = getattr(args, "decoder_embed_dim", args.encoder_embed_dim)
+    args.decoder_ffn_embed_dim = getattr(
+        args, "decoder_ffn_embed_dim", args.encoder_ffn_embed_dim
+    )
+    args.decoder_layers = getattr(args, "decoder_layers", 6)
+    args.decoder_attention_heads = getattr(args, "decoder_attention_heads", 8)
+    args.decoder_normalize_before = getattr(args, "decoder_normalize_before", False)
+    args.decoder_learned_pos = getattr(args, "decoder_learned_pos", False)
+    args.attention_dropout = getattr(args, "attention_dropout", 0.0)
+    args.activation_dropout = getattr(args, "activation_dropout", 0.0)
+    args.activation_fn = getattr(args, "activation_fn", "relu")
+    args.dropout = getattr(args, "dropout", 0.1)
+    args.adaptive_softmax_cutoff = getattr(args, "adaptive_softmax_cutoff", None)
+    args.adaptive_softmax_dropout = getattr(args, "adaptive_softmax_dropout", 0)
+    args.share_decoder_input_output_embed = getattr(
+        args, "share_decoder_input_output_embed", False
+    )
+    args.share_all_embeddings = getattr(args, "share_all_embeddings", False)
+    args.no_token_positional_embeddings = getattr(
+        args, "no_token_positional_embeddings", False
+    )
+    args.adaptive_input = getattr(args, "adaptive_input", False)
+    args.no_cross_attention = getattr(args, "no_cross_attention", False)
+    args.cross_self_attention = getattr(args, "cross_self_attention", False)
+    args.layer_wise_attention = getattr(args, "layer_wise_attention", False)
+
+    args.decoder_output_dim = getattr(
+        args, "decoder_output_dim", args.decoder_embed_dim
+    )
+    args.decoder_input_dim = getattr(args, "decoder_input_dim", args.decoder_embed_dim)
+
+    args.no_scale_embedding = getattr(args, "no_scale_embedding", False)
+    args.layernorm_embedding = getattr(args, "layernorm_embedding", False)
+
+
+@register_model_architecture("gru_transformer", "gru_transformer_big")
+def gru_transformer_big(args):
+    args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 1024)
+    args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 4096)
+    args.encoder_attention_heads = getattr(args, "encoder_attention_heads", 16)
+    args.encoder_normalize_before = getattr(args, "encoder_normalize_before", False)
+    args.decoder_embed_dim = getattr(args, "decoder_embed_dim", 1024)
+    args.decoder_ffn_embed_dim = getattr(args, "decoder_ffn_embed_dim", 4096)
+    args.decoder_attention_heads = getattr(args, "decoder_attention_heads", 16)
+    args.dropout = getattr(args, "dropout", 0.3)
+    gru_transformer_base_architecture(args)
--- a/fairseq/examples/camembert/README.md
+++ b/fairseq/examples/camembert/README.md
+# CamemBERT: a Tasty French Language Model
+
+## Introduction
+
+[CamemBERT](https://arxiv.org/abs/1911.03894) is a pretrained language model trained on 138GB of French text based on RoBERTa.
+
+Also available in [github.com/huggingface/transformers](https://github.com/huggingface/transformers/).
+
+## Pre-trained models
+
+| Model                          | #params | Download                                                                                                                 | Arch. | Training data                     |
+|--------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------|-------|-----------------------------------|
+| `camembert` / `camembert-base` | 110M    | [camembert-base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base.tar.gz)                             | Base  | OSCAR (138 GB of text)            |
+| `camembert-large`              | 335M    | [camembert-large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-large.tar.gz)                           | Large | CCNet (135 GB of text)            |
+| `camembert-base-ccnet`         | 110M    | [camembert-base-ccnet.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-ccnet.tar.gz)                 | Base  | CCNet (135 GB of text)            |
+| `camembert-base-wikipedia-4gb` | 110M    | [camembert-base-wikipedia-4gb.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-wikipedia-4gb.tar.gz) | Base  | Wikipedia (4 GB of text)          |
+| `camembert-base-oscar-4gb`     | 110M    | [camembert-base-oscar-4gb.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-oscar-4gb.tar.gz)         | Base  | Subsample of OSCAR (4 GB of text) |
+| `camembert-base-ccnet-4gb`     | 110M    | [camembert-base-ccnet-4gb.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-ccnet-4gb.tar.gz)         | Base  | Subsample of CCNet (4 GB of text) |
+
+## Example usage
+
+### fairseq
+##### Load CamemBERT from torch.hub (PyTorch >= 1.1):
+```python
+import torch
+camembert = torch.hub.load('pytorch/fairseq', 'camembert')
+camembert.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Load CamemBERT (for PyTorch 1.0 or custom models):
+```python
+# Download camembert model
+wget https://dl.fbaipublicfiles.com/fairseq/models/camembert-base.tar.gz
+tar -xzvf camembert.tar.gz
+
+# Load the model in fairseq
+from fairseq.models.roberta import CamembertModel
+camembert = CamembertModel.from_pretrained('/path/to/camembert')
+camembert.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Filling masks:
+```python
+masked_line = 'Le camembert est <mask> :)'
+camembert.fill_mask(masked_line, topk=3)
+# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
+#  ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
+#  ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]
+```
+
+##### Extract features from Camembert:
+```python
+# Extract the last layer's features
+line = "J'aime le camembert !"
+tokens = camembert.encode(line)
+last_layer_features = camembert.extract_features(tokens)
+assert last_layer_features.size() == torch.Size([1, 10, 768])
+
+# Extract all layer's features (layer 0 is the embedding layer)
+all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
+assert len(all_layers) == 13
+assert torch.all(all_layers[-1] == last_layer_features)
+```
+
+## Citation
+If you use our work, please cite:
+
+```bibtex
+@inproceedings{martin2020camembert,
+  title={CamemBERT: a Tasty French Language Model},
+  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
+  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
+  year={2020}
+}
+```
--- a/fairseq/examples/constrained_decoding/README.md
+++ b/fairseq/examples/constrained_decoding/README.md
+# (Vectorized) Lexically constrained decoding with dynamic beam allocation
+
+This page provides instructions for how to use lexically constrained decoding in Fairseq.
+Fairseq implements the code described in the following papers:
+
+* [Fast Lexically Constrained Decoding With Dynamic Beam Allocation](https://www.aclweb.org/anthology/N18-1119/) (Post & Vilar, 2018)
+* [Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting](https://www.aclweb.org/anthology/N19-1090/) (Hu et al., 2019)
+
+## Quick start
+
+Constrained search is enabled by adding the command-line argument `--constraints` to `fairseq-interactive`.
+Constraints are appended to each line of input, separated by tabs. Each constraint (one or more tokens)
+is a separate field.
+
+The following command, using [Fairseq's WMT19 German--English model](https://github.com/pytorch/fairseq/blob/main/examples/wmt19/README.md),
+translates the sentence *Die maschinelle Übersetzung ist schwer zu kontrollieren.* with the constraints
+"hard" and "to influence".
+
+    echo -e "Die maschinelle Übersetzung ist schwer zu kontrollieren.\thard\ttoinfluence" \
+    | normalize.py | tok.py \
+    | fairseq-interactive /path/to/model \
+      --path /path/to/model/model1.pt \
+      --bpe fastbpe \
+      --bpe-codes /path/to/model/bpecodes \
+      --constraints \
+      -s de -t en \
+      --beam 10
+
+(tok.py and normalize.py can be found in the same directory as this README; they are just shortcuts around Fairseq's WMT19 preprocessing).
+This will generate the following output:
+
+    [snip]
+    S-0     Die masch@@ in@@ elle Über@@ setzung ist schwer zu kontrollieren .
+    W-0     1.844   seconds
+    C-0     hard
+    C-0     influence
+    H-0     -1.5333266258239746     Mach@@ ine trans@@ lation is hard to influence .
+    D-0     -1.5333266258239746     Machine translation is hard to influence .
+    P-0     -0.5434 -0.1423 -0.1930 -0.1415 -0.2346 -1.8031 -0.1701 -11.7727 -0.1815 -0.1511
+
+By default, constraints are generated in the order supplied, with any number (zero or more) of tokens generated
+between constraints. If you wish for the decoder to order the constraints, then use `--constraints unordered`.
+Note that you may want to use a larger beam.
+
+## Implementation details
+
+The heart of the implementation is in `fairseq/search.py`, which adds a `LexicallyConstrainedBeamSearch` instance.
+This instance of beam search tracks the progress of each hypothesis in the beam through the set of constraints
+provided for each input sentence. It does this using one of two classes, both found in `fairseq/token_generation_contstraints.py`:
+
+* OrderedConstraintState: assumes the `C` input constraints will be generated in the provided order
+* UnorderedConstraintState: tries to apply `C` (phrasal) constraints in all `C!` orders
+
+## Differences from Sockeye
+
+There are a number of [differences from Sockeye's implementation](https://awslabs.github.io/sockeye/inference.html#lexical-constraints).
+
+* Generating constraints in the order supplied (the default option here) is not available in Sockeye.
+* Due to an improved beam allocation method, there is no need to prune the beam.
+* Again due to better allocation, beam sizes as low as 10 or even 5 are often sufficient.
+* [The vector extensions described in Hu et al.](https://github.com/edwardjhu/sockeye/tree/trie_constraints) (NAACL 2019) were never merged
+  into the main Sockeye branch.
+
+## Citation
+
+The paper first describing lexical constraints for seq2seq decoding is:
+
+```bibtex
+@inproceedings{hokamp-liu-2017-lexically,
+  title = "Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search",
+  author = "Hokamp, Chris  and
+    Liu, Qun",
+  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+  month = jul,
+  year = "2017",
+  address = "Vancouver, Canada",
+  publisher = "Association for Computational Linguistics",
+  url = "https://www.aclweb.org/anthology/P17-1141",
+  doi = "10.18653/v1/P17-1141",
+  pages = "1535--1546",
+}
+```
+
+The fairseq implementation uses the extensions described in
+
+```bibtex
+@inproceedings{post-vilar-2018-fast,
+    title = "Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation",
+    author = "Post, Matt  and
+      Vilar, David",
+    booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)",
+    month = jun,
+    year = "2018",
+    address = "New Orleans, Louisiana",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/N18-1119",
+    doi = "10.18653/v1/N18-1119",
+    pages = "1314--1324",
+}
+```
+
+and
+
+```bibtex
+@inproceedings{hu-etal-2019-improved,
+  title = "Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting",
+  author = "Hu, J. Edward  and
+    Khayrallah, Huda  and
+    Culkin, Ryan  and
+    Xia, Patrick  and
+    Chen, Tongfei  and
+    Post, Matt  and
+    Van Durme, Benjamin",
+  booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
+  month = jun,
+  year = "2019",
+  address = "Minneapolis, Minnesota",
+  publisher = "Association for Computational Linguistics",
+  url = "https://www.aclweb.org/anthology/N19-1090",
+  doi = "10.18653/v1/N19-1090",
+  pages = "839--850",
+}
+```
--- a/fairseq/examples/constrained_decoding/normalize.py
+++ b/fairseq/examples/constrained_decoding/normalize.py
+#!/usr/bin/env python3
+#
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import sys
+
+from sacremoses.normalize import MosesPunctNormalizer
+
+
+def main(args):
+    normalizer = MosesPunctNormalizer(lang=args.lang, penn=args.penn)
+    for line in sys.stdin:
+        print(normalizer.normalize(line.rstrip()), flush=True)
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--lang", "-l", default="en")
+    parser.add_argument("--penn", "-p", action="store_true")
+    args = parser.parse_args()
+
+    main(args)
--- a/fairseq/examples/constrained_decoding/tok.py
+++ b/fairseq/examples/constrained_decoding/tok.py
+#!/usr/bin/env python3
+#
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import sys
+
+import sacremoses
+
+
+def main(args):
+    """Tokenizes, preserving tabs"""
+    mt = sacremoses.MosesTokenizer(lang=args.lang)
+
+    def tok(s):
+        return mt.tokenize(s, return_str=True)
+
+    for line in sys.stdin:
+        parts = list(map(tok, line.split("\t")))
+        print(*parts, sep="\t", flush=True)
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--lang", "-l", default="en")
+    parser.add_argument("--penn", "-p", action="store_true")
+    parser.add_argument("--fields", "-f", help="fields to tokenize")
+    args = parser.parse_args()
+
+    main(args)
--- a/fairseq/examples/conv_seq2seq/README.md
+++ b/fairseq/examples/conv_seq2seq/README.md
+# Convolutional Sequence to Sequence Learning (Gehring et al., 2017)
+
+## Pre-trained models
+
+Description | Dataset | Model | Test set(s)
+---|---|---|---
+Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2) <br> newstest2012/2013: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.ntst1213.tar.bz2)
+Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-de.newstest2014.tar.bz2)
+Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT17 English-German](http://statmt.org/wmt17/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt17.v2.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.v2.en-de.newstest2014.tar.bz2)
+
+## Example usage
+
+See the [translation README](../translation/README.md) for instructions on reproducing results for WMT'14 En-De and
+WMT'14 En-Fr using the `fconv_wmt_en_de` and `fconv_wmt_en_fr` model architectures.
+
+## Citation
+
+```bibtex
+@inproceedings{gehring2017convs2s,
+  title = {Convolutional Sequence to Sequence Learning},
+  author = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
+  booktitle = {Proc. of ICML},
+  year = 2017,
+}
+```
--- a/fairseq/examples/criss/README.md
+++ b/fairseq/examples/criss/README.md
+# Cross-lingual Retrieval for Iterative Self-Supervised Training
+
+https://arxiv.org/pdf/2006.09526.pdf
+
+## Introduction
+
+CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.
+
+## Requirements:
+
+* faiss: https://github.com/facebookresearch/faiss
+* mosesdecoder: https://github.com/moses-smt/mosesdecoder
+* flores: https://github.com/facebookresearch/flores
+* LASER: https://github.com/facebookresearch/LASER
+
+## Unsupervised Machine Translation
+##### 1. Download and decompress CRISS checkpoints
+```
+cd examples/criss
+wget https://dl.fbaipublicfiles.com/criss/criss_3rd_checkpoints.tar.gz
+tar -xf criss_checkpoints.tar.gz
+```
+##### 2. Download and preprocess Flores test dataset
+Make sure to run all scripts from examples/criss directory
+```
+bash download_and_preprocess_flores_test.sh
+```
+
+##### 3. Run Evaluation on Sinhala-English
+```
+bash unsupervised_mt/eval.sh
+```
+
+## Sentence Retrieval
+##### 1. Download and preprocess Tatoeba dataset
+```
+bash download_and_preprocess_tatoeba.sh
+```
+
+##### 2. Run Sentence Retrieval on Tatoeba Kazakh-English
+```
+bash sentence_retrieval/sentence_retrieval_tatoeba.sh
+```
+
+## Mining
+##### 1. Install faiss
+Follow instructions on https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
+##### 2. Mine pseudo-parallel data between Kazakh and English
+```
+bash mining/mine_example.sh
+```
+
+## Citation
+```bibtex
+@article{tran2020cross,
+  title={Cross-lingual retrieval for iterative self-supervised training},
+  author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao},
+  journal={arXiv preprint arXiv:2006.09526},
+  year={2020}
+}
+```
--- a/fairseq/examples/criss/download_and_preprocess_flores_test.sh
+++ b/fairseq/examples/criss/download_and_preprocess_flores_test.sh
+#!/bin/bash
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+SPM_ENCODE=flores/scripts/spm_encode.py
+DATA=data_tmp
+SPM_MODEL=criss_checkpoints/sentence.bpe.model
+DICT=criss_checkpoints/dict.txt
+
+download_data() {
+  CORPORA=$1
+  URL=$2
+
+  if [ -f $CORPORA ]; then
+    echo "$CORPORA already exists, skipping download"
+  else
+    echo "Downloading $URL"
+    wget $URL -O $CORPORA --no-check-certificate || rm -f $CORPORA
+    if [ -f $CORPORA ]; then
+      echo "$URL successfully downloaded."
+    else
+      echo "$URL not successfully downloaded."
+      rm -f $CORPORA
+    fi
+  fi
+}
+
+if [[ -f flores ]]; then
+  echo "flores already cloned"
+else
+  git clone https://github.com/facebookresearch/flores
+fi
+
+mkdir -p $DATA
+download_data $DATA/wikipedia_en_ne_si_test_sets.tgz "https://github.com/facebookresearch/flores/raw/master/data/wikipedia_en_ne_si_test_sets.tgz"
+pushd $DATA
+pwd
+tar -vxf wikipedia_en_ne_si_test_sets.tgz
+popd
+
+
+for lang in ne_NP si_LK; do
+  datadir=$DATA/${lang}-en_XX-flores
+  rm -rf $datadir
+  mkdir -p $datadir
+  TEST_PREFIX=$DATA/wikipedia_en_ne_si_test_sets/wikipedia.test
+  python $SPM_ENCODE \
+    --model ${SPM_MODEL} \
+    --output_format=piece \
+    --inputs ${TEST_PREFIX}.${lang:0:2}-en.${lang:0:2} ${TEST_PREFIX}.${lang:0:2}-en.en \
+    --outputs $datadir/test.bpe.${lang}-en_XX.${lang} $datadir/test.bpe.${lang}-en_XX.en_XX
+
+  # binarize data
+  fairseq-preprocess \
+    --source-lang ${lang} --target-lang en_XX \
+    --testpref $datadir/test.bpe.${lang}-en_XX \
+    --destdir $datadir \
+    --srcdict ${DICT} \
+    --joined-dictionary \
+    --workers 4
+done
--- a/fairseq/examples/criss/download_and_preprocess_tatoeba.sh
+++ b/fairseq/examples/criss/download_and_preprocess_tatoeba.sh
+#!/bin/bash
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+SPM_ENCODE=flores/scripts/spm_encode.py
+DATA=data_tmp
+SPM_MODEL=criss_checkpoints/sentence.bpe.model
+DICT=criss_checkpoints/dict.txt
+
+if [[ -f flores ]]; then
+  echo "flores already cloned"
+else
+  git clone https://github.com/facebookresearch/flores
+fi
+if [[ -f LASER ]]; then
+  echo "LASER already cloned"
+else
+  git clone https://github.com/facebookresearch/LASER
+fi
+mkdir -p data_tmp
+declare -A lang_tatoeba_map=( ["ar_AR"]="ara" ["de_DE"]="deu"  ["es_XX"]="spa" ["et_EE"]="est" ["fi_FI"]="fin" ["fr_XX"]="fra" ["hi_IN"]="hin" ["it_IT"]="ita" ["ja_XX"]="jpn" ["ko_KR"]="kor" ["kk_KZ"]="kaz" ["nl_XX"]="nld" ["ru_RU"]="rus" ["tr_TR"]="tur" ["vi_VN"]="vie" ["zh_CN"]="cmn")
+for lang in ar_AR de_DE es_XX et_EE fi_FI fr_XX hi_IN it_IT ja_XX kk_KZ ko_KR nl_XX ru_RU tr_TR vi_VN zh_CN; do
+  lang_tatoeba=${lang_tatoeba_map[$lang]}
+  echo $lang_tatoeba
+  datadir=$DATA/${lang}-en_XX-tatoeba
+  rm -rf $datadir
+  mkdir -p $datadir
+  TEST_PREFIX=LASER/data/tatoeba/v1/tatoeba
+  python $SPM_ENCODE \
+    --model ${SPM_MODEL} \
+    --output_format=piece \
+    --inputs ${TEST_PREFIX}.${lang_tatoeba}-eng.${lang_tatoeba} ${TEST_PREFIX}.${lang_tatoeba}-eng.eng \
+    --outputs $datadir/test.bpe.${lang}-en_XX.${lang} $datadir/test.bpe.${lang}-en_XX.en_XX
+
+  # binarize data
+  fairseq-preprocess \
+    --source-lang ${lang} --target-lang en_XX \
+    --testpref $datadir/test.bpe.${lang}-en_XX \
+    --destdir $datadir \
+    --srcdict ${DICT} \
+    --joined-dictionary \
+    --workers 4
+done
--- a/fairseq/examples/criss/mining/mine.py
+++ b/fairseq/examples/criss/mining/mine.py
+#!/usr/bin/env python3 -u
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import glob
+from subprocess import check_call
+
+try:
+    import faiss
+
+    has_faiss = True
+except ImportError:
+    has_faiss = False
+import numpy as np
+
+
+GB = 1024 * 1024 * 1024
+
+
+def call(cmd):
+    print(cmd)
+    check_call(cmd, shell=True)
+
+
+def get_batches(directory, lang, prefix="all_avg_pool"):
+    print(f"Finding in {directory}/{prefix}.{lang}*")
+    files = glob.glob(f"{directory}/{prefix}.{lang}*")
+    emb_files = []
+    txt_files = []
+    for emb_fi in files:
+        emb_files.append(emb_fi)
+        txt_fi = emb_fi.replace(prefix, "sentences")
+        txt_files.append(txt_fi)
+    return emb_files, txt_files
+
+
+def load_batch(emb_file, dim):
+    embeddings = np.fromfile(emb_file, dtype=np.float32)
+    num_rows = int(embeddings.shape[0] / dim)
+    embeddings = embeddings.reshape((num_rows, dim))
+    faiss.normalize_L2(embeddings)
+    return embeddings
+
+
+def knnGPU_sharded(x_batches_f, y_batches_f, dim, k, direction="x2y"):
+    if not has_faiss:
+        raise ImportError("Please install Faiss")
+    sims = []
+    inds = []
+    xfrom = 0
+    xto = 0
+    for x_batch_f in x_batches_f:
+        yfrom = 0
+        yto = 0
+        x_batch = load_batch(x_batch_f, dim)
+        xto = xfrom + x_batch.shape[0]
+        bsims, binds = [], []
+        for y_batch_f in y_batches_f:
+            y_batch = load_batch(y_batch_f, dim)
+            neighbor_size = min(k, y_batch.shape[0])
+            yto = yfrom + y_batch.shape[0]
+            print("{}-{}  ->  {}-{}".format(xfrom, xto, yfrom, yto))
+            idx = faiss.IndexFlatIP(dim)
+            idx = faiss.index_cpu_to_all_gpus(idx)
+            idx.add(y_batch)
+            bsim, bind = idx.search(x_batch, neighbor_size)
+
+            bsims.append(bsim)
+            binds.append(bind + yfrom)
+            yfrom += y_batch.shape[0]
+            del idx
+            del y_batch
+        bsims = np.concatenate(bsims, axis=1)
+        binds = np.concatenate(binds, axis=1)
+        aux = np.argsort(-bsims, axis=1)
+        sim_batch = np.zeros((x_batch.shape[0], k), dtype=np.float32)
+        ind_batch = np.zeros((x_batch.shape[0], k), dtype=np.int64)
+        for i in range(x_batch.shape[0]):
+            for j in range(k):
+                sim_batch[i, j] = bsims[i, aux[i, j]]
+                ind_batch[i, j] = binds[i, aux[i, j]]
+        sims.append(sim_batch)
+        inds.append(ind_batch)
+        xfrom += x_batch.shape[0]
+        del x_batch
+    sim = np.concatenate(sims, axis=0)
+    ind = np.concatenate(inds, axis=0)
+    return sim, ind
+
+
+def score(sim, fwd_mean, bwd_mean, margin):
+    return margin(sim, (fwd_mean + bwd_mean) / 2)
+
+
+def score_candidates(
+    sim_mat, candidate_inds, fwd_mean, bwd_mean, margin, verbose=False
+):
+    print(" - scoring {:d} candidates".format(sim_mat.shape[0]))
+    scores = np.zeros(candidate_inds.shape)
+    for i in range(scores.shape[0]):
+        for j in range(scores.shape[1]):
+            k = int(candidate_inds[i, j])
+            scores[i, j] = score(sim_mat[i, j], fwd_mean[i], bwd_mean[k], margin)
+    return scores
+
+
+def load_text(files):
+    all_sentences = []
+    for fi in files:
+        with open(fi) as sentence_fi:
+            for line in sentence_fi:
+                all_sentences.append(line.strip())
+    print(f"Read {len(all_sentences)} sentences")
+    return all_sentences
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Mine bitext")
+    parser.add_argument("--src-lang", help="Source language")
+    parser.add_argument("--tgt-lang", help="Target language")
+    parser.add_argument(
+        "--dict-path", help="Path to dictionary file", default="dict.txt"
+    )
+    parser.add_argument(
+        "--spm-path", help="Path to SPM model file", default="sentence.bpe.model"
+    )
+    parser.add_argument("--dim", type=int, default=1024, help="Embedding dimension")
+    parser.add_argument("--mem", type=int, default=5, help="Memory in GB")
+    parser.add_argument("--src-dir", help="Source directory")
+    parser.add_argument("--tgt-dir", help="Target directory")
+    parser.add_argument("--output", help="Output path")
+    parser.add_argument(
+        "--neighborhood", type=int, default=4, help="Embedding dimension"
+    )
+    parser.add_argument(
+        "--threshold", type=float, default=1.06, help="Threshold on mined bitext"
+    )
+    parser.add_argument(
+        "--valid-size",
+        type=int,
+        default=2000,
+        help="Number of sentences used for validation set",
+    )
+    parser.add_argument(
+        "--min-count",
+        type=int,
+        default=50000,
+        help="Min num sentences used for each language",
+    )
+    args = parser.parse_args()
+
+    x_batches_f, x_sents_f = get_batches(args.src_dir, args.src_lang)
+    y_batches_f, y_sents_f = get_batches(args.tgt_dir, args.tgt_lang)
+    margin = lambda a, b: a / b
+    y2x_sim, y2x_ind = knnGPU_sharded(
+        y_batches_f, x_batches_f, args.dim, args.neighborhood, direction="y2x"
+    )
+    x2y_sim, x2y_ind = knnGPU_sharded(
+        x_batches_f, y_batches_f, args.dim, args.neighborhood, direction="x2y"
+    )
+
+    x2y_mean = x2y_sim.mean(axis=1)
+    y2x_mean = y2x_sim.mean(axis=1)
+    fwd_scores = score_candidates(x2y_sim, x2y_ind, x2y_mean, y2x_mean, margin)
+    bwd_scores = score_candidates(y2x_sim, y2x_ind, y2x_mean, x2y_mean, margin)
+    fwd_best = x2y_ind[np.arange(x2y_sim.shape[0]), fwd_scores.argmax(axis=1)]
+    bwd_best = y2x_ind[np.arange(y2x_sim.shape[0]), bwd_scores.argmax(axis=1)]
+    indices = np.stack(
+        (
+            np.concatenate((np.arange(x2y_ind.shape[0]), bwd_best)),
+            np.concatenate((fwd_best, np.arange(y2x_ind.shape[0]))),
+        ),
+        axis=1,
+    )
+    scores = np.concatenate((fwd_scores.max(axis=1), bwd_scores.max(axis=1)))
+
+    x_sentences = load_text(x_sents_f)
+    y_sentences = load_text(y_sents_f)
+
+    threshold = args.threshold
+    min_count = args.min_count
+    seen_src, seen_trg = set(), set()
+    directory = args.output
+    call(f"mkdir -p {directory}")
+    src_out = open(
+        f"{directory}/all.{args.src_lang}",
+        mode="w",
+        encoding="utf-8",
+        errors="surrogateescape",
+    )
+    tgt_out = open(
+        f"{directory}/all.{args.tgt_lang}",
+        mode="w",
+        encoding="utf-8",
+        errors="surrogateescape",
+    )
+    scores_out = open(
+        f"{directory}/all.scores", mode="w", encoding="utf-8", errors="surrogateescape"
+    )
+    count = 0
+    for i in np.argsort(-scores):
+        src_ind, trg_ind = indices[i]
+        if src_ind not in seen_src and trg_ind not in seen_trg:
+            seen_src.add(src_ind)
+            seen_trg.add(trg_ind)
+            if scores[i] > threshold or count < min_count:
+                if x_sentences[src_ind]:
+                    print(scores[i], file=scores_out)
+                    print(x_sentences[src_ind], file=src_out)
+                    print(y_sentences[trg_ind], file=tgt_out)
+                    count += 1
+                else:
+                    print(f"Ignoring sentence: {x_sentences[src_ind]}")
+    src_out.close()
+    tgt_out.close()
+    scores_out.close()
+
+    print(f"Found {count} pairs for threshold={threshold}")
+    with open(f"{directory}/all.{args.src_lang}") as all_s, open(
+        f"{directory}/all.{args.tgt_lang}"
+    ) as all_t, open(f"{directory}/valid.{args.src_lang}", "w") as valid_s, open(
+        f"{directory}/valid.{args.tgt_lang}", "w"
+    ) as valid_t, open(
+        f"{directory}/train.{args.src_lang}", "w"
+    ) as train_s, open(
+        f"{directory}/train.{args.tgt_lang}", "w"
+    ) as train_t:
+        count = 0
+        for s_line, t_line in zip(all_s, all_t):
+            s_line = s_line.split("\t")[1]
+            t_line = t_line.split("\t")[1]
+            if count >= args.valid_size:
+                train_s.write(s_line)
+                train_t.write(t_line)
+            else:
+                valid_s.write(s_line)
+                valid_t.write(t_line)
+                count += 1
--- a/fairseq/examples/criss/mining/mine_example.sh
+++ b/fairseq/examples/criss/mining/mine_example.sh
+#!/bin/bash
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+#
+source_lang=kk_KZ
+target_lang=en_XX
+MODEL=criss_checkpoints/criss.3rd.pt
+SPM=criss_checkpoints/sentence.bpe.model
+SPLIT=test
+LANG_DICT=criss_checkpoints/lang_dict.txt
+SPM_ENCODE=flores/scripts/spm_encode.py
+SAVE_ENCODER=save_encoder.py
+ENCODER_SAVE_ROOT=sentence_embeddings/$MODEL
+DICT=criss_checkpoints/dict.txt
+THRESHOLD=1.02
+MIN_COUNT=500
+
+DATA_DIR=data_tmp
+SAVE_DIR=mining/${source_lang}_${target_lang}_mined
+ENCODER_SAVE_DIR=${ENCODER_SAVE_ROOT}/${source_lang}-${target_lang}
+INPUT_DIR=$DATA_DIR/${source_lang}-${target_lang}-tatoeba
+
+mkdir -p $ENCODER_SAVE_DIR/${target_lang}
+mkdir -p $ENCODER_SAVE_DIR/${source_lang}
+mkdir -p $SAVE_DIR
+
+## Save encoder outputs
+
+# Save encoder outputs for source sentences
+python $SAVE_ENCODER \
+  ${INPUT_DIR} \
+  --path ${MODEL} \
+  --task translation_multi_simple_epoch \
+  --lang-pairs ${source_lang}-${target_lang} \
+  --lang-dict ${LANG_DICT} \
+  --gen-subset ${SPLIT} \
+  --bpe 'sentencepiece' \
+  -s ${source_lang} -t ${target_lang} \
+  --sentencepiece-model ${SPM} \
+  --remove-bpe 'sentencepiece' \
+  --beam 1 \
+  --lang-tok-style mbart \
+  --encoder-save-dir ${ENCODER_SAVE_DIR}/${source_lang}
+
+## Save encoder outputs for target sentences
+python $SAVE_ENCODER \
+  ${INPUT_DIR} \
+  --path ${MODEL} \
+  --lang-pairs ${source_lang}-${target_lang} \
+  --lang-dict ${LANG_DICT} \
+  --task translation_multi_simple_epoch \
+  --gen-subset ${SPLIT} \
+  --bpe 'sentencepiece' \
+  -t ${source_lang} -s ${target_lang} \
+  --sentencepiece-model ${SPM} \
+  --remove-bpe 'sentencepiece' \
+  --beam 1 \
+  --lang-tok-style mbart \
+  --encoder-save-dir ${ENCODER_SAVE_DIR}/${target_lang}
+
+## Mining
+python mining/mine.py \
+  --src-lang ${source_lang} \
+  --tgt-lang ${target_lang} \
+  --dim 1024 \
+  --mem 10 \
+  --neighborhood 4 \
+  --src-dir ${ENCODER_SAVE_DIR}/${source_lang} \
+  --tgt-dir ${ENCODER_SAVE_DIR}/${target_lang} \
+  --output $SAVE_DIR \
+  --threshold ${THRESHOLD} \
+  --min-count ${MIN_COUNT} \
+  --valid-size 100 \
+  --dict-path ${DICT} \
+  --spm-path ${SPM} \
+
+
+## Process and binarize mined data
+python $SPM_ENCODE \
+  --model ${SPM} \
+  --output_format=piece \
+  --inputs mining/${source_lang}_${target_lang}_mined/train.${source_lang} mining/${source_lang}_${target_lang}_mined/train.${target_lang} \
+  --outputs mining/${source_lang}_${target_lang}_mined/train.bpe.${source_lang} mining/${source_lang}_${target_lang}_mined/train.bpe.${target_lang}
+
+python $SPM_ENCODE \
+  --model ${SPM} \
+  --output_format=piece \
+  --inputs mining/${source_lang}_${target_lang}_mined/valid.${source_lang} mining/${source_lang}_${target_lang}_mined/valid.${target_lang} \
+  --outputs mining/${source_lang}_${target_lang}_mined/valid.bpe.${source_lang} mining/${source_lang}_${target_lang}_mined/valid.bpe.${target_lang}
+
+
+fairseq-preprocess \
+  --source-lang ${source_lang} \
+  --target-lang ${target_lang} \
+  --trainpref mining/${source_lang}_${target_lang}_mined/train.bpe \
+  --validpref mining/${source_lang}_${target_lang}_mined/valid.bpe \
+  --destdir mining/${source_lang}_${target_lang}_mined \
+  --srcdict ${DICT} \
+  --joined-dictionary \
+  --workers 8