add fairseq0.10.2

7df61696 · Sugon_ldc · 7df61696 · 7df61696 · 7df61696 · 7df61696
Commit 7df61696 authored Jul 28, 2023 by Sugon_ldc
20 changed files
--- a/examples/unsupervised_quality_estimation/README.md
+++ b/examples/unsupervised_quality_estimation/README.md
+# Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)
+
+This page includes instructions for reproducing results from the paper [Unsupervised Quality Estimation for Neural
+Machine Translation (Fomicheva et al., 2020)](https://arxiv.org/abs/2005.10608)
+
+## Requirements:
+
+* mosesdecoder: https://github.com/moses-smt/mosesdecoder
+* subword-nmt: https://github.com/rsennrich/subword-nmt
+* flores: https://github.com/facebookresearch/flores
+
+## Download Models and Test Data
+
+Download translation models and test data from [MLQE dataset repository](https://github.com/facebookresearch/mlqe).
+
+## Set up:
+
+Given a testset consisting of source sentences and reference translations:
+
+* `SRC_LANG`: source language
+* `TGT_LANG`: target language
+* `INPUT`: input prefix, such that the file `$INPUT.$SRC_LANG` contains source sentences and `$INPUT.$TGT_LANG`
+contains the reference sentences
+* `OUTPUT_DIR`: output path to store results
+* `MOSES_DECODER`: path to mosesdecoder installation
+* `BPE_ROOT`: path to subword-nmt installation
+* `BPE`: path to BPE model
+* `MODEL_DIR`: directory containing the NMT model `.pt` file as well as the source and target vocabularies.
+* `TMP`: directory for intermediate temporary files
+* `GPU`: if translating with GPU, id of the GPU to use for inference
+* `DROPOUT_N`: number of stochastic forward passes
+
+`$DROPOUT_N` is set to 30 in the experiments reported in the paper. However, we observed that increasing it beyond 10
+does not bring substantial improvements.
+
+## Translate the data using standard decoding
+
+Preprocess the input data:
+```
+for LANG in $SRC_LANG $TGT_LANG; do
+  perl $MOSES_DECODER/scripts/tokenizer/tokenizer.perl -threads 80 -a -l $LANG < $INPUT.$LANG > $TMP/preprocessed.tok.$LANG
+  python $BPE_ROOT/apply_bpe.py -c ${BPE} < $TMP/preprocessed.tok.$LANG > $TMP/preprocessed.tok.bpe.$LANG
+done
+```
+
+Binarize the data for faster translation:
+
+```
+fairseq-preprocess --srcdict $MODEL_DIR/dict.$SRC_LANG.txt --tgtdict $MODEL_DIR/dict.$TGT_LANG.txt
+--source-lang ${SRC_LANG} --target-lang ${TGT_LANG} --testpref $TMP/preprocessed.tok.bpe --destdir $TMP/bin --workers 4
+```
+
+Translate
+
+```
+CUDA_VISIBLE_DEVICES=$GPU fairseq-generate $TMP/bin --path ${MODEL_DIR}/${SRC_LANG}-${TGT_LANG}.pt --beam 5
+--source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 > $TMP/fairseq.out
+grep ^H $TMP/fairseq.out | cut -f3- > $TMP/mt.out
+```
+
+Post-process
+
+```
+sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/mt.out | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl
+-l $TGT_LANG > $OUTPUT_DIR/mt.out
+```
+
+## Produce uncertainty estimates
+
+### Scoring
+
+Make temporary files to store the translations repeated N times.
+
+```
+python ${SCRIPTS}/scripts/uncertainty/repeat_lines.py -i $TMP/preprocessed.tok.bpe.$SRC_LANG -n $DROPOUT_N
+-o $TMP/repeated.$SRC_LANG
+python ${SCRIPTS}/scripts/uncertainty/repeat_lines.py -i $TMP/mt.out -n $DROPOUT_N -o $TMP/repeated.$TGT_LANG
+
+fairseq-preprocess --srcdict ${MODEL_DIR}/dict.${SRC_LANG}.txt $TGT_DIC --source-lang ${SRC_LANG}
+--target-lang ${TGT_LANG} --testpref ${TMP}/repeated --destdir ${TMP}/bin-repeated
+```
+
+Produce model scores for the generated translations using `--retain-dropout` option to apply dropout at inference time:
+
+```
+CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${LP}.pt --beam 5
+ --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 --score-reference --retain-dropout
+ --retain-dropout-modules TransformerModel TransformerEncoder TransformerDecoder TransformerEncoderLayer
+ TransformerDecoderLayer --seed 46 > $TMP/dropout.scoring.out
+
+grep ^H $TMP/dropout.scoring.out | cut -f2- > $TMP/dropout.scores
+
+```
+
+Use `--retain-dropout-modules` to specify the modules. By default, dropout is applied in the same places
+as for training.
+
+Compute the mean of the resulting output distribution:
+
+```
+python $SCRIPTS/scripts/uncertainty/aggregate_scores.py -i $TMP/dropout.scores -o $OUTPUT_DIR/dropout.scores.mean
+-n $DROPOUT_N
+```
+
+### Generation
+
+Produce multiple translation hypotheses for the same source using `--retain-dropout` option:
+
+```
+CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${LP}.pt
+ --beam 5 --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --retain-dropout
+ --unkpen 5 --retain-dropout-modules TransformerModel TransformerEncoder TransformerDecoder
+TransformerEncoderLayer TransformerDecoderLayer --seed 46 > $TMP/dropout.generation.out
+
+grep ^H $TMP/dropout.generation.out | cut -f3- > $TMP/dropout.hypotheses_
+
+sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/dropout.hypotheses_ | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl
+-l $TGT_LANG > $TMP/dropout.hypotheses
+```
+
+Compute similarity between multiple hypotheses corresponding to the same source sentence using Meteor
+evaluation metric:
+```
+python meteor.py -i $TMP/dropout.hypotheses -m <path_to_meteor_installation> -n $DROPOUT_N -o
+$OUTPUT_DIR/dropout.gen.sim.meteor
+```
--- a/examples/unsupervised_quality_estimation/aggregate_scores.py
+++ b/examples/unsupervised_quality_estimation/aggregate_scores.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import sys
+
+import numpy as np
+
+
+aggregate_funcs = {
+    "std": np.std,
+    "var": np.var,
+    "median": np.median,
+    "mean": np.mean,
+    "min": np.min,
+    "max": np.max,
+}
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-i", "--input_file", required=True, type=str)
+    parser.add_argument("-n", "--repeat_times", required=True, type=int)
+    parser.add_argument("-o", "--output_file", required=False)
+    parser.add_argument("-f", "--func", required=False, default="mean")
+    args = parser.parse_args()
+
+    stream = open(args.output_file, "w") if args.output_file else sys.stdout
+
+    segment_scores = []
+    for line in open(args.input_file):
+        segment_scores.append(float(line.strip()))
+        if len(segment_scores) == args.repeat_times:
+            stream.write("{}\n".format(aggregate_funcs[args.func](segment_scores)))
+            segment_scores = []
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/unsupervised_quality_estimation/meteor.py
+++ b/examples/unsupervised_quality_estimation/meteor.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import math
+import os
+import subprocess
+import sys
+import tempfile
+from collections import defaultdict
+from itertools import combinations
+
+
+def read_translations(path, n_repeats):
+    segment_counter = 0
+    segment_translations = []
+    translations = defaultdict(list)
+    for line in open(path):
+        segment_translations.append(" ".join(line.split()))
+        if len(segment_translations) == n_repeats:
+            translations[segment_counter] = segment_translations
+            segment_translations = []
+            segment_counter += 1
+    return translations
+
+
+def generate_input(translations, n_repeats):
+    _, ref_path = tempfile.mkstemp()
+    _, mt_path = tempfile.mkstemp()
+    ref_fh = open(ref_path, "w")
+    mt_fh = open(mt_path, "w")
+    for segid in sorted(translations.keys()):
+        assert len(translations[segid]) == n_repeats
+        indexes = combinations(range(n_repeats), 2)
+        for idx1, idx2 in indexes:
+            mt_fh.write(translations[segid][idx1].strip() + "\n")
+            ref_fh.write(translations[segid][idx2].strip() + "\n")
+    sys.stderr.write("\nSaved translations to %s and %s" % (ref_path, mt_path))
+    return ref_path, mt_path
+
+
+def run_meteor(ref_path, mt_path, metric_path, lang="en"):
+    _, out_path = tempfile.mkstemp()
+    subprocess.call(
+        [
+            "java",
+            "-Xmx2G",
+            "-jar",
+            metric_path,
+            mt_path,
+            ref_path,
+            "-p",
+            "0.5 0.2 0.6 0.75",  # default parameters, only changed alpha to give equal weight to P and R
+            "-norm",
+            "-l",
+            lang,
+        ],
+        stdout=open(out_path, "w"),
+    )
+    os.remove(ref_path)
+    os.remove(mt_path)
+    sys.stderr.write("\nSaved Meteor output to %s" % out_path)
+    return out_path
+
+
+def read_output(meteor_output_path, n_repeats):
+    n_combinations = math.factorial(n_repeats) / (
+        math.factorial(2) * math.factorial(n_repeats - 2)
+    )
+    raw_scores = []
+    average_scores = []
+    for line in open(meteor_output_path):
+        if not line.startswith("Segment "):
+            continue
+        score = float(line.strip().split("\t")[1])
+        raw_scores.append(score)
+        if len(raw_scores) == n_combinations:
+            average_scores.append(sum(raw_scores) / n_combinations)
+            raw_scores = []
+    os.remove(meteor_output_path)
+    return average_scores
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-i", "--input")
+    parser.add_argument("-n", "--repeat_times", type=int)
+    parser.add_argument("-m", "--meteor")
+    parser.add_argument("-o", "--output")
+    args = parser.parse_args()
+
+    translations = read_translations(args.infile, args.repetitions)
+    sys.stderr.write("\nGenerating input for Meteor...")
+    ref_path, mt_path = generate_input(translations, args.repetitions)
+    sys.stderr.write("\nRunning Meteor...")
+    out_path = run_meteor(ref_path, mt_path, args.meteor)
+    sys.stderr.write("\nReading output...")
+    scores = read_output(out_path, args.repetitions)
+    sys.stderr.write("\nWriting results...")
+    with open(args.output, "w") as o:
+        for scr in scores:
+            o.write("{}\n".format(scr))
+    o.close()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/unsupervised_quality_estimation/repeat_lines.py
+++ b/examples/unsupervised_quality_estimation/repeat_lines.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import sys
+
+
+def _normalize_spaces(line):
+    return " ".join(line.split())
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-i", "--input_file", required=True, type=str)
+    parser.add_argument("-n", "--repeat_times", required=True, type=int)
+    parser.add_argument("-o", "--output_file", required=False, type=str)
+    args = parser.parse_args()
+    stream = open(args.output_file, "w") if args.output_file else sys.stdout
+
+    for line in open(args.input_file):
+        for _ in range(args.repeat_times):
+            stream.write(_normalize_spaces(line) + "\n")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/wav2vec/README.md
+++ b/examples/wav2vec/README.md
+# wav2vec 2.0
+
+wav2vec 2.0 learns speech representations on unlabeled data as described in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](https://arxiv.org/abs/2006.11477).
+
+## Pre-trained models
+
+Model | Finetuning split | Dataset | Model
+|---|---|---|---
+Wav2Vec 2.0 Base | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt)
+Wav2Vec 2.0 Base | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_10m.pt)
+Wav2Vec 2.0 Base | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_100h.pt)
+Wav2Vec 2.0 Base | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt)
+Wav2Vec 2.0 Large | No finetuning | [Librispeech](http://www.openslr.org/12)  | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/libri960_big.pt)
+Wav2Vec 2.0 Large | 10 minutes | [Librispeech](http://www.openslr.org/12)  | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_10m.pt)
+Wav2Vec 2.0 Large | 100 hours | [Librispeech](http://www.openslr.org/12)  | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_100h.pt)
+Wav2Vec 2.0 Large | 960 hours | [Librispeech](http://www.openslr.org/12)  | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_960h.pt)
+Wav2Vec 2.0 Large (LV-60) | No finetuning | [Libri-Light](https://github.com/facebookresearch/libri-light) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox.pt)
+Wav2Vec 2.0 Large (LV-60) | 10 minutes | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_10m.pt)
+Wav2Vec 2.0 Large (LV-60) | 100 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_100h.pt)
+Wav2Vec 2.0 Large (LV-60) | 960 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h.pt)
+
+## Training a new model with the CLI tools
+
+Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
+
+### Prepare training data manifest:
+
+First, install the `soundfile` library:
+```shell script
+pip install soundfile
+```
+
+Next, run:
+
+```shell script
+$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid
+```
+
+$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read.
+
+$valid should be set to some reasonable percentage (like 0.01) of training data to use for validation.
+To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a
+separately pre-processed manifest file.
+
+### Train a wav2vec 2.0 base model:
+
+This configuration was used for the base model trained on the Librispeech dataset in the wav2vec 2.0 paper
+
+Note that this was tested with pytorch 1.4.0 and the input is expected to be single channel, sampled at 16 kHz
+
+```shell script
+$ python train.py --distributed-world-size 64 --distributed-port $PORT /manifest/path \
+--save-dir /model/path --fp16 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
+--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
+--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 256 --latent-vars 320 \
+--latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam \
+--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 \
+--lr 0.0005 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
+--encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 \
+--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \
+--max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
+--max-tokens 1400000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d
+```
+
+Note: you can simulate 64 GPUs by using k GPUs and setting --update-freq 64/k
+
+### Train a wav2vec 2.0 large model:
+
+This configuration was used for the large model trained on the Libri-light dataset in the wav2vec 2.0 paper
+
+```shell script
+$ python train.py --distributed-world-size 128 --distributed-port $PORT /manifest/path \
+--save-dir /model/path --fp16 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
+--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
+--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 768 --latent-vars 320 \
+--latent-groups 2 --latent-temp '(2.0,0.1,0.999995)' --infonce --optimizer adam \
+--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 600000 \
+--lr 0.0003 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
+--encoder-layerdrop 0.0 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.03 \
+--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --encoder-layers 24 --encoder-embed-dim 1024 \
+--encoder-ffn-embed-dim 4096 --encoder-attention-heads 16 --num-negatives 100 --cross-sample-negatives 0 \
+--max-sample-size 320000 --min-sample-size 32000 --dropout 0.0 --attention-dropout 0.1 --weight-decay 0.01 \
+--max-tokens 1200000 --max-update 600000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d
+```
+
+Note: you can simulate 128 GPUs by using k GPUs and setting --update-freq 128/k
+
+### Fine-tune a pre-trained model with CTC:
+
+Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format.
+A letter vocabulary can be downloaded [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt).
+An example [script](libri_labels.py) that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:
+
+```shell script
+split=train
+$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split
+```
+
+Fine-tuning on 100h of Librispeech with letter targets:
+```shell script
+valid_subset=dev_other
+python train.py --distributed-world-size 24 --distributed-port $PORT /path/to/training_data --save-dir /model/path --fp16 \
+--wer-args '("/path/to/lm/4-gram.bin","/path/to/lexicon",2,-1)' \
+--post-process letter --valid-subset $valid_subset --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 4 \
+--max-update 80000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc --w2v-path /path/to/pretrained/model \
+--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
+--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
+--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
+--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
+--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
+--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d
+```
+
+Note: you can simulate 24 GPUs by using k GPUs and setting --update-freq 24/k
+
+Decoding with a language model during training requires wav2letter [python bindings](https://github.com/facebookresearch/wav2letter/wiki/Building-Python-bindings).
+Alternatively, simply omit the --wer-args flag.
+
+For hyper-parameters to fine-tune other Librispeech splits (10 minutes, 1 hour, etc) please refer to the table in Appendix B in the wav2vec 2.0 paper.
+The main changes to make are adjusting --max-update, and then adjusting --warmup-steps, --hold-steps, and --decay steps so that they use 0.1/0.4/0.5 of max-update respectively. You then need to adjust --mask-prob and --mask-channel-prob. This should be set to the mask-length * x where x is the number in the table and mask-length is what you use for --mask-length (10 in this example. Use --mask-channel-length value for --mask-channel-prob).
+
+For example, for 10 hours, we see in the paper that timestep mask prob should be 0.065, so we set --mask-prob to 10* 0.065 = 0.65. channel mask prob is 0.004, so we set it to 64 * 0.004 = 0.256. then we set --max-updates to 20000 and change --warmup-steps to 20000 * 0.1 = 2000, --hold-steps to 8000 and --decay-steps to 10000.
+
+### Evaluating a CTC model:
+
+Evaluating a CTC model with a language model requires wav2letter [python bindings](https://github.com/facebookresearch/wav2letter/wiki/Building-Python-bindings) to be installed.
+
+Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the [wav2letter model repository](https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019).
+Be sure to upper-case the language model vocab after downloading it.
+
+Letter dictionary for pre-trained models can be found [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt).
+
+Next, run the evaluation command:
+
+```shell script
+$subset=dev_other
+python examples/speech_recognition/infer.py /checkpoint/abaevski/data/speech/libri/10h/wav2vec/raw --task audio_pretraining \
+--nbest 1 --path /path/to/model --gen-subset $subset --results-path /path/to/save/results/for/sclite --w2l-decoder kenlm \
+--lm-model /path/to/kenlm.bin --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
+--post-process letter
+```
+
+To get raw numbers, use --w2l-decoder viterbi and omit the lexicon. To use the transformer language model, use --w2l-decoder fairseqlm.
+
+# wav2vec
+
+Example to train a wav2vec model as described in [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](https://arxiv.org/abs/1904.05862).
+
+## Pre-trained models
+
+Description | Dataset | Model
+---|---|---
+Wav2Vec large | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt)
+
+#### Example usage:
+```python
+import torch
+from fairseq.models.wav2vec import Wav2VecModel
+
+cp = torch.load('/path/to/wav2vec.pt')
+model = Wav2VecModel.build_model(cp['args'], task=None)
+model.load_state_dict(cp['model'])
+model.eval()
+
+wav_input_16khz = torch.randn(1,10000)
+z = model.feature_extractor(wav_input_16khz)
+c = model.feature_aggregator(z)
+```
+
+## Training a new model with the CLI tools
+
+Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate files 10 to 30 seconds in length)
+
+### Prepare training data manifest:
+
+```
+$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav
+```
+
+### Train a wav2vec model:
+
+```
+$ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
+--arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 --optimizer adam --max-lr 0.005 --lr-scheduler cosine \
+--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \
+--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
+--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \
+--max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test
+```
+
+### Extract embeddings from the downstream task data:
+
+```
+$ PYTHONPATH=/path/to/fairseq python examples/wav2vec/wav2vec_featurize.py --input /path/to/task/waves --output /path/to/output \
+--model /model/path/checkpoint_best.pt --split train valid test
+```
+
+# vq-wav2vec
+
+Example to train a vq-wav2vec model as described in [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (Baevski et al., 2019)](https://arxiv.org/abs/1910.05453).
+
+These models are also used in [Effectiveness of self-supervised pre-training for speech recognition (Baevski et al., 2019)](https://arxiv.org/abs/1911.03912).
+
+## Pre-trained models
+
+Description | Dataset | Model
+---|---|---
+vq-wav2vec Gumbel | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec.pt)
+vq-wav2vec K-means | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec_kmeans.pt)
+Roberta on K-means codes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/bert_kmeans.tar)
+
+#### Example usage:
+```python
+import torch
+from fairseq.models.wav2vec import Wav2VecModel
+
+cp = torch.load('/path/to/vq-wav2vec.pt')
+model = Wav2VecModel.build_model(cp['args'], task=None)
+model.load_state_dict(cp['model'])
+model.eval()
+
+wav_input_16khz = torch.randn(1,10000)
+z = model.feature_extractor(wav_input_16khz)
+_, idxs = model.vector_quantizer.forward_idx(z)
+print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model
+```
+
+## Training a new model with the CLI tools
+
+Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
+
+### Prepare training data manifest:
+
+```
+$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav
+```
+
+### Train a gumbel vq-wav2vec model:
+
+```
+$ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 \
+--save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 \
+--optimizer adam --max-lr 1e-05 --lr-scheduler cosine \
+--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1), (512, 1, 1)] \
+--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
+--activation gelu --offset auto --skip-connections-agg --residual-scale 0.5 \
+--log-keys ["prob_perplexity","code_perplexity","temp"] --vq-type gumbel --vq-groups 2 --vq-depth 2 \
+--combine-groups --vq-vars 320 --vq-temp (2,0.5,0.999995) --prediction-steps 12 --warmup-updates 1000 \
+--warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 --max-sample-size 150000 \
+--max-tokens 300000 --cross-sample-negatives 0 --update-freq 1 --seed 2 --skip-invalid-size-inputs-valid-test
+```
+
+for k-means training, set vq-type with "kmeans" and add --loss-weights [1] argument. Pre-trained models were trained on 16 GPUs.
+
+### Tokenize audio data (e.g. for BERT training):
+
+```
+$ PYTHONPATH=/path/to/fairseq python examples/wav2vec/vq-wav2vec_featurize.py --data-dir /manifest/path --output-dir /path/to/output \
+--checkpoint /model/path/checkpoint_best.pt --split train valid test --extension tsv
+```
--- a/examples/wav2vec/libri_labels.py
+++ b/examples/wav2vec/libri_labels.py
+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Helper script to pre-compute embeddings for a wav2letter++ dataset
+"""
+
+import argparse
+import os
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("tsv")
+    parser.add_argument("--output-dir", required=True)
+    parser.add_argument("--output-name", required=True)
+    args = parser.parse_args()
+
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    transcriptions = {}
+
+    with open(args.tsv, "r") as tsv, open(
+        os.path.join(args.output_dir, args.output_name + ".ltr"), "w"
+    ) as ltr_out, open(
+        os.path.join(args.output_dir, args.output_name + ".wrd"), "w"
+    ) as wrd_out:
+        root = next(tsv).strip()
+        for line in tsv:
+            line = line.strip()
+            dir = os.path.dirname(line)
+            if dir not in transcriptions:
+                parts = dir.split(os.path.sep)
+                trans_path = f"{parts[-2]}-{parts[-1]}.trans.txt"
+                path = os.path.join(root, dir, trans_path)
+                assert os.path.exists(path)
+                texts = {}
+                with open(path, "r") as trans_f:
+                    for tline in trans_f:
+                        items = tline.strip().split()
+                        texts[items[0]] = " ".join(items[1:])
+                transcriptions[dir] = texts
+            part = os.path.basename(line).split(".")[0]
+            assert part in transcriptions[dir]
+            print(transcriptions[dir][part], file=wrd_out)
+            print(
+                " ".join(list(transcriptions[dir][part].replace(" ", "|"))) + " |",
+                file=ltr_out,
+            )
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/wav2vec/vq-wav2vec_featurize.py
+++ b/examples/wav2vec/vq-wav2vec_featurize.py
+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Helper script to pre-compute embeddings for a wav2letter++ dataset
+"""
+
+import argparse
+import glob
+import os
+import os.path as osp
+import pprint
+
+import soundfile as sf
+import torch
+import tqdm
+from fairseq.models.wav2vec.wav2vec import Wav2VecModel
+from torch import nn
+from torch.utils.data import DataLoader
+
+
+try:
+    import tqdm
+except:
+    print("Install tqdm to use --log-format=tqdm")
+
+
+class FilesDataset:
+    def __init__(self, files, labels):
+        self.files = files
+        if labels and osp.exists(labels):
+            with open(labels, "r") as lbl_f:
+                self.labels = [line.rstrip() for line in lbl_f]
+        else:
+            self.labels = labels
+
+    def __len__(self):
+        return len(self.files)
+
+    def __getitem__(self, index):
+        fname = self.files[index]
+
+        wav, sr = sf.read(fname)
+        assert sr == 16000
+
+        wav = torch.from_numpy(wav).float()
+        lbls = None
+        if self.labels:
+            if isinstance(self.labels, str):
+                lbl_file = osp.splitext(fname)[0] + "." + self.labels
+                with open(lbl_file, "r") as lblf:
+                    lbls = lblf.readline()
+                    assert lbls is not None
+            else:
+                lbls = self.labels[index]
+        return wav, lbls
+
+    def collate(self, batch):
+        return batch
+
+
+class ArgTypes:
+    @staticmethod
+    def existing_path(arg):
+        arg = str(arg)
+        assert osp.exists(arg), f"File {arg} does not exist"
+        return arg
+
+    @staticmethod
+    def mkdir(arg):
+        arg = str(arg)
+        os.makedirs(arg, exist_ok=True)
+        return arg
+
+
+class DatasetWriter:
+    def __init__(self):
+
+        self.args = self.load_config()
+        pprint.pprint(self.args.__dict__)
+
+        self.model = self.load_model()
+
+    def __getattr__(self, attr):
+        return getattr(self.args, attr)
+
+    def read_manifest(self, fname):
+
+        with open(fname, "r") as fp:
+            lines = fp.read().split("\n")
+            root = lines.pop(0).strip()
+            fnames = [
+                osp.join(root, line.split("\t")[0]) for line in lines if len(line) > 0
+            ]
+
+        return fnames
+
+    def process_splits(self):
+
+        if self.args.shard is not None or self.args.num_shards is not None:
+            assert self.args.shard is not None and self.args.num_shards is not None
+
+        for split in self.splits:
+            print(split)
+
+            if self.extension == "tsv":
+                datadir = osp.join(self.data_dir, f"{split}.{self.extension}")
+                print("Reading manifest file: ", datadir)
+                files = self.read_manifest(datadir)
+            else:
+                datadir = osp.join(self.data_dir, split, f"**/*.{self.extension}")
+                files = glob.glob(datadir, recursive=True)
+
+            assert len(files) > 0
+
+            if self.args.shard is not None:
+                files = files[self.args.shard :: self.args.num_shards]
+
+            lbls = []
+            with open(self.data_file(split), "w") as srcf:
+                for line, lbl in self.iterate(files):
+                    print(line, file=srcf)
+                    if self.args.labels:
+                        lbls.append(lbl + "\n")
+
+            if self.args.labels:
+                assert all(a is not None for a in lbls)
+                with open(self.lbl_file(split), "w") as lblf:
+                    lblf.writelines(lbls)
+
+    def iterate(self, files):
+
+        data = self.load_data(files)
+        for samples in tqdm.tqdm(data, total=len(files) // 32):
+
+            for wav, lbl in samples:
+                x = wav.unsqueeze(0).float().cuda()
+
+                div = 1
+                while x.size(-1) // div > self.args.max_size:
+                    div += 1
+
+                xs = x.chunk(div, dim=-1)
+
+                result = []
+                for x in xs:
+                    torch.cuda.empty_cache()
+                    x = self.model.feature_extractor(x)
+                    if self.quantize_location == "encoder":
+                        with torch.no_grad():
+                            _, idx = self.model.vector_quantizer.forward_idx(x)
+                            idx = idx.squeeze(0).cpu()
+                    else:
+                        with torch.no_grad():
+                            z = self.model.feature_aggregator(x)
+                            _, idx = self.model.vector_quantizer.forward_idx(z)
+                            idx = idx.squeeze(0).cpu()
+                    result.append(idx)
+
+                idx = torch.cat(result, dim=0)
+                yield " ".join("-".join(map(str, a.tolist())) for a in idx), lbl
+
+    def lbl_file(self, name):
+        shard_part = "" if self.args.shard is None else f".{self.args.shard}"
+        return osp.join(self.output_dir, f"{name}.lbl{shard_part}")
+
+    def data_file(self, name):
+        shard_part = "" if self.args.shard is None else f".{self.args.shard}"
+        return osp.join(self.output_dir, f"{name}.src{shard_part}")
+
+    def var_file(self):
+        return osp.join(self.output_dir, f"vars.pt")
+
+    def load_config(self):
+
+        parser = argparse.ArgumentParser("Vector Quantized wav2vec features")
+
+        # Model Arguments
+        parser.add_argument("--checkpoint", type=ArgTypes.existing_path, required=True)
+        parser.add_argument("--data-parallel", action="store_true")
+
+        # Output Arguments
+        parser.add_argument("--output-dir", type=ArgTypes.mkdir, required=True)
+
+        # Data Arguments
+        parser.add_argument("--data-dir", type=ArgTypes.existing_path, required=True)
+        parser.add_argument("--splits", type=str, nargs="+", required=True)
+        parser.add_argument("--extension", type=str, required=True)
+        parser.add_argument("--labels", type=str, required=False)
+
+        parser.add_argument("--shard", type=int, default=None)
+        parser.add_argument("--num-shards", type=int, default=None)
+        parser.add_argument("--max-size", type=int, default=1300000)
+
+        # Logger Arguments
+        parser.add_argument(
+            "--log-format", type=str, choices=["none", "simple", "tqdm"]
+        )
+
+        return parser.parse_args()
+
+    def load_data(self, fnames):
+
+        dataset = FilesDataset(fnames, self.args.labels)
+        loader = DataLoader(
+            dataset, batch_size=32, collate_fn=dataset.collate, num_workers=8
+        )
+        return loader
+
+    def load_model(self):
+        cp = torch.load(self.checkpoint, map_location=lambda x, _: x)
+
+        model = Wav2VecModel.build_model(cp["args"], None)
+
+        self.quantize_location = getattr(cp["args"], "vq", "encoder")
+
+        model.load_state_dict(cp["model"])
+        model.eval().float()
+        model.cuda()
+
+        if self.data_parallel:
+            model = nn.DataParallel(model)
+
+        return model
+
+    def __call__(self):
+
+        self.process_splits()
+
+        if hasattr(self.model.feature_extractor, "vars") and (
+            self.args.shard is None or self.args.shard == 0
+        ):
+            vars = (
+                self.model.feature_extractor.vars.view(
+                    self.model.feature_extractor.banks,
+                    self.model.feature_extractor.num_vars,
+                    -1,
+                )
+                .cpu()
+                .detach()
+            )
+            print("writing learned latent variable embeddings: ", vars.shape)
+            torch.save(vars, self.var_file())
+
+
+if __name__ == "__main__":
+    write_data = DatasetWriter()
+
+    write_data()
+    print("Done.")
--- a/examples/wav2vec/wav2vec_featurize.py
+++ b/examples/wav2vec/wav2vec_featurize.py
+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""
+Helper script to pre-compute embeddings for a wav2letter++ dataset
+"""
+
+import argparse
+import glob
+import os
+from shutil import copy
+
+import h5py
+import numpy as np
+import soundfile as sf
+import torch
+import tqdm
+from fairseq.models.wav2vec.wav2vec import Wav2VecModel
+from torch import nn
+
+
+def read_audio(fname):
+    """ Load an audio file and return PCM along with the sample rate """
+
+    wav, sr = sf.read(fname)
+    assert sr == 16e3
+
+    return wav, 16e3
+
+
+class PretrainedWav2VecModel(nn.Module):
+    def __init__(self, fname):
+        super().__init__()
+
+        checkpoint = torch.load(fname)
+        self.args = checkpoint["args"]
+        model = Wav2VecModel.build_model(self.args, None)
+        model.load_state_dict(checkpoint["model"])
+        model.eval()
+
+        self.model = model
+
+    def forward(self, x):
+        with torch.no_grad():
+            z = self.model.feature_extractor(x)
+            if isinstance(z, tuple):
+                z = z[0]
+            c = self.model.feature_aggregator(z)
+        return z, c
+
+
+class EmbeddingWriterConfig(argparse.ArgumentParser):
+    def __init__(self):
+        super().__init__("Pre-compute embeddings for wav2letter++ datasets")
+
+        kwargs = {"action": "store", "type": str, "required": True}
+
+        self.add_argument("--input", "-i", help="Input Directory", **kwargs)
+        self.add_argument("--output", "-o", help="Output Directory", **kwargs)
+        self.add_argument("--model", help="Path to model checkpoint", **kwargs)
+        self.add_argument("--split", help="Dataset Splits", nargs="+", **kwargs)
+        self.add_argument(
+            "--ext", default="wav", required=False, help="Audio file extension"
+        )
+
+        self.add_argument(
+            "--no-copy-labels",
+            action="store_true",
+            help="Do not copy label files. Useful for large datasets, use --targetdir in wav2letter then.",
+        )
+        self.add_argument(
+            "--use-feat",
+            action="store_true",
+            help="Use the feature vector ('z') instead of context vector ('c') for features",
+        )
+        self.add_argument("--gpu", help="GPU to use", default=0, type=int)
+
+
+class Prediction:
+    """ Lightweight wrapper around a fairspeech embedding model """
+
+    def __init__(self, fname, gpu=0):
+        self.gpu = gpu
+        self.model = PretrainedWav2VecModel(fname).cuda(gpu)
+
+    def __call__(self, x):
+        x = torch.from_numpy(x).float().cuda(self.gpu)
+        with torch.no_grad():
+            z, c = self.model(x.unsqueeze(0))
+
+        return z.squeeze(0).cpu().numpy(), c.squeeze(0).cpu().numpy()
+
+
+class H5Writer:
+    """ Write features as hdf5 file in wav2letter++ compatible format """
+
+    def __init__(self, fname):
+        self.fname = fname
+        os.makedirs(os.path.dirname(self.fname), exist_ok=True)
+
+    def write(self, data):
+        channel, T = data.shape
+
+        with h5py.File(self.fname, "w") as out_ds:
+            data = data.T.flatten()
+            out_ds["features"] = data
+            out_ds["info"] = np.array([16e3 // 160, T, channel])
+
+
+class EmbeddingDatasetWriter(object):
+    """Given a model and a wav2letter++ dataset, pre-compute and store embeddings
+
+    Args:
+        input_root, str :
+            Path to the wav2letter++ dataset
+        output_root, str :
+            Desired output directory. Will be created if non-existent
+        split, str :
+            Dataset split
+    """
+
+    def __init__(
+        self,
+        input_root,
+        output_root,
+        split,
+        model_fname,
+        extension="wav",
+        gpu=0,
+        verbose=False,
+        use_feat=False,
+    ):
+
+        assert os.path.exists(model_fname)
+
+        self.model_fname = model_fname
+        self.model = Prediction(self.model_fname, gpu)
+
+        self.input_root = input_root
+        self.output_root = output_root
+        self.split = split
+        self.verbose = verbose
+        self.extension = extension
+        self.use_feat = use_feat
+
+        assert os.path.exists(self.input_path), "Input path '{}' does not exist".format(
+            self.input_path
+        )
+
+    def _progress(self, iterable, **kwargs):
+        if self.verbose:
+            return tqdm.tqdm(iterable, **kwargs)
+        return iterable
+
+    def require_output_path(self, fname=None):
+        path = self.get_output_path(fname)
+        os.makedirs(path, exist_ok=True)
+
+    @property
+    def input_path(self):
+        return self.get_input_path()
+
+    @property
+    def output_path(self):
+        return self.get_output_path()
+
+    def get_input_path(self, fname=None):
+        if fname is None:
+            return os.path.join(self.input_root, self.split)
+        return os.path.join(self.get_input_path(), fname)
+
+    def get_output_path(self, fname=None):
+        if fname is None:
+            return os.path.join(self.output_root, self.split)
+        return os.path.join(self.get_output_path(), fname)
+
+    def copy_labels(self):
+        self.require_output_path()
+
+        labels = list(
+            filter(
+                lambda x: self.extension not in x, glob.glob(self.get_input_path("*"))
+            )
+        )
+        for fname in tqdm.tqdm(labels):
+            copy(fname, self.output_path)
+
+    @property
+    def input_fnames(self):
+        return sorted(glob.glob(self.get_input_path("*.{}".format(self.extension))))
+
+    def __len__(self):
+        return len(self.input_fnames)
+
+    def write_features(self):
+
+        paths = self.input_fnames
+
+        fnames_context = map(
+            lambda x: os.path.join(
+                self.output_path, x.replace("." + self.extension, ".h5context")
+            ),
+            map(os.path.basename, paths),
+        )
+
+        for name, target_fname in self._progress(
+            zip(paths, fnames_context), total=len(self)
+        ):
+            wav, sr = read_audio(name)
+            z, c = self.model(wav)
+            feat = z if self.use_feat else c
+            writer = H5Writer(target_fname)
+            writer.write(feat)
+
+    def __repr__(self):
+
+        return "EmbeddingDatasetWriter ({n_files} files)\n\tinput:\t{input_root}\n\toutput:\t{output_root}\n\tsplit:\t{split})".format(
+            n_files=len(self), **self.__dict__
+        )
+
+
+if __name__ == "__main__":
+
+    args = EmbeddingWriterConfig().parse_args()
+
+    for split in args.split:
+
+        writer = EmbeddingDatasetWriter(
+            input_root=args.input,
+            output_root=args.output,
+            split=split,
+            model_fname=args.model,
+            gpu=args.gpu,
+            extension=args.ext,
+            use_feat=args.use_feat,
+        )
+
+        print(writer)
+        writer.require_output_path()
+
+        print("Writing Features...")
+        writer.write_features()
+        print("Done.")
+
+        if not args.no_copy_labels:
+            print("Copying label data...")
+            writer.copy_labels()
+            print("Done.")
--- a/examples/wav2vec/wav2vec_manifest.py
+++ b/examples/wav2vec/wav2vec_manifest.py
+#!/usr/bin/env python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Data pre-processing: build vocabularies and binarize training data.
+"""
+
+import argparse
+import glob
+import os
+import random
+
+import soundfile
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "root", metavar="DIR", help="root directory containing flac files to index"
+    )
+    parser.add_argument(
+        "--valid-percent",
+        default=0.01,
+        type=float,
+        metavar="D",
+        help="percentage of data to use as validation set (between 0 and 1)",
+    )
+    parser.add_argument(
+        "--dest", default=".", type=str, metavar="DIR", help="output directory"
+    )
+    parser.add_argument(
+        "--ext", default="flac", type=str, metavar="EXT", help="extension to look for"
+    )
+    parser.add_argument("--seed", default=42, type=int, metavar="N", help="random seed")
+    parser.add_argument(
+        "--path-must-contain",
+        default=None,
+        type=str,
+        metavar="FRAG",
+        help="if set, path must contain this substring for a file to be included in the manifest",
+    )
+    return parser
+
+
+def main(args):
+    assert args.valid_percent >= 0 and args.valid_percent <= 1.0
+
+    dir_path = os.path.realpath(args.root)
+    search_path = os.path.join(dir_path, "**/*." + args.ext)
+    rand = random.Random(args.seed)
+
+    with open(os.path.join(args.dest, "train.tsv"), "w") as train_f, open(
+        os.path.join(args.dest, "valid.tsv"), "w"
+    ) as valid_f:
+        print(dir_path, file=train_f)
+        print(dir_path, file=valid_f)
+
+        for fname in glob.iglob(search_path, recursive=True):
+            file_path = os.path.realpath(fname)
+
+            if args.path_must_contain and args.path_must_contain not in file_path:
+                continue
+
+            frames = soundfile.info(fname).frames
+            dest = train_f if rand.random() > args.valid_percent else valid_f
+            print(
+                "{}\t{}".format(os.path.relpath(file_path, dir_path), frames), file=dest
+            )
+
+
+if __name__ == "__main__":
+    parser = get_parser()
+    args = parser.parse_args()
+    main(args)
--- a/examples/wmt19/README.md
+++ b/examples/wmt19/README.md
+# WMT 19
+
+This page provides pointers to the models of Facebook-FAIR's WMT'19 news translation task submission [(Ng et al., 2019)](https://arxiv.org/abs/1907.06616).
+
+## Pre-trained models
+
+Model | Description | Download
+---|---|---
+`transformer.wmt19.en-de` | En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.de-en` | De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
+`transformer.wmt19.en-ru` | En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
+`transformer.wmt19.ru-en` | Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
+`transformer_lm.wmt19.en` | En Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
+`transformer_lm.wmt19.de` | De Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
+`transformer_lm.wmt19.ru` | Ru Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
+
+## Pre-trained single models before finetuning
+
+Model | Description | Download
+---|---|---
+`transformer.wmt19.en-de` | En->De Single, no finetuning | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.ffn8192.tar.gz)
+`transformer.wmt19.de-en` | De->En Single, no finetuning  | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.ffn8192.tar.gz)
+`transformer.wmt19.en-ru` | En->Ru Single, no finetuning | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ffn8192.tar.gz)
+`transformer.wmt19.ru-en` | Ru->En Single, no finetuning  | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ffn8192.tar.gz)
+
+## Example usage (torch.hub)
+
+#### Requirements
+
+We require a few additional Python dependencies for preprocessing:
+```bash
+pip install fastBPE sacremoses
+```
+
+#### Translation
+
+```python
+import torch
+
+# English to German translation
+en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+en2de.translate("Machine learning is great!")  # 'Maschinelles Lernen ist großartig!'
+
+# German to English translation
+de2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.de-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+de2en.translate("Maschinelles Lernen ist großartig!")  # 'Machine learning is great!'
+
+# English to Russian translation
+en2ru = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+en2ru.translate("Machine learning is great!")  # 'Машинное обучение - это здорово!'
+
+# Russian to English translation
+ru2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.ru-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
+                       tokenizer='moses', bpe='fastbpe')
+ru2en.translate("Машинное обучение - это здорово!")  # 'Machine learning is great!'
+```
+
+#### Language Modeling
+
+```python
+# Sample from the English LM
+en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
+en_lm.sample("Machine learning is")  # 'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
+
+# Sample from the German LM
+de_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.de', tokenizer='moses', bpe='fastbpe')
+de_lm.sample("Maschinelles lernen ist")  # 'Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
+
+# Sample from the Russian LM
+ru_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.ru', tokenizer='moses', bpe='fastbpe')
+ru_lm.sample("машинное обучение это")  # 'машинное обучение это то, что мы называем "искусственным интеллектом".'
+```
+
+## Citation
+```bibtex
+@inproceedings{ng2019facebook},
+  title = {Facebook FAIR's WMT19 News Translation Task Submission},
+  author = {Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
+  booktitle = {Proc. of WMT},
+  year = 2019,
+}
+```
--- a/examples/xlmr/README.md
+++ b/examples/xlmr/README.md
+# Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)
+https://arxiv.org/pdf/1911.02116.pdf
+
+## Introduction
+
+XLM-R (XLM-RoBERTa) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on 2.5T of filtered CommonCrawl data in 100 languages (list below).
+
+ Language | Language|Language |Language | Language
+---|---|---|---|---
+Afrikaans | Albanian | Amharic | Arabic | Armenian 
+Assamese | Azerbaijani | Basque | Belarusian | Bengali 
+Bengali Romanize | Bosnian | Breton | Bulgarian | Burmese 
+Burmese zawgyi font | Catalan | Chinese (Simplified) | Chinese (Traditional) | Croatian 
+Czech | Danish | Dutch | English | Esperanto 
+Estonian | Filipino | Finnish | French | Galician
+Georgian | German | Greek | Gujarati | Hausa
+Hebrew | Hindi | Hindi Romanize | Hungarian | Icelandic
+Indonesian | Irish | Italian | Japanese | Javanese
+Kannada | Kazakh | Khmer | Korean | Kurdish (Kurmanji)
+Kyrgyz | Lao | Latin | Latvian | Lithuanian
+Macedonian | Malagasy | Malay | Malayalam | Marathi
+Mongolian | Nepali | Norwegian | Oriya | Oromo
+Pashto | Persian | Polish | Portuguese | Punjabi
+Romanian | Russian | Sanskrit | Scottish Gaelic | Serbian
+Sindhi | Sinhala | Slovak | Slovenian | Somali
+Spanish | Sundanese | Swahili | Swedish | Tamil
+Tamil Romanize | Telugu | Telugu Romanize | Thai | Turkish
+Ukrainian | Urdu | Urdu Romanize | Uyghur | Uzbek
+Vietnamese | Welsh | Western Frisian | Xhosa | Yiddish
+
+## Pre-trained models
+
+Model | Description | #params | vocab size | Download
+---|---|---|---|---
+`xlmr.base` | XLM-R using the BERT-base architecture | 250M | 250k | [xlm.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz)
+`xlmr.large` | XLM-R using the BERT-large architecture | 560M | 250k | [xlm.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz)
+
+(Note: Above are final model checkpoints. If you were using previously released `v0` version, we recommend using above. They have same architecture and dictionary.)
+
+## Results
+
+**[XNLI (Conneau et al., 2018)](https://arxiv.org/abs/1809.05053)**
+
+Model | average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur
+---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
+`roberta.large.mnli` _(TRANSLATE-TEST)_ | 77.8 | 91.3 | 82.9 | 84.3 | 81.2 | 81.7 | 83.1 | 78.3 | 76.8 | 76.6 | 74.2 | 74.1 | 77.5 | 70.9 | 66.7 | 66.8
+`xlmr.large` _(TRANSLATE-TRAIN-ALL)_ | **83.6** | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | 83.7 | 81.6 | 78.0 | 78.1
+
+**[MLQA (Lewis et al., 2018)](https://arxiv.org/abs/1910.07475)**
+
+Model | average | en | es | de | ar | hi | vi | zh
+---|---|---|---|---|---|---|---|---
+`BERT-large` | - | 80.2/67.4 | - | - | - | - | - | -
+`mBERT` | 57.7 / 41.6 | 77.7 / 65.2 | 64.3 / 46.6 | 57.9 / 44.3 | 45.7 / 29.8| 43.8 / 29.7 | 57.1 / 38.6 | 57.5 / 37.3
+`xlmr.large` | **70.7 / 52.7** | 80.6 / 67.8 | 74.1 / 56.0 | 68.5 / 53.6 | 63.1 / 43.5 | 69.2 / 51.6 | 71.3 / 50.9 | 68.0 / 45.4
+
+
+## Example usage
+
+##### Load XLM-R from torch.hub (PyTorch >= 1.1):
+```python
+import torch
+xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
+xlmr.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Load XLM-R (for PyTorch 1.0 or custom models):
+```python
+# Download xlmr.large model
+wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
+tar -xzvf xlmr.large.tar.gz
+
+# Load the model in fairseq
+from fairseq.models.roberta import XLMRModel
+xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
+xlmr.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Apply sentence-piece-model (SPM) encoding to input text:
+```python
+en_tokens = xlmr.encode('Hello world!')
+assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]
+xlmr.decode(en_tokens)  # 'Hello world!'
+
+zh_tokens = xlmr.encode('你好，世界')
+assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
+xlmr.decode(zh_tokens)  # '你好，世界'
+
+hi_tokens = xlmr.encode('नमस्ते दुनिया')
+assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
+xlmr.decode(hi_tokens)  # 'नमस्ते दुनिया'
+
+ar_tokens = xlmr.encode('مرحبا بالعالم')
+assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
+xlmr.decode(ar_tokens) # 'مرحبا بالعالم'
+
+fr_tokens = xlmr.encode('Bonjour le monde')
+assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
+xlmr.decode(fr_tokens) # 'Bonjour le monde'
+```
+
+##### Extract features from XLM-R:
+```python
+# Extract the last layer's features
+last_layer_features = xlmr.extract_features(zh_tokens)
+assert last_layer_features.size() == torch.Size([1, 6, 1024])
+
+# Extract all layer's features (layer 0 is the embedding layer)
+all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
+assert len(all_layers) == 25
+assert torch.all(all_layers[-1] == last_layer_features)
+```
+
+## Citation
+
+```bibtex
+@article{conneau2019unsupervised,
+  title={Unsupervised Cross-lingual Representation Learning at Scale},
+  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
+  journal={arXiv preprint arXiv:1911.02116},
+  year={2019}
+}
+```
--- a/fairseq/__init__.py
+++ b/fairseq/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+"""isort:skip_file"""
+
+__all__ = ["pdb"]
+__version__ = "0.10.2"
+
+import sys
+
+# backwards compatibility to support `from fairseq.meters import AverageMeter`
+from fairseq.logging import meters, metrics, progress_bar  # noqa
+
+sys.modules["fairseq.meters"] = meters
+sys.modules["fairseq.metrics"] = metrics
+sys.modules["fairseq.progress_bar"] = progress_bar
+
+import fairseq.criterions  # noqa
+import fairseq.models  # noqa
+import fairseq.modules  # noqa
+import fairseq.optim  # noqa
+import fairseq.optim.lr_scheduler  # noqa
+import fairseq.pdb  # noqa
+import fairseq.scoring  # noqa
+import fairseq.tasks  # noqa
+import fairseq.token_generation_constraints  # noqa
+
+import fairseq.benchmark  # noqa
+import fairseq.model_parallel  # noqa
--- a/fairseq/__pycache__/__init__.cpython-38.pyc
+++ b/fairseq/__pycache__/__init__.cpython-38.pyc
--- a/fairseq/__pycache__/binarizer.cpython-38.pyc
+++ b/fairseq/__pycache__/binarizer.cpython-38.pyc
--- a/fairseq/__pycache__/checkpoint_utils.cpython-38.pyc
+++ b/fairseq/__pycache__/checkpoint_utils.cpython-38.pyc
--- a/fairseq/__pycache__/distributed_utils.cpython-38.pyc
+++ b/fairseq/__pycache__/distributed_utils.cpython-38.pyc
--- a/fairseq/__pycache__/file_io.cpython-38.pyc
+++ b/fairseq/__pycache__/file_io.cpython-38.pyc
--- a/fairseq/__pycache__/file_utils.cpython-38.pyc
+++ b/fairseq/__pycache__/file_utils.cpython-38.pyc
--- a/fairseq/__pycache__/incremental_decoding_utils.cpython-38.pyc
+++ b/fairseq/__pycache__/incremental_decoding_utils.cpython-38.pyc
--- a/fairseq/__pycache__/iterative_refinement_generator.cpython-38.pyc
+++ b/fairseq/__pycache__/iterative_refinement_generator.cpython-38.pyc