Commit 7df61696 authored by Sugon_ldc's avatar Sugon_ldc
Browse files

add fairseq0.10.2

parents
Pipeline #471 failed with stages
in 0 seconds
# Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)
This page includes instructions for reproducing results from the paper [Unsupervised Quality Estimation for Neural
Machine Translation (Fomicheva et al., 2020)](https://arxiv.org/abs/2005.10608)
## Requirements:
* mosesdecoder: https://github.com/moses-smt/mosesdecoder
* subword-nmt: https://github.com/rsennrich/subword-nmt
* flores: https://github.com/facebookresearch/flores
## Download Models and Test Data
Download translation models and test data from [MLQE dataset repository](https://github.com/facebookresearch/mlqe).
## Set up:
Given a testset consisting of source sentences and reference translations:
* `SRC_LANG`: source language
* `TGT_LANG`: target language
* `INPUT`: input prefix, such that the file `$INPUT.$SRC_LANG` contains source sentences and `$INPUT.$TGT_LANG`
contains the reference sentences
* `OUTPUT_DIR`: output path to store results
* `MOSES_DECODER`: path to mosesdecoder installation
* `BPE_ROOT`: path to subword-nmt installation
* `BPE`: path to BPE model
* `MODEL_DIR`: directory containing the NMT model `.pt` file as well as the source and target vocabularies.
* `TMP`: directory for intermediate temporary files
* `GPU`: if translating with GPU, id of the GPU to use for inference
* `DROPOUT_N`: number of stochastic forward passes
`$DROPOUT_N` is set to 30 in the experiments reported in the paper. However, we observed that increasing it beyond 10
does not bring substantial improvements.
## Translate the data using standard decoding
Preprocess the input data:
```
for LANG in $SRC_LANG $TGT_LANG; do
perl $MOSES_DECODER/scripts/tokenizer/tokenizer.perl -threads 80 -a -l $LANG < $INPUT.$LANG > $TMP/preprocessed.tok.$LANG
python $BPE_ROOT/apply_bpe.py -c ${BPE} < $TMP/preprocessed.tok.$LANG > $TMP/preprocessed.tok.bpe.$LANG
done
```
Binarize the data for faster translation:
```
fairseq-preprocess --srcdict $MODEL_DIR/dict.$SRC_LANG.txt --tgtdict $MODEL_DIR/dict.$TGT_LANG.txt
--source-lang ${SRC_LANG} --target-lang ${TGT_LANG} --testpref $TMP/preprocessed.tok.bpe --destdir $TMP/bin --workers 4
```
Translate
```
CUDA_VISIBLE_DEVICES=$GPU fairseq-generate $TMP/bin --path ${MODEL_DIR}/${SRC_LANG}-${TGT_LANG}.pt --beam 5
--source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 > $TMP/fairseq.out
grep ^H $TMP/fairseq.out | cut -f3- > $TMP/mt.out
```
Post-process
```
sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/mt.out | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl
-l $TGT_LANG > $OUTPUT_DIR/mt.out
```
## Produce uncertainty estimates
### Scoring
Make temporary files to store the translations repeated N times.
```
python ${SCRIPTS}/scripts/uncertainty/repeat_lines.py -i $TMP/preprocessed.tok.bpe.$SRC_LANG -n $DROPOUT_N
-o $TMP/repeated.$SRC_LANG
python ${SCRIPTS}/scripts/uncertainty/repeat_lines.py -i $TMP/mt.out -n $DROPOUT_N -o $TMP/repeated.$TGT_LANG
fairseq-preprocess --srcdict ${MODEL_DIR}/dict.${SRC_LANG}.txt $TGT_DIC --source-lang ${SRC_LANG}
--target-lang ${TGT_LANG} --testpref ${TMP}/repeated --destdir ${TMP}/bin-repeated
```
Produce model scores for the generated translations using `--retain-dropout` option to apply dropout at inference time:
```
CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${LP}.pt --beam 5
--source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 --score-reference --retain-dropout
--retain-dropout-modules TransformerModel TransformerEncoder TransformerDecoder TransformerEncoderLayer
TransformerDecoderLayer --seed 46 > $TMP/dropout.scoring.out
grep ^H $TMP/dropout.scoring.out | cut -f2- > $TMP/dropout.scores
```
Use `--retain-dropout-modules` to specify the modules. By default, dropout is applied in the same places
as for training.
Compute the mean of the resulting output distribution:
```
python $SCRIPTS/scripts/uncertainty/aggregate_scores.py -i $TMP/dropout.scores -o $OUTPUT_DIR/dropout.scores.mean
-n $DROPOUT_N
```
### Generation
Produce multiple translation hypotheses for the same source using `--retain-dropout` option:
```
CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${LP}.pt
--beam 5 --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --retain-dropout
--unkpen 5 --retain-dropout-modules TransformerModel TransformerEncoder TransformerDecoder
TransformerEncoderLayer TransformerDecoderLayer --seed 46 > $TMP/dropout.generation.out
grep ^H $TMP/dropout.generation.out | cut -f3- > $TMP/dropout.hypotheses_
sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/dropout.hypotheses_ | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl
-l $TGT_LANG > $TMP/dropout.hypotheses
```
Compute similarity between multiple hypotheses corresponding to the same source sentence using Meteor
evaluation metric:
```
python meteor.py -i $TMP/dropout.hypotheses -m <path_to_meteor_installation> -n $DROPOUT_N -o
$OUTPUT_DIR/dropout.gen.sim.meteor
```
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import sys
import numpy as np
aggregate_funcs = {
"std": np.std,
"var": np.var,
"median": np.median,
"mean": np.mean,
"min": np.min,
"max": np.max,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input_file", required=True, type=str)
parser.add_argument("-n", "--repeat_times", required=True, type=int)
parser.add_argument("-o", "--output_file", required=False)
parser.add_argument("-f", "--func", required=False, default="mean")
args = parser.parse_args()
stream = open(args.output_file, "w") if args.output_file else sys.stdout
segment_scores = []
for line in open(args.input_file):
segment_scores.append(float(line.strip()))
if len(segment_scores) == args.repeat_times:
stream.write("{}\n".format(aggregate_funcs[args.func](segment_scores)))
segment_scores = []
if __name__ == "__main__":
main()
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import math
import os
import subprocess
import sys
import tempfile
from collections import defaultdict
from itertools import combinations
def read_translations(path, n_repeats):
segment_counter = 0
segment_translations = []
translations = defaultdict(list)
for line in open(path):
segment_translations.append(" ".join(line.split()))
if len(segment_translations) == n_repeats:
translations[segment_counter] = segment_translations
segment_translations = []
segment_counter += 1
return translations
def generate_input(translations, n_repeats):
_, ref_path = tempfile.mkstemp()
_, mt_path = tempfile.mkstemp()
ref_fh = open(ref_path, "w")
mt_fh = open(mt_path, "w")
for segid in sorted(translations.keys()):
assert len(translations[segid]) == n_repeats
indexes = combinations(range(n_repeats), 2)
for idx1, idx2 in indexes:
mt_fh.write(translations[segid][idx1].strip() + "\n")
ref_fh.write(translations[segid][idx2].strip() + "\n")
sys.stderr.write("\nSaved translations to %s and %s" % (ref_path, mt_path))
return ref_path, mt_path
def run_meteor(ref_path, mt_path, metric_path, lang="en"):
_, out_path = tempfile.mkstemp()
subprocess.call(
[
"java",
"-Xmx2G",
"-jar",
metric_path,
mt_path,
ref_path,
"-p",
"0.5 0.2 0.6 0.75", # default parameters, only changed alpha to give equal weight to P and R
"-norm",
"-l",
lang,
],
stdout=open(out_path, "w"),
)
os.remove(ref_path)
os.remove(mt_path)
sys.stderr.write("\nSaved Meteor output to %s" % out_path)
return out_path
def read_output(meteor_output_path, n_repeats):
n_combinations = math.factorial(n_repeats) / (
math.factorial(2) * math.factorial(n_repeats - 2)
)
raw_scores = []
average_scores = []
for line in open(meteor_output_path):
if not line.startswith("Segment "):
continue
score = float(line.strip().split("\t")[1])
raw_scores.append(score)
if len(raw_scores) == n_combinations:
average_scores.append(sum(raw_scores) / n_combinations)
raw_scores = []
os.remove(meteor_output_path)
return average_scores
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input")
parser.add_argument("-n", "--repeat_times", type=int)
parser.add_argument("-m", "--meteor")
parser.add_argument("-o", "--output")
args = parser.parse_args()
translations = read_translations(args.infile, args.repetitions)
sys.stderr.write("\nGenerating input for Meteor...")
ref_path, mt_path = generate_input(translations, args.repetitions)
sys.stderr.write("\nRunning Meteor...")
out_path = run_meteor(ref_path, mt_path, args.meteor)
sys.stderr.write("\nReading output...")
scores = read_output(out_path, args.repetitions)
sys.stderr.write("\nWriting results...")
with open(args.output, "w") as o:
for scr in scores:
o.write("{}\n".format(scr))
o.close()
if __name__ == "__main__":
main()
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import sys
def _normalize_spaces(line):
return " ".join(line.split())
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input_file", required=True, type=str)
parser.add_argument("-n", "--repeat_times", required=True, type=int)
parser.add_argument("-o", "--output_file", required=False, type=str)
args = parser.parse_args()
stream = open(args.output_file, "w") if args.output_file else sys.stdout
for line in open(args.input_file):
for _ in range(args.repeat_times):
stream.write(_normalize_spaces(line) + "\n")
if __name__ == "__main__":
main()
# wav2vec 2.0
wav2vec 2.0 learns speech representations on unlabeled data as described in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](https://arxiv.org/abs/2006.11477).
## Pre-trained models
Model | Finetuning split | Dataset | Model
|---|---|---|---
Wav2Vec 2.0 Base | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt)
Wav2Vec 2.0 Base | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_10m.pt)
Wav2Vec 2.0 Base | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_100h.pt)
Wav2Vec 2.0 Base | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt)
Wav2Vec 2.0 Large | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/libri960_big.pt)
Wav2Vec 2.0 Large | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_10m.pt)
Wav2Vec 2.0 Large | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_100h.pt)
Wav2Vec 2.0 Large | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_960h.pt)
Wav2Vec 2.0 Large (LV-60) | No finetuning | [Libri-Light](https://github.com/facebookresearch/libri-light) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox.pt)
Wav2Vec 2.0 Large (LV-60) | 10 minutes | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_10m.pt)
Wav2Vec 2.0 Large (LV-60) | 100 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_100h.pt)
Wav2Vec 2.0 Large (LV-60) | 960 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h.pt)
## Training a new model with the CLI tools
Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
### Prepare training data manifest:
First, install the `soundfile` library:
```shell script
pip install soundfile
```
Next, run:
```shell script
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid
```
$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read.
$valid should be set to some reasonable percentage (like 0.01) of training data to use for validation.
To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a
separately pre-processed manifest file.
### Train a wav2vec 2.0 base model:
This configuration was used for the base model trained on the Librispeech dataset in the wav2vec 2.0 paper
Note that this was tested with pytorch 1.4.0 and the input is expected to be single channel, sampled at 16 kHz
```shell script
$ python train.py --distributed-world-size 64 --distributed-port $PORT /manifest/path \
--save-dir /model/path --fp16 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 256 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 \
--lr 0.0005 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1400000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d
```
Note: you can simulate 64 GPUs by using k GPUs and setting --update-freq 64/k
### Train a wav2vec 2.0 large model:
This configuration was used for the large model trained on the Libri-light dataset in the wav2vec 2.0 paper
```shell script
$ python train.py --distributed-world-size 128 --distributed-port $PORT /manifest/path \
--save-dir /model/path --fp16 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 768 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2.0,0.1,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 600000 \
--lr 0.0003 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.0 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.03 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --encoder-layers 24 --encoder-embed-dim 1024 \
--encoder-ffn-embed-dim 4096 --encoder-attention-heads 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 320000 --min-sample-size 32000 --dropout 0.0 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1200000 --max-update 600000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d
```
Note: you can simulate 128 GPUs by using k GPUs and setting --update-freq 128/k
### Fine-tune a pre-trained model with CTC:
Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format.
A letter vocabulary can be downloaded [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt).
An example [script](libri_labels.py) that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:
```shell script
split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split
```
Fine-tuning on 100h of Librispeech with letter targets:
```shell script
valid_subset=dev_other
python train.py --distributed-world-size 24 --distributed-port $PORT /path/to/training_data --save-dir /model/path --fp16 \
--wer-args '("/path/to/lm/4-gram.bin","/path/to/lexicon",2,-1)' \
--post-process letter --valid-subset $valid_subset --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 4 \
--max-update 80000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc --w2v-path /path/to/pretrained/model \
--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d
```
Note: you can simulate 24 GPUs by using k GPUs and setting --update-freq 24/k
Decoding with a language model during training requires wav2letter [python bindings](https://github.com/facebookresearch/wav2letter/wiki/Building-Python-bindings).
Alternatively, simply omit the --wer-args flag.
For hyper-parameters to fine-tune other Librispeech splits (10 minutes, 1 hour, etc) please refer to the table in Appendix B in the wav2vec 2.0 paper.
The main changes to make are adjusting --max-update, and then adjusting --warmup-steps, --hold-steps, and --decay steps so that they use 0.1/0.4/0.5 of max-update respectively. You then need to adjust --mask-prob and --mask-channel-prob. This should be set to the mask-length * x where x is the number in the table and mask-length is what you use for --mask-length (10 in this example. Use --mask-channel-length value for --mask-channel-prob).
For example, for 10 hours, we see in the paper that timestep mask prob should be 0.065, so we set --mask-prob to 10* 0.065 = 0.65. channel mask prob is 0.004, so we set it to 64 * 0.004 = 0.256. then we set --max-updates to 20000 and change --warmup-steps to 20000 * 0.1 = 2000, --hold-steps to 8000 and --decay-steps to 10000.
### Evaluating a CTC model:
Evaluating a CTC model with a language model requires wav2letter [python bindings](https://github.com/facebookresearch/wav2letter/wiki/Building-Python-bindings) to be installed.
Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the [wav2letter model repository](https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019).
Be sure to upper-case the language model vocab after downloading it.
Letter dictionary for pre-trained models can be found [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt).
Next, run the evaluation command:
```shell script
$subset=dev_other
python examples/speech_recognition/infer.py /checkpoint/abaevski/data/speech/libri/10h/wav2vec/raw --task audio_pretraining \
--nbest 1 --path /path/to/model --gen-subset $subset --results-path /path/to/save/results/for/sclite --w2l-decoder kenlm \
--lm-model /path/to/kenlm.bin --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--post-process letter
```
To get raw numbers, use --w2l-decoder viterbi and omit the lexicon. To use the transformer language model, use --w2l-decoder fairseqlm.
# wav2vec
Example to train a wav2vec model as described in [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](https://arxiv.org/abs/1904.05862).
## Pre-trained models
Description | Dataset | Model
---|---|---
Wav2Vec large | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt)
#### Example usage:
```python
import torch
from fairseq.models.wav2vec import Wav2VecModel
cp = torch.load('/path/to/wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
c = model.feature_aggregator(z)
```
## Training a new model with the CLI tools
Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate files 10 to 30 seconds in length)
### Prepare training data manifest:
```
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav
```
### Train a wav2vec model:
```
$ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
--arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 --optimizer adam --max-lr 0.005 --lr-scheduler cosine \
--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \
--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \
--max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test
```
### Extract embeddings from the downstream task data:
```
$ PYTHONPATH=/path/to/fairseq python examples/wav2vec/wav2vec_featurize.py --input /path/to/task/waves --output /path/to/output \
--model /model/path/checkpoint_best.pt --split train valid test
```
# vq-wav2vec
Example to train a vq-wav2vec model as described in [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (Baevski et al., 2019)](https://arxiv.org/abs/1910.05453).
These models are also used in [Effectiveness of self-supervised pre-training for speech recognition (Baevski et al., 2019)](https://arxiv.org/abs/1911.03912).
## Pre-trained models
Description | Dataset | Model
---|---|---
vq-wav2vec Gumbel | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec.pt)
vq-wav2vec K-means | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec_kmeans.pt)
Roberta on K-means codes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/bert_kmeans.tar)
#### Example usage:
```python
import torch
from fairseq.models.wav2vec import Wav2VecModel
cp = torch.load('/path/to/vq-wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()
wav_input_16khz = torch.randn(1,10000)
z = model.feature_extractor(wav_input_16khz)
_, idxs = model.vector_quantizer.forward_idx(z)
print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model
```
## Training a new model with the CLI tools
Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)
### Prepare training data manifest:
```
$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav
```
### Train a gumbel vq-wav2vec model:
```
$ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 \
--save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 \
--optimizer adam --max-lr 1e-05 --lr-scheduler cosine \
--conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1), (512, 1, 1)] \
--conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \
--activation gelu --offset auto --skip-connections-agg --residual-scale 0.5 \
--log-keys ["prob_perplexity","code_perplexity","temp"] --vq-type gumbel --vq-groups 2 --vq-depth 2 \
--combine-groups --vq-vars 320 --vq-temp (2,0.5,0.999995) --prediction-steps 12 --warmup-updates 1000 \
--warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 --max-sample-size 150000 \
--max-tokens 300000 --cross-sample-negatives 0 --update-freq 1 --seed 2 --skip-invalid-size-inputs-valid-test
```
for k-means training, set vq-type with "kmeans" and add --loss-weights [1] argument. Pre-trained models were trained on 16 GPUs.
### Tokenize audio data (e.g. for BERT training):
```
$ PYTHONPATH=/path/to/fairseq python examples/wav2vec/vq-wav2vec_featurize.py --data-dir /manifest/path --output-dir /path/to/output \
--checkpoint /model/path/checkpoint_best.pt --split train valid test --extension tsv
```
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Helper script to pre-compute embeddings for a wav2letter++ dataset
"""
import argparse
import os
def main():
parser = argparse.ArgumentParser()
parser.add_argument("tsv")
parser.add_argument("--output-dir", required=True)
parser.add_argument("--output-name", required=True)
args = parser.parse_args()
os.makedirs(args.output_dir, exist_ok=True)
transcriptions = {}
with open(args.tsv, "r") as tsv, open(
os.path.join(args.output_dir, args.output_name + ".ltr"), "w"
) as ltr_out, open(
os.path.join(args.output_dir, args.output_name + ".wrd"), "w"
) as wrd_out:
root = next(tsv).strip()
for line in tsv:
line = line.strip()
dir = os.path.dirname(line)
if dir not in transcriptions:
parts = dir.split(os.path.sep)
trans_path = f"{parts[-2]}-{parts[-1]}.trans.txt"
path = os.path.join(root, dir, trans_path)
assert os.path.exists(path)
texts = {}
with open(path, "r") as trans_f:
for tline in trans_f:
items = tline.strip().split()
texts[items[0]] = " ".join(items[1:])
transcriptions[dir] = texts
part = os.path.basename(line).split(".")[0]
assert part in transcriptions[dir]
print(transcriptions[dir][part], file=wrd_out)
print(
" ".join(list(transcriptions[dir][part].replace(" ", "|"))) + " |",
file=ltr_out,
)
if __name__ == "__main__":
main()
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Helper script to pre-compute embeddings for a wav2letter++ dataset
"""
import argparse
import glob
import os
import os.path as osp
import pprint
import soundfile as sf
import torch
import tqdm
from fairseq.models.wav2vec.wav2vec import Wav2VecModel
from torch import nn
from torch.utils.data import DataLoader
try:
import tqdm
except:
print("Install tqdm to use --log-format=tqdm")
class FilesDataset:
def __init__(self, files, labels):
self.files = files
if labels and osp.exists(labels):
with open(labels, "r") as lbl_f:
self.labels = [line.rstrip() for line in lbl_f]
else:
self.labels = labels
def __len__(self):
return len(self.files)
def __getitem__(self, index):
fname = self.files[index]
wav, sr = sf.read(fname)
assert sr == 16000
wav = torch.from_numpy(wav).float()
lbls = None
if self.labels:
if isinstance(self.labels, str):
lbl_file = osp.splitext(fname)[0] + "." + self.labels
with open(lbl_file, "r") as lblf:
lbls = lblf.readline()
assert lbls is not None
else:
lbls = self.labels[index]
return wav, lbls
def collate(self, batch):
return batch
class ArgTypes:
@staticmethod
def existing_path(arg):
arg = str(arg)
assert osp.exists(arg), f"File {arg} does not exist"
return arg
@staticmethod
def mkdir(arg):
arg = str(arg)
os.makedirs(arg, exist_ok=True)
return arg
class DatasetWriter:
def __init__(self):
self.args = self.load_config()
pprint.pprint(self.args.__dict__)
self.model = self.load_model()
def __getattr__(self, attr):
return getattr(self.args, attr)
def read_manifest(self, fname):
with open(fname, "r") as fp:
lines = fp.read().split("\n")
root = lines.pop(0).strip()
fnames = [
osp.join(root, line.split("\t")[0]) for line in lines if len(line) > 0
]
return fnames
def process_splits(self):
if self.args.shard is not None or self.args.num_shards is not None:
assert self.args.shard is not None and self.args.num_shards is not None
for split in self.splits:
print(split)
if self.extension == "tsv":
datadir = osp.join(self.data_dir, f"{split}.{self.extension}")
print("Reading manifest file: ", datadir)
files = self.read_manifest(datadir)
else:
datadir = osp.join(self.data_dir, split, f"**/*.{self.extension}")
files = glob.glob(datadir, recursive=True)
assert len(files) > 0
if self.args.shard is not None:
files = files[self.args.shard :: self.args.num_shards]
lbls = []
with open(self.data_file(split), "w") as srcf:
for line, lbl in self.iterate(files):
print(line, file=srcf)
if self.args.labels:
lbls.append(lbl + "\n")
if self.args.labels:
assert all(a is not None for a in lbls)
with open(self.lbl_file(split), "w") as lblf:
lblf.writelines(lbls)
def iterate(self, files):
data = self.load_data(files)
for samples in tqdm.tqdm(data, total=len(files) // 32):
for wav, lbl in samples:
x = wav.unsqueeze(0).float().cuda()
div = 1
while x.size(-1) // div > self.args.max_size:
div += 1
xs = x.chunk(div, dim=-1)
result = []
for x in xs:
torch.cuda.empty_cache()
x = self.model.feature_extractor(x)
if self.quantize_location == "encoder":
with torch.no_grad():
_, idx = self.model.vector_quantizer.forward_idx(x)
idx = idx.squeeze(0).cpu()
else:
with torch.no_grad():
z = self.model.feature_aggregator(x)
_, idx = self.model.vector_quantizer.forward_idx(z)
idx = idx.squeeze(0).cpu()
result.append(idx)
idx = torch.cat(result, dim=0)
yield " ".join("-".join(map(str, a.tolist())) for a in idx), lbl
def lbl_file(self, name):
shard_part = "" if self.args.shard is None else f".{self.args.shard}"
return osp.join(self.output_dir, f"{name}.lbl{shard_part}")
def data_file(self, name):
shard_part = "" if self.args.shard is None else f".{self.args.shard}"
return osp.join(self.output_dir, f"{name}.src{shard_part}")
def var_file(self):
return osp.join(self.output_dir, f"vars.pt")
def load_config(self):
parser = argparse.ArgumentParser("Vector Quantized wav2vec features")
# Model Arguments
parser.add_argument("--checkpoint", type=ArgTypes.existing_path, required=True)
parser.add_argument("--data-parallel", action="store_true")
# Output Arguments
parser.add_argument("--output-dir", type=ArgTypes.mkdir, required=True)
# Data Arguments
parser.add_argument("--data-dir", type=ArgTypes.existing_path, required=True)
parser.add_argument("--splits", type=str, nargs="+", required=True)
parser.add_argument("--extension", type=str, required=True)
parser.add_argument("--labels", type=str, required=False)
parser.add_argument("--shard", type=int, default=None)
parser.add_argument("--num-shards", type=int, default=None)
parser.add_argument("--max-size", type=int, default=1300000)
# Logger Arguments
parser.add_argument(
"--log-format", type=str, choices=["none", "simple", "tqdm"]
)
return parser.parse_args()
def load_data(self, fnames):
dataset = FilesDataset(fnames, self.args.labels)
loader = DataLoader(
dataset, batch_size=32, collate_fn=dataset.collate, num_workers=8
)
return loader
def load_model(self):
cp = torch.load(self.checkpoint, map_location=lambda x, _: x)
model = Wav2VecModel.build_model(cp["args"], None)
self.quantize_location = getattr(cp["args"], "vq", "encoder")
model.load_state_dict(cp["model"])
model.eval().float()
model.cuda()
if self.data_parallel:
model = nn.DataParallel(model)
return model
def __call__(self):
self.process_splits()
if hasattr(self.model.feature_extractor, "vars") and (
self.args.shard is None or self.args.shard == 0
):
vars = (
self.model.feature_extractor.vars.view(
self.model.feature_extractor.banks,
self.model.feature_extractor.num_vars,
-1,
)
.cpu()
.detach()
)
print("writing learned latent variable embeddings: ", vars.shape)
torch.save(vars, self.var_file())
if __name__ == "__main__":
write_data = DatasetWriter()
write_data()
print("Done.")
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Helper script to pre-compute embeddings for a wav2letter++ dataset
"""
import argparse
import glob
import os
from shutil import copy
import h5py
import numpy as np
import soundfile as sf
import torch
import tqdm
from fairseq.models.wav2vec.wav2vec import Wav2VecModel
from torch import nn
def read_audio(fname):
""" Load an audio file and return PCM along with the sample rate """
wav, sr = sf.read(fname)
assert sr == 16e3
return wav, 16e3
class PretrainedWav2VecModel(nn.Module):
def __init__(self, fname):
super().__init__()
checkpoint = torch.load(fname)
self.args = checkpoint["args"]
model = Wav2VecModel.build_model(self.args, None)
model.load_state_dict(checkpoint["model"])
model.eval()
self.model = model
def forward(self, x):
with torch.no_grad():
z = self.model.feature_extractor(x)
if isinstance(z, tuple):
z = z[0]
c = self.model.feature_aggregator(z)
return z, c
class EmbeddingWriterConfig(argparse.ArgumentParser):
def __init__(self):
super().__init__("Pre-compute embeddings for wav2letter++ datasets")
kwargs = {"action": "store", "type": str, "required": True}
self.add_argument("--input", "-i", help="Input Directory", **kwargs)
self.add_argument("--output", "-o", help="Output Directory", **kwargs)
self.add_argument("--model", help="Path to model checkpoint", **kwargs)
self.add_argument("--split", help="Dataset Splits", nargs="+", **kwargs)
self.add_argument(
"--ext", default="wav", required=False, help="Audio file extension"
)
self.add_argument(
"--no-copy-labels",
action="store_true",
help="Do not copy label files. Useful for large datasets, use --targetdir in wav2letter then.",
)
self.add_argument(
"--use-feat",
action="store_true",
help="Use the feature vector ('z') instead of context vector ('c') for features",
)
self.add_argument("--gpu", help="GPU to use", default=0, type=int)
class Prediction:
""" Lightweight wrapper around a fairspeech embedding model """
def __init__(self, fname, gpu=0):
self.gpu = gpu
self.model = PretrainedWav2VecModel(fname).cuda(gpu)
def __call__(self, x):
x = torch.from_numpy(x).float().cuda(self.gpu)
with torch.no_grad():
z, c = self.model(x.unsqueeze(0))
return z.squeeze(0).cpu().numpy(), c.squeeze(0).cpu().numpy()
class H5Writer:
""" Write features as hdf5 file in wav2letter++ compatible format """
def __init__(self, fname):
self.fname = fname
os.makedirs(os.path.dirname(self.fname), exist_ok=True)
def write(self, data):
channel, T = data.shape
with h5py.File(self.fname, "w") as out_ds:
data = data.T.flatten()
out_ds["features"] = data
out_ds["info"] = np.array([16e3 // 160, T, channel])
class EmbeddingDatasetWriter(object):
"""Given a model and a wav2letter++ dataset, pre-compute and store embeddings
Args:
input_root, str :
Path to the wav2letter++ dataset
output_root, str :
Desired output directory. Will be created if non-existent
split, str :
Dataset split
"""
def __init__(
self,
input_root,
output_root,
split,
model_fname,
extension="wav",
gpu=0,
verbose=False,
use_feat=False,
):
assert os.path.exists(model_fname)
self.model_fname = model_fname
self.model = Prediction(self.model_fname, gpu)
self.input_root = input_root
self.output_root = output_root
self.split = split
self.verbose = verbose
self.extension = extension
self.use_feat = use_feat
assert os.path.exists(self.input_path), "Input path '{}' does not exist".format(
self.input_path
)
def _progress(self, iterable, **kwargs):
if self.verbose:
return tqdm.tqdm(iterable, **kwargs)
return iterable
def require_output_path(self, fname=None):
path = self.get_output_path(fname)
os.makedirs(path, exist_ok=True)
@property
def input_path(self):
return self.get_input_path()
@property
def output_path(self):
return self.get_output_path()
def get_input_path(self, fname=None):
if fname is None:
return os.path.join(self.input_root, self.split)
return os.path.join(self.get_input_path(), fname)
def get_output_path(self, fname=None):
if fname is None:
return os.path.join(self.output_root, self.split)
return os.path.join(self.get_output_path(), fname)
def copy_labels(self):
self.require_output_path()
labels = list(
filter(
lambda x: self.extension not in x, glob.glob(self.get_input_path("*"))
)
)
for fname in tqdm.tqdm(labels):
copy(fname, self.output_path)
@property
def input_fnames(self):
return sorted(glob.glob(self.get_input_path("*.{}".format(self.extension))))
def __len__(self):
return len(self.input_fnames)
def write_features(self):
paths = self.input_fnames
fnames_context = map(
lambda x: os.path.join(
self.output_path, x.replace("." + self.extension, ".h5context")
),
map(os.path.basename, paths),
)
for name, target_fname in self._progress(
zip(paths, fnames_context), total=len(self)
):
wav, sr = read_audio(name)
z, c = self.model(wav)
feat = z if self.use_feat else c
writer = H5Writer(target_fname)
writer.write(feat)
def __repr__(self):
return "EmbeddingDatasetWriter ({n_files} files)\n\tinput:\t{input_root}\n\toutput:\t{output_root}\n\tsplit:\t{split})".format(
n_files=len(self), **self.__dict__
)
if __name__ == "__main__":
args = EmbeddingWriterConfig().parse_args()
for split in args.split:
writer = EmbeddingDatasetWriter(
input_root=args.input,
output_root=args.output,
split=split,
model_fname=args.model,
gpu=args.gpu,
extension=args.ext,
use_feat=args.use_feat,
)
print(writer)
writer.require_output_path()
print("Writing Features...")
writer.write_features()
print("Done.")
if not args.no_copy_labels:
print("Copying label data...")
writer.copy_labels()
print("Done.")
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Data pre-processing: build vocabularies and binarize training data.
"""
import argparse
import glob
import os
import random
import soundfile
def get_parser():
parser = argparse.ArgumentParser()
parser.add_argument(
"root", metavar="DIR", help="root directory containing flac files to index"
)
parser.add_argument(
"--valid-percent",
default=0.01,
type=float,
metavar="D",
help="percentage of data to use as validation set (between 0 and 1)",
)
parser.add_argument(
"--dest", default=".", type=str, metavar="DIR", help="output directory"
)
parser.add_argument(
"--ext", default="flac", type=str, metavar="EXT", help="extension to look for"
)
parser.add_argument("--seed", default=42, type=int, metavar="N", help="random seed")
parser.add_argument(
"--path-must-contain",
default=None,
type=str,
metavar="FRAG",
help="if set, path must contain this substring for a file to be included in the manifest",
)
return parser
def main(args):
assert args.valid_percent >= 0 and args.valid_percent <= 1.0
dir_path = os.path.realpath(args.root)
search_path = os.path.join(dir_path, "**/*." + args.ext)
rand = random.Random(args.seed)
with open(os.path.join(args.dest, "train.tsv"), "w") as train_f, open(
os.path.join(args.dest, "valid.tsv"), "w"
) as valid_f:
print(dir_path, file=train_f)
print(dir_path, file=valid_f)
for fname in glob.iglob(search_path, recursive=True):
file_path = os.path.realpath(fname)
if args.path_must_contain and args.path_must_contain not in file_path:
continue
frames = soundfile.info(fname).frames
dest = train_f if rand.random() > args.valid_percent else valid_f
print(
"{}\t{}".format(os.path.relpath(file_path, dir_path), frames), file=dest
)
if __name__ == "__main__":
parser = get_parser()
args = parser.parse_args()
main(args)
# WMT 19
This page provides pointers to the models of Facebook-FAIR's WMT'19 news translation task submission [(Ng et al., 2019)](https://arxiv.org/abs/1907.06616).
## Pre-trained models
Model | Description | Download
---|---|---
`transformer.wmt19.en-de` | En->De Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
`transformer.wmt19.de-en` | De->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
`transformer.wmt19.en-ru` | En->Ru Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
`transformer.wmt19.ru-en` | Ru->En Ensemble | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
`transformer_lm.wmt19.en` | En Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz)
`transformer_lm.wmt19.de` | De Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.de.tar.gz)
`transformer_lm.wmt19.ru` | Ru Language Model | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.ru.tar.gz)
## Pre-trained single models before finetuning
Model | Description | Download
---|---|---
`transformer.wmt19.en-de` | En->De Single, no finetuning | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.ffn8192.tar.gz)
`transformer.wmt19.de-en` | De->En Single, no finetuning | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.ffn8192.tar.gz)
`transformer.wmt19.en-ru` | En->Ru Single, no finetuning | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ffn8192.tar.gz)
`transformer.wmt19.ru-en` | Ru->En Single, no finetuning | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ffn8192.tar.gz)
## Example usage (torch.hub)
#### Requirements
We require a few additional Python dependencies for preprocessing:
```bash
pip install fastBPE sacremoses
```
#### Translation
```python
import torch
# English to German translation
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
tokenizer='moses', bpe='fastbpe')
en2de.translate("Machine learning is great!") # 'Maschinelles Lernen ist großartig!'
# German to English translation
de2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.de-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
tokenizer='moses', bpe='fastbpe')
de2en.translate("Maschinelles Lernen ist großartig!") # 'Machine learning is great!'
# English to Russian translation
en2ru = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-ru', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
tokenizer='moses', bpe='fastbpe')
en2ru.translate("Machine learning is great!") # 'Машинное обучение - это здорово!'
# Russian to English translation
ru2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.ru-en', checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
tokenizer='moses', bpe='fastbpe')
ru2en.translate("Машинное обучение - это здорово!") # 'Machine learning is great!'
```
#### Language Modeling
```python
# Sample from the English LM
en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
en_lm.sample("Machine learning is") # 'Machine learning is the future of computing, says Microsoft boss Satya Nadella ...'
# Sample from the German LM
de_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.de', tokenizer='moses', bpe='fastbpe')
de_lm.sample("Maschinelles lernen ist") # 'Maschinelles lernen ist das A und O (neues-deutschland.de) Die Arbeitsbedingungen für Lehrerinnen und Lehrer sind seit Jahren verbesserungswürdig ...'
# Sample from the Russian LM
ru_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.ru', tokenizer='moses', bpe='fastbpe')
ru_lm.sample("машинное обучение это") # 'машинное обучение это то, что мы называем "искусственным интеллектом".'
```
## Citation
```bibtex
@inproceedings{ng2019facebook},
title = {Facebook FAIR's WMT19 News Translation Task Submission},
author = {Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
booktitle = {Proc. of WMT},
year = 2019,
}
```
# Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa)
https://arxiv.org/pdf/1911.02116.pdf
## Introduction
XLM-R (XLM-RoBERTa) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on 2.5T of filtered CommonCrawl data in 100 languages (list below).
Language | Language|Language |Language | Language
---|---|---|---|---
Afrikaans | Albanian | Amharic | Arabic | Armenian
Assamese | Azerbaijani | Basque | Belarusian | Bengali
Bengali Romanize | Bosnian | Breton | Bulgarian | Burmese
Burmese zawgyi font | Catalan | Chinese (Simplified) | Chinese (Traditional) | Croatian
Czech | Danish | Dutch | English | Esperanto
Estonian | Filipino | Finnish | French | Galician
Georgian | German | Greek | Gujarati | Hausa
Hebrew | Hindi | Hindi Romanize | Hungarian | Icelandic
Indonesian | Irish | Italian | Japanese | Javanese
Kannada | Kazakh | Khmer | Korean | Kurdish (Kurmanji)
Kyrgyz | Lao | Latin | Latvian | Lithuanian
Macedonian | Malagasy | Malay | Malayalam | Marathi
Mongolian | Nepali | Norwegian | Oriya | Oromo
Pashto | Persian | Polish | Portuguese | Punjabi
Romanian | Russian | Sanskrit | Scottish Gaelic | Serbian
Sindhi | Sinhala | Slovak | Slovenian | Somali
Spanish | Sundanese | Swahili | Swedish | Tamil
Tamil Romanize | Telugu | Telugu Romanize | Thai | Turkish
Ukrainian | Urdu | Urdu Romanize | Uyghur | Uzbek
Vietnamese | Welsh | Western Frisian | Xhosa | Yiddish
## Pre-trained models
Model | Description | #params | vocab size | Download
---|---|---|---|---
`xlmr.base` | XLM-R using the BERT-base architecture | 250M | 250k | [xlm.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr.base.tar.gz)
`xlmr.large` | XLM-R using the BERT-large architecture | 560M | 250k | [xlm.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz)
(Note: Above are final model checkpoints. If you were using previously released `v0` version, we recommend using above. They have same architecture and dictionary.)
## Results
**[XNLI (Conneau et al., 2018)](https://arxiv.org/abs/1809.05053)**
Model | average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
`roberta.large.mnli` _(TRANSLATE-TEST)_ | 77.8 | 91.3 | 82.9 | 84.3 | 81.2 | 81.7 | 83.1 | 78.3 | 76.8 | 76.6 | 74.2 | 74.1 | 77.5 | 70.9 | 66.7 | 66.8
`xlmr.large` _(TRANSLATE-TRAIN-ALL)_ | **83.6** | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | 83.7 | 81.6 | 78.0 | 78.1
**[MLQA (Lewis et al., 2018)](https://arxiv.org/abs/1910.07475)**
Model | average | en | es | de | ar | hi | vi | zh
---|---|---|---|---|---|---|---|---
`BERT-large` | - | 80.2/67.4 | - | - | - | - | - | -
`mBERT` | 57.7 / 41.6 | 77.7 / 65.2 | 64.3 / 46.6 | 57.9 / 44.3 | 45.7 / 29.8| 43.8 / 29.7 | 57.1 / 38.6 | 57.5 / 37.3
`xlmr.large` | **70.7 / 52.7** | 80.6 / 67.8 | 74.1 / 56.0 | 68.5 / 53.6 | 63.1 / 43.5 | 69.2 / 51.6 | 71.3 / 50.9 | 68.0 / 45.4
## Example usage
##### Load XLM-R from torch.hub (PyTorch >= 1.1):
```python
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval() # disable dropout (or leave in train mode to finetune)
```
##### Load XLM-R (for PyTorch 1.0 or custom models):
```python
# Download xlmr.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
tar -xzvf xlmr.large.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
xlmr.eval() # disable dropout (or leave in train mode to finetune)
```
##### Apply sentence-piece-model (SPM) encoding to input text:
```python
en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378, 8999, 38, 2]
xlmr.decode(en_tokens) # 'Hello world!'
zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens) # '你好,世界'
hi_tokens = xlmr.encode('नमस्ते दुनिया')
assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
xlmr.decode(hi_tokens) # 'नमस्ते दुनिया'
ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'
fr_tokens = xlmr.encode('Bonjour le monde')
assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
xlmr.decode(fr_tokens) # 'Bonjour le monde'
```
##### Extract features from XLM-R:
```python
# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)
```
## Citation
```bibtex
@article{conneau2019unsupervised,
title={Unsupervised Cross-lingual Representation Learning at Scale},
author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
```
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""isort:skip_file"""
__all__ = ["pdb"]
__version__ = "0.10.2"
import sys
# backwards compatibility to support `from fairseq.meters import AverageMeter`
from fairseq.logging import meters, metrics, progress_bar # noqa
sys.modules["fairseq.meters"] = meters
sys.modules["fairseq.metrics"] = metrics
sys.modules["fairseq.progress_bar"] = progress_bar
import fairseq.criterions # noqa
import fairseq.models # noqa
import fairseq.modules # noqa
import fairseq.optim # noqa
import fairseq.optim.lr_scheduler # noqa
import fairseq.pdb # noqa
import fairseq.scoring # noqa
import fairseq.tasks # noqa
import fairseq.token_generation_constraints # noqa
import fairseq.benchmark # noqa
import fairseq.model_parallel # noqa
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment