Commit c394d7d1 authored by “change”'s avatar “change”
Browse files

init

parents
# Hierarchical Neural Story Generation (Fan et al., 2018)
The following commands provide an example of pre-processing data, training a model, and generating text for story generation with the WritingPrompts dataset.
## Pre-trained models
Description | Dataset | Model | Test set(s)
---|---|---|---
Stories with Convolutional Model <br> ([Fan et al., 2018](https://arxiv.org/abs/1805.04833)) | [WritingPrompts](https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/stories_checkpoint.tar.bz2) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/stories_test.tar.bz2)
We provide sample stories generated by the [convolutional seq2seq model](https://dl.fbaipublicfiles.com/fairseq/data/seq2seq_stories.txt) and [fusion model](https://dl.fbaipublicfiles.com/fairseq/data/fusion_stories.txt) from [Fan et al., 2018](https://arxiv.org/abs/1805.04833). The corresponding prompts for the fusion model can be found [here](https://dl.fbaipublicfiles.com/fairseq/data/fusion_prompts.txt). Note that there are unk in the file, as we modeled a small full vocabulary (no BPE or pre-training). We did not use these unk prompts for human evaluation.
## Dataset
The dataset can be downloaded like this:
```bash
cd examples/stories
curl https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz | tar xvzf -
```
and contains a train, test, and valid split. The dataset is described here: https://arxiv.org/abs/1805.04833. We model only the first 1000 words of each story, including one newLine token.
## Example usage
First we will preprocess the dataset. Note that the dataset release is the full data, but the paper models the first 1000 words of each story. Here is example code that trims the dataset to the first 1000 words of each story:
```python
data = ["train", "test", "valid"]
for name in data:
with open(name + ".wp_target") as f:
stories = f.readlines()
stories = [" ".join(i.split()[0:1000]) for i in stories]
with open(name + ".wp_target", "w") as o:
for line in stories:
o.write(line.strip() + "\n")
```
Once we've trimmed the data we can binarize it and train our model:
```bash
# Binarize the dataset:
export TEXT=examples/stories/writingPrompts
fairseq-preprocess --source-lang wp_source --target-lang wp_target \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10
# Train the model:
fairseq-train data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --optimizer nag --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
# Train a fusion model:
# add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint
# Generate:
# Note: to load the pretrained model at generation time, you need to pass in a model-override argument to communicate to the fusion model at generation time where you have placed the pretrained checkpoint. By default, it will load the exact path of the fusion model's pretrained model from training time. You should use model-override if you have moved the pretrained model (or are using our provided models). If you are generating from a non-fusion model, the model-override argument is not necessary.
fairseq-generate data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'/path/to/pretrained/model/checkpoint'}"
```
## Citation
```bibtex
@inproceedings{fan2018hierarchical,
title = {Hierarchical Neural Story Generation},
author = {Fan, Angela and Lewis, Mike and Dauphin, Yann},
booktitle = {Conference of the Association for Computational Linguistics (ACL)},
year = 2018,
}
```
# Neural Machine Translation
This README contains instructions for [using pretrained translation models](#example-usage-torchhub)
as well as [training new models](#training-a-new-model).
## Pre-trained models
Model | Description | Dataset | Download
---|---|---|---
`conv.wmt14.en-fr` | Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2) <br> newstest2012/2013: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.ntst1213.tar.bz2)
`conv.wmt14.en-de` | Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-de.fconv-py.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-de.newstest2014.tar.bz2)
`conv.wmt17.en-de` | Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT17 English-German](http://statmt.org/wmt17/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt17.v2.en-de.fconv-py.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.v2.en-de.newstest2014.tar.bz2)
`transformer.wmt14.en-fr` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2)
`transformer.wmt16.en-de` | Transformer <br> ([Ott et al., 2018](https://arxiv.org/abs/1806.00187)) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2)
`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive
`transformer.wmt19.en-de` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 English-German](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz)
`transformer.wmt19.de-en` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 German-English](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.de-en.joined-dict.ensemble.tar.gz)
`transformer.wmt19.en-ru` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 English-Russian](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz)
`transformer.wmt19.ru-en` | Transformer <br> ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)) <br> WMT'19 winner | [WMT'19 Russian-English](http://www.statmt.org/wmt19/translation-task.html) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt19.ru-en.ensemble.tar.gz)
## Example usage (torch.hub)
We require a few additional Python dependencies for preprocessing:
```bash
pip install fastBPE sacremoses subword_nmt
```
Interactive translation via PyTorch Hub:
```python
import torch
# List available models
torch.hub.list('pytorch/fairseq') # [..., 'transformer.wmt16.en-de', ... ]
# Load a transformer trained on WMT'16 En-De
# Note: WMT'19 models use fastBPE instead of subword_nmt, see instructions below
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt16.en-de',
tokenizer='moses', bpe='subword_nmt')
en2de.eval() # disable dropout
# The underlying model is available under the *models* attribute
assert isinstance(en2de.models[0], fairseq.models.transformer.TransformerModel)
# Move model to GPU for faster translation
en2de.cuda()
# Translate a sentence
en2de.translate('Hello world!')
# 'Hallo Welt!'
# Batched translation
en2de.translate(['Hello world!', 'The cat sat on the mat.'])
# ['Hallo Welt!', 'Die Katze saß auf der Matte.']
```
Loading custom models:
```python
from fairseq.models.transformer import TransformerModel
zh2en = TransformerModel.from_pretrained(
'/path/to/checkpoints',
checkpoint_file='checkpoint_best.pt',
data_name_or_path='data-bin/wmt17_zh_en_full',
bpe='subword_nmt',
bpe_codes='data-bin/wmt17_zh_en_full/zh.code'
)
zh2en.translate('你好 世界')
# 'Hello World'
```
If you are using a `transformer.wmt19` models, you will need to set the `bpe`
argument to `'fastbpe'` and (optionally) load the 4-model ensemble:
```python
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de',
checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
tokenizer='moses', bpe='fastbpe')
en2de.eval() # disable dropout
```
## Example usage (CLI tools)
Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:
```bash
mkdir -p data-bin
curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin
curl https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2 | tar xvjf - -C data-bin
fairseq-generate data-bin/wmt14.en-fr.newstest2014 \
--path data-bin/wmt14.en-fr.fconv-py/model.pt \
--beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
# ...
# | Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
# | Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
# Compute BLEU score
grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
fairseq-score --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
# BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)
```
## Training a new model
### IWSLT'14 German to English (Transformer)
The following instructions can be used to train a Transformer model on the [IWSLT'14 German to English dataset](http://workshop2014.iwslt.org/downloads/proceeding.pdf).
First download and preprocess the data:
```bash
# Download and prepare the data
cd examples/translation/
bash prepare-iwslt14.sh
cd ../..
# Preprocess/binarize the data
TEXT=examples/translation/iwslt14.tokenized.de-en
fairseq-preprocess --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/iwslt14.tokenized.de-en \
--workers 20
```
Next we'll train a Transformer translation model over this data:
```bash
CUDA_VISIBLE_DEVICES=0 fairseq-train \
data-bin/iwslt14.tokenized.de-en \
--arch transformer_iwslt_de_en --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric
```
Finally we can evaluate our trained model:
```bash
fairseq-generate data-bin/iwslt14.tokenized.de-en \
--path checkpoints/checkpoint_best.pt \
--batch-size 128 --beam 5 --remove-bpe
```
### WMT'14 English to German (Convolutional)
The following instructions can be used to train a Convolutional translation model on the WMT English to German dataset.
See the [Scaling NMT README](../scaling_nmt/README.md) for instructions to train a Transformer translation model on this data.
The WMT English to German dataset can be preprocessed using the `prepare-wmt14en2de.sh` script.
By default it will produce a dataset that was modeled after [Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762), but with additional news-commentary-v12 data from WMT'17.
To use only data available in WMT'14 or to replicate results obtained in the original [Convolutional Sequence to Sequence Learning (Gehring et al., 2017)](https://arxiv.org/abs/1705.03122) paper, please use the `--icml17` option.
```bash
# Download and prepare the data
cd examples/translation/
# WMT'17 data:
bash prepare-wmt14en2de.sh
# or to use WMT'14 data:
# bash prepare-wmt14en2de.sh --icml17
cd ../..
# Binarize the dataset
TEXT=examples/translation/wmt17_en_de
fairseq-preprocess \
--source-lang en --target-lang de \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt17_en_de --thresholdtgt 0 --thresholdsrc 0 \
--workers 20
# Train the model
mkdir -p checkpoints/fconv_wmt_en_de
fairseq-train \
data-bin/wmt17_en_de \
--arch fconv_wmt_en_de \
--dropout 0.2 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer nag --clip-norm 0.1 \
--lr 0.5 --lr-scheduler fixed --force-anneal 50 \
--max-tokens 4000 \
--save-dir checkpoints/fconv_wmt_en_de
# Evaluate
fairseq-generate data-bin/wmt17_en_de \
--path checkpoints/fconv_wmt_en_de/checkpoint_best.pt \
--beam 5 --remove-bpe
```
### WMT'14 English to French
```bash
# Download and prepare the data
cd examples/translation/
bash prepare-wmt14en2fr.sh
cd ../..
# Binarize the dataset
TEXT=examples/translation/wmt14_en_fr
fairseq-preprocess \
--source-lang en --target-lang fr \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0 \
--workers 60
# Train the model
mkdir -p checkpoints/fconv_wmt_en_fr
fairseq-train \
data-bin/wmt14_en_fr \
--arch fconv_wmt_en_fr \
--dropout 0.1 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer nag --clip-norm 0.1 \
--lr 0.5 --lr-scheduler fixed --force-anneal 50 \
--max-tokens 3000 \
--save-dir checkpoints/fconv_wmt_en_fr
# Evaluate
fairseq-generate \
data-bin/fconv_wmt_en_fr \
--path checkpoints/fconv_wmt_en_fr/checkpoint_best.pt \
--beam 5 --remove-bpe
```
## Multilingual Translation
We also support training multilingual translation models. In this example we'll
train a multilingual `{de,fr}-en` translation model using the IWSLT'17 datasets.
Note that we use slightly different preprocessing here than for the IWSLT'14
En-De data above. In particular we learn a joint BPE code for all three
languages and use fairseq-interactive and sacrebleu for scoring the test set.
```bash
# First install sacrebleu and sentencepiece
pip install sacrebleu sentencepiece
# Then download and preprocess the data
cd examples/translation/
bash prepare-iwslt17-multilingual.sh
cd ../..
# Binarize the de-en dataset
TEXT=examples/translation/iwslt17.de_fr.en.bpe16k
fairseq-preprocess --source-lang de --target-lang en \
--trainpref $TEXT/train.bpe.de-en \
--validpref $TEXT/valid0.bpe.de-en,$TEXT/valid1.bpe.de-en,$TEXT/valid2.bpe.de-en,$TEXT/valid3.bpe.de-en,$TEXT/valid4.bpe.de-en,$TEXT/valid5.bpe.de-en \
--destdir data-bin/iwslt17.de_fr.en.bpe16k \
--workers 10
# Binarize the fr-en dataset
# NOTE: it's important to reuse the en dictionary from the previous step
fairseq-preprocess --source-lang fr --target-lang en \
--trainpref $TEXT/train.bpe.fr-en \
--validpref $TEXT/valid0.bpe.fr-en,$TEXT/valid1.bpe.fr-en,$TEXT/valid2.bpe.fr-en,$TEXT/valid3.bpe.fr-en,$TEXT/valid4.bpe.fr-en,$TEXT/valid5.bpe.fr-en \
--tgtdict data-bin/iwslt17.de_fr.en.bpe16k/dict.en.txt \
--destdir data-bin/iwslt17.de_fr.en.bpe16k \
--workers 10
# Train a multilingual transformer model
# NOTE: the command below assumes 1 GPU, but accumulates gradients from
# 8 fwd/bwd passes to simulate training on 8 GPUs
mkdir -p checkpoints/multilingual_transformer
CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ \
--max-epoch 50 \
--ddp-backend=legacy_ddp \
--task multilingual_translation --lang-pairs de-en,fr-en \
--arch multilingual_transformer_iwslt_de_en \
--share-decoders --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr 0.0005 --lr-scheduler inverse_sqrt \
--warmup-updates 4000 --warmup-init-lr '1e-07' \
--label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
--dropout 0.3 --weight-decay 0.0001 \
--save-dir checkpoints/multilingual_transformer \
--max-tokens 4000 \
--update-freq 8
# Generate and score the test set with sacrebleu
SRC=de
sacrebleu --test-set iwslt17 --language-pair ${SRC}-en --echo src \
| python scripts/spm_encode.py --model examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model \
> iwslt17.test.${SRC}-en.${SRC}.bpe
cat iwslt17.test.${SRC}-en.${SRC}.bpe \
| fairseq-interactive data-bin/iwslt17.de_fr.en.bpe16k/ \
--task multilingual_translation --lang-pairs de-en,fr-en \
--source-lang ${SRC} --target-lang en \
--path checkpoints/multilingual_transformer/checkpoint_best.pt \
--buffer-size 2000 --batch-size 128 \
--beam 5 --remove-bpe=sentencepiece \
> iwslt17.test.${SRC}-en.en.sys
grep ^H iwslt17.test.${SRC}-en.en.sys | cut -f3 \
| sacrebleu --test-set iwslt17 --language-pair ${SRC}-en
```
##### Argument format during inference
During inference it is required to specify a single `--source-lang` and
`--target-lang`, which indicates the inference langauge direction.
`--lang-pairs`, `--encoder-langtok`, `--decoder-langtok` have to be set to
the same value as training.
#!/usr/bin/env bash
#
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git
echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
git clone https://github.com/rsennrich/subword-nmt.git
SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=10000
URL="http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz"
GZ=de-en.tgz
if [ ! -d "$SCRIPTS" ]; then
echo "Please set SCRIPTS variable correctly to point to Moses scripts."
exit
fi
src=de
tgt=en
lang=de-en
prep=iwslt14.tokenized.de-en
tmp=$prep/tmp
orig=orig
mkdir -p $orig $tmp $prep
echo "Downloading data from ${URL}..."
cd $orig
wget "$URL"
if [ -f $GZ ]; then
echo "Data successfully downloaded."
else
echo "Data not successfully downloaded."
exit
fi
tar zxvf $GZ
cd ..
echo "pre-processing train data..."
for l in $src $tgt; do
f=train.tags.$lang.$l
tok=train.tags.$lang.tok.$l
cat $orig/$lang/$f | \
grep -v '<url>' | \
grep -v '<talkid>' | \
grep -v '<keywords>' | \
sed -e 's/<title>//g' | \
sed -e 's/<\/title>//g' | \
sed -e 's/<description>//g' | \
sed -e 's/<\/description>//g' | \
perl $TOKENIZER -threads 8 -l $l > $tmp/$tok
echo ""
done
perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175
for l in $src $tgt; do
perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l
done
echo "pre-processing valid/test data..."
for l in $src $tgt; do
for o in `ls $orig/$lang/IWSLT14.TED*.$l.xml`; do
fname=${o##*/}
f=$tmp/${fname%.*}
echo $o $f
grep '<seg id' $o | \
sed -e 's/<seg id="[0-9]*">\s*//g' | \
sed -e 's/\s*<\/seg>\s*//g' | \
sed -e "s/\’/\'/g" | \
perl $TOKENIZER -threads 8 -l $l | \
perl $LC > $f
echo ""
done
done
echo "creating train, valid, test..."
for l in $src $tgt; do
awk '{if (NR%23 == 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/valid.$l
awk '{if (NR%23 != 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l
cat $tmp/IWSLT14.TED.dev2010.de-en.$l \
$tmp/IWSLT14.TEDX.dev2012.de-en.$l \
$tmp/IWSLT14.TED.tst2010.de-en.$l \
$tmp/IWSLT14.TED.tst2011.de-en.$l \
$tmp/IWSLT14.TED.tst2012.de-en.$l \
> $tmp/test.$l
done
TRAIN=$tmp/train.en-de
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
cat $tmp/train.$l >> $TRAIN
done
echo "learn_bpe.py on ${TRAIN}..."
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
for L in $src $tgt; do
for f in train.$L valid.$L test.$L; do
echo "apply_bpe.py to ${f}..."
python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f
done
done
#!/bin/bash
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
SRCS=(
"de"
"fr"
)
TGT=en
ROOT=$(dirname "$0")
SCRIPTS=$ROOT/../../scripts
SPM_TRAIN=$SCRIPTS/spm_train.py
SPM_ENCODE=$SCRIPTS/spm_encode.py
BPESIZE=16384
ORIG=$ROOT/iwslt17_orig
DATA=$ROOT/iwslt17.de_fr.en.bpe16k
mkdir -p "$ORIG" "$DATA"
TRAIN_MINLEN=1 # remove sentences with <1 BPE token
TRAIN_MAXLEN=250 # remove sentences with >250 BPE tokens
URLS=(
"https://wit3.fbk.eu/archive/2017-01-trnted/texts/de/en/de-en.tgz"
"https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz"
)
ARCHIVES=(
"de-en.tgz"
"fr-en.tgz"
)
VALID_SETS=(
"IWSLT17.TED.dev2010.de-en IWSLT17.TED.tst2010.de-en IWSLT17.TED.tst2011.de-en IWSLT17.TED.tst2012.de-en IWSLT17.TED.tst2013.de-en IWSLT17.TED.tst2014.de-en IWSLT17.TED.tst2015.de-en"
"IWSLT17.TED.dev2010.fr-en IWSLT17.TED.tst2010.fr-en IWSLT17.TED.tst2011.fr-en IWSLT17.TED.tst2012.fr-en IWSLT17.TED.tst2013.fr-en IWSLT17.TED.tst2014.fr-en IWSLT17.TED.tst2015.fr-en"
)
# download and extract data
for ((i=0;i<${#URLS[@]};++i)); do
ARCHIVE=$ORIG/${ARCHIVES[i]}
if [ -f "$ARCHIVE" ]; then
echo "$ARCHIVE already exists, skipping download"
else
URL=${URLS[i]}
wget -P "$ORIG" "$URL"
if [ -f "$ARCHIVE" ]; then
echo "$URL successfully downloaded."
else
echo "$URL not successfully downloaded."
exit 1
fi
fi
FILE=${ARCHIVE: -4}
if [ -e "$FILE" ]; then
echo "$FILE already exists, skipping extraction"
else
tar -C "$ORIG" -xzvf "$ARCHIVE"
fi
done
echo "pre-processing train data..."
for SRC in "${SRCS[@]}"; do
for LANG in "${SRC}" "${TGT}"; do
cat "$ORIG/${SRC}-${TGT}/train.tags.${SRC}-${TGT}.${LANG}" \
| grep -v '<url>' \
| grep -v '<talkid>' \
| grep -v '<keywords>' \
| grep -v '<speaker>' \
| grep -v '<reviewer' \
| grep -v '<translator' \
| grep -v '<doc' \
| grep -v '</doc>' \
| sed -e 's/<title>//g' \
| sed -e 's/<\/title>//g' \
| sed -e 's/<description>//g' \
| sed -e 's/<\/description>//g' \
| sed 's/^\s*//g' \
| sed 's/\s*$//g' \
> "$DATA/train.${SRC}-${TGT}.${LANG}"
done
done
echo "pre-processing valid data..."
for ((i=0;i<${#SRCS[@]};++i)); do
SRC=${SRCS[i]}
VALID_SET=(${VALID_SETS[i]})
for ((j=0;j<${#VALID_SET[@]};++j)); do
FILE=${VALID_SET[j]}
for LANG in "$SRC" "$TGT"; do
grep '<seg id' "$ORIG/${SRC}-${TGT}/${FILE}.${LANG}.xml" \
| sed -e 's/<seg id="[0-9]*">\s*//g' \
| sed -e 's/\s*<\/seg>\s*//g' \
| sed -e "s/\’/\'/g" \
> "$DATA/valid${j}.${SRC}-${TGT}.${LANG}"
done
done
done
# learn BPE with sentencepiece
TRAIN_FILES=$(for SRC in "${SRCS[@]}"; do echo $DATA/train.${SRC}-${TGT}.${SRC}; echo $DATA/train.${SRC}-${TGT}.${TGT}; done | tr "\n" ",")
echo "learning joint BPE over ${TRAIN_FILES}..."
python "$SPM_TRAIN" \
--input=$TRAIN_FILES \
--model_prefix=$DATA/sentencepiece.bpe \
--vocab_size=$BPESIZE \
--character_coverage=1.0 \
--model_type=bpe
# encode train/valid
echo "encoding train with learned BPE..."
for SRC in "${SRCS[@]}"; do
python "$SPM_ENCODE" \
--model "$DATA/sentencepiece.bpe.model" \
--output_format=piece \
--inputs $DATA/train.${SRC}-${TGT}.${SRC} $DATA/train.${SRC}-${TGT}.${TGT} \
--outputs $DATA/train.bpe.${SRC}-${TGT}.${SRC} $DATA/train.bpe.${SRC}-${TGT}.${TGT} \
--min-len $TRAIN_MINLEN --max-len $TRAIN_MAXLEN
done
echo "encoding valid with learned BPE..."
for ((i=0;i<${#SRCS[@]};++i)); do
SRC=${SRCS[i]}
VALID_SET=(${VALID_SETS[i]})
for ((j=0;j<${#VALID_SET[@]};++j)); do
python "$SPM_ENCODE" \
--model "$DATA/sentencepiece.bpe.model" \
--output_format=piece \
--inputs $DATA/valid${j}.${SRC}-${TGT}.${SRC} $DATA/valid${j}.${SRC}-${TGT}.${TGT} \
--outputs $DATA/valid${j}.bpe.${SRC}-${TGT}.${SRC} $DATA/valid${j}.bpe.${SRC}-${TGT}.${TGT}
done
done
#!/bin/bash
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git
echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
git clone https://github.com/rsennrich/subword-nmt.git
SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=40000
URLS=(
"http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
"http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
"http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz"
"http://data.statmt.org/wmt17/translation-task/dev.tgz"
"http://statmt.org/wmt14/test-full.tgz"
)
FILES=(
"training-parallel-europarl-v7.tgz"
"training-parallel-commoncrawl.tgz"
"training-parallel-nc-v12.tgz"
"dev.tgz"
"test-full.tgz"
)
CORPORA=(
"training/europarl-v7.de-en"
"commoncrawl.de-en"
"training/news-commentary-v12.de-en"
)
# This will make the dataset compatible to the one used in "Convolutional Sequence to Sequence Learning"
# https://arxiv.org/abs/1705.03122
if [ "$1" == "--icml17" ]; then
URLS[2]="http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
FILES[2]="training-parallel-nc-v9.tgz"
CORPORA[2]="training/news-commentary-v9.de-en"
OUTDIR=wmt14_en_de
else
OUTDIR=wmt17_en_de
fi
if [ ! -d "$SCRIPTS" ]; then
echo "Please set SCRIPTS variable correctly to point to Moses scripts."
exit
fi
src=en
tgt=de
lang=en-de
prep=$OUTDIR
tmp=$prep/tmp
orig=orig
dev=dev/newstest2013
mkdir -p $orig $tmp $prep
cd $orig
for ((i=0;i<${#URLS[@]};++i)); do
file=${FILES[i]}
if [ -f $file ]; then
echo "$file already exists, skipping download"
else
url=${URLS[i]}
wget "$url"
if [ -f $file ]; then
echo "$url successfully downloaded."
else
echo "$url not successfully downloaded."
exit -1
fi
if [ ${file: -4} == ".tgz" ]; then
tar zxvf $file
elif [ ${file: -4} == ".tar" ]; then
tar xvf $file
fi
fi
done
cd ..
echo "pre-processing train data..."
for l in $src $tgt; do
rm $tmp/train.tags.$lang.tok.$l
for f in "${CORPORA[@]}"; do
cat $orig/$f.$l | \
perl $NORM_PUNC $l | \
perl $REM_NON_PRINT_CHAR | \
perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
done
done
echo "pre-processing test data..."
for l in $src $tgt; do
if [ "$l" == "$src" ]; then
t="src"
else
t="ref"
fi
grep '<seg id' $orig/test-full/newstest2014-deen-$t.$l.sgm | \
sed -e 's/<seg id="[0-9]*">\s*//g' | \
sed -e 's/\s*<\/seg>\s*//g' | \
sed -e "s/\’/\'/g" | \
perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
echo ""
done
echo "splitting train and valid..."
for l in $src $tgt; do
awk '{if (NR%100 == 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l
awk '{if (NR%100 != 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
done
TRAIN=$tmp/train.de-en
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
cat $tmp/train.$l >> $TRAIN
done
echo "learn_bpe.py on ${TRAIN}..."
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
for L in $src $tgt; do
for f in train.$L valid.$L test.$L; do
echo "apply_bpe.py to ${f}..."
python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
done
done
perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250
for L in $src $tgt; do
cp $tmp/bpe.test.$L $prep/test.$L
done
#!/bin/bash
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git
echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
git clone https://github.com/rsennrich/subword-nmt.git
SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=40000
URLS=(
"http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
"http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
"http://statmt.org/wmt13/training-parallel-un.tgz"
"http://statmt.org/wmt14/training-parallel-nc-v9.tgz"
"http://statmt.org/wmt10/training-giga-fren.tar"
"http://statmt.org/wmt14/test-full.tgz"
)
FILES=(
"training-parallel-europarl-v7.tgz"
"training-parallel-commoncrawl.tgz"
"training-parallel-un.tgz"
"training-parallel-nc-v9.tgz"
"training-giga-fren.tar"
"test-full.tgz"
)
CORPORA=(
"training/europarl-v7.fr-en"
"commoncrawl.fr-en"
"un/undoc.2000.fr-en"
"training/news-commentary-v9.fr-en"
"giga-fren.release2.fixed"
)
if [ ! -d "$SCRIPTS" ]; then
echo "Please set SCRIPTS variable correctly to point to Moses scripts."
exit
fi
src=en
tgt=fr
lang=en-fr
prep=wmt14_en_fr
tmp=$prep/tmp
orig=orig
mkdir -p $orig $tmp $prep
cd $orig
for ((i=0;i<${#URLS[@]};++i)); do
file=${FILES[i]}
if [ -f $file ]; then
echo "$file already exists, skipping download"
else
url=${URLS[i]}
wget "$url"
if [ -f $file ]; then
echo "$url successfully downloaded."
else
echo "$url not successfully downloaded."
exit -1
fi
if [ ${file: -4} == ".tgz" ]; then
tar zxvf $file
elif [ ${file: -4} == ".tar" ]; then
tar xvf $file
fi
fi
done
gunzip giga-fren.release2.fixed.*.gz
cd ..
echo "pre-processing train data..."
for l in $src $tgt; do
rm $tmp/train.tags.$lang.tok.$l
for f in "${CORPORA[@]}"; do
cat $orig/$f.$l | \
perl $NORM_PUNC $l | \
perl $REM_NON_PRINT_CHAR | \
perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
done
done
echo "pre-processing test data..."
for l in $src $tgt; do
if [ "$l" == "$src" ]; then
t="src"
else
t="ref"
fi
grep '<seg id' $orig/test-full/newstest2014-fren-$t.$l.sgm | \
sed -e 's/<seg id="[0-9]*">\s*//g' | \
sed -e 's/\s*<\/seg>\s*//g' | \
sed -e "s/\’/\'/g" | \
perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
echo ""
done
echo "splitting train and valid..."
for l in $src $tgt; do
awk '{if (NR%1333 == 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l
awk '{if (NR%1333 != 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
done
TRAIN=$tmp/train.fr-en
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
cat $tmp/train.$l >> $TRAIN
done
echo "learn_bpe.py on ${TRAIN}..."
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
for L in $src $tgt; do
for f in train.$L valid.$L test.$L; do
echo "apply_bpe.py to ${f}..."
python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
done
done
perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250
for L in $src $tgt; do
cp $tmp/bpe.test.$L $prep/test.$L
done
# Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)
This page includes instructions for reproducing results from the paper [Mixture Models for Diverse Machine Translation: Tricks of the Trade (Shen et al., 2019)](https://arxiv.org/abs/1902.07816).
## Download data
First, follow the [instructions to download and preprocess the WMT'17 En-De dataset](../translation#prepare-wmt14en2desh).
Make sure to learn a joint vocabulary by passing the `--joined-dictionary` option to `fairseq-preprocess`.
## Train a model
Then we can train a mixture of experts model using the `translation_moe` task.
Use the `--method` flag to choose the MoE variant; we support hard mixtures with a learned or uniform prior (`--method hMoElp` and `hMoEup`, respectively) and soft mixures (`--method sMoElp` and `sMoEup`).
The model is trained with online responsibility assignment and shared parameterization.
The following command will train a `hMoElp` model with `3` experts:
```bash
fairseq-train --ddp-backend='legacy_ddp' \
data-bin/wmt17_en_de \
--max-update 100000 \
--task translation_moe --user-dir examples/translation_moe/translation_moe_src \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--arch transformer_wmt_en_de --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0007 \
--dropout 0.1 --weight-decay 0.0 --criterion cross_entropy \
--max-tokens 3584
```
## Translate
Once a model is trained, we can generate translations from different experts using the `--gen-expert` option.
For example, to generate from expert 0:
```bash
fairseq-generate data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt \
--beam 1 --remove-bpe \
--task translation_moe --user-dir examples/translation_moe/translation_moe_src \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--gen-expert 0
```
## Evaluate
First download a tokenized version of the WMT'14 En-De test set with multiple references:
```bash
wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
```
Next apply BPE on the fly and run generation for each expert:
```bash
BPE_CODE=examples/translation/wmt17_en_de/code
for EXPERT in $(seq 0 2); do \
cat wmt14-en-de.extra_refs.tok \
| grep ^S | cut -f 2 \
| fairseq-interactive data-bin/wmt17_en_de \
--path checkpoints/checkpoint_best.pt \
--beam 1 \
--bpe subword_nmt --bpe-codes $BPE_CODE \
--buffer-size 500 --max-tokens 6000 \
--task translation_moe --user-dir examples/translation_moe/translation_moe_src \
--method hMoElp --mean-pool-gating-network \
--num-experts 3 \
--gen-expert $EXPERT ; \
done > wmt14-en-de.extra_refs.tok.gen.3experts
```
Finally use `score_moe.py` to compute pairwise BLUE and average oracle BLEU:
```bash
python examples/translation_moe/score.py --sys wmt14-en-de.extra_refs.tok.gen.3experts --ref wmt14-en-de.extra_refs.tok
# pairwise BLEU: 48.26
# #refs covered: 2.11
# multi-reference BLEU (leave-one-out): 59.46
```
This matches row 3 from Table 7 in the paper.
## Citation
```bibtex
@article{shen2019mixture,
title = {Mixture Models for Diverse Machine Translation: Tricks of the Trade},
author = {Tianxiao Shen and Myle Ott and Michael Auli and Marc'Aurelio Ranzato},
journal = {International Conference on Machine Learning},
year = 2019,
}
```
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
Scoring script for computing pairwise BLEU and multi-ref BLEU over a set of
candidate hypotheses.
See `"Mixture Models for Diverse Machine Translation: Tricks of the Trade"
(Shen et al., 2019) <https://arxiv.org/abs/1902.07816>`_.
"""
import argparse
import random
import sys
from itertools import chain
import numpy as np
from sacrebleu import compute_bleu, corpus_bleu as _corpus_bleu
def main():
parser = argparse.ArgumentParser(sys.argv[0])
parser.add_argument(
"--sys", nargs="*", default="", metavar="FILE", help="path to system output"
)
parser.add_argument("--ref", default="", metavar="FILE", help="path to references")
parser.add_argument(
"--output",
default="",
metavar="FILE",
help="print outputs into a pretty format",
)
args = parser.parse_args()
if args.sys:
src, tgt, hypos, log_probs = load_sys(args.sys)
print("pairwise BLEU: %.2f" % pairwise(hypos))
if args.output:
merge(src, tgt, hypos, log_probs, args.output)
if args.ref:
_, _, refs = load_ref(args.ref)
if args.sys:
multi_ref(refs, hypos)
else:
intra_ref(refs)
def dictolist(d):
a = sorted(d.items(), key=lambda i: i[0])
return [i[1] for i in a]
def load_sys(paths):
src, tgt, hypos, log_probs = {}, {}, {}, {}
for path in paths:
with open(path) as f:
for line in f:
line = line.rstrip()
# S: source
# T: target
# D: detokenized system output
if line.startswith(("S-", "T-", "D-")):
i = int(line[line.find("-") + 1 : line.find("\t")])
if line.startswith("S-"):
src[i] = line.split("\t")[1]
if line.startswith("T-"):
tgt[i] = line.split("\t")[1]
if line.startswith("D-"):
if i not in hypos:
hypos[i] = []
log_probs[i] = []
hypos[i].append(line.split("\t")[2])
log_probs[i].append(float(line.split("\t")[1]))
return dictolist(src), dictolist(tgt), dictolist(hypos), dictolist(log_probs)
def load_ref(path):
with open(path) as f:
lines = f.readlines()
src, tgt, refs = [], [], []
i = 0
while i < len(lines):
if lines[i].startswith("S-"):
src.append(lines[i].split("\t")[1].rstrip())
i += 1
elif lines[i].startswith("T-"):
tgt.append(lines[i].split("\t")[1].rstrip())
i += 1
else:
a = []
while i < len(lines) and lines[i].startswith("R"):
a.append(lines[i].split("\t")[1].rstrip())
i += 1
refs.append(a)
return src, tgt, refs
def merge(src, tgt, hypos, log_probs, path):
with open(path, "w") as f:
for s, t, hs, lps in zip(src, tgt, hypos, log_probs):
f.write(s + "\n")
f.write(t + "\n")
f.write("\n")
for h, lp in zip(hs, lps):
f.write("\t%f\t%s\n" % (lp, h.strip()))
f.write("------------------------------------------------------\n")
def corpus_bleu(sys_stream, ref_streams):
bleu = _corpus_bleu(sys_stream, ref_streams, tokenize="none")
return bleu.score
def sentence_bleu(hypothesis, reference):
bleu = _corpus_bleu(hypothesis, reference)
for i in range(1, 4):
bleu.counts[i] += 1
bleu.totals[i] += 1
bleu = compute_bleu(
bleu.counts,
bleu.totals,
bleu.sys_len,
bleu.ref_len,
smooth_method="exp",
)
return bleu.score
def pairwise(sents):
_ref, _hypo = [], []
for s in sents:
for i in range(len(s)):
for j in range(len(s)):
if i != j:
_ref.append(s[i])
_hypo.append(s[j])
return corpus_bleu(_hypo, [_ref])
def multi_ref(refs, hypos):
_ref, _hypo = [], []
ref_cnt = 0
assert len(refs) == len(hypos)
# count number of refs covered
for rs, hs in zip(refs, hypos):
a = set()
for h in hs:
s = [sentence_bleu(h, r) for r in rs]
j = np.argmax(s)
_ref.append(rs[j])
_hypo.append(h)
best = [k for k in range(len(rs)) if s[k] == s[j]]
a.add(random.choice(best))
ref_cnt += len(a)
print("#refs covered: %.2f" % (ref_cnt / len(refs)))
# transpose refs and hypos
refs = list(zip(*refs))
hypos = list(zip(*hypos))
# compute multi-ref corpus BLEU (leave-one-out to be comparable to intra_ref)
k = len(hypos)
m = len(refs)
flat_hypos = [hypos[j][i] for i in range(len(hypos[0])) for j in range(k)]
duplicated_refs = [[ref for ref in refs_i for _ in range(k)] for refs_i in refs]
loo_bleus = []
for held_out_ref in range(m):
remaining_refs = (
duplicated_refs[:held_out_ref] + duplicated_refs[held_out_ref + 1 :]
)
assert len(remaining_refs) == m - 1
loo_bleus.append(corpus_bleu(flat_hypos, remaining_refs))
print("average multi-reference BLEU (leave-one-out): %.2f" % np.mean(loo_bleus))
def intra_ref(refs):
print("ref pairwise BLEU: %.2f" % pairwise(refs))
refs = list(zip(*refs))
m = len(refs)
concat_h = []
concat_rest = [[] for j in range(m - 1)]
for i, h in enumerate(refs):
rest = refs[:i] + refs[i + 1 :]
concat_h.append(h)
for j in range(m - 1):
concat_rest[j].extend(rest[j])
concat_h = list(chain.from_iterable(concat_h))
bleu = corpus_bleu(concat_h, concat_rest)
print("multi-reference BLEU (leave-one-out): %.2f" % bleu)
if __name__ == "__main__":
main()
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from . import translation_moe # noqa
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import torch
class LogSumExpMoE(torch.autograd.Function):
"""Standard LogSumExp forward pass, but use *posterior* for the backward.
See `"Mixture Models for Diverse Machine Translation: Tricks of the Trade"
(Shen et al., 2019) <https://arxiv.org/abs/1902.07816>`_.
"""
@staticmethod
def forward(ctx, logp, posterior, dim=-1):
ctx.save_for_backward(posterior)
ctx.dim = dim
return torch.logsumexp(logp, dim=dim)
@staticmethod
def backward(ctx, grad_output):
(posterior,) = ctx.saved_tensors
grad_logp = grad_output.unsqueeze(ctx.dim) * posterior
return grad_logp, None, None
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import torch
import torch.nn.functional as F
class MeanPoolGatingNetwork(torch.nn.Module):
"""A simple mean-pooling gating network for selecting experts.
This module applies mean pooling over an encoder's output and returns
reponsibilities for each expert. The encoder format is expected to match
:class:`fairseq.models.transformer.TransformerEncoder`.
"""
def __init__(self, embed_dim, num_experts, dropout=None):
super().__init__()
self.embed_dim = embed_dim
self.num_experts = num_experts
self.fc1 = torch.nn.Linear(embed_dim, embed_dim)
self.dropout = torch.nn.Dropout(dropout) if dropout is not None else None
self.fc2 = torch.nn.Linear(embed_dim, num_experts)
def forward(self, encoder_out):
if not (
"encoder_out" in encoder_out
and "encoder_padding_mask" in encoder_out
and encoder_out["encoder_out"][0].size(2) == self.embed_dim
):
raise ValueError("Unexpected format for encoder_out")
# mean pooling over time
encoder_padding_mask = encoder_out["encoder_padding_mask"][0] # B x T
encoder_out = encoder_out["encoder_out"][0].transpose(0, 1) # B x T x C
if encoder_padding_mask is not None:
encoder_out = encoder_out.clone() # required because of transpose above
encoder_out[encoder_padding_mask] = 0
ntokens = torch.sum(~encoder_padding_mask, dim=1, keepdim=True)
x = torch.sum(encoder_out, dim=1) / ntokens.type_as(encoder_out)
else:
x = torch.mean(encoder_out, dim=1)
x = torch.tanh(self.fc1(x))
if self.dropout is not None:
x = self.dropout(x)
x = self.fc2(x)
return F.log_softmax(x, dim=-1, dtype=torch.float32).type_as(x)
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from dataclasses import dataclass, field
import torch
from omegaconf import II
from fairseq import metrics, utils
from fairseq.dataclass import ChoiceEnum
from fairseq.tasks import register_task
from fairseq.tasks.translation import TranslationConfig, TranslationTask
from .logsumexp_moe import LogSumExpMoE
from .mean_pool_gating_network import MeanPoolGatingNetwork
METHOD_CHOICES = ChoiceEnum(["sMoElp", "sMoEup", "hMoElp", "hMoEup"])
@dataclass
class TranslationMoEConfig(TranslationConfig):
method: METHOD_CHOICES = field(
default="hMoEup",
metadata={"help": "MoE method"},
)
num_experts: int = field(
default=3,
metadata={"help": "number of experts"},
)
mean_pool_gating_network: bool = field(
default=False,
metadata={"help": "use a simple mean-pooling gating network"},
)
mean_pool_gating_network_dropout: float = field(
default=0,
metadata={"help": "dropout for mean-pooling gating network"},
)
mean_pool_gating_network_encoder_dim: int = field(
default=0,
metadata={"help": "encoder output dim for mean-pooling gating network"},
)
gen_expert: int = field(
default=0,
metadata={"help": "which expert to use for generation"},
)
sentence_avg: bool = II("optimization.sentence_avg")
@register_task("translation_moe", dataclass=TranslationMoEConfig)
class TranslationMoETask(TranslationTask):
"""
Translation task for Mixture of Experts (MoE) models.
See `"Mixture Models for Diverse Machine Translation: Tricks of the Trade"
(Shen et al., 2019) <https://arxiv.org/abs/1902.07816>`_.
Args:
src_dict (~fairseq.data.Dictionary): dictionary for the source language
tgt_dict (~fairseq.data.Dictionary): dictionary for the target language
.. note::
The translation task is compatible with :mod:`fairseq-train`,
:mod:`fairseq-generate` and :mod:`fairseq-interactive`.
The translation task provides the following additional command-line
arguments:
.. argparse::
:ref: fairseq.tasks.translation_parser
:prog:
"""
cfg: TranslationMoEConfig
def __init__(self, cfg: TranslationMoEConfig, src_dict, tgt_dict):
if cfg.method == "sMoElp":
# soft MoE with learned prior
self.uniform_prior = False
self.hard_selection = False
elif cfg.method == "sMoEup":
# soft MoE with uniform prior
self.uniform_prior = True
self.hard_selection = False
elif cfg.method == "hMoElp":
# hard MoE with learned prior
self.uniform_prior = False
self.hard_selection = True
elif cfg.method == "hMoEup":
# hard MoE with uniform prior
self.uniform_prior = True
self.hard_selection = True
# add indicator tokens for each expert
for i in range(cfg.num_experts):
# add to both dictionaries in case we're sharing embeddings
src_dict.add_symbol("<expert_{}>".format(i))
tgt_dict.add_symbol("<expert_{}>".format(i))
super().__init__(cfg, src_dict, tgt_dict)
def build_model(self, cfg):
from fairseq import models
model = models.build_model(cfg, self)
if not self.uniform_prior and not hasattr(model, "gating_network"):
if self.cfg.mean_pool_gating_network:
if self.cfg.mean_pool_gating_network_encoder_dim > 0:
encoder_dim = self.cfg.mean_pool_gating_network_encoder_dim
elif getattr(cfg, "encoder_embed_dim", None):
# assume that encoder_embed_dim is the encoder's output dimension
encoder_dim = cfg.encoder_embed_dim
else:
raise ValueError(
"Must specify --mean-pool-gating-network-encoder-dim"
)
if self.cfg.mean_pool_gating_network_dropout > 0:
dropout = self.cfg.mean_pool_gating_network_dropout
elif getattr(cfg, "dropout", None):
dropout = cfg.dropout
else:
raise ValueError("Must specify task.mean_pool_gating_network_dropout")
model.gating_network = MeanPoolGatingNetwork(
encoder_dim,
self.cfg.num_experts,
dropout,
)
else:
raise ValueError(
"translation_moe task with learned prior requires the model to "
"have a gating network; try using --mean-pool-gating-network"
)
return model
def expert_index(self, i):
return i + self.tgt_dict.index("<expert_0>")
def _get_loss(self, sample, model, criterion):
assert hasattr(
criterion, "compute_loss"
), "translation_moe task requires the criterion to implement the compute_loss() method"
k = self.cfg.num_experts
bsz = sample["target"].size(0)
def get_lprob_y(encoder_out, prev_output_tokens_k):
net_output = model.decoder(
prev_output_tokens=prev_output_tokens_k,
encoder_out=encoder_out,
)
loss, _ = criterion.compute_loss(model, net_output, sample, reduce=False)
loss = loss.view(bsz, -1)
return -loss.sum(dim=1, keepdim=True) # -> B x 1
def get_lprob_yz(winners=None):
encoder_out = model.encoder(
src_tokens=sample["net_input"]["src_tokens"],
src_lengths=sample["net_input"]["src_lengths"],
)
if winners is None:
lprob_y = []
for i in range(k):
prev_output_tokens_k = sample["net_input"][
"prev_output_tokens"
].clone()
assert not prev_output_tokens_k.requires_grad
prev_output_tokens_k[:, 0] = self.expert_index(i)
lprob_y.append(get_lprob_y(encoder_out, prev_output_tokens_k))
lprob_y = torch.cat(lprob_y, dim=1) # -> B x K
else:
prev_output_tokens_k = sample["net_input"]["prev_output_tokens"].clone()
prev_output_tokens_k[:, 0] = self.expert_index(winners)
lprob_y = get_lprob_y(encoder_out, prev_output_tokens_k) # -> B
if self.uniform_prior:
lprob_yz = lprob_y
else:
lprob_z = model.gating_network(encoder_out) # B x K
if winners is not None:
lprob_z = lprob_z.gather(dim=1, index=winners.unsqueeze(-1))
lprob_yz = lprob_y + lprob_z.type_as(lprob_y) # B x K
return lprob_yz
# compute responsibilities without dropout
with utils.model_eval(model): # disable dropout
with torch.no_grad(): # disable autograd
lprob_yz = get_lprob_yz() # B x K
prob_z_xy = torch.nn.functional.softmax(lprob_yz, dim=1)
assert not prob_z_xy.requires_grad
# compute loss with dropout
if self.hard_selection:
winners = prob_z_xy.max(dim=1)[1]
loss = -get_lprob_yz(winners)
else:
lprob_yz = get_lprob_yz() # B x K
loss = -LogSumExpMoE.apply(lprob_yz, prob_z_xy, 1)
loss = loss.sum()
sample_size = (
sample["target"].size(0) if self.cfg.sentence_avg else sample["ntokens"]
)
logging_output = {
"loss": utils.item(loss.data),
"ntokens": sample["ntokens"],
"nsentences": bsz,
"sample_size": sample_size,
"posterior": prob_z_xy.float().sum(dim=0).cpu(),
}
return loss, sample_size, logging_output
def train_step(
self, sample, model, criterion, optimizer, update_num, ignore_grad=False
):
model.train()
loss, sample_size, logging_output = self._get_loss(sample, model, criterion)
if ignore_grad:
loss *= 0
optimizer.backward(loss)
return loss, sample_size, logging_output
def valid_step(self, sample, model, criterion):
model.eval()
with torch.no_grad():
loss, sample_size, logging_output = self._get_loss(sample, model, criterion)
return loss, sample_size, logging_output
def inference_step(
self,
generator,
models,
sample,
prefix_tokens=None,
expert=None,
constraints=None,
):
expert = expert or self.cfg.gen_expert
with torch.no_grad():
return generator.generate(
models,
sample,
prefix_tokens=prefix_tokens,
constraints=constraints,
bos_token=self.expert_index(expert),
)
def reduce_metrics(self, logging_outputs, criterion):
super().reduce_metrics(logging_outputs, criterion)
metrics.log_scalar(
"posterior",
sum(log["posterior"] for log in logging_outputs if "posterior" in log),
)
# Truncated Backpropagation Through Time (BPTT)
Truncated BPTT is a useful technique for training language models on very long
sequences. Typically a long sequences is split into chunks and a language model
is trained over the chunks sequentially. The LM may condition on previous
chunks, but gradients only flow through the current chunk. This technique was
the basis for the paper: [Transformer-XL: Attentive Language Models Beyond a
Fixed-Length Context](https://arxiv.org/abs/1901.02860), which achieved
state-of-the-art language modeling results at the time of publication.
It is slightly tricky to implement Truncated BPTT efficiently in fairseq, since
we need to iterate over the data sequentially and disable any batch shuffling
logic. The code provided in this example illustrates how to implement Truncated
BPTT in fairseq by overriding ``FairseqTask::get_batch_iterator`` to iterate
over the data sequentially. Crucially, this example supports batching and
multi-GPU (data parallel) training.
##### 0. Setup
First, see the general [language modeling README](README.md) for instructions on
preprocessing the WikiText-103 data.
##### 1. Train a Transformer-XL model on WikiText-103
We will train a 16-layer Transformer-XL model following the [hyperparameters
used in the original
paper](https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_wt103_base.sh).
The following command assumes 4 GPUs, so that the total batch size is 60
sequences (15 x 4). Training should take ~24 hours on 4 V100 GPUs:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
--user-dir examples/truncated_bptt \
data-bin/wikitext-103/ \
--task truncated_bptt_lm --tokens-per-sample 150 \
--batch-size 15 --max-update 200000 \
--arch transformer_xl --n-layer 16 --d-model 410 --n-head 10 \
--d-head 41 --d-inner 2100 --dropout 0.1 --dropatt 0.0 --mem-len 150 \
--optimizer adam --clip-norm 0.25 \
--lr-scheduler cosine --warmup-updates 0 --min-lr 0.0 --lr 0.00025 \
--log-format json --log-interval 25 \
--fp16
```
If training on a single GPU, set `--update-freq=4` to accumulate 4x gradients
and simulate training on 4 GPUs.
##### 2. Evaluate
```bash
fairseq-eval-lm data-bin/wikitext-103/ \
--path checkpoints/checkpoint_best.pt \
--user-dir examples/truncated_bptt/ \
--task truncated_bptt_lm \
--batch-size 1 --required-batch-size-multiple 1 \
--model-overrides '{"mem_len":640,"clamp_len":400,"same_length":True}' \
--tokens-per-sample 64
# ... | INFO | fairseq_cli.eval_lm | num. model params: 151123537
# ... | INFO | fairseq_cli.eval_lm | Evaluated 245569 tokens in 83.1s (2956.82 tokens/s)
# ... | INFO | fairseq_cli.eval_lm | Loss (base 2): 4.5668, Perplexity: 23.70
# Compare to 24.0 test perplexity from the paper
```
*Note:* During training the model saw 150 tokens of context
(``--tokens-per-sample=150``) and 150 extra memory tokens (``--mem-len=150``).
During evaluation we measure perplexity on sequences of 64 tokens
(``--tokens-per-sample=64``) and increase the memory length
(``--model-overrides='{"mem_len":640}'``). These settings match the evaluation
settings from [the original
paper](https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_wt103_base.sh).
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
from . import transformer_xl_model, truncated_bptt_lm_task # noqa
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import torch
from fairseq.dataclass import FairseqDataclass
from fairseq.models import (
FairseqIncrementalDecoder,
FairseqLanguageModel,
register_model,
)
from fairseq.modules.checkpoint_activations import checkpoint_wrapper
from omegaconf import II
logger = logging.getLogger(__name__)
@dataclass
class TransformerXLConfig(FairseqDataclass):
# defaults come from the original Transformer-XL code
cutoffs: List[int] = field(default_factory=lambda: [20000, 40000, 200000])
d_model: int = 500
n_head: int = 10
d_head: int = 50
d_inner: int = 1000
div_val: int = 1
n_layer: int = 12
mem_len: int = 0
clamp_len: int = -1
same_length: bool = False
dropout: float = 0.0
dropatt: float = 0.0
checkpoint_activations: bool = False
offload_activations: bool = False
max_target_positions: int = II("task.max_target_positions")
@register_model("transformer_xl", dataclass=TransformerXLConfig)
class TransformerXLLanguageModel(FairseqLanguageModel):
@classmethod
def build_model(cls, cfg: TransformerXLConfig, task):
return cls(TransformerXLDecoder(cfg, task))
class TransformerXLDecoder(FairseqIncrementalDecoder):
def __init__(self, cfg, task):
try:
from transformers.models.transfo_xl import (
TransfoXLConfig,
TransfoXLLMHeadModel,
)
except ImportError:
from transformers.configuration_transfo_xl import TransfoXLConfig
from transformers.modeling_transfo_xl import TransfoXLLMHeadModel
super().__init__(task.target_dictionary)
self.cfg = cfg
# remove any cutoffs larger than the vocab size
cutoffs = [
cutoff for cutoff in cfg.cutoffs if cutoff < len(task.target_dictionary)
]
config = TransfoXLConfig(
vocab_size=len(task.target_dictionary),
cutoffs=cutoffs,
d_model=cfg.d_model,
d_embed=cfg.d_model,
n_head=cfg.n_head,
d_head=cfg.d_head,
d_inner=cfg.d_inner,
div_val=cfg.div_val,
n_layer=cfg.n_layer,
mem_len=cfg.mem_len,
clamp_len=cfg.clamp_len,
same_length=cfg.same_length,
dropout=cfg.dropout,
dropatt=cfg.dropatt,
)
logger.info(config)
self.model = TransfoXLLMHeadModel(config)
# Workaround a bug in huggingface's ``ProjectedAdaptiveLogSoftmax``
# which adds ``None`` values to an ``nn.ParameterList``, which is not
# supported in PyTorch. Instead we can replace this with an
# ``nn.ModuleList``, which does support ``None`` values.
try:
if all(p is None for p in self.model.crit.out_projs._parameters.values()):
self.model.crit.out_projs = torch.nn.ModuleList(
[None] * len(self.model.crit.out_projs._parameters)
)
except Exception:
pass
if cfg.checkpoint_activations or cfg.offload_activations:
for i in range(len(self.model.transformer.layers)):
self.model.transformer.layers[i] = checkpoint_wrapper(
self.model.transformer.layers[i],
offload_to_cpu=cfg.offload_activations,
)
# TODO: may save mem to wrap(layer.pos_ff.CoreNet[3])
self._mems = None
def forward(
self,
src_tokens,
src_lengths=None, # unused
incremental_state: Optional[Dict[str, List[torch.Tensor]]] = None,
encoder_out=None,
):
if incremental_state is not None: # used during inference
mems = self.get_incremental_state(incremental_state, "mems")
src_tokens = src_tokens[:, -1:] # only keep the most recent token
else:
mems = self._mems
output = self.model(
input_ids=src_tokens,
mems=mems,
return_dict=False,
)
if len(output) >= 2:
if incremental_state is not None:
self.set_incremental_state(incremental_state, "mems", output[1])
else:
self._mems = output[1]
return (output[0],)
def max_positions(self):
return self.cfg.max_target_positions
def reorder_incremental_state(
self,
incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]],
new_order: torch.Tensor,
):
"""Reorder incremental state.
This will be called when the order of the input has changed from the
previous time step. A typical use case is beam search, where the input
order changes between time steps based on the selection of beams.
"""
mems = self.get_incremental_state(incremental_state, "mems")
if mems is not None:
new_mems = [mems_i.index_select(1, new_order) for mems_i in mems]
self.set_incremental_state(incremental_state, "mems", new_mems)
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import logging
import os
from dataclasses import dataclass, field
from typing import List, Optional, Tuple
import torch
from fairseq import utils
from fairseq.data import (
Dictionary,
TokenBlockDataset,
data_utils,
iterators,
)
from fairseq.dataclass import FairseqDataclass
from fairseq.distributed import utils as dist_utils
from fairseq.tasks import FairseqTask, register_task
from omegaconf import II
logger = logging.getLogger(__name__)
@dataclass
class TruncatedBPTTLMConfig(FairseqDataclass):
data: str = field(default="???", metadata={"help": "path to data directory"})
tokens_per_sample: int = field(
default=1024,
metadata={"help": "max number of tokens per sequence"},
)
batch_size: int = II("dataset.batch_size")
# Some models use *max_target_positions* to know how many positional
# embeddings to learn. We use II(...) to make it default to
# *tokens_per_sample*, but in principle there could be more positional
# embeddings than tokens in a single batch. This may also be irrelevant for
# custom model implementations.
max_target_positions: int = II("task.tokens_per_sample")
# these will be populated automatically if not provided
data_parallel_rank: Optional[int] = None
data_parallel_size: Optional[int] = None
@register_task("truncated_bptt_lm", dataclass=TruncatedBPTTLMConfig)
class TruncatedBPTTLMTask(FairseqTask):
def __init__(self, cfg: TruncatedBPTTLMConfig):
super().__init__(cfg)
if cfg.data_parallel_rank is None or cfg.data_parallel_size is None:
if torch.distributed.is_initialized():
cfg.data_parallel_rank = dist_utils.get_data_parallel_rank()
cfg.data_parallel_size = dist_utils.get_data_parallel_world_size()
else:
cfg.data_parallel_rank = 0
cfg.data_parallel_size = 1
# load the dictionary
paths = utils.split_paths(cfg.data)
assert len(paths) > 0
self.dictionary = Dictionary.load(os.path.join(paths[0], "dict.txt"))
logger.info("dictionary: {} types".format(len(self.dictionary)))
def load_dataset(self, split, epoch=1, combine=False, **kwargs):
"""Load a given dataset split (e.g., train, valid, test)"""
# support sharded datasets
paths = utils.split_paths(self.cfg.data)
assert len(paths) > 0
data_path = paths[(epoch - 1) % len(paths)]
split_path = os.path.join(data_path, split)
# each element of *data* will be a tensorized line from the original
# text dataset, similar to ``open(split_path).readlines()``
data = data_utils.load_indexed_dataset(
split_path, self.dictionary, combine=combine
)
if data is None:
raise FileNotFoundError(
"Dataset not found: {} ({})".format(split, split_path)
)
# this is similar to ``data.view(-1).split(tokens_per_sample)``
data = TokenBlockDataset(
data,
data.sizes,
block_size=self.cfg.tokens_per_sample,
pad=None, # unused
eos=None, # unused
break_mode="none",
)
self.datasets[split] = TruncatedBPTTDataset(
data=data,
bsz_per_shard=self.cfg.batch_size,
shard_id=self.cfg.data_parallel_rank,
num_shards=self.cfg.data_parallel_size,
)
def dataset(self, split):
return self.datasets[split]
def get_batch_iterator(
self, dataset, num_workers=0, epoch=1, data_buffer_size=0, **kwargs
):
return iterators.EpochBatchIterator(
dataset=dataset,
collate_fn=self._collate_fn,
num_workers=num_workers,
epoch=epoch,
buffer_size=data_buffer_size,
# we don't use the batching functionality from EpochBatchIterator;
# instead every item in *dataset* is a whole batch
batch_sampler=[[i] for i in range(len(dataset))],
disable_shuffling=True,
)
def _collate_fn(self, items: List[List[torch.Tensor]]):
# we don't use fairseq's batching functionality, so we expect a single
# Tensor of type List[torch.Tensor]
assert len(items) == 1
# item will have shape B x T (the last batch may have length < T)
id, item = items[0]
item = data_utils.collate_tokens(item, pad_idx=self.source_dictionary.pad())
B, T = item.size()
# shift item one position over and append a padding token for the target
target = torch.nn.functional.pad(
item[:, 1:], (0, 1, 0, 0), value=self.target_dictionary.pad()
)
# fairseq expects batches to have the following structure
return {
"id": torch.tensor([id]*item.size(0)),
"net_input": {
"src_tokens": item,
},
"target": target,
"nsentences": item.size(0),
"ntokens": item.numel(),
}
def build_dataset_for_inference(
self, src_tokens: List[torch.Tensor], src_lengths: List[int], **kwargs
) -> torch.utils.data.Dataset:
eos = self.source_dictionary.eos()
dataset = TokenBlockDataset(
src_tokens,
src_lengths,
block_size=None, # ignored for "eos" break mode
pad=self.source_dictionary.pad(),
eos=eos,
break_mode="eos",
)
class Dataset(torch.utils.data.Dataset):
def __getitem__(self, i):
item = dataset[i]
if item[-1] == eos:
# remove eos to support generating with a prefix
item = item[:-1]
return (i, [item])
def __len__(self):
return len(dataset)
return Dataset()
def inference_step(
self, generator, models, sample, prefix_tokens=None, constraints=None
):
with torch.no_grad():
if constraints is not None:
raise NotImplementedError
# SequenceGenerator doesn't use *src_tokens* directly, we need to
# pass the *prefix_tokens* argument instead.
if prefix_tokens is None and sample["net_input"]["src_tokens"].nelement():
prefix_tokens = sample["net_input"]["src_tokens"]
# begin generation with the end-of-sentence token
bos_token = self.source_dictionary.eos()
return generator.generate(
models, sample, prefix_tokens=prefix_tokens, bos_token=bos_token
)
def eval_lm_dataloader(
self,
dataset,
max_tokens: Optional[int] = 36000,
batch_size: Optional[int] = None,
max_positions: Optional[int] = None,
num_shards: int = 1,
shard_id: int = 0,
num_workers: int = 1,
data_buffer_size: int = 10,
context_window: int = 0,
):
if context_window > 0:
raise NotImplementedError(
"Transformer-XL doesn't need --context-window, try "
"--model-overrides '{\"mem_len\":42}' instead "
)
return self.get_batch_iterator(
dataset=dataset,
max_tokens=max_tokens,
max_sentences=batch_size,
max_positions=max_positions,
ignore_invalid_inputs=True,
num_shards=num_shards,
shard_id=shard_id,
num_workers=num_workers,
data_buffer_size=data_buffer_size,
).next_epoch_itr(shuffle=False)
@property
def source_dictionary(self):
return self.dictionary
@property
def target_dictionary(self):
return self.dictionary
class TruncatedBPTTDataset(torch.utils.data.Dataset):
def __init__(
self,
data: List[torch.Tensor], # ordered list of items
bsz_per_shard, # number of items processed per GPUs per forward
shard_id, # current GPU ID
num_shards, # number of GPUs
):
super().__init__()
self.data = data
def batchify(data, bsz):
# Work out how cleanly we can divide the dataset into bsz parts.
nbatch = data.size(0) // bsz
# Trim off any extra elements that wouldn't cleanly fit (remainders).
data = data.narrow(0, 0, nbatch * bsz)
# Evenly divide the data across the bsz batches.
data = data.view(bsz, -1).contiguous()
return data
# total number of sequences processed by all GPUs in each forward pass
global_batch_size = bsz_per_shard * num_shards
"""
With a 16 item dataset, bsz_per_shard=2 and num_shards=3,
*indices* might look like:
indices = [[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9],
[10, 11]]
The size of the TruncatedBPTTDataset instance will be 2,
and shard 1 will see items:
[(0, [data[4], data[6]]),
(1, [data[5], data[7]])]
"""
indices = batchify(torch.arange(len(data)), global_batch_size)
assert indices.size(0) == global_batch_size
self.my_indices = indices[
shard_id * bsz_per_shard : (shard_id + 1) * bsz_per_shard
]
assert self.my_indices.size(0) == bsz_per_shard
def __len__(self):
return self.my_indices.size(1)
def __getitem__(self, i) -> Tuple[int, List[torch.Tensor]]:
return (i, [self.data[idx] for idx in self.my_indices[:, i]])
# Unsupervised Quality Estimation for Neural Machine Translation (Fomicheva et al., 2020)
This page includes instructions for reproducing results from the paper [Unsupervised Quality Estimation for Neural
Machine Translation (Fomicheva et al., 2020)](https://arxiv.org/abs/2005.10608)
## Requirements:
* mosesdecoder: https://github.com/moses-smt/mosesdecoder
* subword-nmt: https://github.com/rsennrich/subword-nmt
* flores: https://github.com/facebookresearch/flores
## Download Models and Test Data
Download translation models and test data from [MLQE dataset repository](https://github.com/facebookresearch/mlqe).
## Set up:
Given a testset consisting of source sentences and reference translations:
* `SRC_LANG`: source language
* `TGT_LANG`: target language
* `INPUT`: input prefix, such that the file `$INPUT.$SRC_LANG` contains source sentences and `$INPUT.$TGT_LANG`
contains the reference sentences
* `OUTPUT_DIR`: output path to store results
* `MOSES_DECODER`: path to mosesdecoder installation
* `BPE_ROOT`: path to subword-nmt installation
* `BPE`: path to BPE model
* `MODEL_DIR`: directory containing the NMT model `.pt` file as well as the source and target vocabularies.
* `TMP`: directory for intermediate temporary files
* `GPU`: if translating with GPU, id of the GPU to use for inference
* `DROPOUT_N`: number of stochastic forward passes
`$DROPOUT_N` is set to 30 in the experiments reported in the paper. However, we observed that increasing it beyond 10
does not bring substantial improvements.
## Translate the data using standard decoding
Preprocess the input data:
```
for LANG in $SRC_LANG $TGT_LANG; do
perl $MOSES_DECODER/scripts/tokenizer/tokenizer.perl -threads 80 -a -l $LANG < $INPUT.$LANG > $TMP/preprocessed.tok.$LANG
python $BPE_ROOT/apply_bpe.py -c ${BPE} < $TMP/preprocessed.tok.$LANG > $TMP/preprocessed.tok.bpe.$LANG
done
```
Binarize the data for faster translation:
```
fairseq-preprocess --srcdict $MODEL_DIR/dict.$SRC_LANG.txt --tgtdict $MODEL_DIR/dict.$TGT_LANG.txt
--source-lang ${SRC_LANG} --target-lang ${TGT_LANG} --testpref $TMP/preprocessed.tok.bpe --destdir $TMP/bin --workers 4
```
Translate
```
CUDA_VISIBLE_DEVICES=$GPU fairseq-generate $TMP/bin --path ${MODEL_DIR}/${SRC_LANG}-${TGT_LANG}.pt --beam 5
--source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 > $TMP/fairseq.out
grep ^H $TMP/fairseq.out | cut -d- -f2- | sort -n | cut -f3- > $TMP/mt.out
```
Post-process
```
sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/mt.out | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl
-l $TGT_LANG > $OUTPUT_DIR/mt.out
```
## Produce uncertainty estimates
### Scoring
Make temporary files to store the translations repeated N times.
```
python ${SCRIPTS}/scripts/uncertainty/repeat_lines.py -i $TMP/preprocessed.tok.bpe.$SRC_LANG -n $DROPOUT_N
-o $TMP/repeated.$SRC_LANG
python ${SCRIPTS}/scripts/uncertainty/repeat_lines.py -i $TMP/mt.out -n $DROPOUT_N -o $TMP/repeated.$TGT_LANG
fairseq-preprocess --srcdict ${MODEL_DIR}/dict.${SRC_LANG}.txt $TGT_DIC --source-lang ${SRC_LANG}
--target-lang ${TGT_LANG} --testpref ${TMP}/repeated --destdir ${TMP}/bin-repeated
```
Produce model scores for the generated translations using `--retain-dropout` option to apply dropout at inference time:
```
CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${LP}.pt --beam 5
--source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 --score-reference --retain-dropout
--retain-dropout-modules '["TransformerModel","TransformerEncoder","TransformerDecoder","TransformerEncoderLayer"]'
TransformerDecoderLayer --seed 46 > $TMP/dropout.scoring.out
grep ^H $TMP/dropout.scoring.out | cut -d- -f2- | sort -n | cut -f2 > $TMP/dropout.scores
```
Use `--retain-dropout-modules` to specify the modules. By default, dropout is applied in the same places
as for training.
Compute the mean of the resulting output distribution:
```
python $SCRIPTS/scripts/uncertainty/aggregate_scores.py -i $TMP/dropout.scores -o $OUTPUT_DIR/dropout.scores.mean
-n $DROPOUT_N
```
### Generation
Produce multiple translation hypotheses for the same source using `--retain-dropout` option:
```
CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${LP}.pt
--beam 5 --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --retain-dropout
--unkpen 5 --retain-dropout-modules TransformerModel TransformerEncoder TransformerDecoder
TransformerEncoderLayer TransformerDecoderLayer --seed 46 > $TMP/dropout.generation.out
grep ^H $TMP/dropout.generation.out | cut -d- -f2- | sort -n | cut -f3- > $TMP/dropout.hypotheses_
sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/dropout.hypotheses_ | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl
-l $TGT_LANG > $TMP/dropout.hypotheses
```
Compute similarity between multiple hypotheses corresponding to the same source sentence using Meteor
evaluation metric:
```
python meteor.py -i $TMP/dropout.hypotheses -m <path_to_meteor_installation> -n $DROPOUT_N -o
$OUTPUT_DIR/dropout.gen.sim.meteor
```
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import sys
import numpy as np
aggregate_funcs = {
"std": np.std,
"var": np.var,
"median": np.median,
"mean": np.mean,
"min": np.min,
"max": np.max,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input_file", required=True, type=str)
parser.add_argument("-n", "--repeat_times", required=True, type=int)
parser.add_argument("-o", "--output_file", required=False)
parser.add_argument("-f", "--func", required=False, default="mean")
args = parser.parse_args()
stream = open(args.output_file, "w") if args.output_file else sys.stdout
segment_scores = []
for line in open(args.input_file):
segment_scores.append(float(line.strip()))
if len(segment_scores) == args.repeat_times:
stream.write("{}\n".format(aggregate_funcs[args.func](segment_scores)))
segment_scores = []
if __name__ == "__main__":
main()
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import math
import os
import subprocess
import sys
import tempfile
from collections import defaultdict
from itertools import combinations
def read_translations(path, n_repeats):
segment_counter = 0
segment_translations = []
translations = defaultdict(list)
for line in open(path):
segment_translations.append(" ".join(line.split()))
if len(segment_translations) == n_repeats:
translations[segment_counter] = segment_translations
segment_translations = []
segment_counter += 1
return translations
def generate_input(translations, n_repeats):
_, ref_path = tempfile.mkstemp()
_, mt_path = tempfile.mkstemp()
ref_fh = open(ref_path, "w")
mt_fh = open(mt_path, "w")
for segid in sorted(translations.keys()):
assert len(translations[segid]) == n_repeats
indexes = combinations(range(n_repeats), 2)
for idx1, idx2 in indexes:
mt_fh.write(translations[segid][idx1].strip() + "\n")
ref_fh.write(translations[segid][idx2].strip() + "\n")
sys.stderr.write("\nSaved translations to %s and %s" % (ref_path, mt_path))
return ref_path, mt_path
def run_meteor(ref_path, mt_path, metric_path, lang="en"):
_, out_path = tempfile.mkstemp()
subprocess.call(
[
"java",
"-Xmx2G",
"-jar",
metric_path,
mt_path,
ref_path,
"-p",
"0.5 0.2 0.6 0.75", # default parameters, only changed alpha to give equal weight to P and R
"-norm",
"-l",
lang,
],
stdout=open(out_path, "w"),
)
os.remove(ref_path)
os.remove(mt_path)
sys.stderr.write("\nSaved Meteor output to %s" % out_path)
return out_path
def read_output(meteor_output_path, n_repeats):
n_combinations = math.factorial(n_repeats) / (
math.factorial(2) * math.factorial(n_repeats - 2)
)
raw_scores = []
average_scores = []
for line in open(meteor_output_path):
if not line.startswith("Segment "):
continue
score = float(line.strip().split("\t")[1])
raw_scores.append(score)
if len(raw_scores) == n_combinations:
average_scores.append(sum(raw_scores) / n_combinations)
raw_scores = []
os.remove(meteor_output_path)
return average_scores
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--infile")
parser.add_argument("-n", "--repeat_times", type=int)
parser.add_argument("-m", "--meteor")
parser.add_argument("-o", "--output")
args = parser.parse_args()
translations = read_translations(args.infile, args.repeat_times)
sys.stderr.write("\nGenerating input for Meteor...")
ref_path, mt_path = generate_input(translations, args.repeat_times)
sys.stderr.write("\nRunning Meteor...")
out_path = run_meteor(ref_path, mt_path, args.meteor)
sys.stderr.write("\nReading output...")
scores = read_output(out_path, args.repeat_times)
sys.stderr.write("\nWriting results...")
with open(args.output, "w") as o:
for scr in scores:
o.write("{}\n".format(scr))
o.close()
if __name__ == "__main__":
main()
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import sys
def _normalize_spaces(line):
return " ".join(line.split())
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input_file", required=True, type=str)
parser.add_argument("-n", "--repeat_times", required=True, type=int)
parser.add_argument("-o", "--output_file", required=False, type=str)
args = parser.parse_args()
stream = open(args.output_file, "w") if args.output_file else sys.stdout
for line in open(args.input_file):
for _ in range(args.repeat_times):
stream.write(_normalize_spaces(line) + "\n")
if __name__ == "__main__":
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment