Commit 7df61696 authored by Sugon_ldc's avatar Sugon_ldc
Browse files

add fairseq0.10.2

parents
Pipeline #471 failed with stages
in 0 seconds
# Beyond English-Centric Multilingual Machine Translation
## Introduction
In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively with the best single systems of WMT.
If you are new to using fairseq, read the following walkthrough. Otherwise, skip to the sections below.
0. **Generation Data**
To download the generation data, follow the below commands. Note that all datasets need to be detokenized *before* applying SPM in the data preprocessing step. If you use these evaluation datasets, please cite their associated papers.
```bash
# WMT - use sacrebleu, example here:
sacrebleu -t wmt14 -l fr-en --echo src > wmt.test.fr-en.fr
sacrebleu -t wmt14 -l fr-en --echo ref > wmt.test.fr-en.en
# WAT
wget http://lotus.kuee.kyoto-u.ac.jp/WAT/my-en-data/wat2019.my-en.zip
unzip wat2019.my-en.zip
# FLORES
# download from: https://github.com/facebookresearch/flores
# TED - need to detokenize with Moses!
# from: https://github.com/neulab/word-embeddings-for-nmt
wget http://phontron.com/data/ted_talks.tar.gz
# Autshumato
# request to download: https://repo.sadilar.org/handle/20.500.12185/397
# Tatoeba Challenge
# available here: https://github.com/Helsinki-NLP/Tatoeba-Challenge
```
1. **Training Data**
To produce the training data, we use a combination of [CCMatrix](https://arxiv.org/abs/1911.04944) and [CCAligned](https://arxiv.org/abs/1911.06154). Check out the instructions [here](https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix) to download the raw data.
2. **Preprocess Data**
After downloading raw data, you will need to postprocess the data, then apply SPM, then binarize. Note that it is very important you run the postprocessing script, because this removes any instance of the evaluation data in the mined training data.
```bash
# preprocess data
# remove sentences with more than 50% punctuation
python /path/to/fairseq/examples/m2m_100/process_data/remove_too_much_punc.py
# deduplicate training data
paste /path/to/datadir/train.$src /path/to/datadir/train.$tgt | awk '!x[$0]++' > /path/to/datadir/train.dedup
echo "keeping $(wc -l /path/to/datadir/train.dedup) bitext out of $(wc -l /path/to/datadir/train.$src)"
cut -f1 /path/to/datadir/train.dedup > /path/to/datadir/train.$src
cut -f2 /path/to/datadir/train.dedup > /path/to/datadir/train.$tgt
# remove all instances of evaluation data from the training data
python /path/to/fairseq/examples/m2m_100/process_data/dedup_data.py
# frequency cleaning
wget https://dl.fbaipublicfiles.com/m2m_100/histograms.tar.gz
tar -xvzf histograms.tar.gz
python /path/to/fairseq/examples/m2m_100/process_data/clean_histogram.py --src $src --tgt $tgt --src-file /path/to/source/file --tgt-file /path/to/output/file --src-output-file source_output.$src --tgt-output-file target_output.$tgt --histograms /path/to/histograms
# apply SPM
wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
python /path/to/fairseq/scripts/spm_encode.py \
--model spm.128k.model \
--output_format=piece \
--inputs=/path/to/input/file/here \
--outputs=/path/to/output/file/here
# length ratio cleaning
perl mosesdecoder/scripts/training/clean-corpus-n.perl --ratio 3 /path/to/training/data/train.spm.$src-$tgt $src $tgt /path/to/output/directory/train.spm.$src-$tgt 1 250
# binarize data
wget https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txt
fairseq-preprocess \
--source-lang $src --target-lang $tgt \
--testpref spm.$src.$tgt \
--thresholdsrc 0 --thresholdtgt 0 \
--destdir data_bin \
--srcdict data_dict.128k.txt --tgtdict data_dict.128k.txt
```
3. **Training Scripts**
To reproduce the training of our models, we train with fairseq-py's multilingual translation [task](https://github.com/pytorch/fairseq/tree/master/examples/multilingual). If you are interested in model parallel training, also check out [fairscale](https://github.com/facebookresearch/fairscale).
4. **Generation**
To generate from our models, follow the the commands in the generation section below.
If you use any of the resources listed here, please cite:
```bibtex
@article{fan2020beyond,
title={Beyond English-Centric Multilingual Machine Translation},
author={Fan, Angela and Bhosale, Shruti and Schwenk, Holger and Ma, Zhiyi and El-Kishky, Ahmed and Goyal, Siddharth and Baines, Mandeep and Celebi, Onur and Wenzek, Guillaume and Chaudhary, Vishrav and Goyal, Naman and Birch, Tom and Liptchinsky, Vitaliy and Edunov, Sergey and Grave, Edouard and Auli, Michael and Joulin, Armand},
journal={arXiv preprint},
year={2020}
}
@article{schwenk2019ccmatrix,
title={Ccmatrix: Mining billions of high-quality parallel sentences on the web},
author={Schwenk, Holger and Wenzek, Guillaume and Edunov, Sergey and Grave, Edouard and Joulin, Armand},
journal={arXiv preprint arXiv:1911.04944},
year={2019}
}
@article{el2019massive,
title={A Massive Collection of Cross-Lingual Web-Document Pairs},
author={El-Kishky, Ahmed and Chaudhary, Vishrav and Guzman, Francisco and Koehn, Philipp},
journal={arXiv preprint arXiv:1911.06154},
year={2019}
}
```
## Trained Models
Looking for other trained models? Check back soon.
Model | Description | Download
---|---|---
`12b_last_checkpoint` | 12B parameter model trained on many-to-many training data for 100 languages | [12b_last_checkpoint](https://dl.fbaipublicfiles.com/m2m_100/12b_last_checkpoint.pt)
## SentencePiece Model
```bash
wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
```
## Generation with M2M-100
### Encode using our SentencePiece Model
Note: Install SentencePiece from [here](https://github.com/google/sentencepiece)
```bash
fairseq=/path/to/fairseq
cd $fairseq
sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de
sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr
wget https://dl.fbaipublicfiles.com/m2m_100/spm.128k.model
for lang in de fr ; do
python scripts/spm_encode.py \
--model spm.128k.model \
--output_format=piece \
--inputs=raw_input.de-fr.${lang} \
--outputs=spm.de-fr.${lang}
done
```
### Binarization
```bash
wget https://dl.fbaipublicfiles.com/m2m_100/data_dict.128k.txt
fairseq-preprocess \
--source-lang de --target-lang fr \
--testpref spm.de-fr \
--thresholdsrc 0 --thresholdtgt 0 \
--destdir data_bin \
--srcdict data_dict.128k.txt --tgtdict data_dict.128k.txt
```
### Generation on a V100 GPU
```bash
wget https://dl.fbaipublicfiles.com/m2m_100/model_dict.128k.txt
wget https://dl.fbaipublicfiles.com/m2m_100/language_pairs.txt
wget https://dl.fbaipublicfiles.com/m2m_100/12b_last_checkpoint.pt
fairseq-generate \
data_bin \
--batch-size 1 \
--path 12b_last_checkpoint.pt \
--fixed-dictionary model_dict.128k.txt \
-s de -t fr \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn \
--pipeline-model-parallel \
--pipeline-chunks 1 \
--pipeline-encoder-balance '[26]' \
--pipeline-encoder-devices '[0]' \
--pipeline-decoder-balance '[1,24,1]' \
--pipeline-decoder-devices '[0,1,0]' > gen_out
```
## Evaluation with M2M-100
### Tokenization
Note: Refer to tokenizers/README.md for more details on tokenization.
```bash
cd ${fairseq}/examples/m2m_100
cat ${fairseq}/gen_out | grep -P "^H" | sort -V | cut -f 3- | sh tok.sh fr > hyp
cat ${fairseq}/raw_input.de-fr.fr | sh tok.sh fr > ref
```
### BLEU
```bash
sacrebleu -tok 'none' ref < hyp
```
#!/usr/bin/env bash
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
CWD=`pwd`
INSTALL_PATH=$CWD/tokenizers/thirdparty
MOSES=$INSTALL_PATH/mosesdecoder
if [ ! -d $MOSES ]; then
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git $MOSES
cd $MOSES
# To deal with differences in handling ' vs "
git checkout 03578921cc1a03402
cd -
fi
WMT16_SCRIPTS=$INSTALL_PATH/wmt16-scripts
if [ ! -d $WMT16_SCRIPTS ]; then
echo 'Cloning Romanian tokenization scripts'
git clone https://github.com/rsennrich/wmt16-scripts.git $WMT16_SCRIPTS
fi
KYTEA=$INSTALL_PATH/kytea
if [ ! -f $KYTEA/bin/kytea ]; then
git clone https://github.com/neubig/kytea.git $KYTEA
cd $KYTEA
autoreconf -i
./configure --prefix=`pwd`
make
make install
cd ..
fi
export MECAB=$INSTALL_PATH/mecab-0.996-ko-0.9.2
if [ ! -f $MECAB/bin/mecab ]; then
cd $INSTALL_PATH
curl -LO https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
tar zxfv mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2/
./configure --prefix=`pwd`
make
make install
cd ..
curl -LO https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
tar zxfv mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720/
./autogen.sh
./configure --prefix=`pwd` --with-dicdir=$MECAB/lib/mecab/dic/mecab-ko-dic --with-mecab-config=$MECAB/bin/mecab-config
make
sh -c 'echo "dicdir=$MECAB/lib/mecab/dic/mecab-ko-dic" > $MECAB/etc/mecabrc'
make install
cd $CWD
fi
INDIC_RESOURCES_PATH=$INSTALL_PATH/indic_nlp_resources
if [ ! -d $INDIC_RESOURCES_PATH ]; then
echo 'Cloning indic_nlp_resources'
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git $INDIC_RESOURCES_PATH
fi
if [ ! -f $INSTALL_PATH/seg_my.py ]; then
cd $INSTALL_PATH
wget http://lotus.kuee.kyoto-u.ac.jp/WAT/my-en-data/wat2020.my-en.zip
unzip wat2020.my-en.zip
# switch to python3
cat wat2020.my-en/myseg.py |sed 's/^sys.std/###sys.std/g' | sed 's/### sys/sys/g' | sed 's/unichr/chr/g' > seg_my.py
cd $CWD
fi
pip install pythainlp sacrebleu indic-nlp-library
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--src', type=str, help='Source language')
parser.add_argument('--tgt', type=str, help='Target language')
parser.add_argument('--src-file', type=str, help='Input source file')
parser.add_argument('--tgt-file', type=str, help='Input target file')
parser.add_argument('--src-output-file', type=str, help='Output source file')
parser.add_argument('--tgt-output-file', type=str, help='Output target file')
parser.add_argument('--threshold', type=float, default=0.5, help='Threshold')
parser.add_argument('--threshold-character', type=str, default=']', help='Threshold character')
parser.add_argument('--histograms', type=str, help='Path to histograms')
args = parser.parse_args()
def read_hist(f):
ch = []
for line in f:
c = line[0]
if c == args.threshold_character:
break
ch.append(c)
return ch
with(open("{}/{}".format(args.histograms, args.src), 'r', encoding='utf8')) as f:
ch1 = read_hist(f)
with(open("{}/{}".format(args.histograms, args.tgt), 'r', encoding='utf8')) as f:
ch2 = read_hist(f)
print("Accepted characters for {}: {}".format(args.src, ch1))
print("Accepted characters for {}: {}".format(args.tgt, ch2))
with open(args.src_file, 'r', encoding='utf8') as fs1, open(args.tgt_file, 'r', encoding='utf8') as fs2, open(args.src_output_file, 'w', encoding='utf8') as fos1, open(args.tgt_output_file, 'w', encoding='utf8') as fos2:
ls1 = fs1.readline()
ls2 = fs2.readline()
while ls1 or ls2:
cnt1 = len([c for c in ls1.strip() if c in ch1])
cnt2 = len([c for c in ls2.strip() if c in ch2])
if cnt1 / len(ls1) > args.threshold and cnt2 / len(ls2) > args.threshold:
fos1.write(ls1)
fos2.write(ls2)
else:
print("{} {} {} \n{} {} {}".format(args.src, cnt1 / len(ls1), ls1.strip(), args.tgt, cnt2 / len(ls2), ls2.strip()))
ls1 = fs1.readline()
ls2 = fs2.readline()
\ No newline at end of file
import argparse
from collections import namedtuple
import os
DATADIR = "/path/to/train_data"
DEDUP_FROM_DIR = "/path/to/eval/data"
OUTPUT_DIR = "/path/to/output/data"
def main(args):
languages = set()
for language_directory in os.listdir(DATADIR):
if "_" in language_directory:
src, tgt = language_directory.split("_")
languages.add(LanguagePair(src=src, tgt=tgt))
data = existing_data()
train_languages = sorted(languages)
for language_pair in train_languages[args.start_index:args.start_index + args.size]:
print(language_pair)
dedup(language_pair, data)
LanguagePair = namedtuple("LanguagePair", ["src", "tgt"])
def existing_data():
data = set()
for file in os.listdir(DEDUP_FROM_DIR):
with open(os.path.join(DEDUP_FROM_DIR, file)) as f:
data |= set(f.readlines())
return data
def dedup(language_pair, data, verbose=True, output=True):
train_filenames = LanguagePair(
src=f"{DATADIR}/{language_pair.src}_{language_pair.tgt}/train.{language_pair.src}",
tgt=f"{DATADIR}/{language_pair.src}_{language_pair.tgt}/train.{language_pair.tgt}",
)
output_filenames = LanguagePair(
src=f"{OUTPUT_DIR}/train.dedup.{language_pair.src}-{language_pair.tgt}.{language_pair.src}",
tgt=f"{OUTPUT_DIR}/train.dedup.{language_pair.src}-{language_pair.tgt}.{language_pair.tgt}"
)
# If output exists, skip this pair. It has already been done.
if (os.path.exists(output_filenames.src) and
os.path.exists(output_filenames.tgt)):
if verbose:
print(f"{language_pair.src}-{language_pair.tgt} already done.")
return
if verbose:
print(f"{language_pair.src}-{language_pair.tgt} ready, will check dups.")
# If there is no output, no need to actually do the loop.
if not output:
return
if os.path.exists(train_filenames.src) and os.path.exists(train_filenames.tgt):
with open(train_filenames.src) as f:
train_source = f.readlines()
with open(train_filenames.tgt) as f:
train_target = f.readlines()
# do dedup
new_train_source = []
new_train_target = []
for i, train_line in enumerate(train_source):
if train_line not in data and train_target[i] not in data:
new_train_source.append(train_line)
new_train_target.append(train_target[i])
assert len(train_source) == len(train_target)
assert len(new_train_source) == len(new_train_target)
assert len(new_train_source) <= len(train_source)
with open(output_filenames.src, "w") as o:
for line in new_train_source:
o.write(line)
with open(output_filenames.tgt, "w") as o:
for line in new_train_target:
o.write(line)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-s", "--start-index", required=True, type=int)
parser.add_argument("-n", "--size", required=True, type=int)
main(parser.parse_args())
import gzip
import argparse
from string import punctuation
def len_no_punc(s, punc):
return len([ch for ch in s if ch in punc])
def filter_overpunc(len_npunc, len_sen):
return len_npunc < 0.5*len_sen
def main(args):
punc = punctuation + "—|–"
print('Processing file {}'.format(args.input))
with gzip.open(args.input, 'rt', encoding=args.encoding) as tsv:
with open(args.bitext + '.' + args.src_lang, 'wt', encoding=args.encoding) as fsrc:
with open(args.bitext + '.' + args.tgt_lang, 'wt', encoding=args.encoding) as ftgt:
line = tsv.readline()
fields = line.split('\t')
src, tgt = fields[1], fields[2]
nchar_npunc_src = len_no_punc(src, punc)
nchar_npunc_tgt = len_no_punc(tgt, punc)
if filter_overpunc(nchar_npunc_src, len(src)) and filter_overpunc(nchar_npunc_tgt, len(tgt)):
fsrc.write(src.strip() + '\n')
ftgt.write(tgt.strip() + '\n')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True, type=str)
parser.add_argument('--encoding', default='utf-8', help='character encoding for input/output')
parser.add_argument('--bitext', type=str, required=True, help='language direction')
parser.add_argument('--src-lang', type=str, required=True, help='Source language')
parser.add_argument('--tgt-lang', type=str, required=True, help='Target language')
main(parser.parse_args())
#!/usr/bin/env bash
# Copyright (c) 2019-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
set -e
TOKENIZERS_SCRIPTS=tokenizers
INSTALL_PATH=$TOKENIZERS_SCRIPTS/thirdparty
N_THREADS=8
lg=$1
MOSES=$INSTALL_PATH/mosesdecoder
REPLACE_UNICODE_PUNCT=$MOSES/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$MOSES/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$MOSES/scripts/tokenizer/remove-non-printing-char.perl
TOKENIZER=$MOSES/scripts/tokenizer/tokenizer.perl
# special tokenization for Romanian
WMT16_SCRIPTS=$INSTALL_PATH/wmt16-scripts
NORMALIZE_ROMANIAN=$WMT16_SCRIPTS/preprocess/normalise-romanian.py
REMOVE_DIACRITICS=$WMT16_SCRIPTS/preprocess/remove-diacritics.py
# Burmese
MY_SEGMENT=$INSTALL_PATH/seg_my.py
# Arabic
AR_TOKENIZER=$TOKENIZERS_SCRIPTS/tokenizer_ar.sh
# Korean
KO_SEGMENT=$TOKENIZERS_SCRIPTS/seg_ko.sh
# Japanese
JA_SEGMENT=$TOKENIZERS_SCRIPTS/seg_ja.sh
# Indic
IN_TOKENIZER=$TOKENIZERS_SCRIPTS/tokenize_indic.py
INDIC_RESOURCES_PATH=$INSTALL_PATH/indic_nlp_resources
# Thai
THAI_TOKENIZER=$TOKENIZERS_SCRIPTS/tokenize_thai.py
# Chinese
CHINESE_TOKENIZER=$TOKENIZERS_SCRIPTS/tokenize_zh.py
# Chinese
if [ "$lg" = "zh" ]; then
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | python $CHINESE_TOKENIZER
# Thai
elif [ "$lg" = "th" ]; then
cat - | python $THAI_TOKENIZER
# Japanese
elif [ "$lg" = "ja" ]; then
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | ${JA_SEGMENT}
# Korean
elif [ "$lg" = "ko" ]; then
cat - | $REM_NON_PRINT_CHAR | ${KO_SEGMENT}
# Romanian
elif [ "$lg" = "ro" ]; then
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $NORMALIZE_ROMANIAN | $REMOVE_DIACRITICS | $TOKENIZER -no-escape -threads $N_THREADS -l $lg
# Burmese
elif [ "$lg" = "my" ]; then
cat - | python ${MY_SEGMENT}
# Arabic
elif [ "$lg" = "ar" ]; then
cat - | ${AR_TOKENIZER}
# Indic
elif [ "$lg" = "ne" ]; then
cat - | python ${IN_TOKENIZER} $lg
elif [ "$lg" = "si" ]; then
cat - | python ${IN_TOKENIZER} $lg
elif [ "$lg" = "hi" ]; then
cat - | python ${IN_TOKENIZER} $lg
# other languages
else
cat - | $REPLACE_UNICODE_PUNCT | $NORM_PUNC -l $lg | $REM_NON_PRINT_CHAR | $TOKENIZER -no-escape -threads $N_THREADS -l $lg
fi
# M2M-100 Tokenization
We apply different tokenization strategies for different languages following the existing literature. Here we provide tok.sh a tokenizer that can be used to reproduce our results.
To reproduce the results, follow these steps:
```
tgt_lang=...
reference_translation=...
cat generation_output | grep -P "^H" | sort -V | cut -f 3- | sh tok.sh $tgt_lang > hyp
cat $reference_translation |sh tok.sh $tgt_lang > ref
sacrebleu -tok 'none' ref < hyp
```
## Installation
Tools needed for all the languages except Arabic can be installed by running install_dependencies.sh
If you want to evaluate Arabic models, please follow the instructions provided here: http://alt.qcri.org/tools/arabic-normalizer/ to install
#!/usr/bin/env bash
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
SCRIPT=`realpath $0`
KYTEA=`dirname $SCRIPT`/thirdparty/kytea
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$KYTEA/lib:/usr/local/lib
export PATH=$PATH:"$KYTEA/bin"
cat - | tr -d "[:blank:]" | kytea -notags
#!/usr/bin/env bash
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
SCRIPT=`realpath $0`
MECAB=`dirname $SCRIPT`/thirdparty/mecab-0.996-ko-0.9.2
export PATH=$PATH:"$MECAB/bin":"$MECAB/lib"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"$MECAB/lib"
cat - | mecab -O wakati
seg_my.py
indic_nlp_library/
indic_nlp_resources/
kytea/
mecab-0.996-ko-0.9.2.tar.gz
mecab-0.996-ko-0.9.2/
mosesdecoder/
wat2020.my-en.zip
wat2020.my-en/
wmt16-scripts/
mecab-ko-dic-2.1.1-20180720/
mecab-ko-dic-2.1.1-20180720.tar.gz
\ No newline at end of file
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
# Use: echo {text} | python tokenize_indic.py {language}
import sys
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
from indicnlp.tokenize.indic_tokenize import trivial_tokenize
factory = IndicNormalizerFactory()
normalizer = factory.get_normalizer(
sys.argv[1], remove_nuktas=False, nasals_mode="do_nothing"
)
for line in sys.stdin:
normalized_line = normalizer.normalize(line.strip())
tokenized_line = " ".join(trivial_tokenize(normalized_line, sys.argv[1]))
print(tokenized_line)
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import sys
from pythainlp import word_tokenize
for line in sys.stdin:
print(" ".join(word_tokenize(line.strip())))
#!/usr/bin/env python3
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import fileinput
import sacrebleu
for line in fileinput.input():
print(sacrebleu.tokenize_zh(line))
#!/usr/bin/env sh
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
# Please follow the instructions here http://alt.qcri.org/tools/arabic-normalizer/
# to install tools needed for Arabic
echo "Please install Arabic tools: http://alt.qcri.org/tools/arabic-normalizer/"
echo "Then update environment variables in tokenizer_ar.sh"
exit 1
SVMTOOL=...
GOMOSESGO=...
QCRI_ARABIC_NORMALIZER=...
export PERL5LIB="$SVMTOOL/lib":"$GOMOSESGO/bin/MADA-3.2":$PERL5LIB
tempfile=$(mktemp)
cat - > $tempfile
cd $QCRI_ARABIC_NORMALIZER
bash qcri_normalizer_mada3.2_aramorph1.2.1.sh $tempfile
cat $tempfile.mada_norm-aramorph.europarl_tok
# MBART: Multilingual Denoising Pre-training for Neural Machine Translation
[https://arxiv.org/abs/2001.08210]
## Introduction
MBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.
## Pre-trained models
Model | Description | # params | Download
---|---|---|---
`mbart.CC25` | mBART model with 12 encoder and decoder layers trained on 25 languages' monolingual corpus | 610M | [mbart.CC25.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz)
`mbart.ft.ro_en` | finetune mBART cc25 model on ro-en language pairs | 610M | [mbart.cc25.ft.enro.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.ft.enro.tar.gz)
## Results
**[WMT16 EN-RO](https://www.statmt.org/wmt16/translation-task.html)**
_(test set, no additional data used)_
Model | en-ro | ro-en
---|---|---
`Random` | 34.3 | 34.0
`mbart.cc25` | 37.7 | 37.8
`mbart.enro.bilingual` | 38.5 | 38.5
## BPE data
# download model
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz
tar -xzvf mbart.CC25.tar.gz
# bpe data
install SPM [here](https://github.com/google/sentencepiece)
```bash
SPM=/path/to/sentencepiece/build/src/spm_encode
MODEL=sentence.bpe.model
${SPM} --model=${MODEL} < ${DATA}/${TRAIN}.${SRC} > ${DATA}/${TRAIN}.spm.${SRC} &
${SPM} --model=${MODEL} < ${DATA}/${TRAIN}.${TGT} > ${DATA}/${TRAIN}.spm.${TGT} &
${SPM} --model=${MODEL} < ${DATA}/${VALID}.${SRC} > ${DATA}/${VALID}.spm.${SRC} &
${SPM} --model=${MODEL} < ${DATA}/${VALID}.${TGT} > ${DATA}/${VALID}.spm.${TGT} &
${SPM} --model=${MODEL} < ${DATA}/${TEST}.${SRC} > ${DATA}/${TEST}.spm.${SRC} &
${SPM} --model=${MODEL} < ${DATA}/${TEST}.${TGT} > ${DATA}/${TEST}.spm.${TGT} &
```
## Preprocess data
```bash
DICT=dict.txt
fairseq-preprocess \
--source-lang ${SRC} \
--target-lang ${TGT} \
--trainpref ${DATA}/${TRAIN}.spm \
--validpref ${DATA}/${VALID}.spm \
--testpref ${DATA}/${TEST}.spm \
--destdir ${DEST}/${NAME} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70
```
## Finetune on EN-RO
Finetune on mbart CC25
```bash
PRETRAIN=mbart.cc25 # fix if you moved the downloaded checkpoint
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
fairseq-train path_2_data \
--encoder-normalize-before --decoder-normalize-before \
--arch mbart_large --layernorm-embedding \
--task translation_from_pretrained_bart \
--source-lang en_XX --target-lang ro_RO \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler polynomial_decay --lr 3e-05 --min-lr -1 --warmup-updates 2500 --total-num-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
--seed 222 --log-format simple --log-interval 2 \
--restore-file $PRETRAIN \
--reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
--langs $langs \
--ddp-backend no_c10d
```
## Generate on EN-RO
Get sacrebleu on finetuned en-ro model
get tokenizer [here](https://github.com/rsennrich/wmt16-scripts)
```bash
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.ft.enro.tar.gz
tar -xzvf mbart.cc25.ft.enro.tar.gz
```
```bash
model_dir=MBART_finetuned_enro # fix if you moved the checkpoint
fairseq-generate path_2_data \
--path $model_dir/model.pt \
--task translation_from_pretrained_bart \
--gen-subset test \
-t ro_RO -s en_XX \
--bpe 'sentencepiece' --sentencepiece-model $model_dir/sentence.bpe.model \
--sacrebleu --remove-bpe 'sentencepiece' \
--batch-size 32 --langs $langs > en_ro
cat en_ro | grep -P "^H" |sort -V |cut -f 3- | sed 's/\[ro_RO\]//g' |$TOKENIZER ro > en_ro.hyp
cat en_ro | grep -P "^T" |sort -V |cut -f 2- | sed 's/\[ro_RO\]//g' |$TOKENIZER ro > en_ro.ref
sacrebleu -tok 'none' -s 'none' en_ro.ref < en_ro.hyp
```
## Citation
```bibtex
@article{liu2020multilingual,
title={Multilingual Denoising Pre-training for Neural Machine Translation},
author={Yinhan Liu and Jiatao Gu and Naman Goyal and Xian Li and Sergey Edunov and Marjan Ghazvininejad and Mike Lewis and Luke Zettlemoyer},
year={2020},
eprint={2001.08210},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# Megatron-11b
Megatron-11b is a unidirectional language model with `11B` parameters based on [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf). Following the original Megatron work, we trained the model using intra-layer model parallelism with each layer's parameters split across 8 GPUs.
Megatron-11b is trained on the same data and uses the same byte-pair encoding (BPE) as [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf).
## Pre-trained models
Model | Description | # params | # filesize | Download
---|---|---|---|---
`megatron_11b` | megatron_11b unidirectional language model | 11B | 19Gb | [megatron_11b.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz)
#### Architecture:
Param | Value
---|---
embed_dim | 3072
ffn_dim | 3072 * 6
layers | 72
attention heads | 32
#### Training details:
Param | value
---|---
bsz | 512
num_updates | 300,000
peak_lr | 1.5e-04
lr scheduler | inverse_sqrt
clip norm | 0.0
## Example training command (model parallel)
Megatron-11b contains too many parameters to train on a single GPU. Following
the original Megatron work, we adopt an intra-layer model parallel training
approach in which each layer's parameters are split across multiple GPUs and
activations and gradients are communicated during the forward/backward pass,
respectively. We similarly split the loss computation using the
`vocab_parallel_cross_entropy` criterion.
The following training command illustrates how to do model parallel training in
fairseq. We assume that each machine (node) has 8 GPUs among which to split the
model parameters (`--model-parallel-size 8`). If you have access to multiple
nodes, you may combine this with data parallel training by increasing
`--distributed-world-size`.
To train Megatron-11b on a single node:
```bash
fairseq-train <DATA_PATH> \
--distributed-world-size 8 \
--memory-efficient-fp16 \
--num-workers 2 \
--model-parallel-size 8 \
--criterion vocab_parallel_cross_entropy \
--task language_modeling \
--sample-break-mode none \
--tokens-per-sample 1024 \
--arch transformer_lm_megatron_11b \
--share-decoder-input-output-embed \
--optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --lr 0.00015 \
--warmup-updates 3000 --weight-decay 0.01 \
--dropout 0.1 --attention-dropout 0.1 \
--batch-size 2 \
--max-update 300000;
```
Note: Above was tested on `DGX-1` box, with `8xV100-32Gb` GPUs.
## Results
**[Wikitext103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)**
Model | Valid perplexity | Test perplexity
---|---|---
`megatron_11b` | 10.64 | 10.54
## Evaluating `megatron_11b` on Wikitext-103
#### 1. Downloading Megatron-11b
```bash
# WARNING: this file is 19GB
wget https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz
tar -xzvf megatron_11b.tar.gz
```
#### 2. Download Wikitext-103
```bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```
#### 3. Detokenize test tokens
Megatron-11b uses a byte-level BPE that expects raw (untokenized) input. Since
the wikitext-103 dataset comes tokenized, we apply a simple detokenization
process to restore the untokenized test set:
```bash
python -m examples.megatron_11b.detok wikitext-103-raw/wiki.test.raw > wikitext-103-raw/wiki.test.detok
```
#### 4. BPE encoding
```bash
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json encoder.json \
--vocab-bpe vocab.bpe \
--inputs "wikitext-103-raw/wiki.test.detok" \
--outputs "wikitext-103-raw/wiki.test.bpe" \
--workers 60;
```
#### 5. Fairseq binarize
```bash
fairseq-preprocess \
--only-source \
--testpref wikitext-103-raw/wiki.test.bpe \
--srcdict megatron_11b/dict.txt \
--destdir wikitext103-bin;
```
#### 6. Evaluating perplexity.
We can now evaluate perplexity on the test set. Note that because we've modified
the test set (via detokenization and BPE), the perplexity reported by
`fairseq-eval-lm` needs to be renormalized.
Compute unnormalized perplexity:
```bash
DATA_PATH=wikitext103-bin/
fairseq-eval-lm \
$DATA_PATH \
--path megatron_11b/model.pt \
--task language_modeling \
--gen-subset test \
--batch-size 8 \
--criterion cross_entropy \
--context-window 992 \
--distributed-world-size 8 \
--model-parallel-size 8;
# Expected PPL (unnormalized_ppl): [8.46]
# Note: the eval command needs to run on 8 GPUs for the released model
```
Renormalizing formula: `2 ^ ( log_2(unnormalized_PPL) * (270847 / 245566))`.
PPL After normalization: `10.54`
To renormalize the perplexity, we must account for the change in token count
after detokenizing and appling BPE. The formula for this is:
`2 ^ ( log_2(unnormalized_PPL) * (new_token_cnt / orig_token_cnt))`
For the wikitext-103 test set, the original token count is `245566` and the
token count after detokenization and applying BPE is `270847`.
The perplexity after renormalization is:
`2 ^ ( log_2(8.46) * (270847 / 245566)) = 10.54`
#!/usr/bin/env python3 -u
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import fileinput
import sacremoses
def main():
parser = argparse.ArgumentParser(description="")
parser.add_argument("files", nargs="*", help="input files")
args = parser.parse_args()
detok = sacremoses.MosesDetokenizer()
for line in fileinput.input(args.files, openhook=fileinput.hook_compressed):
print(
detok.detokenize(line.strip().split(" "))
.replace(" @", "")
.replace("@ ", "")
.replace(" =", "=")
.replace("= ", "=")
.replace(" – ", "–")
)
if __name__ == "__main__":
main()
# Multilingual Translation
[[Multilingual Translation with Extensible Multilingual Pretraining and Finetuning, https://arxiv.org/abs/2008.00401]](https://arxiv.org/abs/2008.00401)
## Introduction
This work is for training multilingual translation models with multiple bitext datasets. This multilingual translation framework supports (see [[training section]](#Training) and [[finetuning section]](#Finetuning) for examples)
* temperature based sampling over unbalancing datasets of different translation directions
- --sampling-method' with
choices=['uniform', 'temperature', 'concat']
- --sampling-temperature
* configurable to automatically add source and/or target language tokens to source/target sentences using data which are prepared in the same way as bilignual training
- --encoder-langtok with choices=['src', 'tgt', None] to specify whether to add source or target language tokens to the source sentences
- --decoder-langtok (binary option) to specify whether to add target language tokens to the target sentences or not
* finetuning mBART pretrained models for multilingual translation
- --finetune-from-model to specify the path from which to load the pretrained model
## Preprocessing data
Multilingual training requires a joint BPE vocab. Please follow [mBART's preprocessing steps](https://github.com/pytorch/fairseq/tree/master/examples/mbart#bpe-data) to reuse our pretrained sentence-piece model.
You can also train a joint BPE model on your own dataset and then follow the steps in [[link]](https://github.com/pytorch/fairseq/tree/master/examples/translation#multilingual-translation).
## Training
```bash
lang_pairs=<language pairs to be trained, e.g. "en-cs,cs-en">
path_2_data=<set to data path>
lang_list=<a file which contains a list of languages separated by new lines>
fairseq-train $path_2_data \
--encoder-normalize-before --decoder-normalize-before \
--arch transformer --layernorm-embedding \
--task translation_multi_simple_epoch \
--sampling-method "temperature" \
--sampling-temperature 1.5 \
--encoder-langtok "src" \
--decoder-langtok \
--lang-dict "$lang_list" \
--lang-pairs "$lang_pairs" \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
--seed 222 --log-format simple --log-interval 2
```
## Finetuning
We can also finetune multilingual models from a monolingual pretrained models, e.g. [mMBART](https://github.com/pytorch/fairseq/tree/master/examples/mbart).
```bash
lang_pairs=<language pairs to be trained, e.g. "en-cs,cs-en">
path_2_data=<set to data path>
lang_list=<a file which contains a list of languages separated by new lines>
pretrained_model=<path to the pretrained model, e.g. mbart or another trained multilingual model>
fairseq-train $path_2_data \
--finetune-from-model $pretrained_model \
--encoder-normalize-before --decoder-normalize-before \
--arch transformer --layernorm-embedding \
--task translation_multi_simple_epoch \
--sampling-method "temperature" \
--sampling-temperature 1.5 \
--encoder-langtok "src" \
--decoder-langtok \
--lang-dict "$lang_list" \
--lang-pairs "$lang_pairs" \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
--seed 222 --log-format simple --log-interval 2
```
## Generate
The following command uses the multilingual task (translation_multi_simple_epoch) to generate translation from $source_lang to $target_lang on the test dataset. During generaton, the source language tokens are added to source sentences and the target language tokens are added as the starting token to decode target sentences. Options --lang-dict and --lang-pairs are needed to tell the generation process the ordered list of languages and translation directions that the trained model are awared of; they will need to be consistent with the training.
```bash
model=<multilingual model>
source_lang=<source language>
target_lang=<target language>
fairseq-generate $path_2_data \
--path $model \
--task translation_multi_simple_epoch \
--gen-subset test \
--source-lang $source_lang \
--target-lang $target_lang
--sacrebleu --remove-bpe 'sentencepiece'\
--batch-size 32 \
--encoder-langtok "src" \
--decoder-langtok \
--lang-dict "$lang_list" \
--lang-pairs "$lang_pairs" > ${source_lang}_${target_lang}.txt
```
Fairseq will generate translation into a file {source_lang}_${target_lang}.txt with sacreblue at the end.
You can also use costomized tokenizer to compare the performance with the literature. For example, you get a tokenizer [here](https://github.com/rsennrich/wmt16-scripts) and do the following:
```bash
TOKENIZER=<path to a customized tokenizer for decoding evaluation>
TOK_CMD=<"$TOKENIZER $target_lang" or cat for sacrebleu>
cat {source_lang}_${target_lang}.txt | grep -P "^H" |sort -V |cut -f 3- |$TOK_CMD > ${source_lang}_${target_lang}.hyp
cat {source_lang}_${target_lang}.txt | grep -P "^T" |sort -V |cut -f 2- |$TOK_CMD > ${source_lang}_${target_lang}.ref
sacrebleu -tok 'none' -s 'none' ${source_lang}_${target_lang}.ref < ${source_lang}_${target_lang}.hyp
```
## Citation
```bibtex
@article{tang2020multilingual,
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
year={2020},
eprint={2008.00401},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
#!/bin/bash
path_2_data=$1 # <path to data> which contains binarized data for each directions
lang_list=$2 # <path to a file which contains a list of languages separted by new lines>
lang_pairs=$3 #a list language pairs to train multilingual models, e.g. "en-fr,en-cs,fr-en,cs-en"
# pretrained can be an mBART pretrained model as well
pretrained_model=$4 #<path to a pretrained model>
fairseq-train "$path_2_data" \
--encoder-normalize-before --decoder-normalize-before \
--arch transformer --layernorm-embedding \
--task translation_multi_simple_epoch \
--finetune-from-model "$pretrained_model" \
--sampling-method "temperature" \
--sampling-temperature "1.5" \
--encoder-langtok "src" \
--decoder-langtok \
--lang-dict "$lang_list" \
--lang-pairs "$lang_pairs" \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 3e-05 --min-lr -1 --warmup-updates 2500 --max-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
--seed 222 --log-format simple --log-interval 2
#!/bin/bash
lang_pairs="en-fr,en-cs,fr-en,cs-en"
path_2_data=$1 # <path to data>
lang_list=$2 # <path to a file which contains list of languages separted by new lines>
model=$3 # <path to a trained model>
source_lang=cs
target_lang=en
fairseq-generate "$path_2_data" \
--path "$model" \
--task translation_multi_simple_epoch \
--gen-subset test \
--source-lang "$source_lang" \
--target-lang "$target_lang" \
--sacrebleu --remove-bpe 'sentencepiece'\
--batch-size 32 \
--encoder-langtok "src" \
--decoder-langtok \
--lang-dict "$lang_list" \
--lang-pairs "$lang_pairs"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment