Commit 7df61696 authored by Sugon_ldc's avatar Sugon_ldc
Browse files

add fairseq0.10.2

parents
Pipeline #471 failed with stages
in 0 seconds
# Neural Machine Translation with Byte-Level Subwords
https://arxiv.org/abs/1909.03341
We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as
example.
## Data
Get data and generate fairseq binary dataset:
```bash
bash ./get_data.sh
```
## Model Training
Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`):
```bash
# VOCAB=bytes
# VOCAB=chars
VOCAB=bbpe2048
# VOCAB=bpe2048
# VOCAB=bbpe4096
# VOCAB=bpe4096
# VOCAB=bpe16384
```
```bash
fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
--arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \
--batch-size 100 --max-update 100000 --update-freq 2
```
## Generation
`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters:
```bash
# BPE=--bpe bytes
# BPE=--bpe characters
BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model
# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model
```
```bash
fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
--source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \
--tokenizer moses --moses-target-lang en ${BPE}
```
When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions:
```bash
fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
--path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \
--moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000
```
## Results
| Vocabulary | Model | BLEU |
|:-------------:|:-------------:|:-------------:|
| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 |
| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) |
| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) |
| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) |
| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) |
| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) |
| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) |
| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) |
## Citation
```
@misc{wang2019neural,
title={Neural Machine Translation with Byte-Level Subwords},
author={Changhan Wang and Kyunghyun Cho and Jiatao Gu},
year={2019},
eprint={1909.03341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Contact
Changhan Wang ([changhan@fb.com](mailto:changhan@fb.com)),
Kyunghyun Cho ([kyunghyuncho@fb.com](mailto:kyunghyuncho@fb.com)),
Jiatao Gu ([jgu@fb.com](mailto:jgu@fb.com))
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import os
import os.path as op
from collections import namedtuple
from multiprocessing import cpu_count
from typing import List, Optional
import sentencepiece as sp
from fairseq.data.encoders.byte_bpe import ByteBPE
from fairseq.data.encoders.byte_utils import byte_encode
from fairseq.data.encoders.bytes import Bytes
from fairseq.data.encoders.characters import Characters
from fairseq.data.encoders.moses_tokenizer import MosesTokenizer
from fairseq.data.encoders.sentencepiece_bpe import SentencepieceBPE
SPLITS = ["train", "valid", "test"]
def _convert_xml(in_path: str, out_path: str):
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
ss = s.strip()
if not ss.startswith("<seg"):
continue
ss = ss.replace("</seg>", "").split('">')
assert len(ss) == 2
f_o.write(ss[1].strip() + "\n")
def _convert_train(in_path: str, out_path: str):
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
ss = s.strip()
if ss.startswith("<"):
continue
f_o.write(ss.strip() + "\n")
def _get_bytes(in_path: str, out_path: str):
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
f_o.write(Bytes.encode(s.strip()) + "\n")
def _get_chars(in_path: str, out_path: str):
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
f_o.write(Characters.encode(s.strip()) + "\n")
def pretokenize(in_path: str, out_path: str, src: str, tgt: str):
Args = namedtuple(
"Args",
[
"moses_source_lang",
"moses_target_lang",
"moses_no_dash_splits",
"moses_no_escape",
],
)
args = Args(
moses_source_lang=src,
moses_target_lang=tgt,
moses_no_dash_splits=False,
moses_no_escape=False,
)
pretokenizer = MosesTokenizer(args)
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
f_o.write(pretokenizer.encode(s.strip()) + "\n")
def _convert_to_bchar(in_path_prefix: str, src: str, tgt: str, out_path: str):
with open(out_path, "w") as f_o:
for lang in [src, tgt]:
with open(f"{in_path_prefix}.{lang}") as f:
for s in f:
f_o.write(byte_encode(s.strip()) + "\n")
def _get_bpe(in_path: str, model_prefix: str, vocab_size: int):
arguments = [
f"--input={in_path}",
f"--model_prefix={model_prefix}",
f"--model_type=bpe",
f"--vocab_size={vocab_size}",
"--character_coverage=1.0",
"--normalization_rule_name=identity",
f"--num_threads={cpu_count()}",
]
sp.SentencePieceTrainer.Train(" ".join(arguments))
def _apply_bbpe(model_path: str, in_path: str, out_path: str):
Args = namedtuple("Args", ["sentencepiece_model_path"])
args = Args(sentencepiece_model_path=model_path)
tokenizer = ByteBPE(args)
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
f_o.write(tokenizer.encode(s.strip()) + "\n")
def _apply_bpe(model_path: str, in_path: str, out_path: str):
Args = namedtuple("Args", ["sentencepiece_model"])
args = Args(sentencepiece_model=model_path)
tokenizer = SentencepieceBPE(args)
with open(in_path) as f, open(out_path, "w") as f_o:
for s in f:
f_o.write(tokenizer.encode(s.strip()) + "\n")
def _concat_files(in_paths: List[str], out_path: str):
with open(out_path, "w") as f_o:
for p in in_paths:
with open(p) as f:
for r in f:
f_o.write(r)
def preprocess_iwslt17(
root: str,
src: str,
tgt: str,
bpe_size: Optional[int],
need_chars: bool,
bbpe_size: Optional[int],
need_bytes: bool,
):
# extract bitext
in_root = op.join(root, f"{src}-{tgt}")
for lang in [src, tgt]:
_convert_train(
op.join(in_root, f"train.tags.{src}-{tgt}.{lang}"),
op.join(root, f"train.{lang}"),
)
_convert_xml(
op.join(in_root, f"IWSLT17.TED.dev2010.{src}-{tgt}.{lang}.xml"),
op.join(root, f"valid.{lang}"),
)
_convert_xml(
op.join(in_root, f"IWSLT17.TED.tst2015.{src}-{tgt}.{lang}.xml"),
op.join(root, f"test.{lang}"),
)
# pre-tokenize
for lang in [src, tgt]:
for split in SPLITS:
pretokenize(
op.join(root, f"{split}.{lang}"),
op.join(root, f"{split}.moses.{lang}"),
src,
tgt,
)
# tokenize with BPE vocabulary
if bpe_size is not None:
# learn vocabulary
concated_train_path = op.join(root, "train.all")
_concat_files(
[op.join(root, "train.moses.fr"), op.join(root, "train.moses.en")],
concated_train_path,
)
bpe_model_prefix = op.join(root, f"spm_bpe{bpe_size}")
_get_bpe(concated_train_path, bpe_model_prefix, bpe_size)
os.remove(concated_train_path)
# apply
for lang in [src, tgt]:
for split in SPLITS:
_apply_bpe(
bpe_model_prefix + ".model",
op.join(root, f"{split}.moses.{lang}"),
op.join(root, f"{split}.moses.bpe{bpe_size}.{lang}"),
)
# tokenize with bytes vocabulary
if need_bytes:
for lang in [src, tgt]:
for split in SPLITS:
_get_bytes(
op.join(root, f"{split}.moses.{lang}"),
op.join(root, f"{split}.moses.bytes.{lang}"),
)
# tokenize with characters vocabulary
if need_chars:
for lang in [src, tgt]:
for split in SPLITS:
_get_chars(
op.join(root, f"{split}.moses.{lang}"),
op.join(root, f"{split}.moses.chars.{lang}"),
)
# tokenize with byte-level BPE vocabulary
if bbpe_size is not None:
# learn vocabulary
bchar_path = op.join(root, "train.bchar")
_convert_to_bchar(op.join(root, "train.moses"), src, tgt, bchar_path)
bbpe_model_prefix = op.join(root, f"spm_bbpe{bbpe_size}")
_get_bpe(bchar_path, bbpe_model_prefix, bbpe_size)
os.remove(bchar_path)
# apply
for lang in [src, tgt]:
for split in SPLITS:
_apply_bbpe(
bbpe_model_prefix + ".model",
op.join(root, f"{split}.moses.{lang}"),
op.join(root, f"{split}.moses.bbpe{bbpe_size}.{lang}"),
)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--root", type=str, default="data")
parser.add_argument(
"--bpe-vocab",
default=None,
type=int,
help="Generate tokenized bitext with BPE of size K."
"Default to None (disabled).",
)
parser.add_argument(
"--bbpe-vocab",
default=None,
type=int,
help="Generate tokenized bitext with BBPE of size K."
"Default to None (disabled).",
)
parser.add_argument(
"--byte-vocab",
action="store_true",
help="Generate tokenized bitext with bytes vocabulary",
)
parser.add_argument(
"--char-vocab",
action="store_true",
help="Generate tokenized bitext with chars vocabulary",
)
args = parser.parse_args()
preprocess_iwslt17(
args.root,
"fr",
"en",
args.bpe_vocab,
args.char_vocab,
args.bbpe_vocab,
args.byte_vocab,
)
if __name__ == "__main__":
main()
#!/bin/bash
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
PY_BIN_ROOT=
# PyPI dependency
${PY_BIN_ROOT}pip install sentencepiece sacremoses
# Get data
if [ ! -d "data" ]; then
mkdir data
fi
if [ ! -f "data/fr-en.tgz" ]; then
wget https://wit3.fbk.eu/archive/2017-01-trnted/texts/fr/en/fr-en.tgz -P data
tar xvf data/fr-en.tgz -C data
fi
${PY_BIN_ROOT}python get_bitext.py --bpe-vocab 16384 --byte-vocab --char-vocab
for VOCAB_SIZE in 2048 4096; do
${PY_BIN_ROOT}python get_bitext.py --bpe-vocab ${VOCAB_SIZE} --bbpe-vocab ${VOCAB_SIZE}
done
rm -r data/fr-en data/fr-en.tgz
# Generate binary dataset
${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_bpe16384 --joined-dictionary \
--workers "$(nproc)" --trainpref data/train.moses.bpe16384 --validpref data/valid.moses.bpe16384 \
--testpref data/test.moses.bpe16384
${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_bytes --joined-dictionary \
--workers "$(nproc)" --trainpref data/train.moses.bytes --validpref data/valid.moses.bytes \
--testpref data/test.moses.bytes
${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir data/bin_chars --joined-dictionary \
--workers "$(nproc)" --trainpref data/train.moses.chars --validpref data/valid.moses.chars \
--testpref data/test.moses.chars
for VOCAB_SIZE in 2048 4096; do
for TYPE in bbpe bpe; do
${PY_BIN_ROOT}/fairseq-preprocess --source-lang fr --target-lang en --destdir "data/bin_${TYPE}${VOCAB_SIZE}" \
--joined-dictionary --workers "$(nproc)" --trainpref "data/train.moses.${TYPE}${VOCAB_SIZE}" \
--validpref "data/valid.moses.${TYPE}${VOCAB_SIZE}" --testpref "data/test.moses.${TYPE}${VOCAB_SIZE}"
done
done
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import torch.nn as nn
import torch.nn.functional as F
from fairseq.models import register_model, register_model_architecture
from fairseq.models.transformer import TransformerEncoder, TransformerModel
@register_model("gru_transformer")
class GRUTransformerModel(TransformerModel):
@classmethod
def build_encoder(cls, args, src_dict, embed_tokens):
return GRUTransformerEncoder(args, src_dict, embed_tokens)
class GRUTransformerEncoder(TransformerEncoder):
def __init__(self, args, dictionary, embed_tokens):
super().__init__(args, dictionary, embed_tokens)
self.emb_ctx = nn.GRU(
input_size=embed_tokens.embedding_dim,
hidden_size=embed_tokens.embedding_dim // 2,
num_layers=1,
bidirectional=True,
)
def forward_embedding(self, src_tokens):
# embed tokens and positions
x = embed = self.embed_scale * self.embed_tokens(src_tokens)
if self.embed_positions is not None:
x = embed + self.embed_positions(src_tokens)
# contextualize embeddings
x = x.transpose(0, 1)
x = self.dropout_module(x)
x, _ = self.emb_ctx.forward(x)
x = x.transpose(0, 1)
if self.layernorm_embedding is not None:
x = self.layernorm_embedding(x)
x = self.dropout_module(x)
return x, embed
@register_model_architecture("gru_transformer", "gru_transformer")
def gru_transformer_base_architecture(args):
args.encoder_embed_path = getattr(args, "encoder_embed_path", None)
args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 512)
args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 2048)
args.encoder_layers = getattr(args, "encoder_layers", 6)
args.encoder_attention_heads = getattr(args, "encoder_attention_heads", 8)
args.encoder_normalize_before = getattr(args, "encoder_normalize_before", False)
args.encoder_learned_pos = getattr(args, "encoder_learned_pos", False)
args.decoder_embed_path = getattr(args, "decoder_embed_path", None)
args.decoder_embed_dim = getattr(args, "decoder_embed_dim", args.encoder_embed_dim)
args.decoder_ffn_embed_dim = getattr(
args, "decoder_ffn_embed_dim", args.encoder_ffn_embed_dim
)
args.decoder_layers = getattr(args, "decoder_layers", 6)
args.decoder_attention_heads = getattr(args, "decoder_attention_heads", 8)
args.decoder_normalize_before = getattr(args, "decoder_normalize_before", False)
args.decoder_learned_pos = getattr(args, "decoder_learned_pos", False)
args.attention_dropout = getattr(args, "attention_dropout", 0.0)
args.activation_dropout = getattr(args, "activation_dropout", 0.0)
args.activation_fn = getattr(args, "activation_fn", "relu")
args.dropout = getattr(args, "dropout", 0.1)
args.adaptive_softmax_cutoff = getattr(args, "adaptive_softmax_cutoff", None)
args.adaptive_softmax_dropout = getattr(args, "adaptive_softmax_dropout", 0)
args.share_decoder_input_output_embed = getattr(
args, "share_decoder_input_output_embed", False
)
args.share_all_embeddings = getattr(args, "share_all_embeddings", False)
args.no_token_positional_embeddings = getattr(
args, "no_token_positional_embeddings", False
)
args.adaptive_input = getattr(args, "adaptive_input", False)
args.no_cross_attention = getattr(args, "no_cross_attention", False)
args.cross_self_attention = getattr(args, "cross_self_attention", False)
args.layer_wise_attention = getattr(args, "layer_wise_attention", False)
args.decoder_output_dim = getattr(
args, "decoder_output_dim", args.decoder_embed_dim
)
args.decoder_input_dim = getattr(args, "decoder_input_dim", args.decoder_embed_dim)
args.no_scale_embedding = getattr(args, "no_scale_embedding", False)
args.layernorm_embedding = getattr(args, "layernorm_embedding", False)
@register_model_architecture("gru_transformer", "gru_transformer_big")
def gru_transformer_big(args):
args.encoder_embed_dim = getattr(args, "encoder_embed_dim", 1024)
args.encoder_ffn_embed_dim = getattr(args, "encoder_ffn_embed_dim", 4096)
args.encoder_attention_heads = getattr(args, "encoder_attention_heads", 16)
args.encoder_normalize_before = getattr(args, "encoder_normalize_before", False)
args.decoder_embed_dim = getattr(args, "decoder_embed_dim", 1024)
args.decoder_ffn_embed_dim = getattr(args, "decoder_ffn_embed_dim", 4096)
args.decoder_attention_heads = getattr(args, "decoder_attention_heads", 16)
args.dropout = getattr(args, "dropout", 0.3)
gru_transformer_base_architecture(args)
# CamemBERT: a Tasty French Language Model
## Introduction
[CamemBERT](https://arxiv.org/abs/1911.03894) is a pretrained language model trained on 138GB of French text based on RoBERTa.
Also available in [github.com/huggingface/transformers](https://github.com/huggingface/transformers/).
## Pre-trained models
| Model | #params | Download | Arch. | Training data |
|--------------------------------|---------|--------------------------------------------------------------------------------------------------------------------------|-------|-----------------------------------|
| `camembert` / `camembert-base` | 110M | [camembert-base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base.tar.gz) | Base | OSCAR (138 GB of text) |
| `camembert-large` | 335M | [camembert-large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-large.tar.gz) | Large | CCNet (135 GB of text) |
| `camembert-base-ccnet` | 110M | [camembert-base-ccnet.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-ccnet.tar.gz) | Base | CCNet (135 GB of text) |
| `camembert-base-wikipedia-4gb` | 110M | [camembert-base-wikipedia-4gb.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-wikipedia-4gb.tar.gz) | Base | Wikipedia (4 GB of text) |
| `camembert-base-oscar-4gb` | 110M | [camembert-base-oscar-4gb.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-oscar-4gb.tar.gz) | Base | Subsample of OSCAR (4 GB of text) |
| `camembert-base-ccnet-4gb` | 110M | [camembert-base-ccnet-4gb.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/camembert-base-ccnet-4gb.tar.gz) | Base | Subsample of CCNet (4 GB of text) |
## Example usage
### fairseq
##### Load CamemBERT from torch.hub (PyTorch >= 1.1):
```python
import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert')
camembert.eval() # disable dropout (or leave in train mode to finetune)
```
##### Load CamemBERT (for PyTorch 1.0 or custom models):
```python
# Download camembert model
wget https://dl.fbaipublicfiles.com/fairseq/models/camembert-base.tar.gz
tar -xzvf camembert.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import CamembertModel
camembert = CamembertModel.from_pretrained('/path/to/camembert')
camembert.eval() # disable dropout (or leave in train mode to finetune)
```
##### Filling masks:
```python
masked_line = 'Le camembert est <mask> :)'
camembert.fill_mask(masked_line, topk=3)
# [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'),
# ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'),
# ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]
```
##### Extract features from Camembert:
```python
# Extract the last layer's features
line = "J'aime le camembert !"
tokens = camembert.encode(line)
last_layer_features = camembert.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 10, 768])
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 13
assert torch.all(all_layers[-1] == last_layer_features)
```
## Citation
If you use our work, please cite:
```bibtex
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}
```
# (Vectorized) Lexically constrained decoding with dynamic beam allocation
This page provides instructions for how to use lexically constrained decoding in Fairseq.
Fairseq implements the code described in the following papers:
* [Fast Lexically Constrained Decoding With Dynamic Beam Allocation](https://www.aclweb.org/anthology/N18-1119/) (Post & Vilar, 2018)
* [Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting](https://www.aclweb.org/anthology/N19-1090/) (Hu et al., 2019)
## Quick start
Constrained search is enabled by adding the command-line argument `--constraints` to `fairseq-interactive`.
Constraints are appended to each line of input, separated by tabs. Each constraint (one or more tokens)
is a separate field.
The following command, using [Fairseq's WMT19 German--English model](https://github.com/pytorch/fairseq/blob/master/examples/wmt19/README.md),
translates the sentence *Die maschinelle Übersetzung ist schwer zu kontrollieren.* with the constraints
"hard" and "to influence".
echo -e "Die maschinelle Übersetzung ist schwer zu kontrollieren.\thard\ttoinfluence" \
| normalize.py | tok.py \
| fairseq-interactive /path/to/model \
--path /path/to/model/model1.pt \
--bpe fastbpe \
--bpe-codes /path/to/model/bpecodes \
--constraints \
-s de -t en \
--beam 10
(tok.py and normalize.py can be found in the same directory as this README; they are just shortcuts around Fairseq's WMT19 preprocessing).
This will generate the following output:
[snip]
S-0 Die masch@@ in@@ elle Über@@ setzung ist schwer zu kontrollieren .
W-0 1.844 seconds
C-0 hard
C-0 influence
H-0 -1.5333266258239746 Mach@@ ine trans@@ lation is hard to influence .
D-0 -1.5333266258239746 Machine translation is hard to influence .
P-0 -0.5434 -0.1423 -0.1930 -0.1415 -0.2346 -1.8031 -0.1701 -11.7727 -0.1815 -0.1511
By default, constraints are generated in the order supplied, with any number (zero or more) of tokens generated
between constraints. If you wish for the decoder to order the constraints, then use `--constraints unordered`.
Note that you may want to use a larger beam.
## Implementation details
The heart of the implementation is in `fairseq/search.py`, which adds a `LexicallyConstrainedBeamSearch` instance.
This instance of beam search tracks the progress of each hypothesis in the beam through the set of constraints
provided for each input sentence. It does this using one of two classes, both found in `fairseq/token_generation_contstraints.py`:
* OrderedConstraintState: assumes the `C` input constraints will be generated in the provided order
* UnorderedConstraintState: tries to apply `C` (phrasal) constraints in all `C!` orders
## Differences from Sockeye
There are a number of [differences from Sockeye's implementation](https://awslabs.github.io/sockeye/inference.html#lexical-constraints).
* Generating constraints in the order supplied (the default option here) is not available in Sockeye.
* Due to an improved beam allocation method, there is no need to prune the beam.
* Again due to better allocation, beam sizes as low as 10 or even 5 are often sufficient.
* [The vector extensions described in Hu et al.](https://github.com/edwardjhu/sockeye/tree/trie_constraints) (NAACL 2019) were never merged
into the main Sockeye branch.
## Citation
The paper first describing lexical constraints for seq2seq decoding is:
```bibtex
@inproceedings{hokamp-liu-2017-lexically,
title = "Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search",
author = "Hokamp, Chris and
Liu, Qun",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P17-1141",
doi = "10.18653/v1/P17-1141",
pages = "1535--1546",
}
```
The fairseq implementation uses the extensions described in
```bibtex
@inproceedings{post-vilar-2018-fast,
title = "Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation",
author = "Post, Matt and
Vilar, David",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)",
month = jun,
year = "2018",
address = "New Orleans, Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/N18-1119",
doi = "10.18653/v1/N18-1119",
pages = "1314--1324",
}
```
and
```bibtex
@inproceedings{hu-etal-2019-improved,
title = "Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting",
author = "Hu, J. Edward and
Khayrallah, Huda and
Culkin, Ryan and
Xia, Patrick and
Chen, Tongfei and
Post, Matt and
Van Durme, Benjamin",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/N19-1090",
doi = "10.18653/v1/N19-1090",
pages = "839--850",
}
```
#!/usr/bin/env python3
#
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import sys
from sacremoses.normalize import MosesPunctNormalizer
def main(args):
normalizer = MosesPunctNormalizer(lang=args.lang, penn=args.penn)
for line in sys.stdin:
print(normalizer.normalize(line.rstrip()), flush=True)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--lang", "-l", default="en")
parser.add_argument("--penn", "-p", action="store_true")
args = parser.parse_args()
main(args)
#!/usr/bin/env python3
#
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import sys
import sacremoses
def main(args):
"""Tokenizes, preserving tabs"""
mt = sacremoses.MosesTokenizer(lang=args.lang)
def tok(s):
return mt.tokenize(s, return_str=True)
for line in sys.stdin:
parts = list(map(tok, line.split("\t")))
print(*parts, sep="\t", flush=True)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--lang", "-l", default="en")
parser.add_argument("--penn", "-p", action="store_true")
parser.add_argument("--fields", "-f", help="fields to tokenize")
args = parser.parse_args()
main(args)
# Convolutional Sequence to Sequence Learning (Gehring et al., 2017)
## Pre-trained models
Description | Dataset | Model | Test set(s)
---|---|---|---
Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.newstest2014.tar.bz2) <br> newstest2012/2013: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.v2.en-fr.ntst1213.tar.bz2)
Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-de.newstest2014.tar.bz2)
Convolutional <br> ([Gehring et al., 2017](https://arxiv.org/abs/1705.03122)) | [WMT17 English-German](http://statmt.org/wmt17/translation-task.html#Download) | [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/models/wmt17.v2.en-de.fconv-py.tar.bz2) | newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.v2.en-de.newstest2014.tar.bz2)
## Example usage
See the [translation README](../translation/README.md) for instructions on reproducing results for WMT'14 En-De and
WMT'14 En-Fr using the `fconv_wmt_en_de` and `fconv_wmt_en_fr` model architectures.
## Citation
```bibtex
@inproceedings{gehring2017convs2s,
title = {Convolutional Sequence to Sequence Learning},
author = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
booktitle = {Proc. of ICML},
year = 2017,
}
```
# Cross-lingual Retrieval for Iterative Self-Supervised Training
https://arxiv.org/pdf/2006.09526.pdf
## Introduction
CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.
## Unsupervised Machine Translation
##### 1. Download and decompress CRISS checkpoints
```
cd examples/criss
wget https://dl.fbaipublicfiles.com/fairseq/models/criss/criss_checkpoints.tar.gz
tar -xf criss_checkpoints.tar.gz
```
##### 2. Download and preprocess Flores test dataset
```
bash download_and_preprocess_flores_test.sh
```
##### 3. Run Evaluation on Sinhala-English
```
bash unsupervised_mt/eval.sh
```
## Sentence Retrieval
##### 1. Download and preprocess Tatoeba dataset
```
bash download_and_preprocess_tatoeba.sh
```
##### 2. Run Sentence Retrieval on Tatoeba Kazakh-English
```
bash sentence_retrieval/sentence_retrieval_tatoeba.sh
```
## Mining
##### 1. Mine pseudo-parallel
```
bash sentence_retrieval/sentence_retrieval_tatoeba.sh
```
## Citation
```bibtex
@article{tran2020cross,
title={Cross-lingual retrieval for iterative self-supervised training},
author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao},
journal={arXiv preprint arXiv:2006.09526},
year={2020}
}
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment