On top of the basic BPE implementation, this repository supports:
- BPE dropout (Provilkov, Emelianenko and Voita, 2019): https://arxiv.org/abs/1910.13267
use the argument `--dropout 0.1` for `subword-nmt apply-bpe` to randomly drop out possible merges.
Doing this on the training corpus can improve quality of the final system; at test time, use BPE without dropout.
In order to obtain reproducible results, argument `--seed` can be used to set the random seed.
**Note:** In the original paper, the authors used BPE-Dropout on each new batch separately. You can copy the training corpus several times to get similar behavior to obtain multiple segmentations for the same sentence.
- support for glossaries:
use the argument `--glossaries` for `subword-nmt apply-bpe` to provide a list of words and/or regular expressions
that should always be passed to the output without subword segmentation
PUBLICATIONS
------------
The segmentation methods are described in:
Rico Sennrich, Barry Haddow and Alexandra Birch (2016):
Neural Machine Translation of Rare Words with Subword Units
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
HOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)
This repository implements the subword segmentation as described in Sennrich et al. (2016),
but since version 0.2, there is one core difference related to end-of-word tokens.
In Sennrich et al. (2016), the end-of-word token `</w>` is initially represented as a separate token, which can be merged with other subwords over time:
```
u n d </w>
f u n d </w>
```
Since 0.2, end-of-word tokens are initially concatenated with the word-final character:
```
u n d</w>
f u n d</w>
```
The new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit `und` is unambiguously word-final, and `un` is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation.
`apply_bpe.py` is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: `#version: 0.2`
ACKNOWLEDGMENTS
---------------
This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).
"this script's location has moved to {0}. This symbolic link will be removed in a future version. Please point to the new location, or install the package and use the command 'subword-nmt'".format(newdir),
"""Compute chrF3 for machine translation evaluation
Reference:
Maja Popović (2015). chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translationn, pages 392–395, Lisbon, Portugal.
"this script's location has moved to {0}. This symbolic link will be removed in a future version. Please point to the new location, or install the package and use the command 'subword-nmt'".format(newdir),