Commit 13d9e2ba authored by Kevin's avatar Kevin Committed by Facebook Github Bot
Browse files

Fix changes of file locations of subword-nmt (#1219)

Summary:
Solves https://github.com/pytorch/fairseq/issues/1218.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1219

Differential Revision: D18339541

Pulled By: myleott

fbshipit-source-id: 6d5bd7b60fa7fd30c038fdad54591343a01f228b
parent 37c9d96f
...@@ -11,7 +11,7 @@ This model uses a `Byte Pair Encoding (BPE) ...@@ -11,7 +11,7 @@ This model uses a `Byte Pair Encoding (BPE)
vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
the encoding to the source text before it can be translated. This can be the encoding to the source text before it can be translated. This can be
done with the done with the
`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/apply_bpe.py>`__ `apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
used as a continuation marker and the original text can be easily used as a continuation marker and the original text can be easily
recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe`` recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
......
...@@ -12,7 +12,7 @@ SCRIPTS=mosesdecoder/scripts ...@@ -12,7 +12,7 @@ SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl CLEAN=$SCRIPTS/training/clean-corpus-n.perl
BPEROOT=subword-nmt BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=10000 BPE_TOKENS=10000
URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz" URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz"
......
...@@ -12,7 +12,7 @@ TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl ...@@ -12,7 +12,7 @@ TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=subword-nmt BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=40000 BPE_TOKENS=40000
URLS=( URLS=(
......
...@@ -12,7 +12,7 @@ TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl ...@@ -12,7 +12,7 @@ TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=subword-nmt BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=40000 BPE_TOKENS=40000
URLS=( URLS=(
......
...@@ -52,7 +52,6 @@ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok ...@@ -52,7 +52,6 @@ wget dl.fbaipublicfiles.com/fairseq/data/wmt14-en-de.extra_refs.tok
Next apply BPE on the fly and run generation for each expert: Next apply BPE on the fly and run generation for each expert:
```bash ```bash
BPEROOT=examples/translation/subword-nmt/
BPE_CODE=examples/translation/wmt17_en_de/code BPE_CODE=examples/translation/wmt17_en_de/code
for EXPERT in $(seq 0 2); do \ for EXPERT in $(seq 0 2); do \
cat wmt14-en-de.extra_refs.tok \ cat wmt14-en-de.extra_refs.tok \
......
...@@ -24,7 +24,7 @@ class SubwordNMTBPE(object): ...@@ -24,7 +24,7 @@ class SubwordNMTBPE(object):
raise ValueError('--bpe-codes is required for --bpe=subword_nmt') raise ValueError('--bpe-codes is required for --bpe=subword_nmt')
codes = file_utils.cached_path(args.bpe_codes) codes = file_utils.cached_path(args.bpe_codes)
try: try:
from subword_nmt import apply_bpe from subword_nmt.subword_nmt import apply_bpe
bpe_parser = apply_bpe.create_parser() bpe_parser = apply_bpe.create_parser()
bpe_args = bpe_parser.parse_args([ bpe_args = bpe_parser.parse_args([
'--codes', codes, '--codes', codes,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment