Sample data processing scripts for FAIR Sequence-to-Sequence Toolkit
These scripts provide an example of pre-processing data for NMT task.
# prepare-iwslt14.sh
Provides an example of pre-processing for IWSLT'14 German to English translation task: ["Report on the 11th IWSLT evaluation campaign" by Cettolo et al.](http://workshop2014.iwslt.org/downloads/proceeding.pdf)
Example usage:
```
$ cd data/
$ bash prepare-iwslt14.sh
$ cd ..
# Binarize the dataset:
$ TEXT=data/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
Provides an example of pre-processing for WMT'14 English to German translation task. By default it will produce a dataset that was modeled after ["Attention Is All You Need" by Vaswani et al.](https://arxiv.org/abs/1706.03762) that includes news-commentary-v12 data.
To use only data awailable in WMT'14 or to replicate results obtained in the original paper ["Convolutional Sequence to Sequence Learning" by Gehring et al.](https://arxiv.org/abs/1705.03122) run it with --icml17 instead:
```
$ bash prepare-wmt14en2de.sh --icml17
```
Example usage:
```
$ cd data/
$ bash prepare-wmt14en2de.sh
$ cd ..
# Binarize the dataset:
$ TEXT=data/wmt14_en_de
$ python preprocess.py --source-lang en --target-lang de \