Commit c778a31e authored by Alexei Baevski's avatar Alexei Baevski Committed by Myle Ott
Browse files

create examples dir and add conv lm + stories readme

parent a919570b
......@@ -4,10 +4,11 @@ Fairseq(-py) is a sequence modeling toolkit that allows researchers and develope
- **Convolutional Neural Networks (CNN)**
- [Gehring et al. (2017): Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)
- [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://arxiv.org/abs/1711.04956)
- [Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083.pdf)
- **Long Short-Term Memory (LSTM) networks**
- [Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
- [Wiseman and Rush (2016): Sequence-to-Sequence Learning as Beam-Search Optimization](https://arxiv.org/abs/1606.02960)
Fairseq features multi-GPU (distributed) training on one machine or across multiple machines, fast beam search generation on both CPU and GPU, and includes pre-trained models for several benchmark translation datasets.
![Model](fairseq.gif)
......@@ -38,6 +39,7 @@ The following command-line tools are provided:
* `python generate.py`: Translate pre-processed data with a trained model
* `python interactive.py`: Translate raw text with a trained model
* `python score.py`: BLEU scoring of generated translations against reference translations
* `python eval_lm.py`: Language model evaluation
## Evaluating Pre-trained Models
First, download a pre-trained model along with its vocabularies:
......@@ -74,13 +76,16 @@ Check [below](#pre-trained-models) for a full list of pre-trained models availab
## Training a New Model
The following tutorial is for machine translation. For an example of how to use Fairseq for language modeling, please see the [language modeling example README](examples/language_model/README.md).
### Data Pre-processing
Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German).
To pre-process and binarize the IWSLT dataset:
```
$ cd data/
$ cd examples/translation/
$ bash prepare-iwslt14.sh
$ cd ..
$ cd ../..
$ TEXT=data/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
......
Sample data processing scripts for the FAIR Sequence-to-Sequence Toolkit
These scripts provide an example of pre-processing data for the Language Modeling task.
# prepare-wikitext-103.sh
Provides an example of pre-processing for [WikiText-103 language modeling task](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset):
Example usage:
```
$ cd examples/language_model/
$ bash prepare-wikitext-103.sh
$ cd ../..
# Binarize the dataset:
$ TEXT=examples/language_model/wikitext-103
$ python preprocess.py --only-source \
--trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
# Train the model:
# If it runs out of memory, try to reduce max-tokens and max-target-positions
$ mkdir -p checkpoints/wikitext-103
$ python train.py data-bin/wikitext-103 --save-dir /checkpoints/wikitext-103 \
--max-epoch 35 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
--decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' \
--decoder-embed-dim 280 --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
--adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --max-target-positions 1024
# Evaluate:
$ python eval_lm.py data-bin/wikitext-103 --path 'checkpoints/wiki103/checkpoint_best.pt'
```
#!/bin/bash
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
URLS=(
"https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip"
)
FILES=(
"wikitext-103-v1.zip"
)
for ((i=0;i<${#URLS[@]};++i)); do
file=${FILES[i]}
if [ -f $file ]; then
echo "$file already exists, skipping download"
else
url=${URLS[i]}
wget "$url"
if [ -f $file ]; then
echo "$url successfully downloaded."
else
echo "$url not successfully downloaded."
exit -1
fi
if [ ${file: -4} == ".tgz" ]; then
tar zxvf $file
elif [ ${file: -4} == ".tar" ]; then
tar xvf $file
elif [ ${file: -4} == ".zip" ]; then
unzip $file
fi
fi
done
cd ..
FAIR Sequence-to-Sequence Toolkit for Story Generation
The following commands provide an example of pre-processing data, training a model, and generating text for story generation with the WritingPrompts dataset.
The dataset can be downloaded like this:
```
curl https://s3.amazonaws.com/fairseq-py/data/writingPrompts.tar.gz | tar xvjf -
```
and contains a train, test, and valid split. The dataset is described here: https://arxiv.org/abs/1805.04833, where only the first 1000 words of each story are modeled.
Example usage:
```
# Binarize the dataset:
$ TEXT=examples/stories/writingPrompts
$ python preprocess.py --source-lang wp_source --target-lang wp_target \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/writingPrompts --thresholdtgt 10 --thresholdsrc 10
# Train the model:
$ python train.py data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
# Train a fusion model:
# add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint
# Generate:
$ python generate.py data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1
```
......@@ -8,9 +8,9 @@ Provides an example of pre-processing for IWSLT'14 German to English translation
Example usage:
```
$ cd data/
$ cd examples/translation/
$ bash prepare-iwslt14.sh
$ cd ..
$ cd ../..
# Binarize the dataset:
$ TEXT=data/iwslt14.tokenized.de-en
......@@ -47,9 +47,9 @@ $ bash prepare-wmt14en2de.sh --icml17
Example usage:
```
$ cd data/
$ cd examples/translation/
$ bash prepare-wmt14en2de.sh
$ cd ..
$ cd ../..
# Binarize the dataset:
$ TEXT=data/wmt14_en_de
......@@ -79,9 +79,9 @@ Provides an example of pre-processing for the WMT'14 English to French translati
Example usage:
```
$ cd data/
$ cd examples/translation/
$ bash prepare-wmt14en2fr.sh
$ cd ..
$ cd ../..
# Binarize the dataset:
$ TEXT=data/wmt14_en_fr
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment