create examples dir and add conv lm + stories readme

c778a31e · Alexei Baevski · Myle Ott · a919570b · c778a31e · c778a31e
Commit c778a31e authored May 31, 2018 by Alexei Baevski Committed by Myle Ott Jun 15, 2018
8 changed files
--- a/README.md
+++ b/README.md
@@ -4,10 +4,11 @@ Fairseq(-py) is a sequence modeling toolkit that allows researchers and develope
 - **Convolutional Neural Networks (CNN)**
  - [Gehring et al. (2017): Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)
  - [Edunov et al. (2018): Classical Structured Prediction Losses for Sequence to Sequence Learning](https://arxiv.org/abs/1711.04956)
+  - [Dauphin et al. (2017): Language Modeling with Gated Convolutional Networks](https://arxiv.org/abs/1612.08083.pdf)
 - **Long Short-Term Memory (LSTM) networks**
  - [Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
  - [Wiseman and Rush (2016): Sequence-to-Sequence Learning as Beam-Search Optimization](https://arxiv.org/abs/1606.02960)
 Fairseq features multi-GPU (distributed) training on one machine or across multiple machines, fast beam search generation on both CPU and GPU, and includes pre-trained models for several benchmark translation datasets.
 ![Model](fairseq.gif)
@@ -38,6 +39,7 @@ The following command-line tools are provided:
 * `python generate.py`: Translate pre-processed data with a trained model
 * `python interactive.py`: Translate raw text with a trained model
 * `python score.py`: BLEU scoring of generated translations against reference translations
+* `python eval_lm.py`: Language model evaluation
 ## Evaluating Pre-trained Models
 First, download a pre-trained model along with its vocabularies:
@@ -74,13 +76,16 @@ Check [below](#pre-trained-models) for a full list of pre-trained models availab
 ## Training a New Model
+The following tutorial is for machine translation. For an example of how to use Fairseq for language modeling, please see the [language modeling example README](examples/language_model/README.md).
 ### Data Pre-processing
 Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German).
 To pre-process and binarize the IWSLT dataset:
 ```
-$ cd data/
+$ cd examples/translation/
 $ bash prepare-iwslt14.sh
-$ cd ..
+$ cd ../..
 $ TEXT=data/iwslt14.tokenized.de-en
 $ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \

--- a/examples/language_model/README.md
+++ b/examples/language_model/README.md
+Sample data processing scripts for the FAIR Sequence-to-Sequence Toolkit
+These scripts provide an example of pre-processing data for the Language Modeling task.
+# prepare-wikitext-103.sh
+Provides an example of pre-processing for [WikiText-103 language modeling task](https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset):
+Example usage:
+```
+$ cd examples/language_model/
+$ bash prepare-wikitext-103.sh
+$ cd ../..
+# Binarize the dataset:
+$ TEXT=examples/language_model/wikitext-103
+$ python preprocess.py --only-source \
+  --trainpref $TEXT/wiki.train.tokens --validpref $TEXT/wiki.valid.tokens --testpref $TEXT/wiki.test.tokens \ 
+  --destdir data-bin/wikitext-103
+# Train the model:
+# If it runs out of memory, try to reduce max-tokens and max-target-positions
+$ mkdir -p checkpoints/wikitext-103
+$ python train.py data-bin/wikitext-103 --save-dir /checkpoints/wikitext-103 \ 
+  --max-epoch 35 --arch fconv_lm --optimizer nag --lr 1.0 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
+  --decoder-layers '[(850, 6)] * 3 + [(850,1)] + [(850,5)] * 4 + [(850,1)] + [(850,4)] * 3 + [(1024,4)] + [(2048, 4)]' \ 
+  --decoder-embed-dim 280 --clip-norm 0.1 --dropout 0.2 --weight-decay 5e-06 --criterion adaptive_loss \
+  --adaptive-softmax-cutoff 10000,20000,200000 --max-tokens 1024 --max-target-positions 1024
+# Evaluate:
+$ python eval_lm.py data-bin/wikitext-103 --path 'checkpoints/wiki103/checkpoint_best.pt'
+```
--- a/examples/language_model/prepare-wikitext-103.sh
+++ b/examples/language_model/prepare-wikitext-103.sh
+#!/bin/bash
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+URLS=(
+    "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip"
+)
+FILES=(
+    "wikitext-103-v1.zip"
+)
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url"
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit -1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        elif [ ${file: -4} == ".zip" ]; then
+            unzip $file
+        fi
+    fi
+done
+cd ..
--- a/examples/stories/README.md
+++ b/examples/stories/README.md
+FAIR Sequence-to-Sequence Toolkit for Story Generation
+The following commands provide an example of pre-processing data, training a model, and generating text for story generation with the WritingPrompts dataset. 
+The dataset can be downloaded like this:
+```
+curl https://s3.amazonaws.com/fairseq-py/data/writingPrompts.tar.gz | tar xvjf -
+```
+and contains a train, test, and valid split. The dataset is described here: https://arxiv.org/abs/1805.04833, where only the first 1000 words of each story are modeled. 
+Example usage:
+```
+# Binarize the dataset:
+$ TEXT=examples/stories/writingPrompts
+$ python preprocess.py --source-lang wp_source --target-lang wp_target \
+  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+  --destdir data-bin/writingPrompts --thresholdtgt 10 --thresholdsrc 10
+# Train the model:
+$ python train.py data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False
+# Train a fusion model:
+# add the arguments: --pretrained True --pretrained-checkpoint path/to/checkpoint
+# Generate:
+$ python generate.py data-bin/writingPrompts --path /path/to/trained/model/checkpoint_best.pt --batch-size 32 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 
+```
--- a/data/README.md
+++ b/data/README.md
@@ -8,9 +8,9 @@ Provides an example of pre-processing for IWSLT'14 German to English translation
 Example usage:
 ```
-$ cd data/
+$ cd examples/translation/
 $ bash prepare-iwslt14.sh
-$ cd ..
+$ cd ../..
 # Binarize the dataset:
 $ TEXT=data/iwslt14.tokenized.de-en
@@ -47,9 +47,9 @@ $ bash prepare-wmt14en2de.sh --icml17
 Example usage:
 ```
-$ cd data/
+$ cd examples/translation/
 $ bash prepare-wmt14en2de.sh
-$ cd ..
+$ cd ../..
 # Binarize the dataset:
 $ TEXT=data/wmt14_en_de
@@ -79,9 +79,9 @@ Provides an example of pre-processing for the WMT'14 English to French translati
 Example usage:
 ```
-$ cd data/
+$ cd examples/translation/
 $ bash prepare-wmt14en2fr.sh
-$ cd ..
+$ cd ../..
 # Binarize the dataset:
 $ TEXT=data/wmt14_en_fr

--- a/data/prepare-iwslt14.sh
+++ b/data/prepare-iwslt14.sh
--- a/data/prepare-wmt14en2de.sh
+++ b/data/prepare-wmt14en2de.sh
--- a/data/prepare-wmt14en2fr.sh
+++ b/data/prepare-wmt14en2fr.sh