Update XLNet README with dataset processing and fine-tuning running commands.

PiperOrigin-RevId: 334860469

Update XLNet README with dataset processing and fine-tuning running commands.
PiperOrigin-RevId: 334860469
52979660 · Allen Wang · A. Unique TensorFlower · 4680f2fa · 52979660
Commit 52979660 authored Oct 01, 2020 by Allen Wang Committed by A. Unique TensorFlower Oct 01, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 222 additions and 2 deletions

official/nlp/xlnet/README.md official/nlp/xlnet/README.md +222 -2

No files found.
--- a/official/nlp/xlnet/README.md
+++ b/official/nlp/xlnet/README.md
@@ -3,8 +3,6 @@
 The academic paper which describes XLNet in detail and provides full results on
 a number of tasks can be found here: https://arxiv.org/abs/1906.08237.
-**Instructions and user guide will be added soon.**
 XLNet is a generalized autoregressive BERT-like pretraining language model that
 enables learning bidirectional contexts by maximizing the expected likelihood
 over all permutations of the factorization order. It can learn dependency beyond
@@ -14,3 +12,225 @@ recurrence mechanism and relative positional encoding scheme introduced in
 on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks
 including question answering, natural language inference, sentiment analysis,
 and document ranking.
+## Contents
+*   [Contents](#contents)
+*   [Set Up](#set-up)
+*   [Process Datasets](#process-datasets)
+*   [Fine-tuning with XLNet](#fine-tuning-with-xlnet)
+## Set up
+To run XLNet on a Cloud TPU, you can first create a `tf-nightly` TPU with the
+[ctpu tool](https://github.com/tensorflow/tpu/tree/master/tools/ctpu):
+```shell
+ctpu up -name <instance name> --tf-version=”nightly”
+```
+After SSH'ing into the VM (or if you're using an on-prem machine), setup
+continues as follows:
+```shell
+export PYTHONPATH="$PYTHONPATH:/path/to/models"
+```
+Install `tf-nightly` to get latest updates:
+```shell
+pip install tf-nightly-gpu
+```
+## Process Datasets
+Dataset processing requires a
+[Sentence Piece](https://github.com/google/sentencepiece) model. One can be
+found at the publicly available GCS bucket at:
+`gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model`.
+Note that in order to train using Cloud TPUs, data must be stored on a GCS
+bucket.
+Setup commands:
+```shell
+export SPIECE_DIR=~/cased_spiece/
+export SPIECE_MODEL=${SPIECE_DIR}/cased_spiece.model
+export DATASETS_DIR=gs://some_bucket/datasets
+mkdir -p ${SPIECE_MODEL}
+gsutil cp gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model ${SPIECE_DIR}
+```
+### Pre-training
+Pre-training data can be converted into TFRecords using
+[`preprocess_pretrain_data.py`](preprocess_pretrain_data.py). Inputs should
+consist of a plain text file (or a file glob of plain text files) with one
+sentence per line.
+To run the script, use the following command:
+```shell
+export INPUT_GLOB='path/to/wiki_cased/*.txt'
+python3 preprocess_pretrain_data.py --bsz_per_host=32 --num_core_per_host=16
+--seq_len=512 --reuse_len=256 --input_glob='path/to/wiki_cased/*.txt'
+--save_dir=${DATASETS_DIR}/pretrain --bi_data=True --sp_path=${SPIECE_MODEL}
+--mask_alpha=6 --mask_beta=1 --num_predict=85
+```
+Note that to make the memory mechanism work correctly, `bsz_per_host` and
+`num_core_per_host` are *strictly specified* when preparing TFRecords. The same
+TPU settings should be used when training.
+### Fine-tuning
+*   Classification
+To prepare classification data TFRecords on the IMDB dataset, users can download
+and unpack the [IMDB dataset](https://www.imdb.com/interfaces/) with the
+following command:
+```shell
+export IMDB_DIR=~/imdb
+mkdir -p ${IMDB_DIR}
+cd ${IMDB_DIR}
+wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz .
+tar zxvf aclImdb_v1.tar.gz -C ${IMDB_DIR}
+rm aclImdb_v1.tar.gz
+```
+Then, the dataset can be converted into TFRecords with the following command:
+```shell
+export TASK_NAME=imdb
+python3 preprocess_classification_data.py --max_seq_length=512 --spiece_model_file=${SPIECE_MODEL} --output_dir=${DATASETS_DIR}/${TASK_NAME} --data_dir=${IMDB_DIR} --task_name=${TASK_NAME}
+```
+Note: To obtain SOTA on the IMDB dataset, using a sequence length of 512 is
+necessary.
+*   SQUAD
+The [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) contains
+detailed information about the SQuAD datasets and evaluation.
+To download the relevant files, use the following command:
+```shell
+export SQUAD_DIR=~/squad
+mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
+wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
+wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
+```
+Then to process the dataset into TFRecords, run the following commands:
+```shell
+python3 preprocess_squad_data.py --spiece_model_file=${SPIECE_MODEL} --train_file=${SQUAD_DIR}/train-v2.0.json --predict_file=${SQUAD_DIR}/dev-v2.0.json --output_dir=${DATASETS_DIR}/squad --uncased=False --max_seq_length=512 --num_proc=1 --proc_id=0
+gsutil cp ${SQUAD_DIR}/dev-v2.0.json ${DATASETS_DIR}/squad
+```
+## Fine-tuning with XLNet
+*   Cloud Storage
+The unzipped pre-trained model files can be found in the Google Cloud Storage
+folder `gs://cloud-tpu-checkpoints/xlnet/keras_xlnet`. For example:
+```shell
+export XLNET_DIR=gs:/cloud-tpu-checkpoints/xlnet/keras_xlnet
+export MODEL_DIR=gs://some_bucket/my_output_dir
+```
+### Classification task
+This example code fine-tunes `XLNet` on the IMDB dataset. For this task, it
+takes around 11 minutes to get the first 500 steps' results, and takes around 1
+hour to complete on a v3-8. It is expected to obtain an accuracy between 96.15
+and 96.33.
+To run on a v3-8 TPU:
+```shell
+export TPU_NAME=my-tpu
+python3 run_classifier.py \
+--strategy_type=tpu \
+--tpu=${TPU_NAME} \
+--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
+--model_dir=${MODEL_DIR} \
+--test_data_size=25024 \
+--train_tfrecord_path=${DATASETS_DIR}/imdb/cased_spiece.model.len-512.train.tf_record \
+--test_tfrecord_path=${DATASETS_DIR}/imdb/cased_spiece.model.len-512.dev.eval.tf_record \
+--train_batch_size=32 \
+--seq_len=512 \
+--n_layer=24 \
+--d_model=1024 \
+--d_embed=1024 \
+--n_head=16 \
+--d_head=64 \
+--d_inner=4096 \
+--untie_r=true \
+--n_class=2 \
+--ff_activation=gelu \
+--learning_rate=2e-5 \
+--train_steps=4000 \
+--warmup_steps=500 \
+--iterations=500 \
+--bi_data=false \
+--summary_type=last
+```
+### SQuAD 2.0 Task
+The Stanford Question Answering Dataset (SQuAD) is a popular question answering
+benchmark dataset. See more in
+[SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/).
+We use `XLNet-LARGE` (cased_L-24_H-1024_A-16) running on a v3-8 as an example to
+run this workflow. It is expected to reach a `best_f1` score of between 88.30
+and 88.80. It should take around 5 minutes to read the pickle file, and then 18
+minutes to get the first 1000 steps' results. It takes around 2 hours to
+complete.
+```shell
+export TPU_NAME=my-tpu
+python3 run_squad.py \
+  --strategy_type=tpu \
+  --tpu=${TPU_NAME} \
+  --init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
+  --model_dir=${MODEL_DIR} \
+  --train_tfrecord_path=${DATASETS_DIR}/squad/squad_cased \
+  --test_tfrecord_path=${DATASETS_DIR}/squad/squad_cased/12048.eval.tf_record \
+  --test_feature_path=${DATASETS_DIR}/squad/spiece.model.slen-512.qlen-64.eval.features.pkl \
+  --predict_dir=${MODEL_DIR} \
+  --predict_file=${DATASETS_DIR}/squad/dev-v2.0.json \
+  --train_batch_size=48 \
+  --seq_len=512 \
+  --reuse_len=256 \
+  --mem_len=0 \
+  --n_layer=24 \
+  --d_model=1024 \
+  --d_embed=1024 \
+  --n_head=16 \
+  --d_head=64 \
+  --d_inner=4096 \
+  --untie_r=true \
+  --ff_activation=gelu \
+  --learning_rate=.00003 \
+  --train_steps=8000 \
+  --warmup_steps=1000 \
+  --iterations=1000 \
+  --bi_data=false \
+  --query_len=64 \
+  --adam_epsilon=.000001 \
+  --lr_layer_decay_rate=0.75
+```