Commit 52979660 authored by Allen Wang's avatar Allen Wang Committed by A. Unique TensorFlower
Browse files

Update XLNet README with dataset processing and fine-tuning running commands.

PiperOrigin-RevId: 334860469
parent 4680f2fa
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
The academic paper which describes XLNet in detail and provides full results on The academic paper which describes XLNet in detail and provides full results on
a number of tasks can be found here: https://arxiv.org/abs/1906.08237. a number of tasks can be found here: https://arxiv.org/abs/1906.08237.
**Instructions and user guide will be added soon.**
XLNet is a generalized autoregressive BERT-like pretraining language model that XLNet is a generalized autoregressive BERT-like pretraining language model that
enables learning bidirectional contexts by maximizing the expected likelihood enables learning bidirectional contexts by maximizing the expected likelihood
over all permutations of the factorization order. It can learn dependency beyond over all permutations of the factorization order. It can learn dependency beyond
...@@ -14,3 +12,225 @@ recurrence mechanism and relative positional encoding scheme introduced in ...@@ -14,3 +12,225 @@ recurrence mechanism and relative positional encoding scheme introduced in
on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks
including question answering, natural language inference, sentiment analysis, including question answering, natural language inference, sentiment analysis,
and document ranking. and document ranking.
## Contents
* [Contents](#contents)
* [Set Up](#set-up)
* [Process Datasets](#process-datasets)
* [Fine-tuning with XLNet](#fine-tuning-with-xlnet)
## Set up
To run XLNet on a Cloud TPU, you can first create a `tf-nightly` TPU with the
[ctpu tool](https://github.com/tensorflow/tpu/tree/master/tools/ctpu):
```shell
ctpu up -name <instance name> --tf-version=”nightly”
```
After SSH'ing into the VM (or if you're using an on-prem machine), setup
continues as follows:
```shell
export PYTHONPATH="$PYTHONPATH:/path/to/models"
```
Install `tf-nightly` to get latest updates:
```shell
pip install tf-nightly-gpu
```
## Process Datasets
Dataset processing requires a
[Sentence Piece](https://github.com/google/sentencepiece) model. One can be
found at the publicly available GCS bucket at:
`gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model`.
Note that in order to train using Cloud TPUs, data must be stored on a GCS
bucket.
Setup commands:
```shell
export SPIECE_DIR=~/cased_spiece/
export SPIECE_MODEL=${SPIECE_DIR}/cased_spiece.model
export DATASETS_DIR=gs://some_bucket/datasets
mkdir -p ${SPIECE_MODEL}
gsutil cp gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model ${SPIECE_DIR}
```
### Pre-training
Pre-training data can be converted into TFRecords using
[`preprocess_pretrain_data.py`](preprocess_pretrain_data.py). Inputs should
consist of a plain text file (or a file glob of plain text files) with one
sentence per line.
To run the script, use the following command:
```shell
export INPUT_GLOB='path/to/wiki_cased/*.txt'
python3 preprocess_pretrain_data.py --bsz_per_host=32 --num_core_per_host=16
--seq_len=512 --reuse_len=256 --input_glob='path/to/wiki_cased/*.txt'
--save_dir=${DATASETS_DIR}/pretrain --bi_data=True --sp_path=${SPIECE_MODEL}
--mask_alpha=6 --mask_beta=1 --num_predict=85
```
Note that to make the memory mechanism work correctly, `bsz_per_host` and
`num_core_per_host` are *strictly specified* when preparing TFRecords. The same
TPU settings should be used when training.
### Fine-tuning
* Classification
To prepare classification data TFRecords on the IMDB dataset, users can download
and unpack the [IMDB dataset](https://www.imdb.com/interfaces/) with the
following command:
```shell
export IMDB_DIR=~/imdb
mkdir -p ${IMDB_DIR}
cd ${IMDB_DIR}
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz .
tar zxvf aclImdb_v1.tar.gz -C ${IMDB_DIR}
rm aclImdb_v1.tar.gz
```
Then, the dataset can be converted into TFRecords with the following command:
```shell
export TASK_NAME=imdb
python3 preprocess_classification_data.py --max_seq_length=512 --spiece_model_file=${SPIECE_MODEL} --output_dir=${DATASETS_DIR}/${TASK_NAME} --data_dir=${IMDB_DIR} --task_name=${TASK_NAME}
```
Note: To obtain SOTA on the IMDB dataset, using a sequence length of 512 is
necessary.
* SQUAD
The [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) contains
detailed information about the SQuAD datasets and evaluation.
To download the relevant files, use the following command:
```shell
export SQUAD_DIR=~/squad
mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
Then to process the dataset into TFRecords, run the following commands:
```shell
python3 preprocess_squad_data.py --spiece_model_file=${SPIECE_MODEL} --train_file=${SQUAD_DIR}/train-v2.0.json --predict_file=${SQUAD_DIR}/dev-v2.0.json --output_dir=${DATASETS_DIR}/squad --uncased=False --max_seq_length=512 --num_proc=1 --proc_id=0
gsutil cp ${SQUAD_DIR}/dev-v2.0.json ${DATASETS_DIR}/squad
```
## Fine-tuning with XLNet
* Cloud Storage
The unzipped pre-trained model files can be found in the Google Cloud Storage
folder `gs://cloud-tpu-checkpoints/xlnet/keras_xlnet`. For example:
```shell
export XLNET_DIR=gs:/cloud-tpu-checkpoints/xlnet/keras_xlnet
export MODEL_DIR=gs://some_bucket/my_output_dir
```
### Classification task
This example code fine-tunes `XLNet` on the IMDB dataset. For this task, it
takes around 11 minutes to get the first 500 steps' results, and takes around 1
hour to complete on a v3-8. It is expected to obtain an accuracy between 96.15
and 96.33.
To run on a v3-8 TPU:
```shell
export TPU_NAME=my-tpu
python3 run_classifier.py \
--strategy_type=tpu \
--tpu=${TPU_NAME} \
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
--model_dir=${MODEL_DIR} \
--test_data_size=25024 \
--train_tfrecord_path=${DATASETS_DIR}/imdb/cased_spiece.model.len-512.train.tf_record \
--test_tfrecord_path=${DATASETS_DIR}/imdb/cased_spiece.model.len-512.dev.eval.tf_record \
--train_batch_size=32 \
--seq_len=512 \
--n_layer=24 \
--d_model=1024 \
--d_embed=1024 \
--n_head=16 \
--d_head=64 \
--d_inner=4096 \
--untie_r=true \
--n_class=2 \
--ff_activation=gelu \
--learning_rate=2e-5 \
--train_steps=4000 \
--warmup_steps=500 \
--iterations=500 \
--bi_data=false \
--summary_type=last
```
### SQuAD 2.0 Task
The Stanford Question Answering Dataset (SQuAD) is a popular question answering
benchmark dataset. See more in
[SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/).
We use `XLNet-LARGE` (cased_L-24_H-1024_A-16) running on a v3-8 as an example to
run this workflow. It is expected to reach a `best_f1` score of between 88.30
and 88.80. It should take around 5 minutes to read the pickle file, and then 18
minutes to get the first 1000 steps' results. It takes around 2 hours to
complete.
```shell
export TPU_NAME=my-tpu
python3 run_squad.py \
--strategy_type=tpu \
--tpu=${TPU_NAME} \
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
--model_dir=${MODEL_DIR} \
--train_tfrecord_path=${DATASETS_DIR}/squad/squad_cased \
--test_tfrecord_path=${DATASETS_DIR}/squad/squad_cased/12048.eval.tf_record \
--test_feature_path=${DATASETS_DIR}/squad/spiece.model.slen-512.qlen-64.eval.features.pkl \
--predict_dir=${MODEL_DIR} \
--predict_file=${DATASETS_DIR}/squad/dev-v2.0.json \
--train_batch_size=48 \
--seq_len=512 \
--reuse_len=256 \
--mem_len=0 \
--n_layer=24 \
--d_model=1024 \
--d_embed=1024 \
--n_head=16 \
--d_head=64 \
--d_inner=4096 \
--untie_r=true \
--ff_activation=gelu \
--learning_rate=.00003 \
--train_steps=8000 \
--warmup_steps=1000 \
--iterations=1000 \
--bi_data=false \
--query_len=64 \
--adam_epsilon=.000001 \
--lr_layer_decay_rate=0.75
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment