README.md 7.13 KB
Newer Older
Hongkun Yu's avatar
Hongkun Yu committed
1
2
3
4
# XLNet: Generalized Autoregressive Pretraining for Language Understanding

The academic paper which describes XLNet in detail and provides full results on
a number of tasks can be found here: https://arxiv.org/abs/1906.08237.
Hongkun Yu's avatar
Hongkun Yu committed
5
6
7
8
9
10
11
12
13
14

XLNet is a generalized autoregressive BERT-like pretraining language model that
enables learning bidirectional contexts by maximizing the expected likelihood
over all permutations of the factorization order. It can learn dependency beyond
a fixed length without disrupting temporal coherence by using segment-level
recurrence mechanism and relative positional encoding scheme introduced in
[Transformer-XL](https://arxiv.org/pdf/1901.02860.pdf). XLNet outperforms BERT
on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks
including question answering, natural language inference, sentiment analysis,
and document ranking.
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

## Contents

*   [Contents](#contents)
*   [Set Up](#set-up)
*   [Process Datasets](#process-datasets)
*   [Fine-tuning with XLNet](#fine-tuning-with-xlnet)

## Set up

To run XLNet on a Cloud TPU, you can first create a `tf-nightly` TPU with the
[ctpu tool](https://github.com/tensorflow/tpu/tree/master/tools/ctpu):

```shell
ctpu up -name <instance name> --tf-version=”nightly”
```

After SSH'ing into the VM (or if you're using an on-prem machine), setup
continues as follows:

```shell
export PYTHONPATH="$PYTHONPATH:/path/to/models"
```

Install `tf-nightly` to get latest updates:

```shell
pip install tf-nightly-gpu
```

## Process Datasets

Dataset processing requires a
[Sentence Piece](https://github.com/google/sentencepiece) model. One can be
found at the publicly available GCS bucket at:
`gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model`.

Note that in order to train using Cloud TPUs, data must be stored on a GCS
bucket.

Setup commands:

```shell
export SPIECE_DIR=~/cased_spiece/
export SPIECE_MODEL=${SPIECE_DIR}/cased_spiece.model
export DATASETS_DIR=gs://some_bucket/datasets
Allen Wang's avatar
Allen Wang committed
61
mkdir -p ${SPIECE_DIR}
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
gsutil cp gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model ${SPIECE_DIR}
```


### Pre-training

Pre-training data can be converted into TFRecords using
[`preprocess_pretrain_data.py`](preprocess_pretrain_data.py). Inputs should
consist of a plain text file (or a file glob of plain text files) with one
sentence per line.

To run the script, use the following command:

```shell
export INPUT_GLOB='path/to/wiki_cased/*.txt'

python3 preprocess_pretrain_data.py --bsz_per_host=32 --num_core_per_host=16
--seq_len=512 --reuse_len=256 --input_glob='path/to/wiki_cased/*.txt'
--save_dir=${DATASETS_DIR}/pretrain --bi_data=True --sp_path=${SPIECE_MODEL}
--mask_alpha=6 --mask_beta=1 --num_predict=85
```

Note that to make the memory mechanism work correctly, `bsz_per_host` and
`num_core_per_host` are *strictly specified* when preparing TFRecords. The same
TPU settings should be used when training.

### Fine-tuning

*   Classification

To prepare classification data TFRecords on the IMDB dataset, users can download
and unpack the [IMDB dataset](https://www.imdb.com/interfaces/) with the
following command:

```shell
export IMDB_DIR=~/imdb
mkdir -p ${IMDB_DIR}

cd ${IMDB_DIR}
Allen Wang's avatar
Allen Wang committed
101
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
102
103
104
105
106
107
108
109
110
tar zxvf aclImdb_v1.tar.gz -C ${IMDB_DIR}
rm aclImdb_v1.tar.gz
```

Then, the dataset can be converted into TFRecords with the following command:

```shell
export TASK_NAME=imdb

Allen Wang's avatar
Allen Wang committed
111
python3 preprocess_classification_data.py --max_seq_length=512 --spiece_model_file=${SPIECE_MODEL} --output_dir=${DATASETS_DIR}/${TASK_NAME} --data_dir=${IMDB_DIR}/aclImdb --task_name=${TASK_NAME}
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
```

Note: To obtain SOTA on the IMDB dataset, using a sequence length of 512 is
necessary.

*   SQUAD

The [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) contains
detailed information about the SQuAD datasets and evaluation.

To download the relevant files, use the following command:

```shell
export SQUAD_DIR=~/squad

mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```

Then to process the dataset into TFRecords, run the following commands:

```shell
python3 preprocess_squad_data.py --spiece_model_file=${SPIECE_MODEL} --train_file=${SQUAD_DIR}/train-v2.0.json --predict_file=${SQUAD_DIR}/dev-v2.0.json --output_dir=${DATASETS_DIR}/squad --uncased=False --max_seq_length=512 --num_proc=1 --proc_id=0

gsutil cp ${SQUAD_DIR}/dev-v2.0.json ${DATASETS_DIR}/squad
```

## Fine-tuning with XLNet

*   Cloud Storage

The unzipped pre-trained model files can be found in the Google Cloud Storage
folder `gs://cloud-tpu-checkpoints/xlnet/keras_xlnet`. For example:

```shell
export XLNET_DIR=gs:/cloud-tpu-checkpoints/xlnet/keras_xlnet
export MODEL_DIR=gs://some_bucket/my_output_dir
```

### Classification task

This example code fine-tunes `XLNet` on the IMDB dataset. For this task, it
takes around 11 minutes to get the first 500 steps' results, and takes around 1
hour to complete on a v3-8. It is expected to obtain an accuracy between 96.15
and 96.33.

To run on a v3-8 TPU:

```shell
export TPU_NAME=my-tpu

python3 run_classifier.py \
--strategy_type=tpu \
--tpu=${TPU_NAME} \
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
--model_dir=${MODEL_DIR} \
--test_data_size=25024 \
--train_tfrecord_path=${DATASETS_DIR}/imdb/cased_spiece.model.len-512.train.tf_record \
--test_tfrecord_path=${DATASETS_DIR}/imdb/cased_spiece.model.len-512.dev.eval.tf_record \
--train_batch_size=32 \
--seq_len=512 \
--n_layer=24 \
--d_model=1024 \
--d_embed=1024 \
--n_head=16 \
--d_head=64 \
--d_inner=4096 \
--untie_r=true \
--n_class=2 \
--ff_activation=gelu \
--learning_rate=2e-5 \
--train_steps=4000 \
--warmup_steps=500 \
--iterations=500 \
--bi_data=false \
--summary_type=last
```

### SQuAD 2.0 Task

The Stanford Question Answering Dataset (SQuAD) is a popular question answering
benchmark dataset. See more in
[SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/).

We use `XLNet-LARGE` (cased_L-24_H-1024_A-16) running on a v3-8 as an example to
run this workflow. It is expected to reach a `best_f1` score of between 88.30
and 88.80. It should take around 5 minutes to read the pickle file, and then 18
minutes to get the first 1000 steps' results. It takes around 2 hours to
complete.

```shell
export TPU_NAME=my-tpu

python3 run_squad.py \
  --strategy_type=tpu \
  --tpu=${TPU_NAME} \
  --init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
  --model_dir=${MODEL_DIR} \
  --train_tfrecord_path=${DATASETS_DIR}/squad/squad_cased \
  --test_tfrecord_path=${DATASETS_DIR}/squad/squad_cased/12048.eval.tf_record \
  --test_feature_path=${DATASETS_DIR}/squad/spiece.model.slen-512.qlen-64.eval.features.pkl \
  --predict_dir=${MODEL_DIR} \
  --predict_file=${DATASETS_DIR}/squad/dev-v2.0.json \
  --train_batch_size=48 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=0 \
  --n_layer=24 \
  --d_model=1024 \
  --d_embed=1024 \
  --n_head=16 \
  --d_head=64 \
  --d_inner=4096 \
  --untie_r=true \
  --ff_activation=gelu \
  --learning_rate=.00003 \
  --train_steps=8000 \
  --warmup_steps=1000 \
  --iterations=1000 \
  --bi_data=false \
  --query_len=64 \
  --adam_epsilon=.000001 \
  --lr_layer_decay_rate=0.75
```