Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
52979660
Commit
52979660
authored
Oct 01, 2020
by
Allen Wang
Committed by
A. Unique TensorFlower
Oct 01, 2020
Browse files
Update XLNet README with dataset processing and fine-tuning running commands.
PiperOrigin-RevId: 334860469
parent
4680f2fa
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
222 additions
and
2 deletions
+222
-2
official/nlp/xlnet/README.md
official/nlp/xlnet/README.md
+222
-2
No files found.
official/nlp/xlnet/README.md
View file @
52979660
...
...
@@ -3,8 +3,6 @@
The academic paper which describes XLNet in detail and provides full results on
a number of tasks can be found here: https://arxiv.org/abs/1906.08237.
**Instructions and user guide will be added soon.**
XLNet is a generalized autoregressive BERT-like pretraining language model that
enables learning bidirectional contexts by maximizing the expected likelihood
over all permutations of the factorization order. It can learn dependency beyond
...
...
@@ -14,3 +12,225 @@ recurrence mechanism and relative positional encoding scheme introduced in
on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks
including question answering, natural language inference, sentiment analysis,
and document ranking.
## Contents
*
[
Contents
](
#contents
)
*
[
Set Up
](
#set-up
)
*
[
Process Datasets
](
#process-datasets
)
*
[
Fine-tuning with XLNet
](
#fine-tuning-with-xlnet
)
## Set up
To run XLNet on a Cloud TPU, you can first create a
`tf-nightly`
TPU with the
[
ctpu tool
](
https://github.com/tensorflow/tpu/tree/master/tools/ctpu
)
:
```
shell
ctpu up
-name
<instance name>
--tf-version
=
”nightly”
```
After SSH'ing into the VM (or if you're using an on-prem machine), setup
continues as follows:
```
shell
export
PYTHONPATH
=
"
$PYTHONPATH
:/path/to/models"
```
Install
`tf-nightly`
to get latest updates:
```
shell
pip
install
tf-nightly-gpu
```
## Process Datasets
Dataset processing requires a
[
Sentence Piece
](
https://github.com/google/sentencepiece
)
model. One can be
found at the publicly available GCS bucket at:
`gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model`
.
Note that in order to train using Cloud TPUs, data must be stored on a GCS
bucket.
Setup commands:
```
shell
export
SPIECE_DIR
=
~/cased_spiece/
export
SPIECE_MODEL
=
${
SPIECE_DIR
}
/cased_spiece.model
export
DATASETS_DIR
=
gs://some_bucket/datasets
mkdir
-p
${
SPIECE_MODEL
}
gsutil
cp
gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model
${
SPIECE_DIR
}
```
### Pre-training
Pre-training data can be converted into TFRecords using
[
`preprocess_pretrain_data.py`
](
preprocess_pretrain_data.py
)
. Inputs should
consist of a plain text file (or a file glob of plain text files) with one
sentence per line.
To run the script, use the following command:
```
shell
export
INPUT_GLOB
=
'path/to/wiki_cased/*.txt'
python3 preprocess_pretrain_data.py
--bsz_per_host
=
32
--num_core_per_host
=
16
--seq_len
=
512
--reuse_len
=
256
--input_glob
=
'path/to/wiki_cased/*.txt'
--save_dir
=
${
DATASETS_DIR
}
/pretrain
--bi_data
=
True
--sp_path
=
${
SPIECE_MODEL
}
--mask_alpha
=
6
--mask_beta
=
1
--num_predict
=
85
```
Note that to make the memory mechanism work correctly,
`bsz_per_host`
and
`num_core_per_host`
are
*strictly specified*
when preparing TFRecords. The same
TPU settings should be used when training.
### Fine-tuning
*
Classification
To prepare classification data TFRecords on the IMDB dataset, users can download
and unpack the
[
IMDB dataset
](
https://www.imdb.com/interfaces/
)
with the
following command:
```
shell
export
IMDB_DIR
=
~/imdb
mkdir
-p
${
IMDB_DIR
}
cd
${
IMDB_DIR
}
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
.
tar
zxvf aclImdb_v1.tar.gz
-C
${
IMDB_DIR
}
rm
aclImdb_v1.tar.gz
```
Then, the dataset can be converted into TFRecords with the following command:
```
shell
export
TASK_NAME
=
imdb
python3 preprocess_classification_data.py
--max_seq_length
=
512
--spiece_model_file
=
${
SPIECE_MODEL
}
--output_dir
=
${
DATASETS_DIR
}
/
${
TASK_NAME
}
--data_dir
=
${
IMDB_DIR
}
--task_name
=
${
TASK_NAME
}
```
Note: To obtain SOTA on the IMDB dataset, using a sequence length of 512 is
necessary.
*
SQUAD
The
[
SQuAD website
](
https://rajpurkar.github.io/SQuAD-explorer/
)
contains
detailed information about the SQuAD datasets and evaluation.
To download the relevant files, use the following command:
```
shell
export
SQUAD_DIR
=
~/squad
mkdir
-p
${
SQUAD_DIR
}
&&
cd
${
SQUAD_DIR
}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
Then to process the dataset into TFRecords, run the following commands:
```
shell
python3 preprocess_squad_data.py
--spiece_model_file
=
${
SPIECE_MODEL
}
--train_file
=
${
SQUAD_DIR
}
/train-v2.0.json
--predict_file
=
${
SQUAD_DIR
}
/dev-v2.0.json
--output_dir
=
${
DATASETS_DIR
}
/squad
--uncased
=
False
--max_seq_length
=
512
--num_proc
=
1
--proc_id
=
0
gsutil
cp
${
SQUAD_DIR
}
/dev-v2.0.json
${
DATASETS_DIR
}
/squad
```
## Fine-tuning with XLNet
*
Cloud Storage
The unzipped pre-trained model files can be found in the Google Cloud Storage
folder
`gs://cloud-tpu-checkpoints/xlnet/keras_xlnet`
. For example:
```
shell
export
XLNET_DIR
=
gs:/cloud-tpu-checkpoints/xlnet/keras_xlnet
export
MODEL_DIR
=
gs://some_bucket/my_output_dir
```
### Classification task
This example code fine-tunes
`XLNet`
on the IMDB dataset. For this task, it
takes around 11 minutes to get the first 500 steps' results, and takes around 1
hour to complete on a v3-8. It is expected to obtain an accuracy between 96.15
and 96.33.
To run on a v3-8 TPU:
```
shell
export
TPU_NAME
=
my-tpu
python3 run_classifier.py
\
--strategy_type
=
tpu
\
--tpu
=
${
TPU_NAME
}
\
--init_checkpoint
=
${
XLNET_DIR
}
/xlnet_model.ckpt
\
--model_dir
=
${
MODEL_DIR
}
\
--test_data_size
=
25024
\
--train_tfrecord_path
=
${
DATASETS_DIR
}
/imdb/cased_spiece.model.len-512.train.tf_record
\
--test_tfrecord_path
=
${
DATASETS_DIR
}
/imdb/cased_spiece.model.len-512.dev.eval.tf_record
\
--train_batch_size
=
32
\
--seq_len
=
512
\
--n_layer
=
24
\
--d_model
=
1024
\
--d_embed
=
1024
\
--n_head
=
16
\
--d_head
=
64
\
--d_inner
=
4096
\
--untie_r
=
true
\
--n_class
=
2
\
--ff_activation
=
gelu
\
--learning_rate
=
2e-5
\
--train_steps
=
4000
\
--warmup_steps
=
500
\
--iterations
=
500
\
--bi_data
=
false
\
--summary_type
=
last
```
### SQuAD 2.0 Task
The Stanford Question Answering Dataset (SQuAD) is a popular question answering
benchmark dataset. See more in
[
SQuAD website
](
https://rajpurkar.github.io/SQuAD-explorer/
)
.
We use
`XLNet-LARGE`
(cased_L-24_H-1024_A-16) running on a v3-8 as an example to
run this workflow. It is expected to reach a
`best_f1`
score of between 88.30
and 88.80. It should take around 5 minutes to read the pickle file, and then 18
minutes to get the first 1000 steps' results. It takes around 2 hours to
complete.
```
shell
export
TPU_NAME
=
my-tpu
python3 run_squad.py
\
--strategy_type
=
tpu
\
--tpu
=
${
TPU_NAME
}
\
--init_checkpoint
=
${
XLNET_DIR
}
/xlnet_model.ckpt
\
--model_dir
=
${
MODEL_DIR
}
\
--train_tfrecord_path
=
${
DATASETS_DIR
}
/squad/squad_cased
\
--test_tfrecord_path
=
${
DATASETS_DIR
}
/squad/squad_cased/12048.eval.tf_record
\
--test_feature_path
=
${
DATASETS_DIR
}
/squad/spiece.model.slen-512.qlen-64.eval.features.pkl
\
--predict_dir
=
${
MODEL_DIR
}
\
--predict_file
=
${
DATASETS_DIR
}
/squad/dev-v2.0.json
\
--train_batch_size
=
48
\
--seq_len
=
512
\
--reuse_len
=
256
\
--mem_len
=
0
\
--n_layer
=
24
\
--d_model
=
1024
\
--d_embed
=
1024
\
--n_head
=
16
\
--d_head
=
64
\
--d_inner
=
4096
\
--untie_r
=
true
\
--ff_activation
=
gelu
\
--learning_rate
=
.00003
\
--train_steps
=
8000
\
--warmup_steps
=
1000
\
--iterations
=
1000
\
--bi_data
=
false
\
--query_len
=
64
\
--adam_epsilon
=
.000001
\
--lr_layer_decay_rate
=
0.75
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment