Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
ResNet50_tensorflow
Commits
52979660
Commit
52979660
authored
Oct 01, 2020
by
Allen Wang
Committed by
A. Unique TensorFlower
Oct 01, 2020
Browse files
Update XLNet README with dataset processing and fine-tuning running commands.
PiperOrigin-RevId: 334860469
parent
4680f2fa
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
222 additions
and
2 deletions
+222
-2
official/nlp/xlnet/README.md
official/nlp/xlnet/README.md
+222
-2
No files found.
official/nlp/xlnet/README.md
View file @
52979660
...
@@ -3,8 +3,6 @@
...
@@ -3,8 +3,6 @@
The academic paper which describes XLNet in detail and provides full results on
The academic paper which describes XLNet in detail and provides full results on
a number of tasks can be found here: https://arxiv.org/abs/1906.08237.
a number of tasks can be found here: https://arxiv.org/abs/1906.08237.
**Instructions and user guide will be added soon.**
XLNet is a generalized autoregressive BERT-like pretraining language model that
XLNet is a generalized autoregressive BERT-like pretraining language model that
enables learning bidirectional contexts by maximizing the expected likelihood
enables learning bidirectional contexts by maximizing the expected likelihood
over all permutations of the factorization order. It can learn dependency beyond
over all permutations of the factorization order. It can learn dependency beyond
...
@@ -14,3 +12,225 @@ recurrence mechanism and relative positional encoding scheme introduced in
...
@@ -14,3 +12,225 @@ recurrence mechanism and relative positional encoding scheme introduced in
on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks
on 20 NLP benchmark tasks and achieves state-of-the-art results on 18 tasks
including question answering, natural language inference, sentiment analysis,
including question answering, natural language inference, sentiment analysis,
and document ranking.
and document ranking.
## Contents
*
[
Contents
](
#contents
)
*
[
Set Up
](
#set-up
)
*
[
Process Datasets
](
#process-datasets
)
*
[
Fine-tuning with XLNet
](
#fine-tuning-with-xlnet
)
## Set up
To run XLNet on a Cloud TPU, you can first create a
`tf-nightly`
TPU with the
[
ctpu tool
](
https://github.com/tensorflow/tpu/tree/master/tools/ctpu
)
:
```
shell
ctpu up
-name
<instance name>
--tf-version
=
”nightly”
```
After SSH'ing into the VM (or if you're using an on-prem machine), setup
continues as follows:
```
shell
export
PYTHONPATH
=
"
$PYTHONPATH
:/path/to/models"
```
Install
`tf-nightly`
to get latest updates:
```
shell
pip
install
tf-nightly-gpu
```
## Process Datasets
Dataset processing requires a
[
Sentence Piece
](
https://github.com/google/sentencepiece
)
model. One can be
found at the publicly available GCS bucket at:
`gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model`
.
Note that in order to train using Cloud TPUs, data must be stored on a GCS
bucket.
Setup commands:
```
shell
export
SPIECE_DIR
=
~/cased_spiece/
export
SPIECE_MODEL
=
${
SPIECE_DIR
}
/cased_spiece.model
export
DATASETS_DIR
=
gs://some_bucket/datasets
mkdir
-p
${
SPIECE_MODEL
}
gsutil
cp
gs://cloud-tpu-checkpoints/xlnet/cased_spiece.model
${
SPIECE_DIR
}
```
### Pre-training
Pre-training data can be converted into TFRecords using
[
`preprocess_pretrain_data.py`
](
preprocess_pretrain_data.py
)
. Inputs should
consist of a plain text file (or a file glob of plain text files) with one
sentence per line.
To run the script, use the following command:
```
shell
export
INPUT_GLOB
=
'path/to/wiki_cased/*.txt'
python3 preprocess_pretrain_data.py
--bsz_per_host
=
32
--num_core_per_host
=
16
--seq_len
=
512
--reuse_len
=
256
--input_glob
=
'path/to/wiki_cased/*.txt'
--save_dir
=
${
DATASETS_DIR
}
/pretrain
--bi_data
=
True
--sp_path
=
${
SPIECE_MODEL
}
--mask_alpha
=
6
--mask_beta
=
1
--num_predict
=
85
```
Note that to make the memory mechanism work correctly,
`bsz_per_host`
and
`num_core_per_host`
are
*strictly specified*
when preparing TFRecords. The same
TPU settings should be used when training.
### Fine-tuning
*
Classification
To prepare classification data TFRecords on the IMDB dataset, users can download
and unpack the
[
IMDB dataset
](
https://www.imdb.com/interfaces/
)
with the
following command:
```
shell
export
IMDB_DIR
=
~/imdb
mkdir
-p
${
IMDB_DIR
}
cd
${
IMDB_DIR
}
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
.
tar
zxvf aclImdb_v1.tar.gz
-C
${
IMDB_DIR
}
rm
aclImdb_v1.tar.gz
```
Then, the dataset can be converted into TFRecords with the following command:
```
shell
export
TASK_NAME
=
imdb
python3 preprocess_classification_data.py
--max_seq_length
=
512
--spiece_model_file
=
${
SPIECE_MODEL
}
--output_dir
=
${
DATASETS_DIR
}
/
${
TASK_NAME
}
--data_dir
=
${
IMDB_DIR
}
--task_name
=
${
TASK_NAME
}
```
Note: To obtain SOTA on the IMDB dataset, using a sequence length of 512 is
necessary.
*
SQUAD
The
[
SQuAD website
](
https://rajpurkar.github.io/SQuAD-explorer/
)
contains
detailed information about the SQuAD datasets and evaluation.
To download the relevant files, use the following command:
```
shell
export
SQUAD_DIR
=
~/squad
mkdir
-p
${
SQUAD_DIR
}
&&
cd
${
SQUAD_DIR
}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
```
Then to process the dataset into TFRecords, run the following commands:
```
shell
python3 preprocess_squad_data.py
--spiece_model_file
=
${
SPIECE_MODEL
}
--train_file
=
${
SQUAD_DIR
}
/train-v2.0.json
--predict_file
=
${
SQUAD_DIR
}
/dev-v2.0.json
--output_dir
=
${
DATASETS_DIR
}
/squad
--uncased
=
False
--max_seq_length
=
512
--num_proc
=
1
--proc_id
=
0
gsutil
cp
${
SQUAD_DIR
}
/dev-v2.0.json
${
DATASETS_DIR
}
/squad
```
## Fine-tuning with XLNet
*
Cloud Storage
The unzipped pre-trained model files can be found in the Google Cloud Storage
folder
`gs://cloud-tpu-checkpoints/xlnet/keras_xlnet`
. For example:
```
shell
export
XLNET_DIR
=
gs:/cloud-tpu-checkpoints/xlnet/keras_xlnet
export
MODEL_DIR
=
gs://some_bucket/my_output_dir
```
### Classification task
This example code fine-tunes
`XLNet`
on the IMDB dataset. For this task, it
takes around 11 minutes to get the first 500 steps' results, and takes around 1
hour to complete on a v3-8. It is expected to obtain an accuracy between 96.15
and 96.33.
To run on a v3-8 TPU:
```
shell
export
TPU_NAME
=
my-tpu
python3 run_classifier.py
\
--strategy_type
=
tpu
\
--tpu
=
${
TPU_NAME
}
\
--init_checkpoint
=
${
XLNET_DIR
}
/xlnet_model.ckpt
\
--model_dir
=
${
MODEL_DIR
}
\
--test_data_size
=
25024
\
--train_tfrecord_path
=
${
DATASETS_DIR
}
/imdb/cased_spiece.model.len-512.train.tf_record
\
--test_tfrecord_path
=
${
DATASETS_DIR
}
/imdb/cased_spiece.model.len-512.dev.eval.tf_record
\
--train_batch_size
=
32
\
--seq_len
=
512
\
--n_layer
=
24
\
--d_model
=
1024
\
--d_embed
=
1024
\
--n_head
=
16
\
--d_head
=
64
\
--d_inner
=
4096
\
--untie_r
=
true
\
--n_class
=
2
\
--ff_activation
=
gelu
\
--learning_rate
=
2e-5
\
--train_steps
=
4000
\
--warmup_steps
=
500
\
--iterations
=
500
\
--bi_data
=
false
\
--summary_type
=
last
```
### SQuAD 2.0 Task
The Stanford Question Answering Dataset (SQuAD) is a popular question answering
benchmark dataset. See more in
[
SQuAD website
](
https://rajpurkar.github.io/SQuAD-explorer/
)
.
We use
`XLNet-LARGE`
(cased_L-24_H-1024_A-16) running on a v3-8 as an example to
run this workflow. It is expected to reach a
`best_f1`
score of between 88.30
and 88.80. It should take around 5 minutes to read the pickle file, and then 18
minutes to get the first 1000 steps' results. It takes around 2 hours to
complete.
```
shell
export
TPU_NAME
=
my-tpu
python3 run_squad.py
\
--strategy_type
=
tpu
\
--tpu
=
${
TPU_NAME
}
\
--init_checkpoint
=
${
XLNET_DIR
}
/xlnet_model.ckpt
\
--model_dir
=
${
MODEL_DIR
}
\
--train_tfrecord_path
=
${
DATASETS_DIR
}
/squad/squad_cased
\
--test_tfrecord_path
=
${
DATASETS_DIR
}
/squad/squad_cased/12048.eval.tf_record
\
--test_feature_path
=
${
DATASETS_DIR
}
/squad/spiece.model.slen-512.qlen-64.eval.features.pkl
\
--predict_dir
=
${
MODEL_DIR
}
\
--predict_file
=
${
DATASETS_DIR
}
/squad/dev-v2.0.json
\
--train_batch_size
=
48
\
--seq_len
=
512
\
--reuse_len
=
256
\
--mem_len
=
0
\
--n_layer
=
24
\
--d_model
=
1024
\
--d_embed
=
1024
\
--n_head
=
16
\
--d_head
=
64
\
--d_inner
=
4096
\
--untie_r
=
true
\
--ff_activation
=
gelu
\
--learning_rate
=
.00003
\
--train_steps
=
8000
\
--warmup_steps
=
1000
\
--iterations
=
1000
\
--bi_data
=
false
\
--query_len
=
64
\
--adam_epsilon
=
.000001
\
--lr_layer_decay_rate
=
0.75
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment