Commit ece64b24 authored by Chen Chen's avatar Chen Chen Committed by A. Unique TensorFlower
Browse files

Add an example of training/evaluating squad model using checkpoint from...

Add an example of training/evaluating squad model using checkpoint from docs/pretrained_models.md for oss in train.md.

PiperOrigin-RevId: 350070343
parent 684c0d4d
task:
hub_module_url: ''
max_answer_length: 30
n_best_size: 20
null_score_diff_threshold: 0.0
init_checkpoint: ''
train_data:
drop_remainder: true
global_batch_size: 48
input_path: ''
is_training: true
seq_length: 384
validation_data:
do_lower_case: true
doc_stride: 128
drop_remainder: false
global_batch_size: 48
input_path: ''
is_training: false
query_length: 64
seq_length: 384
tokenization: WordPiece
version_2_with_negative: false
vocab_file: ''
trainer:
checkpoint_interval: 1000
max_to_keep: 5
optimizer_config:
learning_rate:
polynomial:
decay_steps: 3699
end_learning_rate: 0.0
initial_learning_rate: 8.0e-05
power: 1.0
type: polynomial
optimizer:
type: adamw
warmup:
polynomial:
power: 1
warmup_steps: 370
type: polynomial
steps_per_loop: 1000
summary_interval: 1000
train_steps: 3699
validation_interval: 1000
validation_steps: 226
best_checkpoint_export_subdir: 'best_ckpt'
best_checkpoint_eval_metric: 'final_f1'
best_checkpoint_metric_comp: 'higher'
......@@ -64,9 +64,32 @@ This example fine-tunes BERT-base from TF-Hub on the the Multi-Genre Natural
Language Inference (MultiNLI) corpus using TPUs.
Firstly, you can prepare the fine-tuning data using
[`data/create_finetuning_data.py`](https://github.com/tensorflow/models/blob/master/official/nlp/data/create_finetuning_data.py) script.
[`create_finetuning_data.py`](https://github.com/tensorflow/models/blob/master/official/nlp/data/create_finetuning_data.py) script.
For GLUE tasks, you can (1) download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`, (2) prepare the vocabulary file,
and (3) run the following command:
```shell
export GLUE_DIR=~/glue
export VOCAB_FILE=~/uncased_L-12_H-768_A-12/vocab.txt
export TASK_NAME=MNLI
export OUTPUT_DATA_DIR=gs://some_bucket/datasets
python3 data/create_finetuning_data.py \
--input_data_dir=${GLUE_DIR}/${TASK_NAME}/ \
--vocab_file=${VOCAB_FILE} \
--train_data_output_path=${OUTPUT_DATA_DIR}/${TASK_NAME}_train.tf_record \
--eval_data_output_path=${OUTPUT_DATA_DIR}/${TASK_NAME}_eval.tf_record \
--meta_data_file_path=${OUTPUT_DATA_DIR}/${TASK_NAME}_meta_data \
--fine_tuning_task_type=classification --max_seq_length=128 \
--classification_task_name=${TASK_NAME}
```
Resulting training and evaluation datasets in `tf_record` format will be later
passed to [train.py](https://github.com/tensorflow/models/blob/master/official/nlp/train.py).
passed to [train.py](train.py). We will support to read dataset from
tensorflow_datasets (TFDS) and use tf.text for pre-processing soon.
Then you can execute the following commands to start the training and evaluation
job.
......@@ -100,4 +123,59 @@ python3 train.py \
You can monitor the training progress in the console and find the output
models in `$OUTPUT_DIR`.
Note: More examples about pre-training and fine-tuning will come soon.
### Fine-tuning SQuAD with a pre-trained BERT checkpoint
This example fine-tunes a pre-trained BERT checkpoint on the
Stanford Question Answering Dataset (SQuAD) using TPUs.
The [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) contains
detailed information about the SQuAD datasets and evaluation. After downloading
the SQuAD datasets and the [pre-trained BERT checkpoints](https://github.com/tensorflow/models/blob/master/official/nlp/docs/pretrained_models.md),
you can run the following command to prepare the `tf_record` files:
```shell
export SQUAD_DIR=~/squad
export BERT_DIR=~/uncased_L-12_H-768_A-12
export OUTPUT_DATA_DIR=gs://some_bucket/datasets
python3 create_finetuning_data.py \
--squad_data_file=${SQUAD_DIR}/train-v1.1.json \
--vocab_file=${BERT_DIR}/vocab.txt \
--train_data_output_path=${OUTPUT_DATA_DIR}/train.tf_record \
--meta_data_file_path=${OUTPUT_DATA_DIR}/squad_meta_data \
--fine_tuning_task_type=squad --max_seq_length=384
```
Note: To create fine-tuning data with SQuAD 2.0, you need to add flag `--version_2_with_negative=True`.
Then, you can start the training and evaluation jobs:
```shell
export SQUAD_DIR=~/squad
export INPUT_DATA_DIR=gs://some_bucket/datasets
export OUTPUT_DIR=gs://some_bucket/my_output_dir
# See the following link for more pre-trained checkpoints:
# https://github.com/tensorflow/models/blob/master/official/nlp/docs/pretrained_models.md
export BERT_DIR=~/uncased_L-12_H-768_A-12
# Override the configurations by FLAGS. Alternatively, you can directly edit
# `configs/experiments/squad_v1.1.yaml` to specify corresponding fields.
# Also note that the training data is the pre-processed tf_record file, while
# the validation file is the raw json file.
export PARAMS=task.train_data.input_path=$INPUT_DATA_DIR/train.tf_record
export PARAMS=$PARAMS,task.validation_data.input_path=$SQUAD_DIR/dev-v1.1.json
export PARAMS=$PARAMS,task.validation_data.vocab_file=$BERT_DIR/vocab.txt
export PARAMS=$PARAMS,task.init_checkpoint=$BERT_DIR/bert_model.ckpt
export PARAMS=$PARAMS,runtime.distribution_strategy=tpu
python3 train.py \
--experiment=bert/squad \
--mode=train_and_eval \
--model_dir=$OUTPUT_DIR \
--config_file=configs/experiments/squad_v1.1.yaml \
--tpu=${TPU_NAME} \
--params_override=$PARAMS
```
Note: More examples about pre-training will come soon.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment