"examples/speech_recognition/vscode:/vscode.git/clone" did not exist on "6ce55e4b011275e43404034832b40648b1483ff6"
Commit c0ece9cc authored by Chen Chen's avatar Chen Chen Committed by A. Unique TensorFlower
Browse files

Move data pre-processing related files (classifier_data_lib.py,...

Move data pre-processing related files (classifier_data_lib.py, create_finetuning_data.py, create_pretraining_data.py, squad_lib.py, squad_lib_sp.py) to data folder.

PiperOrigin-RevId: 296254023
parent 73d8226d
...@@ -114,7 +114,7 @@ officially supported by Google Cloud TPU team yet until TF 2.1 released. ...@@ -114,7 +114,7 @@ officially supported by Google Cloud TPU team yet until TF 2.1 released.
### Pre-training ### Pre-training
There is no change to generate pre-training data. Please use the script There is no change to generate pre-training data. Please use the script
[`create_pretraining_data.py`](create_pretraining_data.py) [`../data/create_pretraining_data.py`](../data/create_pretraining_data.py)
which is essentially branched from [BERT research repo](https://github.com/google-research/bert) which is essentially branched from [BERT research repo](https://github.com/google-research/bert)
to get processed pre-training data and it adapts to TF2 symbols and python3 to get processed pre-training data and it adapts to TF2 symbols and python3
compatibility. compatibility.
...@@ -123,10 +123,10 @@ compatibility. ...@@ -123,10 +123,10 @@ compatibility.
### Fine-tuning ### Fine-tuning
To prepare the fine-tuning data for final model training, use the To prepare the fine-tuning data for final model training, use the
[`create_finetuning_data.py`](./create_finetuning_data.py) script. Resulting [`../data/create_finetuning_data.py`](../data/create_finetuning_data.py) script.
datasets in `tf_record` format and training meta data should be later passed to Resulting datasets in `tf_record` format and training meta data should be later
training or evaluation scripts. The task-specific arguments are described in passed to training or evaluation scripts. The task-specific arguments are
following sections: described in following sections:
* GLUE * GLUE
...@@ -141,7 +141,7 @@ export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1 ...@@ -141,7 +141,7 @@ export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1
export TASK_NAME=MNLI export TASK_NAME=MNLI
export OUTPUT_DIR=gs://some_bucket/datasets export OUTPUT_DIR=gs://some_bucket/datasets
python create_finetuning_data.py \ python ../data/create_finetuning_data.py \
--input_data_dir=${GLUE_DIR}/${TASK_NAME}/ \ --input_data_dir=${GLUE_DIR}/${TASK_NAME}/ \
--vocab_file=${BERT_BASE_DIR}/vocab.txt \ --vocab_file=${BERT_BASE_DIR}/vocab.txt \
--train_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record \ --train_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record \
...@@ -171,7 +171,7 @@ export SQUAD_VERSION=v1.1 ...@@ -171,7 +171,7 @@ export SQUAD_VERSION=v1.1
export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16 export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
export OUTPUT_DIR=gs://some_bucket/datasets export OUTPUT_DIR=gs://some_bucket/datasets
python create_finetuning_data.py \ python ../data/create_finetuning_data.py \
--squad_data_file=${SQUAD_DIR}/train-${SQUAD_VERSION}.json \ --squad_data_file=${SQUAD_DIR}/train-${SQUAD_VERSION}.json \
--vocab_file=${BERT_BASE_DIR}/vocab.txt \ --vocab_file=${BERT_BASE_DIR}/vocab.txt \
--train_data_output_path=${OUTPUT_DIR}/squad_${SQUAD_VERSION}_train.tf_record \ --train_data_output_path=${OUTPUT_DIR}/squad_${SQUAD_VERSION}_train.tf_record \
......
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
# limitations under the License. # limitations under the License.
# ============================================================================== # ==============================================================================
"""Run BERT on SQuAD 1.1 and SQuAD 2.0 in TF 2.x.""" """Run BERT on SQuAD 1.1 and SQuAD 2.0 in TF 2.x."""
from __future__ import absolute_import from __future__ import absolute_import
from __future__ import division from __future__ import division
from __future__ import print_function from __future__ import print_function
...@@ -33,13 +34,14 @@ from official.nlp.bert import common_flags ...@@ -33,13 +34,14 @@ from official.nlp.bert import common_flags
from official.nlp.bert import configs as bert_configs from official.nlp.bert import configs as bert_configs
from official.nlp.bert import input_pipeline from official.nlp.bert import input_pipeline
from official.nlp.bert import model_saving_utils from official.nlp.bert import model_saving_utils
from official.nlp.bert import squad_lib as squad_lib_wp
from official.nlp.bert import squad_lib_sp
from official.nlp.bert import tokenization from official.nlp.bert import tokenization
# word-piece tokenizer based squad_lib
from official.nlp.data import squad_lib as squad_lib_wp
# sentence-piece tokenizer based squad_lib
from official.nlp.data import squad_lib_sp
from official.utils.misc import distribution_utils from official.utils.misc import distribution_utils
from official.utils.misc import keras_utils from official.utils.misc import keras_utils
flags.DEFINE_enum( flags.DEFINE_enum(
'mode', 'train_and_predict', 'mode', 'train_and_predict',
['train_and_predict', 'train', 'predict', 'export_only'], ['train_and_predict', 'train', 'predict', 'export_only'],
......
...@@ -24,13 +24,12 @@ import json ...@@ -24,13 +24,12 @@ import json
from absl import app from absl import app
from absl import flags from absl import flags
import tensorflow as tf import tensorflow as tf
from official.nlp.bert import tokenization
from official.nlp.bert import classifier_data_lib from official.nlp.data import classifier_data_lib
# word-piece tokenizer based squad_lib # word-piece tokenizer based squad_lib
from official.nlp.bert import squad_lib as squad_lib_wp from official.nlp.data import squad_lib as squad_lib_wp
# sentence-piece tokenizer based squad_lib # sentence-piece tokenizer based squad_lib
from official.nlp.bert import squad_lib_sp from official.nlp.data import squad_lib_sp
from official.nlp.bert import tokenization
FLAGS = flags.FLAGS FLAGS = flags.FLAGS
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment