" \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/official_models/tutorials/fine_tune_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/fine_tuning_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/fine_tuning_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
"* Select GPU from the \"Hardware Accelerator\" drop-down list, save it."
"In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package.\n",
"\n",
"The pretrained BERT model this tutorial is based on is also available on [TensorFlow Hub](https://tensorflow.org/hub), to see how to use it refer to the [Hub Appendix](#hub_bert)"
]
},
{
...
...
@@ -71,7 +86,7 @@
"id": "s2d9S2CSSO1z"
},
"source": [
"##Install and import"
"## Setup"
]
},
{
...
...
@@ -83,7 +98,7 @@
"source": [
"### Install the TensorFlow Model Garden pip package\n",
"\n",
"* tf-models-nightly is the nightly Model Garden package created daily automatically.\n",
"* `tf-models-nightly` is the nightly Model Garden package created daily automatically.\n",
"* pip will install all models and dependencies automatically."
]
},
...
...
@@ -97,7 +112,8 @@
},
"outputs": [],
"source": [
"!pip install tf-models-nightly"
"!pip install -q tf-nightly\n",
"!pip install -q tf-models-nightly"
]
},
{
...
...
@@ -107,7 +123,7 @@
"id": "U-7qPCjWUAyy"
},
"source": [
"### Import Tensorflow and other libraries"
"### Imports"
]
},
{
...
...
@@ -123,67 +139,176 @@
"import os\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import tensorflow as tf\n",
"\n",
"import tensorflow_hub as hub\n",
"import tensorflow_datasets as tfds\n",
"tfds.disable_progress_bar()\n",
"\n",
"from official.modeling import tf_utils\n",
"from official.nlp import optimization\n",
"from official.nlp.bert import configs as bert_configs\n",
"## Preprocess the raw data and output tf.record files"
"## The data\n",
"For this example we used the [GLUE MRPC dataset from TFDS](https://www.tensorflow.org/datasets/catalog/glue#gluemrpc).\n",
"\n",
"This dataset is not set up so that it can be directly fed into the BERT model, so this section also handles the necessary preprocessing."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "qfjcKj5FYQOp"
"id": "28DvUhC1YUiB"
},
"source": [
"### Introduction of dataset\n",
"### Get the dataset from TensorFlow Datasets\n",
"\n",
"The Microsoft Research Paraphrase Corpus (Dolan \u0026 Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.\n",
"\n",
"* Number of labels: 2.\n",
"* Size of training dataset: 3668.\n",
"* Size of evaluation dataset: 408.\n",
"* Maximum sequence length of training and evaluation dataset: 128.\n",
"* Please refer here for details: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc"
"* Maximum sequence length of training and evaluation dataset: 128.\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Ijikx5OsH9AT"
},
"outputs": [],
"source": [
"glue, info = tfds.load('glue/mrpc', with_info=True,\n",
" # It's small, load the whole dataset\n",
" batch_size=-1)"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "xf9zz4vLYXjr"
},
"outputs": [],
"source": [
"list(glue.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "28DvUhC1YUiB"
"id": "ZgBg2r2nYT-K"
},
"source": [
"### Get dataset from TensorFlow Datasets (TFDS)\n",
"\n",
"For example, we used the GLUE MRPC dataset from TFDS: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc."
"The `info` object describes the dataset and it's features:"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "IQrHxv7W7jH5"
},
"outputs": [],
"source": [
"info.features"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4PhRLWh9jaXp"
"id": "vhsVWYNxazz5"
},
"source": [
"### Preprocess the data and write to TensorFlow record file\n",
"To fine tune a pre-trained model you need to be sure that you're using exactly the same tokenization, vocabulary, and index mapping as you used during training.\n",
"\n",
"Here, a Bert Model is constructed from the json file with parameters. The bert_config defines the core Bert Model, which is a Keras model to predict the outputs of *num_classes* from the inputs with maximum sequence length *max_seq_length*. "
"The BERT tokenizer used in this tutorial is written in pure Python (It's not built out of TensorFlow ops). So you can't just plug it into your model as a `keras.layer` like you can with `preprocessing.TextVectorization`.\n",
"\n",
"The following code rebuilds the tokenizer that was used by the base model:"
"BERT model adopts the Adam optimizer with weight decay.\n",
"It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0."
"The section manually preprocessed the dataset into the format expected by the model.\n",
"\n",
"This dataset is small, so preprocessing can be done quickly and easily in memory. For larger datasets the `tf_models` library includes some tools for preprocessing and re-serializing a dataset. See [Appendix: Re-encoding a large dataset](#re_encoding_tools) for details."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "62UTWLQd9-LB"
},
"source": [
"#### Encode the sentences\n",
"\n",
"The model expects its two inputs sentences to be concatenated together. This input is expected to start with a `[CLS]` \"This is a classification problem\" token, and each sentence should end with a `[SEP]` \"Separator\" token:"
"Now prepend a `[CLS]` token, and concatenate the ragged tensors to form a single `input_word_ids` tensor for each example. `RaggedTensor.to_tensor()` zero pads to the longest sequence."
"The mask allows the model to cleanly differentiate between the content and the padding. The mask has the same shape as the `input_word_ids`, and contains a `1` anywhere the `input_word_ids` is not padding."
"The \"input type\" also has the same shape, but inside the non-padded region, contains a `0` or a `1` indicating which sentence the token is a part of. "
"Each subset of the data has been converted to a dictionary of features, and a set of labels. Each feature in the input dictionary has the same shape, and the number of labels should match:"
"The `config` defines the core BERT Model, which is a Keras model to predict the outputs of `num_classes` from the inputs with maximum sequence length `max_seq_length`.\n",
"\n",
"This function returns both the encoder and the classifier."
"Note: The pretrained `TransformerEncoder` is also available on [TensorFlow Hub](https://tensorflow.org/hub). See the [Hub appendix](#hub_bert) for details. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "115caFLMk-_l"
},
"source": [
"### Set up the optimizer\n",
"\n",
"BERT adopts the Adam optimizer with weight decay (aka \"[AdamW](https://arxiv.org/abs/1711.05101)\").\n",
"It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0."
"This tutorial you re-encoded the dataset in memory, for clarity.\n",
"\n",
"This was only possible because `glue/mrpc` is a very small dataset. To deal with larger datasets `tf_models` library includes some tools for processing and re-encoding a dataset for efficient training."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "2UTQrkyOT5wD"
},
"source": [
"The first step is to describe which features of the dataset should be transformed:"
"You can get [the BERT model](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2) off the shelf from [TFHub](https://tensorflow.org/hub). It would not be hard to add a classification head on top of this `hub.KerasLayer`"
"The one downside to loading this model from TFHub is that the structure of internal keras layers is not restored. So it's more difficult to inspect or modify the model. The `TransformerEncoder` model is now a single layer:"
"If you need a more control over the construction of the model it's worth noting that the `classifier_model` function used earlier is really just a thin wrapper over the `nlp.modeling.networks.TransformerEncoder` and `nlp.modeling.models.BertClassifier` classes. Just remember that if you start modifying the architecture it may not be correct or possible to reload the pre-trained checkpoint so you'll need to retrain from scratch."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "0cgABEwDj06P"
},
"source": [
"Build the encoder:"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "5r_yqhBFSVEM"
},
"outputs": [],
"source": [
"transformer_config = config_dict.copy()\n",
"\n",
"# You need to rename a few fields to make this work:\n",