Add explaination and examples to `fine_tune_bert.ipynb`

PiperOrigin-RevId: 316785333

Add explaination and examples to `fine_tune_bert.ipynb`
PiperOrigin-RevId: 316785333
0ab249df · Mark Daoust · A. Unique TensorFlower · 43587c64 · 0ab249df
Commit 0ab249df authored Jun 16, 2020 by Mark Daoust Committed by A. Unique TensorFlower Jun 16, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 1509 additions and 231 deletions

official/colab/fine_tuning_bert.ipynb official/colab/fine_tuning_bert.ipynb +1509 -231

No files found.
--- a/official/colab/fine_tuning_bert.ipynb
+++ b/official/colab/fine_tuning_bert.ipynb
@@ -4,64 +4,79 @@
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "YN2ACivEPxgD"
+        "id": "vXLA5InzXydn"
      },
      "source": [
-        "## How-to Guide: Using a PIP package for fine-tuning a BERT model\n",
+        "##### Copyright 2019 The TensorFlow Authors."
-        "\n",
+      ]
-        "Authors: [Chen Chen](https://github.com/chenGitHuber), [Claire Yao](https://github.com/claireyao-fen)\n",
+    },
-        "\n",
+    {
-        "In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package."
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "cellView": "form",
+        "colab": {},
+        "colab_type": "code",
+        "id": "RuRlpLL-X0R_"
+      },
+      "outputs": [],
+      "source": [
+        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# https://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "T7BBEc1-RNCQ"
+        "id": "1mLJmVotXs64"
      },
      "source": [
-        "## License\n",
+        "# Fine-tuning a BERT model"
-        "\n",
-        "Copyright 2020 The TensorFlow Authors. All Rights Reserved.\n",
-        "\n",
-        "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-        "you may not use this file except in compliance with the License.\n",
-        "You may obtain a copy of the License at\n",
-        "\n",
-        "    http://www.apache.org/licenses/LICENSE-2.0\n",
-        "\n",
-        "Unless required by applicable law or agreed to in writing, software\n",
-        "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-        "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-        "See the License for the specific language governing permissions and\n",
-        "limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "Pf6xzoKjywY_"
+        "id": "hYEwGTeCXnnX"
      },
      "source": [
-        "## Learning objectives\n",
+        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
-        "\n",
+        "  \u003ctd\u003e\n",
-        "In this Colab notebook, you will learn how to fine-tune a BERT model using the TensorFlow Model Garden PIP package."
+        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/official_models/tutorials/fine_tune_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
+        "  \u003c/td\u003e\n",
+        "  \u003ctd\u003e\n",
+        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/fine_tuning_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
+        "  \u003c/td\u003e\n",
+        "  \u003ctd\u003e\n",
+        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/fine_tuning_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
+        "  \u003c/td\u003e\n",
+        "  \u003ctd\u003e\n",
+        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/fine_tuning_bert.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
+        "  \u003c/td\u003e\n",
+        "\u003c/table\u003e"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "YHkmV89jRWkS"
+        "id": "YN2ACivEPxgD"
      },
      "source": [
-        "## Enable the GPU acceleration\n",
+        "In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package.\n",
-        "Please enable GPU for better performance.\n",
+        "\n",
-        "*   Navigate to Edit.\n",
+        "The pretrained BERT model this tutorial is based on is also available on [TensorFlow Hub](https://tensorflow.org/hub), to see how to use it refer to the [Hub Appendix](#hub_bert)"
-        "*   Find Notebook settings.\n",
-        "*   Select GPU from the \"Hardware Accelerator\" drop-down list, save it."
      ]
    },
    {
@@ -71,7 +86,7 @@
        "id": "s2d9S2CSSO1z"
      },
      "source": [
-        "##Install and import"
+        "## Setup"
      ]
    },
    {
@@ -83,7 +98,7 @@
      "source": [
        "### Install the TensorFlow Model Garden pip package\n",
        "\n",
-        "*  tf-models-nightly is the nightly Model Garden package created daily automatically.\n",
+        "*  `tf-models-nightly` is the nightly Model Garden package created daily automatically.\n",
        "*  pip will install all models and dependencies automatically."
      ]
    },
@@ -97,7 +112,8 @@
      },
      "outputs": [],
      "source": [
-        "!pip install tf-models-nightly"
+        "!pip install -q tf-nightly\n",
+        "!pip install -q tf-models-nightly"
      ]
    },
    {
@@ -107,7 +123,7 @@
        "id": "U-7qPCjWUAyy"
      },
      "source": [
-        "### Import Tensorflow and other libraries"
+        "### Imports"
      ]
    },
    {
@@ -123,67 +139,176 @@
        "import os\n",
        "\n",
        "import numpy as np\n",
+        "import matplotlib.pyplot as plt\n",
+        "\n",
        "import tensorflow as tf\n",
        "\n",
+        "import tensorflow_hub as hub\n",
+        "import tensorflow_datasets as tfds\n",
+        "tfds.disable_progress_bar()\n",
+        "\n",
        "from official.modeling import tf_utils\n",
-        "from official.nlp import optimization\n",
+        "from official import nlp\n",
-        "from official.nlp.bert import configs as bert_configs\n",
+        "from official.nlp import bert\n",
-        "from official.nlp.bert import tokenization\n",
+        "\n",
-        "from official.nlp.data import classifier_data_lib\n",
+        "# Load the required submodules\n",
-        "from official.nlp.modeling import losses\n",
+        "import official.nlp.optimization\n",
-        "from official.nlp.modeling import models\n",
+        "import official.nlp.bert.bert_models\n",
-        "from official.nlp.modeling import networks"
+        "import official.nlp.bert.configs\n",
+        "import official.nlp.bert.run_classifier\n",
+        "import official.nlp.bert.tokenization\n",
+        "import official.nlp.data.classifier_data_lib\n",
+        "import official.nlp.modeling.losses\n",
+        "import official.nlp.modeling.models\n",
+        "import official.nlp.modeling.networks"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "mbanlzTvJBsz"
+      },
+      "source": [
+        "### Resources"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "PpW0x8TpR8DT"
+      },
+      "source": [
+        "This directory contains the configuration, vocabulary, and a pre-trained checkpoint used in this tutorial:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "vzRHOLciR8eq"
+      },
+      "outputs": [],
+      "source": [
+        "gs_folder_bert = \"gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12\"\n",
+        "tf.io.gfile.listdir(gs_folder_bert)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "9uFskufsR2LT"
+      },
+      "source": [
+        "You can get a pre-trained BERT encoder from TensorFlow Hub here:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "e0dAkUttJAzj"
+      },
+      "outputs": [],
+      "source": [
+        "hub_url_bert = \"https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "C2drjD7OVCmh"
+        "id": "Qv6abtRvH4xO"
      },
      "source": [
-        "## Preprocess the raw data and output tf.record files"
+        "## The data\n",
+        "For this example we used the [GLUE MRPC dataset from TFDS](https://www.tensorflow.org/datasets/catalog/glue#gluemrpc).\n",
+        "\n",
+        "This dataset is not set up so that it can be directly fed into the BERT model, so this section also handles the necessary preprocessing."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "qfjcKj5FYQOp"
+        "id": "28DvUhC1YUiB"
      },
      "source": [
-        "### Introduction of dataset\n",
+        "### Get the dataset from TensorFlow Datasets\n",
        "\n",
        "The Microsoft Research Paraphrase Corpus (Dolan \u0026 Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.\n",
        "\n",
        "*   Number of labels: 2.\n",
        "*   Size of training dataset: 3668.\n",
        "*   Size of evaluation dataset: 408.\n",
-        "*   Maximum sequence length of training and evaluation dataset: 128.\n",
+        "*   Maximum sequence length of training and evaluation dataset: 128.\n"
-        "*   Please refer here for details: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "Ijikx5OsH9AT"
+      },
+      "outputs": [],
+      "source": [
+        "glue, info = tfds.load('glue/mrpc', with_info=True,\n",
+        "                       # It's small, load the whole dataset\n",
+        "                       batch_size=-1)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "xf9zz4vLYXjr"
+      },
+      "outputs": [],
+      "source": [
+        "list(glue.keys())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "28DvUhC1YUiB"
+        "id": "ZgBg2r2nYT-K"
      },
      "source": [
-        "### Get dataset from TensorFlow Datasets (TFDS)\n",
+        "The `info` object describes the dataset and it's features:"
-        "\n",
+      ]
-        "For example, we used the GLUE MRPC dataset from TFDS: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "IQrHxv7W7jH5"
+      },
+      "outputs": [],
+      "source": [
+        "info.features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "4PhRLWh9jaXp"
+        "id": "vhsVWYNxazz5"
      },
      "source": [
-        "### Preprocess the data and write to TensorFlow record file\n",
+        "The two classes are:"
-        "\n"
      ]
    },
    {
@@ -192,43 +317,21 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "FhcMdzsrjWzG"
+        "id": "n0gfc_VTayfQ"
      },
      "outputs": [],
      "source": [
-        "gs_folder_bert = \"gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12\"\n",
+        "info.features['label'].names"
-        "\n",
-        "# Set up tokenizer to generate Tensorflow dataset\n",
-        "tokenizer = tokenization.FullTokenizer(\n",
-        "    vocab_file=os.path.join(gs_folder_bert, \"vocab.txt\"), do_lower_case=True)\n",
-        "\n",
-        "# Set up processor to generate Tensorflow dataset\n",
-        "processor = classifier_data_lib.TfdsProcessor(\n",
-        "    tfds_params=\"dataset=glue/mrpc,text_key=sentence1,text_b_key=sentence2\",\n",
-        "    process_text_fn=tokenization.convert_to_unicode)\n",
-        "\n",
-        "# Set up output of training and evaluation Tensorflow dataset\n",
-        "train_data_output_path=\"./mrpc_train.tf_record\"\n",
-        "eval_data_output_path=\"./mrpc_eval.tf_record\"\n",
-        "\n",
-        "# Generate and save training data into a tf record file\n",
-        "input_meta_data = classifier_data_lib.generate_tf_record_from_data_file(\n",
-        "    processor=processor,\n",
-        "    data_dir=None,  # It is `None` because data is from tfds, not local dir.\n",
-        "    tokenizer=tokenizer,\n",
-        "    train_data_output_path=train_data_output_path,\n",
-        "    eval_data_output_path=eval_data_output_path,\n",
-        "    max_seq_length=128)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "dbJ76vSJj77j"
+        "id": "38zJcap6xkbC"
      },
      "source": [
-        "### Create tf.dataset for training and evaluation\n"
+        "Here is one example from the training set:"
      ]
    },
    {
@@ -237,82 +340,38 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "gCvaLLAxPuMc"
+        "id": "xON_i6SkwApW"
      },
      "outputs": [],
      "source": [
-        "def create_classifier_dataset(file_path, seq_length, batch_size, is_training):\n",
+        "glue_train = glue['train']\n",
-        "  \"\"\"Creates input dataset from (tf)records files for train/eval.\"\"\"\n",
-        "  dataset = tf.data.TFRecordDataset(file_path)\n",
-        "  if is_training:\n",
-        "    dataset = dataset.shuffle(100)\n",
-        "    dataset = dataset.repeat()\n",
-        "\n",
-        "  def decode_record(record):\n",
-        "    name_to_features = {\n",
-        "      'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
-        "      'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
-        "      'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
-        "      'label_ids': tf.io.FixedLenFeature([], tf.int64),\n",
-        "    }\n",
-        "    return tf.io.parse_single_example(record, name_to_features)\n",
-        "\n",
-        "  def _select_data_from_record(record):\n",
-        "    x = {\n",
-        "        'input_word_ids': record['input_ids'],\n",
-        "        'input_mask': record['input_mask'],\n",
-        "        'input_type_ids': record['segment_ids']\n",
-        "    }\n",
-        "    y = record['label_ids']\n",
-        "    return (x, y)\n",
-        "\n",
-        "  dataset = dataset.map(decode_record,\n",
-        "                        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
-        "  dataset = dataset.map(\n",
-        "      _select_data_from_record,\n",
-        "      num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
-        "  dataset = dataset.batch(batch_size, drop_remainder=is_training)\n",
-        "  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)\n",
-        "  return dataset\n",
-        "\n",
-        "# Set up batch sizes\n",
-        "batch_size = 32\n",
-        "eval_batch_size = 32\n",
-        "\n",
-        "# Return Tensorflow dataset\n",
-        "training_dataset = create_classifier_dataset(\n",
-        "    train_data_output_path,\n",
-        "    input_meta_data['max_seq_length'],\n",
-        "    batch_size,\n",
-        "    is_training=True)\n",
        "\n",
-        "evaluation_dataset = create_classifier_dataset(\n",
+        "for key, value in glue_train.items():\n",
-        "    eval_data_output_path,\n",
+        "  print(f\"{key:9s}: {value[0].numpy()}\")"
-        "    input_meta_data['max_seq_length'],\n",
-        "    eval_batch_size,\n",
-        "    is_training=False)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "Efrj3Cn1kLAp"
+        "id": "9fbTyfJpNr7x"
      },
      "source": [
-        "## Create, compile and train the model"
+        "### The BERT tokenizer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "96ldxDSwkVkj"
+        "id": "wqeN54S61ZKQ"
      },
      "source": [
-        "### Construct a Bert Model\n",
+        "To fine tune a pre-trained model you need to be sure that you're using exactly the same tokenization, vocabulary, and index mapping as you used during training.\n",
        "\n",
-        "Here, a Bert Model is constructed from the json file with parameters. The bert_config defines the core Bert Model, which is a Keras model to predict the outputs of *num_classes* from the inputs with maximum sequence length *max_seq_length*. "
+        "The BERT tokenizer used in this tutorial is written in pure Python (It's not built out of TensorFlow ops). So you can't just plug it into your model as a `keras.layer` like you can with `preprocessing.TextVectorization`.\n",
+        "\n",
+        "The following code rebuilds the tokenizer that was used by the base model:"
      ]
    },
    {
@@ -321,44 +380,26 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "Qgajw8WPYzJZ"
+        "id": "idxyhmrCQcw5"
      },
      "outputs": [],
      "source": [
-        "bert_config_file = os.path.join(gs_folder_bert, \"bert_config.json\")\n",
+        "# Set up tokenizer to generate Tensorflow dataset\n",
-        "bert_config = bert_configs.BertConfig.from_json_file(bert_config_file)\n",
+        "tokenizer = bert.tokenization.FullTokenizer(\n",
-        "\n",
+        "    vocab_file=os.path.join(gs_folder_bert, \"vocab.txt\"),\n",
-        "bert_encoder = networks.TransformerEncoder(vocab_size=bert_config.vocab_size,\n",
+        "     do_lower_case=True)\n",
-        "      hidden_size=bert_config.hidden_size,\n",
+        "\n",
-        "      num_layers=bert_config.num_hidden_layers,\n",
+        "print(\"Vocab size:\", len(tokenizer.vocab))"
-        "      num_attention_heads=bert_config.num_attention_heads,\n",
-        "      intermediate_size=bert_config.intermediate_size,\n",
-        "      activation=tf_utils.get_activation(bert_config.hidden_act),\n",
-        "      dropout_rate=bert_config.hidden_dropout_prob,\n",
-        "      attention_dropout_rate=bert_config.attention_probs_dropout_prob,\n",
-        "      sequence_length=input_meta_data['max_seq_length'],\n",
-        "      max_sequence_length=bert_config.max_position_embeddings,\n",
-        "      type_vocab_size=bert_config.type_vocab_size,\n",
-        "      embedding_width=bert_config.embedding_size,\n",
-        "      initializer=tf.keras.initializers.TruncatedNormal(\n",
-        "          stddev=bert_config.initializer_range))\n",
-        "\n",
-        "classifier_model = models.BertClassifier(\n",
-        "        bert_encoder,\n",
-        "        num_classes=input_meta_data['num_labels'],\n",
-        "        dropout_rate=bert_config.hidden_dropout_prob,\n",
-        "        initializer=tf.keras.initializers.TruncatedNormal(\n",
-        "          stddev=bert_config.initializer_range))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "pkSq1wbNXBaa"
+        "id": "zYHDSquU2lDU"
      },
      "source": [
-        "### Initialize the encoder from a pretrained model"
+        "Tokenize a sentence:"
      ]
    },
    {
@@ -367,26 +408,40 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "X6N9NEqfXJCx"
+        "id": "L_OfOYPg853R"
      },
      "outputs": [],
      "source": [
-        "checkpoint = tf.train.Checkpoint(model=bert_encoder)\n",
+        "tokens = tokenizer.tokenize(\"Hello TensorFlow!\")\n",
-        "checkpoint.restore(\n",
+        "print(tokens)\n",
-        "    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()"
+        "ids = tokenizer.convert_tokens_to_ids(tokens)\n",
+        "print(ids)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "115caFLMk-_l"
+        "id": "kkAXLtuyWWDI"
      },
      "source": [
-        "### Set up an optimizer for the model\n",
+        "### Preprocess the data\n",
        "\n",
-        "BERT model adopts the Adam optimizer with weight decay.\n",
+        "The section manually preprocessed the dataset into the format expected by the model.\n",
-        "It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0."
+        "\n",
+        "This dataset is small, so preprocessing can be done quickly and easily in memory. For larger datasets the `tf_models` library includes some tools for preprocessing and re-serializing a dataset. See [Appendix: Re-encoding a large dataset](#re_encoding_tools) for details."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "62UTWLQd9-LB"
+      },
+      "source": [
+        "#### Encode the sentences\n",
+        "\n",
+        "The model expects its two inputs sentences to be concatenated together. This input is expected to start with a `[CLS]` \"This is a classification problem\" token, and each sentence should end with a `[SEP]` \"Separator\" token:"
      ]
    },
    {
@@ -395,45 +450,21 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "2Hf2rpRXk89N"
+        "id": "bdL-dRNRBRJT"
      },
      "outputs": [],
      "source": [
-        "# Set up epochs and steps\n",
+        "tokenizer.convert_tokens_to_ids(['[CLS]', '[SEP]'])"
-        "epochs = 3\n",
-        "train_data_size = input_meta_data['train_data_size']\n",
-        "steps_per_epoch = int(train_data_size / batch_size)\n",
-        "num_train_steps = steps_per_epoch * epochs\n",
-        "warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)\n",
-        "\n",
-        "# Create learning rate schedule that firstly warms up from 0 and they decy to 0.\n",
-        "lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n",
-        "      initial_learning_rate=2e-5,\n",
-        "      decay_steps=num_train_steps,\n",
-        "      end_learning_rate=0)\n",
-        "lr_schedule = optimization.WarmUp(\n",
-        "        initial_learning_rate=2e-5,\n",
-        "        decay_schedule_fn=lr_schedule,\n",
-        "        warmup_steps=warmup_steps)\n",
-        "optimizer = optimization.AdamWeightDecay(\n",
-        "        learning_rate=lr_schedule,\n",
-        "        weight_decay_rate=0.01,\n",
-        "        beta_1=0.9,\n",
-        "        beta_2=0.999,\n",
-        "        epsilon=1e-6,\n",
-        "        exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "OTNcA0O0nSq9"
+        "id": "UrPktnqpwqie"
      },
      "source": [
-        "### Define metric_fn and loss_fn\n",
+        "Start by encoding all the sentences while appending a `[SEP]` token, and packing them into ragged-tensors:"
-        "\n",
-        "The metric is accuracy and we use sparse categorical cross-entropy as loss."
      ]
    },
    {
@@ -442,27 +473,43 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "ELHjRp87nVNH"
+        "id": "BR7BmtU498Bh"
      },
      "outputs": [],
      "source": [
-        "def metric_fn():\n",
+        "def encode_sentence(s):\n",
-        "  return tf.keras.metrics.SparseCategoricalAccuracy(\n",
+        "   tokens = list(tokenizer.tokenize(s.numpy()))\n",
-        "      'accuracy', dtype=tf.float32)\n",
+        "   tokens.append('[SEP]')\n",
+        "   return tokenizer.convert_tokens_to_ids(tokens)\n",
        "\n",
-        "def classification_loss_fn(labels, logits):\n",
+        "sentence1 = tf.ragged.constant([\n",
-        "  return losses.weighted_sparse_categorical_crossentropy_loss(\n",
+        "    encode_sentence(s) for s in glue_train[\"sentence1\"]])\n",
-        "    labels=labels, predictions=tf.nn.log_softmax(logits, axis=-1))\n"
+        "sentence2 = tf.ragged.constant([\n",
+        "    encode_sentence(s) for s in glue_train[\"sentence2\"]])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "has42aUdfky-"
+      },
+      "outputs": [],
+      "source": [
+        "print(\"Sentence1 shape:\", sentence1.shape.as_list())\n",
+        "print(\"Sentence2 shape:\", sentence2.shape.as_list())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "78FEUOOEkoP0"
+        "id": "MU9lTWy_xXbb"
      },
      "source": [
-        "### Compile and train the model"
+        "Now prepend a `[CLS]` token, and concatenate the ragged tensors to form a single `input_word_ids` tensor for each example. `RaggedTensor.to_tensor()` zero pads to the longest sequence."
      ]
    },
    {
@@ -471,29 +518,46 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "nzi8hjeTQTRs"
+        "id": "USD8uihw-g4J"
      },
      "outputs": [],
      "source": [
-        "classifier_model.compile(optimizer=optimizer,\n",
+        "cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]\n",
-        "                         loss=classification_loss_fn,\n",
+        "input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)\n",
-        "                         metrics=[metric_fn()])\n",
+        "_ = plt.pcolormesh(input_word_ids.to_tensor())"
-        "classifier_model.fit(\n",
-        "      x=training_dataset,\n",
-        "      validation_data=evaluation_dataset,\n",
-        "      steps_per_epoch=steps_per_epoch,\n",
-        "      epochs=epochs,\n",
-        "      validation_steps=int(input_meta_data['eval_data_size'] / eval_batch_size))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "fVo_AnT0l26j"
+        "id": "xmNv4l4k-dBZ"
+      },
+      "source": [
+        "#### Mask and input type"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "DIWjNIKq-ldh"
+      },
+      "source": [
+        "The model expects two additional inputs:\n",
+        "\n",
+        "* The input mask\n",
+        "* The input type"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "ulNZ4U96-8JZ"
      },
      "source": [
-        "### Save the model"
+        "The mask allows the model to cleanly differentiate between the content and the padding. The mask has the same shape as the `input_word_ids`, and contains a `1` anywhere the `input_word_ids` is not padding."
      ]
    },
    {
@@ -502,21 +566,23 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "Nl5x6nElZqkP"
+        "id": "EezOO9qj91kP"
      },
      "outputs": [],
      "source": [
-        "classifier_model.save('./saved_model', include_optimizer=False, save_format='tf')"
+        "input_mask = tf.ones_like(input_word_ids).to_tensor()\n",
+        "\n",
+        "plt.pcolormesh(input_mask)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
-        "id": "nWsE6yeyfW00"
+        "id": "rxLenwAvCkBf"
      },
      "source": [
-        "## Use the trained model to predict\n"
+        "The \"input type\" also has the same shape, but inside the non-padded region, contains a `0` or a `1` indicating which sentence the token is a part of. "
      ]
    },
    {
@@ -525,13 +591,1223 @@
      "metadata": {
        "colab": {},
        "colab_type": "code",
-        "id": "vz7YJY2QYAjP"
+        "id": "2CetH_5C9P2m"
      },
      "outputs": [],
      "source": [
-        "eval_predictions = classifier_model.predict(evaluation_dataset)\n",
+        "type_cls = tf.zeros_like(cls)\n",
-        "for prediction in eval_predictions:\n",
+        "type_s1 = tf.zeros_like(sentence1)\n",
-        "  print(\"Predicted label id: %s\" % np.argmax(prediction))"
+        "type_s2 = tf.ones_like(sentence2)\n",
+        "input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1).to_tensor()\n",
+        "\n",
+        "plt.pcolormesh(input_type_ids)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "P5UBnCn8Ii6s"
+      },
+      "source": [
+        "#### Put it all together\n",
+        "\n",
+        "Collect the above text parsing code into a single function, and apply it to each split of the `glue/mrpc` dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "sDGiWYPLEd5a"
+      },
+      "outputs": [],
+      "source": [
+        "def encode_sentence(s, tokenizer):\n",
+        "   tokens = list(tokenizer.tokenize(s))\n",
+        "   tokens.append('[SEP]')\n",
+        "   return tokenizer.convert_tokens_to_ids(tokens)\n",
+        "\n",
+        "def bert_encode(glue_dict, tokenizer):\n",
+        "  num_examples = len(glue_dict[\"sentence1\"])\n",
+        "  \n",
+        "  sentence1 = tf.ragged.constant([\n",
+        "      encode_sentence(s, tokenizer)\n",
+        "      for s in np.array(glue_dict[\"sentence1\"])])\n",
+        "  sentence2 = tf.ragged.constant([\n",
+        "      encode_sentence(s, tokenizer)\n",
+        "       for s in np.array(glue_dict[\"sentence2\"])])\n",
+        "\n",
+        "  cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]\n",
+        "  input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1)\n",
+        "\n",
+        "  input_mask = tf.ones_like(input_word_ids).to_tensor()\n",
+        "\n",
+        "  type_cls = tf.zeros_like(cls)\n",
+        "  type_s1 = tf.zeros_like(sentence1)\n",
+        "  type_s2 = tf.ones_like(sentence2)\n",
+        "  input_type_ids = tf.concat(\n",
+        "      [type_cls, type_s1, type_s2], axis=-1).to_tensor()\n",
+        "\n",
+        "  inputs = {\n",
+        "      'input_word_ids': input_word_ids.to_tensor(),\n",
+        "      'input_mask': input_mask,\n",
+        "      'input_type_ids': input_type_ids}\n",
+        "\n",
+        "  return inputs"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "yuLKxf6zHxw-"
+      },
+      "outputs": [],
+      "source": [
+        "glue_train = bert_encode(glue['train'], tokenizer)\n",
+        "glue_train_labels = glue['train']['label']\n",
+        "\n",
+        "glue_validation = bert_encode(glue['validation'], tokenizer)\n",
+        "glue_validation_labels = glue['validation']['label']\n",
+        "\n",
+        "glue_test = bert_encode(glue['test'], tokenizer)\n",
+        "glue_test_labels  = glue['test']['label']"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "7FC5aLVxKVKK"
+      },
+      "source": [
+        "Each subset of the data has been converted to a dictionary of features, and a set of labels. Each feature in the input dictionary has the same shape, and the number of labels should match:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "jyjTdGpFhO_1"
+      },
+      "outputs": [],
+      "source": [
+        "for key, value in glue_train.items():\n",
+        "  print(f'{key:15s} shape: {value.shape}')\n",
+        "\n",
+        "print(f'glue_train_labels shape: {glue_train_labels.shape}')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "FSwymsbkbLDA"
+      },
+      "source": [
+        "## The model"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Efrj3Cn1kLAp"
+      },
+      "source": [
+        "### Build the model\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "xxpOY5r2Ayq6"
+      },
+      "source": [
+        "The first step is to download the configuration  for the pre-trained model.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "ujapVfZ_AKW7"
+      },
+      "outputs": [],
+      "source": [
+        "import json\n",
+        "\n",
+        "bert_config_file = os.path.join(gs_folder_bert, \"bert_config.json\")\n",
+        "config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())\n",
+        "\n",
+        "bert_config = bert.configs.BertConfig.from_dict(config_dict)\n",
+        "\n",
+        "config_dict"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "96ldxDSwkVkj"
+      },
+      "source": [
+        "The `config` defines the core BERT Model, which is a Keras model to predict the outputs of `num_classes` from the inputs with maximum sequence length `max_seq_length`.\n",
+        "\n",
+        "This function returns both the encoder and the classifier."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "cH682__U0FBv"
+      },
+      "outputs": [],
+      "source": [
+        "bert_classifier, bert_encoder = bert.bert_models.classifier_model(\n",
+        "    bert_config, num_labels=2)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "XqKp3-5GIZlw"
+      },
+      "source": [
+        "The classifier has three inputs and one output:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "bAQblMIjwkvx"
+      },
+      "outputs": [],
+      "source": [
+        "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "sFmVG4SKZAw8"
+      },
+      "source": [
+        "Run it on a test batch of data 10 examples from the training set. The output is the logits for the two classes:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "VTjgPbp4ZDKo"
+      },
+      "outputs": [],
+      "source": [
+        "glue_batch = {key: val[:10] for key, val in glue_train.items()}\n",
+        "\n",
+        "bert_classifier(\n",
+        "    glue_batch, training=True\n",
+        ").numpy()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "Q0NTdwZsQK8n"
+      },
+      "source": [
+        "The `TransformerEncoder` in the center of the classifier above **is** the `bert_encoder`.\n",
+        "\n",
+        "Inspecting the encoder, we see its stack of `Transformer` layers connected to those same three inputs:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "8L__-erBwLIQ"
+      },
+      "outputs": [],
+      "source": [
+        "tf.keras.utils.plot_model(bert_encoder, show_shapes=True, dpi=48)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "mKAvkQc3heSy"
+      },
+      "source": [
+        "### Restore the encoder weights\n",
+        "\n",
+        "When built the encoder is randomly initialized. Restore the encoder's weights from the checkpoint:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "97Ll2Gichd_Y"
+      },
+      "outputs": [],
+      "source": [
+        "checkpoint = tf.train.Checkpoint(model=bert_encoder)\n",
+        "checkpoint.restore(\n",
+        "    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "2oHOql35k3Dd"
+      },
+      "source": [
+        "Note: The pretrained `TransformerEncoder` is also available on [TensorFlow Hub](https://tensorflow.org/hub). See the [Hub appendix](#hub_bert) for details. "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "115caFLMk-_l"
+      },
+      "source": [
+        "### Set up the optimizer\n",
+        "\n",
+        "BERT adopts the Adam optimizer with weight decay (aka \"[AdamW](https://arxiv.org/abs/1711.05101)\").\n",
+        "It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "w8qXKRZuCwW4"
+      },
+      "outputs": [],
+      "source": [
+        "# Set up epochs and steps\n",
+        "epochs = 3\n",
+        "batch_size = 32\n",
+        "eval_batch_size = 32\n",
+        "\n",
+        "train_data_size = len(glue_train_labels)\n",
+        "steps_per_epoch = int(train_data_size / batch_size)\n",
+        "num_train_steps = steps_per_epoch * epochs\n",
+        "warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)\n",
+        "\n",
+        "# creates an optimizer with learning rate schedule\n",
+        "optimizer = nlp.optimization.create_optimizer(\n",
+        "    2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "pXRGxiRNEHS2"
+      },
+      "source": [
+        "This returns an `AdamWeightDecay`  optimizer with the learning rate schedule set:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "eQNA16bhDpky"
+      },
+      "outputs": [],
+      "source": [
+        "type(optimizer)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "xqu_K71fJQB8"
+      },
+      "source": [
+        "To see an example of how to customize the optimizer and it's schedule, see the [Optimizer schedule appendix](#optiizer_schedule)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "78FEUOOEkoP0"
+      },
+      "source": [
+        "### Train the model"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "OTNcA0O0nSq9"
+      },
+      "source": [
+        "The metric is accuracy and we use sparse categorical cross-entropy as loss."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "nzi8hjeTQTRs"
+      },
+      "outputs": [],
+      "source": [
+        "metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]\n",
+        "loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n",
+        "\n",
+        "bert_classifier.compile(\n",
+        "    optimizer=optimizer,\n",
+        "    loss=loss,\n",
+        "    metrics=metrics)\n",
+        "\n",
+        "bert_classifier.fit(\n",
+        "      glue_train, glue_train_labels,\n",
+        "      validation_data=(glue_validation, glue_validation_labels),\n",
+        "      batch_size=32,\n",
+        "      epochs=epochs)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "IFtKFWbNKb0u"
+      },
+      "source": [
+        "Now run the fine-tuned model on a custom example to see that it works.\n",
+        "\n",
+        "Start by encoding some sentence pairs:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "9ZoUgDUNJPz3"
+      },
+      "outputs": [],
+      "source": [
+        "my_examples = bert_encode(\n",
+        "    glue_dict = {\n",
+        "        'sentence1':[\n",
+        "            'The rain in Spain falls mainly on the plain.',\n",
+        "            'Look I fine tuned BERT.'],\n",
+        "        'sentence2':[\n",
+        "            'It mostly rains on the flat lands of Spain.',\n",
+        "            'Is it working? This does not match.']\n",
+        "    },\n",
+        "    tokenizer=tokenizer)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "7ynJibkBRTJF"
+      },
+      "source": [
+        "The model should report class `1` \"match\" for the first example and class `0` \"no-match\" for the second:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "umo0ttrgRYIM"
+      },
+      "outputs": [],
+      "source": [
+        "result = bert_classifier(my_examples, training=False)\n",
+        "\n",
+        "result = tf.argmax(result).numpy()\n",
+        "result"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "utGl0M3aZCE4"
+      },
+      "outputs": [],
+      "source": [
+        "np.array(info.features['label'].names)[result]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "fVo_AnT0l26j"
+      },
+      "source": [
+        "### Save the model\n",
+        "\n",
+        "Often the goal of training a model is to _use_ it for something, so export the model and then restore it to be sure that it works."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "Nl5x6nElZqkP"
+      },
+      "outputs": [],
+      "source": [
+        "export_dir='./saved_model'\n",
+        "tf.saved_model.save(bert_classifier, export_dir=export_dir)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "y_ACvKPsVUXC"
+      },
+      "outputs": [],
+      "source": [
+        "reloaded = tf.saved_model.load(export_dir)\n",
+        "reloaded_result = reloaded([my_examples['input_word_ids'],\n",
+        "                            my_examples['input_mask'],\n",
+        "                            my_examples['input_type_ids']], training=False)\n",
+        "\n",
+        "original_result = bert_classifier(my_examples, training=False)\n",
+        "\n",
+        "# The results are (nearly) identical:\n",
+        "print(original_result.numpy())\n",
+        "print()\n",
+        "print(reloaded_result.numpy())"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "eQceYqRFT_Eg"
+      },
+      "source": [
+        "## Appendix"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "SaC1RlFawUpc"
+      },
+      "source": [
+        "\u003ca id=re_encoding_tools\u003e\u003c/a\u003e\n",
+        "### Re-encoding a large dataset"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "CwUdjFBkzUgh"
+      },
+      "source": [
+        "This tutorial you re-encoded the dataset in memory, for clarity.\n",
+        "\n",
+        "This was only possible because `glue/mrpc` is a very small dataset. To deal with larger datasets `tf_models` library includes some tools for processing and re-encoding a dataset for efficient training."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "2UTQrkyOT5wD"
+      },
+      "source": [
+        "The first step is to describe which features of the dataset should be transformed:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "XQeDFOzYR9Z9"
+      },
+      "outputs": [],
+      "source": [
+        "processor = nlp.data.classifier_data_lib.TfdsProcessor(\n",
+        "    tfds_params=\"dataset=glue/mrpc,text_key=sentence1,text_b_key=sentence2\",\n",
+        "    process_text_fn=bert.tokenization.convert_to_unicode)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "XrFQbfErUWxa"
+      },
+      "source": [
+        "Then apply the transformation to generate new TFRecord files."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "ymw7GOHpSHKU"
+      },
+      "outputs": [],
+      "source": [
+        "# Set up output of training and evaluation Tensorflow dataset\n",
+        "train_data_output_path=\"./mrpc_train.tf_record\"\n",
+        "eval_data_output_path=\"./mrpc_eval.tf_record\"\n",
+        "\n",
+        "max_seq_length = 128\n",
+        "batch_size = 32\n",
+        "eval_batch_size = 32\n",
+        "\n",
+        "# Generate and save training data into a tf record file\n",
+        "input_meta_data = (\n",
+        "    nlp.data.classifier_data_lib.generate_tf_record_from_data_file(\n",
+        "      processor=processor,\n",
+        "      data_dir=None,  # It is `None` because data is from tfds, not local dir.\n",
+        "      tokenizer=tokenizer,\n",
+        "      train_data_output_path=train_data_output_path,\n",
+        "      eval_data_output_path=eval_data_output_path,\n",
+        "      max_seq_length=max_seq_length))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "uX_Sp-wTUoRm"
+      },
+      "source": [
+        "Finally create `tf.data` input pipelines from those TFRecord files:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "rkHxIK57SQ_r"
+      },
+      "outputs": [],
+      "source": [
+        "training_dataset = bert.run_classifier.get_dataset_fn(\n",
+        "    train_data_output_path,\n",
+        "    max_seq_length,\n",
+        "    batch_size,\n",
+        "    is_training=True)()\n",
+        "\n",
+        "evaluation_dataset = bert.run_classifier.get_dataset_fn(\n",
+        "    eval_data_output_path,\n",
+        "    max_seq_length,\n",
+        "    eval_batch_size,\n",
+        "    is_training=False)()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "stbaVouogvzS"
+      },
+      "source": [
+        "The resulting `tf.data.Datasets` return `(features, labels)` pairs, as expected by `keras.Model.fit`:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "gwhrlQl4gxVF"
+      },
+      "outputs": [],
+      "source": [
+        "training_dataset.element_spec"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "dbJ76vSJj77j"
+      },
+      "source": [
+        "#### Create tf.data.Dataset for training and evaluation\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "9J95LFRohiYw"
+      },
+      "source": [
+        "If you need to modify the data loading here is some code to get you started:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "gCvaLLAxPuMc"
+      },
+      "outputs": [],
+      "source": [
+        "def create_classifier_dataset(file_path, seq_length, batch_size, is_training):\n",
+        "  \"\"\"Creates input dataset from (tf)records files for train/eval.\"\"\"\n",
+        "  dataset = tf.data.TFRecordDataset(file_path)\n",
+        "  if is_training:\n",
+        "    dataset = dataset.shuffle(100)\n",
+        "    dataset = dataset.repeat()\n",
+        "\n",
+        "  def decode_record(record):\n",
+        "    name_to_features = {\n",
+        "      'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
+        "      'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
+        "      'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
+        "      'label_ids': tf.io.FixedLenFeature([], tf.int64),\n",
+        "    }\n",
+        "    return tf.io.parse_single_example(record, name_to_features)\n",
+        "\n",
+        "  def _select_data_from_record(record):\n",
+        "    x = {\n",
+        "        'input_word_ids': record['input_ids'],\n",
+        "        'input_mask': record['input_mask'],\n",
+        "        'input_type_ids': record['segment_ids']\n",
+        "    }\n",
+        "    y = record['label_ids']\n",
+        "    return (x, y)\n",
+        "\n",
+        "  dataset = dataset.map(decode_record,\n",
+        "                        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
+        "  dataset = dataset.map(\n",
+        "      _select_data_from_record,\n",
+        "      num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
+        "  dataset = dataset.batch(batch_size, drop_remainder=is_training)\n",
+        "  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)\n",
+        "  return dataset"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "rutkBadrhzdR"
+      },
+      "outputs": [],
+      "source": [
+        "# Set up batch sizes\n",
+        "batch_size = 32\n",
+        "eval_batch_size = 32\n",
+        "\n",
+        "# Return Tensorflow dataset\n",
+        "training_dataset = create_classifier_dataset(\n",
+        "    train_data_output_path,\n",
+        "    input_meta_data['max_seq_length'],\n",
+        "    batch_size,\n",
+        "    is_training=True)\n",
+        "\n",
+        "evaluation_dataset = create_classifier_dataset(\n",
+        "    eval_data_output_path,\n",
+        "    input_meta_data['max_seq_length'],\n",
+        "    eval_batch_size,\n",
+        "    is_training=False)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "59TVgt4Z7fuU"
+      },
+      "outputs": [],
+      "source": [
+        "training_dataset.element_spec"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "QbklKt-w_CiI"
+      },
+      "source": [
+        "\u003ca id=\"hub_bert\"\u003e\u003c/a\u003e\n",
+        "\n",
+        "### TFModels BERT on TFHub\n",
+        "\n",
+        "You can get [the BERT model](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2) off the shelf from [TFHub](https://tensorflow.org/hub). It would not be hard to add a classification head on top of this `hub.KerasLayer`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "lo6479At4sP1"
+      },
+      "outputs": [],
+      "source": [
+        "# Note: 350MB download.\n",
+        "import tensorflow_hub as hub\n",
+        "hub_encoder = hub.KerasLayer(hub_url_bert, trainable=True)\n",
+        "\n",
+        "print(f\"The Hub encoder has {len(hub_encoder.trainable_variables)} trainable variables\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "iTzF574wivQv"
+      },
+      "source": [
+        "Test run it on a batch of data:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "XEcYrCR45Uwo"
+      },
+      "outputs": [],
+      "source": [
+        "result = hub_encoder(\n",
+        "    inputs=[glue_train['input_word_ids'][:10],\n",
+        "            glue_train['input_mask'][:10],\n",
+        "            glue_train['input_type_ids'][:10],],\n",
+        "    training=False,\n",
+        ")\n",
+        "\n",
+        "print(\"Pooled output shape:\", result[0].shape)\n",
+        "print(\"Sequence output shape:\", result[1].shape)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "cjojn8SmLSRI"
+      },
+      "source": [
+        "At this point it would be simple to add a classification head yourself.\n",
+        "\n",
+        "The `bert_models.classifier_model` function can also build a classifier onto the encoder from TensorFlow Hub:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "9nTDaApyLR70"
+      },
+      "outputs": [],
+      "source": [
+        "hub_classifier, hub_encoder = bert.bert_models.classifier_model(\n",
+        "    # Caution: Most of `bert_config` is ignored if you pass a hub url.\n",
+        "    bert_config=bert_config, hub_module_url=hub_url_bert, num_labels=2)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "xMJX3wV0_v7I"
+      },
+      "source": [
+        "The one downside to loading this model from TFHub is that the structure of internal keras layers is not restored. So it's more difficult to inspect or modify the model. The `TransformerEncoder` model is now a single layer:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "pD71dnvhM2QS"
+      },
+      "outputs": [],
+      "source": [
+        "tf.keras.utils.plot_model(hub_classifier, show_shapes=True, dpi=64)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "nLZD-isBzNKi"
+      },
+      "outputs": [],
+      "source": [
+        "try:\n",
+        "  tf.keras.utils.plot_model(hub_encoder, show_shapes=True, dpi=64)\n",
+        "  assert False\n",
+        "except Exception as e:\n",
+        "  print(f\"{type(e).__name__}: {e}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "ZxSqH0dNAgXV"
+      },
+      "source": [
+        "\u003ca id=\"model_builder_functions\"\u003e\u003c/a\u003e\n",
+        "\n",
+        "### Low level model building\n",
+        "\n",
+        "If you need a more control over the construction of the model it's worth noting that the `classifier_model` function used earlier is really just a thin wrapper over the `nlp.modeling.networks.TransformerEncoder` and `nlp.modeling.models.BertClassifier` classes. Just remember that if you start modifying the architecture it may not be correct or possible to reload the pre-trained checkpoint so you'll need to retrain from scratch."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "0cgABEwDj06P"
+      },
+      "source": [
+        "Build the encoder:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "5r_yqhBFSVEM"
+      },
+      "outputs": [],
+      "source": [
+        "transformer_config = config_dict.copy()\n",
+        "\n",
+        "# You need to rename a few fields to make this work:\n",
+        "transformer_config['attention_dropout_rate'] = transformer_config.pop('attention_probs_dropout_prob')\n",
+        "transformer_config['activation'] = tf_utils.get_activation(transformer_config.pop('hidden_act'))\n",
+        "transformer_config['dropout_rate'] = transformer_config.pop('hidden_dropout_prob')\n",
+        "transformer_config['initializer'] = tf.keras.initializers.TruncatedNormal(\n",
+        "          stddev=transformer_config.pop('initializer_range'))\n",
+        "transformer_config['max_sequence_length'] = transformer_config.pop('max_position_embeddings')\n",
+        "transformer_config['num_layers'] = transformer_config.pop('num_hidden_layers')\n",
+        "\n",
+        "transformer_config"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "rIO8MI7LLijh"
+      },
+      "outputs": [],
+      "source": [
+        "manual_encoder = nlp.modeling.networks.TransformerEncoder(**transformer_config)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "4a4tFSg9krRi"
+      },
+      "source": [
+        "Restore the weights:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "X6N9NEqfXJCx"
+      },
+      "outputs": [],
+      "source": [
+        "checkpoint = tf.train.Checkpoint(model=manual_encoder)\n",
+        "checkpoint.restore(\n",
+        "    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "1BPiPO4ykuwM"
+      },
+      "source": [
+        "Test run it:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "hlVdgJKmj389"
+      },
+      "outputs": [],
+      "source": [
+        "result = manual_encoder(my_examples, training=True)\n",
+        "\n",
+        "print(\"Sequence output shape:\", result[0].shape)\n",
+        "print(\"Pooled output shape:\", result[1].shape)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "nJMXvVgJkyBv"
+      },
+      "source": [
+        "Wrap it in a classifier:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "tQX57GJ6wkAb"
+      },
+      "outputs": [],
+      "source": [
+        "manual_classifier = nlp.modeling.models.BertClassifier(\n",
+        "        bert_encoder,\n",
+        "        num_classes=2,\n",
+        "        dropout_rate=transformer_config['dropout_rate'],\n",
+        "        initializer=tf.keras.initializers.TruncatedNormal(\n",
+        "          stddev=bert_config.initializer_range))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "kB-nBWhQk0dS"
+      },
+      "outputs": [],
+      "source": [
+        "manual_classifier(my_examples, training=True).numpy()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "E6AJlOSyIO1L"
+      },
+      "source": [
+        "\u003ca id=\"optiizer_schedule\"\u003e\u003c/a\u003e\n",
+        "\n",
+        "### Optimizers and schedules\n",
+        "\n",
+        "The optimizer used to train the model was created using the `nlp.optimization.create_optimizer` function:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "28Dv3BPRlFTD"
+      },
+      "outputs": [],
+      "source": [
+        "optimizer = nlp.optimization.create_optimizer(\n",
+        "    2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "LRjcHr0UlT8c"
+      },
+      "source": [
+        "That high level wrapper sets up the learning rate schedules and the optimizer.\n",
+        "\n",
+        "The base learning rate schedule used here is a linear decay to zero over the training run:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "MHY8K6kDngQn"
+      },
+      "outputs": [],
+      "source": [
+        "epochs = 3\n",
+        "batch_size = 32\n",
+        "eval_batch_size = 32\n",
+        "\n",
+        "train_data_size = len(glue_train_labels)\n",
+        "steps_per_epoch = int(train_data_size / batch_size)\n",
+        "num_train_steps = steps_per_epoch * epochs"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "wKIcSprulu3P"
+      },
+      "outputs": [],
+      "source": [
+        "decay_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n",
+        "      initial_learning_rate=2e-5,\n",
+        "      decay_steps=num_train_steps,\n",
+        "      end_learning_rate=0)\n",
+        "\n",
+        "plt.plot([decay_schedule(n) for n in range(num_train_steps)])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "IMTC_gfAl_PZ"
+      },
+      "source": [
+        "This, in turn is wrapped in a `WarmUp` schedule that linearly increases the learning rate to the target value over the first 10% of training:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "YRt3VTmBmCBY"
+      },
+      "outputs": [],
+      "source": [
+        "warmup_steps = num_train_steps * 0.1\n",
+        "\n",
+        "warmup_schedule = nlp.optimization.WarmUp(\n",
+        "        initial_learning_rate=2e-5,\n",
+        "        decay_schedule_fn=decay_schedule,\n",
+        "        warmup_steps=warmup_steps)\n",
+        "\n",
+        "# The warmup overshoots, because it warms up to the `initial_learning_rate`\n",
+        "# following the original implementation. You can set\n",
+        "# `initial_learning_rate=decay_schedule(warmup_steps)` if you don't like the\n",
+        "# overshoot.\n",
+        "plt.plot([warmup_schedule(n) for n in range(num_train_steps)])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "colab_type": "text",
+        "id": "l8D9Lv3Bn740"
+      },
+      "source": [
+        "Then create the `nlp.optimization.AdamWeightDecay` using that schedule, configured for the BERT model:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 0,
+      "metadata": {
+        "colab": {},
+        "colab_type": "code",
+        "id": "2Hf2rpRXk89N"
+      },
+      "outputs": [],
+      "source": [
+        "optimizer = nlp.optimization.AdamWeightDecay(\n",
+        "        learning_rate=warmup_schedule,\n",
+        "        weight_decay_rate=0.01,\n",
+        "        epsilon=1e-6,\n",
+        "        exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias'])"
      ]
    }
  ],
@@ -539,8 +1815,10 @@
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
-      "name": "How-to Guide: Using a PIP package for fine-tuning a BERT model",
+      "name": "fine_tuning_bert.ipynb",
-      "provenance": []
+      "private_outputs": true,
+      "provenance": [],
+      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",