fine_tuning_bert.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YN2ACivEPxgD"
      },
      "source": [
        "## How-to Guide: Using a PIP package for fine-tuning a BERT model\n",
        "\n",
        "Authors: [Chen Chen](https://github.com/chenGitHuber), [Claire Yao](https://github.com/claireyao-fen)\n",
        "\n",
        "In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "T7BBEc1-RNCQ"
      },
      "source": [
        "## License\n",
        "\n",
        "Copyright 2020 The TensorFlow Authors. All Rights Reserved.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "you may not use this file except in compliance with the License.\n",
        "You may obtain a copy of the License at\n",
        "\n",
        "    http://www.apache.org/licenses/LICENSE-2.0\n",
        "\n",
        "Unless required by applicable law or agreed to in writing, software\n",
        "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "See the License for the specific language governing permissions and\n",
        "limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Pf6xzoKjywY_"
      },
      "source": [
        "## Learning objectives\n",
        "\n",
        "In this Colab notebook, you will learn how to fine-tune a BERT model using the TensorFlow Model Garden PIP package."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YHkmV89jRWkS"
      },
      "source": [
        "## Enable the GPU acceleration\n",
        "Please enable GPU for better performance.\n",
        "*   Navigate to Edit.\n",
        "*   Find Notebook settings.\n",
        "*   Select GPU from the \"Hardware Accelerator\" drop-down list, save it."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "s2d9S2CSSO1z"
      },
      "source": [
        "##Install and import"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fsACVQpVSifi"
      },
      "source": [
        "### Install the TensorFlow Model Garden pip package\n",
        "\n",
        "*  tf-models-nightly is the nightly Model Garden package created daily automatically.\n",
        "*  pip will install all models and dependencies automatically."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "NvNr2svBM-p3"
      },
      "outputs": [],
      "source": [
        "!pip install tf-models-nightly"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "U-7qPCjWUAyy"
      },
      "source": [
        "### Import Tensorflow and other libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "lXsXev5MNr20"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "\n",
        "from official.modeling import tf_utils\n",
        "from official.nlp import optimization\n",
        "from official.nlp.bert import configs as bert_configs\n",
        "from official.nlp.bert import tokenization\n",
        "from official.nlp.data import classifier_data_lib\n",
        "from official.nlp.modeling import losses\n",
        "from official.nlp.modeling import models\n",
        "from official.nlp.modeling import networks"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "C2drjD7OVCmh"
      },
      "source": [
        "## Preprocess the raw data and output tf.record files"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "qfjcKj5FYQOp"
      },
      "source": [
        "### Introduction of dataset\n",
        "\n",
        "The Microsoft Research Paraphrase Corpus (Dolan \u0026 Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.\n",
        "\n",
        "*   Number of labels: 2.\n",
        "*   Size of training dataset: 3668.\n",
        "*   Size of evaluation dataset: 408.\n",
        "*   Maximum sequence length of training and evaluation dataset: 128.\n",
        "*   Please refer here for details: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "28DvUhC1YUiB"
      },
      "source": [
        "### Get dataset from TensorFlow Datasets (TFDS)\n",
        "\n",
        "For example, we used the GLUE MRPC dataset from TFDS: https://www.tensorflow.org/datasets/catalog/glue#gluemrpc."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "4PhRLWh9jaXp"
      },
      "source": [
        "### Preprocess the data and write to TensorFlow record file\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "FhcMdzsrjWzG"
      },
      "outputs": [],
      "source": [
        "gs_folder_bert = \"gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12\"\n",
        "\n",
        "# Set up tokenizer to generate Tensorflow dataset\n",
        "tokenizer = tokenization.FullTokenizer(\n",
        "    vocab_file=os.path.join(gs_folder_bert, \"vocab.txt\"), do_lower_case=True)\n",
        "\n",
        "# Set up processor to generate Tensorflow dataset\n",
        "processor = classifier_data_lib.TfdsProcessor(\n",
        "    tfds_params=\"dataset=glue/mrpc,text_key=sentence1,text_b_key=sentence2\",\n",
        "    process_text_fn=tokenization.convert_to_unicode)\n",
        "\n",
        "# Set up output of training and evaluation Tensorflow dataset\n",
        "train_data_output_path=\"./mrpc_train.tf_record\"\n",
        "eval_data_output_path=\"./mrpc_eval.tf_record\"\n",
        "\n",
        "# Generate and save training data into a tf record file\n",
        "input_meta_data = classifier_data_lib.generate_tf_record_from_data_file(\n",
        "    processor=processor,\n",
        "    data_dir=None,  # It is `None` because data is from tfds, not local dir.\n",
        "    tokenizer=tokenizer,\n",
        "    train_data_output_path=train_data_output_path,\n",
        "    eval_data_output_path=eval_data_output_path,\n",
        "    max_seq_length=128)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dbJ76vSJj77j"
      },
      "source": [
        "### Create tf.dataset for training and evaluation\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "gCvaLLAxPuMc"
      },
      "outputs": [],
      "source": [
        "def create_classifier_dataset(file_path, seq_length, batch_size, is_training):\n",
        "  \"\"\"Creates input dataset from (tf)records files for train/eval.\"\"\"\n",
        "  dataset = tf.data.TFRecordDataset(file_path)\n",
        "  if is_training:\n",
        "    dataset = dataset.shuffle(100)\n",
        "    dataset = dataset.repeat()\n",
        "\n",
        "  def decode_record(record):\n",
        "    name_to_features = {\n",
        "      'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
        "      'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
        "      'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64),\n",
        "      'label_ids': tf.io.FixedLenFeature([], tf.int64),\n",
        "    }\n",
        "    return tf.io.parse_single_example(record, name_to_features)\n",
        "\n",
        "  def _select_data_from_record(record):\n",
        "    x = {\n",
        "        'input_word_ids': record['input_ids'],\n",
        "        'input_mask': record['input_mask'],\n",
        "        'input_type_ids': record['segment_ids']\n",
        "    }\n",
        "    y = record['label_ids']\n",
        "    return (x, y)\n",
        "\n",
        "  dataset = dataset.map(decode_record,\n",
        "                        num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
        "  dataset = dataset.map(\n",
        "      _select_data_from_record,\n",
        "      num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
        "  dataset = dataset.batch(batch_size, drop_remainder=is_training)\n",
        "  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)\n",
        "  return dataset\n",
        "\n",
        "# Set up batch sizes\n",
        "batch_size = 32\n",
        "eval_batch_size = 32\n",
        "\n",
        "# Return Tensorflow dataset\n",
        "training_dataset = create_classifier_dataset(\n",
        "    train_data_output_path,\n",
        "    input_meta_data['max_seq_length'],\n",
        "    batch_size,\n",
        "    is_training=True)\n",
        "\n",
        "evaluation_dataset = create_classifier_dataset(\n",
        "    eval_data_output_path,\n",
        "    input_meta_data['max_seq_length'],\n",
        "    eval_batch_size,\n",
        "    is_training=False)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Efrj3Cn1kLAp"
      },
      "source": [
        "## Create, compile and train the model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "96ldxDSwkVkj"
      },
      "source": [
        "### Construct a Bert Model\n",
        "\n",
        "Here, a Bert Model is constructed from the json file with parameters. The bert_config defines the core Bert Model, which is a Keras model to predict the outputs of *num_classes* from the inputs with maximum sequence length *max_seq_length*. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Qgajw8WPYzJZ"
      },
      "outputs": [],
      "source": [
        "bert_config_file = os.path.join(gs_folder_bert, \"bert_config.json\")\n",
        "bert_config = bert_configs.BertConfig.from_json_file(bert_config_file)\n",
        "\n",
        "bert_encoder = networks.TransformerEncoder(vocab_size=bert_config.vocab_size,\n",
        "      hidden_size=bert_config.hidden_size,\n",
        "      num_layers=bert_config.num_hidden_layers,\n",
        "      num_attention_heads=bert_config.num_attention_heads,\n",
        "      intermediate_size=bert_config.intermediate_size,\n",
        "      activation=tf_utils.get_activation(bert_config.hidden_act),\n",
        "      dropout_rate=bert_config.hidden_dropout_prob,\n",
        "      attention_dropout_rate=bert_config.attention_probs_dropout_prob,\n",
        "      sequence_length=input_meta_data['max_seq_length'],\n",
        "      max_sequence_length=bert_config.max_position_embeddings,\n",
        "      type_vocab_size=bert_config.type_vocab_size,\n",
        "      embedding_width=bert_config.embedding_size,\n",
        "      initializer=tf.keras.initializers.TruncatedNormal(\n",
        "          stddev=bert_config.initializer_range))\n",
        "\n",
        "classifier_model = models.BertClassifier(\n",
        "        bert_encoder,\n",
        "        num_classes=input_meta_data['num_labels'],\n",
        "        dropout_rate=bert_config.hidden_dropout_prob,\n",
        "        initializer=tf.keras.initializers.TruncatedNormal(\n",
        "          stddev=bert_config.initializer_range))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "pkSq1wbNXBaa"
      },
      "source": [
        "### Initialize the encoder from a pretrained model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "X6N9NEqfXJCx"
      },
      "outputs": [],
      "source": [
        "checkpoint = tf.train.Checkpoint(model=bert_encoder)\n",
        "checkpoint.restore(\n",
        "    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "115caFLMk-_l"
      },
      "source": [
        "### Set up an optimizer for the model\n",
        "\n",
        "BERT model adopts the Adam optimizer with weight decay.\n",
        "It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "2Hf2rpRXk89N"
      },
      "outputs": [],
      "source": [
        "# Set up epochs and steps\n",
        "epochs = 3\n",
        "train_data_size = input_meta_data['train_data_size']\n",
        "steps_per_epoch = int(train_data_size / batch_size)\n",
        "num_train_steps = steps_per_epoch * epochs\n",
        "warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)\n",
        "\n",
        "# Create learning rate schedule that firstly warms up from 0 and they decy to 0.\n",
        "lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(\n",
        "      initial_learning_rate=2e-5,\n",
        "      decay_steps=num_train_steps,\n",
        "      end_learning_rate=0)\n",
        "lr_schedule = optimization.WarmUp(\n",
        "        initial_learning_rate=2e-5,\n",
        "        decay_schedule_fn=lr_schedule,\n",
        "        warmup_steps=warmup_steps)\n",
        "optimizer = optimization.AdamWeightDecay(\n",
        "        learning_rate=lr_schedule,\n",
        "        weight_decay_rate=0.01,\n",
        "        beta_1=0.9,\n",
        "        beta_2=0.999,\n",
        "        epsilon=1e-6,\n",
        "        exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "OTNcA0O0nSq9"
      },
      "source": [
        "### Define metric_fn and loss_fn\n",
        "\n",
        "The metric is accuracy and we use sparse categorical cross-entropy as loss."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "ELHjRp87nVNH"
      },
      "outputs": [],
      "source": [
        "def metric_fn():\n",
        "  return tf.keras.metrics.SparseCategoricalAccuracy(\n",
        "      'accuracy', dtype=tf.float32)\n",
        "\n",
        "def classification_loss_fn(labels, logits):\n",
        "  return losses.weighted_sparse_categorical_crossentropy_loss(\n",
        "    labels=labels, predictions=tf.nn.log_softmax(logits, axis=-1))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "78FEUOOEkoP0"
      },
      "source": [
        "### Compile and train the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "nzi8hjeTQTRs"
      },
      "outputs": [],
      "source": [
        "classifier_model.compile(optimizer=optimizer,\n",
        "                         loss=classification_loss_fn,\n",
        "                         metrics=[metric_fn()])\n",
        "classifier_model.fit(\n",
        "      x=training_dataset,\n",
        "      validation_data=evaluation_dataset,\n",
        "      steps_per_epoch=steps_per_epoch,\n",
        "      epochs=epochs,\n",
        "      validation_steps=int(input_meta_data['eval_data_size'] / eval_batch_size))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fVo_AnT0l26j"
      },
      "source": [
        "### Save the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Nl5x6nElZqkP"
      },
      "outputs": [],
      "source": [
        "classifier_model.save('./saved_model', include_optimizer=False, save_format='tf')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nWsE6yeyfW00"
      },
      "source": [
        "## Use the trained model to predict\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "vz7YJY2QYAjP"
      },
      "outputs": [],
      "source": [
        "eval_predictions = classifier_model.predict(evaluation_dataset)\n",
        "for prediction in eval_predictions:\n",
        "  print(\"Predicted label id: %s\" % np.argmax(prediction))"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "How-to Guide: Using a PIP package for fine-tuning a BERT model",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}