{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "80xnUmoI7fBX" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "8nvTnfs6Q692" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "WmfcMK5P5C1G" }, "source": [ "# Introduction to the TensorFlow Models NLP library" ] }, { "cell_type": "markdown", "metadata": { "id": "cH-oJ8R6AHMK" }, "source": [ "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", " \u003ctd\u003e\n", " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/tfmodels/nlp\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n", " \u003c/td\u003e\n", " \u003ctd\u003e\n", " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/docs/nlp/index.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", " \u003c/td\u003e\n", " \u003ctd\u003e\n", " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/docs/nlp/index.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n", " \u003c/td\u003e\n", " \u003ctd\u003e\n", " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/models/docs/nlp/index.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n", " \u003c/td\u003e\n", "\u003c/table\u003e" ] }, { "cell_type": "markdown", "metadata": { "id": "0H_EFIhq4-MJ" }, "source": [ "## Learning objectives\n", "\n", "In this Colab notebook, you will learn how to build transformer-based models for common NLP tasks including pretraining, span labelling and classification using the building blocks from [NLP modeling library](https://github.com/tensorflow/models/tree/master/official/nlp/modeling)." ] }, { "cell_type": "markdown", "metadata": { "id": "2N97-dps_nUk" }, "source": [ "## Install and import" ] }, { "cell_type": "markdown", "metadata": { "id": "459ygAVl_rg0" }, "source": [ "### Install the TensorFlow Model Garden pip package\n", "\n", "* `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n", "which is the nightly Model Garden package created daily automatically.\n", "* `pip` will install all models and dependencies automatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IAOmYthAzI7J" }, "outputs": [], "source": [ "!pip install -q opencv-python" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y-qGkdh6_sZc" }, "outputs": [], "source": [ "!pip install tf-models-official" ] }, { "cell_type": "markdown", "metadata": { "id": "e4huSSwyAG_5" }, "source": [ "### Import Tensorflow and other libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jqYXqtjBAJd9" }, "outputs": [], "source": [ "import numpy as np\n", "import tensorflow as tf\n", "\n", "from tensorflow_models import nlp" ] }, { "cell_type": "markdown", "metadata": { "id": "djBQWjvy-60Y" }, "source": [ "## BERT pretraining model\n", "\n", "BERT ([Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)) introduced the method of pre-training language representations on a large text corpus and then using that model for downstream NLP tasks.\n", "\n", "In this section, we will learn how to build a model to pretrain BERT on the masked language modeling task and next sentence prediction task. For simplicity, we only show the minimum example and use dummy data." ] }, { "cell_type": "markdown", "metadata": { "id": "MKuHVlsCHmiq" }, "source": [ "### Build a `BertPretrainer` model wrapping `BertEncoder`\n", "\n", "The `nlp.networks.BertEncoder` class implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers (`nlp.layers.TransformerEncoderBlock`), but not the masked language model or classification task networks.\n", "\n", "The `nlp.models.BertPretrainer` class allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EXkcXz-9BwB3" }, "outputs": [], "source": [ "# Build a small transformer network.\n", "vocab_size = 100\n", "network = nlp.networks.BertEncoder(\n", " vocab_size=vocab_size, \n", " # The number of TransformerEncoderBlock layers\n", " num_layers=3)" ] }, { "cell_type": "markdown", "metadata": { "id": "0NH5irV5KTMS" }, "source": [ "Inspecting the encoder, we see it contains few embedding layers, stacked `nlp.layers.TransformerEncoderBlock` layers and are connected to three input layers:\n", "\n", "`input_word_ids`, `input_type_ids` and `input_mask`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lZNoZkBrIoff" }, "outputs": [], "source": [ "tf.keras.utils.plot_model(network, show_shapes=True, expand_nested=True, dpi=48)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "o7eFOZXiIl-b" }, "outputs": [], "source": [ "# Create a BERT pretrainer with the created network.\n", "num_token_predictions = 8\n", "bert_pretrainer = nlp.models.BertPretrainer(\n", " network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')" ] }, { "cell_type": "markdown", "metadata": { "id": "d5h5HT7gNHx_" }, "source": [ "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `nlp.layers.ClassificationHead` heads." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2tcNfm03IBF7" }, "outputs": [], "source": [ "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, expand_nested=True, dpi=48)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "F2oHrXGUIS0M" }, "outputs": [], "source": [ "# We can feed some dummy data to get masked language model and sentence output.\n", "sequence_length = 16\n", "batch_size = 2\n", "\n", "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "masked_lm_positions_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n", "\n", "outputs = bert_pretrainer(\n", " [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n", "lm_output = outputs[\"masked_lm\"]\n", "sentence_output = outputs[\"classification\"]\n", "print(f'lm_output: shape={lm_output.shape}, dtype={lm_output.dtype!r}')\n", "print(f'sentence_output: shape={sentence_output.shape}, dtype={sentence_output.dtype!r}')" ] }, { "cell_type": "markdown", "metadata": { "id": "bnx3UCHniCS5" }, "source": [ "### Compute loss\n", "Next, we can use `lm_output` and `sentence_output` to compute `loss`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k30H4Q86f52x" }, "outputs": [], "source": [ "masked_lm_ids_data = np.random.randint(vocab_size, size=(batch_size, num_token_predictions))\n", "masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n", "next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n", "\n", "mlm_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n", " labels=masked_lm_ids_data,\n", " predictions=lm_output,\n", " weights=masked_lm_weights_data)\n", "sentence_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n", " labels=next_sentence_labels_data,\n", " predictions=sentence_output)\n", "loss = mlm_loss + sentence_loss\n", "\n", "print(loss)" ] }, { "cell_type": "markdown", "metadata": { "id": "wrmSs8GjHxVw" }, "source": [ "With the loss, you can optimize the model.\n", "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/legacy/bert/run_pretraining.py) for the full example.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "k8cQVFvBCV4s" }, "source": [ "## Span labeling model\n", "\n", "Span labeling is the task to assign labels to a span of the text, for example, label a span of text as the answer of a given question.\n", "\n", "In this section, we will learn how to build a span labeling model. Again, we use dummy data for simplicity." ] }, { "cell_type": "markdown", "metadata": { "id": "xrLLEWpfknUW" }, "source": [ "### Build a BertSpanLabeler wrapping BertEncoder\n", "\n", "The `nlp.models.BertSpanLabeler` class implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n", "\n", "Note that `nlp.models.BertSpanLabeler` wraps a `nlp.networks.BertEncoder`, the weights of which can be restored from the above pretraining model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "B941M4iUCejO" }, "outputs": [], "source": [ "network = nlp.networks.BertEncoder(\n", " vocab_size=vocab_size, num_layers=2)\n", "\n", "# Create a BERT trainer with the created network.\n", "bert_span_labeler = nlp.models.BertSpanLabeler(network)" ] }, { "cell_type": "markdown", "metadata": { "id": "QpB9pgj4PpMg" }, "source": [ "Inspecting the `bert_span_labeler`, we see it wraps the encoder with additional `SpanLabeling` that outputs `start_position` and `end_position`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RbqRNJCLJu4H" }, "outputs": [], "source": [ "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, expand_nested=True, dpi=48)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fUf1vRxZJwio" }, "outputs": [], "source": [ "# Create a set of 2-dimensional data tensors to feed into the model.\n", "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "\n", "# Feed the data to the model.\n", "start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n", "\n", "print(f'start_logits: shape={start_logits.shape}, dtype={start_logits.dtype!r}')\n", "print(f'end_logits: shape={end_logits.shape}, dtype={end_logits.dtype!r}')" ] }, { "cell_type": "markdown", "metadata": { "id": "WqhgQaN1lt-G" }, "source": [ "### Compute loss\n", "With `start_logits` and `end_logits`, we can compute loss:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "waqs6azNl3Nn" }, "outputs": [], "source": [ "start_positions = np.random.randint(sequence_length, size=(batch_size))\n", "end_positions = np.random.randint(sequence_length, size=(batch_size))\n", "\n", "start_loss = tf.keras.losses.sparse_categorical_crossentropy(\n", " start_positions, start_logits, from_logits=True)\n", "end_loss = tf.keras.losses.sparse_categorical_crossentropy(\n", " end_positions, end_logits, from_logits=True)\n", "\n", "total_loss = (tf.reduce_mean(start_loss) + tf.reduce_mean(end_loss)) / 2\n", "print(total_loss)" ] }, { "cell_type": "markdown", "metadata": { "id": "Zdf03YtZmd_d" }, "source": [ "With the `loss`, you can optimize the model. Please see [run_squad.py](https://github.com/tensorflow/models/blob/master/official/legacy/bert/run_squad.py) for the full example." ] }, { "cell_type": "markdown", "metadata": { "id": "0A1XnGSTChg9" }, "source": [ "## Classification model\n", "\n", "In the last section, we show how to build a text classification model.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MSK8OpZgnQa9" }, "source": [ "### Build a BertClassifier model wrapping BertEncoder\n", "\n", "`nlp.models.BertClassifier` implements a [CLS] token classification model containing a single classification head." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cXXCsffkCphk" }, "outputs": [], "source": [ "network = nlp.networks.BertEncoder(\n", " vocab_size=vocab_size, num_layers=2)\n", "\n", "# Create a BERT trainer with the created network.\n", "num_classes = 2\n", "bert_classifier = nlp.models.BertClassifier(\n", " network, num_classes=num_classes)" ] }, { "cell_type": "markdown", "metadata": { "id": "8tZKueKYP4bB" }, "source": [ "Inspecting the `bert_classifier`, we see it wraps the `encoder` with additional `Classification` head." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "snlutm9ZJgEZ" }, "outputs": [], "source": [ "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, expand_nested=True, dpi=48)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yyHPHsqBJkCz" }, "outputs": [], "source": [ "# Create a set of 2-dimensional data tensors to feed into the model.\n", "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n", "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n", "\n", "# Feed the data to the model.\n", "logits = bert_classifier([word_id_data, mask_data, type_id_data])\n", "print(f'logits: shape={logits.shape}, dtype={logits.dtype!r}')" ] }, { "cell_type": "markdown", "metadata": { "id": "w--a2mg4nzKm" }, "source": [ "### Compute loss\n", "\n", "With `logits`, we can compute `loss`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9X0S1DoFn_5Q" }, "outputs": [], "source": [ "labels = np.random.randint(num_classes, size=(batch_size))\n", "\n", "loss = tf.keras.losses.sparse_categorical_crossentropy(\n", " labels, logits, from_logits=True)\n", "print(loss)" ] }, { "cell_type": "markdown", "metadata": { "id": "mzBqOylZo3og" }, "source": [ "With the `loss`, you can optimize the model. Please see [run_classifier.py](https://github.com/tensorflow/models/blob/master/official/legacy/bert/run_classifier.py) or the [Fine tune_bert](https://www.tensorflow.org/text/tutorials/fine_tune_bert) notebook for the full example." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "nlp_modeling_library_intro.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }