Merge branch 'tensorflow:master' into panoptic-deeplab

44f6d511 · Srihari Humbarwadi · GitHub · 686a287d · 8bc5a1a5 · 44f6d511
Unverified Commit 44f6d511 authored Apr 25, 2022 by Srihari Humbarwadi Committed by GitHub Apr 25, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -3,7 +3,8 @@
 </div>
 [![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg?style=plastic)](https://badge.fury.io/py/tensorflow)
-[![PyPI](https://badge.fury.io/py/tensorflow.svg)](https://badge.fury.io/py/tensorflow)
+[![tf-models-official PyPI](https://badge.fury.io/py/tf-models-official.svg)](https://badge.fury.io/py/tf-models-official)
 # Welcome to the Model Garden for TensorFlow
@@ -32,7 +33,8 @@ To install the current release of tensorflow-models, please follow any one of th
 <details>
-**tf-models-official** is the stable Model Garden package.
+**tf-models-official** is the stable Model Garden package. Please check out the [releases](https://github.com/tensorflow/models/releases) to see what are available modules.
 pip will install all models and dependencies automatically.
 ```shell

--- a/community/README.md
+++ b/community/README.md
@@ -19,7 +19,7 @@ This repository provides a curated list of the GitHub repositories with machine
 | [ResNet 101](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet101) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference<br/>• FP32 Training | [Intel](https://github.com/IntelAI) |
-| [EfficientNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
+| EfficientNet [v1](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v1) [v2](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v2) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
 ### Object Detection

--- a/official/README.md
+++ b/official/README.md
@@ -38,16 +38,15 @@ In the near future, we will add:
 ## Models and Implementations
-### Computer Vision
+### [Computer Vision](vision/README.md)
 #### Image Classification
 | Model | Reference (Paper) |
 |-------|-------------------|
-| [MNIST](legacy/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) |
 | [ResNet](vision/MODEL_GARDEN.md) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
 | [ResNet-RS](vision/MODEL_GARDEN.md) | [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/abs/2103.07579) |
-| [EfficientNet](legacy/image_classification) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
+| [EfficientNet](vision/MODEL_GARDEN.md) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
 | [Vision Transformer](vision/MODEL_GARDEN.md) | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) |
 #### Object Detection and Segmentation
@@ -56,7 +55,6 @@ In the near future, we will add:
 |-------|-------------------|
 | [RetinaNet](vision/MODEL_GARDEN.md) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
 | [Mask R-CNN](vision/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
-| [ShapeMask](legacy/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) |
 | [SpineNet](vision/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
 | [Cascade RCNN-RS and RetinaNet-RS](vision/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|
@@ -66,7 +64,7 @@ In the near future, we will add:
 |-------|-------------------|
 | [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) |
-### Natural Language Processing
+### [Natural Language Processing](nlp/README.md)
 | Model | Reference (Paper) |
 |-------|-------------------|
@@ -74,7 +72,6 @@ In the near future, we will add:
 | [BERT (Bidirectional Encoder Representations from Transformers)](nlp/MODEL_GARDEN.md#available-model-configs) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
 | [NHNet (News Headline generation model)](projects/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) |
 | [Transformer](nlp/MODEL_GARDEN.md#available-model-configs) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
-| [XLNet](nlp/xlnet) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
 | [MobileBERT](projects/mobilebert) | [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) |
 ### Recommendation

--- a/official/colab/decoding_api_in_tf_nlp.ipynb
+++ b/official/colab/decoding_api_in_tf_nlp.ipynb
@@ -34,14 +34,10 @@
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "fsACVQpVSifi"
+        "id": "2X-XaMSVcLua"
      },
      "source": [
-        "### Install the TensorFlow Model Garden pip package\n",
+        "# Decoding API"
-        "\n",
-        "*  `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
-        "which is the nightly Model Garden package created daily automatically.\n",
-        "*  pip will install all models and dependencies automatically."
      ]
    },
    {
@@ -66,6 +62,30 @@
        "\u003c/table\u003e"
      ]
    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fsACVQpVSifi"
+      },
+      "source": [
+        "### Install the TensorFlow Model Garden pip package\n",
+        "\n",
+        "*  `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
+        "which is the nightly Model Garden package created daily automatically.\n",
+        "*  pip will install all models and dependencies automatically."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "G4BhAu01HZcM"
+      },
+      "outputs": [],
+      "source": [
+        "!pip uninstall -y opencv-python"
+      ]
+    },
    {
      "cell_type": "code",
      "execution_count": null,
@@ -74,7 +94,7 @@
      },
      "outputs": [],
      "source": [
-        "pip install  tf-models-nightly"
+        "!pip install tf-models-official"
      ]
    },
    {
@@ -92,9 +112,20 @@
        "\n",
        "import tensorflow as tf\n",
        "\n",
-        "from official import nlp\n",
+        "from tensorflow_models import nlp"
-        "from official.nlp.modeling.ops import sampling_module\n",
+      ]
-        "from official.nlp.modeling.ops import beam_search"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "T92ccAzlnGqh"
+      },
+      "outputs": [],
+      "source": [
+        "def length_norm(length, dtype):\n",
+        "  \"\"\"Return length normalization factor.\"\"\"\n",
+        "  return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
      ]
    },
    {
@@ -103,7 +134,8 @@
        "id": "0AWgyo-IQ5sP"
      },
      "source": [
-        "# Decoding API\n",
+        "## Overview\n",
+        "\n",
        "This API provides an interface to experiment with different decoding strategies used for auto-regressive models.\n",
        "\n",
        "1. The following sampling strategies are provided in sampling_module.py, which inherits from the base Decoding class:\n",
@@ -182,7 +214,7 @@
        "id": "lV1RRp6ihnGX"
      },
      "source": [
-        "# Initialize the Model Hyper-parameters"
+        "## Initialize the Model Hyper-parameters"
      ]
    },
    {
@@ -193,44 +225,32 @@
      },
      "outputs": [],
      "source": [
-        "params = {}\n",
+        "params = {\n",
-        "params['num_heads'] = 2\n",
+        "    'num_heads': 2\n",
-        "params['num_layers'] = 2\n",
+        "    'num_layers': 2\n",
-        "params['batch_size'] = 2\n",
+        "    'batch_size': 2\n",
-        "params['n_dims'] = 256\n",
+        "    'n_dims': 256\n",
-        "params['max_decode_length'] = 4"
+        "    'max_decode_length': 4}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "UGvmd0_dRFYI"
+        "id": "CYXkoplAij01"
      },
      "source": [
-        "## What is a Cache?\n",
+        "## Initialize cache. "
-        "In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
-        "Cache is used for fast sequential decoding.\n",
-        "It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer.\n",
-        "\n",
-        "```\n",
-        "{\n",
-        "    'layer_%d' % layer: {\n",
-        "        'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
-        "        'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
-        "        } for layer in range(params['num_layers']),\n",
-        "    'model_specific_item' : Model specific tensor shape,\n",
-        "}\n",
-        "\n",
-        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "CYXkoplAij01"
+        "id": "UGvmd0_dRFYI"
      },
      "source": [
-        "# Initialize cache. "
+        "In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
+        "Cache is used for fast sequential decoding.\n",
+        "It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer."
      ]
    },
    {
@@ -243,35 +263,15 @@
      "source": [
        "cache = {\n",
        "    'layer_%d' % layer: {\n",
-        "        'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
+        "        'k': tf.zeros(\n",
-        "        'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
+        "            shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
+        "            dtype=tf.float32),\n",
+        "        'v': tf.zeros(\n",
+        "            shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
+        "            dtype=tf.float32)\n",
        "        } for layer in range(params['num_layers'])\n",
        "    }\n",
-        "print(\"cache key shape for layer 1 :\", cache['layer_1']['k'].shape)"
+        "print(\"cache value shape for layer 1 :\", cache['layer_1']['k'].shape)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "nNY3Xn8SiblP"
-      },
-      "source": [
-        "# Define closure for length normalization. **optional.**\n",
-        "\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "T92ccAzlnGqh"
-      },
-      "outputs": [],
-      "source": [
-        "def length_norm(length, dtype):\n",
-        "  \"\"\"Return length normalization factor.\"\"\"\n",
-        "  return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
      ]
    },
    {
@@ -280,15 +280,14 @@
        "id": "syl7I5nURPgW"
      },
      "source": [
-        "# Create model_fn\n",
+        "### Create model_fn\n",
        "  In practice, this will be replaced by an actual model implementation such as [here](https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L236)\n",
        "```\n",
        "Args:\n",
        "i : Step that is being decoded.\n",
        "Returns:\n",
        "  logit probabilities of size [batch_size, 1, vocab_size]\n",
-        "```\n",
+        "```\n"
-        "\n"
      ]
    },
    {
@@ -307,15 +306,6 @@
        "  return probabilities[:, i, :]"
      ]
    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "DBMUkaVmVZBg"
-      },
-      "source": [
-        "# Initialize symbols_to_logits_fn\n"
-      ]
-    },
    {
      "cell_type": "code",
      "execution_count": null,
@@ -339,7 +329,7 @@
        "id": "R_tV3jyWVL47"
      },
      "source": [
-        "# Greedy \n",
+        "## Greedy \n",
        "Greedy decoding selects the token id with the highest probability as its next id: $id_t = argmax_{w}P(id | id_{1:t-1})$ at each timestep $t$. The following sketch shows greedy decoding. "
      ]
    },
@@ -370,7 +360,7 @@
        "id": "s4pTTsQXVz5O"
      },
      "source": [
-        "# top_k sampling\n",
+        "## top_k sampling\n",
        "In *Top-K* sampling, the *K* most likely next token ids are filtered and the probability mass is redistributed among only those *K* ids. "
      ]
    },
@@ -404,7 +394,7 @@
        "id": "Jp3G-eE_WI4Y"
      },
      "source": [
-        "# top_p sampling\n",
+        "## top_p sampling\n",
        "Instead of sampling only from the most likely *K* token ids, in *Top-p* sampling chooses from the smallest possible set of ids whose cumulative probability exceeds the probability *p*."
      ]
    },
@@ -438,7 +428,7 @@
        "id": "2hcuyJ2VWjDz"
      },
      "source": [
-        "# Beam search decoding\n",
+        "## Beam search decoding\n",
        "Beam search reduces the risk of missing hidden high probability token ids by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. "
      ]
    },

--- a/official/colab/nlp/customize_encoder.ipynb
+++ b/official/colab/nlp/customize_encoder.ipynb
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "name": "Customizing a Transformer Encoder",
-      "private_outputs": true,
-      "provenance": [],
-      "collapsed_sections": [],
-      "toc_visible": true
-    },
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    }
-  },
  "cells": [
    {
      "cell_type": "markdown",
@@ -26,10 +11,12 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "rxPj2Lsni9O4"
      },
+      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
@@ -42,9 +29,7 @@
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -61,20 +46,20 @@
        "id": "Mwb9uw1cDXsa"
      },
      "source": [
-        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
+        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
-        "  <td>\n",
+        "  \u003ctd\u003e\n",
-        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/official_models/nlp/customize_encoder\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
+        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/official_models/nlp/customize_encoder\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
-        "  </td>\n",
+        "  \u003c/td\u003e\n",
-        "  <td>\n",
+        "  \u003ctd\u003e\n",
-        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
+        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
-        "  </td>\n",
+        "  \u003c/td\u003e\n",
-        "  <td>\n",
+        "  \u003ctd\u003e\n",
-        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
+        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/models/blob/master/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
-        "  </td>\n",
+        "  \u003c/td\u003e\n",
-        "  <td>\n",
+        "  \u003ctd\u003e\n",
-        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/nlp/customize_encoder.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
+        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/models/official/colab/nlp/customize_encoder.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
-        "  </td>\n",
+        "  \u003c/td\u003e\n",
-        "</table>"
+        "\u003c/table\u003e"
      ]
    },
    {
@@ -87,7 +72,7 @@
        "\n",
        "The [TensorFlow Models NLP library](https://github.com/tensorflow/models/tree/master/official/nlp/modeling) is a collection of tools for building and training modern high performance natural language models.\n",
        "\n",
-        "The [TransformEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py) is the core of this library, and lots of new network architectures are proposed to improve the encoder. In this Colab notebook, we will learn how to customize the encoder to employ new network architectures."
+        "The `tfm.nlp.networks.EncoderScaffold` is the core of this library, and lots of new network architectures are proposed to improve the encoder. In this Colab notebook, we will learn how to customize the encoder to employ new network architectures."
      ]
    },
    {
@@ -114,14 +99,27 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
-        "id": "thsKZDjhswhR"
+        "id": "mfHI5JyuJ1y9"
      },
+      "outputs": [],
      "source": [
-        "!pip install -q tf-models-official==2.4.0"
+        "# Uninstall colab's opencv-python, it conflicts with `opencv-python-headless`\n",
-      ],
+        "# which is installed by tf-models-official\n",
+        "!pip uninstall -y opencv-python"
+      ]
+    },
+    {
+      "cell_type": "code",
      "execution_count": null,
-      "outputs": []
+      "metadata": {
+        "id": "thsKZDjhswhR"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -q tf-models-nightly"
+      ]
    },
    {
      "cell_type": "markdown",
@@ -134,19 +132,18 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "my4dp-RMssQe"
      },
+      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "\n",
-        "from official.modeling import activations\n",
+        "import tensorflow_models as tfm\n",
-        "from official.nlp import modeling\n",
+        "from tensorflow_models import nlp"
-        "from official.nlp.modeling import layers, losses, models, networks"
+      ]
-      ],
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -156,14 +153,16 @@
      "source": [
        "## Canonical BERT encoder\n",
        "\n",
-        "Before learning how to customize the encoder, let's firstly create a canonical BERT enoder and use it to instantiate a `BertClassifier` for classification task."
+        "Before learning how to customize the encoder, let's firstly create a canonical BERT enoder and use it to instantiate a `bert_classifier.BertClassifier` for classification task."
      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "Oav8sbgstWc-"
      },
+      "outputs": [],
      "source": [
        "cfg = {\n",
        "    \"vocab_size\": 100,\n",
@@ -171,22 +170,20 @@
        "    \"num_layers\": 3,\n",
        "    \"num_attention_heads\": 4,\n",
        "    \"intermediate_size\": 64,\n",
-        "    \"activation\": activations.gelu,\n",
+        "    \"activation\": tfm.utils.activations.gelu,\n",
        "    \"dropout_rate\": 0.1,\n",
        "    \"attention_dropout_rate\": 0.1,\n",
        "    \"max_sequence_length\": 16,\n",
        "    \"type_vocab_size\": 2,\n",
        "    \"initializer\": tf.keras.initializers.TruncatedNormal(stddev=0.02),\n",
        "}\n",
-        "bert_encoder = modeling.networks.BertEncoder(**cfg)\n",
+        "bert_encoder = nlp.networks.BertEncoder(**cfg)\n",
        "\n",
        "def build_classifier(bert_encoder):\n",
-        "  return modeling.models.BertClassifier(bert_encoder, num_classes=2)\n",
+        "  return nlp.models.BertClassifier(bert_encoder, num_classes=2)\n",
        "\n",
        "canonical_classifier_model = build_classifier(bert_encoder)"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -201,9 +198,11 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "csED2d-Yt5h6"
      },
+      "outputs": [],
      "source": [
        "def predict(model):\n",
        "  batch_size = 3\n",
@@ -216,9 +215,7 @@
        "  print(model([word_ids, mask, type_ids], training=False))\n",
        "\n",
        "predict(canonical_classifier_model)"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -249,7 +246,7 @@
      "source": [
        "### Use EncoderScaffold\n",
        "\n",
-        "`EncoderScaffold` allows users to provide a custom embedding subnetwork\n",
+        "`networks.EncoderScaffold` allows users to provide a custom embedding subnetwork\n",
        "  (which will replace the standard embedding logic) and/or a custom hidden layer class (which will replace the `Transformer` instantiation in the encoder)."
      ]
    },
@@ -261,30 +258,32 @@
      "source": [
        "#### Without Customization\n",
        "\n",
-        "Without any customization, `EncoderScaffold` behaves the same the canonical `BertEncoder`.\n",
+        "Without any customization, `networks.EncoderScaffold` behaves the same the canonical `networks.BertEncoder`.\n",
        "\n",
-        "As shown in the following example, `EncoderScaffold` can load `BertEncoder`'s weights and output the same values:"
+        "As shown in the following example, `networks.EncoderScaffold` can load `networks.BertEncoder`'s weights and output the same values:"
      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "ktNzKuVByZQf"
      },
+      "outputs": [],
      "source": [
        "default_hidden_cfg = dict(\n",
        "    num_attention_heads=cfg[\"num_attention_heads\"],\n",
        "    intermediate_size=cfg[\"intermediate_size\"],\n",
-        "    intermediate_activation=activations.gelu,\n",
+        "    intermediate_activation=cfg[\"activation\"],\n",
        "    dropout_rate=cfg[\"dropout_rate\"],\n",
        "    attention_dropout_rate=cfg[\"attention_dropout_rate\"],\n",
-        "    kernel_initializer=tf.keras.initializers.TruncatedNormal(0.02),\n",
+        "    kernel_initializer=cfg[\"initializer\"],\n",
        ")\n",
        "default_embedding_cfg = dict(\n",
        "    vocab_size=cfg[\"vocab_size\"],\n",
        "    type_vocab_size=cfg[\"type_vocab_size\"],\n",
        "    hidden_size=cfg[\"hidden_size\"],\n",
-        "    initializer=tf.keras.initializers.TruncatedNormal(0.02),\n",
+        "    initializer=cfg[\"initializer\"],\n",
        "    dropout_rate=cfg[\"dropout_rate\"],\n",
        "    max_seq_length=cfg[\"max_sequence_length\"]\n",
        ")\n",
@@ -294,17 +293,15 @@
        "    num_hidden_instances=cfg[\"num_layers\"],\n",
        "    pooled_output_dim=cfg[\"hidden_size\"],\n",
        "    return_all_layer_outputs=True,\n",
-        "    pooler_layer_initializer=tf.keras.initializers.TruncatedNormal(0.02),\n",
+        "    pooler_layer_initializer=cfg[\"initializer\"],\n",
        ")\n",
        "\n",
-        "encoder_scaffold = modeling.networks.EncoderScaffold(**default_kwargs)\n",
+        "encoder_scaffold = nlp.networks.EncoderScaffold(**default_kwargs)\n",
        "classifier_model_from_encoder_scaffold = build_classifier(encoder_scaffold)\n",
        "classifier_model_from_encoder_scaffold.set_weights(\n",
        "    canonical_classifier_model.get_weights())\n",
        "predict(classifier_model_from_encoder_scaffold)"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -321,26 +318,26 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "LTinnaG6vcsw"
      },
+      "outputs": [],
      "source": [
        "word_ids = tf.keras.layers.Input(\n",
        "    shape=(cfg['max_sequence_length'],), dtype=tf.int32, name=\"input_word_ids\")\n",
        "mask = tf.keras.layers.Input(\n",
        "    shape=(cfg['max_sequence_length'],), dtype=tf.int32, name=\"input_mask\")\n",
-        "embedding_layer = modeling.layers.OnDeviceEmbedding(\n",
+        "embedding_layer = nlp.layers.OnDeviceEmbedding(\n",
        "    vocab_size=cfg['vocab_size'],\n",
        "    embedding_width=cfg['hidden_size'],\n",
-        "    initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02),\n",
+        "    initializer=cfg[\"initializer\"],\n",
        "    name=\"word_embeddings\")\n",
        "word_embeddings = embedding_layer(word_ids)\n",
-        "attention_mask = layers.SelfAttentionMask()([word_embeddings, mask])\n",
+        "attention_mask = nlp.layers.SelfAttentionMask()([word_embeddings, mask])\n",
        "new_embedding_network = tf.keras.Model([word_ids, mask],\n",
        "                                       [word_embeddings, attention_mask])"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -354,14 +351,14 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "fO9zKFE4OpHp"
      },
+      "outputs": [],
      "source": [
        "tf.keras.utils.plot_model(new_embedding_network, show_shapes=True, dpi=48)"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -374,9 +371,11 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "mtFDMNf2vIl9"
      },
+      "outputs": [],
      "source": [
        "kwargs = dict(default_kwargs)\n",
        "\n",
@@ -384,16 +383,14 @@
        "kwargs['embedding_cls'] = new_embedding_network\n",
        "kwargs['embedding_data'] = embedding_layer.embeddings\n",
        "\n",
-        "encoder_with_customized_embedding = modeling.networks.EncoderScaffold(**kwargs)\n",
+        "encoder_with_customized_embedding = nlp.networks.EncoderScaffold(**kwargs)\n",
        "classifier_model = build_classifier(encoder_with_customized_embedding)\n",
        "# ... Train the model ...\n",
        "print(classifier_model.inputs)\n",
        "\n",
        "# Assert that there are only two inputs.\n",
        "assert len(classifier_model.inputs) == 2"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -403,34 +400,34 @@
      "source": [
        "#### Customized Transformer\n",
        "\n",
-        "User can also override the [hidden_cls](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/encoder_scaffold.py#L103) argument in `EncoderScaffold`'s constructor to employ a customized Transformer layer.\n",
+        "User can also override the `hidden_cls` argument in `networks.EncoderScaffold`'s constructor to employ a customized Transformer layer.\n",
        "\n",
-        "See [ReZeroTransformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/rezero_transformer.py) for how to implement a customized Transformer layer.\n",
+        "See [the source of `nlp.layers.ReZeroTransformer`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/rezero_transformer.py) for how to implement a customized Transformer layer.\n",
        "\n",
-        "Following is an example of using `ReZeroTransformer`:\n"
+        "Following is an example of using `nlp.layers.ReZeroTransformer`:\n"
      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "uAIarLZgw6pA"
      },
+      "outputs": [],
      "source": [
        "kwargs = dict(default_kwargs)\n",
        "\n",
        "# Use ReZeroTransformer.\n",
-        "kwargs['hidden_cls'] = modeling.layers.ReZeroTransformer\n",
+        "kwargs['hidden_cls'] = nlp.layers.ReZeroTransformer\n",
        "\n",
-        "encoder_with_rezero_transformer = modeling.networks.EncoderScaffold(**kwargs)\n",
+        "encoder_with_rezero_transformer = nlp.networks.EncoderScaffold(**kwargs)\n",
        "classifier_model = build_classifier(encoder_with_rezero_transformer)\n",
        "# ... Train the model ...\n",
        "predict(classifier_model)\n",
        "\n",
        "# Assert that the variable `rezero_alpha` from ReZeroTransformer exists.\n",
        "assert 'rezero_alpha' in ''.join([x.name for x in classifier_model.trainable_weights])"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -438,10 +435,9 @@
        "id": "6PMHFdvnxvR0"
      },
      "source": [
-        "### Use [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py)\n",
+        "### Use `nlp.layers.TransformerScaffold`\n",
        "\n",
-        "The above method of customizing `Transformer` requires rewriting the whole `Transformer` layer, while sometimes you may only want to customize either attention layer or feedforward block. In this case, [TransformerScaffold](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py) can be used.\n",
+        "The above method of customizing the model requires rewriting the whole `nlp.layers.Transformer` layer, while sometimes you may only want to customize either attention layer or feedforward block. In this case, `nlp.layers.TransformerScaffold` can be used.\n"
-        "\n"
      ]
    },
    {
@@ -452,37 +448,48 @@
      "source": [
        "#### Customize Attention Layer\n",
        "\n",
-        "User can also override the [attention_cls](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_scaffold.py#L45) argument in `TransformerScaffold`'s constructor to employ a customized Attention layer.\n",
+        "User can also override the `attention_cls` argument in `layers.TransformerScaffold`'s constructor to employ a customized Attention layer.\n",
        "\n",
-        "See [TalkingHeadsAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py) for how to implement a customized `Attention` layer.\n",
+        "See [the source of `nlp.layers.TalkingHeadsAttention`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py) for how to implement a customized `Attention` layer.\n",
        "\n",
-        "Following is an example of using [TalkingHeadsAttention](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/talking_heads_attention.py):"
+        "Following is an example of using `nlp.layers.TalkingHeadsAttention`:"
      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "nFrSMrZuyNeQ"
      },
+      "outputs": [],
      "source": [
        "# Use TalkingHeadsAttention\n",
        "hidden_cfg = dict(default_hidden_cfg)\n",
-        "hidden_cfg['attention_cls'] = modeling.layers.TalkingHeadsAttention\n",
+        "hidden_cfg['attention_cls'] = nlp.layers.TalkingHeadsAttention\n",
        "\n",
        "kwargs = dict(default_kwargs)\n",
-        "kwargs['hidden_cls'] = modeling.layers.TransformerScaffold\n",
+        "kwargs['hidden_cls'] = nlp.layers.TransformerScaffold\n",
        "kwargs['hidden_cfg'] = hidden_cfg\n",
        "\n",
-        "encoder = modeling.networks.EncoderScaffold(**kwargs)\n",
+        "encoder = nlp.networks.EncoderScaffold(**kwargs)\n",
        "classifier_model = build_classifier(encoder)\n",
        "# ... Train the model ...\n",
        "predict(classifier_model)\n",
        "\n",
        "# Assert that the variable `pre_softmax_weight` from TalkingHeadsAttention exists.\n",
        "assert 'pre_softmax_weight' in ''.join([x.name for x in classifier_model.trainable_weights])"
-      ],
+      ]
+    },
+    {
+      "cell_type": "code",
      "execution_count": null,
-      "outputs": []
+      "metadata": {
+        "id": "tKkZ8spzYmpc"
+      },
+      "outputs": [],
+      "source": [
+        "tf.keras.utils.plot_model(encoder_with_rezero_transformer, show_shapes=True, dpi=48)"
+      ]
    },
    {
      "cell_type": "markdown",
@@ -494,35 +501,35 @@
        "\n",
        "Similiarly, one could also customize the feedforward layer.\n",
        "\n",
-        "See [GatedFeedforward](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py) for how to implement a customized feedforward layer.\n",
+        "See [the source of `nlp.layers.GatedFeedforward`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py) for how to implement a customized feedforward layer.\n",
        "\n",
-        "Following is an example of using [GatedFeedforward](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/gated_feedforward.py)."
+        "Following is an example of using `nlp.layers.GatedFeedforward`:"
      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "XAbKy_l4y_-i"
      },
+      "outputs": [],
      "source": [
-        "# Use TalkingHeadsAttention\n",
+        "# Use GatedFeedforward\n",
        "hidden_cfg = dict(default_hidden_cfg)\n",
-        "hidden_cfg['feedforward_cls'] = modeling.layers.GatedFeedforward\n",
+        "hidden_cfg['feedforward_cls'] = nlp.layers.GatedFeedforward\n",
        "\n",
        "kwargs = dict(default_kwargs)\n",
-        "kwargs['hidden_cls'] = modeling.layers.TransformerScaffold\n",
+        "kwargs['hidden_cls'] = nlp.layers.TransformerScaffold\n",
        "kwargs['hidden_cfg'] = hidden_cfg\n",
        "\n",
-        "encoder_with_gated_feedforward = modeling.networks.EncoderScaffold(**kwargs)\n",
+        "encoder_with_gated_feedforward = nlp.networks.EncoderScaffold(**kwargs)\n",
        "classifier_model = build_classifier(encoder_with_gated_feedforward)\n",
        "# ... Train the model ...\n",
        "predict(classifier_model)\n",
        "\n",
        "# Assert that the variable `gate` from GatedFeedforward exists.\n",
        "assert 'gate' in ''.join([x.name for x in classifier_model.trainable_weights])"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -530,26 +537,28 @@
        "id": "a_8NWUhkzeAq"
      },
      "source": [
-        "### Build a new Encoder using building blocks from KerasBERT.\n",
+        "### Build a new Encoder\n",
        "\n",
        "Finally, you could also build a new encoder using building blocks in the modeling library.\n",
        "\n",
-        "See [AlbertEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/albert_encoder.py) as an example:\n"
+        "See [the source for `nlp.networks.AlbertEncoder`](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/albert_encoder.py) as an example of how to du this. \n",
+        "\n",
+        "Here is an example using `nlp.networks.AlbertEncoder`:\n"
      ]
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "xsiA3RzUzmUM"
      },
+      "outputs": [],
      "source": [
-        "albert_encoder = modeling.networks.AlbertEncoder(**cfg)\n",
+        "albert_encoder = nlp.networks.AlbertEncoder(**cfg)\n",
        "classifier_model = build_classifier(albert_encoder)\n",
        "# ... Train the model ...\n",
        "predict(classifier_model)"
-      ],
+      ]
-      "execution_count": null,
-      "outputs": []
    },
    {
      "cell_type": "markdown",
@@ -562,14 +571,28 @@
    },
    {
      "cell_type": "code",
+      "execution_count": null,
      "metadata": {
        "id": "Uv_juT22HERW"
      },
+      "outputs": [],
      "source": [
        "tf.keras.utils.plot_model(albert_encoder, show_shapes=True, dpi=48)"
+      ]
+    }
  ],
-      "execution_count": null,
+  "metadata": {
-      "outputs": []
+    "colab": {
+      "collapsed_sections": [],
+      "name": "customize_encoder.ipynb",
+      "provenance": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
    }
-  ]
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }
--- a/official/colab/nlp/nlp_modeling_library_intro.ipynb
+++ b/official/colab/nlp/nlp_modeling_library_intro.ipynb
@@ -95,6 +95,19 @@
        "*  `pip` will install all models and dependencies automatically."
      ]
    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "IAOmYthAzI7J"
+      },
+      "outputs": [],
+      "source": [
+        "# Uninstall colab's opencv-python, it conflicts with `opencv-python-headless`\n",
+        "# which is installed by tf-models-official\n",
+        "!pip uninstall -y opencv-python"
+      ]
+    },
    {
      "cell_type": "code",
      "execution_count": null,
@@ -103,7 +116,7 @@
      },
      "outputs": [],
      "source": [
-        "!pip install -q tf-models-official==2.4.0"
+        "!pip install tf-models-official"
      ]
    },
    {
@@ -126,8 +139,7 @@
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "\n",
-        "from official.nlp import modeling\n",
+        "from tensorflow_models import nlp"
-        "from official.nlp.modeling import layers, losses, models, networks"
      ]
    },
    {
@@ -151,9 +163,9 @@
      "source": [
        "### Build a `BertPretrainer` model wrapping `BertEncoder`\n",
        "\n",
-        "The [BertEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/bert_encoder.py) implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.\n",
+        "The `nlp.networks.BertEncoder` class implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers (`nlp.layers.TransformerEncoderBlock`), but not the masked language model or classification task networks.\n",
        "\n",
-        "The [BertPretrainer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_pretrainer.py) allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
+        "The `nlp.models.BertPretrainer` class allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
      ]
    },
    {
@@ -166,9 +178,10 @@
      "source": [
        "# Build a small transformer network.\n",
        "vocab_size = 100\n",
-        "sequence_length = 16\n",
+        "network = nlp.networks.BertEncoder(\n",
-        "network = modeling.networks.BertEncoder(\n",
+        "    vocab_size=vocab_size, \n",
-        "    vocab_size=vocab_size, num_layers=2, sequence_length=16)"
+        "    # The number of TransformerEncoderBlock layers\n",
+        "    num_layers=3)"
      ]
    },
    {
@@ -177,7 +190,7 @@
        "id": "0NH5irV5KTMS"
      },
      "source": [
-        "Inspecting the encoder, we see it contains few embedding layers, stacked `Transformer` layers and are connected to three input layers:\n",
+        "Inspecting the encoder, we see it contains few embedding layers, stacked `nlp.layers.TransformerEncoderBlock` layers and are connected to three input layers:\n",
        "\n",
        "`input_word_ids`, `input_type_ids` and `input_mask`.\n"
      ]
@@ -190,7 +203,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(network, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(network, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -203,7 +216,7 @@
      "source": [
        "# Create a BERT pretrainer with the created network.\n",
        "num_token_predictions = 8\n",
-        "bert_pretrainer = modeling.models.BertPretrainer(\n",
+        "bert_pretrainer = nlp.models.BertPretrainer(\n",
        "    network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')"
      ]
    },
@@ -213,7 +226,7 @@
        "id": "d5h5HT7gNHx_"
      },
      "source": [
-        "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `Classification` heads."
+        "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `nlp.layers.ClassificationHead` heads."
      ]
    },
    {
@@ -224,7 +237,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -236,7 +249,9 @@
      "outputs": [],
      "source": [
        "# We can feed some dummy data to get masked language model and sentence output.\n",
+        "sequence_length = 16\n",
        "batch_size = 2\n",
+        "\n",
        "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n",
        "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
        "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
@@ -246,8 +261,8 @@
        "    [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n",
        "lm_output = outputs[\"masked_lm\"]\n",
        "sentence_output = outputs[\"classification\"]\n",
-        "print(lm_output)\n",
+        "print(f'lm_output: shape={lm_output.shape}, dtype={lm_output.dtype!r}')\n",
-        "print(sentence_output)"
+        "print(f'sentence_output: shape={sentence_output.shape}, dtype={sentence_output.dtype!r}')"
      ]
    },
    {
@@ -272,14 +287,15 @@
        "masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n",
        "next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n",
        "\n",
-        "mlm_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n",
+        "mlm_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
        "    labels=masked_lm_ids_data,\n",
        "    predictions=lm_output,\n",
        "    weights=masked_lm_weights_data)\n",
-        "sentence_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n",
+        "sentence_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
        "    labels=next_sentence_labels_data,\n",
        "    predictions=sentence_output)\n",
        "loss = mlm_loss + sentence_loss\n",
+        "\n",
        "print(loss)"
      ]
    },
@@ -290,8 +306,7 @@
      },
      "source": [
        "With the loss, you can optimize the model.\n",
-        "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n",
+        "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n"
-        "\n"
      ]
    },
    {
@@ -315,9 +330,9 @@
      "source": [
        "### Build a BertSpanLabeler wrapping BertEncoder\n",
        "\n",
-        "[BertSpanLabeler](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_span_labeler.py) implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
+        "The `nlp.models.BertSpanLabeler` class implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
        "\n",
-        "Note that `BertSpanLabeler` wraps a `BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
+        "Note that `nlp.models.BertSpanLabeler` wraps a `nlp.networks.BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
      ]
    },
    {
@@ -328,11 +343,11 @@
      },
      "outputs": [],
      "source": [
-        "network = modeling.networks.BertEncoder(\n",
+        "network = nlp.networks.BertEncoder(\n",
-        "        vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n",
+        "        vocab_size=vocab_size, num_layers=2)\n",
        "\n",
        "# Create a BERT trainer with the created network.\n",
-        "bert_span_labeler = modeling.models.BertSpanLabeler(network)"
+        "bert_span_labeler = nlp.models.BertSpanLabeler(network)"
      ]
    },
    {
@@ -352,7 +367,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -370,8 +385,9 @@
        "\n",
        "# Feed the data to the model.\n",
        "start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n",
-        "print(start_logits)\n",
+        "\n",
-        "print(end_logits)"
+        "print(f'start_logits: shape={start_logits.shape}, dtype={start_logits.dtype!r}')\n",
+        "print(f'end_logits: shape={end_logits.shape}, dtype={end_logits.dtype!r}')"
      ]
    },
    {
@@ -432,7 +448,7 @@
      "source": [
        "### Build a BertClassifier model wrapping BertEncoder\n",
        "\n",
-        "[BertClassifier](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_classifier.py) implements a [CLS] token classification model containing a single classification head."
+        "`nlp.models.BertClassifier` implements a [CLS] token classification model containing a single classification head."
      ]
    },
    {
@@ -443,12 +459,12 @@
      },
      "outputs": [],
      "source": [
-        "network = modeling.networks.BertEncoder(\n",
+        "network = nlp.networks.BertEncoder(\n",
-        "        vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n",
+        "        vocab_size=vocab_size, num_layers=2)\n",
        "\n",
        "# Create a BERT trainer with the created network.\n",
        "num_classes = 2\n",
-        "bert_classifier = modeling.models.BertClassifier(\n",
+        "bert_classifier = nlp.models.BertClassifier(\n",
        "    network, num_classes=num_classes)"
      ]
    },
@@ -469,7 +485,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -487,7 +503,7 @@
        "\n",
        "# Feed the data to the model.\n",
        "logits = bert_classifier([word_id_data, mask_data, type_id_data])\n",
-        "print(logits)"
+        "print(f'logits: shape={logits.shape}, dtype={logits.dtype!r}')"
      ]
    },
    {
@@ -529,8 +545,7 @@
  "metadata": {
    "colab": {
      "collapsed_sections": [],
-      "name": "Introduction to the TensorFlow Models NLP library",
+      "name": "nlp_modeling_library_intro.ipynb",
-      "private_outputs": true,
      "provenance": [],
      "toc_visible": true
    },

--- a/official/core/__init__.py
+++ b/official/core/__init__.py
@@ -12,3 +12,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""Core is shared by both `nlp` and `vision`."""
+from official.core import actions
+from official.core import base_task
+from official.core import base_trainer
+from official.core import config_definitions
+from official.core import exp_factory
+from official.core import export_base
+from official.core import input_reader
+from official.core import registry
+from official.core import task_factory
+from official.core import train_lib
+from official.core import train_utils
--- a/official/core/base_trainer.py
+++ b/official/core/base_trainer.py
@@ -33,57 +33,6 @@ ExperimentConfig = config_definitions.ExperimentConfig
 TrainerConfig = config_definitions.TrainerConfig
-class Recovery:
-  """Built-in model blowup recovery module.
-  Checks the loss value by the given threshold. If applicable, recover the
-  model by reading the checkpoint on disk.
-  """
-  def __init__(self,
-               loss_upper_bound: float,
-               checkpoint_manager: tf.train.CheckpointManager,
-               recovery_begin_steps: int = 0,
-               recovery_max_trials: int = 3):
-    self.recover_counter = 0
-    self.recovery_begin_steps = recovery_begin_steps
-    self.recovery_max_trials = recovery_max_trials
-    self.loss_upper_bound = loss_upper_bound
-    self.checkpoint_manager = checkpoint_manager
-  def should_recover(self, loss_value, global_step):
-    if tf.math.is_nan(loss_value):
-      return True
-    if (global_step >= self.recovery_begin_steps and
-        loss_value > self.loss_upper_bound):
-      return True
-    return False
-  def maybe_recover(self, loss_value, global_step):
-    """Conditionally recovers the training by triggering checkpoint restoration.
-    Args:
-      loss_value: the loss value as a float.
-      global_step: the number of global training steps.
-    Raises:
-      RuntimeError: when recovery happens more than the max number of trials,
-      the job should crash.
-    """
-    if not self.should_recover(loss_value, global_step):
-      return
-    self.recover_counter += 1
-    if self.recover_counter > self.recovery_max_trials:
-      raise RuntimeError(
-          "The loss value is NaN or out of range after training loop and "
-          f"this happens {self.recover_counter} times.")
-    # Loads the previous good checkpoint.
-    checkpoint_path = self.checkpoint_manager.restore_or_initialize()
-    logging.warning(
-        "Recovering the model from checkpoint: %s. The loss value becomes "
-        "%f at step %d.", checkpoint_path, loss_value, global_step)
 class _AsyncTrainer(orbit.StandardTrainer, orbit.StandardEvaluator):
  """Trainer class for both sync and async Strategy."""

--- a/official/core/base_trainer_test.py
+++ b/official/core/base_trainer_test.py
@@ -150,30 +150,6 @@ class MockAsyncTrainer(trainer_lib._AsyncTrainer):
    return self.eval_global_step.numpy()
-class RecoveryTest(tf.test.TestCase):
-  def test_recovery_module(self):
-    ckpt = tf.train.Checkpoint(v=tf.Variable(1, dtype=tf.int32))
-    model_dir = self.get_temp_dir()
-    manager = tf.train.CheckpointManager(ckpt, model_dir, max_to_keep=1)
-    recovery_module = trainer_lib.Recovery(
-        loss_upper_bound=1.0,
-        checkpoint_manager=manager,
-        recovery_begin_steps=1,
-        recovery_max_trials=1)
-    self.assertFalse(recovery_module.should_recover(1.1, 0))
-    self.assertFalse(recovery_module.should_recover(0.1, 1))
-    self.assertTrue(recovery_module.should_recover(1.1, 2))
-    # First triggers the recovery once.
-    recovery_module.maybe_recover(1.1, 10)
-    # Second time, it raises.
-    with self.assertRaisesRegex(
-        RuntimeError, 'The loss value is NaN .*'):
-      recovery_module.maybe_recover(1.1, 10)
 class TrainerTest(tf.test.TestCase, parameterized.TestCase):
  def setUp(self):

--- a/official/core/config_definitions.py
+++ b/official/core/config_definitions.py
@@ -76,6 +76,10 @@ class DataConfig(base_config.Config):
      features. The main use case is to skip the image/video decoding for better
      performance.
    seed: An optional seed to use for deterministic shuffling/preprocessing.
+    prefetch_buffer_size: An int specifying the buffer size of prefetch
+      datasets. If None, the buffer size is autotuned. Specifying this is useful
+      in case autotuning uses up too much memory by making the buffer size too
+      high.
  """
  input_path: Union[Sequence[str], str, base_config.Config] = ""
  tfds_name: str = ""
@@ -96,6 +100,7 @@ class DataConfig(base_config.Config):
  tfds_as_supervised: bool = False
  tfds_skip_decoding_feature: str = ""
  seed: Optional[int] = None
+  prefetch_buffer_size: Optional[int] = None
 @dataclasses.dataclass
@@ -190,8 +195,8 @@ class TrainerConfig(base_config.Config):
      is only used continuous_train_and_eval and continuous_eval modes. Default
      value is 1 hrs.
    train_steps: number of train steps.
-    validation_steps: number of eval steps. If `None`, the entire eval dataset
+    validation_steps: number of eval steps. If -1, the entire eval dataset is
-      is used.
+      used.
    validation_interval: number of training steps to run between evaluations.
    best_checkpoint_export_subdir: if set, the trainer will keep track of the
      best evaluation metric, and export the corresponding best checkpoint under

--- a/official/core/input_reader.py
+++ b/official/core/input_reader.py
@@ -292,6 +292,8 @@ class InputReader:
    self._transform_and_batch_fn = transform_and_batch_fn
    self._postprocess_fn = postprocess_fn
    self._seed = params.seed
+    self._prefetch_buffer_size = (params.prefetch_buffer_size or
+                                  tf.data.experimental.AUTOTUNE)
    # When tf.data service is enabled, each data service worker should get
    # different random seeds. Thus, we set `seed` to None.
@@ -505,4 +507,4 @@ class InputReader:
      options = tf.data.Options()
      options.experimental_deterministic = self._deterministic
      dataset = dataset.with_options(options)
-    return dataset.prefetch(tf.data.experimental.AUTOTUNE)
+    return dataset.prefetch(self._prefetch_buffer_size)
--- a/official/modeling/optimization/legacy_adamw.py
+++ b/official/modeling/optimization/legacy_adamw.py
+# Copyright 2022 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Adam optimizer with weight decay that exactly matches the original BERT."""
+import re
+from absl import logging
+import tensorflow as tf
+class AdamWeightDecay(tf.keras.optimizers.Adam):
+  """Adam enables L2 weight decay and clip_by_global_norm on gradients.
+  [Warning!]: Keras optimizer supports gradient clipping and has an AdamW
+  implementation. Please consider evaluating the choice in Keras package.
+  Just adding the square of the weights to the loss function is *not* the
+  correct way of using L2 regularization/weight decay with Adam, since that will
+  interact with the m and v parameters in strange ways.
+  Instead we want to decay the weights in a manner that doesn't interact with
+  the m/v parameters. This is equivalent to adding the square of the weights to
+  the loss with plain (non-momentum) SGD.
+  """
+  def __init__(self,
+               learning_rate=0.001,
+               beta_1=0.9,
+               beta_2=0.999,
+               epsilon=1e-7,
+               amsgrad=False,
+               weight_decay_rate=0.0,
+               include_in_weight_decay=None,
+               exclude_from_weight_decay=None,
+               gradient_clip_norm=1.0,
+               name='AdamWeightDecay',
+               **kwargs):
+    super(AdamWeightDecay, self).__init__(learning_rate, beta_1, beta_2,
+                                          epsilon, amsgrad, name, **kwargs)
+    self.weight_decay_rate = weight_decay_rate
+    self.gradient_clip_norm = gradient_clip_norm
+    self._include_in_weight_decay = include_in_weight_decay
+    self._exclude_from_weight_decay = exclude_from_weight_decay
+    logging.info('AdamWeightDecay gradient_clip_norm=%f', gradient_clip_norm)
+  def _prepare_local(self, var_device, var_dtype, apply_state):
+    super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype,  # pytype: disable=attribute-error  # typed-keras
+                                                apply_state)
+    apply_state[(var_device, var_dtype)]['weight_decay_rate'] = tf.constant(
+        self.weight_decay_rate, name='adam_weight_decay_rate')
+  def _decay_weights_op(self, var, learning_rate, apply_state):
+    do_decay = self._do_use_weight_decay(var.name)
+    if do_decay:
+      return var.assign_sub(
+          learning_rate * var *
+          apply_state[(var.device, var.dtype.base_dtype)]['weight_decay_rate'],
+          use_locking=self._use_locking)
+    return tf.no_op()
+  def apply_gradients(self,
+                      grads_and_vars,
+                      name=None,
+                      experimental_aggregate_gradients=True):
+    grads, tvars = list(zip(*grads_and_vars))
+    if experimental_aggregate_gradients and self.gradient_clip_norm > 0.0:
+      # when experimental_aggregate_gradients = False, apply_gradients() no
+      # longer implicitly allreduce gradients, users manually allreduce gradient
+      # and passed the allreduced grads_and_vars. For now, the
+      # clip_by_global_norm will be moved to before the explicit allreduce to
+      # keep the math the same as TF 1 and pre TF 2.2 implementation.
+      (grads, _) = tf.clip_by_global_norm(
+          grads, clip_norm=self.gradient_clip_norm)
+    return super(AdamWeightDecay, self).apply_gradients(
+        zip(grads, tvars),
+        name=name,
+        experimental_aggregate_gradients=experimental_aggregate_gradients)
+  def _get_lr(self, var_device, var_dtype, apply_state):
+    """Retrieves the learning rate with the given state."""
+    if apply_state is None:
+      return self._decayed_lr_t[var_dtype], {}
+    apply_state = apply_state or {}
+    coefficients = apply_state.get((var_device, var_dtype))
+    if coefficients is None:
+      coefficients = self._fallback_apply_state(var_device, var_dtype)
+      apply_state[(var_device, var_dtype)] = coefficients
+    return coefficients['lr_t'], dict(apply_state=apply_state)
+  def _resource_apply_dense(self, grad, var, apply_state=None):
+    lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
+    decay = self._decay_weights_op(var, lr_t, apply_state)
+    with tf.control_dependencies([decay]):
+      return super(AdamWeightDecay,
+                   self)._resource_apply_dense(grad, var, **kwargs)  # pytype: disable=attribute-error  # typed-keras
+  def _resource_apply_sparse(self, grad, var, indices, apply_state=None):
+    lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
+    decay = self._decay_weights_op(var, lr_t, apply_state)
+    with tf.control_dependencies([decay]):
+      return super(AdamWeightDecay,
+                   self)._resource_apply_sparse(grad, var, indices, **kwargs)  # pytype: disable=attribute-error  # typed-keras
+  def get_config(self):
+    config = super(AdamWeightDecay, self).get_config()
+    config.update({
+        'weight_decay_rate': self.weight_decay_rate,
+    })
+    return config
+  def _do_use_weight_decay(self, param_name):
+    """Whether to use L2 weight decay for `param_name`."""
+    if self.weight_decay_rate == 0:
+      return False
+    if self._include_in_weight_decay:
+      for r in self._include_in_weight_decay:
+        if re.search(r, param_name) is not None:
+          return True
+    if self._exclude_from_weight_decay:
+      for r in self._exclude_from_weight_decay:
+        if re.search(r, param_name) is not None:
+          return False
+    return True
--- a/official/modeling/optimization/optimizer_factory.py
+++ b/official/modeling/optimization/optimizer_factory.py
@@ -18,20 +18,21 @@ from typing import Callable, Optional, Union, List, Tuple
 import gin
 import tensorflow as tf
 import tensorflow_addons.optimizers as tfa_optimizers
 from official.modeling.optimization import slide_optimizer
 from official.modeling.optimization import adafactor_optimizer
 from official.modeling.optimization import ema_optimizer
 from official.modeling.optimization import lars_optimizer
+from official.modeling.optimization import legacy_adamw
 from official.modeling.optimization import lr_schedule
 from official.modeling.optimization.configs import optimization_config as opt_cfg
-from official.nlp import optimization as nlp_optimization
 OPTIMIZERS_CLS = {
    'sgd': tf.keras.optimizers.SGD,
    # TODO(chenmoneygithub): experimental.SGD
    'adam': tf.keras.optimizers.Adam,
    # TODO(chenmoneygithub): experimental.Adam
-    'adamw': nlp_optimization.AdamWeightDecay,
+    'adamw': legacy_adamw.AdamWeightDecay,
    'lamb': tfa_optimizers.LAMB,
    'rmsprop': tf.keras.optimizers.RMSprop,
    'lars': lars_optimizer.LARS,
@@ -57,8 +58,8 @@ WARMUP_CLS = {
 }
-def register_optimizer_cls(
+def register_optimizer_cls(key: str,
-    key: str, optimizer_config_cls: tf.keras.optimizers.Optimizer):
+                           optimizer_config_cls: tf.keras.optimizers.Optimizer):
  """Register customize optimizer cls.
  The user will still need to subclass data classes in
@@ -85,6 +86,8 @@ class OptimizerFactory:
  (4) Build optimizer.
  This is a typical example for using this class:
+  ```
  params = {
        'optimizer': {
            'type': 'sgd',
@@ -104,6 +107,7 @@ class OptimizerFactory:
  opt_factory = OptimizerFactory(opt_config)
  lr = opt_factory.build_learning_rate()
  optimizer = opt_factory.build_optimizer(lr)
+  ```
  """
  def __init__(self, config: opt_cfg.OptimizationConfig):
@@ -156,9 +160,12 @@ class OptimizerFactory:
  def build_optimizer(
      self,
      lr: Union[tf.keras.optimizers.schedules.LearningRateSchedule, float],
+      gradient_aggregator: Optional[Callable[
+          [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
+                                                          tf.Tensor]]]] = None,
      gradient_transformers: Optional[List[Callable[
-          [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor, tf.Tensor]]
+          [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
-      ]]] = None,
+                                                          tf.Tensor]]]]] = None,
      postprocessor: Optional[Callable[[tf.keras.optimizers.Optimizer],
                                       tf.keras.optimizers.Optimizer]] = None):
    """Build optimizer.
@@ -170,6 +177,7 @@ class OptimizerFactory:
    Args:
      lr: A floating point value, or a
        tf.keras.optimizers.schedules.LearningRateSchedule instance.
+      gradient_aggregator: Optional function to overwrite gradient aggregation.
      gradient_transformers: Optional list of functions to use to transform
        gradients before applying updates to Variables. The functions are
        applied after gradient_aggregator. The functions should accept and
@@ -193,6 +201,8 @@ class OptimizerFactory:
      del optimizer_dict['global_clipnorm']
    optimizer_dict['learning_rate'] = lr
+    if gradient_aggregator is not None:
+      optimizer_dict['gradient_aggregator'] = gradient_aggregator
    if gradient_transformers is not None:
      optimizer_dict['gradient_transformers'] = gradient_transformers

--- a/official/modeling/optimization/optimizer_factory_test.py
+++ b/official/modeling/optimization/optimizer_factory_test.py
@@ -49,6 +49,39 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
    self.assertIsInstance(optimizer, optimizer_cls)
    self.assertEqual(expected_optimizer_config, optimizer.get_config())
+  def test_gradient_aggregator(self):
+    params = {
+        'optimizer': {
+            'type': 'adam',
+        },
+        'learning_rate': {
+            'type': 'constant',
+            'constant': {
+                'learning_rate': 1.0
+            }
+        }
+    }
+    opt_config = optimization_config.OptimizationConfig(params)
+    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
+    lr = opt_factory.build_learning_rate()
+    # Dummy function to zero out gradients.
+    zero_grads = lambda gv: [(tf.zeros_like(g), v) for g, v in gv]
+    optimizer = opt_factory.build_optimizer(lr, gradient_aggregator=zero_grads)
+    var0 = tf.Variable([1.0, 2.0])
+    var1 = tf.Variable([3.0, 4.0])
+    grads0 = tf.constant([1.0, 1.0])
+    grads1 = tf.constant([1.0, 1.0])
+    grads_and_vars = list(zip([grads0, grads1], [var0, var1]))
+    optimizer.apply_gradients(grads_and_vars)
+    self.assertAllClose(np.array([1.0, 2.0]), var0.numpy())
+    self.assertAllClose(np.array([3.0, 4.0]), var1.numpy())
  @parameterized.parameters((None, None), (1.0, None), (None, 1.0))
  def test_gradient_clipping(self, clipnorm, clipvalue):
    params = {
@@ -418,7 +451,7 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
            }
        }
    }
-    expected_lr_step_values = [[0, 0.0], [5000, 1e-4/2.0], [10000, 1e-4],
+    expected_lr_step_values = [[0, 0.0], [5000, 1e-4 / 2.0], [10000, 1e-4],
                               [20000, 9.994863e-05], [499999, 5e-05]]
    opt_config = optimization_config.OptimizationConfig(params)
    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
@@ -434,10 +467,12 @@ class OptimizerFactoryRegistryTest(tf.test.TestCase):
    class MyClass():
      pass
    optimizer_factory.register_optimizer_cls('test', MyClass)
    self.assertIn('test', optimizer_factory.OPTIMIZERS_CLS)
    with self.assertRaisesRegex(ValueError, 'test already registered.*'):
      optimizer_factory.register_optimizer_cls('test', MyClass)
 if __name__ == '__main__':
  tf.test.main()
--- a/official/nlp/README.md
+++ b/official/nlp/README.md
-# TensorFlow NLP Modelling Toolkit
+# TF-NLP Model Garden
+⚠️ Disclaimer: All datasets hyperlinked from this page are not owned or
+distributed by Google. The dataset is made available by third parties.
+Please review the terms and conditions made available by the third parties
+before using the data.
 This codebase provides a Natrual Language Processing modeling toolkit written in
 [TF2](https://www.tensorflow.org/guide/effective_tf2). It allows researchers and
@@ -30,7 +35,10 @@ research ideas. Detailed intructions can be found in READMEs in each folder.
 We provide SoTA model implementations, pre-trained models, training and
 evaluation examples, and command lines. Detail instructions can be found in the
-READMEs for specific papers.
+READMEs for specific papers. Below are some papers implemented in the
+repository and more NLP projects can be found in the
+[`projects`](https://github.com/tensorflow/models/tree/master/official/projects)
+folder:
 1.  [BERT](MODEL_GARDEN.md#available-model-configs): [BERT: Pre-training of Deep Bidirectional Transformers for
    Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al.,
@@ -38,10 +46,10 @@ READMEs for specific papers.
 2.  [ALBERT](MODEL_GARDEN.md#available-model-configs):
    [A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
    by Lan et al., 2019
-3.  [XLNet](xlnet):
+3.  [XLNet](MODEL_GARDEN.md):
    [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
    by Yang et al., 2019
-4.  [Transformer for translation](transformer):
+4.  [Transformer for translation](MODEL_GARDEN.md#available-model-configs):
    [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et
    al., 2017

--- a/official/nlp/configs/experiment_configs.py
+++ b/official/nlp/configs/experiment_configs.py
@@ -17,4 +17,3 @@
 from official.nlp.configs import finetuning_experiments
 from official.nlp.configs import pretraining_experiments
 from official.nlp.configs import wmt_transformer_experiments
-from official.projects.teams import teams_experiments
--- a/official/nlp/data/classifier_data_lib.py
+++ b/official/nlp/data/classifier_data_lib.py
@@ -187,6 +187,8 @@ class AxProcessor(DataProcessor):
  def _create_examples_tfds(self, dataset, set_type):
    """Creates examples for the training/dev/test sets."""
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -218,6 +220,8 @@ class ColaProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/cola", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -312,6 +316,8 @@ class MnliProcessor(DataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/mnli", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -343,6 +349,8 @@ class MrpcProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/mrpc", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -453,6 +461,8 @@ class QnliProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/qnli", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -484,6 +494,8 @@ class QqpProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/qqp", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -517,6 +529,8 @@ class RteProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/rte", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -548,6 +562,8 @@ class SstProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/sst2", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -574,6 +590,8 @@ class StsBProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/stsb", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -742,6 +760,8 @@ class WnliProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/wnli", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)

--- a/official/nlp/modeling/layers/kernel_attention.py
+++ b/official/nlp/modeling/layers/kernel_attention.py
@@ -178,13 +178,13 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
               is_short_seq=False,
               begin_kernel=0,
               scale=None,
+               scale_by_length=False,
               **kwargs):
    r"""Constructor of KernelAttention.
    Args:
-      feature_transform: A non-linear transform of the keys and quries.
+      feature_transform: A non-linear transform of the keys and quries. Possible
-      Possible transforms are "elu", "relu", "square", "exp", "expmod",
+        transforms are "elu", "relu", "square", "exp", "expmod", "identity".
-      "identity".
      num_random_features: Number of random features to be used for projection.
        if num_random_features <= 0, no production is used before transform.
      seed: The seed to begin drawing random features. Once the seed is set, the
@@ -194,12 +194,16 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      redraw: Whether to redraw projection every forward pass during training.
        The argument is only effective when num_random_features > 0.
      is_short_seq: boolean predicate indicating whether input data consists of
-        very short sequences or not; in most cases this should be False
+        very short sequences or not; in most cases this should be False (default
-        (default option).
+        option).
      begin_kernel: Apply kernel_attention after this sequence id and apply
        softmax attention before this.
      scale: The value to scale the dot product as described in `Attention Is
        All You Need`. If None, we use 1/sqrt(dk) as described in the paper.
+      scale_by_length: boolean predicate indicating whether additionally scale
+        the dot product based on key length. Set as log_512^(n) to stablize
+        attention entropy against length. Refer to
+        https://kexue.fm/archives/8823 for details.
      **kwargs: The same arguments `MultiHeadAttention` layer.
    """
    if feature_transform not in _TRANSFORM_MAP:
@@ -214,6 +218,7 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
    self._redraw = redraw
    self._is_short_seq = is_short_seq
    self._begin_kernel = begin_kernel
+    self._scale_by_length = scale_by_length
    # We use the seed for two scenarios:
    # 1. inference
    # 2. no redraw
@@ -252,9 +257,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      is_short_seq: boolean predicate indicating whether input data consists of
        short or long sequences; usually short sequence is defined as having
        length L <= 1024.
-      attention_mask: a boolean mask of shape `[B, S]`, that prevents
+      attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
-        attenting to masked positions. Note that the mask is only appied to
+        to masked positions. Note that the mask is only appied to the keys. User
-        the keys. User may want to mask the output if query contains pads.
+        may want to mask the output if query contains pads.
      training: Python boolean indicating whether the layer should behave in
        training mode (adding dropout) or in inference mode (doing nothing).
      numeric_stabler: A scalar value added to avoid divide by 0.
@@ -270,17 +275,23 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      else:
        projection_matrix = self._projection_matrix
+    if self._scale_by_length:
+      scale = tf.math.log(tf.reduce_sum(attention_mask,
+                                        axis=-1)) * self._scale / math.log(512)
+      scale = tf.reshape(scale, [-1, 1, 1, 1])
+    else:
+      scale = self._scale
    if is_short_seq:
      # Note: Applying scalar multiply at the smaller end of einsum improves
      # XLA performance, but may introduce slight numeric differences in
      # the Transformer attention head.
-      query = query * self._scale
+      query = query * scale
    else:
      # Note: we suspect spliting the scale to key, query yields smaller
      # approximation variance when random projection is used.
      # For simplicity, we also split when there's no random projection.
-      key *= math.sqrt(self._scale)
+      key *= tf.math.sqrt(scale)
-      query *= math.sqrt(self._scale)
+      query *= tf.math.sqrt(scale)
    key = _TRANSFORM_MAP[feature_transform](key, projection_matrix)
    query = _TRANSFORM_MAP[feature_transform](query, projection_matrix)
@@ -330,9 +341,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      value: Value `Tensor` of shape `[B, S, dim]`.
      key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will use
        `value` for both `key` and `value`, which is the most common case.
-      attention_mask: a boolean mask of shape `[B, S]`, that prevents
+      attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
-        attenting to masked positions. Note that the mask is only appied to
+        to masked positions. Note that the mask is only appied to the keys. User
-        the keys. User may want to mask the output if query contains pads.
+        may want to mask the output if query contains pads.
      training: Python boolean indicating whether the layer should behave in
        training mode (adding dropout) or in inference mode (doing nothing).
@@ -373,9 +384,10 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      attention_output = tf.concat(
          [attention_output_softmax, attention_output_kernel], axis=1)
    else:
-      attention_output = self._compute_attention(
+      attention_output = self._compute_attention(query, key, value,
-          query, key, value, self._feature_transform,
+                                                 self._feature_transform,
-          self._is_short_seq, attention_mask, training)
+                                                 self._is_short_seq,
+                                                 attention_mask, training)
      # This is actually dropping out entire tokens to attend to, which might
      # seem a bit unusual, but is taken from the original Transformer paper.
      attention_output = self._dropout_layer(attention_output)

--- a/official/nlp/modeling/layers/kernel_attention_test.py
+++ b/official/nlp/modeling/layers/kernel_attention_test.py
@@ -30,8 +30,8 @@ _BEGIN_KERNEL = [0, 512]
 class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
-  @parameterized.parameters(itertools.product(
+  @parameterized.parameters(
-      _FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
+      itertools.product(_FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
                        _IS_SHORT_SEQ, _BEGIN_KERNEL))
  def test_attention_projection(
      self, feature_transform, num_random_features, training, redraw, is_short,
@@ -90,6 +90,32 @@ class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
        training=training)
    self.assertEqual(output.shape, [batch_size, seq_length, key_dim])
+  @parameterized.parameters([128, 512])
+  def test_attention_scale_by_length(self, seq_length):
+    num_heads = 12
+    key_dim = 64
+    batch_size = 2
+    test_layer = attention.KernelAttention(
+        num_heads=num_heads,
+        key_dim=key_dim,
+        num_random_features=0,
+        scale_by_length=True)
+    query = tf.random.normal(
+        shape=(batch_size, seq_length, key_dim))
+    value = query
+    encoder_inputs_mask = tf.ones((batch_size, seq_length), dtype=tf.int32)
+    masks = tf.cast(encoder_inputs_mask, dtype=tf.float32)
+    output_scale_by_length = test_layer(
+        query=query, value=value, attention_mask=masks)
+    test_layer._scale_by_length = False
+    output_no_scale_by_length = test_layer(
+        query=query, value=value, attention_mask=masks)
+    if seq_length == 512:  # Equals because log(seq_length, base=512) = 1.0
+      self.assertAllClose(output_scale_by_length, output_no_scale_by_length)
+    else:
+      self.assertNotAllClose(output_scale_by_length, output_no_scale_by_length)
  def test_unsupported_feature_transform(self):
    with self.assertRaisesRegex(ValueError, 'Unsupported feature_transform.*'):
      _ = attention.KernelAttention(feature_transform='test')

--- a/official/nlp/modeling/layers/transformer_encoder_block.py
+++ b/official/nlp/modeling/layers/transformer_encoder_block.py
@@ -14,6 +14,7 @@
 """Keras-based TransformerEncoder block layer."""
+from absl import logging
 import tensorflow as tf
 from official.nlp.modeling.layers import util
@@ -176,9 +177,9 @@ class TransformerEncoderBlock(tf.keras.layers.Layer):
      einsum_equation = "...bc,cd->...bd"
    hidden_size = input_tensor_shape[-1]
    if hidden_size % self._num_heads != 0:
-      raise ValueError(
+      logging.warning(
          "The input size (%d) is not a multiple of the number of attention "
-          "heads (%d)" % (hidden_size, self._num_heads))
+          "heads (%d)", hidden_size, self._num_heads)
    if self._key_dim is None:
      self._key_dim = int(hidden_size // self._num_heads)
    if self._output_last_dim is None: