Merge branch 'tensorflow:master' into panoptic-deeplab

44f6d511 · Srihari Humbarwadi · GitHub · 686a287d · 8bc5a1a5 · 44f6d511
Unverified Commit 44f6d511 authored Apr 25, 2022 by Srihari Humbarwadi Committed by GitHub Apr 25, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -3,7 +3,8 @@
 </div>
 [![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg?style=plastic)](https://badge.fury.io/py/tensorflow)
-[![PyPI](https://badge.fury.io/py/tensorflow.svg)](https://badge.fury.io/py/tensorflow)
+[![tf-models-official PyPI](https://badge.fury.io/py/tf-models-official.svg)](https://badge.fury.io/py/tf-models-official)
 # Welcome to the Model Garden for TensorFlow
@@ -32,7 +33,8 @@ To install the current release of tensorflow-models, please follow any one of th
 <details>
-**tf-models-official** is the stable Model Garden package.
+**tf-models-official** is the stable Model Garden package. Please check out the [releases](https://github.com/tensorflow/models/releases) to see what are available modules.
 pip will install all models and dependencies automatically.
 ```shell

--- a/community/README.md
+++ b/community/README.md
@@ -19,7 +19,7 @@ This repository provides a curated list of the GitHub repositories with machine
 | [ResNet 101](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet101) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference<br/>• FP32 Training | [Intel](https://github.com/IntelAI) |
-| [EfficientNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
+| EfficientNet [v1](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v1) [v2](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Classification/ConvNets/efficientnet_v2) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/pdf/1905.11946.pdf) | • Automatic mixed precision<br/>• Horovod Multi-GPU training (NCCL)<br/>• Multi-node training on a Pyxis/Enroot Slurm cluster<br/>• XLA | [NVIDIA](https://github.com/NVIDIA) |
 ### Object Detection

--- a/official/README.md
+++ b/official/README.md
@@ -38,16 +38,15 @@ In the near future, we will add:
 ## Models and Implementations
-### Computer Vision
+### [Computer Vision](vision/README.md)
 #### Image Classification
 | Model | Reference (Paper) |
 |-------|-------------------|
-| [MNIST](legacy/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) |
 | [ResNet](vision/MODEL_GARDEN.md) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
 | [ResNet-RS](vision/MODEL_GARDEN.md) | [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/abs/2103.07579) |
-| [EfficientNet](legacy/image_classification) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
+| [EfficientNet](vision/MODEL_GARDEN.md) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
 | [Vision Transformer](vision/MODEL_GARDEN.md) | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) |
 #### Object Detection and Segmentation
@@ -56,7 +55,6 @@ In the near future, we will add:
 |-------|-------------------|
 | [RetinaNet](vision/MODEL_GARDEN.md) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
 | [Mask R-CNN](vision/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
-| [ShapeMask](legacy/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) |
 | [SpineNet](vision/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
 | [Cascade RCNN-RS and RetinaNet-RS](vision/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|
@@ -66,7 +64,7 @@ In the near future, we will add:
 |-------|-------------------|
 | [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) |
-### Natural Language Processing
+### [Natural Language Processing](nlp/README.md)
 | Model | Reference (Paper) |
 |-------|-------------------|
@@ -74,7 +72,6 @@ In the near future, we will add:
 | [BERT (Bidirectional Encoder Representations from Transformers)](nlp/MODEL_GARDEN.md#available-model-configs) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
 | [NHNet (News Headline generation model)](projects/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) |
 | [Transformer](nlp/MODEL_GARDEN.md#available-model-configs) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
-| [XLNet](nlp/xlnet) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
 | [MobileBERT](projects/mobilebert) | [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) |
 ### Recommendation

--- a/official/colab/decoding_api_in_tf_nlp.ipynb
+++ b/official/colab/decoding_api_in_tf_nlp.ipynb
@@ -34,14 +34,10 @@
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "fsACVQpVSifi"
+        "id": "2X-XaMSVcLua"
      },
      "source": [
-        "### Install the TensorFlow Model Garden pip package\n",
+        "# Decoding API"
-        "\n",
-        "*  `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
-        "which is the nightly Model Garden package created daily automatically.\n",
-        "*  pip will install all models and dependencies automatically."
      ]
    },
    {
@@ -66,6 +62,30 @@
        "\u003c/table\u003e"
      ]
    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fsACVQpVSifi"
+      },
+      "source": [
+        "### Install the TensorFlow Model Garden pip package\n",
+        "\n",
+        "*  `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`,\n",
+        "which is the nightly Model Garden package created daily automatically.\n",
+        "*  pip will install all models and dependencies automatically."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "G4BhAu01HZcM"
+      },
+      "outputs": [],
+      "source": [
+        "!pip uninstall -y opencv-python"
+      ]
+    },
    {
      "cell_type": "code",
      "execution_count": null,
@@ -74,7 +94,7 @@
      },
      "outputs": [],
      "source": [
-        "pip install  tf-models-nightly"
+        "!pip install tf-models-official"
      ]
    },
    {
@@ -92,9 +112,20 @@
        "\n",
        "import tensorflow as tf\n",
        "\n",
-        "from official import nlp\n",
+        "from tensorflow_models import nlp"
-        "from official.nlp.modeling.ops import sampling_module\n",
+      ]
-        "from official.nlp.modeling.ops import beam_search"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "T92ccAzlnGqh"
+      },
+      "outputs": [],
+      "source": [
+        "def length_norm(length, dtype):\n",
+        "  \"\"\"Return length normalization factor.\"\"\"\n",
+        "  return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
      ]
    },
    {
@@ -103,7 +134,8 @@
        "id": "0AWgyo-IQ5sP"
      },
      "source": [
-        "# Decoding API\n",
+        "## Overview\n",
+        "\n",
        "This API provides an interface to experiment with different decoding strategies used for auto-regressive models.\n",
        "\n",
        "1. The following sampling strategies are provided in sampling_module.py, which inherits from the base Decoding class:\n",
@@ -182,7 +214,7 @@
        "id": "lV1RRp6ihnGX"
      },
      "source": [
-        "# Initialize the Model Hyper-parameters"
+        "## Initialize the Model Hyper-parameters"
      ]
    },
    {
@@ -193,44 +225,32 @@
      },
      "outputs": [],
      "source": [
-        "params = {}\n",
+        "params = {\n",
-        "params['num_heads'] = 2\n",
+        "    'num_heads': 2\n",
-        "params['num_layers'] = 2\n",
+        "    'num_layers': 2\n",
-        "params['batch_size'] = 2\n",
+        "    'batch_size': 2\n",
-        "params['n_dims'] = 256\n",
+        "    'n_dims': 256\n",
-        "params['max_decode_length'] = 4"
+        "    'max_decode_length': 4}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "UGvmd0_dRFYI"
+        "id": "CYXkoplAij01"
      },
      "source": [
-        "## What is a Cache?\n",
+        "## Initialize cache. "
-        "In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
-        "Cache is used for fast sequential decoding.\n",
-        "It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer.\n",
-        "\n",
-        "```\n",
-        "{\n",
-        "    'layer_%d' % layer: {\n",
-        "        'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
-        "        'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
-        "        } for layer in range(params['num_layers']),\n",
-        "    'model_specific_item' : Model specific tensor shape,\n",
-        "}\n",
-        "\n",
-        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
-        "id": "CYXkoplAij01"
+        "id": "UGvmd0_dRFYI"
      },
      "source": [
-        "# Initialize cache. "
+        "In auto-regressive architectures like Transformer based [Encoder-Decoder](https://arxiv.org/abs/1706.03762) models, \n",
+        "Cache is used for fast sequential decoding.\n",
+        "It is a nested dictionary storing pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) for every layer."
      ]
    },
    {
@@ -243,35 +263,15 @@
      "source": [
        "cache = {\n",
        "    'layer_%d' % layer: {\n",
-        "        'k': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32),\n",
+        "        'k': tf.zeros(\n",
-        "        'v': tf.zeros([params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims']/params['num_heads']], dtype=tf.float32)\n",
+        "            shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
+        "            dtype=tf.float32),\n",
+        "        'v': tf.zeros(\n",
+        "            shape=[params['batch_size'], params['max_decode_length'], params['num_heads'], params['n_dims'] // params['num_heads']],\n",
+        "            dtype=tf.float32)\n",
        "        } for layer in range(params['num_layers'])\n",
        "    }\n",
-        "print(\"cache key shape for layer 1 :\", cache['layer_1']['k'].shape)"
+        "print(\"cache value shape for layer 1 :\", cache['layer_1']['k'].shape)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "nNY3Xn8SiblP"
-      },
-      "source": [
-        "# Define closure for length normalization. **optional.**\n",
-        "\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "T92ccAzlnGqh"
-      },
-      "outputs": [],
-      "source": [
-        "def length_norm(length, dtype):\n",
-        "  \"\"\"Return length normalization factor.\"\"\"\n",
-        "  return tf.pow(((5. + tf.cast(length, dtype)) / 6.), 0.0)"
      ]
    },
    {
@@ -280,15 +280,14 @@
        "id": "syl7I5nURPgW"
      },
      "source": [
-        "# Create model_fn\n",
+        "### Create model_fn\n",
        "  In practice, this will be replaced by an actual model implementation such as [here](https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L236)\n",
        "```\n",
        "Args:\n",
        "i : Step that is being decoded.\n",
        "Returns:\n",
        "  logit probabilities of size [batch_size, 1, vocab_size]\n",
-        "```\n",
+        "```\n"
-        "\n"
      ]
    },
    {
@@ -307,15 +306,6 @@
        "  return probabilities[:, i, :]"
      ]
    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "DBMUkaVmVZBg"
-      },
-      "source": [
-        "# Initialize symbols_to_logits_fn\n"
-      ]
-    },
    {
      "cell_type": "code",
      "execution_count": null,
@@ -339,7 +329,7 @@
        "id": "R_tV3jyWVL47"
      },
      "source": [
-        "# Greedy \n",
+        "## Greedy \n",
        "Greedy decoding selects the token id with the highest probability as its next id: $id_t = argmax_{w}P(id | id_{1:t-1})$ at each timestep $t$. The following sketch shows greedy decoding. "
      ]
    },
@@ -370,7 +360,7 @@
        "id": "s4pTTsQXVz5O"
      },
      "source": [
-        "# top_k sampling\n",
+        "## top_k sampling\n",
        "In *Top-K* sampling, the *K* most likely next token ids are filtered and the probability mass is redistributed among only those *K* ids. "
      ]
    },
@@ -404,7 +394,7 @@
        "id": "Jp3G-eE_WI4Y"
      },
      "source": [
-        "# top_p sampling\n",
+        "## top_p sampling\n",
        "Instead of sampling only from the most likely *K* token ids, in *Top-p* sampling chooses from the smallest possible set of ids whose cumulative probability exceeds the probability *p*."
      ]
    },
@@ -438,7 +428,7 @@
        "id": "2hcuyJ2VWjDz"
      },
      "source": [
-        "# Beam search decoding\n",
+        "## Beam search decoding\n",
        "Beam search reduces the risk of missing hidden high probability token ids by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. "
      ]
    },

--- a/official/colab/nlp/customize_encoder.ipynb
+++ b/official/colab/nlp/customize_encoder.ipynb
--- a/official/colab/nlp/nlp_modeling_library_intro.ipynb
+++ b/official/colab/nlp/nlp_modeling_library_intro.ipynb
@@ -95,6 +95,19 @@
        "*  `pip` will install all models and dependencies automatically."
      ]
    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "IAOmYthAzI7J"
+      },
+      "outputs": [],
+      "source": [
+        "# Uninstall colab's opencv-python, it conflicts with `opencv-python-headless`\n",
+        "# which is installed by tf-models-official\n",
+        "!pip uninstall -y opencv-python"
+      ]
+    },
    {
      "cell_type": "code",
      "execution_count": null,
@@ -103,7 +116,7 @@
      },
      "outputs": [],
      "source": [
-        "!pip install -q tf-models-official==2.4.0"
+        "!pip install tf-models-official"
      ]
    },
    {
@@ -126,8 +139,7 @@
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "\n",
-        "from official.nlp import modeling\n",
+        "from tensorflow_models import nlp"
-        "from official.nlp.modeling import layers, losses, models, networks"
      ]
    },
    {
@@ -151,9 +163,9 @@
      "source": [
        "### Build a `BertPretrainer` model wrapping `BertEncoder`\n",
        "\n",
-        "The [BertEncoder](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/networks/bert_encoder.py) implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers, but not the masked language model or classification task networks.\n",
+        "The `nlp.networks.BertEncoder` class implements the Transformer-based encoder as described in [BERT paper](https://arxiv.org/abs/1810.04805). It includes the embedding lookups and transformer layers (`nlp.layers.TransformerEncoderBlock`), but not the masked language model or classification task networks.\n",
        "\n",
-        "The [BertPretrainer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_pretrainer.py) allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
+        "The `nlp.models.BertPretrainer` class allows a user to pass in a transformer stack, and instantiates the masked language model and classification networks that are used to create the training objectives."
      ]
    },
    {
@@ -166,9 +178,10 @@
      "source": [
        "# Build a small transformer network.\n",
        "vocab_size = 100\n",
-        "sequence_length = 16\n",
+        "network = nlp.networks.BertEncoder(\n",
-        "network = modeling.networks.BertEncoder(\n",
+        "    vocab_size=vocab_size, \n",
-        "    vocab_size=vocab_size, num_layers=2, sequence_length=16)"
+        "    # The number of TransformerEncoderBlock layers\n",
+        "    num_layers=3)"
      ]
    },
    {
@@ -177,7 +190,7 @@
        "id": "0NH5irV5KTMS"
      },
      "source": [
-        "Inspecting the encoder, we see it contains few embedding layers, stacked `Transformer` layers and are connected to three input layers:\n",
+        "Inspecting the encoder, we see it contains few embedding layers, stacked `nlp.layers.TransformerEncoderBlock` layers and are connected to three input layers:\n",
        "\n",
        "`input_word_ids`, `input_type_ids` and `input_mask`.\n"
      ]
@@ -190,7 +203,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(network, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(network, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -203,7 +216,7 @@
      "source": [
        "# Create a BERT pretrainer with the created network.\n",
        "num_token_predictions = 8\n",
-        "bert_pretrainer = modeling.models.BertPretrainer(\n",
+        "bert_pretrainer = nlp.models.BertPretrainer(\n",
        "    network, num_classes=2, num_token_predictions=num_token_predictions, output='predictions')"
      ]
    },
@@ -213,7 +226,7 @@
        "id": "d5h5HT7gNHx_"
      },
      "source": [
-        "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `Classification` heads."
+        "Inspecting the `bert_pretrainer`, we see it wraps the `encoder` with additional `MaskedLM` and `nlp.layers.ClassificationHead` heads."
      ]
    },
    {
@@ -224,7 +237,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(bert_pretrainer, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -236,7 +249,9 @@
      "outputs": [],
      "source": [
        "# We can feed some dummy data to get masked language model and sentence output.\n",
+        "sequence_length = 16\n",
        "batch_size = 2\n",
+        "\n",
        "word_id_data = np.random.randint(vocab_size, size=(batch_size, sequence_length))\n",
        "mask_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
        "type_id_data = np.random.randint(2, size=(batch_size, sequence_length))\n",
@@ -246,8 +261,8 @@
        "    [word_id_data, mask_data, type_id_data, masked_lm_positions_data])\n",
        "lm_output = outputs[\"masked_lm\"]\n",
        "sentence_output = outputs[\"classification\"]\n",
-        "print(lm_output)\n",
+        "print(f'lm_output: shape={lm_output.shape}, dtype={lm_output.dtype!r}')\n",
-        "print(sentence_output)"
+        "print(f'sentence_output: shape={sentence_output.shape}, dtype={sentence_output.dtype!r}')"
      ]
    },
    {
@@ -272,14 +287,15 @@
        "masked_lm_weights_data = np.random.randint(2, size=(batch_size, num_token_predictions))\n",
        "next_sentence_labels_data = np.random.randint(2, size=(batch_size))\n",
        "\n",
-        "mlm_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n",
+        "mlm_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
        "    labels=masked_lm_ids_data,\n",
        "    predictions=lm_output,\n",
        "    weights=masked_lm_weights_data)\n",
-        "sentence_loss = modeling.losses.weighted_sparse_categorical_crossentropy_loss(\n",
+        "sentence_loss = nlp.losses.weighted_sparse_categorical_crossentropy_loss(\n",
        "    labels=next_sentence_labels_data,\n",
        "    predictions=sentence_output)\n",
        "loss = mlm_loss + sentence_loss\n",
+        "\n",
        "print(loss)"
      ]
    },
@@ -290,8 +306,7 @@
      },
      "source": [
        "With the loss, you can optimize the model.\n",
-        "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n",
+        "After training, we can save the weights of TransformerEncoder for the downstream fine-tuning tasks. Please see [run_pretraining.py](https://github.com/tensorflow/models/blob/master/official/nlp/bert/run_pretraining.py) for the full example.\n"
-        "\n"
      ]
    },
    {
@@ -315,9 +330,9 @@
      "source": [
        "### Build a BertSpanLabeler wrapping BertEncoder\n",
        "\n",
-        "[BertSpanLabeler](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_span_labeler.py) implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
+        "The `nlp.models.BertSpanLabeler` class implements a simple single-span start-end predictor (that is, a model that predicts two values: a start token index and an end token index), suitable for SQuAD-style tasks.\n",
        "\n",
-        "Note that `BertSpanLabeler` wraps a `BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
+        "Note that `nlp.models.BertSpanLabeler` wraps a `nlp.networks.BertEncoder`, the weights of which can be restored from the above pretraining model.\n"
      ]
    },
    {
@@ -328,11 +343,11 @@
      },
      "outputs": [],
      "source": [
-        "network = modeling.networks.BertEncoder(\n",
+        "network = nlp.networks.BertEncoder(\n",
-        "        vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n",
+        "        vocab_size=vocab_size, num_layers=2)\n",
        "\n",
        "# Create a BERT trainer with the created network.\n",
-        "bert_span_labeler = modeling.models.BertSpanLabeler(network)"
+        "bert_span_labeler = nlp.models.BertSpanLabeler(network)"
      ]
    },
    {
@@ -352,7 +367,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(bert_span_labeler, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -370,8 +385,9 @@
        "\n",
        "# Feed the data to the model.\n",
        "start_logits, end_logits = bert_span_labeler([word_id_data, mask_data, type_id_data])\n",
-        "print(start_logits)\n",
+        "\n",
-        "print(end_logits)"
+        "print(f'start_logits: shape={start_logits.shape}, dtype={start_logits.dtype!r}')\n",
+        "print(f'end_logits: shape={end_logits.shape}, dtype={end_logits.dtype!r}')"
      ]
    },
    {
@@ -432,7 +448,7 @@
      "source": [
        "### Build a BertClassifier model wrapping BertEncoder\n",
        "\n",
-        "[BertClassifier](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_classifier.py) implements a [CLS] token classification model containing a single classification head."
+        "`nlp.models.BertClassifier` implements a [CLS] token classification model containing a single classification head."
      ]
    },
    {
@@ -443,12 +459,12 @@
      },
      "outputs": [],
      "source": [
-        "network = modeling.networks.BertEncoder(\n",
+        "network = nlp.networks.BertEncoder(\n",
-        "        vocab_size=vocab_size, num_layers=2, sequence_length=sequence_length)\n",
+        "        vocab_size=vocab_size, num_layers=2)\n",
        "\n",
        "# Create a BERT trainer with the created network.\n",
        "num_classes = 2\n",
-        "bert_classifier = modeling.models.BertClassifier(\n",
+        "bert_classifier = nlp.models.BertClassifier(\n",
        "    network, num_classes=num_classes)"
      ]
    },
@@ -469,7 +485,7 @@
      },
      "outputs": [],
      "source": [
-        "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48)"
+        "tf.keras.utils.plot_model(bert_classifier, show_shapes=True, expand_nested=True, dpi=48)"
      ]
    },
    {
@@ -487,7 +503,7 @@
        "\n",
        "# Feed the data to the model.\n",
        "logits = bert_classifier([word_id_data, mask_data, type_id_data])\n",
-        "print(logits)"
+        "print(f'logits: shape={logits.shape}, dtype={logits.dtype!r}')"
      ]
    },
    {
@@ -529,8 +545,7 @@
  "metadata": {
    "colab": {
      "collapsed_sections": [],
-      "name": "Introduction to the TensorFlow Models NLP library",
+      "name": "nlp_modeling_library_intro.ipynb",
-      "private_outputs": true,
      "provenance": [],
      "toc_visible": true
    },

--- a/official/core/__init__.py
+++ b/official/core/__init__.py
@@ -12,3 +12,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""Core is shared by both `nlp` and `vision`."""
+from official.core import actions
+from official.core import base_task
+from official.core import base_trainer
+from official.core import config_definitions
+from official.core import exp_factory
+from official.core import export_base
+from official.core import input_reader
+from official.core import registry
+from official.core import task_factory
+from official.core import train_lib
+from official.core import train_utils
--- a/official/core/base_trainer.py
+++ b/official/core/base_trainer.py
@@ -33,57 +33,6 @@ ExperimentConfig = config_definitions.ExperimentConfig
 TrainerConfig = config_definitions.TrainerConfig
-class Recovery:
-  """Built-in model blowup recovery module.
-  Checks the loss value by the given threshold. If applicable, recover the
-  model by reading the checkpoint on disk.
-  """
-  def __init__(self,
-               loss_upper_bound: float,
-               checkpoint_manager: tf.train.CheckpointManager,
-               recovery_begin_steps: int = 0,
-               recovery_max_trials: int = 3):
-    self.recover_counter = 0
-    self.recovery_begin_steps = recovery_begin_steps
-    self.recovery_max_trials = recovery_max_trials
-    self.loss_upper_bound = loss_upper_bound
-    self.checkpoint_manager = checkpoint_manager
-  def should_recover(self, loss_value, global_step):
-    if tf.math.is_nan(loss_value):
-      return True
-    if (global_step >= self.recovery_begin_steps and
-        loss_value > self.loss_upper_bound):
-      return True
-    return False
-  def maybe_recover(self, loss_value, global_step):
-    """Conditionally recovers the training by triggering checkpoint restoration.
-    Args:
-      loss_value: the loss value as a float.
-      global_step: the number of global training steps.
-    Raises:
-      RuntimeError: when recovery happens more than the max number of trials,
-      the job should crash.
-    """
-    if not self.should_recover(loss_value, global_step):
-      return
-    self.recover_counter += 1
-    if self.recover_counter > self.recovery_max_trials:
-      raise RuntimeError(
-          "The loss value is NaN or out of range after training loop and "
-          f"this happens {self.recover_counter} times.")
-    # Loads the previous good checkpoint.
-    checkpoint_path = self.checkpoint_manager.restore_or_initialize()
-    logging.warning(
-        "Recovering the model from checkpoint: %s. The loss value becomes "
-        "%f at step %d.", checkpoint_path, loss_value, global_step)
 class _AsyncTrainer(orbit.StandardTrainer, orbit.StandardEvaluator):
  """Trainer class for both sync and async Strategy."""

--- a/official/core/base_trainer_test.py
+++ b/official/core/base_trainer_test.py
@@ -150,30 +150,6 @@ class MockAsyncTrainer(trainer_lib._AsyncTrainer):
    return self.eval_global_step.numpy()
-class RecoveryTest(tf.test.TestCase):
-  def test_recovery_module(self):
-    ckpt = tf.train.Checkpoint(v=tf.Variable(1, dtype=tf.int32))
-    model_dir = self.get_temp_dir()
-    manager = tf.train.CheckpointManager(ckpt, model_dir, max_to_keep=1)
-    recovery_module = trainer_lib.Recovery(
-        loss_upper_bound=1.0,
-        checkpoint_manager=manager,
-        recovery_begin_steps=1,
-        recovery_max_trials=1)
-    self.assertFalse(recovery_module.should_recover(1.1, 0))
-    self.assertFalse(recovery_module.should_recover(0.1, 1))
-    self.assertTrue(recovery_module.should_recover(1.1, 2))
-    # First triggers the recovery once.
-    recovery_module.maybe_recover(1.1, 10)
-    # Second time, it raises.
-    with self.assertRaisesRegex(
-        RuntimeError, 'The loss value is NaN .*'):
-      recovery_module.maybe_recover(1.1, 10)
 class TrainerTest(tf.test.TestCase, parameterized.TestCase):
  def setUp(self):

--- a/official/core/config_definitions.py
+++ b/official/core/config_definitions.py
@@ -76,6 +76,10 @@ class DataConfig(base_config.Config):
      features. The main use case is to skip the image/video decoding for better
      performance.
    seed: An optional seed to use for deterministic shuffling/preprocessing.
+    prefetch_buffer_size: An int specifying the buffer size of prefetch
+      datasets. If None, the buffer size is autotuned. Specifying this is useful
+      in case autotuning uses up too much memory by making the buffer size too
+      high.
  """
  input_path: Union[Sequence[str], str, base_config.Config] = ""
  tfds_name: str = ""
@@ -96,6 +100,7 @@ class DataConfig(base_config.Config):
  tfds_as_supervised: bool = False
  tfds_skip_decoding_feature: str = ""
  seed: Optional[int] = None
+  prefetch_buffer_size: Optional[int] = None
 @dataclasses.dataclass
@@ -190,8 +195,8 @@ class TrainerConfig(base_config.Config):
      is only used continuous_train_and_eval and continuous_eval modes. Default
      value is 1 hrs.
    train_steps: number of train steps.
-    validation_steps: number of eval steps. If `None`, the entire eval dataset
+    validation_steps: number of eval steps. If -1, the entire eval dataset is
-      is used.
+      used.
    validation_interval: number of training steps to run between evaluations.
    best_checkpoint_export_subdir: if set, the trainer will keep track of the
      best evaluation metric, and export the corresponding best checkpoint under

--- a/official/core/input_reader.py
+++ b/official/core/input_reader.py
@@ -292,6 +292,8 @@ class InputReader:
    self._transform_and_batch_fn = transform_and_batch_fn
    self._postprocess_fn = postprocess_fn
    self._seed = params.seed
+    self._prefetch_buffer_size = (params.prefetch_buffer_size or
+                                  tf.data.experimental.AUTOTUNE)
    # When tf.data service is enabled, each data service worker should get
    # different random seeds. Thus, we set `seed` to None.
@@ -505,4 +507,4 @@ class InputReader:
      options = tf.data.Options()
      options.experimental_deterministic = self._deterministic
      dataset = dataset.with_options(options)
-    return dataset.prefetch(tf.data.experimental.AUTOTUNE)
+    return dataset.prefetch(self._prefetch_buffer_size)
--- a/official/modeling/optimization/legacy_adamw.py
+++ b/official/modeling/optimization/legacy_adamw.py
+# Copyright 2022 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Adam optimizer with weight decay that exactly matches the original BERT."""
+import re
+from absl import logging
+import tensorflow as tf
+class AdamWeightDecay(tf.keras.optimizers.Adam):
+  """Adam enables L2 weight decay and clip_by_global_norm on gradients.
+  [Warning!]: Keras optimizer supports gradient clipping and has an AdamW
+  implementation. Please consider evaluating the choice in Keras package.
+  Just adding the square of the weights to the loss function is *not* the
+  correct way of using L2 regularization/weight decay with Adam, since that will
+  interact with the m and v parameters in strange ways.
+  Instead we want to decay the weights in a manner that doesn't interact with
+  the m/v parameters. This is equivalent to adding the square of the weights to
+  the loss with plain (non-momentum) SGD.
+  """
+  def __init__(self,
+               learning_rate=0.001,
+               beta_1=0.9,
+               beta_2=0.999,
+               epsilon=1e-7,
+               amsgrad=False,
+               weight_decay_rate=0.0,
+               include_in_weight_decay=None,
+               exclude_from_weight_decay=None,
+               gradient_clip_norm=1.0,
+               name='AdamWeightDecay',
+               **kwargs):
+    super(AdamWeightDecay, self).__init__(learning_rate, beta_1, beta_2,
+                                          epsilon, amsgrad, name, **kwargs)
+    self.weight_decay_rate = weight_decay_rate
+    self.gradient_clip_norm = gradient_clip_norm
+    self._include_in_weight_decay = include_in_weight_decay
+    self._exclude_from_weight_decay = exclude_from_weight_decay
+    logging.info('AdamWeightDecay gradient_clip_norm=%f', gradient_clip_norm)
+  def _prepare_local(self, var_device, var_dtype, apply_state):
+    super(AdamWeightDecay, self)._prepare_local(var_device, var_dtype,  # pytype: disable=attribute-error  # typed-keras
+                                                apply_state)
+    apply_state[(var_device, var_dtype)]['weight_decay_rate'] = tf.constant(
+        self.weight_decay_rate, name='adam_weight_decay_rate')
+  def _decay_weights_op(self, var, learning_rate, apply_state):
+    do_decay = self._do_use_weight_decay(var.name)
+    if do_decay:
+      return var.assign_sub(
+          learning_rate * var *
+          apply_state[(var.device, var.dtype.base_dtype)]['weight_decay_rate'],
+          use_locking=self._use_locking)
+    return tf.no_op()
+  def apply_gradients(self,
+                      grads_and_vars,
+                      name=None,
+                      experimental_aggregate_gradients=True):
+    grads, tvars = list(zip(*grads_and_vars))
+    if experimental_aggregate_gradients and self.gradient_clip_norm > 0.0:
+      # when experimental_aggregate_gradients = False, apply_gradients() no
+      # longer implicitly allreduce gradients, users manually allreduce gradient
+      # and passed the allreduced grads_and_vars. For now, the
+      # clip_by_global_norm will be moved to before the explicit allreduce to
+      # keep the math the same as TF 1 and pre TF 2.2 implementation.
+      (grads, _) = tf.clip_by_global_norm(
+          grads, clip_norm=self.gradient_clip_norm)
+    return super(AdamWeightDecay, self).apply_gradients(
+        zip(grads, tvars),
+        name=name,
+        experimental_aggregate_gradients=experimental_aggregate_gradients)
+  def _get_lr(self, var_device, var_dtype, apply_state):
+    """Retrieves the learning rate with the given state."""
+    if apply_state is None:
+      return self._decayed_lr_t[var_dtype], {}
+    apply_state = apply_state or {}
+    coefficients = apply_state.get((var_device, var_dtype))
+    if coefficients is None:
+      coefficients = self._fallback_apply_state(var_device, var_dtype)
+      apply_state[(var_device, var_dtype)] = coefficients
+    return coefficients['lr_t'], dict(apply_state=apply_state)
+  def _resource_apply_dense(self, grad, var, apply_state=None):
+    lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
+    decay = self._decay_weights_op(var, lr_t, apply_state)
+    with tf.control_dependencies([decay]):
+      return super(AdamWeightDecay,
+                   self)._resource_apply_dense(grad, var, **kwargs)  # pytype: disable=attribute-error  # typed-keras
+  def _resource_apply_sparse(self, grad, var, indices, apply_state=None):
+    lr_t, kwargs = self._get_lr(var.device, var.dtype.base_dtype, apply_state)
+    decay = self._decay_weights_op(var, lr_t, apply_state)
+    with tf.control_dependencies([decay]):
+      return super(AdamWeightDecay,
+                   self)._resource_apply_sparse(grad, var, indices, **kwargs)  # pytype: disable=attribute-error  # typed-keras
+  def get_config(self):
+    config = super(AdamWeightDecay, self).get_config()
+    config.update({
+        'weight_decay_rate': self.weight_decay_rate,
+    })
+    return config
+  def _do_use_weight_decay(self, param_name):
+    """Whether to use L2 weight decay for `param_name`."""
+    if self.weight_decay_rate == 0:
+      return False
+    if self._include_in_weight_decay:
+      for r in self._include_in_weight_decay:
+        if re.search(r, param_name) is not None:
+          return True
+    if self._exclude_from_weight_decay:
+      for r in self._exclude_from_weight_decay:
+        if re.search(r, param_name) is not None:
+          return False
+    return True
--- a/official/modeling/optimization/optimizer_factory.py
+++ b/official/modeling/optimization/optimizer_factory.py
@@ -18,20 +18,21 @@ from typing import Callable, Optional, Union, List, Tuple
 import gin
 import tensorflow as tf
 import tensorflow_addons.optimizers as tfa_optimizers
 from official.modeling.optimization import slide_optimizer
 from official.modeling.optimization import adafactor_optimizer
 from official.modeling.optimization import ema_optimizer
 from official.modeling.optimization import lars_optimizer
+from official.modeling.optimization import legacy_adamw
 from official.modeling.optimization import lr_schedule
 from official.modeling.optimization.configs import optimization_config as opt_cfg
-from official.nlp import optimization as nlp_optimization
 OPTIMIZERS_CLS = {
    'sgd': tf.keras.optimizers.SGD,
    # TODO(chenmoneygithub): experimental.SGD
    'adam': tf.keras.optimizers.Adam,
    # TODO(chenmoneygithub): experimental.Adam
-    'adamw': nlp_optimization.AdamWeightDecay,
+    'adamw': legacy_adamw.AdamWeightDecay,
    'lamb': tfa_optimizers.LAMB,
    'rmsprop': tf.keras.optimizers.RMSprop,
    'lars': lars_optimizer.LARS,
@@ -57,8 +58,8 @@ WARMUP_CLS = {
 }
-def register_optimizer_cls(
+def register_optimizer_cls(key: str,
-    key: str, optimizer_config_cls: tf.keras.optimizers.Optimizer):
+                           optimizer_config_cls: tf.keras.optimizers.Optimizer):
  """Register customize optimizer cls.
  The user will still need to subclass data classes in
@@ -85,6 +86,8 @@ class OptimizerFactory:
  (4) Build optimizer.
  This is a typical example for using this class:
+  ```
  params = {
        'optimizer': {
            'type': 'sgd',
@@ -104,6 +107,7 @@ class OptimizerFactory:
  opt_factory = OptimizerFactory(opt_config)
  lr = opt_factory.build_learning_rate()
  optimizer = opt_factory.build_optimizer(lr)
+  ```
  """
  def __init__(self, config: opt_cfg.OptimizationConfig):
@@ -156,9 +160,12 @@ class OptimizerFactory:
  def build_optimizer(
      self,
      lr: Union[tf.keras.optimizers.schedules.LearningRateSchedule, float],
+      gradient_aggregator: Optional[Callable[
+          [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
+                                                          tf.Tensor]]]] = None,
      gradient_transformers: Optional[List[Callable[
-          [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor, tf.Tensor]]
+          [List[Tuple[tf.Tensor, tf.Tensor]]], List[Tuple[tf.Tensor,
-      ]]] = None,
+                                                          tf.Tensor]]]]] = None,
      postprocessor: Optional[Callable[[tf.keras.optimizers.Optimizer],
                                       tf.keras.optimizers.Optimizer]] = None):
    """Build optimizer.
@@ -170,6 +177,7 @@ class OptimizerFactory:
    Args:
      lr: A floating point value, or a
        tf.keras.optimizers.schedules.LearningRateSchedule instance.
+      gradient_aggregator: Optional function to overwrite gradient aggregation.
      gradient_transformers: Optional list of functions to use to transform
        gradients before applying updates to Variables. The functions are
        applied after gradient_aggregator. The functions should accept and
@@ -193,6 +201,8 @@ class OptimizerFactory:
      del optimizer_dict['global_clipnorm']
    optimizer_dict['learning_rate'] = lr
+    if gradient_aggregator is not None:
+      optimizer_dict['gradient_aggregator'] = gradient_aggregator
    if gradient_transformers is not None:
      optimizer_dict['gradient_transformers'] = gradient_transformers

--- a/official/modeling/optimization/optimizer_factory_test.py
+++ b/official/modeling/optimization/optimizer_factory_test.py
@@ -49,6 +49,39 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
    self.assertIsInstance(optimizer, optimizer_cls)
    self.assertEqual(expected_optimizer_config, optimizer.get_config())
+  def test_gradient_aggregator(self):
+    params = {
+        'optimizer': {
+            'type': 'adam',
+        },
+        'learning_rate': {
+            'type': 'constant',
+            'constant': {
+                'learning_rate': 1.0
+            }
+        }
+    }
+    opt_config = optimization_config.OptimizationConfig(params)
+    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
+    lr = opt_factory.build_learning_rate()
+    # Dummy function to zero out gradients.
+    zero_grads = lambda gv: [(tf.zeros_like(g), v) for g, v in gv]
+    optimizer = opt_factory.build_optimizer(lr, gradient_aggregator=zero_grads)
+    var0 = tf.Variable([1.0, 2.0])
+    var1 = tf.Variable([3.0, 4.0])
+    grads0 = tf.constant([1.0, 1.0])
+    grads1 = tf.constant([1.0, 1.0])
+    grads_and_vars = list(zip([grads0, grads1], [var0, var1]))
+    optimizer.apply_gradients(grads_and_vars)
+    self.assertAllClose(np.array([1.0, 2.0]), var0.numpy())
+    self.assertAllClose(np.array([3.0, 4.0]), var1.numpy())
  @parameterized.parameters((None, None), (1.0, None), (None, 1.0))
  def test_gradient_clipping(self, clipnorm, clipvalue):
    params = {
@@ -418,7 +451,7 @@ class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):
            }
        }
    }
-    expected_lr_step_values = [[0, 0.0], [5000, 1e-4/2.0], [10000, 1e-4],
+    expected_lr_step_values = [[0, 0.0], [5000, 1e-4 / 2.0], [10000, 1e-4],
                               [20000, 9.994863e-05], [499999, 5e-05]]
    opt_config = optimization_config.OptimizationConfig(params)
    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
@@ -434,10 +467,12 @@ class OptimizerFactoryRegistryTest(tf.test.TestCase):
    class MyClass():
      pass
    optimizer_factory.register_optimizer_cls('test', MyClass)
    self.assertIn('test', optimizer_factory.OPTIMIZERS_CLS)
    with self.assertRaisesRegex(ValueError, 'test already registered.*'):
      optimizer_factory.register_optimizer_cls('test', MyClass)
 if __name__ == '__main__':
  tf.test.main()
--- a/official/nlp/README.md
+++ b/official/nlp/README.md
-# TensorFlow NLP Modelling Toolkit
+# TF-NLP Model Garden
+⚠️ Disclaimer: All datasets hyperlinked from this page are not owned or
+distributed by Google. The dataset is made available by third parties.
+Please review the terms and conditions made available by the third parties
+before using the data.
 This codebase provides a Natrual Language Processing modeling toolkit written in
 [TF2](https://www.tensorflow.org/guide/effective_tf2). It allows researchers and
@@ -30,7 +35,10 @@ research ideas. Detailed intructions can be found in READMEs in each folder.
 We provide SoTA model implementations, pre-trained models, training and
 evaluation examples, and command lines. Detail instructions can be found in the
-READMEs for specific papers.
+READMEs for specific papers. Below are some papers implemented in the
+repository and more NLP projects can be found in the
+[`projects`](https://github.com/tensorflow/models/tree/master/official/projects)
+folder:
 1.  [BERT](MODEL_GARDEN.md#available-model-configs): [BERT: Pre-training of Deep Bidirectional Transformers for
    Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al.,
@@ -38,10 +46,10 @@ READMEs for specific papers.
 2.  [ALBERT](MODEL_GARDEN.md#available-model-configs):
    [A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
    by Lan et al., 2019
-3.  [XLNet](xlnet):
+3.  [XLNet](MODEL_GARDEN.md):
    [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
    by Yang et al., 2019
-4.  [Transformer for translation](transformer):
+4.  [Transformer for translation](MODEL_GARDEN.md#available-model-configs):
    [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et
    al., 2017

--- a/official/nlp/configs/experiment_configs.py
+++ b/official/nlp/configs/experiment_configs.py
@@ -17,4 +17,3 @@
 from official.nlp.configs import finetuning_experiments
 from official.nlp.configs import pretraining_experiments
 from official.nlp.configs import wmt_transformer_experiments
-from official.projects.teams import teams_experiments
--- a/official/nlp/data/classifier_data_lib.py
+++ b/official/nlp/data/classifier_data_lib.py
@@ -187,6 +187,8 @@ class AxProcessor(DataProcessor):
  def _create_examples_tfds(self, dataset, set_type):
    """Creates examples for the training/dev/test sets."""
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -218,6 +220,8 @@ class ColaProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/cola", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -312,6 +316,8 @@ class MnliProcessor(DataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/mnli", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -343,6 +349,8 @@ class MrpcProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/mrpc", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -453,6 +461,8 @@ class QnliProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/qnli", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -484,6 +494,8 @@ class QqpProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/qqp", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -517,6 +529,8 @@ class RteProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/rte", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -548,6 +562,8 @@ class SstProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/sst2", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -574,6 +590,8 @@ class StsBProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/stsb", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)
@@ -742,6 +760,8 @@ class WnliProcessor(DefaultGLUEDataProcessor):
    """Creates examples for the training/dev/test sets."""
    dataset = tfds.load(
        "glue/wnli", split=set_type, try_gcs=True).as_numpy_iterator()
+    dataset = list(dataset)
+    dataset.sort(key=lambda x: x["idx"])
    examples = []
    for i, example in enumerate(dataset):
      guid = "%s-%s" % (set_type, i)

--- a/official/nlp/modeling/layers/kernel_attention.py
+++ b/official/nlp/modeling/layers/kernel_attention.py
@@ -178,13 +178,13 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
               is_short_seq=False,
               begin_kernel=0,
               scale=None,
+               scale_by_length=False,
               **kwargs):
    r"""Constructor of KernelAttention.
    Args:
-      feature_transform: A non-linear transform of the keys and quries.
+      feature_transform: A non-linear transform of the keys and quries. Possible
-      Possible transforms are "elu", "relu", "square", "exp", "expmod",
+        transforms are "elu", "relu", "square", "exp", "expmod", "identity".
-      "identity".
      num_random_features: Number of random features to be used for projection.
        if num_random_features <= 0, no production is used before transform.
      seed: The seed to begin drawing random features. Once the seed is set, the
@@ -194,12 +194,16 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      redraw: Whether to redraw projection every forward pass during training.
        The argument is only effective when num_random_features > 0.
      is_short_seq: boolean predicate indicating whether input data consists of
-        very short sequences or not; in most cases this should be False
+        very short sequences or not; in most cases this should be False (default
-        (default option).
+        option).
      begin_kernel: Apply kernel_attention after this sequence id and apply
        softmax attention before this.
      scale: The value to scale the dot product as described in `Attention Is
        All You Need`. If None, we use 1/sqrt(dk) as described in the paper.
+      scale_by_length: boolean predicate indicating whether additionally scale
+        the dot product based on key length. Set as log_512^(n) to stablize
+        attention entropy against length. Refer to
+        https://kexue.fm/archives/8823 for details.
      **kwargs: The same arguments `MultiHeadAttention` layer.
    """
    if feature_transform not in _TRANSFORM_MAP:
@@ -214,6 +218,7 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
    self._redraw = redraw
    self._is_short_seq = is_short_seq
    self._begin_kernel = begin_kernel
+    self._scale_by_length = scale_by_length
    # We use the seed for two scenarios:
    # 1. inference
    # 2. no redraw
@@ -252,9 +257,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      is_short_seq: boolean predicate indicating whether input data consists of
        short or long sequences; usually short sequence is defined as having
        length L <= 1024.
-      attention_mask: a boolean mask of shape `[B, S]`, that prevents
+      attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
-        attenting to masked positions. Note that the mask is only appied to
+        to masked positions. Note that the mask is only appied to the keys. User
-        the keys. User may want to mask the output if query contains pads.
+        may want to mask the output if query contains pads.
      training: Python boolean indicating whether the layer should behave in
        training mode (adding dropout) or in inference mode (doing nothing).
      numeric_stabler: A scalar value added to avoid divide by 0.
@@ -270,17 +275,23 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      else:
        projection_matrix = self._projection_matrix
+    if self._scale_by_length:
+      scale = tf.math.log(tf.reduce_sum(attention_mask,
+                                        axis=-1)) * self._scale / math.log(512)
+      scale = tf.reshape(scale, [-1, 1, 1, 1])
+    else:
+      scale = self._scale
    if is_short_seq:
      # Note: Applying scalar multiply at the smaller end of einsum improves
      # XLA performance, but may introduce slight numeric differences in
      # the Transformer attention head.
-      query = query * self._scale
+      query = query * scale
    else:
      # Note: we suspect spliting the scale to key, query yields smaller
      # approximation variance when random projection is used.
      # For simplicity, we also split when there's no random projection.
-      key *= math.sqrt(self._scale)
+      key *= tf.math.sqrt(scale)
-      query *= math.sqrt(self._scale)
+      query *= tf.math.sqrt(scale)
    key = _TRANSFORM_MAP[feature_transform](key, projection_matrix)
    query = _TRANSFORM_MAP[feature_transform](query, projection_matrix)
@@ -330,9 +341,9 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      value: Value `Tensor` of shape `[B, S, dim]`.
      key: Optional key `Tensor` of shape `[B, S, dim]`. If not given, will use
        `value` for both `key` and `value`, which is the most common case.
-      attention_mask: a boolean mask of shape `[B, S]`, that prevents
+      attention_mask: a boolean mask of shape `[B, S]`, that prevents attenting
-        attenting to masked positions. Note that the mask is only appied to
+        to masked positions. Note that the mask is only appied to the keys. User
-        the keys. User may want to mask the output if query contains pads.
+        may want to mask the output if query contains pads.
      training: Python boolean indicating whether the layer should behave in
        training mode (adding dropout) or in inference mode (doing nothing).
@@ -373,9 +384,10 @@ class KernelAttention(tf.keras.layers.MultiHeadAttention):
      attention_output = tf.concat(
          [attention_output_softmax, attention_output_kernel], axis=1)
    else:
-      attention_output = self._compute_attention(
+      attention_output = self._compute_attention(query, key, value,
-          query, key, value, self._feature_transform,
+                                                 self._feature_transform,
-          self._is_short_seq, attention_mask, training)
+                                                 self._is_short_seq,
+                                                 attention_mask, training)
      # This is actually dropping out entire tokens to attend to, which might
      # seem a bit unusual, but is taken from the original Transformer paper.
      attention_output = self._dropout_layer(attention_output)

--- a/official/nlp/modeling/layers/kernel_attention_test.py
+++ b/official/nlp/modeling/layers/kernel_attention_test.py
@@ -30,9 +30,9 @@ _BEGIN_KERNEL = [0, 512]
 class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
-  @parameterized.parameters(itertools.product(
+  @parameterized.parameters(
-      _FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
+      itertools.product(_FEATURE_TRANSFORM, [127], _TRAINING, [True, False],
-      _IS_SHORT_SEQ, _BEGIN_KERNEL))
+                        _IS_SHORT_SEQ, _BEGIN_KERNEL))
  def test_attention_projection(
      self, feature_transform, num_random_features, training, redraw, is_short,
      begin_kernel):
@@ -90,6 +90,32 @@ class KernelAttentionTest(tf.test.TestCase, parameterized.TestCase):
        training=training)
    self.assertEqual(output.shape, [batch_size, seq_length, key_dim])
+  @parameterized.parameters([128, 512])
+  def test_attention_scale_by_length(self, seq_length):
+    num_heads = 12
+    key_dim = 64
+    batch_size = 2
+    test_layer = attention.KernelAttention(
+        num_heads=num_heads,
+        key_dim=key_dim,
+        num_random_features=0,
+        scale_by_length=True)
+    query = tf.random.normal(
+        shape=(batch_size, seq_length, key_dim))
+    value = query
+    encoder_inputs_mask = tf.ones((batch_size, seq_length), dtype=tf.int32)
+    masks = tf.cast(encoder_inputs_mask, dtype=tf.float32)
+    output_scale_by_length = test_layer(
+        query=query, value=value, attention_mask=masks)
+    test_layer._scale_by_length = False
+    output_no_scale_by_length = test_layer(
+        query=query, value=value, attention_mask=masks)
+    if seq_length == 512:  # Equals because log(seq_length, base=512) = 1.0
+      self.assertAllClose(output_scale_by_length, output_no_scale_by_length)
+    else:
+      self.assertNotAllClose(output_scale_by_length, output_no_scale_by_length)
  def test_unsupported_feature_transform(self):
    with self.assertRaisesRegex(ValueError, 'Unsupported feature_transform.*'):
      _ = attention.KernelAttention(feature_transform='test')

--- a/official/nlp/modeling/layers/transformer_encoder_block.py
+++ b/official/nlp/modeling/layers/transformer_encoder_block.py
@@ -14,6 +14,7 @@
 """Keras-based TransformerEncoder block layer."""
+from absl import logging
 import tensorflow as tf
 from official.nlp.modeling.layers import util
@@ -176,9 +177,9 @@ class TransformerEncoderBlock(tf.keras.layers.Layer):
      einsum_equation = "...bc,cd->...bd"
    hidden_size = input_tensor_shape[-1]
    if hidden_size % self._num_heads != 0:
-      raise ValueError(
+      logging.warning(
          "The input size (%d) is not a multiple of the number of attention "
-          "heads (%d)" % (hidden_size, self._num_heads))
+          "heads (%d)", hidden_size, self._num_heads)
    if self._key_dim is None:
      self._key_dim = int(hidden_size // self._num_heads)
    if self._output_last_dim is None: