Release keras bert:

- Update classifier example. - Add new converted checkpoints. - Update benchmark, PiperOrigin-RevId: 279762797

Release keras bert:
- Update classifier example. - Add new converted checkpoints. - Update benchmark, PiperOrigin-RevId: 279762797
f1d35b4e · Hongkun Yu · A. Unique TensorFlower · 0351cb87 · f1d35b4e · f1d35b4e
Commit f1d35b4e authored Nov 11, 2019 by Hongkun Yu Committed by A. Unique TensorFlower Nov 11, 2019
20 changed files
--- a/official/benchmark/bert_benchmark.py
+++ b/official/benchmark/bert_benchmark.py
@@ -35,7 +35,7 @@ from official.nlp.bert import run_classifier
 from official.utils.misc import distribution_utils

 # pylint: disable=line-too-long
-PRETRAINED_CHECKPOINT_PATH = 'gs://cloud-tpu-checkpoints/bert/tf_20/uncased_L-24_H-1024_A-16/bert_model.ckpt'
+PRETRAINED_CHECKPOINT_PATH = 'placer/prod/home/tensorflow-performance-data/datasets/bert/keras_bert/bert_model.ckpt'
 CLASSIFIER_TRAIN_DATA_PATH = 'gs://tf-perfzero-data/bert/classification/mrpc_train.tf_record'
 CLASSIFIER_EVAL_DATA_PATH = 'gs://tf-perfzero-data/bert/classification/mrpc_eval.tf_record'
 CLASSIFIER_INPUT_META_DATA_PATH = 'gs://tf-perfzero-data/bert/classification/mrpc_meta_data'

--- a/official/benchmark/bert_squad_benchmark.py
+++ b/official/benchmark/bert_squad_benchmark.py
@@ -34,7 +34,7 @@ from official.nlp.bert import run_squad
 from official.utils.misc import distribution_utils

 # pylint: disable=line-too-long
-PRETRAINED_CHECKPOINT_PATH = 'gs://cloud-tpu-checkpoints/bert/tf_20/uncased_L-24_H-1024_A-16/bert_model.ckpt'
+PRETRAINED_CHECKPOINT_PATH = '/placer/prod/home/tensorflow-performance-data/datasets/bert/tf_20/uncased_L-24_H-1024_A-16/bert_model.ckpt'
 SQUAD_TRAIN_DATA_PATH = 'gs://tf-perfzero-data/bert/squad/squad_train.tf_record'
 SQUAD_PREDICT_FILE = 'gs://tf-perfzero-data/bert/squad/dev-v1.1.json'
 SQUAD_VOCAB_FILE = 'gs://tf-perfzero-data/bert/squad/vocab.txt'

--- a/official/nlp/bert/README.md
+++ b/official/nlp/bert/README.md
@@ -32,6 +32,11 @@ are going to release new pre-trained checkpoints soon.
 We provide checkpoints that are converted from [google-research/bert](https://github.com/google-research/bert),
 in order to keep consistent with BERT paper.

+The stable model checkpoints work with [v2.0 release](https://github.com/tensorflow/models/releases/tag/v2.0).
+
+**Note: these checkpoints are not compatible with the current master
+[run_classifier.py](run_classifier.py) example.**
+
 *   **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/wwm_uncased_L-24_H-1024_A-16.tar.gz)**:
    24-layer, 1024-hidden, 16-heads, 340M parameters
 *   **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/wwm_cased_L-24_H-1024_A-16.tar.gz)**:
@@ -45,12 +50,28 @@ in order to keep consistent with BERT paper.
 *   **[`BERT-Large, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/cased_L-24_H-1024_A-16.tar.gz)**:
    24-layer, 1024-hidden, 16-heads, 340M parameters

-We recommend to host checkpoints on Google Cloud storage buckets when you use
-Cloud GPU/TPU. For example, in the following tutorial, we use:
+**Note: We are in the middle of a transition stage to switch BERT implementation
+to use Keras functional-style networks in [nlp/modeling](../modeling).
+The checkpoint above will be deleted once transition is done.**

-```shell
-export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/tf_20/uncased_L-24_H-1024_A-16
-```
+The new checkpoints work with [run_classifier.py](run_classifier.py) example
+are:
+
+*   **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/wwm_uncased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/wwm_cased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Uncased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12.tar.gz)**:
+    12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Large, Uncased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/cased_L-12_H-768_A-12.tar.gz)**:
+    12-layer, 768-hidden, 12-heads , 110M parameters
+*   **[`BERT-Large, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/cased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+
+We recommend to host checkpoints on Google Cloud storage buckets when you use
+Cloud GPU/TPU.

 ### Restoring from Checkpoints

@@ -175,7 +196,7 @@ The unzipped pre-trained model files can also be found in the Google Cloud
 Storage folder `gs://cloud-tpu-checkpoints/bert/tf_20`. For example:

 ```shell
-export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/tf_20/uncased_L-24_H-1024_A-16
+export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
 export MODEL_DIR=gs://some_bucket/my_output_dir
 ```


--- a/official/nlp/bert_models.py
+++ b/official/nlp/bert_models.py
@@ -24,6 +24,8 @@ import tensorflow_hub as hub

 from official.modeling import tf_utils
 from official.nlp import bert_modeling as modeling
+from official.nlp.modeling import networks
+from official.nlp.modeling.networks import bert_classifier


 def gather_indexes(sequence_tensor, positions):
@@ -160,8 +162,8 @@ class BertPretrainLossAndMetricLayer(tf.keras.layers.Layer):
        lm_output, sentence_output, lm_label_ids, lm_label_weights,
        sentence_labels
    ])
-    return super(BertPretrainLossAndMetricLayer, self).__call__(
-        inputs, **kwargs)
+    return super(BertPretrainLossAndMetricLayer,
+                 self).__call__(inputs, **kwargs)

  def _add_metrics(self, lm_output, lm_labels, lm_label_weights,
                   lm_per_example_loss, sentence_output, sentence_labels,
@@ -354,9 +356,7 @@ def squad_model(bert_config,
      shape=(max_seq_length,), dtype=tf.int32, name='segment_ids')

  if hub_module_url:
-    core_model = hub.KerasLayer(
-        hub_module_url,
-        trainable=True)
+    core_model = hub.KerasLayer(hub_module_url, trainable=True)
    _, sequence_output = core_model(
        [input_word_ids, input_mask, input_type_ids])
    # Sets the shape manually due to a bug in TF shape inference.
@@ -417,32 +417,44 @@ def classifier_model(bert_config,
    Combined prediction model (words, mask, type) -> (one-hot labels)
    BERT sub-model (words, mask, type) -> (bert_outputs)
  """
-  input_word_ids = tf.keras.layers.Input(
-      shape=(max_seq_length,), dtype=tf.int32, name='input_word_ids')
-  input_mask = tf.keras.layers.Input(
-      shape=(max_seq_length,), dtype=tf.int32, name='input_mask')
-  input_type_ids = tf.keras.layers.Input(
-      shape=(max_seq_length,), dtype=tf.int32, name='input_type_ids')
-  if hub_module_url:
-    bert_model = hub.KerasLayer(hub_module_url, trainable=True)
-    pooled_output, _ = bert_model([input_word_ids, input_mask, input_type_ids])
-  else:
-    bert_model = modeling.get_bert_model(
-        input_word_ids,
-        input_mask,
-        input_type_ids,
-        config=bert_config,
-        float_type=float_type)
-    pooled_output = bert_model.outputs[0]
-
  if final_layer_initializer is not None:
    initializer = final_layer_initializer
  else:
    initializer = tf.keras.initializers.TruncatedNormal(
        stddev=bert_config.initializer_range)

+  if not hub_module_url:
+    bert_encoder = networks.TransformerEncoder(
+        vocab_size=bert_config.vocab_size,
+        hidden_size=bert_config.hidden_size,
+        num_layers=bert_config.num_hidden_layers,
+        num_attention_heads=bert_config.num_attention_heads,
+        intermediate_size=bert_config.intermediate_size,
+        activation=tf_utils.get_activation('gelu'),
+        dropout_rate=bert_config.hidden_dropout_prob,
+        attention_dropout_rate=bert_config.attention_probs_dropout_prob,
+        sequence_length=max_seq_length,
+        max_sequence_length=bert_config.max_position_embeddings,
+        type_vocab_size=bert_config.type_vocab_size,
+        initializer=tf.keras.initializers.TruncatedNormal(
+            stddev=bert_config.initializer_range))
+    return bert_classifier.BertClassifier(
+        bert_encoder,
+        num_classes=num_labels,
+        dropout_rate=bert_config.hidden_dropout_prob,
+        initializer=initializer), bert_encoder
+
+  input_word_ids = tf.keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_word_ids')
+  input_mask = tf.keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_mask')
+  input_type_ids = tf.keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_type_ids')
+  bert_model = hub.KerasLayer(hub_module_url, trainable=True)
+  pooled_output, _ = bert_model([input_word_ids, input_mask, input_type_ids])
  output = tf.keras.layers.Dropout(rate=bert_config.hidden_dropout_prob)(
      pooled_output)
+
  output = tf.keras.layers.Dense(
      num_labels,
      kernel_initializer=initializer,

--- a/official/nlp/modeling/__init__.py
+++ b/official/nlp/modeling/__init__.py
+
--- a/official/nlp/modeling/layers/__init__.py
+++ b/official/nlp/modeling/layers/__init__.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Layers package definition."""
+from official.nlp.modeling.layers.attention import Attention
+from official.nlp.modeling.layers.dense_einsum import DenseEinsum
+from official.nlp.modeling.layers.masked_softmax import MaskedSoftmax
+from official.nlp.modeling.layers.on_device_embedding import OnDeviceEmbedding
+from official.nlp.modeling.layers.position_embedding import PositionEmbedding
+from official.nlp.modeling.layers.transformer import Transformer
--- a/official/nlp/modeling/layers/attention.py
+++ b/official/nlp/modeling/layers/attention.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras-based attention layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import math
+import tensorflow as tf
+
+from official.nlp.modeling.layers import dense_einsum
+from official.nlp.modeling.layers import masked_softmax
+
+
+@tf.keras.utils.register_keras_serializable(package="Text")
+class Attention(tf.keras.layers.Layer):
+  """Attention layer.
+
+  This is an implementation of multi-headed attention based on "Attention
+  is all you Need". If `from_tensor` and `to_tensor` are the same, then
+  this is self-attention. Each timestep in `from_tensor` attends to the
+  corresponding sequence in `to_tensor`, and returns a fixed-width vector.
+
+  This function first projects `from_tensor` into a "query" tensor and
+  `to_tensor` into "key" and "value" tensors. These are (effectively) a list
+  of tensors of length `num_attention_heads`, where each tensor is of shape
+  [batch_size, seq_length, size_per_head].
+
+  Then, the query and key tensors are dot-producted and scaled. These are
+  softmaxed to obtain attention probabilities. The value tensors are then
+  interpolated by these probabilities, then concatenated back to a single
+  tensor and returned.
+
+  Attributes:
+    num_heads: Number of attention heads.
+    head_size: Size of each attention head.
+    dropout: Dropout probability.
+    kernel_initializer: Initializer for dense layer kernels.
+    bias_initializer: Initializer for dense layer biases.
+    kernel_regularizer: Regularizer for dense layer kernels.
+    bias_regularizer: Regularizer for dense layer biases.
+    activity_regularizer: Regularizer for dense layer activity.
+    kernel_constraint: Constraint for dense layer kernels.
+    bias_constraint: Constraint for dense layer kernels.
+  """
+
+  def __init__(self,
+               num_heads,
+               head_size,
+               dropout_rate=0.0,
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               activity_regularizer=None,
+               kernel_constraint=None,
+               bias_constraint=None,
+               **kwargs):
+    super(Attention, self).__init__(**kwargs)
+    self._num_heads = num_heads
+    self._head_size = head_size
+    self._dropout_rate = dropout_rate
+    self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
+    self._bias_initializer = tf.keras.initializers.get(bias_initializer)
+    self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
+    self._bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
+    self._kernel_constraint = tf.keras.constraints.get(kernel_constraint)
+    self._bias_constraint = tf.keras.constraints.get(bias_constraint)
+
+    self._query_dense = dense_einsum.DenseEinsum(
+        output_shape=(self._num_heads, self._head_size),
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=self.dtype,
+        name="query")
+
+    self._key_dense = dense_einsum.DenseEinsum(
+        output_shape=(self._num_heads, self._head_size),
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=self.dtype,
+        name="key")
+
+    self._value_dense = dense_einsum.DenseEinsum(
+        output_shape=(self._num_heads, self._head_size),
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=self.dtype,
+        name="value")
+
+    self._masked_softmax = masked_softmax.MaskedSoftmax(mask_expansion_axes=[1])
+
+    self._dropout = tf.keras.layers.Dropout(
+        rate=self._dropout_rate, dtype=self.dtype)
+
+  def compute_output_shape(self, input_shape):
+    # TODO(momernick): validate tensor dimensioos
+    from_tensor_shape = tf.TensorShape(input_shape[0])
+    batch = from_tensor_shape[0]
+    from_tensor_length = from_tensor_shape[1]
+    return tf.TensorShape(
+        (batch, from_tensor_length, self._num_heads, self._head_size))
+
+  def get_config(self):
+    config = {
+        "num_heads":
+            self._num_heads,
+        "head_size":
+            self._head_size,
+        "dropout_rate":
+            self._dropout_rate,
+        "kernel_initializer":
+            tf.keras.initializers.serialize(self._kernel_initializer),
+        "bias_initializer":
+            tf.keras.initializers.serialize(self._bias_initializer),
+        "kernel_regularizer":
+            tf.keras.regularizers.serialize(self._kernel_regularizer),
+        "bias_regularizer":
+            tf.keras.regularizers.serialize(self._bias_regularizer),
+        "activity_regularizer":
+            tf.keras.regularizers.serialize(self._activity_regularizer),
+        "kernel_constraint":
+            tf.keras.constraints.serialize(self._kernel_constraint),
+        "bias_constraint":
+            tf.keras.constraints.serialize(self._bias_constraint)
+    }
+    base_config = super(Attention, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+  def call(self, inputs):
+    from_tensor = inputs[0]
+    to_tensor = inputs[1]
+    attention_mask = inputs[2] if len(inputs) == 3 else None
+
+    # Scalar dimensions referenced here:
+    #   B = batch size (number of sequences)
+    #   F = `from_tensor` sequence length
+    #   T = `to_tensor` sequence length
+    #   N = `num_attention_heads`
+    #   H = `size_per_head`
+    # `query_tensor` = [B, F, N ,H]
+    query_tensor = self._query_dense(from_tensor)
+
+    # `key_tensor` = [B, T, N, H]
+    key_tensor = self._key_dense(to_tensor)
+
+    # `value_tensor` = [B, T, N, H]
+    value_tensor = self._value_dense(to_tensor)
+
+    # Take the dot product between "query" and "key" to get the raw
+    # attention scores.
+    attention_scores = tf.einsum("BTNH,BFNH->BNFT", key_tensor, query_tensor)
+    attention_scores = tf.multiply(attention_scores,
+                                   1.0 / math.sqrt(float(self._head_size)))
+
+    # Normalize the attention scores to probabilities.
+    # `attention_probs` = [B, N, F, T]
+    attention_probs = self._masked_softmax([attention_scores, attention_mask])
+
+    # This is actually dropping out entire tokens to attend to, which might
+    # seem a bit unusual, but is taken from the original Transformer paper.
+    attention_probs = self._dropout(attention_probs)
+
+    # `context_layer` = [B, F, N, H]
+    return tf.einsum("BNFT,BTNH->BFNH", attention_probs, value_tensor)
--- a/official/nlp/modeling/layers/attention_test.py
+++ b/official/nlp/modeling/layers/attention_test.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the attention layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
+from official.nlp.modeling.layers import attention
+
+
+# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
+# guarantees forward compatibility of this code for the V2 switchover.
+@keras_parameterized.run_all_keras_modes
+class AttentionLayerTest(keras_parameterized.TestCase):
+
+  def test_non_masked_attention(self):
+    """Test that the attention layer can be created without a mask tensor."""
+    test_layer = attention.Attention(num_heads=12, head_size=64)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    from_tensor = tf.keras.Input(shape=(40, 80))
+    to_tensor = tf.keras.Input(shape=(20, 80))
+    output = test_layer([from_tensor, to_tensor])
+    self.assertEqual(output.shape.as_list(), [None, 40, 12, 64])
+
+  def test_non_masked_self_attention(self):
+    """Test with one input (self-attenntion) and no mask tensor."""
+    test_layer = attention.Attention(num_heads=12, head_size=64)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    from_tensor = tf.keras.Input(shape=(40, 80))
+    output = test_layer([from_tensor, from_tensor])
+    self.assertEqual(output.shape.as_list(), [None, 40, 12, 64])
+
+  def test_masked_attention(self):
+    """Test with a mask tensor."""
+    test_layer = attention.Attention(num_heads=2, head_size=2)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    from_tensor = tf.keras.Input(shape=(4, 8))
+    to_tensor = tf.keras.Input(shape=(2, 8))
+    mask_tensor = tf.keras.Input(shape=(4, 2))
+    output = test_layer([from_tensor, to_tensor, mask_tensor])
+
+    # Create a model containing the test layer.
+    model = tf.keras.Model([from_tensor, to_tensor, mask_tensor], output)
+
+    # Generate data for the input (non-mask) tensors.
+    from_data = 10 * np.random.random_sample((3, 4, 8))
+    to_data = 10 * np.random.random_sample((3, 2, 8))
+
+    # Invoke the data with a random set of mask data. This should mask at least
+    # one element.
+    mask_data = np.random.randint(2, size=(3, 4, 2))
+    masked_output_data = model.predict([from_data, to_data, mask_data])
+
+    # Invoke the same data, but with a null mask (where no elements are masked).
+    null_mask_data = np.ones((3, 4, 2))
+    unmasked_output_data = model.predict([from_data, to_data, null_mask_data])
+
+    # Because one data is masked and one is not, the outputs should not be the
+    # same.
+    self.assertNotAllClose(masked_output_data, unmasked_output_data)
+
+  def test_initializer(self):
+    """Test with a specified initializer."""
+    test_layer = attention.Attention(
+        num_heads=12,
+        head_size=64,
+        kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
+    # Create a 3-dimensional input (the first dimension is implicit).
+    from_tensor = tf.keras.Input(shape=(40, 80))
+    output = test_layer([from_tensor, from_tensor])
+    self.assertEqual(output.shape.as_list(), [None, 40, 12, 64])
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/official/nlp/modeling/layers/dense_einsum.py
+++ b/official/nlp/modeling/layers/dense_einsum.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras-based einsum layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import tensorflow as tf
+
+_CHR_IDX = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m"]
+
+
+@tf.keras.utils.register_keras_serializable(package="Text")
+class DenseEinsum(tf.keras.layers.Layer):
+  """A densely connected layer that uses tf.einsum as the backing computation.
+
+  This layer can perform einsum calculations of arbitrary dimensionality.
+
+  Attributes:
+    output_shape: Positive integer or tuple, dimensionality of the output space.
+    num_summed_dimensions: The number of dimensions to sum over. Standard 2D
+      matmul should use 1, 3D matmul should use 2, and so forth.
+    activation: Activation function to use. If you don't specify anything, no
+      activation is applied
+      (ie. "linear" activation: `a(x) = x`).
+    use_bias: Boolean, whether the layer uses a bias vector.
+    kernel_initializer: Initializer for the `kernel` weights matrix.
+    bias_initializer: Initializer for the bias vector.
+    kernel_regularizer: Regularizer function applied to the `kernel` weights
+      matrix.
+    bias_regularizer: Regularizer function applied to the bias vector.
+    activity_regularizer: Regularizer function applied to the output of the
+      layer (its "activation")..
+    kernel_constraint: Constraint function applied to the `kernel` weights
+      matrix.
+    bias_constraint: Constraint function applied to the bias vector.
+  Input shape:
+    N-D tensor with shape: `(batch_size, ..., input_dim)`. The most common
+      situation would be a 2D input with shape `(batch_size, input_dim)`.
+  Output shape:
+    N-D tensor with shape: `(batch_size, ..., units)`. For instance, for a 2D
+      input with shape `(batch_size, input_dim)`, the output would have shape
+      `(batch_size, units)`.
+  """
+
+  def __init__(self,
+               output_shape,
+               num_summed_dimensions=1,
+               activation=None,
+               use_bias=True,
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               activity_regularizer=None,
+               kernel_constraint=None,
+               bias_constraint=None,
+               **kwargs):
+    super(DenseEinsum, self).__init__(**kwargs)
+    self._output_shape = output_shape if isinstance(
+        output_shape, (list, tuple)) else (output_shape,)
+    self._activation = tf.keras.activations.get(activation)
+    self._use_bias = use_bias
+    self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
+    self._bias_initializer = tf.keras.initializers.get(bias_initializer)
+    self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
+    self._bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
+    self._kernel_constraint = tf.keras.constraints.get(kernel_constraint)
+    self._bias_constraint = tf.keras.constraints.get(bias_constraint)
+    self._num_summed_dimensions = num_summed_dimensions
+    self._einsum_string = None
+
+  def _build_einsum_string(self, free_input_dims, bound_dims, output_dims):
+    input_str = ""
+    kernel_str = ""
+    output_str = ""
+    letter_offset = 0
+    for i in range(free_input_dims):
+      char = _CHR_IDX[i + letter_offset]
+      input_str += char
+      output_str += char
+
+    letter_offset += free_input_dims
+    for i in range(bound_dims):
+      char = _CHR_IDX[i + letter_offset]
+      input_str += char
+      kernel_str += char
+
+    letter_offset += bound_dims
+    for i in range(output_dims):
+      char = _CHR_IDX[i + letter_offset]
+      kernel_str += char
+      output_str += char
+
+    return input_str + "," + kernel_str + "->" + output_str
+
+  def build(self, input_shape):
+    input_shape = tf.TensorShape(input_shape)
+    input_rank = input_shape.rank
+    free_input_dims = input_rank - self._num_summed_dimensions
+    output_dims = len(self._output_shape)
+
+    self._einsum_string = self._build_einsum_string(free_input_dims,
+                                                    self._num_summed_dimensions,
+                                                    output_dims)
+
+    # This is only saved for testing purposes.
+    self._kernel_shape = (
+        input_shape[free_input_dims:].concatenate(self._output_shape))
+
+    self._kernel = self.add_weight(
+        "kernel",
+        shape=self._kernel_shape,
+        initializer=self._kernel_initializer,
+        regularizer=self._kernel_regularizer,
+        constraint=self._kernel_constraint,
+        dtype=self.dtype,
+        trainable=True)
+    if self._use_bias:
+      self._bias = self.add_weight(
+          "bias",
+          shape=self._output_shape,
+          initializer=self._bias_initializer,
+          regularizer=self._bias_regularizer,
+          constraint=self._bias_constraint,
+          dtype=self.dtype,
+          trainable=True)
+    else:
+      self._bias = None
+    super(DenseEinsum, self).build(input_shape)
+
+  def compute_output_shape(self, input_shape):
+    input_shape = tf.TensorShape(input_shape)
+    input_shape = input_shape.with_rank_at_least(self._num_summed_dimensions +
+                                                 1)
+    for i in range(self._num_summed_dimensions):
+      if tf.dimension_value(input_shape[-1 * i]) is None:
+        raise ValueError(
+            "The %s dimension of input_shape must be defined, but saw: %s" %
+            (-1 * i, input_shape))
+    return input_shape[:-1 * self._num_summed_dimensions].concatenate(
+        self._units)
+
+  def get_config(self):
+    config = {
+        "output_shape":
+            self._output_shape,
+        "activation":
+            tf.keras.activations.serialize(self._activation),
+        "use_bias":
+            self._use_bias,
+        "kernel_initializer":
+            tf.keras.initializers.serialize(self._kernel_initializer),
+        "bias_initializer":
+            tf.keras.initializers.serialize(self._bias_initializer),
+        "kernel_regularizer":
+            tf.keras.regularizers.serialize(self._kernel_regularizer),
+        "bias_regularizer":
+            tf.keras.regularizers.serialize(self._bias_regularizer),
+        "activity_regularizer":
+            tf.keras.regularizers.serialize(self._activity_regularizer),
+        "kernel_constraint":
+            tf.keras.constraints.serialize(self._kernel_constraint),
+        "bias_constraint":
+            tf.keras.constraints.serialize(self._bias_constraint)
+    }
+    base_config = super(DenseEinsum, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+  def call(self, inputs):
+    ret = tf.einsum(self._einsum_string, inputs, self._kernel)
+    if self._use_bias:
+      ret += self._bias
+    if self._activation is not None:
+      ret = self._activation(ret)
+    return ret
--- a/official/nlp/modeling/layers/dense_einsum_test.py
+++ b/official/nlp/modeling/layers/dense_einsum_test.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Keras-based einsum layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
+from official.nlp.modeling.layers import dense_einsum
+
+
+# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
+# guarantees forward compatibility of this code for the V2 switchover.
+@keras_parameterized.run_all_keras_modes
+class DenseEinsumLayer(keras_parameterized.TestCase):
+
+  def test_3D_einsum_with_two_bound_dimensions(self):
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=(64,), num_summed_dimensions=2)
+    # Create a 4-dimensional input (the first dimension is implicit).
+    input_tensor = tf.keras.Input(shape=(None, 40, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(test_layer._einsum_string, "abcd,cde->abe")
+    self.assertEqual(test_layer._kernel_shape, (40, 80, 64))
+
+  def test_3D_einsum_with_one_bound_dimensions(self):
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=(64, 32), num_summed_dimensions=1)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(test_layer._einsum_string, "abc,cde->abde")
+    self.assertEqual(test_layer._kernel_shape, (80, 64, 32))
+
+  def test_2D_einsum_with_one_bound_dimensions(self):
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=(64,), num_summed_dimensions=1)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(test_layer._einsum_string, "abc,cd->abd")
+    self.assertEqual(test_layer._kernel_shape, (80, 64))
+
+  def test_bias_term_can_be_disabled(self):
+    # A layer created using the bias should have two weights.
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=64, num_summed_dimensions=1, use_bias=True)
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(2, len(test_layer.get_weights()))
+
+    # A layer created without the bias should have only one weight.
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=64, num_summed_dimensions=1, use_bias=False)
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(1, len(test_layer.get_weights()))
+
+  def test_activation(self):
+    # Create a model that does not use an activation.
+    no_activation_layer = dense_einsum.DenseEinsum(
+        output_shape=64, num_summed_dimensions=1, activation=None)
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    output_tensor = no_activation_layer(input_tensor)
+    no_activation_model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Create a model that uses a softmax activation.
+    activation_layer = dense_einsum.DenseEinsum(
+        output_shape=64, num_summed_dimensions=1, activation="softmax")
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    output_tensor = activation_layer(input_tensor)
+    activation_model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Make sure the models' weights are identical.
+    activation_model.set_weights(no_activation_model.get_weights())
+
+    # Predict using each model on the same input data. The output should be
+    # different, since one is using a softmax - even though the models' weights
+    # are the same.
+    input_values = 10 * np.random.random_sample((10, 4, 80))
+    non_activated_data = no_activation_model.predict(input_values)
+    activated_data = activation_model.predict(input_values)
+    self.assertNotAllClose(activated_data, non_activated_data)
+
+  def test_non_iterable_output_shape(self):
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=64, num_summed_dimensions=1)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    input_tensor = tf.keras.Input(shape=(None, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(test_layer._einsum_string, "abc,cd->abd")
+    self.assertEqual(test_layer._kernel_shape, (80, 64))
+
+  def test_with_explicit_initializer(self):
+    test_layer = dense_einsum.DenseEinsum(
+        output_shape=(64,),
+        num_summed_dimensions=2,
+        kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
+    # Create a 4-dimensional input (the first dimension is implicit).
+    input_tensor = tf.keras.Input(shape=(None, 40, 80))
+    _ = test_layer(input_tensor)
+    self.assertEqual(test_layer._einsum_string, "abcd,cde->abe")
+    self.assertEqual(test_layer._kernel_shape, (40, 80, 64))
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/nlp/modeling/layers/masked_softmax.py
+++ b/official/nlp/modeling/layers/masked_softmax.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras-based softmax layer with optional masking."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+@tf.keras.utils.register_keras_serializable(package='Text')
+class MaskedSoftmax(tf.keras.layers.Layer):
+  """Performs a softmax with optional masking on a tensor.
+
+  Attributes:
+    mask_expansion_axes: Any axes that should be padded on the mask tensor.
+  """
+
+  def __init__(self, mask_expansion_axes=None, **kwargs):
+    self._mask_expansion_axes = mask_expansion_axes
+    super(MaskedSoftmax, self).__init__(**kwargs)
+
+  def call(self, inputs):
+    if isinstance(inputs, list) and len(inputs) == 2:
+      scores, mask = inputs
+    else:
+      scores, mask = (inputs, None)
+
+    if mask is not None:
+      if self._mask_expansion_axes is not None:
+        mask = tf.expand_dims(mask, axis=self._mask_expansion_axes)
+
+      # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+      # masked positions, this operation will create a tensor which is 0.0 for
+      # positions we want to attend and -10000.0 for masked positions.
+      adder = (1.0 - tf.cast(mask, scores.dtype)) * -10000.0
+
+      # Since we are adding it to the raw scores before the softmax, this is
+      # effectively the same as removing these entirely.
+      scores += adder
+
+    return tf.nn.softmax(scores)
+
+  def get_config(self):
+    config = {'mask_expansion_axes': self._mask_expansion_axes}
+    base_config = super(MaskedSoftmax, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
--- a/official/nlp/modeling/layers/masked_softmax_test.py
+++ b/official/nlp/modeling/layers/masked_softmax_test.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Keras-based masked softmax layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
+from official.nlp.modeling.layers import masked_softmax
+
+
+# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
+# guarantees forward compatibility of this code for the V2 switchover.
+@keras_parameterized.run_all_keras_modes
+class MaskedSoftmaxLayerTest(keras_parameterized.TestCase):
+
+  def test_non_masked_softmax(self):
+    test_layer = masked_softmax.MaskedSoftmax()
+    input_tensor = tf.keras.Input(shape=(4, 8))
+    output = test_layer(input_tensor)
+    model = tf.keras.Model(input_tensor, output)
+
+    input_data = 10 * np.random.random_sample((3, 4, 8))
+    output_data = model.predict(input_data)
+    expected_data = tf.nn.softmax(input_data)
+    self.assertAllClose(expected_data, output_data)
+
+  def test_masked_softmax(self):
+    test_layer = masked_softmax.MaskedSoftmax()
+    input_tensor = tf.keras.Input(shape=(4, 8))
+    mask_tensor = tf.keras.Input(shape=(4, 8))
+    output = test_layer([input_tensor, mask_tensor])
+    model = tf.keras.Model([input_tensor, mask_tensor], output)
+
+    input_data = 10 * np.random.random_sample((3, 4, 8))
+    mask_data = np.random.randint(2, size=(3, 4, 8))
+
+    output_data = model.predict([input_data, mask_data])
+    expected_zeros = np.greater(mask_data, 0)
+    is_zeros = np.greater(output_data, 0)
+    self.assertAllEqual(expected_zeros, is_zeros)
+
+  def test_masked_softmax_with_none_mask(self):
+    test_layer = masked_softmax.MaskedSoftmax()
+    input_tensor = tf.keras.Input(shape=(4, 8))
+    output = test_layer([input_tensor, None])
+    model = tf.keras.Model(input_tensor, output)
+
+    input_data = 10 * np.random.random_sample((3, 4, 8))
+    output_data = model.predict(input_data)
+    expected_data = tf.nn.softmax(input_data)
+    self.assertAllClose(expected_data, output_data)
+
+  def test_softmax_with_axes_expansion(self):
+    test_layer = masked_softmax.MaskedSoftmax(mask_expansion_axes=[1])
+    input_tensor = tf.keras.Input(shape=(4, 8))
+    mask_tensor = tf.keras.Input(shape=(8))
+    output = test_layer([input_tensor, mask_tensor])
+    model = tf.keras.Model([input_tensor, mask_tensor], output)
+
+    input_data = 10 * np.random.random_sample((3, 4, 8))
+    mask_data = np.random.randint(2, size=(3, 8))
+
+    output_data = model.predict([input_data, mask_data])
+    expanded_mask = np.expand_dims(mask_data, axis=1) * np.ones_like(input_data)
+    expected_zeros = np.greater(expanded_mask, 0)
+    is_zeros = np.greater(output_data, 0)
+    self.assertAllEqual(expected_zeros, is_zeros)
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/official/nlp/modeling/layers/on_device_embedding.py
+++ b/official/nlp/modeling/layers/on_device_embedding.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras-based one-hot embedding layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.modeling import tf_utils
+
+
+@tf.keras.utils.register_keras_serializable(package="Text")
+class OnDeviceEmbedding(tf.keras.layers.Layer):
+  """Performs an embedding lookup suitable for accelerator devices.
+
+  This layer uses either tf.gather or tf.one_hot to translate integer indices to
+  float embeddings.
+
+  Attributes:
+    vocab_size: Number of elements in the vocabulary.
+    embedding_width: Output size of the embedding layer.
+    initializer: The initializer to use for the embedding weights. Defaults to
+      "glorot_uniform".
+    use_one_hot: Whether to use tf.one_hot over tf.gather for the embedding
+      lookup. Defaults to False (that is, using tf.gather). Setting this option
+      to True may improve performance, especially on small vocabulary sizes,
+      but will generally require more memory.
+  """
+
+  def __init__(self,
+               vocab_size,
+               embedding_width,
+               initializer="glorot_uniform",
+               use_one_hot=False,
+               **kwargs):
+    # We need to have a default dtype of float32, since the inputs (which Keras
+    # usually uses to infer the dtype) will always be int32.
+    if "dtype" not in kwargs:
+      kwargs["dtype"] = "float32"
+
+    super(OnDeviceEmbedding, self).__init__(**kwargs)
+    self._vocab_size = vocab_size
+    self._embedding_width = embedding_width
+    self._initializer = initializer
+    self._use_one_hot = use_one_hot
+
+  def get_config(self):
+    config = {
+        "vocab_size": self._vocab_size,
+        "embedding_width": self._embedding_width,
+        "initializer": self._initializer,
+        "use_one_hot": self._use_one_hot,
+    }
+    base_config = super(OnDeviceEmbedding, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+  def build(self, input_shape):
+    self.embeddings = self.add_weight(
+        "embeddings",
+        shape=[self._vocab_size, self._embedding_width],
+        initializer=self._initializer)
+
+    super(OnDeviceEmbedding, self).build(input_shape)
+
+  def call(self, inputs):
+    input_shape = tf_utils.get_shape_list(inputs, expected_rank=2)
+    input_shape.append(self._embedding_width)
+    flat_inputs = tf.reshape(inputs, [-1])
+    if self._use_one_hot:
+      one_hot_data = tf.one_hot(
+          flat_inputs, depth=self._vocab_size, dtype=self._dtype)
+      embeddings = tf.matmul(one_hot_data, self.embeddings)
+    else:
+      embeddings = tf.gather(self.embeddings, flat_inputs)
+    embeddings = tf.reshape(embeddings, input_shape)
+
+    return embeddings
--- a/official/nlp/modeling/layers/on_device_embedding_test.py
+++ b/official/nlp/modeling/layers/on_device_embedding_test.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Keras-based one-hot embedding layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
+from official.nlp.modeling.layers import on_device_embedding
+
+
+# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
+# guarantees forward compatibility of this code for the V2 switchover.
+@keras_parameterized.run_all_keras_modes
+class OnDeviceEmbeddingTest(keras_parameterized.TestCase):
+
+  def test_layer_creation(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size, embedding_width=embedding_width)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # The output should be the same as the input, save that it has an extra
+    # embedding_width dimension on the end.
+    expected_output_shape = [None, sequence_length, embedding_width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+    self.assertEqual(output_tensor.dtype, tf.float32)
+
+  def test_layer_creation_with_float16_dtype(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size, embedding_width=embedding_width, dtype="float16")
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # The output should be the same as the input, save that it has an extra
+    # embedding_width dimension on the end.
+    expected_output_shape = [None, sequence_length, embedding_width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+    self.assertEqual(output_tensor.dtype, tf.float16)
+
+  def test_layer_invocation(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size, embedding_width=embedding_width)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # Create a model from the test layer.
+    model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 3
+    input_data = np.random.randint(
+        vocab_size, size=(batch_size, sequence_length))
+    output = model.predict(input_data)
+    self.assertEqual(tf.float32, output.dtype)
+
+  def test_layer_invocation_with_float16_dtype(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size, embedding_width=embedding_width, dtype="float16")
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # Create a model from the test layer.
+    model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 3
+    input_data = np.random.randint(
+        vocab_size, size=(batch_size, sequence_length))
+    output = model.predict(input_data)
+    self.assertEqual(tf.float16, output.dtype)
+
+  def test_one_hot_layer_creation(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size,
+        embedding_width=embedding_width,
+        use_one_hot=True)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # The output should be the same as the input, save that it has an extra
+    # embedding_width dimension on the end.
+    expected_output_shape = [None, sequence_length, embedding_width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+    self.assertEqual(output_tensor.dtype, tf.float32)
+
+  def test_one_hot_layer_creation_with_float16_dtype(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size,
+        embedding_width=embedding_width,
+        dtype="float16",
+        use_one_hot=True)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # The output should be the same as the input, save that it has an extra
+    # embedding_width dimension on the end.
+    expected_output_shape = [None, sequence_length, embedding_width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+    self.assertEqual(output_tensor.dtype, tf.float16)
+
+  def test_one_hot_layer_invocation(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size,
+        embedding_width=embedding_width,
+        use_one_hot=True)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # Create a model from the test layer.
+    model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 3
+    input_data = np.random.randint(
+        vocab_size, size=(batch_size, sequence_length))
+    output = model.predict(input_data)
+    self.assertEqual(tf.float32, output.dtype)
+
+  def test_one_hot_layer_invocation_with_float16_dtype(self):
+    vocab_size = 31
+    embedding_width = 27
+    test_layer = on_device_embedding.OnDeviceEmbedding(
+        vocab_size=vocab_size,
+        embedding_width=embedding_width,
+        dtype="float16",
+        use_one_hot=True)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    sequence_length = 23
+    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
+    output_tensor = test_layer(input_tensor)
+
+    # Create a model from the test layer.
+    model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 3
+    input_data = np.random.randint(
+        vocab_size, size=(batch_size, sequence_length))
+    output = model.predict(input_data)
+    self.assertEqual(tf.float16, output.dtype)
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/nlp/modeling/layers/position_embedding.py
+++ b/official/nlp/modeling/layers/position_embedding.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras-based positional embedding layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.modeling import tf_utils
+
+
+@tf.keras.utils.register_keras_serializable(package="Text")
+class PositionEmbedding(tf.keras.layers.Layer):
+  """Creates a positional embedding.
+
+  This layer creates a positional embedding as described in "BERT: Pre-training
+  of Deep Bidirectional Transformers for Language Understanding"
+  (https://arxiv.org/abs/1810.04805).
+
+  This layer can be set up to either create a statically shaped slice or a
+  dynamically shaped slice. If `use_dynamic_slicing` is True, the input tensor
+  can have a dynamic 1st dimension, while if `use_dynamic_slicing` is False the
+  input size must be fixed.
+
+  Attributes:
+    use_dynamic_slicing: Whether to use the dynamic slicing path.
+    max_sequence_length: The maximum size of the dynamic sequence. Only
+      applicable if `use_dynamic_slicing` is True.
+    initializer: The initializer to use for the embedding weights. Defaults to
+      "glorot_uniform".
+  """
+
+  def __init__(self,
+               initializer="glorot_uniform",
+               use_dynamic_slicing=False,
+               max_sequence_length=None,
+               **kwargs):
+    # We need to have a default dtype of float32, since the inputs (which Keras
+    # usually uses to infer the dtype) will always be int32.
+    if "dtype" not in kwargs:
+      kwargs["dtype"] = "float32"
+
+    super(PositionEmbedding, self).__init__(**kwargs)
+    if use_dynamic_slicing and max_sequence_length is None:
+      raise ValueError(
+          "If `use_dynamic_slicing` is True, `max_sequence_length` must be set."
+      )
+    self._max_sequence_length = max_sequence_length
+    self._initializer = tf.keras.initializers.get(initializer)
+    self._use_dynamic_slicing = use_dynamic_slicing
+
+  def get_config(self):
+    config = {
+        "max_sequence_length": self._max_sequence_length,
+        "initializer": tf.keras.initializers.serialize(self._initializer),
+        "use_dynamic_slicing": self._use_dynamic_slicing,
+    }
+    base_config = super(PositionEmbedding, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+  def build(self, input_shape):
+    """Implements build() for the layer."""
+    dimension_list = input_shape.as_list()
+
+    if len(dimension_list) != 3:
+      raise ValueError("PositionEmbedding expects a 3-dimensional input tensor "
+                       "of shape [batch, sequence, width]")
+    seq_length = dimension_list[1]
+    width = dimension_list[2]
+
+    # If we are not using dynamic slicing, we must assume that the sequence
+    # length is fixed and max_sequence_length should not be specified.
+    if not self._use_dynamic_slicing:
+      if seq_length is None:
+        raise ValueError(
+            "PositionEmbedding must have `use_dynamic_slicing` set "
+            "to True (and max_sequence_length set) when the "
+            "sequence (1st) dimension of the input is None.")
+      if self._max_sequence_length is not None:
+        raise ValueError(
+            "When `use_dynamic_slicing` is False, max_sequence_length should "
+            "not be specified and we ought to use seq_length to get the "
+            "variable shape.")
+
+    if self._max_sequence_length is not None:
+      weight_sequence_length = self._max_sequence_length
+    else:
+      weight_sequence_length = seq_length
+
+    self._position_embeddings = self.add_weight(
+        "embeddings",
+        shape=[weight_sequence_length, width],
+        initializer=self._initializer)
+
+    super(PositionEmbedding, self).build(input_shape)
+
+  def call(self, inputs):
+    """Implements call() for the layer."""
+    if self._use_dynamic_slicing:
+      input_shape = tf_utils.get_shape_list(inputs, expected_rank=3)
+      seq_length = input_shape[1]
+      width = input_shape[2]
+
+      position_embeddings = tf.expand_dims(
+          tf.slice(self._position_embeddings, [0, 0], [seq_length, width]),
+          axis=0)
+    else:
+      position_embeddings = tf.expand_dims(self._position_embeddings, axis=0)
+
+    return position_embeddings
--- a/official/nlp/modeling/layers/position_embedding_test.py
+++ b/official/nlp/modeling/layers/position_embedding_test.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Keras-based positional embedding layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
+from official.nlp.modeling.layers import position_embedding
+
+
+# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
+# guarantees forward compatibility of this code for the V2 switchover.
+@keras_parameterized.run_all_keras_modes
+class PositionEmbeddingLayerTest(keras_parameterized.TestCase):
+
+  def test_static_layer_output_shape(self):
+    test_layer = position_embedding.PositionEmbedding()
+    # Create a 3-dimensional input (the first dimension is implicit).
+    sequence_length = 21
+    width = 30
+    input_tensor = tf.keras.Input(shape=(sequence_length, width))
+    output_tensor = test_layer(input_tensor)
+
+    # When using static positional embedding shapes, the output is expected
+    # to be the same as the input shape in all dimensions save batch.
+    expected_output_shape = [1, sequence_length, width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+    # The default output dtype for this layer should be tf.float32.
+    self.assertEqual(tf.float32, output_tensor.dtype)
+
+  def test_float16_dtype(self):
+    test_layer = position_embedding.PositionEmbedding(dtype="float16")
+    # Create a 3-dimensional input (the first dimension is implicit).
+    sequence_length = 21
+    width = 30
+    input_tensor = tf.keras.Input(shape=(sequence_length, width))
+    output_tensor = test_layer(input_tensor)
+
+    # When using static positional embedding shapes, the output is expected
+    # to be the same as the input shape in all dimensions save batch.
+    expected_output_shape = [1, sequence_length, width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+    # The default output dtype for this layer should be tf.float32.
+    self.assertEqual(tf.float16, output_tensor.dtype)
+
+  def test_dynamic_layer_output_shape(self):
+    max_sequence_length = 40
+    test_layer = position_embedding.PositionEmbedding(
+        use_dynamic_slicing=True, max_sequence_length=max_sequence_length)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    width = 30
+    input_tensor = tf.keras.Input(shape=(None, width))
+    output_tensor = test_layer(input_tensor)
+
+    # When using dynamic positional embedding shapes, the output is expected
+    # to be the same as the input shape in all dimensions - but may be None if
+    # the input shape is None there.
+    expected_output_shape = [1, None, width]
+    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
+
+  def test_dynamic_layer_slicing(self):
+    max_sequence_length = 40
+    test_layer = position_embedding.PositionEmbedding(
+        use_dynamic_slicing=True, max_sequence_length=max_sequence_length)
+    # Create a 3-dimensional input (the first dimension is implicit).
+    width = 30
+    input_tensor = tf.keras.Input(shape=(None, width))
+    output_tensor = test_layer(input_tensor)
+
+    model = tf.keras.Model(input_tensor, output_tensor)
+
+    # Create input data that is shorter than max_sequence_length, which should
+    # trigger a down-slice.
+    input_length = 17
+    # Note: This test explicitly uses a batch size of 1. This is to get around
+    # Keras' restriction on Model invocations: inputs are expected to have the
+    # same batch cardinality as outputs. In practice, this layer should be used
+    # inside a model, where it can be projected when added to another tensor.
+    input_data = np.ones((1, input_length, width))
+    output_data = model.predict(input_data)
+
+    self.assertAllEqual([1, input_length, width], output_data.shape)
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/nlp/modeling/layers/transformer.py
+++ b/official/nlp/modeling/layers/transformer.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Keras-based transformer block layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.nlp.modeling.layers import attention
+from official.nlp.modeling.layers import dense_einsum
+
+
+@tf.keras.utils.register_keras_serializable(package="Text")
+class Transformer(tf.keras.layers.Layer):
+  """Transformer layer.
+
+  This layer implements the Transformer from "Attention Is All You Need".
+  (https://arxiv.org/abs/1706.03762).
+
+  Attributes:
+    num_attention_heads: Number of attention heads.
+    intermediate_size: Size of the intermediate layer.
+    intermediate_activation: Activation for the intermediate layer.
+    dropout_rate: Dropout probability for the post-attention and output dropout.
+    attention_dropout_rate: Dropout probability for within the attention layer.
+    kernel_initializer: Initializer for dense layer kernels.
+    bias_initializer: Initializer for dense layer biases.
+    kernel_regularizer: Regularizer for dense layer kernels.
+    bias_regularizer: Regularizer for dense layer biases.
+    activity_regularizer: Regularizer for dense layer activity.
+    kernel_constraint: Constraint for dense layer kernels.
+    bias_constraint: Constraint for dense layer kernels.
+  """
+
+  def __init__(self,
+               num_attention_heads,
+               intermediate_size,
+               intermediate_activation,
+               dropout_rate=0.0,
+               attention_dropout_rate=0.0,
+               kernel_initializer="glorot_uniform",
+               bias_initializer="zeros",
+               kernel_regularizer=None,
+               bias_regularizer=None,
+               activity_regularizer=None,
+               kernel_constraint=None,
+               bias_constraint=None,
+               **kwargs):
+    super(Transformer, self).__init__(**kwargs)
+
+    self._num_heads = num_attention_heads
+    self._intermediate_size = intermediate_size
+    self._intermediate_activation = intermediate_activation
+    self._attention_dropout_rate = attention_dropout_rate
+    self._dropout_rate = dropout_rate
+    self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
+    self._bias_initializer = tf.keras.initializers.get(bias_initializer)
+    self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
+    self._bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
+    self._kernel_constraint = tf.keras.constraints.get(kernel_constraint)
+    self._bias_constraint = tf.keras.constraints.get(bias_constraint)
+
+  def build(self, input_shape):
+    input_tensor = input_shape[0] if len(input_shape) == 2 else input_shape
+    input_tensor_shape = tf.TensorShape(input_tensor)
+    if len(input_tensor_shape) != 3:
+      raise ValueError("TransformerLayer expects a three-dimensional input of "
+                       "shape [batch, sequence, width].")
+    batch_size, sequence_length, hidden_size = input_tensor_shape
+
+    if len(input_shape) == 2:
+      mask_tensor_shape = tf.TensorShape(input_shape[1])
+      expected_mask_tensor_shape = tf.TensorShape(
+          [batch_size, sequence_length, sequence_length])
+      if not expected_mask_tensor_shape.is_compatible_with(mask_tensor_shape):
+        raise ValueError("When passing a mask tensor to TransformerLayer, the "
+                         "mask tensor must be of shape [batch, "
+                         "sequence_length, sequence_length] (here %s). Got a "
+                         "mask tensor of shape %s." %
+                         (expected_mask_tensor_shape, mask_tensor_shape))
+    if hidden_size % self._num_heads != 0:
+      raise ValueError(
+          "The input size (%d) is not a multiple of the number of attention "
+          "heads (%d)" % (hidden_size, self._num_heads))
+    self._attention_head_size = int(hidden_size // self._num_heads)
+
+    self._attention_layer = attention.Attention(
+        num_heads=self._num_heads,
+        head_size=self._attention_head_size,
+        dropout_rate=self._attention_dropout_rate,
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=self.dtype,
+        name="self_attention")
+    self._attention_output_dense = dense_einsum.DenseEinsum(
+        output_shape=hidden_size,
+        num_summed_dimensions=2,
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=self.dtype,
+        name="self_attention_output")
+    self._attention_dropout = tf.keras.layers.Dropout(rate=self._dropout_rate)
+    self._attention_layer_norm = (
+        tf.keras.layers.LayerNormalization(
+            name="self_attention_layer_norm", axis=-1, epsilon=1e-12))
+    self._intermediate_dense = dense_einsum.DenseEinsum(
+        output_shape=self._intermediate_size,
+        activation=self._intermediate_activation,
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=tf.float32,  # This layer is always float32 for numeric stability.
+        name="intermediate")
+    self._output_dense = dense_einsum.DenseEinsum(
+        output_shape=hidden_size,
+        kernel_initializer=self._kernel_initializer,
+        bias_initializer=self._bias_initializer,
+        kernel_regularizer=self._kernel_regularizer,
+        bias_regularizer=self._bias_regularizer,
+        activity_regularizer=self._activity_regularizer,
+        kernel_constraint=self._kernel_constraint,
+        bias_constraint=self._bias_constraint,
+        dtype=self.dtype,
+        name="output")
+    self._output_dropout = tf.keras.layers.Dropout(rate=self._dropout_rate)
+    self._output_layer_norm = tf.keras.layers.LayerNormalization(
+        name="output_layer_norm", axis=-1, epsilon=1e-12)
+
+    super(Transformer, self).build(input_shape)
+
+  def compute_output_shape(self, input_shape):
+    data_tensor_shape = tf.TensorShape(input_shape[0])
+    batch = data_tensor_shape[0]
+    sequence_length = data_tensor_shape[1]
+
+    return tf.TensorShape((batch, sequence_length, self._output_einsum_shape))
+
+  def get_config(self):
+    config = {
+        "num_attention_heads":
+            self._num_heads,
+        "intermediate_size":
+            self._intermediate_size,
+        "intermediate_activation":
+            self._intermediate_activation,
+        "dropout_rate":
+            self._dropout_rate,
+        "attention_dropout_rate":
+            self._attention_dropout_rate,
+        "kernel_initializer":
+            tf.keras.initializers.serialize(self._kernel_initializer),
+        "bias_initializer":
+            tf.keras.initializers.serialize(self._bias_initializer),
+        "kernel_regularizer":
+            tf.keras.regularizers.serialize(self._kernel_regularizer),
+        "bias_regularizer":
+            tf.keras.regularizers.serialize(self._bias_regularizer),
+        "activity_regularizer":
+            tf.keras.regularizers.serialize(self._activity_regularizer),
+        "kernel_constraint":
+            tf.keras.constraints.serialize(self._kernel_constraint),
+        "bias_constraint":
+            tf.keras.constraints.serialize(self._bias_constraint)
+    }
+    base_config = super(Transformer, self).get_config()
+    return dict(list(base_config.items()) + list(config.items()))
+
+  def call(self, inputs):
+    if isinstance(inputs, (list, tuple)) and len(inputs) == 2:
+      input_tensor, attention_mask = inputs
+    else:
+      input_tensor, attention_mask = (inputs, None)
+
+    attention_inputs = [input_tensor, input_tensor]
+
+    if attention_mask is not None:
+      attention_inputs.append(attention_mask)
+
+    attention_output = self._attention_layer(attention_inputs)
+    attention_output = self._attention_output_dense(attention_output)
+    attention_output = self._attention_dropout(attention_output)
+    # Use float32 in keras layer norm and the gelu activation in the
+    # intermediate dense layer for numeric stability
+    if self.dtype == tf.float16:
+      input_tensor = tf.cast(input_tensor, tf.float32)
+      attention_output = tf.cast(attention_output, tf.float32)
+    attention_output = self._attention_layer_norm(input_tensor +
+                                                  attention_output)
+    intermediate_output = self._intermediate_dense(attention_output)
+    if self.dtype == tf.float16:
+      intermediate_output = tf.cast(intermediate_output, tf.float16)
+    layer_output = self._output_dense(intermediate_output)
+    layer_output = self._output_dropout(layer_output)
+    # Use float32 in keras layer norm for numeric stability
+    if self.dtype == tf.float16:
+      layer_output = tf.cast(layer_output, tf.float32)
+    layer_output = self._output_layer_norm(layer_output + attention_output)
+    if self.dtype == tf.float16:
+      layer_output = tf.cast(layer_output, tf.float16)
+
+    return layer_output
--- a/official/nlp/modeling/layers/transformer_test.py
+++ b/official/nlp/modeling/layers/transformer_test.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Keras-based transformer block layer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+
+from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
+from official.nlp.modeling.layers import transformer
+
+
+# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
+# guarantees forward compatibility of this code for the V2 switchover.
+@keras_parameterized.run_all_keras_modes
+class TransformerLayerTest(keras_parameterized.TestCase):
+
+  def test_layer_creation(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu')
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(shape=(sequence_length, width))
+    output_tensor = test_layer(data_tensor)
+    # The default output of a transformer layer should be the same as the input.
+    self.assertEqual(data_tensor.shape.as_list(), output_tensor.shape.as_list())
+
+  def test_layer_creation_with_mask(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu')
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(shape=(sequence_length, width))
+    # Create a 2-dimensional input (the first dimension is implicit).
+    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length))
+    output_tensor = test_layer([data_tensor, mask_tensor])
+    # The default output of a transformer layer should be the same as the input.
+    self.assertEqual(data_tensor.shape.as_list(), output_tensor.shape.as_list())
+
+  def test_layer_creation_with_incorrect_mask_fails(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu')
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(shape=(sequence_length, width))
+    # Create a 2-dimensional input (the first dimension is implicit).
+    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length - 3))
+    with self.assertRaisesRegex(ValueError, 'When passing a mask tensor.*'):
+      _ = test_layer([data_tensor, mask_tensor])
+
+  def test_layer_invocation(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu')
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(shape=(sequence_length, width))
+    output_tensor = test_layer(data_tensor)
+
+    # Create a model from the test layer.
+    model = tf.keras.Model(data_tensor, output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 6
+    input_data = 10 * np.random.random_sample(
+        (batch_size, sequence_length, width))
+    _ = model.predict(input_data)
+
+  def test_layer_invocation_with_mask(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu')
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(shape=(sequence_length, width))
+    # Create a 2-dimensional input (the first dimension is implicit).
+    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length))
+    output_tensor = test_layer([data_tensor, mask_tensor])
+
+    # Create a model from the test layer.
+    model = tf.keras.Model([data_tensor, mask_tensor], output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 6
+    input_data = 10 * np.random.random_sample(
+        (batch_size, sequence_length, width))
+    # The attention mask should be of shape (batch, from_seq_len, to_seq_len),
+    # which here is (batch, sequence_length, sequence_length)
+    mask_data = np.random.randint(
+        2, size=(batch_size, sequence_length, sequence_length))
+    _ = model.predict([input_data, mask_data])
+
+  def test_layer_invocation_with_float16_dtype(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu',
+        dtype='float16')
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(
+        shape=(sequence_length, width), dtype=tf.float16)
+    # Create a 2-dimensional input (the first dimension is implicit).
+    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length))
+    output_tensor = test_layer([data_tensor, mask_tensor])
+
+    # Create a model from the test layer.
+    model = tf.keras.Model([data_tensor, mask_tensor], output_tensor)
+
+    # Invoke the model on test data. We can't validate the output data itself
+    # (the NN is too complex) but this will rule out structural runtime errors.
+    batch_size = 6
+    input_data = (10 * np.random.random_sample(
+        (batch_size, sequence_length, width))).astype(np.float16)
+    # The attention mask should be of shape (batch, from_seq_len, to_seq_len),
+    # which here is (batch, sequence_length, sequence_length)
+    mask_data = np.random.randint(
+        2, size=(batch_size, sequence_length, sequence_length))
+    _ = model.predict([input_data, mask_data])
+
+  def test_transform_with_initializer(self):
+    test_layer = transformer.Transformer(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu',
+        kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
+    sequence_length = 21
+    width = 80
+    # Create a 3-dimensional input (the first dimension is implicit).
+    data_tensor = tf.keras.Input(shape=(sequence_length, width))
+    output = test_layer(data_tensor)
+    # The default output of a transformer layer should be the same as the input.
+    self.assertEqual(data_tensor.shape.as_list(), output.shape.as_list())
+
+
+if __name__ == '__main__':
+  tf.test.main()
--- a/official/nlp/modeling/losses/__init__.py
+++ b/official/nlp/modeling/losses/__init__.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Activations package definition. Subject to change."""
+from official.nlp.modeling.losses.weighted_sparse_categorical_crossentropy import loss as weighted_sparse_categorical_crossentropy_loss
+from official.nlp.modeling.losses.weighted_sparse_categorical_crossentropy import per_example_loss as weighted_sparse_categorical_crossentropy_per_example_loss
--- a/official/nlp/modeling/losses/weighted_sparse_categorical_crossentropy.py
+++ b/official/nlp/modeling/losses/weighted_sparse_categorical_crossentropy.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Sparse categorical cross-entropy losses."""
+
+from __future__ import absolute_import
+from __future__ import division
+# from __future__ import google_type_annotations
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+def _adjust_labels(labels, predictions):
+  """Adjust the 'labels' tensor by squeezing it if needed."""
+  labels = tf.cast(labels, tf.int32)
+  if len(predictions.shape) == len(labels.shape):
+    labels = tf.squeeze(labels, [-1])
+  return labels, predictions
+
+
+def _validate_rank(labels, predictions, weights):
+  if weights is not None and len(weights.shape) != len(labels.shape):
+    raise RuntimeError(
+        ("Weight and label tensors were not of the same rank. weights.shape "
+         "was %s, and labels.shape was %s.") %
+        (predictions.shape, labels.shape))
+  if (len(predictions.shape) - 1) != len(labels.shape):
+    raise RuntimeError(
+        ("Weighted sparse categorical crossentropy expects `labels` to have a "
+         "rank of one less than `predictions`. labels.shape was %s, and "
+         "predictions.shape was %s.") % (labels.shape, predictions.shape))
+
+
+def per_example_loss(labels, predictions, weights=None):
+  """Calculate a per-example sparse categorical crossentropy loss.
+
+  This loss function assumes that the predictions are post-softmax.
+  Args:
+    labels: The labels to evaluate against. Should be a set of integer indices
+      ranging from 0 to (vocab_size-1).
+    predictions: The network predictions. Should have softmax already applied.
+    weights: An optional weight array of the same shape as the 'labels' array.
+      If None, all examples will be used.
+
+  Returns:
+    A tensor of shape predictions.shape[:-1] containing the per-example
+      loss.
+  """
+  # When using these functions with the Keras core API, we will need to squeeze
+  # the labels tensor - Keras adds a spurious inner dimension.
+  labels, predictions = _adjust_labels(labels, predictions)
+  _validate_rank(labels, predictions, weights)
+
+  labels_one_hot = tf.keras.backend.one_hot(labels, predictions.shape[-1])
+  labels_one_hot = tf.keras.backend.cast(labels_one_hot, predictions.dtype)
+  per_example_loss_data = -tf.keras.backend.sum(
+      predictions * labels_one_hot, axis=[-1])
+  if weights is not None:
+    weights = tf.keras.backend.cast(weights, per_example_loss_data.dtype)
+    per_example_loss_data = weights * per_example_loss_data
+  return per_example_loss_data
+
+
+def loss(labels, predictions, weights=None):
+  """Calculate a per-batch sparse categorical crossentropy loss.
+
+  This loss function assumes that the predictions are post-softmax.
+  Args:
+    labels: The labels to evaluate against. Should be a set of integer indices
+      ranging from 0 to (vocab_size-1).
+    predictions: The network predictions. Should have softmax already applied.
+    weights: An optional weight array of the same shape as the 'labels' array.
+      If None, all examples will be used.
+
+  Returns:
+    A loss scalar.
+
+  Raises:
+    RuntimeError if the passed tensors do not have the same rank.
+  """
+  # When using these functions with the Keras core API, we will need to squeeze
+  # the labels tensor - Keras adds a spurious inner dimension.
+  labels, predictions = _adjust_labels(labels, predictions)
+  _validate_rank(labels, predictions, weights)
+
+  per_example_loss_data = per_example_loss(labels, predictions, weights)
+
+  if weights is None:
+    return tf.keras.backend.mean(per_example_loss_data)
+  else:
+    numerator = tf.keras.backend.sum(per_example_loss_data)
+    weights = tf.keras.backend.cast(weights, predictions.dtype)
+    denominator = tf.keras.backend.sum(weights) + 1e-5
+    return numerator / denominator