[keras_nlp] Delete keras_nlp from official/nlp since we have merged all...

[keras_nlp] Delete keras_nlp from official/nlp since we have merged all modules back to official/nlp/modeling PiperOrigin-RevId: 403977545

[keras_nlp] Delete keras_nlp from official/nlp since we have merged all...
[keras_nlp] Delete keras_nlp from official/nlp since we have merged all modules back to official/nlp/modeling PiperOrigin-RevId: 403977545
ec0d7d0b · Frederick Liu · A. Unique TensorFlower · b037ae20 · b037ae20 · b037ae20
Commit ec0d7d0b authored Oct 18, 2021 by Frederick Liu Committed by A. Unique TensorFlower Oct 18, 2021
17 changed files
--- a/official/nlp/keras_nlp/README.md
+++ b/official/nlp/keras_nlp/README.md
-# keras-nlp
-## Layers
-Layers are the fundamental building blocks for NLP models. They can be used to
-assemble new layers, networks, or models.
-*   [TransformerEncoderBlock](layers/transformer_encoder_block.py) implements
-    an optionally masked transformer as described in
-    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).
-*   [OnDeviceEmbedding](layers/on_device_embedding.py) implements efficient
-    embedding lookups designed for TPU-based models.
-*   [PositionalEmbedding](layers/position_embedding.py) creates a positional
-    embedding as described in ["BERT: Pre-training of Deep Bidirectional
-    Transformers for Language Understanding"](https://arxiv.org/abs/1810.04805).
-*   [SelfAttentionMask](layers/self_attention_mask.py) creates a 3D attention
-    mask from a 2D tensor mask.
-*   [MaskedLM](layers/masked_lm.py) implements a masked language model. It
-    assumes the embedding table variable is passed to it.
-## Encoders
-Encoders are combinations of layers (and possibly other encoders). They are
-sub-units of models that would not be trained alone. It encapsulates common
-network structures like a classification head or a transformer encoder into an
-easily handled object with a standardized configuration.
-*   [BertEncoder](encoders/bert_encoder.py) implements a bi-directional
-    Transformer-based encoder as described in
-    ["BERT: Pre-training of Deep Bidirectional Transformers for Language
-    Understanding"](https://arxiv.org/abs/1810.04805). It includes the embedding
-    lookups, transformer layers and pooling layer.
--- a/official/nlp/keras_nlp/__init__.py
+++ b/official/nlp/keras_nlp/__init__.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras-NLP package definition."""
-# pylint: disable=wildcard-import
-from official.nlp.keras_nlp import encoders
-from official.nlp.keras_nlp import layers
--- a/official/nlp/keras_nlp/contributing.md
+++ b/official/nlp/keras_nlp/contributing.md
-## Contributing to KerasNLP
-Patches to KerasNLP are welcome!
-The source-of-truth repository lives under
-[TF Model Garden NLP](https://github.com/tensorflow/models/tree/master/official/nlp/keras_nlp),
-and is mirrored as a read-only repository under
-[keras-team/keras-nlp](https://github.com/keras-team/keras-nlp).
-Contributions should be made as PRs to the TF Model Garden repository.
-This is to ensure the codebase is rigorously tested with state-of-art models
-on different accelerators.
-In the long run, we will move development to the current repository `keras-team/keras-nlp`.
-## :heavy_check_mark: Contributor checklist
-1. Ensure you have signed the [Contributor License Agreement](https://cla.developers.google.com/about/google-individual?csw=1).
-    * All code contributors are required to sign a Contributor License Agreement.
-    * Please read this [troubleshooting guide](Contributor-License-Agreements#troubleshooting-clas)
-    if you encounter an issue.
-2. Please review the [contribution guidelines](https://github.com/tensorflow/models/wiki/How-to-contribute).
-3. Check if your changes are consistent with the [TensorFlow coding style](https://www.tensorflow.org/community/contribute/code_style).
--- a/official/nlp/keras_nlp/encoders/__init__.py
+++ b/official/nlp/keras_nlp/encoders/__init__.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras-NLP layers package definition."""
-from official.nlp.keras_nlp.encoders.bert_encoder import BertEncoder
--- a/official/nlp/keras_nlp/encoders/bert_encoder.py
+++ b/official/nlp/keras_nlp/encoders/bert_encoder.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Bert encoder network."""
-# pylint: disable=g-classes-have-attributes
-import tensorflow as tf
-from official.nlp.modeling import networks
-@tf.keras.utils.register_keras_serializable(package='keras_nlp')
-class BertEncoder(networks.BertEncoder):
-  """Deprecated."""
-  def __init__(self, *args, **kwargs):
-    if 'dict_outputs' in kwargs:
-      kwargs.pop('dict_outputs')
-    super().__init__(*args, dict_outputs=True, **kwargs)
--- a/official/nlp/keras_nlp/encoders/bert_encoder_test.py
+++ b/official/nlp/keras_nlp/encoders/bert_encoder_test.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tests for transformer-based bert encoder network."""
-from absl.testing import parameterized
-import numpy as np
-import tensorflow as tf
-from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
-from official.nlp.keras_nlp.encoders import bert_encoder
-# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
-# guarantees forward compatibility of this code for the V2 switchover.
-@keras_parameterized.run_all_keras_modes
-class BertEncoderTest(keras_parameterized.TestCase):
-  def tearDown(self):
-    super(BertEncoderTest, self).tearDown()
-    tf.keras.mixed_precision.set_global_policy("float32")
-  def test_network_creation(self):
-    hidden_size = 32
-    sequence_length = 21
-    # Create a small BertEncoder for testing.
-    test_network = bert_encoder.BertEncoder(
-        vocab_size=100,
-        hidden_size=hidden_size,
-        num_attention_heads=2,
-        num_layers=3)
-    # Create the inputs (note that the first dimension is implicit).
-    word_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    mask = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    type_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    dict_outputs = test_network([word_ids, mask, type_ids])
-    data = dict_outputs["sequence_output"]
-    pooled = dict_outputs["pooled_output"]
-    self.assertIsInstance(test_network.transformer_layers, list)
-    self.assertLen(test_network.transformer_layers, 3)
-    self.assertIsInstance(test_network.pooler_layer, tf.keras.layers.Dense)
-    expected_data_shape = [None, sequence_length, hidden_size]
-    expected_pooled_shape = [None, hidden_size]
-    self.assertAllEqual(expected_data_shape, data.shape.as_list())
-    self.assertAllEqual(expected_pooled_shape, pooled.shape.as_list())
-    # The default output dtype is float32.
-    self.assertAllEqual(tf.float32, data.dtype)
-    self.assertAllEqual(tf.float32, pooled.dtype)
-  def test_all_encoder_outputs_network_creation(self):
-    hidden_size = 32
-    sequence_length = 21
-    # Create a small BertEncoder for testing.
-    test_network = bert_encoder.BertEncoder(
-        vocab_size=100,
-        hidden_size=hidden_size,
-        num_attention_heads=2,
-        num_layers=3)
-    # Create the inputs (note that the first dimension is implicit).
-    word_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    mask = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    type_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    dict_outputs = test_network([word_ids, mask, type_ids])
-    all_encoder_outputs = dict_outputs["encoder_outputs"]
-    pooled = dict_outputs["pooled_output"]
-    expected_data_shape = [None, sequence_length, hidden_size]
-    expected_pooled_shape = [None, hidden_size]
-    self.assertLen(all_encoder_outputs, 3)
-    for data in all_encoder_outputs:
-      self.assertAllEqual(expected_data_shape, data.shape.as_list())
-    self.assertAllEqual(expected_pooled_shape, pooled.shape.as_list())
-    # The default output dtype is float32.
-    self.assertAllEqual(tf.float32, all_encoder_outputs[-1].dtype)
-    self.assertAllEqual(tf.float32, pooled.dtype)
-  def test_network_creation_with_float16_dtype(self):
-    hidden_size = 32
-    sequence_length = 21
-    tf.keras.mixed_precision.set_global_policy("mixed_float16")
-    # Create a small BertEncoder for testing.
-    test_network = bert_encoder.BertEncoder(
-        vocab_size=100,
-        hidden_size=hidden_size,
-        num_attention_heads=2,
-        num_layers=3)
-    # Create the inputs (note that the first dimension is implicit).
-    word_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    mask = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    type_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    dict_outputs = test_network([word_ids, mask, type_ids])
-    data = dict_outputs["sequence_output"]
-    pooled = dict_outputs["pooled_output"]
-    expected_data_shape = [None, sequence_length, hidden_size]
-    expected_pooled_shape = [None, hidden_size]
-    self.assertAllEqual(expected_data_shape, data.shape.as_list())
-    self.assertAllEqual(expected_pooled_shape, pooled.shape.as_list())
-    # If float_dtype is set to float16, the data output is float32 (from a layer
-    # norm) and pool output should be float16.
-    self.assertAllEqual(tf.float32, data.dtype)
-    self.assertAllEqual(tf.float16, pooled.dtype)
-  @parameterized.named_parameters(
-      ("all_sequence", None, 21),
-      ("output_range", 1, 1),
-  )
-  def test_network_invocation(self, output_range, out_seq_len):
-    hidden_size = 32
-    sequence_length = 21
-    vocab_size = 57
-    num_types = 7
-    # Create a small BertEncoder for testing.
-    test_network = bert_encoder.BertEncoder(
-        vocab_size=vocab_size,
-        hidden_size=hidden_size,
-        num_attention_heads=2,
-        num_layers=3,
-        type_vocab_size=num_types,
-        output_range=output_range)
-    # Create the inputs (note that the first dimension is implicit).
-    word_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    mask = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    type_ids = tf.keras.Input(shape=(sequence_length,), dtype=tf.int32)
-    dict_outputs = test_network([word_ids, mask, type_ids])
-    data = dict_outputs["sequence_output"]
-    pooled = dict_outputs["pooled_output"]
-    # Create a model based off of this network:
-    model = tf.keras.Model([word_ids, mask, type_ids], [data, pooled])
-    # Invoke the model. We can't validate the output data here (the model is too
-    # complex) but this will catch structural runtime errors.
-    batch_size = 3
-    word_id_data = np.random.randint(
-        vocab_size, size=(batch_size, sequence_length))
-    mask_data = np.random.randint(2, size=(batch_size, sequence_length))
-    type_id_data = np.random.randint(
-        num_types, size=(batch_size, sequence_length))
-    outputs = model.predict([word_id_data, mask_data, type_id_data])
-    self.assertEqual(outputs[0].shape[1], out_seq_len)
-    # Creates a BertEncoder with max_sequence_length != sequence_length
-    max_sequence_length = 128
-    test_network = bert_encoder.BertEncoder(
-        vocab_size=vocab_size,
-        hidden_size=hidden_size,
-        max_sequence_length=max_sequence_length,
-        num_attention_heads=2,
-        num_layers=3,
-        type_vocab_size=num_types)
-    dict_outputs = test_network([word_ids, mask, type_ids])
-    data = dict_outputs["sequence_output"]
-    pooled = dict_outputs["pooled_output"]
-    model = tf.keras.Model([word_ids, mask, type_ids], [data, pooled])
-    outputs = model.predict([word_id_data, mask_data, type_id_data])
-    self.assertEqual(outputs[0].shape[1], sequence_length)
-    # Creates a BertEncoder with embedding_width != hidden_size
-    test_network = bert_encoder.BertEncoder(
-        vocab_size=vocab_size,
-        hidden_size=hidden_size,
-        max_sequence_length=max_sequence_length,
-        num_attention_heads=2,
-        num_layers=3,
-        type_vocab_size=num_types,
-        embedding_width=16)
-    dict_outputs = test_network([word_ids, mask, type_ids])
-    data = dict_outputs["sequence_output"]
-    pooled = dict_outputs["pooled_output"]
-    model = tf.keras.Model([word_ids, mask, type_ids], [data, pooled])
-    outputs = model.predict([word_id_data, mask_data, type_id_data])
-    self.assertEqual(outputs[0].shape[-1], hidden_size)
-    self.assertTrue(hasattr(test_network, "_embedding_projection"))
-  def test_serialize_deserialize(self):
-    # Create a network object that sets all of its config options.
-    kwargs = dict(
-        vocab_size=100,
-        hidden_size=32,
-        num_layers=3,
-        num_attention_heads=2,
-        max_sequence_length=21,
-        type_vocab_size=12,
-        inner_dim=1223,
-        inner_activation="relu",
-        output_dropout=0.05,
-        attention_dropout=0.22,
-        initializer="glorot_uniform",
-        output_range=-1,
-        embedding_width=16,
-        embedding_layer=None,
-        norm_first=False)
-    network = bert_encoder.BertEncoder(**kwargs)
-    expected_config = dict(kwargs)
-    expected_config["inner_activation"] = tf.keras.activations.serialize(
-        tf.keras.activations.get(expected_config["inner_activation"]))
-    expected_config["initializer"] = tf.keras.initializers.serialize(
-        tf.keras.initializers.get(expected_config["initializer"]))
-    # Validate that the config can be forced to JSON.
-    _ = network.to_json()
-    # Tests model saving/loading.
-    model_path = self.get_temp_dir() + "/model"
-    network.save(model_path)
-    _ = tf.keras.models.load_model(model_path)
-if __name__ == "__main__":
-  tf.test.main()
--- a/official/nlp/keras_nlp/layers/__init__.py
+++ b/official/nlp/keras_nlp/layers/__init__.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras-NLP layers package definition."""
-from official.nlp.keras_nlp.layers.masked_lm import MaskedLM
-from official.nlp.keras_nlp.layers.on_device_embedding import OnDeviceEmbedding
-from official.nlp.keras_nlp.layers.position_embedding import PositionEmbedding
-from official.nlp.keras_nlp.layers.self_attention_mask import SelfAttentionMask
-from official.nlp.keras_nlp.layers.transformer_encoder_block import TransformerEncoderBlock
--- a/official/nlp/keras_nlp/layers/masked_lm.py
+++ b/official/nlp/keras_nlp/layers/masked_lm.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Masked language model network."""
-from official.nlp.modeling import layers
-MaskedLM = layers.MaskedLM
--- a/official/nlp/keras_nlp/layers/on_device_embedding.py
+++ b/official/nlp/keras_nlp/layers/on_device_embedding.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras-based one-hot embedding layer."""
-from official.nlp.modeling import layers
-OnDeviceEmbedding = layers.OnDeviceEmbedding
--- a/official/nlp/keras_nlp/layers/on_device_embedding_test.py
+++ b/official/nlp/keras_nlp/layers/on_device_embedding_test.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tests for Keras-based one-hot embedding layer."""
-import numpy as np
-import tensorflow as tf
-from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
-from official.nlp.keras_nlp.layers import on_device_embedding
-# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
-# guarantees forward compatibility of this code for the V2 switchover.
-@keras_parameterized.run_all_keras_modes
-class OnDeviceEmbeddingTest(keras_parameterized.TestCase):
-  def test_layer_creation(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size, embedding_width=embedding_width)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # The output should be the same as the input, save that it has an extra
-    # embedding_width dimension on the end.
-    expected_output_shape = [None, sequence_length, embedding_width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    self.assertEqual(output_tensor.dtype, tf.float32)
-  def test_layer_creation_with_mixed_precision(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size, embedding_width=embedding_width,
-        dtype="mixed_float16")
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # The output should be the same as the input, save that it has an extra
-    # embedding_width dimension on the end.
-    expected_output_shape = [None, sequence_length, embedding_width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    self.assertEqual(output_tensor.dtype, tf.float16)
-  def test_layer_invocation(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size, embedding_width=embedding_width)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # Create a model from the test layer.
-    model = tf.keras.Model(input_tensor, output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 3
-    input_data = np.random.randint(
-        vocab_size, size=(batch_size, sequence_length))
-    output = model.predict(input_data)
-    self.assertEqual(tf.float32, output.dtype)
-  def test_layer_invocation_with_mixed_precision(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size, embedding_width=embedding_width,
-        dtype="mixed_float16")
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # Create a model from the test layer.
-    model = tf.keras.Model(input_tensor, output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 3
-    input_data = np.random.randint(
-        vocab_size, size=(batch_size, sequence_length))
-    output = model.predict(input_data)
-    self.assertEqual(tf.float16, output.dtype)
-  def test_one_hot_layer_creation(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size,
-        embedding_width=embedding_width,
-        use_one_hot=True)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # The output should be the same as the input, save that it has an extra
-    # embedding_width dimension on the end.
-    expected_output_shape = [None, sequence_length, embedding_width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    self.assertEqual(output_tensor.dtype, tf.float32)
-  def test_one_hot_layer_creation_with_mixed_precision(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size,
-        embedding_width=embedding_width,
-        dtype="mixed_float16",
-        use_one_hot=True)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # The output should be the same as the input, save that it has an extra
-    # embedding_width dimension on the end.
-    expected_output_shape = [None, sequence_length, embedding_width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    self.assertEqual(output_tensor.dtype, tf.float16)
-  def test_one_hot_layer_invocation(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size,
-        embedding_width=embedding_width,
-        use_one_hot=True)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # Create a model from the test layer.
-    model = tf.keras.Model(input_tensor, output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 3
-    input_data = np.random.randint(
-        vocab_size, size=(batch_size, sequence_length))
-    output = model.predict(input_data)
-    self.assertEqual(tf.float32, output.dtype)
-  def test_one_hot_layer_invocation_with_mixed_precision(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size,
-        embedding_width=embedding_width,
-        dtype="mixed_float16",
-        use_one_hot=True)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # Create a model from the test layer.
-    model = tf.keras.Model(input_tensor, output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 3
-    input_data = np.random.randint(
-        vocab_size, size=(batch_size, sequence_length))
-    output = model.predict(input_data)
-    self.assertEqual(tf.float16, output.dtype)
-  def test_use_scale_layer_invocation(self):
-    vocab_size = 31
-    embedding_width = 27
-    test_layer = on_device_embedding.OnDeviceEmbedding(
-        vocab_size=vocab_size, embedding_width=embedding_width,
-        scale_factor=embedding_width**0.5)
-    # Create a 2-dimensional input (the first dimension is implicit).
-    sequence_length = 23
-    input_tensor = tf.keras.Input(shape=(sequence_length), dtype=tf.int32)
-    output_tensor = test_layer(input_tensor)
-    # Create a model from the test layer.
-    model = tf.keras.Model(input_tensor, output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 3
-    input_data = np.random.randint(
-        vocab_size, size=(batch_size, sequence_length))
-    output = model.predict(input_data)
-    self.assertEqual(tf.float32, output.dtype)
-if __name__ == "__main__":
-  tf.test.main()
--- a/official/nlp/keras_nlp/layers/position_embedding.py
+++ b/official/nlp/keras_nlp/layers/position_embedding.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras-based positional embedding layer."""
-from official.nlp.modeling import layers
-PositionEmbedding = layers.PositionEmbedding
--- a/official/nlp/keras_nlp/layers/position_embedding_test.py
+++ b/official/nlp/keras_nlp/layers/position_embedding_test.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tests for Keras-based positional embedding layer."""
-import numpy as np
-import tensorflow as tf
-from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
-from official.nlp.keras_nlp.layers import position_embedding
-# This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
-# guarantees forward compatibility of this code for the V2 switchover.
-@keras_parameterized.run_all_keras_modes
-class PositionEmbeddingLayerTest(keras_parameterized.TestCase):
-  def test_static_layer_output_shape(self):
-    # Create a 3-dimensional input (the first dimension is implicit).
-    sequence_length = 21
-    test_layer = position_embedding.PositionEmbedding(
-        max_length=sequence_length)
-    width = 30
-    input_tensor = tf.keras.Input(shape=(sequence_length, width))
-    output_tensor = test_layer(input_tensor)
-    # When using static positional embedding shapes, the output is expected
-    # to be the same as the input shape in all dimensions save batch.
-    expected_output_shape = [None, sequence_length, width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    # The default output dtype for this layer should be tf.float32.
-    self.assertEqual(tf.float32, output_tensor.dtype)
-  def test_non_default_axis_static(self):
-    # Create a 3-dimensional input (the first dimension is implicit).
-    sequence_length = 21
-    test_layer = position_embedding.PositionEmbedding(
-        max_length=sequence_length, seq_axis=2)
-    width = 30
-    input_tensor = tf.keras.Input(shape=(width, sequence_length, width))
-    output_tensor = test_layer(input_tensor)
-    # When using static positional embedding shapes, the output is expected
-    # to be the same as the input shape in all dimensions save batch.
-    expected_output_shape = [None, width, sequence_length, width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    # The default output dtype for this layer should be tf.float32.
-    self.assertEqual(tf.float32, output_tensor.dtype)
-  def test_float16_dtype(self):
-    # Create a 3-dimensional input (the first dimension is implicit).
-    sequence_length = 21
-    test_layer = position_embedding.PositionEmbedding(
-        max_length=sequence_length, dtype="float16")
-    width = 30
-    input_tensor = tf.keras.Input(shape=(sequence_length, width))
-    output_tensor = test_layer(input_tensor)
-    # When using static positional embedding shapes, the output is expected
-    # to be the same as the input shape in all dimensions save batch.
-    expected_output_shape = [None, sequence_length, width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-    # The default output dtype for this layer should be tf.float32.
-    self.assertEqual(tf.float16, output_tensor.dtype)
-  def test_dynamic_layer_output_shape(self):
-    max_sequence_length = 40
-    test_layer = position_embedding.PositionEmbedding(
-        max_length=max_sequence_length)
-    # Create a 3-dimensional input (the first dimension is implicit).
-    width = 30
-    input_tensor = tf.keras.Input(shape=(None, width))
-    output_tensor = test_layer(input_tensor)
-    # When using dynamic positional embedding shapes, the output is expected
-    # to be the same as the input shape in all dimensions - but may be None if
-    # the input shape is None there.
-    expected_output_shape = [None, None, width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-  def test_non_default_axis_dynamic(self):
-    max_sequence_length = 60
-    test_layer = position_embedding.PositionEmbedding(
-        max_length=max_sequence_length, seq_axis=2)
-    # Create a 3-dimensional input (the first dimension is implicit).
-    width = 30
-    input_tensor = tf.keras.Input(shape=(None, None, width))
-    output_tensor = test_layer(input_tensor)
-    # When using dynamic positional embedding shapes, the output is expected
-    # to be the same as the input shape in all dimensions - but may be None if
-    # the input shape is None there.
-    expected_output_shape = [None, None, None, width]
-    self.assertEqual(expected_output_shape, output_tensor.shape.as_list())
-  def test_dynamic_layer_slicing(self):
-    max_sequence_length = 40
-    test_layer = position_embedding.PositionEmbedding(
-        max_length=max_sequence_length)
-    # Create a 3-dimensional input (the first dimension is implicit).
-    width = 30
-    input_tensor = tf.keras.Input(shape=(None, width))
-    output_tensor = test_layer(input_tensor)
-    model = tf.keras.Model(input_tensor, output_tensor)
-    # Create input data that is shorter than max_sequence_length, which should
-    # trigger a down-slice.
-    input_length = 17
-    # Note: This test explicitly uses a batch size of 1. This is to get around
-    # Keras' restriction on Model invocations: inputs are expected to have the
-    # same batch cardinality as outputs. In practice, this layer should be used
-    # inside a model, where it can be projected when added to another tensor.
-    input_data = np.ones((1, input_length, width))
-    output_data = model.predict(input_data)
-    self.assertAllEqual([1, input_length, width], output_data.shape)
-if __name__ == "__main__":
-  tf.test.main()
--- a/official/nlp/keras_nlp/layers/self_attention_mask.py
+++ b/official/nlp/keras_nlp/layers/self_attention_mask.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras layer that creates a self-attention mask."""
-from official.nlp.modeling import layers
-SelfAttentionMask = layers.SelfAttentionMask
--- a/official/nlp/keras_nlp/layers/transformer_encoder_block.py
+++ b/official/nlp/keras_nlp/layers/transformer_encoder_block.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Keras-based TransformerEncoder block layer."""
-from official.nlp.modeling import layers
-TransformerEncoderBlock = layers.TransformerEncoderBlock
--- a/official/nlp/keras_nlp/layers/transformer_encoder_block_test.py
+++ b/official/nlp/keras_nlp/layers/transformer_encoder_block_test.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tests for Keras-based transformer block layer."""
-from absl.testing import parameterized
-import numpy as np
-import tensorflow as tf
-from tensorflow.python.keras import keras_parameterized  # pylint: disable=g-direct-tensorflow-import
-from official.nlp.keras_nlp.layers.transformer_encoder_block import TransformerEncoderBlock
-@keras_parameterized.run_all_keras_modes
-@parameterized.named_parameters(
-    ('base', TransformerEncoderBlock))
-class TransformerEncoderBlockLayerTest(keras_parameterized.TestCase):
-  def tearDown(self):
-    super(TransformerEncoderBlockLayerTest, self).tearDown()
-    tf.keras.mixed_precision.set_global_policy('float32')
-  def test_layer_creation(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048, inner_activation='relu')
-    sequence_length = 21
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(sequence_length, width))
-    output_tensor = test_layer(data_tensor)
-    # The default output of a transformer layer should be the same as the input.
-    self.assertEqual(data_tensor.shape.as_list(), output_tensor.shape.as_list())
-  def test_layer_creation_with_mask(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048, inner_activation='relu')
-    sequence_length = 21
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(sequence_length, width))
-    # Create a 2-dimensional input (the first dimension is implicit).
-    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length))
-    output_tensor = test_layer([data_tensor, mask_tensor])
-    # The default output of a transformer layer should be the same as the input.
-    self.assertEqual(data_tensor.shape.as_list(), output_tensor.shape.as_list())
-  def test_layer_invocation(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048, inner_activation='relu')
-    sequence_length = 21
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(sequence_length, width))
-    output_tensor = test_layer(data_tensor)
-    # Create a model from the test layer.
-    model = tf.keras.Model(data_tensor, output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 6
-    input_data = 10 * np.random.random_sample(
-        (batch_size, sequence_length, width))
-    _ = model.predict(input_data)
-  def test_layer_invocation_with_mask(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048, inner_activation='relu')
-    sequence_length = 21
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(sequence_length, width))
-    # Create a 2-dimensional input (the first dimension is implicit).
-    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length))
-    output_tensor = test_layer([data_tensor, mask_tensor])
-    # Create a model from the test layer.
-    model = tf.keras.Model([data_tensor, mask_tensor], output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 6
-    input_data = 10 * np.random.random_sample(
-        (batch_size, sequence_length, width))
-    # The attention mask should be of shape (batch, from_seq_len, to_seq_len),
-    # which here is (batch, sequence_length, sequence_length)
-    mask_data = np.random.randint(
-        2, size=(batch_size, sequence_length, sequence_length))
-    _ = model.predict([input_data, mask_data])
-  def test_layer_output_range(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048, inner_activation='relu')
-    sequence_length = 21
-    width = 80
-    batch_size = 6
-    input_data = 10 * np.random.random_sample(
-        (batch_size, sequence_length, width))
-    mask_data = np.random.randint(
-        2, size=(batch_size, sequence_length, sequence_length))
-    output_tensor = test_layer([input_data, mask_data])
-    # The layer only attends to the first token and outputs the first token
-    # embedding.
-    new_layer = transformer_cls(
-        num_attention_heads=10,
-        inner_dim=2048,
-        inner_activation='relu',
-        output_range=1)
-    _ = new_layer([input_data, mask_data])
-    new_layer.set_weights(test_layer.get_weights())
-    new_output_tensor = new_layer([input_data, mask_data])
-    self.assertAllClose(
-        new_output_tensor, output_tensor[:, 0:1, :], atol=5e-5, rtol=0.003)
-  def test_layer_output_range_without_mask(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048,
-        inner_activation='relu', norm_first=True)
-    sequence_length = 21
-    width = 80
-    batch_size = 6
-    input_data = 10 * np.random.random_sample(
-        (batch_size, sequence_length, width))
-    output_tensor = test_layer(input_data)
-    # The layer only attends to the first token and outputs the first token
-    # embedding.
-    new_layer = transformer_cls(
-        num_attention_heads=10,
-        inner_dim=2048,
-        inner_activation='relu',
-        output_range=1,
-        norm_first=True)
-    _ = new_layer(input_data)
-    new_layer.set_weights(test_layer.get_weights())
-    new_output_tensor = new_layer(input_data)
-    self.assertAllClose(
-        new_output_tensor, output_tensor[:, 0:1, :], atol=5e-5, rtol=0.003)
-  def test_layer_output_range_with_pre_norm(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048,
-        inner_activation='relu', norm_first=True)
-    sequence_length = 21
-    width = 80
-    batch_size = 6
-    input_data = 10 * np.random.random_sample(
-        (batch_size, sequence_length, width))
-    mask_data = np.random.randint(
-        2, size=(batch_size, sequence_length, sequence_length))
-    output_tensor = test_layer([input_data, mask_data])
-    # The layer only attends to the first token and outputs the first token
-    # embedding.
-    new_layer = transformer_cls(
-        num_attention_heads=10,
-        inner_dim=2048,
-        inner_activation='relu',
-        output_range=1,
-        norm_first=True)
-    _ = new_layer([input_data, mask_data])
-    new_layer.set_weights(test_layer.get_weights())
-    new_output_tensor = new_layer([input_data, mask_data])
-    self.assertAllClose(
-        new_output_tensor, output_tensor[:, 0:1, :], atol=5e-5, rtol=0.003)
-  def test_layer_invocation_with_float16_dtype(self, transformer_cls):
-    tf.keras.mixed_precision.set_global_policy('mixed_float16')
-    test_layer = transformer_cls(
-        num_attention_heads=10, inner_dim=2048, inner_activation='relu')
-    sequence_length = 21
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(sequence_length, width))
-    # Create a 2-dimensional input (the first dimension is implicit).
-    mask_tensor = tf.keras.Input(shape=(sequence_length, sequence_length))
-    output_tensor = test_layer([data_tensor, mask_tensor])
-    # Create a model from the test layer.
-    model = tf.keras.Model([data_tensor, mask_tensor], output_tensor)
-    # Invoke the model on test data. We can't validate the output data itself
-    # (the NN is too complex) but this will rule out structural runtime errors.
-    batch_size = 6
-    input_data = (10 * np.random.random_sample(
-        (batch_size, sequence_length, width)))
-    # The attention mask should be of shape (batch, from_seq_len, to_seq_len),
-    # which here is (batch, sequence_length, sequence_length)
-    mask_data = np.random.randint(
-        2, size=(batch_size, sequence_length, sequence_length))
-    _ = model.predict([input_data, mask_data])
-  def test_transform_with_initializer(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10,
-        inner_dim=2048,
-        inner_activation='relu',
-        kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
-    sequence_length = 21
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(sequence_length, width))
-    output = test_layer(data_tensor)
-    # The default output of a transformer layer should be the same as the input.
-    self.assertEqual(data_tensor.shape.as_list(), output.shape.as_list())
-  def test_dynamic_layer_sequence(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=10,
-        inner_dim=2048,
-        inner_activation='relu',
-        kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
-    # Create a 3-dimensional input (the first dimension is implicit).
-    width = 30
-    input_tensor = tf.keras.Input(shape=(None, width))
-    output_tensor = test_layer(input_tensor)
-    model = tf.keras.Model(input_tensor, output_tensor)
-    input_length = 17
-    input_data = np.ones((1, input_length, width))
-    output_data = model.predict(input_data)
-    self.assertAllEqual([1, input_length, width], output_data.shape)
-  def test_separate_qkv(self, transformer_cls):
-    test_layer = transformer_cls(
-        num_attention_heads=2,
-        inner_dim=128,
-        inner_activation='relu',
-        kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
-    # Forward path.
-    q_tensor = tf.zeros([2, 4, 16], dtype=tf.float32)
-    kv_tensor = tf.zeros([2, 8, 16], dtype=tf.float32)
-    dummy_mask = tf.zeros([2, 4, 8], dtype=tf.float32)
-    inputs = [q_tensor, kv_tensor, dummy_mask]
-    output = test_layer(inputs)
-    self.assertEqual(output.shape, q_tensor.shape)
-@keras_parameterized.run_all_keras_modes
-class TransformerArgumentTest(keras_parameterized.TestCase):
-  def test_use_bias_norm_first(self):
-    num_attention_heads = 2
-    hidden_size = 16
-    encoder_block = TransformerEncoderBlock(
-        num_attention_heads=num_attention_heads,
-        inner_dim=32,
-        inner_activation='relu',
-        output_dropout=0.1,
-        attention_dropout=0.1,
-        use_bias=False,
-        norm_first=True,
-        norm_epsilon=1e-6,
-        inner_dropout=0.1,
-        attention_initializer=tf.keras.initializers.RandomUniform(
-            minval=0., maxval=1.))
-    # Forward path.
-    dummy_tensor = tf.zeros([2, 4, 16], dtype=tf.float32)
-    dummy_mask = tf.zeros([2, 4, 4], dtype=tf.float32)
-    inputs = [dummy_tensor, dummy_mask]
-    output = encoder_block(inputs)
-    self.assertEqual(output.shape, (2, 4, hidden_size))
-  def test_get_config(self):
-    num_attention_heads = 2
-    encoder_block = TransformerEncoderBlock(
-        num_attention_heads=num_attention_heads,
-        inner_dim=32,
-        inner_activation='relu',
-        output_dropout=0.1,
-        attention_dropout=0.1,
-        use_bias=False,
-        norm_first=True,
-        norm_epsilon=1e-6,
-        inner_dropout=0.1,
-        attention_initializer=tf.keras.initializers.RandomUniform(
-            minval=0., maxval=1.))
-    encoder_block_config = encoder_block.get_config()
-    new_encoder_block = TransformerEncoderBlock.from_config(
-        encoder_block_config)
-    self.assertEqual(encoder_block_config, new_encoder_block.get_config())
-  @parameterized.parameters({'attention_axes': None}, {'attention_axes': [1]},
-                            {'attention_axes': [2]}, {'attention_axes': [1, 2]})
-  def test_several_attention_axes(self, attention_axes):
-    test_layer = TransformerEncoderBlock(
-        inner_dim=32,
-        inner_activation='relu',
-        output_dropout=0.1,
-        attention_dropout=0.1,
-        use_bias=False,
-        norm_first=True,
-        norm_epsilon=1e-6,
-        inner_dropout=0.1,
-        num_attention_heads=10,
-        attention_axes=attention_axes)
-    num_rows = 21
-    num_cols = 13
-    width = 80
-    # Create a 3-dimensional input (the first dimension is implicit).
-    data_tensor = tf.keras.Input(shape=(num_rows, num_cols, width))
-    output_tensor = test_layer(data_tensor)
-    # The default output of a transformer layer should be the same as the input.
-    self.assertEqual(data_tensor.shape.as_list(), output_tensor.shape.as_list())
-if __name__ == '__main__':
-  tf.test.main()
--- a/official/nlp/keras_nlp/requirements.txt
+++ b/official/nlp/keras_nlp/requirements.txt
-numpy>=1.15.4
--- a/official/nlp/keras_nlp/setup.py
+++ b/official/nlp/keras_nlp/setup.py
-# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Setup script."""
-import os
-from setuptools import find_packages
-from setuptools import setup
-version = '0.0.1'
-def _get_requirements():
-  """Parses requirements.txt file."""
-  install_requires_tmp = []
-  dependency_links_tmp = []
-  with open(
-      os.path.join(os.path.dirname(__file__), './requirements.txt'), 'r') as f:
-    for line in f:
-      package_name = line.strip()
-      # Skip empty line or comments starting with "#".
-      if not package_name or package_name[0] == '#':
-        continue
-      if package_name.startswith('-e '):
-        dependency_links_tmp.append(package_name[3:].strip())
-      else:
-        install_requires_tmp.append(package_name)
-  return install_requires_tmp, dependency_links_tmp
-install_requires, dependency_links = _get_requirements()
-install_requires.append('tf-nightly')
-setup(
-    name='keras-nlp',
-    version=version,
-    description='Keras Natural Language Processing Library',
-    url='https://github.com/keras-team/keras-nlp',
-    author='The Keras authors',
-    author_email='keras-team@google.com',
-    license='Apache License 2.0',
-    install_requires=install_requires,
-    classifiers=[
-        'Programming Language :: Python',
-        'Programming Language :: Python :: 3.6',
-        'Operating System :: Unix',
-        'Operating System :: Microsoft :: Windows',
-        'Operating System :: MacOS',
-        'Intended Audience :: Science/Research',
-        'Topic :: Scientific/Engineering',
-        'Topic :: Software Development'
-    ],
-    packages=find_packages(exclude=('tests',)),
-    exclude_package_data={'': ['*_test.py',],},
-    dependency_links=dependency_links,
-    python_requires='>=3.6',
-)