Introduce output_range for transformer layer and encoder last layer.

PiperOrigin-RevId: 310032518

Introduce output_range for transformer layer and encoder last layer.
PiperOrigin-RevId: 310032518
563d923a · Hongkun Yu · A. Unique TensorFlower · 9c1429cf · 563d923a · 563d923a
Commit 563d923a authored May 05, 2020 by Hongkun Yu Committed by A. Unique TensorFlower May 05, 2020
6 changed files
--- a/official/README.md
+++ b/official/README.md
@@ -6,68 +6,41 @@ The TensorFlow official models are a collection of models
 that use TensorFlow’s high-level APIs.
 They are intended to be well-maintained, tested, and kept up to date
 with the latest TensorFlow API.
-
 They should also be reasonably optimized for fast performance while still
 being easy to read.
 These models are used as end-to-end tests, ensuring that the models run
 with the same or improved speed and performance with each new TensorFlow build.

-## More models to come!
-
-The team is actively developing new models.
-In the near future, we will add:
-
-* State-of-the-art language understanding models:
-  More members in Transformer family
-* Start-of-the-art image classification models:
-  EfficientNet, MnasNet, and variants
-* A set of excellent objection detection models.
-
-## Table of Contents
+## Model Implementations

- [Models and Implementations](#models-and-implementations)
-  * [Computer Vision](#computer-vision)
-    + [Image Classification](#image-classification)
-    + [Object Detection and Segmentation](#object-detection-and-segmentation)
-  * [Natural Language Processing](#natural-language-processing)
-  * [Recommendation](#recommendation)
- [How to get started with the official models](#how-to-get-started-with-the-official-models)
+### Natural Language Processing

-## Models and Implementations
+| Model | Description | Reference |
+| ----- | ----------- | --------- |
+| [ALBERT](nlp/albert) | A Lite BERT for Self-supervised Learning of Language Representations | [arXiv:1909.11942](https://arxiv.org/abs/1909.11942) |
+| [BERT](nlp/bert) | A powerful pre-trained language representation model: BERT (Bidirectional Encoder Representations from Transformers) | [arXiv:1810.04805](https://arxiv.org/abs/1810.04805) |
+| [NHNet](nlp/nhnet) | A transformer-based multi-sequence to sequence model: Generating Representative Headlines for News Stories | [arXiv:2001.09386](https://arxiv.org/abs/2001.09386) |
+| [Transformer](nlp/transformer) | A transformer model to translate the WMT English to German dataset | [arXiv:1706.03762](https://arxiv.org/abs/1706.03762) |
+| [XLNet](nlp/xlnet) | XLNet: Generalized Autoregressive Pretraining for Language Understanding | [arXiv:1906.08237](https://arxiv.org/abs/1906.08237) |

 ### Computer Vision

-#### Image Classification
-
-| Model | Reference (Paper) |
-|-------|-------------------|
-| [MNIST](vision/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) |
-| [ResNet](vision/image_classification) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
+| Model | Description | Reference |
+| ----- | ----------- | --------- |
+| [MNIST](vision/image_classification) | A basic model to classify digits from the MNIST dataset | [Link](http://yann.lecun.com/exdb/mnist/) |
+| [ResNet](vision/image_classification) | A deep residual network for image recognition | [arXiv:1512.03385](https://arxiv.org/abs/1512.03385) |
+| [RetinaNet](vision/detection) | A fast and powerful object detector | [arXiv:1708.02002](https://arxiv.org/abs/1708.02002) |
+| [Mask R-CNN](vision/detection) | An object detection and instance segmentation model | [arXiv:1703.06870](https://arxiv.org/abs/1703.06870) |

-#### Object Detection and Segmentation
+### Other models

-| Model | Reference (Paper) |
-|-------|-------------------|
-| [RetinaNet](vision/detection) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
-| [Mask R-CNN](vision/detection) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
-
-### Natural Language Processing
+| Model | Description | Reference |
+| ----- | ----------- | --------- |
+| [NCF](recommendation) | Neural Collaborative Filtering model for recommendation tasks | [arXiv:1708.05031](https://arxiv.org/abs/1708.05031) |

-| Model | Reference (Paper) |
-|-------|-------------------|
-| [ALBERT (A Lite BERT)](nlp/albert) | [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) |
-| [BERT (Bidirectional Encoder Representations from Transformers)](nlp/bert) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
-| [NHNet (News Headline generation model)](nlp/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) |
-| [Transformer](nlp/transformer) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
-| [XLNet](nlp/xlnet) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
+---

-### Recommendation
-
-| Model | Reference (Paper) |
-|-------|-------------------|
-| [NCF](recommendation) | [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031) |
-
-## How to get started with the official models
+## How to get started with the Model Garden official models

 * The models in the master branch are developed using TensorFlow 2,
 and they target the TensorFlow [nightly binaries](https://github.com/tensorflow/tensorflow#installation)
@@ -135,6 +108,44 @@ os.environ['PYTHONPATH'] += ":/path/to/models"
 pip3 install --user -r official/requirements.txt
 ```

+---
+
+## More models to come!
+
+The team is actively developing new models.
+In the near future, we will add:
+
+- State-of-the-art language understanding models:
+  More members in Transformer family
+- Start-of-the-art image classification models:
+  EfficientNet, MnasNet and variants.
+- A set of excellent objection detection models.
+
+If you would like to make any fixes or improvements to the models, please
+[submit a pull request](https://github.com/tensorflow/models/compare).
+
+---
+
 ## Contributions

-If you want to contribute, please review the [contribution guidelines](../../../wiki/How-to-contribute).
+Every model should follow our guidelines to uphold our objectives of readable,
+usable, and maintainable code.
+
+### General Guidelines
+
+- Code should be well documented and tested.
+- Runnable from a blank environment with ease.
+- Trainable on: single GPU/CPU (baseline), multiple GPUs & TPUs
+- Compatible with Python 3 (using [six](https://pythonhosted.org/six/)
+when being compatible with Python 2 is necessary)
+- Conform to
+  [Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md)
+
+### Implementation Guidelines
+
+These guidelines are to ensure consistent model implementations for
+better readability and maintainability.
+
+- Use [common utility functions](utils)
+- Export SavedModel at the end of the training.
+- Consistent flags and flag-parsing library ([read more here](utils/flags/guidelines.md))
--- a/official/nlp/bert/bert_models.py
+++ b/official/nlp/bert/bert_models.py
@@ -101,7 +101,8 @@ class BertPretrainLossAndMetricLayer(tf.keras.layers.Layer):
 @gin.configurable
 def get_transformer_encoder(bert_config,
                            sequence_length,
-                            transformer_encoder_cls=None):
+                            transformer_encoder_cls=None,
+                            output_range=None):
  """Gets a 'TransformerEncoder' object.

  Args:
@@ -109,6 +110,8 @@ def get_transformer_encoder(bert_config,
    sequence_length: Maximum sequence length of the training data.
    transformer_encoder_cls: A EncoderScaffold class. If it is None, uses the
      default BERT encoder implementation.
+    output_range: the sequence output range, [0, output_range). Default setting
+      is to return the entire sequence output.

  Returns:
    A networks.TransformerEncoder object.
@@ -161,6 +164,7 @@ def get_transformer_encoder(bert_config,
    return networks.AlbertTransformerEncoder(**kwargs)
  else:
    assert isinstance(bert_config, configs.BertConfig)
+    kwargs['output_range'] = output_range
    return networks.TransformerEncoder(**kwargs)


@@ -320,7 +324,8 @@ def classifier_model(bert_config,
        stddev=bert_config.initializer_range)

  if not hub_module_url:
-    bert_encoder = get_transformer_encoder(bert_config, max_seq_length)
+    bert_encoder = get_transformer_encoder(
+        bert_config, max_seq_length, output_range=1)
    return models.BertClassifier(
        bert_encoder,
        num_classes=num_labels,

--- a/official/nlp/modeling/layers/transformer.py
+++ b/official/nlp/modeling/layers/transformer.py
@@ -39,6 +39,8 @@ class Transformer(tf.keras.layers.Layer):
    intermediate_activation: Activation for the intermediate layer.
    dropout_rate: Dropout probability for the post-attention and output dropout.
    attention_dropout_rate: Dropout probability for within the attention layer.
+    output_range: the sequence output range, [0, output_range) by slicing the
+      target sequence. `None` means the target sequence is not sliced.
    kernel_initializer: Initializer for dense layer kernels.
    bias_initializer: Initializer for dense layer biases.
    kernel_regularizer: Regularizer for dense layer kernels.
@@ -54,6 +56,7 @@ class Transformer(tf.keras.layers.Layer):
               intermediate_activation,
               dropout_rate=0.0,
               attention_dropout_rate=0.0,
+               output_range=None,
               kernel_initializer="glorot_uniform",
               bias_initializer="zeros",
               kernel_regularizer=None,
@@ -69,6 +72,7 @@ class Transformer(tf.keras.layers.Layer):
    self._intermediate_activation = intermediate_activation
    self._attention_dropout_rate = attention_dropout_rate
    self._dropout_rate = dropout_rate
+    self._output_range = output_range
    self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
    self._bias_initializer = tf.keras.initializers.get(bias_initializer)
    self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
@@ -168,6 +172,8 @@ class Transformer(tf.keras.layers.Layer):
            self._dropout_rate,
        "attention_dropout_rate":
            self._attention_dropout_rate,
+        "output_range":
+            self._output_range,
        "kernel_initializer":
            tf.keras.initializers.serialize(self._kernel_initializer),
        "bias_initializer":
@@ -192,11 +198,16 @@ class Transformer(tf.keras.layers.Layer):
    else:
      input_tensor, attention_mask = (inputs, None)

-    attention_inputs = [input_tensor, input_tensor]
+    if self._output_range:
+      target_tensor = input_tensor[:, 0:self._output_range, :]
+      attention_mask = attention_mask[:, 0:self._output_range, :]
+    else:
+      target_tensor = input_tensor
+    attention_inputs = [target_tensor, input_tensor]

    attention_output = self._attention_layer(attention_inputs, attention_mask)
    attention_output = self._attention_dropout(attention_output)
-    attention_output = self._attention_layer_norm(input_tensor +
+    attention_output = self._attention_layer_norm(target_tensor +
                                                  attention_output)
    intermediate_output = self._intermediate_dense(attention_output)
    intermediate_output = self._intermediate_activation_layer(

--- a/official/nlp/modeling/layers/transformer_test.py
+++ b/official/nlp/modeling/layers/transformer_test.py
@@ -29,8 +29,8 @@ from official.nlp.modeling.layers import transformer
 # This decorator runs the test in V1, V2-Eager, and V2-Functional mode. It
 # guarantees forward compatibility of this code for the V2 switchover.
 @keras_parameterized.run_all_keras_modes
-@parameterized.parameters(transformer.Transformer,
-                          transformer.CompiledTransformer)
+@parameterized.named_parameters(('base', transformer.Transformer),
+                                ('xla', transformer.CompiledTransformer))
 class TransformerLayerTest(keras_parameterized.TestCase):

  def tearDown(self):
@@ -127,6 +127,33 @@ class TransformerLayerTest(keras_parameterized.TestCase):
        2, size=(batch_size, sequence_length, sequence_length))
    _ = model.predict([input_data, mask_data])

+  def test_layer_output_range(self, transformer_cls):
+    test_layer = transformer_cls(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu')
+    sequence_length = 21
+    width = 80
+
+    batch_size = 6
+    input_data = 10 * np.random.random_sample(
+        (batch_size, sequence_length, width))
+    mask_data = np.random.randint(
+        2, size=(batch_size, sequence_length, sequence_length))
+    output_tensor = test_layer([input_data, mask_data])
+
+    # The layer only attends to the first token and outputs the first token
+    # embeeding.
+    new_layer = transformer_cls(
+        num_attention_heads=10,
+        intermediate_size=2048,
+        intermediate_activation='relu',
+        output_range=1)
+    _ = new_layer([input_data, mask_data])
+    new_layer.set_weights(test_layer.get_weights())
+    new_output_tensor = new_layer([input_data, mask_data])
+    self.assertAllClose(new_output_tensor, output_tensor[:, 0:1, :])
+
  def test_layer_invocation_with_float16_dtype(self, transformer_cls):
    tf.keras.mixed_precision.experimental.set_policy('mixed_float16')
    test_layer = transformer_cls(

--- a/official/nlp/modeling/networks/transformer_encoder.py
+++ b/official/nlp/modeling/networks/transformer_encoder.py
@@ -61,6 +61,10 @@ class TransformerEncoder(network.Network):
    initializer: The initialzer to use for all weights in this encoder.
    return_all_encoder_outputs: Whether to output sequence embedding outputs of
      all encoder transformer layers.
+    output_range: the sequence output range, [0, output_range), by slicing the
+      target sequence of the last transformer layer. `None` means the entire
+      target sequence will attend to the source sequence, which yeilds the full
+      output.
  """

  def __init__(self,
@@ -77,6 +81,7 @@ class TransformerEncoder(network.Network):
               attention_dropout_rate=0.1,
               initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02),
               return_all_encoder_outputs=False,
+               output_range=None,
               **kwargs):
    activation = tf.keras.activations.get(activation)
    initializer = tf.keras.initializers.get(initializer)
@@ -98,6 +103,7 @@ class TransformerEncoder(network.Network):
        'attention_dropout_rate': attention_dropout_rate,
        'initializer': tf.keras.initializers.serialize(initializer),
        'return_all_encoder_outputs': return_all_encoder_outputs,
+        'output_range': output_range,
    }

    word_ids = tf.keras.layers.Input(
@@ -146,12 +152,17 @@ class TransformerEncoder(network.Network):
    attention_mask = layers.SelfAttentionMask()([data, mask])
    encoder_outputs = []
    for i in range(num_layers):
+      if i == num_layers - 1 and output_range is not None:
+        transformer_output_range = output_range
+      else:
+        transformer_output_range = None
      layer = layers.Transformer(
          num_attention_heads=num_attention_heads,
          intermediate_size=intermediate_size,
          intermediate_activation=activation,
          dropout_rate=dropout_rate,
          attention_dropout_rate=attention_dropout_rate,
+          output_range=transformer_output_range,
          kernel_initializer=initializer,
          name='transformer/layer_%d' % i)
      self._transformer_layers.append(layer)

--- a/official/nlp/modeling/networks/transformer_encoder_test.py
+++ b/official/nlp/modeling/networks/transformer_encoder_test.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

+from absl.testing import parameterized
 import numpy as np
 import tensorflow as tf

@@ -114,7 +115,11 @@ class TransformerEncoderTest(keras_parameterized.TestCase):
    self.assertAllEqual(tf.float32, data.dtype)
    self.assertAllEqual(tf.float16, pooled.dtype)

-  def test_network_invocation(self):
+  @parameterized.named_parameters(
+      ("all_sequence", None, 21),
+      ("output_range", 1, 1),
+  )
+  def test_network_invocation(self, output_range, out_seq_len):
    hidden_size = 32
    sequence_length = 21
    vocab_size = 57
@@ -126,7 +131,8 @@ class TransformerEncoderTest(keras_parameterized.TestCase):
        sequence_length=sequence_length,
        num_attention_heads=2,
        num_layers=3,
-        type_vocab_size=num_types)
+        type_vocab_size=num_types,
+        output_range=output_range)
    self.assertTrue(
        test_network._position_embedding_layer._use_dynamic_slicing)
    # Create the inputs (note that the first dimension is implicit).
@@ -160,7 +166,8 @@ class TransformerEncoderTest(keras_parameterized.TestCase):
        type_vocab_size=num_types)
    self.assertTrue(test_network._position_embedding_layer._use_dynamic_slicing)
    model = tf.keras.Model([word_ids, mask, type_ids], [data, pooled])
-    _ = model.predict([word_id_data, mask_data, type_id_data])
+    outputs = model.predict([word_id_data, mask_data, type_id_data])
+    self.assertEqual(outputs[0].shape[1], out_seq_len)

  def test_serialize_deserialize(self):
    tf.keras.mixed_precision.experimental.set_policy("mixed_float16")
@@ -178,7 +185,8 @@ class TransformerEncoderTest(keras_parameterized.TestCase):
        dropout_rate=0.05,
        attention_dropout_rate=0.22,
        initializer="glorot_uniform",
-        return_all_encoder_outputs=False)
+        return_all_encoder_outputs=False,
+        output_range=-1)
    network = transformer_encoder.TransformerEncoder(**kwargs)

    expected_config = dict(kwargs)