Merge Transformer V2 to Github (#6846)

* Merged commit includes the following changes: 249218656 by tianlin<tianlin@google.com>: Deal with imports, fix a typo and make unit tests fast. -- 249198645 by tianlin<tianlin@google.com>: Trivial: Remove one empty line before "import tensorflow" -- 249195490 by tianlin<tianlin@google.com>: Initialize Transformer TF V2 Model with Keras subclassing implementation. (Compatible with TF V1) -- 249195008 by tianlin<tianlin@google.com>: Internal change 249173564 by hongkuny<hongkuny@google.com>: Internal change 249079258 by hongkuny<hongkuny@google.com>: Internal change 247691534 by haoyuzhang<haoyuzhang@google.com>: Internal change 247533725 by haoyuzhang<haoyuzhang@google.com>: Internal change 247509295 by haoyuzhang<haoyuzhang@google.com>: Internal change 247311355 by wangtz<wangtz@google.com>: Internal change 247303127 by wangtz<wangtz@google.com>: ...

Merge Transformer V2 to Github (#6846)
* Merged commit includes the following changes: 249218656 by tianlin<tianlin@google.com>: Deal with imports, fix a typo and make unit tests fast. -- 249198645 by tianlin<tianlin@google.com>: Trivial: Remove one empty line before "import tensorflow" -- 249195490 by tianlin<tianlin@google.com>: Initialize Transformer TF V2 Model with Keras subclassing implementation. (Compatible with TF V1) -- 249195008 by tianlin<tianlin@google.com>: Internal change 249173564 by hongkuny<hongkuny@google.com>: Internal change 249079258 by hongkuny<hongkuny@google.com>: Internal change 247691534 by haoyuzhang<haoyuzhang@google.com>: Internal change 247533725 by haoyuzhang<haoyuzhang@google.com>: Internal change 247509295 by haoyuzhang<haoyuzhang@google.com>: Internal change 247311355 by wangtz<wangtz@google.com>: Internal change 247303127 by wangtz<wangtz@google.com>: ...
c4f34e58 · Tian Lin · Toby Boyd · 4726c5b9 · c4f34e58 · c4f34e58
Commit c4f34e58 authored May 22, 2019 by Tian Lin Committed by Toby Boyd May 22, 2019
19 changed files
--- a/official/LICENSE
+++ b/official/LICENSE
+Copyright 2015 The TensorFlow Authors.  All rights reserved.
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2015, The TensorFlow Authors.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/official/mnist/mnist_tpu.py
+++ b/official/mnist/mnist_tpu.py
@@ -26,7 +26,10 @@ from __future__ import print_function
 import os
 import sys

-import tensorflow as tf  # pylint: disable=g-bad-import-order
+# pylint: disable=g-bad-import-order
+from absl import app as absl_app
+import tensorflow as tf
+# pylint: enable=g-bad-import-order

 # For open source environment, add grandparent directory for import
 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(sys.path[0]))))
@@ -195,4 +198,4 @@ def main(argv):


 if __name__ == "__main__":
-  tf.app.run()
+  absl_app.run()
--- a/official/transformer/model/beam_search.py
+++ b/official/transformer/model/beam_search.py
@@ -190,13 +190,15 @@ class SequenceBeamSearch(object):
    best_alive_scores = alive_log_probs[:, 0] / max_length_norm

    # Compute worst score in finished sequences for each batch element
-    finished_scores *= tf.to_float(finished_flags)  # set filler scores to zero
+    finished_scores *= tf.cast(finished_flags,
+                               tf.float32)  # set filler scores to zero
    lowest_finished_scores = tf.reduce_min(finished_scores, axis=1)

    # If there are no finished sequences in a batch element, then set the lowest
    # finished score to -INF for that element.
    finished_batches = tf.reduce_any(finished_flags, 1)
-    lowest_finished_scores += (1. - tf.to_float(finished_batches)) * -INF
+    lowest_finished_scores += (1.0 -
+                               tf.cast(finished_batches, tf.float32)) * -INF

    worst_finished_score_better_than_best_alive_score = tf.reduce_all(
        tf.greater(lowest_finished_scores, best_alive_scores)
@@ -319,7 +321,7 @@ class SequenceBeamSearch(object):
    """
    # To prevent finished sequences from being considered, set log probs to -INF
    new_finished_flags = tf.equal(new_seq[:, :, -1], self.eos_id)
-    new_log_probs += tf.to_float(new_finished_flags) * -INF
+    new_log_probs += tf.cast(new_finished_flags, tf.float32) * -INF

    top_alive_seq, top_alive_log_probs, top_alive_cache = _gather_topk_beams(
        [new_seq, new_log_probs, new_cache], new_log_probs, self.batch_size,
@@ -364,7 +366,7 @@ class SequenceBeamSearch(object):

    # Set the scores of the still-alive seq in new_seq to large negative values.
    new_finished_flags = tf.equal(new_seq[:, :, -1], self.eos_id)
-    new_scores += (1. - tf.to_float(new_finished_flags)) * -INF
+    new_scores += (1. - tf.cast(new_finished_flags, tf.float32)) * -INF

    # Combine sequences, scores, and flags.
    finished_seq = tf.concat([finished_seq, new_seq], axis=1)
@@ -417,12 +419,12 @@ def sequence_beam_search(


 def _log_prob_from_logits(logits):
-  return logits - tf.reduce_logsumexp(logits, axis=2, keep_dims=True)
+  return logits - tf.reduce_logsumexp(logits, axis=2, keepdims=True)


 def _length_normalization(alpha, length):
  """Return length normalization factor."""
-  return tf.pow(((5. + tf.to_float(length)) / 6.), alpha)
+  return tf.pow(((5. + tf.cast(length, tf.float32)) / 6.), alpha)


 def _expand_to_beam_size(tensor, beam_size):

--- a/official/transformer/model/model_utils.py
+++ b/official/transformer/model/model_utils.py
@@ -42,13 +42,13 @@ def get_position_encoding(
  Returns:
    Tensor with shape [length, hidden_size]
  """
-  position = tf.to_float(tf.range(length))
+  position = tf.cast(tf.range(length), tf.float32)
  num_timescales = hidden_size // 2
  log_timescale_increment = (
      math.log(float(max_timescale) / float(min_timescale)) /
-      (tf.to_float(num_timescales) - 1))
+      (tf.cast(num_timescales, tf.float32) - 1))
  inv_timescales = min_timescale * tf.exp(
-      tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
+      tf.cast(tf.range(num_timescales), tf.float32) * -log_timescale_increment)
  scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
  signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
  return signal
@@ -68,7 +68,7 @@ def get_decoder_self_attention_bias(length):
    float tensor of shape [1, 1, length, length]
  """
  with tf.name_scope("decoder_self_attention_bias"):
-    valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
+    valid_locs = tf.linalg.band_part(tf.ones([length, length]), -1, 0)
    valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
    decoder_bias = _NEG_INF * (1.0 - valid_locs)
  return decoder_bias
@@ -86,7 +86,7 @@ def get_padding(x, padding_value=0):
      0 -> non-padding, 1 -> padding
  """
  with tf.name_scope("padding"):
-    return tf.to_float(tf.equal(x, padding_value))
+    return tf.cast(tf.equal(x, padding_value), tf.float32)


 def get_padding_bias(x):

--- a/official/transformer/utils/tokenizer.py
+++ b/official/transformer/utils/tokenizer.py
@@ -63,7 +63,8 @@ class Subtokenizer(object):

  def __init__(self, vocab_file, reserved_tokens=None):
    """Initializes class, creating a vocab file if data_files is provided."""
-    tf.logging.info("Initializing Subtokenizer from file %s." % vocab_file)
+    tf.compat.v1.logging.info("Initializing Subtokenizer from file %s." %
+                              vocab_file)

    if reserved_tokens is None:
      reserved_tokens = RESERVED_TOKENS
@@ -106,17 +107,17 @@ class Subtokenizer(object):
    if reserved_tokens is None:
      reserved_tokens = RESERVED_TOKENS

-    if tf.gfile.Exists(vocab_file):
-      tf.logging.info("Vocab file already exists (%s)" % vocab_file)
+    if tf.io.gfile.exists(vocab_file):
+      tf.compat.v1.logging.info("Vocab file already exists (%s)" % vocab_file)
    else:
-      tf.logging.info("Begin steps to create subtoken vocabulary...")
+      tf.compat.v1.logging.info("Begin steps to create subtoken vocabulary...")
      token_counts = _count_tokens(files, file_byte_limit)
      alphabet = _generate_alphabet_dict(token_counts)
      subtoken_list = _generate_subtokens_with_target_vocab_size(
          token_counts, alphabet, target_vocab_size, threshold, min_count,
          reserved_tokens)
-      tf.logging.info("Generated vocabulary with %d subtokens." %
-                      len(subtoken_list))
+      tf.compat.v1.logging.info("Generated vocabulary with %d subtokens." %
+          len(subtoken_list))
      _save_vocab_file(vocab_file, subtoken_list)
    return Subtokenizer(vocab_file)

@@ -394,22 +395,23 @@ def _generate_subtokens_with_target_vocab_size(
    reserved_tokens = RESERVED_TOKENS

  if min_count is not None:
-    tf.logging.info("Using min_count=%d to generate vocab with target size %d" %
-                    (min_count, target_size))
+    tf.compat.v1.logging.info(
+        "Using min_count=%d to generate vocab with target size %d" %
+        (min_count, target_size))
    return _generate_subtokens(
        token_counts, alphabet, min_count, reserved_tokens=reserved_tokens)

  def bisect(min_val, max_val):
    """Recursive function to binary search for subtoken vocabulary."""
    cur_count = (min_val + max_val) // 2
-    tf.logging.info("Binary search: trying min_count=%d (%d %d)" %
-                    (cur_count, min_val, max_val))
+    tf.compat.v1.logging.info("Binary search: trying min_count=%d (%d %d)" %
+                              (cur_count, min_val, max_val))
    subtoken_list = _generate_subtokens(
        token_counts, alphabet, cur_count, reserved_tokens=reserved_tokens)

    val = len(subtoken_list)
-    tf.logging.info("Binary search: min_count=%d resulted in %d tokens" %
-                    (cur_count, val))
+    tf.compat.v1.logging.info(
+        "Binary search: min_count=%d resulted in %d tokens" % (cur_count, val))

    within_threshold = abs(val - target_size) < threshold
    if within_threshold or min_val >= max_val or cur_count < 2:
@@ -425,8 +427,8 @@ def _generate_subtokens_with_target_vocab_size(
      return other_subtoken_list
    return subtoken_list

-  tf.logging.info("Finding best min_count to get target size of %d" %
-                  target_size)
+  tf.compat.v1.logging.info("Finding best min_count to get target size of %d" %
+                            target_size)
  return bisect(_MIN_MIN_COUNT, _MAX_MIN_COUNT)


@@ -594,7 +596,7 @@ def _generate_subtokens(
  # subtoken_dict, count how often the resulting subtokens appear, and update
  # the dictionary with subtokens w/ high enough counts.
  for i in xrange(num_iterations):
-    tf.logging.info("\tGenerating subtokens: iteration %d" % i)
+    tf.compat.v1.logging.info("\tGenerating subtokens: iteration %d" % i)
    # Generate new subtoken->id dictionary using the new subtoken list.
    subtoken_dict = _list_to_index_dict(subtoken_list)

@@ -607,5 +609,5 @@ def _generate_subtokens(
    subtoken_list, max_subtoken_length = _gen_new_subtoken_list(
        subtoken_counts, min_count, alphabet, reserved_tokens)

-    tf.logging.info("\tVocab size: %d" % len(subtoken_list))
+    tf.compat.v1.logging.info("\tVocab size: %d" % len(subtoken_list))
  return subtoken_list
--- a/official/transformer/v2/__init__.py
+++ b/official/transformer/v2/__init__.py
--- a/official/transformer/v2/attention_layer.py
+++ b/official/transformer/v2/attention_layer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of multiheaded attention and self-attention layers."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+class Attention(tf.keras.layers.Layer):
+  """Multi-headed attention layer."""
+
+  def __init__(self, hidden_size, num_heads, attention_dropout):
+    """Initialize Attention.
+
+    Args:
+      hidden_size: int, output dim of hidden layer.
+      num_heads: int, number of heads to repeat the same attention structure.
+      attention_dropout: float, dropout rate inside attention for training.
+    """
+    if hidden_size % num_heads:
+      raise ValueError(
+          "Hidden size ({}) must be divisible by the number of heads ({})."
+          .format(hidden_size, num_heads))
+
+    super(Attention, self).__init__()
+    self.hidden_size = hidden_size
+    self.num_heads = num_heads
+    self.attention_dropout = attention_dropout
+
+  def build(self, input_shape):
+    """Builds the layer."""
+    # Layers for linearly projecting the queries, keys, and values.
+    self.q_dense_layer = tf.keras.layers.Dense(
+        self.hidden_size, use_bias=False, name="q")
+    self.k_dense_layer = tf.keras.layers.Dense(
+        self.hidden_size, use_bias=False, name="k")
+    self.v_dense_layer = tf.keras.layers.Dense(
+        self.hidden_size, use_bias=False, name="v")
+    self.output_dense_layer = tf.keras.layers.Dense(
+        self.hidden_size, use_bias=False, name="output_transform")
+    super(Attention, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "hidden_size": self.hidden_size,
+        "num_heads": self.num_heads,
+        "attention_dropout": self.attention_dropout,
+    }
+
+  def split_heads(self, x):
+    """Split x into different heads, and transpose the resulting value.
+
+    The tensor is transposed to insure the inner dimensions hold the correct
+    values during the matrix multiplication.
+
+    Args:
+      x: A tensor with shape [batch_size, length, hidden_size]
+
+    Returns:
+      A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]
+    """
+    with tf.name_scope("split_heads"):
+      batch_size = tf.shape(x)[0]
+      length = tf.shape(x)[1]
+
+      # Calculate depth of last dimension after it has been split.
+      depth = (self.hidden_size // self.num_heads)
+
+      # Split the last dimension
+      x = tf.reshape(x, [batch_size, length, self.num_heads, depth])
+
+      # Transpose the result
+      return tf.transpose(x, [0, 2, 1, 3])
+
+  def combine_heads(self, x):
+    """Combine tensor that has been split.
+
+    Args:
+      x: A tensor [batch_size, num_heads, length, hidden_size/num_heads]
+
+    Returns:
+      A tensor with shape [batch_size, length, hidden_size]
+    """
+    with tf.name_scope("combine_heads"):
+      batch_size = tf.shape(x)[0]
+      length = tf.shape(x)[2]
+      x = tf.transpose(x, [0, 2, 1, 3])  # --> [batch, length, num_heads, depth]
+      return tf.reshape(x, [batch_size, length, self.hidden_size])
+
+  def call(self, x, y, bias, training, cache=None):
+    """Apply attention mechanism to x and y.
+
+    Args:
+      x: a tensor with shape [batch_size, length_x, hidden_size]
+      y: a tensor with shape [batch_size, length_y, hidden_size]
+      bias: attention bias that will be added to the result of the dot product.
+      training: boolean, whether in training mode or not.
+      cache: (Used during prediction) dictionary with tensors containing results
+        of previous attentions. The dictionary must have the items:
+            {"k": tensor with shape [batch_size, i, key_channels],
+             "v": tensor with shape [batch_size, i, value_channels]}
+        where i is the current decoded length.
+
+    Returns:
+      Attention layer output with shape [batch_size, length_x, hidden_size]
+    """
+    # Linearly project the query (q), key (k) and value (v) using different
+    # learned projections. This is in preparation of splitting them into
+    # multiple heads. Multi-head attention uses multiple queries, keys, and
+    # values rather than regular attention (which uses a single q, k, v).
+    q = self.q_dense_layer(x)
+    k = self.k_dense_layer(y)
+    v = self.v_dense_layer(y)
+
+    if cache is not None:
+      # Combine cached keys and values with new keys and values.
+      k = tf.concat([cache["k"], k], axis=1)
+      v = tf.concat([cache["v"], v], axis=1)
+
+      # Update cache
+      cache["k"] = k
+      cache["v"] = v
+
+    # Split q, k, v into heads.
+    q = self.split_heads(q)
+    k = self.split_heads(k)
+    v = self.split_heads(v)
+
+    # Scale q to prevent the dot product between q and k from growing too large.
+    depth = (self.hidden_size // self.num_heads)
+    q *= depth ** -0.5
+
+    # Calculate dot product attention
+    logits = tf.matmul(q, k, transpose_b=True)
+    logits += bias
+    weights = tf.nn.softmax(logits, name="attention_weights")
+    if training:
+      weights = tf.nn.dropout(weights, rate=self.attention_dropout)
+    attention_output = tf.matmul(weights, v)
+
+    # Recombine heads --> [batch_size, length, hidden_size]
+    attention_output = self.combine_heads(attention_output)
+
+    # Run the combined outputs through another linear projection layer.
+    attention_output = self.output_dense_layer(attention_output)
+    return attention_output
+
+
+class SelfAttention(Attention):
+  """Multiheaded self-attention layer."""
+
+  def call(self, x, bias, training, cache=None):
+    return super(SelfAttention, self).call(x, x, bias, training, cache)
--- a/official/transformer/v2/data_pipeline.py
+++ b/official/transformer/v2/data_pipeline.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Input pipeline for the transformer model to read, filter, and batch examples.
+
+Two things to note in the pipeline:
+
+1. Batching scheme
+
+   The examples encoded in the TFRecord files contain data in the format:
+     {"inputs": [variable length array of integers],
+      "targets": [variable length array of integers]}
+   Where integers in the arrays refer to tokens in the English and German vocab
+   file (named `vocab.ende.32768`).
+
+   Prior to batching, elements in the dataset are grouped by length (max between
+   "inputs" and "targets" length). Each group is then batched such that:
+     group_batch_size * length <= batch_size.
+
+   Another way to view batch_size is the maximum number of tokens in each batch.
+
+   Once batched, each element in the dataset will have the shape:
+     {"inputs": [group_batch_size, padded_input_length],
+      "targets": [group_batch_size, padded_target_length]}
+   Lengths are padded to the longest "inputs" or "targets" sequence in the batch
+   (padded_input_length and padded_target_length can be different).
+
+   This batching scheme decreases the fraction of padding tokens per training
+   batch, thus improving the training speed significantly.
+
+2. Shuffling
+
+   While training, the dataset is shuffled in two places in the code. The first
+   is the list of training files. Second, while reading records using
+   `parallel_interleave`, the `sloppy` argument is used to generate randomness
+   in the order of the examples.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+import os
+
+import tensorflow as tf
+
+# TODO(tianlin) Import internal library. Remove this when different behaviors
+# of keras_model.fit(dataset, ...) for different TF versions are fixed.
+from tensorflow.python import tf2 as tf2_internal
+
+from official.utils.misc import model_helpers
+
+# Buffer size for reading records from a TFRecord file. Each training file is
+# 7.2 MB, so 8 MB allows an entire file to be kept in memory.
+_READ_RECORD_BUFFER = 8 * 1000 * 1000
+
+# Example grouping constants. Defines length boundaries for each group.
+# These values are the defaults used in Tensor2Tensor.
+_MIN_BOUNDARY = 8
+_BOUNDARY_SCALE = 1.1
+
+
+def _load_records(filename):
+  """Read file and return a dataset of tf.Examples."""
+  return tf.data.TFRecordDataset(filename, buffer_size=_READ_RECORD_BUFFER)
+
+
+def _parse_example(serialized_example):
+  """Return inputs and targets Tensors from a serialized tf.Example."""
+  data_fields = {
+      "inputs": tf.io.VarLenFeature(tf.int64),
+      "targets": tf.io.VarLenFeature(tf.int64)
+  }
+  parsed = tf.io.parse_single_example(serialized_example, data_fields)
+  inputs = tf.sparse.to_dense(parsed["inputs"])
+  targets = tf.sparse.to_dense(parsed["targets"])
+  return inputs, targets
+
+
+def _filter_max_length(example, max_length=256):
+  """Indicates whether the example's length is lower than the maximum length."""
+  return tf.logical_and(tf.size(example[0]) <= max_length,
+                        tf.size(example[1]) <= max_length)
+
+
+def _get_example_length(example):
+  """Returns the maximum length between the example inputs and targets."""
+  length = tf.maximum(tf.shape(example[0])[0], tf.shape(example[1])[0])
+  return length
+
+
+def _create_min_max_boundaries(
+    max_length, min_boundary=_MIN_BOUNDARY, boundary_scale=_BOUNDARY_SCALE):
+  """Create min and max boundary lists up to max_length.
+
+  For example, when max_length=24, min_boundary=4 and boundary_scale=2, the
+  returned values will be:
+    buckets_min = [0, 4, 8, 16, 24]
+    buckets_max = [4, 8, 16, 24, 25]
+
+  Args:
+    max_length: The maximum length of example in dataset.
+    min_boundary: Minimum length in boundary.
+    boundary_scale: Amount to scale consecutive boundaries in the list.
+
+  Returns:
+    min and max boundary lists
+
+  """
+  # Create bucket boundaries list by scaling the previous boundary or adding 1
+  # (to ensure increasing boundary sizes).
+  bucket_boundaries = []
+  x = min_boundary
+  while x < max_length:
+    bucket_boundaries.append(x)
+    x = max(x + 1, int(x * boundary_scale))
+
+  # Create min and max boundary lists from the initial list.
+  buckets_min = [0] + bucket_boundaries
+  buckets_max = bucket_boundaries + [max_length + 1]
+  return buckets_min, buckets_max
+
+
+def _batch_examples(dataset, batch_size, max_length):
+  """Group examples by similar lengths, and return batched dataset.
+
+  Each batch of similar-length examples are padded to the same length, and may
+  have different number of elements in each batch, such that:
+    group_batch_size * padded_length <= batch_size.
+
+  This decreases the number of padding tokens per batch, which improves the
+  training speed.
+
+  Args:
+    dataset: Dataset of unbatched examples.
+    batch_size: Max number of tokens per batch of examples.
+    max_length: Max number of tokens in an example input or target sequence.
+
+  Returns:
+    Dataset of batched examples with similar lengths.
+  """
+  # Get min and max boundary lists for each example. These are used to calculate
+  # the `bucket_id`, which is the index at which:
+  # buckets_min[bucket_id] <= len(example) < buckets_max[bucket_id]
+  # Note that using both min and max lists improves the performance.
+  buckets_min, buckets_max = _create_min_max_boundaries(max_length)
+
+  # Create list of batch sizes for each bucket_id, so that
+  # bucket_batch_size[bucket_id] * buckets_max[bucket_id] <= batch_size
+  bucket_batch_sizes = [batch_size // x for x in buckets_max]
+  # bucket_id will be a tensor, so convert this list to a tensor as well.
+  bucket_batch_sizes = tf.constant(bucket_batch_sizes, dtype=tf.int64)
+
+  def example_to_bucket_id(example_input, example_target):
+    """Return int64 bucket id for this example, calculated based on length."""
+    seq_length = _get_example_length((example_input, example_target))
+
+    # TODO: investigate whether removing code branching improves performance.
+    conditions_c = tf.logical_and(
+        tf.less_equal(buckets_min, seq_length),
+        tf.less(seq_length, buckets_max))
+    bucket_id = tf.reduce_min(tf.where(conditions_c))
+    return bucket_id
+
+  def window_size_fn(bucket_id):
+    """Return number of examples to be grouped when given a bucket id."""
+    return bucket_batch_sizes[bucket_id]
+
+  def batching_fn(bucket_id, grouped_dataset):
+    """Batch and add padding to a dataset of elements with similar lengths."""
+    bucket_batch_size = window_size_fn(bucket_id)
+
+    # Batch the dataset and add padding so that all input sequences in the
+    # examples have the same length, and all target sequences have the same
+    # lengths as well. Resulting lengths of inputs and targets can differ.
+    return grouped_dataset.padded_batch(bucket_batch_size, ([None], [None]))
+
+  return dataset.apply(tf.data.experimental.group_by_window(
+      key_func=example_to_bucket_id,
+      reduce_func=batching_fn,
+      window_size=None,
+      window_size_func=window_size_fn))
+
+
+def _read_and_batch_from_files(
+    file_pattern, batch_size, max_length, num_parallel_calls, shuffle, repeat,
+    static_batch=False):
+  """Create dataset where each item is a dict of "inputs" and "targets".
+
+  Args:
+    file_pattern: String used to match the input TFRecord files.
+    batch_size: Maximum number of tokens per batch of examples
+    max_length: Maximum number of tokens per example
+    num_parallel_calls: Number of cpu cores for parallel input processing.
+    shuffle: If true, randomizes order of elements.
+    repeat: Number of times to repeat the dataset. If None, the dataset is
+      repeated forever.
+    static_batch: Whether the batches in the dataset should have static shapes.
+      If True, the input is batched so that every batch has the
+      shape [batch_size // max_length, max_length]. If False, the input is
+      grouped by length, and batched so that batches may have different
+      shapes [N, M], where:
+        N * M <= batch_size
+        M <= max_length
+      In general, this setting should be False. Dynamic shapes allow the inputs
+      to be grouped so that the number of padding tokens is minimized, and helps
+      model training. In cases where the input shape must be static
+      (e.g. running on TPU), this setting should be set to True.
+
+  Returns:
+    tf.data.Dataset object containing examples loaded from the files.
+  """
+  dataset = tf.data.Dataset.list_files(file_pattern, shuffle=shuffle)
+
+  # Read files and interleave results. When training, the order of the examples
+  # will be non-deterministic.
+  dataset = dataset.interleave(
+      _load_records,
+      cycle_length=num_parallel_calls,
+      num_parallel_calls=num_parallel_calls)
+
+  # Parse each tf.Example into a dictionary
+  # TODO: Look into prefetch_input_elements for performance optimization.
+  dataset = dataset.map(_parse_example,
+                        num_parallel_calls=num_parallel_calls)
+
+  # Remove examples where the input or target length exceeds the maximum length,
+  dataset = dataset.filter(lambda x, y: _filter_max_length((x, y), max_length))
+
+  if static_batch:
+    dataset = dataset.padded_batch(
+        batch_size // max_length, ([max_length], [max_length]),
+        drop_remainder=True)
+  else:
+    # Group and batch such that each batch has examples of similar length.
+    dataset = _batch_examples(dataset, batch_size, max_length)
+
+  dataset = dataset.repeat(repeat)
+
+  # Prefetch the next element to improve speed of input pipeline.
+  dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
+  return dataset
+
+
+def _generate_synthetic_data(params):
+  """Create synthetic data based on the parameter batch size."""
+  batch = length = int(math.sqrt(params["batch_size"]))
+  return model_helpers.generate_synthetic_data(
+      input_shape=tf.TensorShape([batch, length]),
+      input_value=1,
+      input_dtype=tf.int32,
+      label_shape=tf.TensorShape([batch, length]),
+      label_value=1,
+      label_dtype=tf.int32,
+  )
+
+
+def train_input_fn(params):
+  """Load and return dataset of batched examples for use during training."""
+  file_pattern = os.path.join(params["data_dir"] or "", "*train*")
+  if params["use_synthetic_data"]:
+    return _generate_synthetic_data(params)
+  return _read_and_batch_from_files(
+      file_pattern, params["batch_size"], params["max_length"],
+      params["num_parallel_calls"], shuffle=True,
+      repeat=params["repeat_dataset"], static_batch=params["static_batch"])
+
+
+def eval_input_fn(params):
+  """Load and return dataset of batched examples for use during evaluation."""
+  file_pattern = os.path.join(params["data_dir"] or "", "*dev*")
+  if params["use_synthetic_data"]:
+    return _generate_synthetic_data(params)
+  return _read_and_batch_from_files(
+      file_pattern, params["batch_size"], params["max_length"],
+      params["num_parallel_calls"], shuffle=False, repeat=1,
+      static_batch=params["static_batch"])
+
+
+def map_data_for_transformer_fn(x, y):
+  """Maps data for training, and handles weried behaviors for different vers."""
+  # Will transform input x and targets y into tuple(x, y) as new model inputs.
+  if tf2_internal.enabled():
+    # For TF v2, the 2nd parameter is omitted to make Keras training work.
+    return ((x, y),)
+  else:
+    # For TF v1, Keras requires a dummy placeholder as the 2nd parameter.
+    return ((x, y), tf.constant(0.0))
--- a/official/transformer/v2/embedding_layer.py
+++ b/official/transformer/v2/embedding_layer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of embedding layer with shared weights."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+class EmbeddingSharedWeights(tf.keras.layers.Layer):
+  """Calculates input embeddings and pre-softmax linear with shared weights."""
+
+  def __init__(self, vocab_size, hidden_size):
+    """Specify characteristic parameters of embedding layer.
+
+    Args:
+      vocab_size: Number of tokens in the embedding. (Typically ~32,000)
+      hidden_size: Dimensionality of the embedding. (Typically 512 or 1024)
+    """
+    super(EmbeddingSharedWeights, self).__init__()
+    self.vocab_size = vocab_size
+    self.hidden_size = hidden_size
+
+  def build(self, input_shape):
+    with tf.name_scope("embedding_and_softmax"):
+      # Create and initialize weights. The random normal initializer was chosen
+      # arbitrarily, and works well.
+      self.shared_weights = self.add_weight(
+          "weights",
+          shape=[self.vocab_size, self.hidden_size],
+          dtype="float32",
+          initializer=tf.random_normal_initializer(
+              mean=0., stddev=self.hidden_size**-0.5))
+    super(EmbeddingSharedWeights, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "vocab_size": self.vocab_size,
+        "hidden_size": self.hidden_size,
+    }
+
+  def call(self, inputs, mode="embedding"):
+    """Get token embeddings of inputs.
+
+    Args:
+      inputs: An int64 tensor with shape [batch_size, length]
+      mode: string, a valid value is one of "embedding" and "linear".
+    Returns:
+      outputs: (1) If mode == "embedding", output embedding tensor, float32 with
+        shape [batch_size, length, embedding_size]; (2) mode == "linear", output
+        linear tensor, float32 with shape [batch_size, length, vocab_size].
+    Raises:
+      ValueError: if mode is not valid.
+    """
+    if mode == "embedding":
+      return self._embedding(inputs)
+    elif mode == "linear":
+      return self._linear(inputs)
+    else:
+      raise ValueError("mode {} is not valid.".format(mode))
+
+  def _embedding(self, inputs):
+    """Applies embedding based on inputs tensor."""
+    with tf.name_scope("embedding"):
+      # Create binary mask of size [batch_size, length]
+      mask = tf.cast(tf.not_equal(inputs, 0), tf.float32)
+      embeddings = tf.gather(self.shared_weights, inputs)
+      embeddings *= tf.expand_dims(mask, -1)
+      # Scale embedding by the sqrt of the hidden size
+      embeddings *= self.hidden_size ** 0.5
+
+      return embeddings
+
+  def _linear(self, inputs):
+    """Computes logits by running inputs through a linear layer.
+
+    Args:
+      inputs: A float32 tensor with shape [batch_size, length, hidden_size]
+    Returns:
+      float32 tensor with shape [batch_size, length, vocab_size].
+    """
+    with tf.name_scope("presoftmax_linear"):
+      batch_size = tf.shape(inputs)[0]
+      length = tf.shape(inputs)[1]
+
+      x = tf.reshape(inputs, [-1, self.hidden_size])
+      logits = tf.matmul(x, self.shared_weights, transpose_b=True)
+
+      return tf.reshape(logits, [batch_size, length, self.vocab_size])
--- a/official/transformer/v2/ffn_layer.py
+++ b/official/transformer/v2/ffn_layer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of fully connected network."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+class FeedForwardNetwork(tf.keras.layers.Layer):
+  """Fully connected feedforward network."""
+
+  def __init__(self, hidden_size, filter_size, relu_dropout):
+    """Initialize FeedForwardNetwork.
+
+    Args:
+      hidden_size: int, output dim of hidden layer.
+      filter_size: int, filter size for the inner (first) dense layer.
+      relu_dropout: float, dropout rate for training.
+    """
+    super(FeedForwardNetwork, self).__init__()
+    self.hidden_size = hidden_size
+    self.filter_size = filter_size
+    self.relu_dropout = relu_dropout
+
+  def build(self, input_shape):
+    self.filter_dense_layer = tf.keras.layers.Dense(
+        self.filter_size,
+        use_bias=True,
+        activation=tf.nn.relu,
+        name="filter_layer")
+    self.output_dense_layer = tf.keras.layers.Dense(
+        self.hidden_size, use_bias=True, name="output_layer")
+    super(FeedForwardNetwork, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "hidden_size": self.hidden_size,
+        "filter_size": self.filter_size,
+        "relu_dropout": self.relu_dropout,
+    }
+
+  def call(self, x, training):
+    """Return outputs of the feedforward network.
+
+    Args:
+      x: tensor with shape [batch_size, length, hidden_size]
+      training: boolean, whether in training mode or not.
+
+    Returns:
+      Output of the feedforward network.
+      tensor with shape [batch_size, length, hidden_size]
+    """
+    # Retrieve dynamically known shapes
+    batch_size = tf.shape(x)[0]
+    length = tf.shape(x)[1]
+
+    output = self.filter_dense_layer(x)
+    if training:
+      output = tf.nn.dropout(output, rate=self.relu_dropout)
+    output = self.output_dense_layer(output)
+
+    return output
--- a/official/transformer/v2/metrics.py
+++ b/official/transformer/v2/metrics.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Functions for calculating loss, accuracy, and other model metrics.
+
+Metrics:
+ - Padded loss, accuracy, and negative log perplexity. Source:
+     https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/metrics.py
+ - BLEU approximation. Source:
+     https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py
+ - ROUGE score. Source:
+     https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/rouge.py
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import functools
+
+import tensorflow as tf
+
+
+def _pad_tensors_to_same_length(x, y):
+  """Pad x and y so that the results have the same length (second dimension)."""
+  with tf.name_scope("pad_to_same_length"):
+    x_length = tf.shape(x)[1]
+    y_length = tf.shape(y)[1]
+
+    max_length = tf.maximum(x_length, y_length)
+
+    x = tf.pad(x, [[0, 0], [0, max_length - x_length], [0, 0]])
+    y = tf.pad(y, [[0, 0], [0, max_length - y_length]])
+    return x, y
+
+
+def padded_cross_entropy_loss(logits, labels, smoothing, vocab_size):
+  """Calculate cross entropy loss while ignoring padding.
+
+  Args:
+    logits: Tensor of size [batch_size, length_logits, vocab_size]
+    labels: Tensor of size [batch_size, length_labels]
+    smoothing: Label smoothing constant, used to determine the on and off values
+    vocab_size: int size of the vocabulary
+
+  Returns:
+    Returns the cross entropy loss and weight tensors: float32 tensors with
+      shape [batch_size, max(length_logits, length_labels)]
+  """
+  with tf.name_scope("loss"):
+    logits, labels = _pad_tensors_to_same_length(logits, labels)
+
+    # Calculate smoothing cross entropy
+    with tf.name_scope("smoothing_cross_entropy"):
+      confidence = 1.0 - smoothing
+      low_confidence = (1.0 - confidence) / tf.cast(vocab_size - 1, tf.float32)
+      soft_targets = tf.one_hot(
+          tf.cast(labels, tf.int32),
+          depth=vocab_size,
+          on_value=confidence,
+          off_value=low_confidence)
+      xentropy = tf.nn.softmax_cross_entropy_with_logits(
+          logits=logits, labels=soft_targets)
+
+      # Calculate the best (lowest) possible value of cross entropy, and
+      # subtract from the cross entropy loss.
+      normalizing_constant = -(
+          confidence * tf.math.log(confidence) +
+          tf.cast(vocab_size - 1, tf.float32) * low_confidence *
+          tf.math.log(low_confidence + 1e-20))
+      xentropy -= normalizing_constant
+
+    weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
+    return xentropy * weights, weights
+
+
+def padded_accuracy(logits, labels):
+  """Percentage of times that predictions matches labels on non-0s."""
+  with tf.name_scope("padded_accuracy"):
+    logits, labels = _pad_tensors_to_same_length(logits, labels)
+    weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
+    outputs = tf.cast(tf.argmax(logits, axis=-1), tf.int32)
+    padded_labels = tf.cast(labels, tf.int32)
+    return tf.cast(tf.equal(outputs, padded_labels), tf.float32), weights
+
+
+def padded_accuracy_topk(logits, labels, k):
+  """Percentage of times that top-k predictions matches labels on non-0s."""
+  with tf.name_scope("padded_accuracy_topk"):
+    logits, labels = _pad_tensors_to_same_length(logits, labels)
+    weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
+    effective_k = tf.minimum(k, tf.shape(logits)[-1])
+    _, outputs = tf.nn.top_k(logits, k=effective_k)
+    outputs = tf.cast(outputs, tf.int32)
+    padded_labels = tf.cast(labels, tf.int32)
+    padded_labels = tf.expand_dims(padded_labels, axis=-1)
+    padded_labels += tf.zeros_like(outputs)  # Pad to same shape.
+    same = tf.cast(tf.equal(outputs, padded_labels), tf.float32)
+    same_topk = tf.reduce_sum(same, axis=-1)
+    return same_topk, weights
+
+
+def padded_accuracy_top5(logits, labels):
+  return padded_accuracy_topk(logits, labels, 5)
+
+
+def padded_sequence_accuracy(logits, labels):
+  """Percentage of times that predictions matches labels everywhere (non-0)."""
+  with tf.name_scope("padded_sequence_accuracy"):
+    logits, labels = _pad_tensors_to_same_length(logits, labels)
+    weights = tf.cast(tf.not_equal(labels, 0), tf.float32)
+    outputs = tf.cast(tf.argmax(logits, axis=-1), tf.int32)
+    padded_labels = tf.cast(labels, tf.int32)
+    not_correct = tf.cast(tf.not_equal(outputs, padded_labels),
+                          tf.float32) * weights
+    axis = list(range(1, len(outputs.get_shape())))
+    correct_seq = 1.0 - tf.minimum(1.0, tf.reduce_sum(not_correct, axis=axis))
+    return correct_seq, tf.constant(1.0)
+
+
+def padded_neg_log_perplexity(logits, labels, vocab_size):
+  """Average log-perplexity excluding padding 0s. No smoothing."""
+  num, den = padded_cross_entropy_loss(logits, labels, 0, vocab_size)
+  return -num, den
+
+
+class MetricLayer(tf.keras.layers.Layer):
+  """Custom a layer of metrics for Transformer model."""
+
+  def __init__(self, vocab_size):
+    super(MetricLayer, self).__init__()
+    self.vocab_size = vocab_size
+    self.metric_mean_fns = []
+
+  def build(self, input_shape):
+    neg_log_perplexity = functools.partial(
+        padded_neg_log_perplexity, vocab_size=self.vocab_size)
+    self.metric_mean_fns = [
+        (tf.keras.metrics.Mean("accuracy"), padded_accuracy),
+        (tf.keras.metrics.Mean("accuracy_top5"), padded_accuracy_top5),
+        (tf.keras.metrics.Mean("accuracy_per_sequence"),
+            padded_sequence_accuracy),
+        (tf.keras.metrics.Mean("neg_log_perplexity"), neg_log_perplexity),
+    ]
+    super(MetricLayer, self).build(input_shape)
+
+  def get_config(self):
+    return {"vocab_size": self.vocab_size}
+
+  def call(self, inputs):
+    logits, targets = inputs[0], inputs[1]
+    for mean, fn in self.metric_mean_fns:
+      m = mean(*fn(logits, targets))
+      self.add_metric(m)
+    return logits
+
+
+def transformer_loss(logits, labels, smoothing, vocab_size):
+  """Calculates total loss containing cross entropy with padding ignored.
+
+  Args:
+    logits: Tensor of size [batch_size, length_logits, vocab_size]
+    labels: Tensor of size [batch_size, length_labels]
+    smoothing: Label smoothing constant, used to determine the on and off values
+    vocab_size: int size of the vocabulary
+
+  Returns:
+    A scalar float tensor for loss.
+  """
+  xentropy, weights = padded_cross_entropy_loss(logits, labels, smoothing,
+                                                vocab_size)
+  return tf.reduce_sum(xentropy) / tf.reduce_sum(weights)
+
+
+class LossLayer(tf.keras.layers.Layer):
+  """Custom a layer of transformer loss for Transformer model."""
+
+  def __init__(self, vocab_size, label_smoothing):
+    super(LossLayer, self).__init__()
+    self.vocab_size = vocab_size
+    self.label_smoothing = label_smoothing
+
+  def get_config(self):
+    return {
+        "vocab_size": self.vocab_size,
+        "label_smoothing": self.label_smoothing,
+    }
+
+  def call(self, inputs):
+    logits, targets = inputs[0], inputs[1]
+    loss = transformer_loss(logits, targets, self.label_smoothing,
+                            self.vocab_size)
+    self.add_loss(loss)
+    return logits
--- a/official/transformer/v2/misc.py
+++ b/official/transformer/v2/misc.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Misc for Transformer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from absl import flags
+
+from official.transformer.model import model_params
+from official.utils.flags import core as flags_core
+
+PARAMS_MAP = {
+    "tiny": model_params.TINY_PARAMS,
+    "base": model_params.BASE_PARAMS,
+    "big": model_params.BIG_PARAMS,
+}
+
+
+def get_model_params(param_set, num_gpus):
+  """Gets predefined model params."""
+  if num_gpus > 1:
+    if param_set == "big":
+      return model_params.BIG_MULTI_GPU_PARAMS.copy()
+    elif param_set == "base":
+      return model_params.BASE_MULTI_GPU_PARAMS.copy()
+    else:
+      raise ValueError("Not valid params: param_set={} num_gpus={}".format(
+          param_set, num_gpus))
+
+  return PARAMS_MAP[param_set].copy()
+
+
+def define_transformer_flags():
+  """Add flags and flag validators for running transformer_main."""
+  # Add common flags (data_dir, model_dir, train_epochs, etc.).
+  flags_core.define_base()
+  flags_core.define_performance(
+      num_parallel_calls=True,
+      inter_op=False,
+      intra_op=False,
+      synthetic_data=True,
+      max_train_steps=False,
+      dtype=False,
+      all_reduce_alg=True
+  )
+  flags_core.define_benchmark()
+  flags_core.define_device(tpu=True)
+
+  # Set flags from the flags_core module as "key flags" so they're listed when
+  # the '-h' flag is used. Without this line, the flags defined above are
+  # only shown in the full `--helpful` help text.
+  flags.adopt_module_key_flags(flags_core)
+
+  # Add transformer-specific flags
+  flags.DEFINE_enum(
+      name="param_set", short_name="mp", default="big",
+      enum_values=PARAMS_MAP.keys(),
+      help=flags_core.help_wrap(
+          "Parameter set to use when creating and training the model. The "
+          "parameters define the input shape (batch size and max length), "
+          "model configuration (size of embedding, # of hidden layers, etc.), "
+          "and various other settings. The big parameter set increases the "
+          "default batch size, embedding/hidden size, and filter size. For a "
+          "complete list of parameters, please see model/model_params.py."))
+
+  flags.DEFINE_bool(
+      name="static_batch", default=False,
+      help=flags_core.help_wrap(
+          "Whether the batches in the dataset should have static shapes. In "
+          "general, this setting should be False. Dynamic shapes allow the "
+          "inputs to be grouped so that the number of padding tokens is "
+          "minimized, and helps model training. In cases where the input shape "
+          "must be static (e.g. running on TPU), this setting will be ignored "
+          "and static batching will always be used."))
+
+  # Flags for training with steps (may be used for debugging)
+  flags.DEFINE_integer(
+      name="steps_per_epoch", short_name="sbe", default=1000,
+      help=flags_core.help_wrap(
+          "The number of training steps for each epoch."))
+  flags.DEFINE_integer(
+      name="init_epoch", short_name="is", default=0,
+      help=flags_core.help_wrap("The number of initial epoch for training."))
+  flags.DEFINE_string(
+      name="init_weight_path", short_name="iwp", default=None,
+      help=flags_core.help_wrap("The initial model weights to load."))
+  flags.DEFINE_string(
+      name="init_logdir_timestamp", short_name="ilt", default=None,
+      help=flags_core.help_wrap("The initial timestamp for logdir."))
+  flags.DEFINE_integer(
+      name="validation_steps", short_name="vs", default=64,
+      help=flags_core.help_wrap("The number of steps used in validation."))
+
+  # BLEU score computation
+  flags.DEFINE_string(
+      name="bleu_source", short_name="bls", default=None,
+      help=flags_core.help_wrap(
+          "Path to source file containing text translate when calculating the "
+          "official BLEU score. Both --bleu_source and --bleu_ref must be set. "
+          "Use the flag --stop_threshold to stop the script based on the "
+          "uncased BLEU score."))
+  flags.DEFINE_string(
+      name="bleu_ref", short_name="blr", default=None,
+      help=flags_core.help_wrap(
+          "Path to source file containing text translate when calculating the "
+          "official BLEU score. Both --bleu_source and --bleu_ref must be set. "
+          "Use the flag --stop_threshold to stop the script based on the "
+          "uncased BLEU score."))
+  flags.DEFINE_string(
+      name="vocab_file", short_name="vf", default=None,
+      help=flags_core.help_wrap(
+          "Path to subtoken vocabulary file. If data_download.py was used to "
+          "download and encode the training data, look in the data_dir to find "
+          "the vocab file."))
+  flags.DEFINE_string(
+      name="mode", default="train",
+      help=flags_core.help_wrap("mode: train, eval, or predict"))
+
+  flags_core.set_defaults(data_dir="/tmp/translate_ende",
+                          model_dir="/tmp/transformer_model",
+                          batch_size=None,
+                          train_epochs=10)
+
+  # pylint: disable=unused-variable
+  @flags.multi_flags_validator(
+      ["mode", "train_epochs"],
+      message="--train_epochs must be defined in train mode")
+  def _check_train_limits(flag_dict):
+    if flag_dict["mode"] == "train":
+      return flag_dict["train_epochs"] is not None
+    return True
+
+  @flags.multi_flags_validator(
+      ["bleu_source", "bleu_ref"],
+      message="Both or neither --bleu_source and --bleu_ref must be defined.")
+  def _check_bleu_files(flags_dict):
+    return (flags_dict["bleu_source"] is None) == (
+        flags_dict["bleu_ref"] is None)
+
+  @flags.multi_flags_validator(
+      ["bleu_source", "bleu_ref", "vocab_file"],
+      message="--vocab_file must be defined if --bleu_source and --bleu_ref "
+              "are defined.")
+  def _check_bleu_vocab_file(flags_dict):
+    if flags_dict["bleu_source"] and flags_dict["bleu_ref"]:
+      return flags_dict["vocab_file"] is not None
+    return True
+
+  @flags.multi_flags_validator(
+      ["export_dir", "vocab_file"],
+      message="--vocab_file must be defined if --export_dir is set.")
+  def _check_export_vocab_file(flags_dict):
+    if flags_dict["export_dir"]:
+      return flags_dict["vocab_file"] is not None
+    return True
+  # pylint: enable=unused-variable
+
+  flags_core.require_cloud_storage(["data_dir", "model_dir", "export_dir"])
--- a/official/transformer/v2/optimizer.py
+++ b/official/transformer/v2/optimizer.py
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Optimizer from addons and learning rate scheduler.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import tensorflow as tf
+K = tf.keras.backend
+
+
+class LazyAdam(tf.keras.optimizers.Adam):
+  """Variant of the Adam optimizer that handles sparse updates more efficiently.
+
+  The original Adam algorithm maintains two moving-average accumulators for
+  each trainable variable; the accumulators are updated at every step.
+  This class provides lazier handling of gradient updates for sparse
+  variables.  It only updates moving-average accumulators for sparse variable
+  indices that appear in the current batch, rather than updating the
+  accumulators for all indices. Compared with the original Adam optimizer,
+  it can provide large improvements in model training throughput for some
+  applications. However, it provides slightly different semantics than the
+  original Adam algorithm, and may lead to different empirical results.
+  Note, amsgrad is currently not supported and the argument can only be
+  False.
+
+  This class is borrowed from:
+  https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lazy_adam.py
+  """
+
+  def _resource_apply_sparse(self, grad, var, indices):
+    """Applies grad for one step."""
+    var_dtype = var.dtype.base_dtype
+    lr_t = self._decayed_lr(var_dtype)
+    beta_1_t = self._get_hyper('beta_1', var_dtype)
+    beta_2_t = self._get_hyper('beta_2', var_dtype)
+    local_step = tf.cast(self.iterations + 1, var_dtype)
+    beta_1_power = tf.math.pow(beta_1_t, local_step)
+    beta_2_power = tf.math.pow(beta_2_t, local_step)
+    epsilon_t = tf.convert_to_tensor(self.epsilon, var_dtype)
+    lr = (lr_t * tf.math.sqrt(1 - beta_2_power) / (1 - beta_1_power))
+
+    # \\(m := beta1 * m + (1 - beta1) * g_t\\)
+    m = self.get_slot(var, 'm')
+    m_t_slice = beta_1_t * tf.gather(m, indices) + (1 - beta_1_t) * grad
+
+    m_update_kwargs = {
+        'resource': m.handle,
+        'indices': indices,
+        'updates': m_t_slice
+    }
+    m_update_op = tf.raw_ops.ResourceScatterUpdate(**m_update_kwargs)
+
+    # \\(v := beta2 * v + (1 - beta2) * (g_t * g_t)\\)
+    v = self.get_slot(var, 'v')
+    v_t_slice = (beta_2_t * tf.gather(v, indices) +
+                 (1 - beta_2_t) * tf.math.square(grad))
+
+    v_update_kwargs = {
+        'resource': v.handle,
+        'indices': indices,
+        'updates': v_t_slice
+    }
+    v_update_op = tf.raw_ops.ResourceScatterUpdate(**v_update_kwargs)
+
+    # \\(variable -= learning_rate * m_t / (epsilon_t + sqrt(v_t))\\)
+    var_slice = lr * m_t_slice / (tf.math.sqrt(v_t_slice) + epsilon_t)
+
+    var_update_kwargs = {
+        'resource': var.handle,
+        'indices': indices,
+        'updates': var_slice
+    }
+    var_update_op = tf.raw_ops.ResourceScatterSub(**var_update_kwargs)
+
+    return tf.group(*[var_update_op, m_update_op, v_update_op])
+
+
+class LearningRateFn(object):
+  """Creates learning rate function."""
+
+  def __init__(self, learning_rate, hidden_size, warmup_steps):
+    self.learning_rate = learning_rate
+    self.hidden_size = hidden_size
+    self.warmup_steps = float(warmup_steps)
+
+  def __call__(self, global_step):
+    """Calculate learning rate with linear warmup and rsqrt decay."""
+    step = float(global_step)
+    learning_rate = self.learning_rate
+    learning_rate *= (self.hidden_size ** -0.5)
+    # Apply linear warmup
+    learning_rate *= np.minimum(1.0, step / self.warmup_steps)
+    # Apply rsqrt decay
+    learning_rate /= np.sqrt(np.maximum(step, self.warmup_steps))
+    return learning_rate
+
+
+class LearningRateScheduler(tf.keras.callbacks.Callback):
+  """Keras callback to schedule learning rate.
+
+  TODO(tianlin): Refactor this scheduler and LearningRateBatchScheduler in
+  official/resnet/keras/keras_common.py.
+  """
+
+  def __init__(self, schedule, init_steps=None, verbose=False):
+    super(LearningRateScheduler, self).__init__()
+    self.schedule = schedule
+    self.verbose = verbose
+    if init_steps is None:
+      init_steps = 0.0
+    self.steps = float(init_steps)   # Total steps during training.
+
+  def on_epoch_begin(self, epoch, logs=None):
+    if not hasattr(self.model.optimizer, 'lr'):
+      raise ValueError('Optimizer must have a "lr" attribute.')
+    if not hasattr(self.model.optimizer, 'iterations'):
+      raise ValueError('Optimizer must have a "iterations" attribute.')
+
+  def on_train_batch_begin(self, batch, logs=None):
+    if self.verbose > 0:
+      iterations = K.get_value(self.model.optimizer.iterations)
+      print('Original iteration %d' % iterations)
+
+    self.steps += 1.0
+    try:  # new API
+      lr = float(K.get_value(self.model.optimizer.lr))
+      lr = self.schedule(self.steps, lr)
+    except TypeError:  # Support for old API for backward compatibility
+      lr = self.schedule(self.steps)
+    if not isinstance(lr, (float, np.float32, np.float64)):
+      raise ValueError('The output of the "schedule" function '
+                       'should be float.')
+    K.set_value(self.model.optimizer.lr, lr)
+    K.set_value(self.model.optimizer.iterations, self.steps)
+
+    if self.verbose > 0:
+      print('Batch %05d Step %05d: LearningRateScheduler setting learning '
+            'rate to %s.' % (batch + 1, self.steps, lr))
+
+  def on_epoch_end(self, epoch, logs=None):
+    logs = logs or {}
+    logs['lr'] = K.get_value(self.model.optimizer.lr)
+    logs['steps'] = self.steps
--- a/official/transformer/v2/transformer.py
+++ b/official/transformer/v2/transformer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Defines the Transformer model in TF 2.0.
+
+Model paper: https://arxiv.org/pdf/1706.03762.pdf
+Transformer model code source: https://github.com/tensorflow/tensor2tensor
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.transformer.model import beam_search
+from official.transformer.model import model_utils
+from official.transformer.utils.tokenizer import EOS_ID
+from official.transformer.v2 import attention_layer
+from official.transformer.v2 import embedding_layer
+from official.transformer.v2 import ffn_layer
+from official.transformer.v2 import metrics
+
+
+def create_model(params, is_train):
+  """Creates transformer model."""
+  with tf.name_scope("model"):
+    if is_train:
+      inputs = tf.keras.layers.Input((None,), dtype="int64", name="inputs")
+      targets = tf.keras.layers.Input((None,), dtype="int64", name="targets")
+      internal_model = Transformer(params, name="transformer_v2")
+      logits = internal_model([inputs, targets], training=is_train)
+      vocab_size = params["vocab_size"]
+      label_smoothing = params["label_smoothing"]
+      logits = metrics.MetricLayer(vocab_size)([logits, targets])
+      logits = metrics.LossLayer(vocab_size, label_smoothing)([logits, targets])
+      logits = tf.keras.layers.Lambda(lambda x: x, name="logits")(logits)
+      return tf.keras.Model([inputs, targets], logits)
+
+    else:
+      inputs = tf.keras.layers.Input((None,), dtype="int64", name="inputs")
+      internal_model = Transformer(params, name="transformer_v2")
+      ret = internal_model([inputs], training=is_train)
+      outputs, scores = ret["outputs"], ret["scores"]
+      return tf.keras.Model(inputs, [outputs, scores])
+
+
+class Transformer(tf.keras.Model):
+  """Transformer model with Keras.
+
+  Implemented as described in: https://arxiv.org/pdf/1706.03762.pdf
+
+  The Transformer model consists of an encoder and decoder. The input is an int
+  sequence (or a batch of sequences). The encoder produces a continuous
+  representation, and the decoder uses the encoder output to generate
+  probabilities for the output sequence.
+  """
+
+  def __init__(self, params, name=None):
+    """Initialize layers to build Transformer model.
+
+    Args:
+      params: hyperparameter object defining layer sizes, dropout values, etc.
+      name: name of the model.
+    """
+    super(Transformer, self).__init__(name=name)
+    self.params = params
+    self.embedding_softmax_layer = embedding_layer.EmbeddingSharedWeights(
+        params["vocab_size"], params["hidden_size"])
+    self.encoder_stack = EncoderStack(params)
+    self.decoder_stack = DecoderStack(params)
+
+  def get_config(self):
+    return {
+        "params": self.params,
+    }
+
+  def call(self, x, training):
+    """Calculate target logits or inferred target sequences.
+
+    Args:
+      x: input tensor list of size 1 or 2.
+        First item, inputs: int tensor with shape [batch_size, input_length].
+        Second item (optional), targets: None or int tensor with shape
+          [batch_size, target_length].
+      training: boolean, whether in training mode or not.
+
+    Returns:
+      If targets is defined, then return logits for each word in the target
+      sequence. float tensor with shape [batch_size, target_length, vocab_size]
+      If target is none, then generate output sequence one token at a time.
+        returns a dictionary {
+          outputs: [batch_size, decoded length]
+          scores: [batch_size, float]}
+    """
+    if len(x) == 2:
+      inputs, targets = x[0], x[1]
+    else:
+      inputs, targets = x[0], None
+
+    # Variance scaling is used here because it seems to work in many problems.
+    # Other reasonable initializers may also work just as well.
+    with tf.name_scope("Transformer"):
+      # Calculate attention bias for encoder self-attention and decoder
+      # multi-headed attention layers.
+      attention_bias = model_utils.get_padding_bias(inputs)
+
+      # Run the inputs through the encoder layer to map the symbol
+      # representations to continuous representations.
+      encoder_outputs = self.encode(inputs, attention_bias, training)
+      # Generate output sequence if targets is None, or return logits if target
+      # sequence is known.
+      if targets is None:
+        return self.predict(encoder_outputs, attention_bias, training)
+      else:
+        logits = self.decode(targets, encoder_outputs, attention_bias, training)
+        return logits
+
+  def encode(self, inputs, attention_bias, training):
+    """Generate continuous representation for inputs.
+
+    Args:
+      inputs: int tensor with shape [batch_size, input_length].
+      attention_bias: float tensor with shape [batch_size, 1, 1, input_length].
+      training: boolean, whether in training mode or not.
+
+    Returns:
+      float tensor with shape [batch_size, input_length, hidden_size]
+    """
+    with tf.name_scope("encode"):
+      # Prepare inputs to the layer stack by adding positional encodings and
+      # applying dropout.
+      embedded_inputs = self.embedding_softmax_layer(inputs)
+      inputs_padding = model_utils.get_padding(inputs)
+
+      with tf.name_scope("add_pos_encoding"):
+        length = tf.shape(embedded_inputs)[1]
+        pos_encoding = model_utils.get_position_encoding(
+            length, self.params["hidden_size"])
+        encoder_inputs = embedded_inputs + pos_encoding
+
+      if training:
+        encoder_inputs = tf.nn.dropout(
+            encoder_inputs, rate=self.params["layer_postprocess_dropout"])
+
+      return self.encoder_stack(
+          encoder_inputs, attention_bias, inputs_padding, training=training)
+
+  def decode(self, targets, encoder_outputs, attention_bias, training):
+    """Generate logits for each value in the target sequence.
+
+    Args:
+      targets: target values for the output sequence. int tensor with shape
+        [batch_size, target_length]
+      encoder_outputs: continuous representation of input sequence. float tensor
+        with shape [batch_size, input_length, hidden_size]
+      attention_bias: float tensor with shape [batch_size, 1, 1, input_length]
+      training: boolean, whether in training mode or not.
+
+    Returns:
+      float32 tensor with shape [batch_size, target_length, vocab_size]
+    """
+    with tf.name_scope("decode"):
+      # Prepare inputs to decoder layers by shifting targets, adding positional
+      # encoding and applying dropout.
+      decoder_inputs = self.embedding_softmax_layer(targets)
+      with tf.name_scope("shift_targets"):
+        # Shift targets to the right, and remove the last element
+        decoder_inputs = tf.pad(decoder_inputs,
+                                [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
+      with tf.name_scope("add_pos_encoding"):
+        length = tf.shape(decoder_inputs)[1]
+        decoder_inputs += model_utils.get_position_encoding(
+            length, self.params["hidden_size"])
+      if training:
+        decoder_inputs = tf.nn.dropout(
+            decoder_inputs, rate=self.params["layer_postprocess_dropout"])
+
+      # Run values
+      decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(
+          length)
+      outputs = self.decoder_stack(
+          decoder_inputs,
+          encoder_outputs,
+          decoder_self_attention_bias,
+          attention_bias,
+          training=training)
+      logits = self.embedding_softmax_layer(outputs, mode="linear")
+      return logits
+
+  def _get_symbols_to_logits_fn(self, max_decode_length, training):
+    """Returns a decoding function that calculates logits of the next tokens."""
+
+    timing_signal = model_utils.get_position_encoding(
+        max_decode_length + 1, self.params["hidden_size"])
+    decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(
+        max_decode_length)
+
+    def symbols_to_logits_fn(ids, i, cache):
+      """Generate logits for next potential IDs.
+
+      Args:
+        ids: Current decoded sequences. int tensor with shape [batch_size *
+          beam_size, i + 1]
+        i: Loop index
+        cache: dictionary of values storing the encoder output, encoder-decoder
+          attention bias, and previous decoder attention values.
+
+      Returns:
+        Tuple of
+          (logits with shape [batch_size * beam_size, vocab_size],
+           updated cache values)
+      """
+      # Set decoder input to the last generated IDs
+      decoder_input = ids[:, -1:]
+
+      # Preprocess decoder input by getting embeddings and adding timing signal.
+      decoder_input = self.embedding_softmax_layer(decoder_input)
+      decoder_input += timing_signal[i:i + 1]
+
+      self_attention_bias = decoder_self_attention_bias[:, :, i:i + 1, :i + 1]
+      decoder_outputs = self.decoder_stack(
+          decoder_input,
+          cache.get("encoder_outputs"),
+          self_attention_bias,
+          cache.get("encoder_decoder_attention_bias"),
+          training=training,
+          cache=cache)
+      logits = self.embedding_softmax_layer(decoder_outputs, mode="linear")
+      logits = tf.squeeze(logits, axis=[1])
+      return logits, cache
+
+    return symbols_to_logits_fn
+
+  def predict(self, encoder_outputs, encoder_decoder_attention_bias, training):
+    """Return predicted sequence."""
+    batch_size = tf.shape(encoder_outputs)[0]
+    input_length = tf.shape(encoder_outputs)[1]
+    max_decode_length = input_length + self.params["extra_decode_length"]
+
+    symbols_to_logits_fn = self._get_symbols_to_logits_fn(
+        max_decode_length, training)
+
+    # Create initial set of IDs that will be passed into symbols_to_logits_fn.
+    initial_ids = tf.zeros([batch_size], dtype=tf.int32)
+
+    # Create cache storing decoder attention values for each layer.
+    # pylint: disable=g-complex-comprehension
+    cache = {
+        "layer_%d" % layer: {
+            "k": tf.zeros([batch_size, 0, self.params["hidden_size"]]),
+            "v": tf.zeros([batch_size, 0, self.params["hidden_size"]])
+        } for layer in range(self.params["num_hidden_layers"])
+    }
+    # pylint: enable=g-complex-comprehension
+
+    # Add encoder output and attention bias to the cache.
+    cache["encoder_outputs"] = encoder_outputs
+    cache["encoder_decoder_attention_bias"] = encoder_decoder_attention_bias
+
+    # Use beam search to find the top beam_size sequences and scores.
+    decoded_ids, scores = beam_search.sequence_beam_search(
+        symbols_to_logits_fn=symbols_to_logits_fn,
+        initial_ids=initial_ids,
+        initial_cache=cache,
+        vocab_size=self.params["vocab_size"],
+        beam_size=self.params["beam_size"],
+        alpha=self.params["alpha"],
+        max_decode_length=max_decode_length,
+        eos_id=EOS_ID)
+
+    # Get the top sequence for each batch element
+    top_decoded_ids = decoded_ids[:, 0, 1:]
+    top_scores = scores[:, 0]
+
+    return {"outputs": top_decoded_ids, "scores": top_scores}
+
+
+class LayerNormalization(tf.keras.layers.Layer):
+  """Applies layer normalization."""
+
+  def __init__(self, hidden_size):
+    super(LayerNormalization, self).__init__()
+    self.hidden_size = hidden_size
+
+  def build(self, input_shape):
+    """Builds the layer."""
+    self.scale = self.add_weight(
+        "layer_norm_scale",
+        shape=[self.hidden_size],
+        dtype="float32",
+        initializer=tf.ones_initializer())
+    self.bias = self.add_weight(
+        "layer_norm_bias",
+        shape=[self.hidden_size],
+        dtype="float32",
+        initializer=tf.zeros_initializer())
+    super(LayerNormalization, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "hidden_size": self.hidden_size,
+    }
+
+  def call(self, x, epsilon=1e-6):
+    mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
+    variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
+    norm_x = (x - mean) * tf.math.rsqrt(variance + epsilon)
+    return norm_x * self.scale + self.bias
+
+
+class PrePostProcessingWrapper(tf.keras.layers.Layer):
+  """Wrapper class that applies layer pre-processing and post-processing."""
+
+  def __init__(self, layer, params):
+    super(PrePostProcessingWrapper, self).__init__()
+    self.layer = layer
+    self.params = params
+    self.postprocess_dropout = params["layer_postprocess_dropout"]
+
+  def build(self, input_shape):
+    # Create normalization layer
+    self.layer_norm = LayerNormalization(self.params["hidden_size"])
+    super(PrePostProcessingWrapper, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "params": self.params,
+    }
+
+  def call(self, x, *args, **kwargs):
+    """Calls wrapped layer with same parameters."""
+    # Preprocessing: apply layer normalization
+    training = kwargs["training"]
+
+    y = self.layer_norm(x)
+
+    # Get layer output
+    y = self.layer(y, *args, **kwargs)
+
+    # Postprocessing: apply dropout and residual connection
+    if training:
+      y = tf.nn.dropout(y, rate=self.postprocess_dropout)
+    return x + y
+
+
+class EncoderStack(tf.keras.layers.Layer):
+  """Transformer encoder stack.
+
+  The encoder stack is made up of N identical layers. Each layer is composed
+  of the sublayers:
+    1. Self-attention layer
+    2. Feedforward network (which is 2 fully-connected layers)
+  """
+
+  def __init__(self, params):
+    super(EncoderStack, self).__init__()
+    self.params = params
+    self.layers = []
+
+  def build(self, input_shape):
+    """Builds the encoder stack."""
+    params = self.params
+    for _ in range(params["num_hidden_layers"]):
+      # Create sublayers for each layer.
+      self_attention_layer = attention_layer.SelfAttention(
+          params["hidden_size"], params["num_heads"],
+          params["attention_dropout"])
+      feed_forward_network = ffn_layer.FeedForwardNetwork(
+          params["hidden_size"], params["filter_size"], params["relu_dropout"])
+
+      self.layers.append([
+          PrePostProcessingWrapper(self_attention_layer, params),
+          PrePostProcessingWrapper(feed_forward_network, params)
+      ])
+
+    # Create final layer normalization layer.
+    self.output_normalization = LayerNormalization(params["hidden_size"])
+    super(EncoderStack, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "params": self.params,
+    }
+
+  def call(self, encoder_inputs, attention_bias, inputs_padding, training):
+    """Return the output of the encoder layer stacks.
+
+    Args:
+      encoder_inputs: tensor with shape [batch_size, input_length, hidden_size]
+      attention_bias: bias for the encoder self-attention layer. [batch_size, 1,
+        1, input_length]
+      inputs_padding: tensor with shape [batch_size, input_length], inputs with
+        zero paddings.
+      training: boolean, whether in training mode or not.
+
+    Returns:
+      Output of encoder layer stack.
+      float32 tensor with shape [batch_size, input_length, hidden_size]
+    """
+    for n, layer in enumerate(self.layers):
+      # Run inputs through the sublayers.
+      self_attention_layer = layer[0]
+      feed_forward_network = layer[1]
+
+      with tf.name_scope("layer_%d" % n):
+        with tf.name_scope("self_attention"):
+          encoder_inputs = self_attention_layer(
+              encoder_inputs, attention_bias, training=training)
+        with tf.name_scope("ffn"):
+          encoder_inputs = feed_forward_network(
+              encoder_inputs, training=training)
+
+    return self.output_normalization(encoder_inputs)
+
+
+class DecoderStack(tf.keras.layers.Layer):
+  """Transformer decoder stack.
+
+  Like the encoder stack, the decoder stack is made up of N identical layers.
+  Each layer is composed of the sublayers:
+    1. Self-attention layer
+    2. Multi-headed attention layer combining encoder outputs with results from
+       the previous self-attention layer.
+    3. Feedforward network (2 fully-connected layers)
+  """
+
+  def __init__(self, params):
+    super(DecoderStack, self).__init__()
+    self.params = params
+    self.layers = []
+
+  def build(self, input_shape):
+    """Builds the decoder stack."""
+    params = self.params
+    for _ in range(params["num_hidden_layers"]):
+      self_attention_layer = attention_layer.SelfAttention(
+          params["hidden_size"], params["num_heads"],
+          params["attention_dropout"])
+      enc_dec_attention_layer = attention_layer.Attention(
+          params["hidden_size"], params["num_heads"],
+          params["attention_dropout"])
+      feed_forward_network = ffn_layer.FeedForwardNetwork(
+          params["hidden_size"], params["filter_size"], params["relu_dropout"])
+
+      self.layers.append([
+          PrePostProcessingWrapper(self_attention_layer, params),
+          PrePostProcessingWrapper(enc_dec_attention_layer, params),
+          PrePostProcessingWrapper(feed_forward_network, params)
+      ])
+    self.output_normalization = LayerNormalization(params["hidden_size"])
+    super(DecoderStack, self).build(input_shape)
+
+  def get_config(self):
+    return {
+        "params": self.params,
+    }
+
+  def call(self,
+           decoder_inputs,
+           encoder_outputs,
+           decoder_self_attention_bias,
+           attention_bias,
+           training,
+           cache=None):
+    """Return the output of the decoder layer stacks.
+
+    Args:
+      decoder_inputs: tensor with shape [batch_size, target_length, hidden_size]
+      encoder_outputs: tensor with shape [batch_size, input_length, hidden_size]
+      decoder_self_attention_bias: bias for decoder self-attention layer. [1, 1,
+        target_len, target_length]
+      attention_bias: bias for encoder-decoder attention layer. [batch_size, 1,
+        1, input_length]
+      training: boolean, whether in training mode or not.
+      cache: (Used for fast decoding) A nested dictionary storing previous
+        decoder self-attention values. The items are:
+          {layer_n: {"k": tensor with shape [batch_size, i, key_channels],
+                     "v": tensor with shape [batch_size, i, value_channels]},
+                       ...}
+
+    Returns:
+      Output of decoder layer stack.
+      float32 tensor with shape [batch_size, target_length, hidden_size]
+    """
+    for n, layer in enumerate(self.layers):
+      self_attention_layer = layer[0]
+      enc_dec_attention_layer = layer[1]
+      feed_forward_network = layer[2]
+
+      # Run inputs through the sublayers.
+      layer_name = "layer_%d" % n
+      layer_cache = cache[layer_name] if cache is not None else None
+      with tf.name_scope(layer_name):
+        with tf.name_scope("self_attention"):
+          decoder_inputs = self_attention_layer(
+              decoder_inputs,
+              decoder_self_attention_bias,
+              training=training,
+              cache=layer_cache)
+        with tf.name_scope("encdec_attention"):
+          decoder_inputs = enc_dec_attention_layer(
+              decoder_inputs,
+              encoder_outputs,
+              attention_bias,
+              training=training)
+        with tf.name_scope("ffn"):
+          decoder_inputs = feed_forward_network(
+              decoder_inputs, training=training)
+
+    return self.output_normalization(decoder_inputs)
--- a/official/transformer/v2/transformer_layers_test.py
+++ b/official/transformer/v2/transformer_layers_test.py
+"""Tests for layers in Transformer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.transformer.v2 import attention_layer
+from official.transformer.v2 import embedding_layer
+from official.transformer.v2 import ffn_layer
+from official.transformer.v2 import metrics
+
+
+class TransformerLayersTest(tf.test.TestCase):
+
+  def test_attention_layer(self):
+    hidden_size = 64
+    num_heads = 4
+    dropout = 0.5
+    layer = attention_layer.SelfAttention(hidden_size, num_heads, dropout)
+    self.assertDictEqual(layer.get_config(), {
+        "hidden_size": hidden_size,
+        "num_heads": num_heads,
+        "attention_dropout": dropout,
+    })
+    length = 2
+    x = tf.ones([1, length, hidden_size])
+    bias = tf.ones([1])
+    cache = {
+        "k": tf.zeros([1, 0, hidden_size]),
+        "v": tf.zeros([1, 0, hidden_size]),
+    }
+    y = layer(x, bias, training=True, cache=cache)
+    self.assertEqual(y.shape, (1, length, 64,))
+    self.assertEqual(cache["k"].shape, (1, length, 64,))
+    self.assertEqual(cache["v"].shape, (1, length, 64,))
+
+  def test_embedding_shared_weights(self):
+    vocab_size = 50
+    hidden_size = 64
+    length = 2
+    layer = embedding_layer.EmbeddingSharedWeights(vocab_size, hidden_size)
+    self.assertDictEqual(layer.get_config(), {
+        "vocab_size": 50,
+        "hidden_size": 64,
+    })
+
+    idx = tf.ones([1, length], dtype="int32")
+    y = layer(idx)
+    self.assertEqual(y.shape, (1, length, hidden_size,))
+    x = tf.ones([1, length, hidden_size])
+    output = layer(x, "linear")
+    self.assertEqual(output.shape, (1, length, vocab_size,))
+
+  def test_feed_forward_network(self):
+    hidden_size = 64
+    filter_size = 32
+    relu_dropout = 0.5
+    layer = ffn_layer.FeedForwardNetwork(hidden_size, filter_size, relu_dropout)
+    self.assertDictEqual(layer.get_config(), {
+        "hidden_size": hidden_size,
+        "filter_size": filter_size,
+        "relu_dropout": relu_dropout,
+    })
+    length = 2
+    x = tf.ones([1, length, hidden_size])
+    y = layer(x, training=True)
+    self.assertEqual(y.shape, (1, length, hidden_size,))
+
+  def test_metric_layer(self):
+    vocab_size = 50
+    logits = tf.keras.layers.Input((None, vocab_size),
+                                   dtype="float32",
+                                   name="logits")
+    targets = tf.keras.layers.Input((None,), dtype="int64", name="targets")
+    output_logits = metrics.MetricLayer(vocab_size)([logits, targets])
+    self.assertEqual(output_logits.shape.as_list(), [None, None, vocab_size,])
+
+  def test_loss_layer(self):
+    vocab_size, label_smoothing = 50, 0.1
+    logits = tf.keras.layers.Input((None, vocab_size),
+                                   dtype="float32",
+                                   name="logits")
+    targets = tf.keras.layers.Input((None,), dtype="int64", name="targets")
+    output_logits = metrics.LossLayer(vocab_size,
+                                      label_smoothing)([logits, targets])
+    self.assertEqual(output_logits.shape.as_list(), [None, None, vocab_size,])
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/transformer/v2/transformer_main.py
+++ b/official/transformer/v2/transformer_main.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Train and evaluate the Transformer model.
+
+See README for description of setting the training schedule and evaluating the
+BLEU score.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import datetime
+import os
+import tempfile
+
+# pylint: disable=g-bad-import-order
+from absl import flags
+import tensorflow as tf
+# pylint: enable=g-bad-import-order
+
+from official.transformer import compute_bleu
+from official.transformer.utils import tokenizer
+from official.transformer.v2 import data_pipeline
+from official.transformer.v2 import misc
+from official.transformer.v2 import optimizer
+from official.transformer.v2 import transformer
+from official.transformer.v2 import translate
+from official.utils.flags import core as flags_core
+from official.utils.logs import logger
+
+
+INF = int(1e9)
+BLEU_DIR = "bleu"
+_SINGLE_SAMPLE = 1
+
+
+def translate_and_compute_bleu(model, subtokenizer, bleu_source, bleu_ref):
+  """Translate file and report the cased and uncased bleu scores."""
+  # Create temporary file to store translation.
+  tmp = tempfile.NamedTemporaryFile(delete=False)
+  tmp_filename = tmp.name
+
+  translate.translate_file(
+      model,
+      subtokenizer,
+      bleu_source,
+      output_file=tmp_filename,
+      print_all_translations=False)
+
+  # Compute uncased and cased bleu scores.
+  uncased_score = compute_bleu.bleu_wrapper(bleu_ref, tmp_filename, False)
+  cased_score = compute_bleu.bleu_wrapper(bleu_ref, tmp_filename, True)
+  os.remove(tmp_filename)
+  return uncased_score, cased_score
+
+
+def evaluate_and_log_bleu(model, bleu_source, bleu_ref, vocab_file):
+  """Calculate and record the BLEU score."""
+  subtokenizer = tokenizer.Subtokenizer(vocab_file)
+
+  uncased_score, cased_score = translate_and_compute_bleu(
+      model, subtokenizer, bleu_source, bleu_ref)
+
+  tf.compat.v1.logging.info("Bleu score (uncased): %s", uncased_score)
+  tf.compat.v1.logging.info("Bleu score (cased): %s", cased_score)
+  return uncased_score, cased_score
+
+
+class TransformerTask(object):
+  """Main entry of Transformer model."""
+
+  def __init__(self, flags_obj):
+    """Init function of TransformerMain.
+
+    Args:
+      flags_obj: Object containing parsed flag values, i.e., FLAGS.
+    """
+    self.flags_obj = flags_obj
+
+    # Add flag-defined parameters to params object
+    num_gpus = flags_core.get_num_gpus(flags_obj)
+    self.params = params = misc.get_model_params(flags_obj.param_set, num_gpus)
+
+    params["data_dir"] = flags_obj.data_dir
+    params["model_dir"] = flags_obj.model_dir
+    params["num_parallel_calls"] = (
+        flags_obj.num_parallel_calls or tf.data.experimental.AUTOTUNE)
+
+    params["use_synthetic_data"] = flags_obj.use_synthetic_data
+    params["batch_size"] = flags_obj.batch_size or params["default_batch_size"]
+    params["repeat_dataset"] = None
+
+  def train(self):
+    """Trains the model."""
+    params, flags_obj, is_train = self.params, self.flags_obj, True
+    model = transformer.create_model(params, is_train)
+    opt = self._create_optimizer()
+
+    model.compile(opt, target_tensors=[])
+    model.summary()
+    self._load_weights_if_possible(model, flags_obj.init_weight_path)
+
+    cur_log_dir = _get_log_dir_or_default(flags_obj)
+    _ensure_dir(cur_log_dir)
+
+    map_data_fn = data_pipeline.map_data_for_transformer_fn
+    train_ds = data_pipeline.train_input_fn(params)
+    train_ds = train_ds.map(
+        map_data_fn, num_parallel_calls=params["num_parallel_calls"])
+    valid_ds = data_pipeline.eval_input_fn(params)
+    valid_ds = valid_ds.map(
+        map_data_fn, num_parallel_calls=params["num_parallel_calls"])
+
+    init_epoch = flags_obj.init_epoch or 0
+    init_steps = init_epoch * flags_obj.steps_per_epoch
+    callbacks = self._create_callbacks(cur_log_dir, init_steps, params)
+
+    history = model.fit(
+        train_ds,
+        initial_epoch=init_epoch,
+        epochs=flags_obj.train_epochs,
+        steps_per_epoch=flags_obj.steps_per_epoch,
+        validation_data=valid_ds,
+        validation_steps=flags_obj.validation_steps,
+        callbacks=callbacks)
+    tf.compat.v1.logging.info("\nTrain history: {}".format(history.history))
+
+    save_weight_path = os.path.join(cur_log_dir, "saves-model-weights.hdf5")
+    save_model_path = os.path.join(cur_log_dir, "saves-model.hdf5")
+    model.save_weights(save_weight_path)
+    model.save(save_model_path)
+
+  def eval(self):
+    """Evaluates the model."""
+    params, flags_obj, is_train = self.params, self.flags_obj, False
+    with tf.name_scope("model"):
+      model = transformer.create_model(params, is_train)
+      self._load_weights_if_possible(model, flags_obj.init_weight_path)
+      model.summary()
+    evaluate_and_log_bleu(model, flags_obj.bleu_source, flags_obj.bleu_ref,
+                          flags_obj.vocab_file)
+
+  def predict(self):
+    """Predicts result from the model."""
+    params, flags_obj, is_train = self.params, self.flags_obj, False
+
+    with tf.name_scope("model"):
+      model = transformer.create_model(params, is_train)
+      self._load_weights_if_possible(model, flags_obj.init_weight_path)
+      model.summary()
+    subtokenizer = tokenizer.Subtokenizer(flags_obj.vocab_file)
+
+    ds = data_pipeline.eval_input_fn(params)
+    ds = ds.map(lambda x, y: x).take(_SINGLE_SAMPLE)
+    ret = model.predict(ds)
+    val_outputs, _ = ret
+    length = len(val_outputs)
+    for i in range(length):
+      translate.translate_from_input(val_outputs[i], subtokenizer)
+
+  def _create_callbacks(self, cur_log_dir, init_steps, params):
+    """Creates a list of callbacks."""
+    sfunc = optimizer.LearningRateFn(params["learning_rate"],
+                                     params["hidden_size"],
+                                     params["learning_rate_warmup_steps"])
+    scheduler_callback = optimizer.LearningRateScheduler(sfunc, init_steps)
+
+    tb_logdir = os.path.join(cur_log_dir, "logs")
+    save_path = os.path.join(cur_log_dir,
+                             "weights-epoch-{epoch:02d}-loss-{loss:.4f}.hdf5")
+    csv_path = os.path.join(cur_log_dir, "result.csv")
+    return [
+        scheduler_callback,
+        tf.keras.callbacks.TensorBoard(tb_logdir),
+        tf.keras.callbacks.ModelCheckpoint(save_path, save_weights_only=True),
+        tf.keras.callbacks.CSVLogger(csv_path, append=True),
+    ]
+
+  def _load_weights_if_possible(self, model, init_weight_path=None):
+    """Loads model weights when it is provided."""
+    if init_weight_path:
+      tf.compat.v1.logging.info("Load weights: {}".format(init_weight_path))
+      model.load_weights(init_weight_path, by_name=True)
+
+  def _create_optimizer(self):
+    """Creates optimizer."""
+    params = self.params
+    opt = optimizer.LazyAdam(
+        params["learning_rate"],
+        params["optimizer_adam_beta1"],
+        params["optimizer_adam_beta2"],
+        epsilon=params["optimizer_adam_epsilon"])
+    return opt
+
+
+def _get_log_dir_or_default(flags_obj):
+  """Gets init_logdir_timestamp if it is given, otherwise use current time."""
+  if flags_obj.init_logdir_timestamp is not None:
+    timestamp = flags_obj.init_logdir_timestamp
+  else:
+    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M")
+  return os.path.join(flags_obj.model_dir, timestamp)
+
+
+def _ensure_dir(log_dir):
+  """Makes log dir if not existed."""
+  if not os.path.exists(log_dir):
+    os.makedirs(log_dir)
+
+
+def main(_):
+  flags_obj = flags.FLAGS
+  with logger.benchmark_context(flags_obj):
+    task = TransformerTask(flags_obj)
+    if flags_obj.mode == "train":
+      task.train()
+    elif flags_obj.mode == "predict":
+      task.predict()
+    elif flags_obj.mode == "eval":
+      task.eval()
+    else:
+      raise ValueError("Invalid mode {}".format(flags_obj.mode))
+
+
+if __name__ == "__main__":
+  tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
+  misc.define_transformer_flags()
+  tf.compat.v1.app.run(main)
--- a/official/transformer/v2/transformer_main_test.py
+++ b/official/transformer/v2/transformer_main_test.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Test Transformer model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import re
+
+from absl import flags
+import tensorflow as tf
+from tensorflow.python.framework import test_util
+
+from official.transformer.v2 import misc
+from official.transformer.v2 import transformer_main as tm
+
+FLAGS = flags.FLAGS
+FIXED_TIMESTAMP = "my_time_stamp"
+WEIGHT_PATTERN = re.compile(r"weights-epoch-.+\.hdf5")
+
+
+def _generate_file(filepath, lines):
+  with open(filepath, "w") as f:
+    for l in lines:
+      f.write("{}\n".format(l))
+
+
+class TransformerTaskTest(tf.test.TestCase):
+
+  def setUp(self):
+    temp_dir = self.get_temp_dir()
+    FLAGS.model_dir = temp_dir
+    FLAGS.init_logdir_timestamp = FIXED_TIMESTAMP
+    FLAGS.param_set = param_set = "tiny"
+    FLAGS.use_synthetic_data = True
+    FLAGS.steps_per_epoch = 1
+    FLAGS.validation_steps = 1
+    FLAGS.train_epochs = 1
+    FLAGS.batch_size = 8
+    FLAGS.init_weight_path = None
+    self.cur_log_dir = os.path.join(temp_dir, FIXED_TIMESTAMP)
+    self.vocab_file = os.path.join(self.cur_log_dir, "vocab")
+    self.vocab_size = misc.get_model_params(param_set, 0)["vocab_size"]
+    self.bleu_source = os.path.join(self.cur_log_dir, "bleu_source")
+    self.bleu_ref = os.path.join(self.cur_log_dir, "bleu_ref")
+    self.flags_file = os.path.join(self.cur_log_dir, "flags")
+
+  def _assert_exists(self, filepath):
+    self.assertTrue(os.path.exists(filepath))
+
+  def test_train(self):
+    t = tm.TransformerTask(FLAGS)
+    t.train()
+    # Test model dir.
+    self._assert_exists(self.cur_log_dir)
+    # Test saving models.
+    self._assert_exists(
+        os.path.join(self.cur_log_dir, "saves-model-weights.hdf5"))
+    self._assert_exists(os.path.join(self.cur_log_dir, "saves-model.hdf5"))
+
+    # Test callbacks:
+    # TensorBoard file.
+    self._assert_exists(os.path.join(self.cur_log_dir, "logs"))
+    # CSVLogger file.
+    self._assert_exists(os.path.join(self.cur_log_dir, "result.csv"))
+    # Checkpoint file.
+    filenames = os.listdir(self.cur_log_dir)
+    matched_weight_file = any([WEIGHT_PATTERN.match(f) for f in filenames])
+    self.assertTrue(matched_weight_file)
+
+  def _prepare_files_and_flags(self, *extra_flags):
+    # Make log dir.
+    if not os.path.exists(self.cur_log_dir):
+      os.makedirs(self.cur_log_dir)
+
+    # Fake vocab, bleu_source and bleu_ref.
+    tokens = [
+        "'<pad>'", "'<EOS>'", "'_'", "'a'", "'b'", "'c'", "'d'", "'a_'", "'b_'",
+        "'c_'", "'d_'"
+    ]
+    tokens += ["'{}'".format(i) for i in range(self.vocab_size - len(tokens))]
+    _generate_file(self.vocab_file, tokens)
+    _generate_file(self.bleu_source, ["a b", "c d"])
+    _generate_file(self.bleu_ref, ["a b", "d c"])
+
+    # Update flags.
+    update_flags = [
+        "ignored_program_name",
+        "--vocab_file={}".format(self.vocab_file),
+        "--bleu_source={}".format(self.bleu_source),
+        "--bleu_ref={}".format(self.bleu_ref),
+    ]
+    if extra_flags:
+      update_flags.extend(extra_flags)
+    FLAGS(update_flags)
+
+  @test_util.run_v1_only("V1 should work. Issue: V2 w/ graph transformed.")
+  def test_predict(self):
+    self._prepare_files_and_flags()
+    t = tm.TransformerTask(FLAGS)
+    t.predict()
+
+  @test_util.run_v1_only("V1 should work. Issue: V2 w/ graph transformed.")
+  def test_eval(self):
+    self._prepare_files_and_flags()
+    t = tm.TransformerTask(FLAGS)
+    t.eval()
+
+
+if __name__ == "__main__":
+  misc.define_transformer_flags()
+  tf.test.main()
--- a/official/transformer/v2/transformer_test.py
+++ b/official/transformer/v2/transformer_test.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Test Transformer model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.transformer.model import model_params
+from official.transformer.v2 import transformer
+
+
+class TransformerV2Test(tf.test.TestCase):
+
+  def setUp(self):
+    self.params = params = model_params.TINY_PARAMS
+    params["batch_size"] = params["default_batch_size"] = 16
+    params["use_synthetic_data"] = True
+    params["hidden_size"] = 12
+    params["num_hidden_layers"] = 2
+    params["filter_size"] = 14
+    params["num_heads"] = 2
+    params["vocab_size"] = 41
+    params["extra_decode_length"] = 2
+    params["beam_size"] = 3
+
+  def test_create_model_train(self):
+    model = transformer.create_model(self.params, True)
+    inputs, outputs = model.inputs, model.outputs
+    self.assertEqual(len(inputs), 2)
+    self.assertEqual(len(outputs), 1)
+    self.assertEqual(inputs[0].shape.as_list(), [None, None])
+    self.assertEqual(inputs[0].dtype, tf.int64)
+    self.assertEqual(inputs[1].shape.as_list(), [None, None])
+    self.assertEqual(inputs[1].dtype, tf.int64)
+    self.assertEqual(outputs[0].shape.as_list(), [None, None, 41])
+    self.assertEqual(outputs[0].dtype, tf.float32)
+
+  def test_create_model_not_train(self):
+    model = transformer.create_model(self.params, False)
+    inputs, outputs = model.inputs, model.outputs
+    self.assertEqual(len(inputs), 1)
+    self.assertEqual(len(outputs), 2)
+    self.assertEqual(inputs[0].shape.as_list(), [None, None])
+    self.assertEqual(inputs[0].dtype, tf.int64)
+    self.assertEqual(outputs[0].shape.as_list(), [None, None])
+    self.assertEqual(outputs[0].dtype, tf.int32)
+    self.assertEqual(outputs[1].shape.as_list(), [None])
+    self.assertEqual(outputs[1].dtype, tf.float32)
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/transformer/v2/translate.py
+++ b/official/transformer/v2/translate.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Translate text or files using trained transformer model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from official.transformer.utils import tokenizer
+
+_DECODE_BATCH_SIZE = 32
+_EXTRA_DECODE_LENGTH = 100
+_BEAM_SIZE = 4
+_ALPHA = 0.6
+
+
+def _get_sorted_inputs(filename):
+  """Read and sort lines from the file sorted by decreasing length.
+
+  Args:
+    filename: String name of file to read inputs from.
+  Returns:
+    Sorted list of inputs, and dictionary mapping original index->sorted index
+    of each element.
+  """
+  with tf.io.gfile.GFile(filename) as f:
+    records = f.read().split("\n")
+    inputs = [record.strip() for record in records]
+    if not inputs[-1]:
+      inputs.pop()
+
+  input_lens = [(i, len(line.split())) for i, line in enumerate(inputs)]
+  sorted_input_lens = sorted(input_lens, key=lambda x: x[1], reverse=True)
+
+  sorted_inputs = [None] * len(sorted_input_lens)
+  sorted_keys = [0] * len(sorted_input_lens)
+  for i, (index, _) in enumerate(sorted_input_lens):
+    sorted_inputs[i] = inputs[index]
+    sorted_keys[index] = i
+  return sorted_inputs, sorted_keys
+
+
+def _encode_and_add_eos(line, subtokenizer):
+  """Encode line with subtokenizer, and add EOS id to the end."""
+  return subtokenizer.encode(line) + [tokenizer.EOS_ID]
+
+
+def _trim_and_decode(ids, subtokenizer):
+  """Trim EOS and PAD tokens from ids, and decode to return a string."""
+  try:
+    index = list(ids).index(tokenizer.EOS_ID)
+    return subtokenizer.decode(ids[:index])
+  except ValueError:  # No EOS found in sequence
+    return subtokenizer.decode(ids)
+
+
+def translate_file(
+    model, subtokenizer, input_file, output_file=None,
+    print_all_translations=True):
+  """Translate lines in file, and save to output file if specified.
+
+  Args:
+    model: Keras model used to generate the translations.
+    subtokenizer: Subtokenizer object for encoding and decoding source and
+       translated lines.
+    input_file: file containing lines to translate
+    output_file: file that stores the generated translations.
+    print_all_translations: If true, all translations are printed to stdout.
+
+  Raises:
+    ValueError: if output file is invalid.
+  """
+  batch_size = _DECODE_BATCH_SIZE
+
+  # Read and sort inputs by length. Keep dictionary (original index-->new index
+  # in sorted list) to write translations in the original order.
+  sorted_inputs, sorted_keys = _get_sorted_inputs(input_file)
+  total_samples = len(sorted_inputs)
+  num_decode_batches = (total_samples - 1) // batch_size + 1
+
+  def input_generator():
+    """Yield encoded strings from sorted_inputs."""
+    for i in range(num_decode_batches):
+      lines = [
+          sorted_inputs[j + i * batch_size]
+          for j in range(batch_size)
+          if j + i * batch_size < total_samples
+      ]
+      lines = [_encode_and_add_eos(l, subtokenizer) for l in lines]
+      batch = tf.keras.preprocessing.sequence.pad_sequences(
+          lines, dtype="int64", padding="post")
+      tf.compat.v1.logging.info("Decoding batch %d out of %d.", i,
+                                num_decode_batches)
+      yield batch
+
+  translations = []
+  for i, text in enumerate(input_generator()):
+    val_outputs, _ = model.predict(text)
+
+    length = len(val_outputs)
+    for j in range(length):
+      translation = _trim_and_decode(val_outputs[j], subtokenizer)
+      translations.append(translation)
+      if print_all_translations:
+        tf.compat.v1.logging.info(
+            "Translating:\n\tInput: %s\n\tOutput: %s" %
+            (sorted_inputs[j + i * batch_size], translation))
+
+  # Write translations in the order they appeared in the original file.
+  if output_file is not None:
+    if tf.io.gfile.isdir(output_file):
+      raise ValueError("File output is a directory, will not save outputs to "
+                       "file.")
+    tf.compat.v1.logging.info("Writing to file %s" % output_file)
+    with tf.compat.v1.gfile.Open(output_file, "w") as f:
+      for i in sorted_keys:
+        f.write("%s\n" % translations[i])
+
+
+def translate_from_text(model, subtokenizer, txt):
+  encoded_txt = _encode_and_add_eos(txt, subtokenizer)
+  result = model.predict(encoded_txt)
+  outputs = result["outputs"]
+  tf.compat.v1.logging.info("Original: \"%s\"" % txt)
+  translate_from_input(outputs, subtokenizer)
+
+
+def translate_from_input(outputs, subtokenizer):
+  translation = _trim_and_decode(outputs, subtokenizer)
+  tf.compat.v1.logging.info("Translation: \"%s\"" % translation)