Add transformer model (#4148)

3fca8afe · Katherine Wu · GitHub · dea7ecf6 · 3fca8afe · 3fca8afe
Unverified Commit 3fca8afe authored May 02, 2018 by Katherine Wu Committed by GitHub May 02, 2018
20 changed files
--- a/official/transformer/README.md
+++ b/official/transformer/README.md
+# Transformer Translation Model
+This is an implementation of the Transformer translation model as described in the [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. Based on the code provided by the authors: [Transformer code](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py) from [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor).
+
+Transformer is a neural network architecture that solves sequence to sequence problems using attention mechanisms. Unlike traditional neural seq2seq models, Transformer does not involve recurrent connections. The attention mechanism learns dependencies between tokens in two sequences. Since attention weights apply to all tokens in the sequences, the Tranformer model is able to easily capture long-distance depedencies.
+
+Transformer's overall structure follows the standard encoder-decoder pattern. The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.
+
+The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
+
+## Contents
+  * [Contents](#contents)
+  * [Walkthrough](#walkthrough)
+  * [Benchmarks](#benchmarks)
+    * [Training times](#training-times)
+    * [Evaluation results](#evaluation-results)
+  * [Detailed instructions](#detailed-instructions)
+    * [Export variables (optional)](#export-variables-optional)
+    * [Download and preprocess datasets](#download-and-preprocess-datasets)
+    * [Model training and evaluation](#model-training-and-evaluation)
+    * [Translate using the model](#translate-using-the-model)
+    * [Compute official BLEU score](#compute-official-bleu-score)
+  * [Implementation overview](#implementation-overview)
+    * [Model Definition](#model-definition)
+    * [Model Estimator](#model-estimator)
+    * [Other scripts](#other-scripts)
+    * [Test dataset](#test-dataset)
+  * [Term definitions](#term-definitions)
+
+## Walkthrough
+
+Below are the commands for running the Transformer model. See the [Detailed instrutions](#detailed-instructions) for more details on running the model.
+
+```
+PARAMS=big
+DATA_DIR=$HOME/transformer/data
+MODEL_DIR=$HOME/transformer/model_$PARAMS
+
+# Download training/evaluation datasets
+python data_download.py --data_dir=$DATA_DIR
+
+# Train the model for 10 epochs, and evaluate after every epoch.
+python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+    --params=$PARAMS --bleu_source=test_data/newstest2014.en --bleu_ref=test_data/newstest2014.de
+
+# Run during training in a separate process to get continuous updates,
+# or after training is complete.
+tensorboard --logdir=$MODEL_DIR
+
+# Translate some text using the trained model
+python translate.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+    --params=$PARAMS --text="hello world"
+
+# Compute model's BLEU score using the newstest2014 dataset.
+python translate.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+    --params=$PARAMS --file=test_data/newstest2014.en --file_out=translation.en
+python compute_bleu.py --translation=translation.en --reference=test_data/newstest2014.de
+```
+
+## Benchmarks
+### Training times
+
+Currently, both big and base params run on a single GPU. The measurements below
+are reported from running the model on a P100 GPU.
+
+Params | batches/sec | batches per epoch | time per epoch
+--- | --- | --- | ---
+base | 4.8 | 83244 | 4 hr
+big | 1.1 | 41365 | 10 hr
+
+### Evaluation results
+Below are the case-insensitive BLEU scores after 10 epochs.
+
+Params | Score
+--- | --- |
+base | 27.7
+big | 28.9
+
+
+## Detailed instructions
+
+
+0. ### Export variables (optional)
+
+   Export the following variables, or modify the values in each of the snippets below:
+   ```
+   PARAMS=big
+   DATA_DIR=$HOME/transformer/data
+   MODEL_DIR=$HOME/transformer/model_$PARAMS
+   ```
+
+1. ### Download and preprocess datasets
+
+   [data_download.py](data_download.py) downloads and preprocesses the training and evaluation WMT datasets. After the data is downloaded and extracted, the training data is used to generate a vocabulary of subtokens. The evaluation and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords.
+
+   1.75GB of compressed data will be downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.
+
+   Command to run:
+   ```
+   python data_download.py --data_dir=$DATA_DIR
+   ```
+
+   Arguments:
+   * `--data_dir`: Path where the preprocessed TFRecord data, and vocab file will be saved.
+   * Use the `--help` or `-h` flag to get a full list of possible arguments.
+
+2. ### Model training and evaluation
+
+   [transformer_main.py](transformer_main.py) creates a Transformer model, and trains it using Tensorflow Estimator.
+
+   Command to run:
+   ```
+   python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --params=$PARAMS
+   ```
+
+   Arguments:
+   * `--data_dir`: This should be set to the same directory given to the `data_download`'s `data_dir` argument.
+   * `--model_dir`: Directory to save Transformer model training checkpoints.
+   * `--params`: Parameter set to use when creating and training the model. Options are `base` and `big` (default).
+   * Use the `--help` or `-h` flag to get a full list of possible arguments.
+
+   #### Customizing training schedule
+
+   By default, the model will train for 10 epochs, and evaluate after every epoch. The training schedule may be defined through the flags:
+   * Training with epochs (default):
+     * `--train_epochs`: The total number of complete passes to make through the dataset
+     * `--epochs_between_eval`: The number of epochs to train between evaluations.
+   * Training with steps:
+     * `--train_steps`: sets the total number of training steps to run.
+     * `--steps_between_eval`: Number of training steps to run between evaluations.
+
+   Only one of `train_epochs` or `train_steps` may be set. Since the default option is to evaluate the model after training for an epoch, it may take 4 or more hours between model evaluations. To get more frequent evaluations, use the flags `--train_steps=250000 --steps_between_eval=1000`.
+
+   Note: At the beginning of each training session, the training dataset is reloaded and shuffled. Stopping the training before completing an epoch may result in worse model quality, due to the chance that some examples may be seen more than others. Therefore, it is recommended to use epochs when the model quality is important.
+
+   #### Compute BLEU score during model evaluation
+
+   Use these flags to compute the BLEU when the model evaluates:
+   * `--bleu_source`: Path to file containing text to translate.
+   * `--bleu_ref`: Path to file containing the reference translation.
+   * `--bleu_threshold`: Train until the BLEU score reaches this lower bound. This setting overrides the `--train_steps` and `--train_epochs` flags.
+
+   The test source and reference files located in the `test_data` directory are extracted from the preprocessed dataset from the [NMT Seq2Seq tutorial](https://google.github.io/seq2seq/nmt/#download-data).
+
+   When running `transformer_main.py`, use the flags: `--bleu_source=test_data/newstest2014.en --bleu_ref=test_data/newstest2014.de`
+
+   #### Tensorboard
+   Training and evaluation metrics (loss, accuracy, approximate BLEU score, etc.) are logged, and can be displayed in the browser using Tensorboard.
+   ```
+   tensorboard --logdir=$MODEL_DIR
+   ```
+   The values are displayed at [localhost:6006](localhost:6006).
+
+3. ### Translate using the model
+   [translate.py](translate.py) contains the script to use the trained model to translate input text or file. Each line in the file is translated separately.
+
+   Command to run:
+   ```
+   python translate.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --params=$PARAMS --text="hello world"
+   ```
+
+   Arguments for initializing the Subtokenizer and trained model:
+   * `--data_dir`: Used to locate the vocabulary file to create a Subtokenizer, which encodes the input and decodes the model output.
+   * `--model_dir` and `--params`: These parameters are used to rebuild the trained model
+
+   Arguments for specifying what to translate:
+   * `--text`: Text to translate
+   * `--file`: Path to file containing text to translate
+   * `--file_out`: If `--file` is set, then this file will store the input file's translations.
+
+   To translate the newstest2014 data, run:
+   ```
+   python translate.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+       --params=$PARAMS --file=test_data/newstest2014.en --file_out=translation.en
+   ```
+
+   Translating the file takes around 15 minutes on a GTX1080, or 5 minutes on a P100.
+
+4. ### Compute official BLEU score
+   Use [compute_bleu.py](compute_bleu.py) to compute the BLEU by comparing generated translations to the reference translation.
+
+   Command to run:
+   ```
+   python compute_bleu.py --translation=translation.en --reference=test_data/newstest2014.de
+   ```
+
+   Arguments:
+   * `--translation`: Path to file containing generated translations.
+   * `--reference`: Path to file containing reference translations.
+   * Use the `--help` or `-h` flag to get a full list of possible arguments.
+
+## Implementation overview
+
+A brief look at each component in the code:
+
+### Model Definition
+The [model](model) subdirectory contains the implementation of the Transformer model. The following files define the Transformer model and its layers:
+* [transformer.py](model/transformer.py): Defines the transformer model and its encoder/decoder layer stacks.
+* [embedding_layer.py](model/embedding_layer.py): Contains the layer that calculates the embeddings. The embedding weights are also used to calculate the pre-softmax probabilities from the decoder output.
+* [attention_layer.py](model/attention_layer.py): Defines the multi-headed and self attention layers that are used in the encoder/decoder stacks.
+* [ffn_layer.py](model/ffn_layer.py): Defines the feedforward network that is used in the encoder/decoder stacks. The network is composed of 2 fully connected layers.
+
+Other files:
+* [beam_search.py](model/beam_search.py) contains the beam search implementation, which is used during model inference to find high scoring translations.
+* [model_params.py](model/model_params.py) contains the parameters used for the big and base models.
+* [model_utils.py](model/model_utils.py) defines some helper functions used in the model (calculating padding, bias, etc.).
+
+
+### Model Estimator
+[transformer_main.py](model/transformer.py) creates an `Estimator` to train and evaluate the model.
+
+Helper functions:
+* [utils/dataset.py](utils/dataset.py): contains functions for creating a `dataset` that is passed to the `Estimator`.
+* [utils/metrics.py](utils/metrics.py): defines metrics functions used by the `Estimator` to evaluate the
+
+### Other scripts
+
+Aside from the main file to train the Transformer model, we provide other scripts for using the model or downloading the data:
+
+#### Data download and preprocessing
+
+[data_download.py](data_download.py) downloads and extracts data, then uses `Subtokenizer` to tokenize strings into arrays of int IDs. The int arrays are converted to `tf.Examples` and saved in the `tf.RecordDataset` format.
+
+ The data is downloaded from the Workshop of Machine Transtion (WMT) [news translation task](http://www.statmt.org/wmt17/translation-task.html). The following datasets are used:
+
+ * Europarl v7
+ * Common Crawl corpus
+ * News Commentary v12
+
+ See the [download section](http://www.statmt.org/wmt17/translation-task.html#download) to explore the raw datasets. The parameters in this model are tuned to fit the English-German translation data, so the EN-DE texts are extracted from the downloaded compressed files.
+
+The text is transformed into arrays of integer IDs using the `Subtokenizer` defined in [`utils/tokenizer.py`](util/tokenizer.py). During initialization of the `Subtokenizer`, the raw training data is used to generate a vocabulary list containing common subtokens.
+
+The target vocabulary size of the WMT dataset is 32,768. The set of subtokens is found through binary search on the minimum number of times a subtoken appears in the data. The actual vocabulary size is 33,708, and is stored in a 324kB file.
+
+#### Translation
+Translation is defined in [translate.py](translate.py). First, `Subtokenizer` tokenizes the input. The vocabulary file is the same used to tokenize the training/eval files. Next, beam search is used to find the combination of tokens that maximizes the probability outputted by the model decoder. The tokens are then converted back to strings with `Subtokenizer`.
+
+#### BLEU computation
+[compute_bleu.py](compute_bleu.py): Implementation from [https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py).
+
+### Test dataset
+The [newstest2014 files](test_data) are extracted from the [NMT Seq2Seq tutorial](https://google.github.io/seq2seq/nmt/#download-data). The raw text files are converted from the SGM format of the [WMT 2016](http://www.statmt.org/wmt16/translation-task.html) test sets.
+
+## Term definitions
+
+**Steps / Epochs**:
+* Step: unit for processing a single batch of data
+* Epoch: a complete run through the dataset
+
+Example: Consider a training a dataset with 100 examples that is divided into 20 batches with 5 examples per batch. A single training step trains the model on one batch. After 20 training steps, the model will have trained on every batch in the dataset, or one epoch.
+
+**Subtoken**: Words are referred to as tokens, and parts of words are referred to as 'subtokens'. For example, the word 'inclined' may be split into `['incline', 'd_']`. The '\_' indicates the end of the token. The subtoken vocabulary list is guaranteed to contain the alphabet (including numbers and special characters), so all words can be tokenized.
\ No newline at end of file
--- a/official/transformer/__init__.py
+++ b/official/transformer/__init__.py
--- a/official/transformer/compute_bleu.py
+++ b/official/transformer/compute_bleu.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Script to compute official BLEU score.
+
+Source:
+https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import re
+import sys
+import unicodedata
+
+# pylint: disable=g-bad-import-order
+import six
+import tensorflow as tf
+# pylint: enable=g-bad-import-order
+
+from official.transformer.utils import metrics
+
+
+class UnicodeRegex(object):
+  """Ad-hoc hack to recognize all punctuation and symbols."""
+
+  def __init__(self):
+    punctuation = self.property_chars("P")
+    self.nondigit_punct_re = re.compile(r"([^\d])([" + punctuation + r"])")
+    self.punct_nondigit_re = re.compile(r"([" + punctuation + r"])([^\d])")
+    self.symbol_re = re.compile("([" + self.property_chars("S") + "])")
+
+  def property_chars(self, prefix):
+    return "".join(six.unichr(x) for x in range(sys.maxunicode)
+                   if unicodedata.category(six.unichr(x)).startswith(prefix))
+
+
+uregex = UnicodeRegex()
+
+
+def bleu_tokenize(string):
+  r"""Tokenize a string following the official BLEU implementation.
+
+  See https://github.com/moses-smt/mosesdecoder/'
+           'blob/master/scripts/generic/mteval-v14.pl#L954-L983
+  In our case, the input string is expected to be just one line
+  and no HTML entities de-escaping is needed.
+  So we just tokenize on punctuation and symbols,
+  except when a punctuation is preceded and followed by a digit
+  (e.g. a comma/dot as a thousand/decimal separator).
+
+  Note that a numer (e.g. a year) followed by a dot at the end of sentence
+  is NOT tokenized,
+  i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g`
+  does not match this case (unless we add a space after each sentence).
+  However, this error is already in the original mteval-v14.pl
+  and we want to be consistent with it.
+
+  Args:
+    string: the input string
+
+  Returns:
+    a list of tokens
+  """
+  string = uregex.nondigit_punct_re.sub(r"\1 \2 ", string)
+  string = uregex.punct_nondigit_re.sub(r" \1 \2", string)
+  string = uregex.symbol_re.sub(r" \1 ", string)
+  return string.split()
+
+
+def bleu_wrapper(ref_filename, hyp_filename, case_sensitive=False):
+  """Compute BLEU for two files (reference and hypothesis translation)."""
+  ref_lines = tf.gfile.Open(ref_filename).read().strip().splitlines()
+  hyp_lines = tf.gfile.Open(hyp_filename).read().strip().splitlines()
+
+  if len(ref_lines) != len(hyp_lines):
+    raise ValueError("Reference and translation files have different number of "
+                     "lines.")
+  if not case_sensitive:
+    ref_lines = [x.lower() for x in ref_lines]
+    hyp_lines = [x.lower() for x in hyp_lines]
+  ref_tokens = [bleu_tokenize(x) for x in ref_lines]
+  hyp_tokens = [bleu_tokenize(x) for x in hyp_lines]
+  return metrics.compute_bleu(ref_tokens, hyp_tokens) * 100
+
+
+def main(unused_argv):
+  if FLAGS.bleu_variant is None or "uncased" in FLAGS.bleu_variant:
+    score = bleu_wrapper(FLAGS.reference, FLAGS.translation, False)
+    print("Case-insensitive results:", score)
+
+  if FLAGS.bleu_variant is None or "cased" in FLAGS.bleu_variant:
+    score = bleu_wrapper(FLAGS.reference, FLAGS.translation, True)
+    print("Case-sensitive results:", score)
+
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      "--translation", "-t", type=str, default=None, required=True,
+      help="[default: %(default)s] File containing translated text.",
+      metavar="<T>")
+  parser.add_argument(
+      "--reference", "-r", type=str, default=None, required=True,
+      help="[default: %(default)s] File containing reference translation",
+      metavar="<R>")
+  parser.add_argument(
+      "--bleu_variant", "-bv", type=str, choices=["uncased", "cased"],
+      nargs="*", default=None,
+      help="Specify one or more BLEU variants to calculate (both are "
+           "calculated by default. Variants: \"cased\" or \"uncased\".",
+      metavar="<BV>")
+
+  FLAGS, unparsed = parser.parse_known_args()
+  main(sys.argv)
--- a/official/transformer/compute_bleu_test.py
+++ b/official/transformer/compute_bleu_test.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Test functions in compute_blue.py."""
+
+import tempfile
+import unittest
+
+import tensorflow as tf  # pylint: disable=g-bad-import-order
+
+from official.transformer import compute_bleu
+
+
+class ComputeBleuTest(unittest.TestCase):
+
+  def _create_temp_file(self, text):
+    temp_file = tempfile.NamedTemporaryFile(delete=False)
+    with tf.gfile.Open(temp_file.name, 'w') as w:
+      w.write(text)
+    return temp_file.name
+
+  def test_bleu_same(self):
+    ref = self._create_temp_file("test 1 two 3\nmore tests!")
+    hyp = self._create_temp_file("test 1 two 3\nmore tests!")
+
+    uncased_score = compute_bleu.bleu_wrapper(ref, hyp, False)
+    cased_score = compute_bleu.bleu_wrapper(ref, hyp, True)
+    self.assertEqual(100, uncased_score)
+    self.assertEqual(100, cased_score)
+
+  def test_bleu_same_different_case(self):
+    ref = self._create_temp_file("Test 1 two 3\nmore tests!")
+    hyp = self._create_temp_file("test 1 two 3\nMore tests!")
+    uncased_score = compute_bleu.bleu_wrapper(ref, hyp, False)
+    cased_score = compute_bleu.bleu_wrapper(ref, hyp, True)
+    self.assertEqual(100, uncased_score)
+    self.assertLess(cased_score, 100)
+
+  def test_bleu_different(self):
+    ref = self._create_temp_file("Testing\nmore tests!")
+    hyp = self._create_temp_file("Dog\nCat")
+    uncased_score = compute_bleu.bleu_wrapper(ref, hyp, False)
+    cased_score = compute_bleu.bleu_wrapper(ref, hyp, True)
+    self.assertLess(uncased_score, 100)
+    self.assertLess(cased_score, 100)
+
+  def test_bleu_tokenize(self):
+    s = "Test0, 1 two, 3"
+    tokenized = compute_bleu.bleu_tokenize(s)
+    self.assertEqual(["Test0", ",", "1", "two", ",", "3"], tokenized)
+
+
+if __name__ == "__main__":
+  unittest.main()
--- a/official/transformer/data_download.py
+++ b/official/transformer/data_download.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Download and preprocess WMT17 ende training and evaluation datasets."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import os
+import random
+import sys
+import tarfile
+import urllib
+
+# pylint: disable=g-bad-import-order
+import six
+import tensorflow as tf
+# pylint: enable=g-bad-import-order
+
+from official.transformer.utils import tokenizer
+
+# Data sources for training/evaluating the transformer translation model.
+# If any of the training sources are changed, then either:
+#   1) use the flag `--search` to find the best min count or
+#   2) update the _TRAIN_DATA_MIN_COUNT constant.
+# min_count is the minimum number of times a token must appear in the data
+# before it is added to the vocabulary. "Best min count" refers to the value
+# that generates a vocabulary set that is closest in size to _TARGET_VOCAB_SIZE.
+_TRAIN_DATA_SOURCES = [
+    {
+        "url": "http://data.statmt.org/wmt17/translation-task/"
+               "training-parallel-nc-v12.tgz",
+        "input": "news-commentary-v12.de-en.en",
+        "target": "news-commentary-v12.de-en.de",
+    },
+    {
+        "url": "http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz",
+        "input": "commoncrawl.de-en.en",
+        "target": "commoncrawl.de-en.de",
+    },
+    {
+        "url": "http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz",
+        "input": "europarl-v7.de-en.en",
+        "target": "europarl-v7.de-en.de",
+    },
+]
+# Use pre-defined minimum count to generate subtoken vocabulary.
+_TRAIN_DATA_MIN_COUNT = 6
+
+_EVAL_DATA_SOURCES = [
+    {
+        "url": "http://data.statmt.org/wmt17/translation-task/dev.tgz",
+        "input": "newstest2013.en",
+        "target": "newstest2013.de",
+    }
+]
+
+# Vocabulary constants
+_TARGET_VOCAB_SIZE = 32768  # Number of subtokens in the vocabulary list.
+_TARGET_THRESHOLD = 327  # Accept vocabulary if size is within this threshold
+VOCAB_FILE = "vocab.ende.%d" % _TARGET_VOCAB_SIZE
+
+# Strings to inclue in the generated files.
+_PREFIX = "wmt32k"
+_TRAIN_TAG = "train"
+_EVAL_TAG = "dev"  # Following WMT and Tensor2Tensor conventions, in which the
+                   # evaluation datasets are tagged as "dev" for development.
+
+# Number of files to split train and evaluation data
+_TRAIN_SHARDS = 100
+_EVAL_SHARDS = 1
+
+
+def find_file(path, filename, max_depth=5):
+  """Returns full filepath if the file is in path or a subdirectory."""
+  for root, dirs, files in os.walk(path):
+    if filename in files:
+      return os.path.join(root, filename)
+
+    # Don't search past max_depth
+    depth = root[len(path) + 1:].count(os.sep)
+    if depth > max_depth:
+      del dirs[:]  # Clear dirs
+  return None
+
+
+###############################################################################
+# Download and extraction functions
+###############################################################################
+def get_raw_files(raw_dir, data_source):
+  """Return raw files from source. Downloads/extracts if needed.
+
+  Args:
+    raw_dir: string directory to store raw files
+    data_source: dictionary with
+      {"url": url of compressed dataset containing input and target files
+       "input": file with data in input language
+       "target": file with data in target language}
+
+  Returns:
+    dictionary with
+      {"inputs": list of files containing data in input language
+       "targets": list of files containing corresponding data in target language
+      }
+  """
+  raw_files = {
+      "inputs": [],
+      "targets": [],
+  }  # keys
+  for d in data_source:
+    input_file, target_file = download_and_extract(
+        raw_dir, d["url"], d["input"], d["target"])
+    raw_files["inputs"].append(input_file)
+    raw_files["targets"].append(target_file)
+  return raw_files
+
+
+def download_report_hook(count, block_size, total_size):
+  """Report hook for download progress.
+
+  Args:
+    count: current block number
+    block_size: block size
+    total_size: total size
+  """
+  percent = int(count * block_size * 100 / total_size)
+  print("\r%d%%" % percent + " completed", end="\r")
+
+
+def download_from_url(path, url):
+  """Download content from a url.
+
+  Args:
+    path: string directory where file will be downloaded
+    url: string url
+
+  Returns:
+    Full path to downloaded file
+  """
+  filename = url.split("/")[-1]
+  found_file = find_file(path, filename, max_depth=0)
+  if found_file is None:
+    filename = os.path.join(path, filename)
+    tf.logging.info("Downloading from %s to %s." % (url, filename))
+    inprogress_filepath = filename + ".incomplete"
+    inprogress_filepath, _ = urllib.urlretrieve(
+        url, inprogress_filepath, reporthook=download_report_hook)
+    # Print newline to clear the carriage return from the download progress.
+    print()
+    tf.gfile.Rename(inprogress_filepath, filename)
+    return filename
+  else:
+    tf.logging.info("Already downloaded: %s (at %s)." % (url, found_file))
+    return found_file
+
+
+def download_and_extract(path, url, input_filename, target_filename):
+  """Extract files from downloaded compressed archive file.
+
+  Args:
+    path: string directory where the files will be downloaded
+    url: url containing the compressed input and target files
+    input_filename: name of file containing data in source language
+    target_filename: name of file containing data in target language
+
+  Returns:
+    Full paths to extracted input and target files.
+
+  Raises:
+    OSError: if the the download/extraction fails.
+  """
+  # Check if extracted files already exist in path
+  input_file = find_file(path, input_filename)
+  target_file = find_file(path, target_filename)
+  if input_file and target_file:
+    tf.logging.info("Already downloaded and extracted %s." % url)
+    return input_file, target_file
+
+  # Download archive file if it doesn't already exist.
+  compressed_file = download_from_url(path, url)
+
+  # Extract compressed files
+  tf.logging.info("Extracting %s." % compressed_file)
+  with tarfile.open(compressed_file, "r:gz") as corpus_tar:
+    corpus_tar.extractall(path)
+
+  # Return filepaths of the requested files.
+  input_file = find_file(path, input_filename)
+  target_file = find_file(path, target_filename)
+
+  if input_file and target_file:
+    return input_file, target_file
+
+  raise OSError("Download/extraction failed for url %s to path %s" %
+                (url, path))
+
+
+def txt_line_iterator(path):
+  """Iterate through lines of file."""
+  with tf.gfile.Open(path) as f:
+    for line in f:
+      yield line.strip()
+
+
+def compile_files(raw_dir, raw_files, tag):
+  """Compile raw files into a single file for each language.
+
+  Args:
+    raw_dir: Directory containing downloaded raw files.
+    raw_files: Dict containing filenames of input and target data.
+      {"inputs": list of files containing data in input language
+       "targets": list of files containing corresponding data in target language
+      }
+    tag: String to append to the compiled filename.
+
+  Returns:
+    Full path of compiled input and target files.
+  """
+  tf.logging.info("Compiling files with tag %s." % tag)
+  filename = "%s-%s" % (_PREFIX, tag)
+  input_compiled_file = os.path.join(raw_dir, filename + ".lang1")
+  target_compiled_file = os.path.join(raw_dir, filename + ".lang2")
+
+  with tf.gfile.Open(input_compiled_file, mode="w") as input_writer:
+    with tf.gfile.Open(target_compiled_file, mode="w") as target_writer:
+      for i in range(len(raw_files["inputs"])):
+        input_file = raw_files["inputs"][i]
+        target_file = raw_files["targets"][i]
+
+        tf.logging.info("Reading files %s and %s." % (input_file, target_file))
+        write_file(input_writer, input_file)
+        write_file(target_writer, target_file)
+  return input_compiled_file, target_compiled_file
+
+
+def write_file(writer, filename):
+  """Write all of lines from file using the writer."""
+  for line in txt_line_iterator(filename):
+    writer.write(line)
+    writer.write("\n")
+
+
+###############################################################################
+# Data preprocessing
+###############################################################################
+def encode_and_save_files(
+    subtokenizer, data_dir, raw_files, tag, total_shards):
+  """Save data from files as encoded Examples in TFrecord format.
+
+  Args:
+    subtokenizer: Subtokenizer object that will be used to encode the strings.
+    data_dir: The directory in which to write the examples
+    raw_files: A tuple of (input, target) data files. Each line in the input and
+      the corresponding line in target file will be saved in a tf.Example.
+    tag: String that will be added onto the file names.
+    total_shards: Number of files to divide the data into.
+
+  Returns:
+    List of all files produced.
+  """
+  # Create a file for each shard.
+  filepaths = [shard_filename(data_dir, tag, n + 1, total_shards)
+               for n in range(total_shards)]
+
+  if all_exist(filepaths):
+    tf.logging.info("Files with tag %s already exist." % tag)
+    return filepaths
+
+  tf.logging.info("Saving files with tag %s." % tag)
+  input_file = raw_files[0]
+  target_file = raw_files[1]
+
+  # Write examples to each shard in round robin order.
+  tmp_filepaths = [fname + ".incomplete" for fname in filepaths]
+  writers = [tf.python_io.TFRecordWriter(fname) for fname in tmp_filepaths]
+  counter, shard = 0, 0
+  for counter, (input_line, target_line) in enumerate(zip(
+      txt_line_iterator(input_file), txt_line_iterator(target_file))):
+    if counter > 0 and counter % 100000 == 0:
+      tf.logging.info("\tSaving case %d." % counter)
+    example = dict_to_example(
+        {"inputs": subtokenizer.encode(input_line, add_eos=True),
+         "targets": subtokenizer.encode(target_line, add_eos=True)})
+    writers[shard].write(example.SerializeToString())
+    shard = (shard + 1) % total_shards
+  for writer in writers:
+    writer.close()
+
+  for tmp_name, final_name in zip(tmp_filepaths, filepaths):
+    tf.gfile.Rename(tmp_name, final_name)
+
+  tf.logging.info("Saved %d Examples", counter)
+  return filepaths
+
+
+def shard_filename(path, tag, shard_num, total_shards):
+  """Create filename for data shard."""
+  return os.path.join(
+      path, "%s-%s-%.5d-of-%.5d" % (_PREFIX, tag, shard_num, total_shards))
+
+
+def shuffle_records(fname):
+  """Shuffle records in a single file."""
+  tf.logging.info("Shuffling records in file %s" % fname)
+
+  # Rename file prior to shuffling
+  tmp_fname = fname + ".unshuffled"
+  tf.gfile.Rename(fname, tmp_fname)
+
+  reader = tf.python_io.tf_record_iterator(tmp_fname)
+  records = []
+  for record in reader:
+    records.append(record)
+    if len(records) % 100000 == 0:
+      tf.logging.info("\tRead: %d", len(records))
+
+  random.shuffle(records)
+
+  # Write shuffled records to original file name
+  with tf.python_io.TFRecordWriter(fname) as w:
+    for count, record in enumerate(records):
+      w.write(record)
+      if count > 0 and count % 100000 == 0:
+        tf.logging.info("\tWriting record: %d" % count)
+
+  tf.gfile.Remove(tmp_fname)
+
+
+def dict_to_example(dictionary):
+  """Converts a dictionary of string->int to a tf.Example."""
+  features = {}
+  for k, v in six.iteritems(dictionary):
+    features[k] = tf.train.Feature(int64_list=tf.train.Int64List(value=v))
+  return tf.train.Example(features=tf.train.Features(feature=features))
+
+
+def all_exist(filepaths):
+  """Returns true if all files in the list exist."""
+  for fname in filepaths:
+    if not tf.gfile.Exists(fname):
+      return False
+  return True
+
+
+def make_dir(path):
+  if not tf.gfile.Exists(path):
+    tf.logging.info("Creating directory %s" % path)
+    tf.gfile.MakeDirs(path)
+
+
+def main(unused_argv):
+  """Obtain training and evaluation data for the Transformer model."""
+  tf.logging.set_verbosity(tf.logging.INFO)
+
+  make_dir(FLAGS.raw_dir)
+  make_dir(FLAGS.data_dir)
+
+  # Get paths of download/extracted training and evaluation files.
+  tf.logging.info("Step 1/4: Downloading data from source")
+  train_files = get_raw_files(FLAGS.raw_dir, _TRAIN_DATA_SOURCES)
+  eval_files = get_raw_files(FLAGS.raw_dir, _EVAL_DATA_SOURCES)
+
+  # Create subtokenizer based on the training files.
+  tf.logging.info("Step 2/4: Creating subtokenizer and building vocabulary")
+  train_files_flat = train_files["inputs"] + train_files["targets"]
+  vocab_file = os.path.join(FLAGS.data_dir, VOCAB_FILE)
+  subtokenizer = tokenizer.Subtokenizer.init_from_files(
+      vocab_file, train_files_flat, _TARGET_VOCAB_SIZE, _TARGET_THRESHOLD,
+      min_count=None if FLAGS.search else _TRAIN_DATA_MIN_COUNT)
+
+  tf.logging.info("Step 3/4: Compiling training and evaluation data")
+  compiled_train_files = compile_files(FLAGS.raw_dir, train_files, _TRAIN_TAG)
+  compiled_eval_files = compile_files(FLAGS.raw_dir, eval_files, _EVAL_TAG)
+
+  # Tokenize and save data as Examples in the TFRecord format.
+  tf.logging.info("Step 4/4: Preprocessing and saving data")
+  train_tfrecord_files = encode_and_save_files(
+      subtokenizer, FLAGS.data_dir, compiled_train_files, _TRAIN_TAG,
+      _TRAIN_SHARDS)
+  encode_and_save_files(
+      subtokenizer, FLAGS.data_dir, compiled_eval_files, _EVAL_TAG,
+      _EVAL_SHARDS)
+
+  for fname in train_tfrecord_files:
+    shuffle_records(fname)
+
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      "--data_dir", "-dd", type=str, default="/tmp/translate_ende",
+      help="[default: %(default)s] Directory for where the "
+           "translate_ende_wmt32k dataset is saved.",
+      metavar="<DD>")
+  parser.add_argument(
+      "--raw_dir", "-rd", type=str, default="/tmp/translate_ende_raw",
+      help="[default: %(default)s] Path where the raw data will be downloaded "
+           "and extracted.",
+      metavar="<RD>")
+  parser.add_argument(
+      "--search", action="store_true",
+      help="If set, use binary search to find the vocabulary set with size"
+           "closest to the target size (%d)." % _TARGET_VOCAB_SIZE)
+
+  FLAGS, unparsed = parser.parse_known_args()
+  main(sys.argv)
--- a/official/transformer/model/__init__.py
+++ b/official/transformer/model/__init__.py
--- a/official/transformer/model/attention_layer.py
+++ b/official/transformer/model/attention_layer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of multiheaded attention and self-attention layers."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+class Attention(tf.layers.Layer):
+  """Multi-headed attention layer."""
+
+  def __init__(self, hidden_size, num_heads, attention_dropout, train):
+    if hidden_size % num_heads != 0:
+      raise ValueError("Hidden size must be evenly divisible by the number of "
+                       "heads.")
+
+    super(Attention, self).__init__()
+    self.hidden_size = hidden_size
+    self.num_heads = num_heads
+    self.attention_dropout = attention_dropout
+    self.train = train
+
+    # Layers for linearly projecting the queries, keys, and values.
+    self.q_dense_layer = tf.layers.Dense(hidden_size, use_bias=False, name="q")
+    self.k_dense_layer = tf.layers.Dense(hidden_size, use_bias=False, name="k")
+    self.v_dense_layer = tf.layers.Dense(hidden_size, use_bias=False, name="v")
+
+    self.output_dense_layer = tf.layers.Dense(hidden_size, use_bias=False,
+                                              name="output_transform")
+
+  def split_heads(self, x):
+    """Split x into different heads, and transpose the resulting value.
+
+    The tensor is transposed to insure the inner dimensions hold the correct
+    values during the matrix multiplication.
+
+    Args:
+      x: A tensor with shape [batch_size, length, hidden_size]
+
+    Returns:
+      A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]
+    """
+    with tf.name_scope("split_heads"):
+      batch_size = tf.shape(x)[0]
+      length = tf.shape(x)[1]
+
+      # Calculate depth of last dimension after it has been split.
+      depth = (self.hidden_size // self.num_heads)
+
+      # Split the last dimension
+      x = tf.reshape(x, [batch_size, length, self.num_heads, depth])
+
+      # Transpose the result
+      return tf.transpose(x, [0, 2, 1, 3])
+
+  def combine_heads(self, x):
+    """Combine tensor that has been split.
+
+    Args:
+      x: A tensor [batch_size, num_heads, length, hidden_size/num_heads]
+
+    Returns:
+      A tensor with shape [batch_size, length, hidden_size]
+    """
+    with tf.name_scope("combine_heads"):
+      batch_size = tf.shape(x)[0]
+      length = tf.shape(x)[2]
+      x = tf.transpose(x, [0, 2, 1, 3])  # --> [batch, length, num_heads, depth]
+      return tf.reshape(x, [batch_size, length, self.hidden_size])
+
+  def call(self, x, y, bias, cache=None):
+    """Apply attention mechanism to x and y.
+
+    Args:
+      x: a tensor with shape [batch_size, length_x, hidden_size]
+      y: a tensor with shape [batch_size, length_y, hidden_size]
+      bias: attention bias that will be added to the result of the dot product.
+      cache: (Used during prediction) dictionary with tensors containing results
+        of previous attentions. The dictionary must have the items:
+            {"k": tensor with shape [batch_size, i, key_channels],
+             "v": tensor with shape [batch_size, i, value_channels]}
+        where i is the current decoded length.
+
+    Returns:
+      Attention layer output with shape [batch_size, length_x, hidden_size]
+    """
+    # Linearly project the query (q), key (k) and value (v) using different
+    # learned projections. This is in preparation of splitting them into
+    # multiple heads. Multi-head attention uses multiple queries, keys, and
+    # values rather than regular attention (which uses a single q, k, v).
+    q = self.q_dense_layer(x)
+    k = self.k_dense_layer(y)
+    v = self.v_dense_layer(y)
+
+    if cache is not None:
+      # Combine cached keys and values with new keys and values.
+      k = tf.concat([cache["k"], k], axis=1)
+      v = tf.concat([cache["v"], v], axis=1)
+
+      # Update cache
+      cache["k"] = k
+      cache["v"] = v
+
+    # Split q, k, v into heads.
+    q = self.split_heads(q)
+    k = self.split_heads(k)
+    v = self.split_heads(v)
+
+    # Scale q to prevent the dot product between q and k from growing too large.
+    depth = (self.hidden_size // self.num_heads)
+    q *= depth ** -0.5
+
+    # Calculate dot product attention
+    logits = tf.matmul(q, k, transpose_b=True)
+    logits += bias
+    weights = tf.nn.softmax(logits, name="attention_weights")
+    if self.train:
+      weights = tf.nn.dropout(weights, 1.0 - self.attention_dropout)
+    attention_output = tf.matmul(weights, v)
+
+    # Recombine heads --> [batch_size, length, hidden_size]
+    attention_output = self.combine_heads(attention_output)
+
+    # Run the combined outputs through another linear projection layer.
+    attention_output = self.output_dense_layer(attention_output)
+    return attention_output
+
+
+class SelfAttention(Attention):
+  """Multiheaded self-attention layer."""
+
+  def call(self, x, bias, cache=None):
+    return super(SelfAttention, self).call(x, x, bias, cache)
--- a/official/transformer/model/beam_search.py
+++ b/official/transformer/model/beam_search.py
--- a/official/transformer/model/beam_search_test.py
+++ b/official/transformer/model/beam_search_test.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Test beam search helper methods."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf  # pylint: disable=g-bad-import-order
+
+from official.transformer.model import beam_search
+
+
+class BeamSearchHelperTests(tf.test.TestCase):
+
+  def test_expand_to_beam_size(self):
+    x = tf.ones([7, 4, 2, 5])
+    x = beam_search._expand_to_beam_size(x, 3)
+    with self.test_session() as sess:
+      shape = sess.run(tf.shape(x))
+    self.assertAllEqual([7, 3, 4, 2, 5], shape)
+
+  def test_shape_list(self):
+    y = tf.constant(4.0)
+    x = tf.ones([7, tf.to_int32(tf.sqrt(y)), 2, 5])
+    shape = beam_search._shape_list(x)
+    self.assertIsInstance(shape[0], int)
+    self.assertIsInstance(shape[1], tf.Tensor)
+    self.assertIsInstance(shape[2], int)
+    self.assertIsInstance(shape[3], int)
+
+  def test_get_shape_keep_last_dim(self):
+    y = tf.constant(4.0)
+    x = tf.ones([7, tf.to_int32(tf.sqrt(y)), 2, 5])
+    shape = beam_search._get_shape_keep_last_dim(x)
+    self.assertAllEqual([None, None, None, 5],
+                        shape.as_list())
+
+  def test_flatten_beam_dim(self):
+    x = tf.ones([7, 4, 2, 5])
+    x = beam_search._flatten_beam_dim(x)
+    with self.test_session() as sess:
+      shape = sess.run(tf.shape(x))
+    self.assertAllEqual([28, 2, 5], shape)
+
+  def test_unflatten_beam_dim(self):
+    x = tf.ones([28, 2, 5])
+    x = beam_search._unflatten_beam_dim(x, 7, 4)
+    with self.test_session() as sess:
+      shape = sess.run(tf.shape(x))
+    self.assertAllEqual([7, 4, 2, 5], shape)
+
+  def test_gather_beams(self):
+    x = tf.reshape(tf.range(24), [2, 3, 4])
+    # x looks like:  [[[ 0  1  2  3]
+    #                  [ 4  5  6  7]
+    #                  [ 8  9 10 11]]
+    #
+    #                 [[12 13 14 15]
+    #                  [16 17 18 19]
+    #                  [20 21 22 23]]]
+
+    y = beam_search._gather_beams(x, [[1, 2], [0, 2]], 2, 2)
+    with self.test_session() as sess:
+      y = sess.run(y)
+
+    self.assertAllEqual([[[4, 5, 6, 7],
+                          [8, 9, 10, 11]],
+                         [[12, 13, 14, 15],
+                          [20, 21, 22, 23]]],
+                        y)
+
+  def test_gather_topk_beams(self):
+    x = tf.reshape(tf.range(24), [2, 3, 4])
+    x_scores = [[0, 1, 1], [1, 0, 1]]
+
+    y = beam_search._gather_topk_beams(x, x_scores, 2, 2)
+    with self.test_session() as sess:
+      y = sess.run(y)
+
+    self.assertAllEqual([[[4, 5, 6, 7],
+                          [8, 9, 10, 11]],
+                         [[12, 13, 14, 15],
+                          [20, 21, 22, 23]]],
+                        y)
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/transformer/model/embedding_layer.py
+++ b/official/transformer/model/embedding_layer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of embedding layer with shared weights."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf  # pylint: disable=g-bad-import-order
+
+from official.transformer.model import model_utils
+
+
+class EmbeddingSharedWeights(tf.layers.Layer):
+  """Calculates input embeddings and pre-softmax linear with shared weights."""
+
+  def __init__(self, vocab_size, hidden_size):
+    super(EmbeddingSharedWeights, self).__init__()
+    self.vocab_size = vocab_size
+    self.hidden_size = hidden_size
+
+  def build(self, _):
+    with tf.variable_scope("embedding_and_softmax", reuse=tf.AUTO_REUSE):
+      # Create and initialize weights. The random normal initializer was chosen
+      # randomly, and works well.
+      self.shared_weights = tf.get_variable(
+          "weights", [self.vocab_size, self.hidden_size],
+          initializer=tf.random_normal_initializer(
+              0., self.hidden_size ** -0.5))
+
+    self.built = True
+
+  def call(self, x):
+    """Get token embeddings of x.
+
+    Args:
+      x: An int64 tensor with shape [batch_size, length]
+    Returns:
+      embeddings: float32 tensor with shape [batch_size, length, embedding_size]
+      padding: float32 tensor with shape [batch_size, length] indicating the
+        locations of the padding tokens in x.
+    """
+    with tf.name_scope("embedding"):
+      embeddings = tf.gather(self.shared_weights, x)
+
+      # Scale embedding by the sqrt of the hidden size
+      embeddings *= self.hidden_size ** 0.5
+
+      # Create binary array of size [batch_size, length]
+      # where 1 = padding, 0 = not padding
+      padding = model_utils.get_padding(x)
+
+      # Set all padding embedding values to 0
+      embeddings *= tf.expand_dims(1 - padding, -1)
+      return embeddings
+
+  def linear(self, x):
+    """Computes logits by running x through a linear layer.
+
+    Args:
+      x: A float32 tensor with shape [batch_size, length, hidden_size]
+    Returns:
+      float32 tensor with shape [batch_size, length, vocab_size].
+    """
+    with tf.name_scope("presoftmax_linear"):
+      batch_size = tf.shape(x)[0]
+      length = tf.shape(x)[1]
+
+      x = tf.reshape(x, [-1, self.hidden_size])
+      logits = tf.matmul(x, self.shared_weights, transpose_b=True)
+
+      return tf.reshape(logits, [batch_size, length, self.vocab_size])
--- a/official/transformer/model/ffn_layer.py
+++ b/official/transformer/model/ffn_layer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of fully connected network."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+class FeedFowardNetwork(tf.layers.Layer):
+  """Fully connected feedforward network."""
+
+  def __init__(self, hidden_size, filter_size, relu_dropout, train):
+    super(FeedFowardNetwork, self).__init__()
+    self.hidden_size = hidden_size
+    self.filter_size = filter_size
+    self.relu_dropout = relu_dropout
+    self.train = train
+
+    self.filter_dense_layer = tf.layers.Dense(
+        filter_size, use_bias=True, activation=tf.nn.relu, name="filter_layer")
+    self.output_dense_layer = tf.layers.Dense(
+        hidden_size, use_bias=True, name="output_layer")
+
+  def call(self, x, padding=None):
+    """Return outputs of the feedforward network.
+
+    Args:
+      x: tensor with shape [batch_size, length, hidden_size]
+      padding: (optional) If set, the padding values are temporarily removed
+        from x. The padding values are placed back in the output tensor in the
+        same locations. shape [batch_size, length]
+
+    Returns:
+      Output of the feedforward network.
+      tensor with shape [batch_size, length, hidden_size]
+    """
+    # Retrieve dynamically known shapes
+    batch_size = tf.shape(x)[0]
+    length = tf.shape(x)[1]
+
+    if padding is not None:
+      with tf.name_scope("remove_padding"):
+        # Flatten padding to [batch_size*length]
+        pad_mask = tf.reshape(padding, [-1])
+
+        nonpad_ids = tf.to_int32(tf.where(pad_mask < 1e-9))
+
+        # Reshape x to [batch_size*length, hidden_size] to remove padding
+        x = tf.reshape(x, [-1, self.hidden_size])
+        x = tf.gather_nd(x, indices=nonpad_ids)
+
+        # Reshape x from 2 dimensions to 3 dimensions.
+        x.set_shape([None, self.hidden_size])
+        x = tf.expand_dims(x, axis=0)
+
+    output = self.filter_dense_layer(x)
+    if self.train:
+      output = tf.nn.dropout(output, 1.0 - self.relu_dropout)
+    output = self.output_dense_layer(output)
+
+    if padding is not None:
+      with tf.name_scope("re_add_padding"):
+        output = tf.squeeze(output, axis=0)
+        output = tf.scatter_nd(
+            indices=nonpad_ids,
+            updates=output,
+            shape=[batch_size * length, self.hidden_size]
+        )
+        output = tf.reshape(output, [batch_size, length, self.hidden_size])
+    return output
--- a/official/transformer/model/model_params.py
+++ b/official/transformer/model/model_params.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Defines Transformer model parameters."""
+
+
+class TransformerBaseParams(object):
+  """Parameters for the base Transformer model."""
+  # Input params
+  batch_size = 2048  # Maximum number of tokens per batch of examples.
+  max_length = 256  # Maximum number of tokens per example.
+
+  # Model params
+  initializer_gain = 1.0  # Used in trainable variable initialization.
+  vocab_size = 33708  # Number of tokens defined in the vocabulary file.
+  hidden_size = 512  # Model dimension in the hidden layers.
+  num_hidden_layers = 6  # Number of layers in the encoder and decoder stacks.
+  num_heads = 8  # Number of heads to use in multi-headed attention.
+  filter_size = 2048  # Inner layer dimensionality in the feedforward network.
+
+  # Dropout values (only used when training)
+  layer_postprocess_dropout = 0.1
+  attention_dropout = 0.1
+  relu_dropout = 0.1
+
+  # Training params
+  label_smoothing = 0.1
+  learning_rate = 2.0
+  learning_rate_decay_rate = 1.0
+  learning_rate_warmup_steps = 16000
+
+  # Optimizer params
+  optimizer_adam_beta1 = 0.9
+  optimizer_adam_beta2 = 0.997
+  optimizer_adam_epsilon = 1e-09
+
+  # Default prediction params
+  extra_decode_length = 50
+  beam_size = 4
+  alpha = 0.6  # used to calculate length normalization in beam search
+
+
+class TransformerBigParams(TransformerBaseParams):
+  """Parameters for the big Transformer model."""
+  batch_size = 4096
+  hidden_size = 1024
+  filter_size = 4096
+  num_heads = 16
--- a/official/transformer/model/model_utils.py
+++ b/official/transformer/model/model_utils.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Transformer model helper methods."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+import tensorflow as tf
+
+_NEG_INF = -1e9
+
+
+def get_position_encoding(
+    length, hidden_size, min_timescale=1.0, max_timescale=1.0e4):
+  """Return positional encoding.
+
+  Calculates the position encoding as a mix of sine and cosine functions with
+  geometrically increasing wavelengths.
+  Defined and formulized in Attention is All You Need, section 3.5.
+
+  Args:
+    length: Sequence length.
+    hidden_size: Size of the
+    min_timescale: Minimum scale that will be applied at each position
+    max_timescale: Maximum scale that will be applied at each position
+
+  Returns:
+    Tensor with shape [length, hidden_size]
+  """
+  position = tf.to_float(tf.range(length))
+  num_timescales = hidden_size // 2
+  log_timescale_increment = (
+      math.log(float(max_timescale) / float(min_timescale)) /
+      (tf.to_float(num_timescales) - 1))
+  inv_timescales = min_timescale * tf.exp(
+      tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)
+  scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
+  signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
+  return signal
+
+
+def get_decoder_self_attention_bias(length):
+  """Calculate bias for decoder that maintains model's autoregressive property.
+
+  Creates a tensor that masks out locations that correspond to illegal
+  connections, so prediction at position i cannot draw information from future
+  positions.
+
+  Args:
+    length: int length of sequences in batch.
+
+  Returns:
+    float tensor of shape [1, 1, length, length]
+  """
+  with tf.name_scope("decoder_self_attention_bias"):
+    valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
+    valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
+    decoder_bias = _NEG_INF * (1.0 - valid_locs)
+  return decoder_bias
+
+
+def get_padding(x, padding_value=0):
+  """Return float tensor representing the padding values in x.
+
+  Args:
+    x: int tensor with any shape
+    padding_value: int value that
+
+  Returns:
+    flaot tensor with same shape as x containing values 0 or 1.
+      0 -> non-padding, 1 -> padding
+  """
+  with tf.name_scope("padding"):
+    return tf.to_float(tf.equal(x, padding_value))
+
+
+def get_padding_bias(x):
+  """Calculate bias tensor from padding values in tensor.
+
+  Bias tensor that is added to the pre-softmax multi-headed attention logits,
+  which has shape [batch_size, num_heads, length, length]. The tensor is zero at
+  non-padding locations, and -1e9 (negative infinity) at padding locations.
+
+  Args:
+    x: int tensor with shape [batch_size, length]
+
+  Returns:
+    Attention bias tensor of shape [batch_size, 1, 1, length].
+  """
+  with tf.name_scope("attention_bias"):
+    padding = get_padding(x)
+    attention_bias = padding * _NEG_INF
+    attention_bias = tf.expand_dims(
+        tf.expand_dims(attention_bias, axis=1), axis=1)
+  return attention_bias
--- a/official/transformer/model/model_utils_test.py
+++ b/official/transformer/model/model_utils_test.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Test Transformer model helper methods."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf  # pylint: disable=g-bad-import-order
+
+from official.transformer.model import model_utils
+
+NEG_INF = -1e9
+
+
+class ModelUtilsTest(tf.test.TestCase):
+
+  def test_get_padding(self):
+    x = tf.constant([[1, 0, 0, 0, 2], [3, 4, 0, 0, 0], [0, 5, 6, 0, 7]])
+    padding = model_utils.get_padding(x, padding_value=0)
+    with self.test_session() as sess:
+      padding = sess.run(padding)
+
+    self.assertAllEqual([[0, 1, 1, 1, 0], [0, 0, 1, 1, 1], [1, 0, 0, 1, 0]],
+                        padding)
+
+  def test_get_padding_bias(self):
+    x = tf.constant([[1, 0, 0, 0, 2], [3, 4, 0, 0, 0], [0, 5, 6, 0, 7]])
+    bias = model_utils.get_padding_bias(x)
+    bias_shape = tf.shape(bias)
+    flattened_bias = tf.reshape(bias, [3, 5])
+    with self.test_session() as sess:
+      flattened_bias, bias_shape = sess.run((flattened_bias, bias_shape))
+
+    self.assertAllEqual([[0, NEG_INF, NEG_INF, NEG_INF, 0],
+                         [0, 0, NEG_INF, NEG_INF, NEG_INF],
+                         [NEG_INF, 0, 0, NEG_INF, 0]],
+                        flattened_bias)
+    self.assertAllEqual([3, 1, 1, 5], bias_shape)
+
+  def test_get_decoder_self_attention_bias(self):
+    length = 5
+    bias = model_utils.get_decoder_self_attention_bias(length)
+    with self.test_session() as sess:
+      bias = sess.run(bias)
+
+    self.assertAllEqual([[[[0, NEG_INF, NEG_INF, NEG_INF, NEG_INF],
+                           [0, 0, NEG_INF, NEG_INF, NEG_INF],
+                           [0, 0, 0, NEG_INF, NEG_INF],
+                           [0, 0, 0, 0, NEG_INF],
+                           [0, 0, 0, 0, 0]]]],
+                        bias)
+
+
+if __name__ == "__main__":
+  tf.test.main()
--- a/official/transformer/model/transformer.py
+++ b/official/transformer/model/transformer.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Defines the Transformer model, and its encoder and decoder stacks.
+
+Model paper: https://arxiv.org/pdf/1706.03762.pdf
+Transformer model code source: https://github.com/tensorflow/tensor2tensor
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf  # pylint: disable=g-bad-import-order
+
+from official.transformer.model import attention_layer
+from official.transformer.model import beam_search
+from official.transformer.model import embedding_layer
+from official.transformer.model import ffn_layer
+from official.transformer.model import model_utils
+from official.transformer.utils.tokenizer import EOS_ID
+
+_NEG_INF = -1e9
+
+
+class Transformer(object):
+  """Transformer model for sequence to sequence data.
+
+  Implemented as described in: https://arxiv.org/pdf/1706.03762.pdf
+
+  The Transformer model consists of an encoder and decoder. The input is an int
+  sequence (or a batch of sequences). The encoder produces a continous
+  representation, and the decoder uses the encoder output to generate
+  probabilities for the output sequence.
+  """
+
+  def __init__(self, params, train):
+    """Initialize layers to build Transformer model.
+
+    Args:
+      params: hyperparameter object defining layer sizes, dropout values, etc.
+      train: boolean indicating whether the model is in training mode. Used to
+        determine if dropout layers should be added.
+    """
+    self.train = train
+    self.params = params
+
+    self.embedding_softmax_layer = embedding_layer.EmbeddingSharedWeights(
+        params.vocab_size, params.hidden_size)
+    self.encoder_stack = EncoderStack(params, train)
+    self.decoder_stack = DecoderStack(params, train)
+
+  def __call__(self, inputs, targets=None):
+    """Calculate target logits or inferred target sequences.
+
+    Args:
+      inputs: int tensor with shape [batch_size, input_length].
+      targets: None or int tensor with shape [batch_size, target_length].
+
+    Returns:
+      If targets is defined, then return logits for each word in the target
+      sequence. float tensor with shape [batch_size, target_length, vocab_size]
+      If target is none, then generate output sequence one token at a time.
+        returns a dictionary {
+          output: [batch_size, decoded length]
+          score: [batch_size, float]}
+    """
+    # Variance scaling is used here because it seems to work in many problems.
+    # Other reasonable initializers may also work just as well.
+    initializer = tf.variance_scaling_initializer(
+        self.params.initializer_gain, mode="fan_avg", distribution="uniform")
+    with tf.variable_scope("Transformer", initializer=initializer):
+      # Calculate attention bias for encoder self-attention and decoder
+      # multi-headed attention layers.
+      attention_bias = model_utils.get_padding_bias(inputs)
+
+      # Run the inputs through the encoder layer to map the symbol
+      # representations to continuous representations.
+      encoder_outputs = self.encode(inputs, attention_bias)
+
+      # Generate output sequence if targets is None, or return logits if target
+      # sequence is known.
+      if targets is None:
+        return self.predict(encoder_outputs, attention_bias)
+      else:
+        logits = self.decode(targets, encoder_outputs, attention_bias)
+        return logits
+
+  def encode(self, inputs, attention_bias):
+    """Generate continuous representation for inputs.
+
+    Args:
+      inputs: int tensor with shape [batch_size, input_length].
+      attention_bias: float tensor with shape [batch_size, 1, 1, input_length]
+
+    Returns:
+      float tensor with shape [batch_size, input_length, hidden_size]
+    """
+    with tf.name_scope("encode"):
+      # Prepare inputs to the layer stack by adding positional encodings and
+      # applying dropout.
+      embedded_inputs = self.embedding_softmax_layer(inputs)
+      inputs_padding = model_utils.get_padding(inputs)
+
+      with tf.name_scope("add_pos_encoding"):
+        length = tf.shape(embedded_inputs)[1]
+        pos_encoding = model_utils.get_position_encoding(
+            length, self.params.hidden_size)
+        encoder_inputs = embedded_inputs + pos_encoding
+
+      if self.train:
+        encoder_inputs = tf.nn.dropout(
+            encoder_inputs, 1 - self.params.layer_postprocess_dropout)
+
+      return self.encoder_stack(encoder_inputs, attention_bias, inputs_padding)
+
+  def decode(self, targets, encoder_outputs, attention_bias):
+    """Generate logits for each value in the target sequence.
+
+    Args:
+      targets: target values for the output sequence.
+        int tensor with shape [batch_size, target_length]
+      encoder_outputs: continuous representation of input sequence.
+        float tensor with shape [batch_size, input_length, hidden_size]
+      attention_bias: float tensor with shape [batch_size, 1, 1, input_length]
+
+    Returns:
+      float32 tensor with shape [batch_size, target_length, vocab_size]
+    """
+    with tf.name_scope("decode"):
+      # Prepare inputs to decoder layers by shifting targets, adding positional
+      # encoding and applying dropout.
+      decoder_inputs = self.embedding_softmax_layer(targets)
+      with tf.name_scope("shift_targets"):
+        # Shift targets to the right, and remove the last element
+        decoder_inputs = tf.pad(
+            decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
+      with tf.name_scope("add_pos_encoding"):
+        length = tf.shape(decoder_inputs)[1]
+        decoder_inputs += model_utils.get_position_encoding(
+            length, self.params.hidden_size)
+      if self.train:
+        decoder_inputs = tf.nn.dropout(
+            decoder_inputs, 1 - self.params.layer_postprocess_dropout)
+
+      # Run values
+      decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(
+          length)
+      outputs = self.decoder_stack(
+          decoder_inputs, encoder_outputs, decoder_self_attention_bias,
+          attention_bias)
+      logits = self.embedding_softmax_layer.linear(outputs)
+      return logits
+
+  def _get_symbols_to_logits_fn(self, max_decode_length):
+    """Returns a decoding function that calculates logits of the next tokens."""
+
+    timing_signal = model_utils.get_position_encoding(
+        max_decode_length + 1, self.params.hidden_size)
+    decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(
+        max_decode_length)
+
+    def symbols_to_logits_fn(ids, i, cache):
+      """Generate logits for next potential IDs.
+
+      Args:
+        ids: Current decoded sequences.
+          int tensor with shape [batch_size * beam_size, i + 1]
+        i: Loop index
+        cache: dictionary of values storing the encoder output, encoder-decoder
+          attention bias, and previous decoder attention values.
+
+      Returns:
+        Tuple of
+          (logits with shape [batch_size * beam_size, vocab_size],
+           updated cache values)
+      """
+      # Set decoder input to the last generated IDs
+      decoder_input = ids[:, -1:]
+
+      # Preprocess decoder input by getting embeddings and adding timing signal.
+      decoder_input = self.embedding_softmax_layer(decoder_input)
+      decoder_input += timing_signal[i:i + 1]
+
+      self_attention_bias = decoder_self_attention_bias[:, :, i:i + 1, :i + 1]
+      decoder_outputs = self.decoder_stack(
+          decoder_input, cache.get("encoder_outputs"), self_attention_bias,
+          cache.get("encoder_decoder_attention_bias"), cache)
+      logits = self.embedding_softmax_layer.linear(decoder_outputs)
+      logits = tf.squeeze(logits, axis=[1])
+      return logits, cache
+    return symbols_to_logits_fn
+
+  def predict(self, encoder_outputs, encoder_decoder_attention_bias):
+    """Return predicted sequence."""
+    batch_size = tf.shape(encoder_outputs)[0]
+    input_length = tf.shape(encoder_outputs)[1]
+    max_decode_length = input_length + self.params.extra_decode_length
+
+    symbols_to_logits_fn = self._get_symbols_to_logits_fn(max_decode_length)
+
+    # Create initial set of IDs that will be passed into symbols_to_logits_fn.
+    initial_ids = tf.zeros([batch_size], dtype=tf.int32)
+
+    # Create cache storing decoder attention values for each layer.
+    cache = {
+        "layer_%d" % layer: {
+            "k": tf.zeros([batch_size, 0, self.params.hidden_size]),
+            "v": tf.zeros([batch_size, 0, self.params.hidden_size]),
+        } for layer in range(self.params.num_hidden_layers)}
+
+    # Add encoder output and attention bias to the cache.
+    cache["encoder_outputs"] = encoder_outputs
+    cache["encoder_decoder_attention_bias"] = encoder_decoder_attention_bias
+
+    # Use beam search to find the top beam_size sequences and scores.
+    decoded_ids, scores = beam_search.sequence_beam_search(
+        symbols_to_logits_fn=symbols_to_logits_fn,
+        initial_ids=initial_ids,
+        initial_cache=cache,
+        vocab_size=self.params.vocab_size,
+        beam_size=self.params.beam_size,
+        alpha=self.params.alpha,
+        max_decode_length=max_decode_length,
+        eos_id=EOS_ID)
+
+    # Get the top sequence for each batch element
+    top_decoded_ids = decoded_ids[:, 0, 1:]
+    top_scores = scores[:, 0]
+
+    return {"outputs": top_decoded_ids, "scores": top_scores}
+
+
+class LayerNormalization(tf.layers.Layer):
+  """Applies layer normalization."""
+
+  def __init__(self, hidden_size):
+    super(LayerNormalization, self).__init__()
+    self.hidden_size = hidden_size
+
+  def build(self, _):
+    self.scale = tf.get_variable("layer_norm_scale", [self.hidden_size],
+                                 initializer=tf.ones_initializer())
+    self.bias = tf.get_variable("layer_norm_bias", [self.hidden_size],
+                                initializer=tf.zeros_initializer())
+    self.built = True
+
+  def call(self, x, epsilon=1e-6):
+    mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
+    variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
+    norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
+    return norm_x * self.scale + self.bias
+
+
+class PrePostProcessingWrapper(object):
+  """Wrapper class that applies layer pre-processing and post-processing."""
+
+  def __init__(self, layer, params, train):
+    self.layer = layer
+    self.postprocess_dropout = params.layer_postprocess_dropout
+    self.train = train
+
+    # Create normalization layer
+    self.layer_norm = LayerNormalization(params.hidden_size)
+
+  def __call__(self, x, *args, **kwargs):
+    # Preprocessing: apply layer normalization
+    y = self.layer_norm(x)
+
+    # Get layer output
+    y = self.layer(y, *args, **kwargs)
+
+    # Postprocessing: apply dropout and residual connection
+    if self.train:
+      y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
+    return x + y
+
+
+class EncoderStack(tf.layers.Layer):
+  """Transformer encoder stack.
+
+  The encoder stack is made up of N identical layers. Each layer is composed
+  of the sublayers:
+    1. Self-attention layer
+    2. Feedforward network (which is 2 fully-connected layers)
+  """
+
+  def __init__(self, params, train):
+    super(EncoderStack, self).__init__()
+    self.layers = []
+    for _ in range(params.num_hidden_layers):
+      # Create sublayers for each layer.
+      self_attention_layer = attention_layer.SelfAttention(
+          params.hidden_size, params.num_heads, params.attention_dropout, train)
+      feed_forward_network = ffn_layer.FeedFowardNetwork(
+          params.hidden_size, params.filter_size, params.relu_dropout, train)
+
+      self.layers.append([
+          PrePostProcessingWrapper(self_attention_layer, params, train),
+          PrePostProcessingWrapper(feed_forward_network, params, train)])
+
+    # Create final layer normalization layer.
+    self.output_normalization = LayerNormalization(params.hidden_size)
+
+  def call(self, encoder_inputs, attention_bias, inputs_padding):
+    """Return the output of the encoder layer stacks.
+
+    Args:
+      encoder_inputs: tensor with shape [batch_size, input_length, hidden_size]
+      attention_bias: bias for the encoder self-attention layer.
+        [batch_size, 1, 1, input_length]
+      inputs_padding: P
+
+    Returns:
+      Output of encoder layer stack.
+      float32 tensor with shape [batch_size, input_length, hidden_size]
+    """
+    for n, layer in enumerate(self.layers):
+      # Run inputs through the sublayers.
+      self_attention_layer = layer[0]
+      feed_forward_network = layer[1]
+
+      with tf.variable_scope("layer_%d" % n):
+        with tf.variable_scope("self_attention"):
+          encoder_inputs = self_attention_layer(encoder_inputs, attention_bias)
+        with tf.variable_scope("ffn"):
+          encoder_inputs = feed_forward_network(encoder_inputs, inputs_padding)
+
+    return self.output_normalization(encoder_inputs)
+
+
+class DecoderStack(tf.layers.Layer):
+  """Transformer decoder stack.
+
+  Like the encoder stack, the decoder stack is made up of N identical layers.
+  Each layer is composed of the sublayers:
+    1. Self-attention layer
+    2. Multi-headed attention layer combining encoder outputs with results from
+       the previous self-attention layer.
+    3. Feedforward network (2 fully-connected layers)
+  """
+
+  def __init__(self, params, train):
+    super(DecoderStack, self).__init__()
+    self.layers = []
+    for _ in range(params.num_hidden_layers):
+      self_attention_layer = attention_layer.SelfAttention(
+          params.hidden_size, params.num_heads, params.attention_dropout, train)
+      enc_dec_attention_layer = attention_layer.Attention(
+          params.hidden_size, params.num_heads, params.attention_dropout, train)
+      feed_forward_network = ffn_layer.FeedFowardNetwork(
+          params.hidden_size, params.filter_size, params.relu_dropout, train)
+
+      self.layers.append([
+          PrePostProcessingWrapper(self_attention_layer, params, train),
+          PrePostProcessingWrapper(enc_dec_attention_layer, params, train),
+          PrePostProcessingWrapper(feed_forward_network, params, train)])
+
+    self.output_normalization = LayerNormalization(params.hidden_size)
+
+  def call(self, decoder_inputs, encoder_outputs, decoder_self_attention_bias,
+           attention_bias, cache=None):
+    """Return the output of the decoder layer stacks.
+
+    Args:
+      decoder_inputs: tensor with shape [batch_size, target_length, hidden_size]
+      encoder_outputs: tensor with shape [batch_size, input_length, hidden_size]
+      decoder_self_attention_bias: bias for decoder self-attention layer.
+        [1, 1, target_len, target_length]
+      attention_bias: bias for encoder-decoder attention layer.
+        [batch_size, 1, 1, input_length]
+      cache: (Used for fast decoding) A nested dictionary storing previous
+        decoder self-attention values. The items are:
+          {layer_n: {"k": tensor with shape [batch_size, i, key_channels],
+                     "v": tensor with shape [batch_size, i, value_channels]},
+           ...}
+
+    Returns:
+      Output of decoder layer stack.
+      float32 tensor with shape [batch_size, target_length, hidden_size]
+    """
+    for n, layer in enumerate(self.layers):
+      self_attention_layer = layer[0]
+      enc_dec_attention_layer = layer[1]
+      feed_forward_network = layer[2]
+
+      # Run inputs through the sublayers.
+      layer_name = "layer_%d" % n
+      layer_cache = cache[layer_name] if cache is not None else None
+      with tf.variable_scope(layer_name):
+        with tf.variable_scope("self_attention"):
+          decoder_inputs = self_attention_layer(
+              decoder_inputs, decoder_self_attention_bias, cache=layer_cache)
+        with tf.variable_scope("encdec_attention"):
+          decoder_inputs = enc_dec_attention_layer(
+              decoder_inputs, encoder_outputs, attention_bias)
+        with tf.variable_scope("ffn"):
+          decoder_inputs = feed_forward_network(decoder_inputs)
+
+    return self.output_normalization(decoder_inputs)
--- a/official/transformer/test_data/newstest2014.de
+++ b/official/transformer/test_data/newstest2014.de
--- a/official/transformer/test_data/newstest2014.en
+++ b/official/transformer/test_data/newstest2014.en
--- a/official/transformer/transformer_main.py
+++ b/official/transformer/transformer_main.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Creates an estimator to train the Transformer model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import os
+import sys
+import tempfile
+
+# pylint: disable=g-bad-import-order
+from six.moves import xrange  # pylint: disable=redefined-builtin
+import tensorflow as tf
+# pylint: enable=g-bad-import-order
+
+from official.transformer import compute_bleu
+from official.transformer import translate
+from official.transformer.data_download import VOCAB_FILE
+from official.transformer.model import model_params
+from official.transformer.model import transformer
+from official.transformer.utils import dataset
+from official.transformer.utils import metrics
+from official.transformer.utils import tokenizer
+
+DEFAULT_TRAIN_EPOCHS = 10
+BLEU_DIR = "bleu"
+INF = int(1e9)
+
+
+def model_fn(features, labels, mode, params):
+  """Defines how to train, evaluate and predict from the transformer model."""
+  with tf.variable_scope("model"):
+    inputs, targets = features, labels
+
+    # Create model and get output logits.
+    model = transformer.Transformer(params, mode == tf.estimator.ModeKeys.TRAIN)
+
+    output = model(inputs, targets)
+
+    # When in prediction mode, the labels/targets is None. The model output
+    # is the prediction
+    if mode == tf.estimator.ModeKeys.PREDICT:
+      return tf.estimator.EstimatorSpec(
+          tf.estimator.ModeKeys.PREDICT,
+          predictions=output)
+
+    logits = output
+
+    # Calculate model loss.
+    xentropy, weights = metrics.padded_cross_entropy_loss(
+        logits, targets, params.label_smoothing, params.vocab_size)
+    loss = tf.reduce_sum(xentropy * weights) / tf.reduce_sum(weights)
+
+    if mode == tf.estimator.ModeKeys.EVAL:
+      return tf.estimator.EstimatorSpec(
+          mode=mode, loss=loss, predictions={"predictions": logits},
+          eval_metric_ops=metrics.get_eval_metrics(logits, labels, params))
+    else:
+      train_op = get_train_op(loss, params)
+      return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
+
+
+def get_learning_rate(learning_rate, hidden_size, learning_rate_warmup_steps):
+  """Calculate learning rate with linear warmup and rsqrt decay."""
+  with tf.name_scope("learning_rate"):
+    warmup_steps = tf.to_float(learning_rate_warmup_steps)
+    step = tf.to_float(tf.train.get_or_create_global_step())
+
+    learning_rate *= (hidden_size ** -0.5)
+    # Apply linear warmup
+    learning_rate *= tf.minimum(1.0, step / warmup_steps)
+    # Apply rsqrt decay
+    learning_rate *= tf.rsqrt(tf.maximum(step, warmup_steps))
+
+    # Save learning rate value to TensorBoard summary.
+    tf.summary.scalar("learning_rate", learning_rate)
+
+    return learning_rate
+
+
+def get_train_op(loss, params):
+  """Generate training operation that updates variables based on loss."""
+  with tf.variable_scope("get_train_op"):
+    learning_rate = get_learning_rate(
+        params.learning_rate, params.hidden_size,
+        params.learning_rate_warmup_steps)
+
+    # Create optimizer. Use LazyAdamOptimizer from TF contrib, which is faster
+    # than the TF core Adam optimizer.
+    optimizer = tf.contrib.opt.LazyAdamOptimizer(
+        learning_rate,
+        beta1=params.optimizer_adam_beta1,
+        beta2=params.optimizer_adam_beta2,
+        epsilon=params.optimizer_adam_epsilon)
+
+    # Calculate and apply gradients using LazyAdamOptimizer.
+    global_step = tf.train.get_global_step()
+    tvars = tf.trainable_variables()
+    gradients = optimizer.compute_gradients(
+        loss, tvars, colocate_gradients_with_ops=True)
+    train_op = optimizer.apply_gradients(
+        gradients, global_step=global_step, name="train")
+
+    # Save gradient norm to Tensorboard
+    tf.summary.scalar("global_norm/gradient_norm",
+                      tf.global_norm(list(zip(*gradients))[0]))
+
+    return train_op
+
+
+def translate_and_compute_bleu(estimator, subtokenizer, bleu_source, bleu_ref):
+  """Translate file and report the cased and uncased bleu scores."""
+  # Create temporary file to store translation.
+  tmp = tempfile.NamedTemporaryFile(delete=False)
+  tmp_filename = tmp.name
+
+  translate.translate_file(
+      estimator, subtokenizer, bleu_source, output_file=tmp_filename,
+      print_all_translations=False)
+
+  # Compute uncased and cased bleu scores.
+  uncased_score = compute_bleu.bleu_wrapper(bleu_ref, tmp_filename, False)
+  cased_score = compute_bleu.bleu_wrapper(bleu_ref, tmp_filename, True)
+  os.remove(tmp_filename)
+  return uncased_score, cased_score
+
+
+def get_global_step(estimator):
+  """Return estimator's last checkpoint."""
+  return int(estimator.latest_checkpoint().split("-")[-1])
+
+
+def evaluate_and_log_bleu(estimator, bleu_writer, bleu_source, bleu_ref):
+  """Calculate and record the BLEU score."""
+  subtokenizer = tokenizer.Subtokenizer(
+      os.path.join(FLAGS.data_dir, FLAGS.vocab_file))
+
+  uncased_score, cased_score = translate_and_compute_bleu(
+      estimator, subtokenizer, bleu_source, bleu_ref)
+
+  print("Bleu score (uncased):", uncased_score)
+  print("Bleu score (cased):", cased_score)
+
+  summary = tf.Summary(value=[
+      tf.Summary.Value(tag="bleu/uncased", simple_value=uncased_score),
+      tf.Summary.Value(tag="bleu/cased", simple_value=cased_score),
+  ])
+
+  bleu_writer.add_summary(summary, get_global_step(estimator))
+  bleu_writer.flush()
+  return uncased_score, cased_score
+
+
+def train_schedule(
+    estimator, train_eval_iterations, single_iteration_train_steps=None,
+    single_iteration_train_epochs=None, bleu_source=None, bleu_ref=None,
+    bleu_threshold=None):
+  """Train and evaluate model, and optionally compute model's BLEU score.
+
+  **Step vs. Epoch vs. Iteration**
+
+  Steps and epochs are canonical terms used in TensorFlow and general machine
+  learning. They are used to describe running a single process (train/eval):
+    - Step refers to running the process through a single or batch of examples.
+    - Epoch refers to running the process through an entire dataset.
+
+  E.g. training a dataset with 100 examples. The dataset is
+  divided into 20 batches with 5 examples per batch. A single training step
+  trains the model on one batch. After 20 training steps, the model will have
+  trained on every batch in the dataset, or, in other words, one epoch.
+
+  Meanwhile, iteration is used in this implementation to describe running
+  multiple processes (training and eval).
+    - A single iteration:
+      1. trains the model for a specific number of steps or epochs.
+      2. evaluates the model.
+      3. (if source and ref files are provided) compute BLEU score.
+
+  This function runs through multiple train+eval+bleu iterations.
+
+  Args:
+    estimator: tf.Estimator containing model to train.
+    train_eval_iterations: Number of times to repeat the train+eval iteration.
+    single_iteration_train_steps: Number of steps to train in one iteration.
+    single_iteration_train_epochs: Number of epochs to train in one iteration.
+    bleu_source: File containing text to be translated for BLEU calculation.
+    bleu_ref: File containing reference translations for BLEU calculation.
+    bleu_threshold: minimum BLEU score before training is stopped.
+
+  Raises:
+    ValueError: if both or none of single_iteration_train_steps and
+      single_iteration_train_epochs were defined.
+  """
+  # Ensure that exactly one of single_iteration_train_steps and
+  # single_iteration_train_epochs is defined.
+  if single_iteration_train_steps is None:
+    if single_iteration_train_epochs is None:
+      raise ValueError(
+          "Exactly one of single_iteration_train_steps or "
+          "single_iteration_train_epochs must be defined. Both were none.")
+  else:
+    if single_iteration_train_epochs is not None:
+      raise ValueError(
+          "Exactly one of single_iteration_train_steps or "
+          "single_iteration_train_epochs must be defined. Both were defined.")
+
+  evaluate_bleu = bleu_source is not None and bleu_ref is not None
+
+  # Print out training schedule
+  print("Training schedule:")
+  if single_iteration_train_epochs is not None:
+    print("\t1. Train for %d epochs." % single_iteration_train_epochs)
+  else:
+    print("\t1. Train for %d steps." % single_iteration_train_steps)
+  print("\t2. Evaluate model.")
+  if evaluate_bleu:
+    print("\t3. Compute BLEU score.")
+    if bleu_threshold is not None:
+      print("Repeat above steps until the BLEU score reaches", bleu_threshold)
+  if not evaluate_bleu or bleu_threshold is None:
+    print("Repeat above steps %d times." % train_eval_iterations)
+
+  if evaluate_bleu:
+    # Set summary writer to log bleu score.
+    bleu_writer = tf.summary.FileWriter(
+        os.path.join(estimator.model_dir, BLEU_DIR))
+    if bleu_threshold is not None:
+      # Change loop stopping condition if bleu_threshold is defined.
+      train_eval_iterations = INF
+
+  # Loop training/evaluation/bleu cycles
+  for i in xrange(train_eval_iterations):
+    print("Starting iteration", i + 1)
+
+    # Train the model for single_iteration_train_steps or until the input fn
+    # runs out of examples (if single_iteration_train_steps is None).
+    estimator.train(dataset.train_input_fn, steps=single_iteration_train_steps)
+
+    eval_results = estimator.evaluate(dataset.eval_input_fn)
+    print("Evaluation results (iter %d/%d):" % (i + 1, train_eval_iterations),
+          eval_results)
+
+    if evaluate_bleu:
+      uncased_score, _ = evaluate_and_log_bleu(
+          estimator, bleu_writer, bleu_source, bleu_ref)
+      if bleu_threshold is not None and uncased_score > bleu_threshold:
+        bleu_writer.close()
+        break
+
+
+def main(_):
+  # Set logging level to INFO to display training progress (logged by the
+  # estimator)
+  tf.logging.set_verbosity(tf.logging.INFO)
+
+  if FLAGS.params == "base":
+    params = model_params.TransformerBaseParams
+  elif FLAGS.params == "big":
+    params = model_params.TransformerBigParams
+  else:
+    raise ValueError("Invalid parameter set defined: %s."
+                     "Expected 'base' or 'big.'" % FLAGS.params)
+
+  # Determine training schedule based on flags.
+  if FLAGS.train_steps is not None and FLAGS.train_epochs is not None:
+    raise ValueError("Both --train_steps and --train_epochs were set. Only one "
+                     "may be defined.")
+  if FLAGS.train_steps is not None:
+    train_eval_iterations = FLAGS.train_steps // FLAGS.steps_between_eval
+    single_iteration_train_steps = FLAGS.steps_between_eval
+    single_iteration_train_epochs = None
+  else:
+    if FLAGS.train_epochs is None:
+      FLAGS.train_epochs = DEFAULT_TRAIN_EPOCHS
+    train_eval_iterations = FLAGS.train_epochs // FLAGS.epochs_between_eval
+    single_iteration_train_steps = None
+    single_iteration_train_epochs = FLAGS.epochs_between_eval
+
+  # Make sure that the BLEU source and ref files if set
+  if FLAGS.bleu_source is not None and FLAGS.bleu_ref is not None:
+    if not tf.gfile.Exists(FLAGS.bleu_source):
+      raise ValueError("BLEU source file %s does not exist" % FLAGS.bleu_source)
+    if not tf.gfile.Exists(FLAGS.bleu_ref):
+      raise ValueError("BLEU source file %s does not exist" % FLAGS.bleu_ref)
+
+  # Add flag-defined parameters to params object
+  params.data_dir = FLAGS.data_dir
+  params.num_cpu_cores = FLAGS.num_cpu_cores
+  params.epochs_between_eval = FLAGS.epochs_between_eval
+  params.repeat_dataset = single_iteration_train_epochs
+
+  estimator = tf.estimator.Estimator(
+      model_fn=model_fn, model_dir=FLAGS.model_dir, params=params)
+  train_schedule(
+      estimator, train_eval_iterations, single_iteration_train_steps,
+      single_iteration_train_epochs, FLAGS.bleu_source, FLAGS.bleu_ref,
+      FLAGS.bleu_threshold)
+
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      "--data_dir", "-dd", type=str, default="/tmp/translate_ende",
+      help="[default: %(default)s] Directory containing training and "
+           "evaluation data, and vocab file used for encoding.",
+      metavar="<DD>")
+  parser.add_argument(
+      "--vocab_file", "-vf", type=str, default=VOCAB_FILE,
+      help="[default: %(default)s] Name of vocabulary file.",
+      metavar="<vf>")
+  parser.add_argument(
+      "--model_dir", "-md", type=str, default="/tmp/transformer_model",
+      help="[default: %(default)s] Directory to save Transformer model "
+           "training checkpoints",
+      metavar="<MD>")
+  parser.add_argument(
+      "--params", "-p", type=str, default="big", choices=["base", "big"],
+      help="[default: %(default)s] Parameter set to use when creating and "
+           "training the model.",
+      metavar="<P>")
+  parser.add_argument(
+      "--num_cpu_cores", "-nc", type=int, default=4,
+      help="[default: %(default)s] Number of CPU cores to use in the input "
+           "pipeline.",
+      metavar="<NC>")
+
+  # Flags for training with epochs. (default)
+  parser.add_argument(
+      "--train_epochs", "-te", type=int, default=None,
+      help="The number of epochs used to train. If both --train_epochs and "
+           "--train_steps are not set, the model will train for %d epochs." %
+      DEFAULT_TRAIN_EPOCHS,
+      metavar="<TE>")
+  parser.add_argument(
+      "--epochs_between_eval", "-ebe", type=int, default=1,
+      help="[default: %(default)s] The number of training epochs to run "
+           "between evaluations.",
+      metavar="<TE>")
+
+  # Flags for training with steps (may be used for debugging)
+  parser.add_argument(
+      "--train_steps", "-ts", type=int, default=None,
+      help="Total number of training steps. If both --train_epochs and "
+           "--train_steps are not set, the model will train for %d epochs." %
+      DEFAULT_TRAIN_EPOCHS,
+      metavar="<TS>")
+  parser.add_argument(
+      "--steps_between_eval", "-sbe", type=int, default=1000,
+      help="[default: %(default)s] Number of training steps to run between "
+           "evaluations.",
+      metavar="<SBE>")
+
+  # BLEU score computation
+  parser.add_argument(
+      "--bleu_source", "-bs", type=str, default=None,
+      help="Path to source file containing text translate when calculating the "
+           "official BLEU score. Both --bleu_source and --bleu_ref must be "
+           "set. The BLEU score will be calculated during model evaluation.",
+      metavar="<BS>")
+  parser.add_argument(
+      "--bleu_ref", "-br", type=str, default=None,
+      help="Path to file containing the reference translation for calculating "
+           "the official BLEU score. Both --bleu_source and --bleu_ref must be "
+           "set. The BLEU score will be calculated during model evaluation.",
+      metavar="<BR>")
+  parser.add_argument(
+      "--bleu_threshold", "-bt", type=float, default=None,
+      help="Stop training when the uncased BLEU score reaches this value. "
+           "Setting this overrides the total number of steps or epochs set by "
+           "--train_steps or --train_epochs.",
+      metavar="<BT>")
+
+  FLAGS, unparsed = parser.parse_known_args()
+  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
--- a/official/transformer/translate.py
+++ b/official/transformer/translate.py
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Translate text or files using trained transformer model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import os
+import sys
+
+# pylint: disable=g-bad-import-order
+from six.moves import xrange  # pylint: disable=redefined-builtin
+import tensorflow as tf
+# pylint: enable=g-bad-import-order
+
+from official.transformer.data_download import VOCAB_FILE
+from official.transformer.model import model_params
+from official.transformer.utils import tokenizer
+
+_DECODE_BATCH_SIZE = 32
+_EXTRA_DECODE_LENGTH = 100
+_BEAM_SIZE = 4
+_ALPHA = 0.6
+
+
+def _get_sorted_inputs(filename):
+  """Read and sort lines from the file sorted by decreasing length.
+
+  Args:
+    filename: String name of file to read inputs from.
+  Returns:
+    Sorted list of inputs, and dictionary mapping original index->sorted index
+    of each element.
+  """
+  with tf.gfile.Open(filename) as f:
+    records = f.read().split("\n")
+    inputs = [record.strip() for record in records]
+    if not inputs[-1]:
+      inputs.pop()
+
+  input_lens = [(i, len(line.split())) for i, line in enumerate(inputs)]
+  sorted_input_lens = sorted(input_lens, key=lambda x: x[1], reverse=True)
+
+  sorted_inputs = []
+  sorted_keys = {}
+  for i, (index, _) in enumerate(sorted_input_lens):
+    sorted_inputs.append(inputs[index])
+    sorted_keys[index] = i
+  return sorted_inputs, sorted_keys
+
+
+def _encode_and_add_eos(line, subtokenizer):
+  """Encode line with subtokenizer, and add EOS id to the end."""
+  return subtokenizer.encode(line) + [tokenizer.EOS_ID]
+
+
+def _trim_and_decode(ids, subtokenizer):
+  """Trim EOS and PAD tokens from ids, and decode to return a string."""
+  try:
+    index = list(ids).index(tokenizer.EOS_ID)
+    return subtokenizer.decode(ids[:index])
+  except ValueError:  # No EOS found in sequence
+    return subtokenizer.decode(ids)
+
+
+def translate_file(
+    estimator, subtokenizer, input_file, output_file=None,
+    print_all_translations=True):
+  """Translate lines in file, and save to output file if specified.
+
+  Args:
+    estimator: tf.Estimator used to generate the translations.
+    subtokenizer: Subtokenizer object for encoding and decoding source and
+       translated lines.
+    input_file: file containing lines to translate
+    output_file: file that stores the generated translations.
+    print_all_translations: If true, all translations are printed to stdout.
+
+  Raises:
+    ValueError: if output file is invalid.
+  """
+  batch_size = _DECODE_BATCH_SIZE
+
+  # Read and sort inputs by length. Keep dictionary (original index-->new index
+  # in sorted list) to write translations in the original order.
+  sorted_inputs, sorted_keys = _get_sorted_inputs(input_file)
+  num_decode_batches = (len(sorted_inputs) - 1) // batch_size + 1
+
+  def input_generator():
+    """Yield encoded strings from sorted_inputs."""
+    for i, line in enumerate(sorted_inputs):
+      if i % batch_size == 0:
+        batch_num = (i // batch_size) + 1
+
+        print("Decoding batch %d out of %d." % (batch_num, num_decode_batches))
+      yield _encode_and_add_eos(line, subtokenizer)
+
+  def input_fn():
+    """Created batched dataset of encoded inputs."""
+    ds = tf.data.Dataset.from_generator(
+        input_generator, tf.int64, tf.TensorShape([None]))
+    ds = ds.padded_batch(batch_size, [None])
+    return ds
+
+  translations = []
+  for i, prediction in enumerate(estimator.predict(input_fn)):
+    translation = _trim_and_decode(prediction["outputs"], subtokenizer)
+    translations.append(translation)
+
+    if print_all_translations:
+      print("Translating:")
+      print("\tInput: %s" % sorted_inputs[i])
+      print("\tOutput: %s\n" % translation)
+      print("=" * 100)
+
+  # Write translations in the order they appeared in the original file.
+  if output_file is not None:
+    if tf.gfile.IsDirectory(output_file):
+      raise ValueError("File output is a directory, will not save outputs to "
+                       "file.")
+    tf.logging.info("Writing to file %s" % output_file)
+    with tf.gfile.Open(output_file, "w") as f:
+      for index in xrange(len(sorted_keys)):
+        f.write("%s\n" % translations[sorted_keys[index]])
+
+
+def translate_text(estimator, subtokenizer, txt):
+  """Translate a single string."""
+  encoded_txt = _encode_and_add_eos(txt, subtokenizer)
+
+  def input_fn():
+    ds = tf.data.Dataset.from_tensors(encoded_txt)
+    ds = ds.batch(_DECODE_BATCH_SIZE)
+    return ds
+
+  predictions = estimator.predict(input_fn)
+  translation = next(predictions)["outputs"]
+  translation = _trim_and_decode(translation, subtokenizer)
+  print("Translation of \"%s\": \"%s\"" % (txt, translation))
+
+
+def main(unused_argv):
+  from official.transformer import transformer_main
+
+  tf.logging.set_verbosity(tf.logging.INFO)
+
+  if FLAGS.text is None and FLAGS.file is None:
+    tf.logging.warn("Nothing to translate. Make sure to call this script using "
+                    "flags --text or --file.")
+    return
+
+  subtokenizer = tokenizer.Subtokenizer(
+      os.path.join(FLAGS.data_dir, FLAGS.vocab_file))
+
+  if FLAGS.params == "base":
+    params = model_params.TransformerBaseParams
+  elif FLAGS.params == "big":
+    params = model_params.TransformerBigParams
+  else:
+    raise ValueError("Invalid parameter set defined: %s."
+                     "Expected 'base' or 'big.'" % FLAGS.params)
+
+  # Set up estimator and params
+  params.beam_size = _BEAM_SIZE
+  params.alpha = _ALPHA
+  params.extra_decode_length = _EXTRA_DECODE_LENGTH
+  params.batch_size = _DECODE_BATCH_SIZE
+  estimator = tf.estimator.Estimator(
+      model_fn=transformer_main.model_fn, model_dir=FLAGS.model_dir,
+      params=params)
+
+  if FLAGS.text is not None:
+    tf.logging.info("Translating text: %s" % FLAGS.text)
+    translate_text(estimator, subtokenizer, FLAGS.text)
+
+  if FLAGS.file is not None:
+    input_file = os.path.abspath(FLAGS.file)
+    tf.logging.info("Translating file: %s" % input_file)
+    if not tf.gfile.Exists(FLAGS.file):
+      raise ValueError("File does not exist: %s" % input_file)
+
+    output_file = None
+    if FLAGS.file_out is not None:
+      output_file = os.path.abspath(FLAGS.file_out)
+      tf.logging.info("File output specified: %s" % output_file)
+
+    translate_file(estimator, subtokenizer, input_file, output_file)
+
+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+
+  # Model arguments
+  parser.add_argument(
+      "--data_dir", "-dd", type=str, default="/tmp/data/translate_ende",
+      help="[default: %(default)s] Directory where vocab file is stored.",
+      metavar="<DD>")
+  parser.add_argument(
+      "--vocab_file", "-vf", type=str, default=VOCAB_FILE,
+      help="[default: %(default)s] Name of vocabulary file.",
+      metavar="<vf>")
+  parser.add_argument(
+      "--model_dir", "-md", type=str, default="/tmp/transformer_model",
+      help="[default: %(default)s] Directory containing Transformer model "
+           "checkpoints.",
+      metavar="<MD>")
+  parser.add_argument(
+      "--params", "-p", type=str, default="big", choices=["base", "big"],
+      help="[default: %(default)s] Parameter used for trained model.",
+      metavar="<P>")
+
+  # Flags for specifying text/file to be translated.
+  parser.add_argument(
+      "--text", "-t", type=str, default=None,
+      help="[default: %(default)s] Text to translate. Output will be printed "
+           "to console.",
+      metavar="<T>")
+  parser.add_argument(
+      "--file", "-f", type=str, default=None,
+      help="[default: %(default)s] File containing text to translate. "
+           "Translation will be printed to console and, if --file_out is "
+           "provided, saved to an output file.",
+      metavar="<F>")
+  parser.add_argument(
+      "--file_out", "-fo", type=str, default=None,
+      help="[default: %(default)s] If --file flag is specified, save "
+           "translation to this file.",
+      metavar="<FO>")
+
+  FLAGS, unparsed = parser.parse_known_args()
+  main(sys.argv)
--- a/official/transformer/utils/__init__.py
+++ b/official/transformer/utils/__init__.py