Merge branch 'master' into ncf_f16

5b0ef1fc · Nimit Nigania · 1cba90f3 · bf748370 · 5b0ef1fc · 5b0ef1fc
Commit 5b0ef1fc authored Aug 23, 2019 by Nimit Nigania
20 changed files
--- a/official/README.md
+++ b/official/README.md
 # TensorFlow Official Models

-The TensorFlow official models are a collection of example models that use TensorFlow's high-level APIs. They are intended to be well-maintained, tested, and kept up to date with the latest TensorFlow API. They should also be reasonably optimized for fast performance while still being easy to read.
+The TensorFlow official models are a collection of example models that use
+TensorFlow's high-level APIs. They are intended to be well-maintained, tested,
+and kept up to date with the latest TensorFlow API. They should also be
+reasonably optimized for fast performance while still being easy to read.

-These models are used as end-to-end tests, ensuring that the models run with the same speed and performance with each new TensorFlow build.
+These models are used as end-to-end tests, ensuring that the models run with the
+same speed and performance with each new TensorFlow build.

 ## Tensorflow releases
-The master branch of the models are **in development**, and they target the [nightly binaries](https://github.com/tensorflow/tensorflow#installation) built from the [master branch of TensorFlow](https://github.com/tensorflow/tensorflow/tree/master). We aim to keep them backwards compatible with the latest release when possible (currently TensorFlow 1.5), but we cannot always guarantee compatibility.

-**Stable versions** of the official models targeting releases of TensorFlow are available as tagged branches or [downloadable releases](https://github.com/tensorflow/models/releases). Model repository version numbers match the target TensorFlow release, such that [branch r1.4.0](https://github.com/tensorflow/models/tree/r1.4.0) and [release v1.4.0](https://github.com/tensorflow/models/releases/tag/v1.4.0) are compatible with [TensorFlow v1.4.0](https://github.com/tensorflow/tensorflow/releases/tag/v1.4.0).
-
-If you are on a version of TensorFlow earlier than 1.4, please [update your installation](https://www.tensorflow.org/install/).
+The master branch of the models are **in development**, and they target the
+[nightly binaries](https://github.com/tensorflow/tensorflow#installation) built
+from the
+[master branch of TensorFlow](https://github.com/tensorflow/tensorflow/tree/master).
+We aim to keep them backwards compatible with the latest release when possible
+(currently TensorFlow 1.5), but we cannot always guarantee compatibility.
+
+**Stable versions** of the official models targeting releases of TensorFlow are
+available as tagged branches or
+[downloadable releases](https://github.com/tensorflow/models/releases). Model
+repository version numbers match the target TensorFlow release, such that
+[branch r1.4.0](https://github.com/tensorflow/models/tree/r1.4.0) and
+[release v1.4.0](https://github.com/tensorflow/models/releases/tag/v1.4.0) are
+compatible with
+[TensorFlow v1.4.0](https://github.com/tensorflow/tensorflow/releases/tag/v1.4.0).
+
+If you are on a version of TensorFlow earlier than 1.4, please
+[update your installation](https://www.tensorflow.org/install/).

 ## Requirements
+
 Please follow the below steps before running models in this repo:

+1.  TensorFlow
+    [nightly binaries](https://github.com/tensorflow/tensorflow#installation)
+
+2.  Add the top-level ***/models*** folder to the Python path with the command:
+    `export PYTHONPATH="$PYTHONPATH:/path/to/models"`

-1. TensorFlow [nightly binaries](https://github.com/tensorflow/tensorflow#installation)
+    Using Colab: `import os os.environ['PYTHONPATH'] += ":/path/to/models"`

-2. Add the top-level ***/models*** folder to the Python path with the command:
-   ```
-   export PYTHONPATH="$PYTHONPATH:/path/to/models"
-   ```
+3.  Install dependencies: `pip3 install --user -r official/requirements.txt` or
+    `pip install --user -r official/requirements.txt`

-   Using Colab:
-   ```
-   import os
-   os.environ['PYTHONPATH'] += ":/path/to/models"
-   ```
+To make Official Models easier to use, we are planning to create a pip
+installable Official Models package. This is being tracked in
+[#917](https://github.com/tensorflow/models/issues/917).

-3. Install dependencies:
-   ```
-   pip3 install --user -r official/requirements.txt
-   ```
-   or
-   ```
-   pip install --user -r official/requirements.txt
-   ```
+## Available models

+**NOTE:** Please make sure to follow the steps in the
+[Requirements](#requirements) section.

-To make Official Models easier to use, we are planning to create a pip installable Official Models package. This is being tracked in [#917](https://github.com/tensorflow/models/issues/917).
+*   [bert](bert): A powerful pre-trained language representation model: BERT,
+    which stands for Bidirectional Encoder Representations from Transformers.
+*   [mnist](mnist): A basic model to classify digits from the MNIST dataset.
+*   [resnet](vision/image_classification): A deep residual network that can be
+    used to classify both CIFAR-10 and ImageNet's dataset of 1000 classes.
+*   [transformer](transformer): A transformer model to translate the WMT English
+    to German dataset.
+*   [ncf](recommendation): Neural Collaborative Filtering model for
+    recommendation tasks.

+Models that will not update to TensorFlow 2.x stay inside R1 directory:

-## Available models
+*   [boosted_trees](r1/boosted_trees): A Gradient Boosted Trees model to
+    classify higgs boson process from HIGGS Data Set.
+*   [wide_deep](r1/wide_deep): A model that combines a wide model and deep
+    network to classify census income data.
+
+## More models to come!

-**NOTE:** Please make sure to follow the steps in the [Requirements](#requirements) section.
+We are in the progress to revamp official model garden with TensorFlow 2.0 and
+Keras. In the near future, we will bring:

-* [bert](bert): A powerful pre-trained language representation model: BERT, which
-  stands for Bidirectional Encoder Representations from Transformers.
-* [boosted_trees](boosted_trees): A Gradient Boosted Trees model to classify higgs boson process from HIGGS Data Set.
-* [mnist](mnist): A basic model to classify digits from the MNIST dataset.
-* [resnet](resnet): A deep residual network that can be used to classify both CIFAR-10 and ImageNet's dataset of 1000 classes.
-* [transformer](transformer): A transformer model to translate the WMT English to German dataset.
-* [wide_deep](wide_deep): A model that combines a wide model and deep network to classify census income data.
-* More models to come!
+*   State-of-the-art language understanding models: XLNet, GPT2, and more
+    members in Transformer family.
+*   Start-of-the-art image classification models: EfficientNet, MnasNet and
+    variants.
+*   A set of excellent objection detection models.

-If you would like to make any fixes or improvements to the models, please [submit a pull request](https://github.com/tensorflow/models/compare).
+If you would like to make any fixes or improvements to the models, please
+[submit a pull request](https://github.com/tensorflow/models/compare).

 ## New Models

-The team is actively working to add new models to the repository. Every model should follow the following guidelines, to uphold the
-our objectives of readable, usable, and maintainable code.
+The team is actively working to add new models to the repository. Every model
+should follow the following guidelines, to uphold the our objectives of
+readable, usable, and maintainable code.

-**General guidelines**
-* Code should be well documented and tested.
-* Runnable from a blank environment with relative ease.
-* Trainable on: single GPU/CPU (baseline), multiple GPUs, TPU
-* Compatible with Python 2 and 3 (using [six](https://pythonhosted.org/six/) when necessary)
-* Conform to [Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md)
+**General guidelines** * Code should be well documented and tested. * Runnable
+from a blank environment with relative ease. * Trainable on: single GPU/CPU
+(baseline), multiple GPUs, TPU * Compatible with Python 2 and 3 (using
+[six](https://pythonhosted.org/six/) when necessary) * Conform to
+[Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md)

 **Implementation guidelines**

-These guidelines exist so the model implementations are consistent for better readability and maintainability.
+These guidelines exist so the model implementations are consistent for better
+readability and maintainability.

-* Use [common utility functions](utils)
-* Export SavedModel at the end of training.
-* Consistent flags and flag-parsing library ([read more here](utils/flags/guidelines.md))
-* Produce benchmarks and logs ([read more here](utils/logs/guidelines.md))
+*   Use [common utility functions](utils)
+*   Export SavedModel at the end of training.
+*   Consistent flags and flag-parsing library
+    ([read more here](utils/flags/guidelines.md))
+*   Produce benchmarks and logs ([read more here](utils/logs/guidelines.md))
--- a/official/resnet/keras/keras_benchmark.py
+++ b/official/resnet/keras/keras_benchmark.py
--- a/official/resnet/keras/keras_cifar_benchmark.py
+++ b/official/resnet/keras/keras_cifar_benchmark.py
@@ -22,8 +22,8 @@ import time
 from absl import flags
 import tensorflow as tf  # pylint: disable=g-bad-import-order

-from official.resnet.keras import keras_benchmark
-from official.resnet.keras import keras_cifar_main
+from official.benchmark import keras_benchmark
+from official.vision.image_classification import resnet_cifar_main

 MIN_TOP_1_ACCURACY = 0.929
 MAX_TOP_1_ACCURACY = 0.938
@@ -47,7 +47,7 @@ class Resnet56KerasAccuracy(keras_benchmark.KerasBenchmark):
    """

    self.data_dir = os.path.join(root_data_dir, CIFAR_DATA_DIR_NAME)
-    flag_methods = [keras_cifar_main.define_cifar_flags]
+    flag_methods = [resnet_cifar_main.define_cifar_flags]

    super(Resnet56KerasAccuracy, self).__init__(
        output_dir=output_dir, flag_methods=flag_methods)
@@ -199,7 +199,7 @@ class Resnet56KerasAccuracy(keras_benchmark.KerasBenchmark):

  def _run_and_report_benchmark(self):
    start_time_sec = time.time()
-    stats = keras_cifar_main.run(FLAGS)
+    stats = resnet_cifar_main.run(FLAGS)
    wall_time_sec = time.time() - start_time_sec

    super(Resnet56KerasAccuracy, self)._report_benchmark(
@@ -215,7 +215,7 @@ class Resnet56KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
  """Short performance tests for ResNet56 via Keras and CIFAR-10."""

  def __init__(self, output_dir=None, default_flags=None):
-    flag_methods = [keras_cifar_main.define_cifar_flags]
+    flag_methods = [resnet_cifar_main.define_cifar_flags]

    super(Resnet56KerasBenchmarkBase, self).__init__(
        output_dir=output_dir,
@@ -224,7 +224,7 @@ class Resnet56KerasBenchmarkBase(keras_benchmark.KerasBenchmark):

  def _run_and_report_benchmark(self):
    start_time_sec = time.time()
-    stats = keras_cifar_main.run(FLAGS)
+    stats = resnet_cifar_main.run(FLAGS)
    wall_time_sec = time.time() - start_time_sec

    super(Resnet56KerasBenchmarkBase, self)._report_benchmark(
@@ -248,6 +248,7 @@ class Resnet56KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    self._setup()
    FLAGS.num_gpus = 1
    FLAGS.enable_eager = True
+    FLAGS.run_eagerly = False
    FLAGS.enable_xla = True
    FLAGS.distribution_strategy = 'default'
    FLAGS.model_dir = self._get_model_dir('benchmark_1_gpu_xla')
@@ -270,6 +271,7 @@ class Resnet56KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    self._setup()
    FLAGS.num_gpus = 1
    FLAGS.enable_eager = False
+    FLAGS.run_eagerly = False
    FLAGS.distribution_strategy = 'default'
    FLAGS.model_dir = self._get_model_dir('benchmark_graph_1_gpu')
    FLAGS.batch_size = 128
@@ -340,6 +342,7 @@ class Resnet56KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    self._setup()
    FLAGS.num_gpus = 2
    FLAGS.enable_eager = True
+    FLAGS.run_eagerly = False
    FLAGS.distribution_strategy = 'default'
    FLAGS.model_dir = self._get_model_dir('benchmark_2_gpu')
    FLAGS.batch_size = 128 * 2  # 2 GPUs
@@ -350,6 +353,7 @@ class Resnet56KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    self._setup()
    FLAGS.num_gpus = 2
    FLAGS.enable_eager = False
+    FLAGS.run_eagerly = False
    FLAGS.distribution_strategy = 'default'
    FLAGS.model_dir = self._get_model_dir('benchmark_graph_2_gpu')
    FLAGS.batch_size = 128 * 2  # 2 GPUs

--- a/official/resnet/keras/keras_imagenet_benchmark.py
+++ b/official/resnet/keras/keras_imagenet_benchmark.py
@@ -21,8 +21,8 @@ import time
 from absl import flags
 import tensorflow as tf  # pylint: disable=g-bad-import-order

-from official.resnet.keras import keras_benchmark
-from official.resnet.keras import keras_imagenet_main
+from official.benchmark import keras_benchmark
+from official.vision.image_classification import resnet_imagenet_main

 MIN_TOP_1_ACCURACY = 0.76
 MAX_TOP_1_ACCURACY = 0.77
@@ -44,7 +44,7 @@ class Resnet50KerasAccuracy(keras_benchmark.KerasBenchmark):
                named arguments before updating the constructor.
    """

-    flag_methods = [keras_imagenet_main.define_imagenet_keras_flags]
+    flag_methods = [resnet_imagenet_main.define_imagenet_keras_flags]

    self.data_dir = os.path.join(root_data_dir, 'imagenet')
    super(Resnet50KerasAccuracy, self).__init__(
@@ -112,32 +112,6 @@ class Resnet50KerasAccuracy(keras_benchmark.KerasBenchmark):
    FLAGS.use_tensor_lr = True
    self._run_and_report_benchmark()

-  def benchmark_8_gpu_mlperf_like_tweaked(self):
-    """Test similar to the rules for MLPerf 0.5.
-
-    Listed below are reasons this comparison is not to the MLSpec, but this is
-    still a decent directional measurement:
-      - Eval is every 4 epochs and again at the end. ~2 extra times.
-      - Learning rate is not tuned to hit 75%, but we know the model is correct.
-      - We measure total time and MLPerf 0.5 excluded some startup time.
-      - Eval is not on the total set, need to set eval batch_size where
-        8*batch_size/50K is even. 250 is a good number.
-      - Not sure if we are doing any extra or too few steps due to epoch bleed.
-    """
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.data_dir = self.data_dir
-    FLAGS.batch_size = 256 * 8
-    FLAGS.train_epochs = 61
-    FLAGS.epochs_between_evals = 4
-    FLAGS.model_dir = self._get_model_dir('benchmark_8_gpu_mlperf_like_tweaked')
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = True
-    FLAGS.enable_xla = True
-    FLAGS.use_tensor_lr = True
-    FLAGS.tf_gpu_thread_mode = 'gpu_private'
-    self._run_and_report_benchmark(top_1_min=0.736)
-
  def benchmark_8_gpu_mlperf_like(self):
    """Test similar to the rules for MLPerf 0.5.

@@ -184,7 +158,7 @@ class Resnet50KerasAccuracy(keras_benchmark.KerasBenchmark):
                                top_1_min=MIN_TOP_1_ACCURACY,
                                top_1_max=MAX_TOP_1_ACCURACY):
    start_time_sec = time.time()
-    stats = keras_imagenet_main.run(flags.FLAGS)
+    stats = resnet_imagenet_main.run(flags.FLAGS)
    wall_time_sec = time.time() - start_time_sec

    super(Resnet50KerasAccuracy, self)._report_benchmark(
@@ -203,7 +177,7 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
  """Resnet50 benchmarks."""

  def __init__(self, output_dir=None, default_flags=None):
-    flag_methods = [keras_imagenet_main.define_imagenet_keras_flags]
+    flag_methods = [resnet_imagenet_main.define_imagenet_keras_flags]

    super(Resnet50KerasBenchmarkBase, self).__init__(
        output_dir=output_dir,
@@ -212,7 +186,7 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):

  def _run_and_report_benchmark(self):
    start_time_sec = time.time()
-    stats = keras_imagenet_main.run(FLAGS)
+    stats = resnet_imagenet_main.run(FLAGS)
    wall_time_sec = time.time() - start_time_sec
    # Number of logged step time entries that are excluded in performance
    # report. We keep results from last 100 batches in this case.
@@ -277,48 +251,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.batch_size = 64
    self._run_and_report_benchmark()

-  def benchmark_1_gpu_no_dist_strat_force_v1_path_run_eagerly(self):
-    """Forced v1 execution in tf.compile path and force eager."""
-    self._setup()
-
-    FLAGS.num_gpus = 1
-    FLAGS.enable_eager = True
-    FLAGS.run_eagerly = True
-    FLAGS.distribution_strategy = 'off'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_1_gpu_no_dist_strat_force_v1_path_run_eagerly')
-    FLAGS.batch_size = 64
-    FLAGS.force_v2_in_keras_compile = False
-    self._run_and_report_benchmark()
-
-  def benchmark_1_gpu_no_dist_strat_force_v1_path_run_eagerly_tweaked(self):
-    """Forced v1 execution in tf.compile path and force eager."""
-    self._setup()
-
-    FLAGS.num_gpus = 1
-    FLAGS.enable_eager = True
-    FLAGS.run_eagerly = True
-    FLAGS.explicit_gpu_placement = True
-    FLAGS.distribution_strategy = 'off'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_1_gpu_no_dist_strat_force_v1_path_run_eagerly_tweaked')
-    FLAGS.batch_size = 64
-    FLAGS.force_v2_in_keras_compile = False
-    self._run_and_report_benchmark()
-
-  def benchmark_1_gpu_no_dist_strat_force_v1_path(self):
-    """No dist strat but forced v1 execution tf.compile path."""
-    self._setup()
-
-    FLAGS.num_gpus = 1
-    FLAGS.enable_eager = True
-    FLAGS.distribution_strategy = 'off'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_1_gpu_no_dist_strat_force_v1_path')
-    FLAGS.batch_size = 128
-    FLAGS.force_v2_in_keras_compile = False
-    self._run_and_report_benchmark()
-
  def benchmark_1_gpu_no_dist_strat_run_eagerly_fp16(self):
    """Test with 1 GPU, no distribution strategy, fp16, run eagerly."""
    self._setup()
@@ -437,20 +369,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.tf_gpu_thread_mode = 'gpu_private'
    self._run_and_report_benchmark()

-  def benchmark_xla_1_gpu_fp16_slack(self):
-    """Test Keras model tf.data's experimental_slack functionality."""
-    self._setup()
-
-    FLAGS.num_gpus = 1
-    FLAGS.enable_eager = True
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir('benchmark_xla_1_gpu_fp16_slack')
-    FLAGS.dtype = 'fp16'
-    FLAGS.batch_size = 256
-    FLAGS.tf_data_experimental_slack = True
-    self._run_and_report_benchmark()
-
  def benchmark_xla_1_gpu_fp16_dynamic(self):
    """Test Keras model with XLA, 1 GPU, fp16, and dynamic loss scaling."""
    self._setup()
@@ -529,21 +447,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.tf_gpu_thread_mode = 'gpu_private'
    self._run_and_report_benchmark()

-  def benchmark_graph_xla_1_gpu_fp16_slack(self):
-    """Test model in legacy graph with tf.data's experimental_slack."""
-    self._setup()
-
-    FLAGS.num_gpus = 1
-    FLAGS.enable_eager = False
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_graph_xla_1_gpu_fp16_slack')
-    FLAGS.dtype = 'fp16'
-    FLAGS.batch_size = 256
-    FLAGS.tf_data_experimental_slack = True
-    self._run_and_report_benchmark()
-
  def benchmark_8_gpu(self):
    """Test Keras model with 8 GPUs."""
    self._setup()
@@ -568,18 +471,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.datasets_num_private_threads = 14
    self._run_and_report_benchmark()

-  def benchmark_8_gpu_slack(self):
-    """Test Keras model with tf.data's experimental_slack and 8 GPUs."""
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.enable_eager = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir('benchmark_8_gpu_slack')
-    FLAGS.batch_size = 128 * 8  # 8 GPUs
-    FLAGS.tf_data_experimental_slack = True
-    self._run_and_report_benchmark()
-
  def benchmark_xla_8_gpu(self):
    """Test Keras model with XLA and 8 GPUs."""
    self._setup()
@@ -649,24 +540,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.tf_gpu_thread_mode = 'gpu_private'
    self._run_and_report_benchmark()

-  def benchmark_xla_8_gpu_fp16_optional_next(self):
-    """Test Keras model with XLA, 8 GPUs and fp16.
-
-    This test also enables get_next_as_optional.
-    """
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = True
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_xla_8_gpu_fp16_optional_next')
-    FLAGS.batch_size = 256 * 8  # 8 GPUs
-    FLAGS.enable_get_next_as_optional = True
-    self._run_and_report_benchmark()
-
  def benchmark_xla_8_gpu_fp16(self):
    """Test Keras model with XLA, 8 GPUs and fp16."""
    self._setup()
@@ -716,44 +589,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.train_steps = 310
    self._run_and_report_benchmark()

-  def benchmark_xla_8_gpu_fp16_tweaked_optional_next(self):
-    """Test Keras model with manual config tuning, XLA, 8 GPUs, fp16.
-
-    This test also enables get_next_as_optional.
-    """
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = True
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_xla_8_gpu_fp16_tweaked_optional_next')
-    FLAGS.batch_size = 256 * 8  # 8 GPUs
-    FLAGS.use_tensor_lr = True
-    FLAGS.tf_gpu_thread_mode = 'gpu_private'
-    FLAGS.datasets_num_private_threads = 48
-    FLAGS.enable_get_next_as_optional = True
-    self._run_and_report_benchmark()
-
-  def benchmark_xla_8_gpu_fp16_slack(self):
-    """Test Keras model with XLA, 8 GPUs and fp16.
-
-    This test also enable tf.data's experimental_slack functionality.
-    """
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = True
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir('benchmark_xla_8_gpu_fp16_slack')
-    FLAGS.batch_size = 256 * 8  # 8 GPUs
-    FLAGS.tf_data_experimental_slack = True
-    self._run_and_report_benchmark()
-
  def benchmark_xla_8_gpu_fp16_dynamic_tweaked(self):
    """Test Keras model with config tuning, XLA, 8 GPUs and dynamic fp16."""
    self._setup()
@@ -772,24 +607,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.datasets_num_private_threads = 48
    self._run_and_report_benchmark()

-  def benchmark_xla_8_gpu_fp16_tensorboard_tweaked(self):
-    """Test to track Tensorboard performance overhead."""
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = True
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_xla_8_gpu_fp16_tensorboard_tweaked')
-    FLAGS.batch_size = 256 * 8  # 8 GPUs
-    FLAGS.use_tensor_lr = True
-    FLAGS.tf_gpu_thread_mode = 'gpu_private'
-    FLAGS.datasets_num_private_threads = 48
-    FLAGS.enable_tensorboard = True
-    self._run_and_report_benchmark()
-
  def benchmark_graph_8_gpu(self):
    """Test Keras model in legacy graph mode with 8 GPUs."""
    self._setup()
@@ -888,41 +705,6 @@ class Resnet50KerasBenchmarkBase(keras_benchmark.KerasBenchmark):
    FLAGS.train_steps = 310
    self._run_and_report_benchmark()

-  def benchmark_graph_xla_8_gpu_fp16_tweaked_optional_next(self):
-    """Test in legacy graph mode with manual config tuning, XLA, 8 GPUs, fp16.
-
-    This test also enables get_next_as_optional.
-    """
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = False
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_graph_xla_8_gpu_fp16_tweaked_optional_next')
-    FLAGS.batch_size = 256 * 8  # 8 GPUs
-    FLAGS.use_tensor_lr = True
-    FLAGS.tf_gpu_thread_mode = 'gpu_private'
-    FLAGS.enable_get_next_as_optional = True
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_xla_8_gpu_fp16_slack(self):
-    """Test legacy graph mode with tf.data's experimental_slack."""
-    self._setup()
-
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.enable_eager = False
-    FLAGS.enable_xla = True
-    FLAGS.distribution_strategy = 'default'
-    FLAGS.model_dir = self._get_model_dir(
-        'benchmark_graph_xla_8_gpu_fp16_slack')
-    FLAGS.batch_size = 256 * 8  # 8 GPUs
-    FLAGS.tf_data_experimental_slack = True
-    self._run_and_report_benchmark()
-
  def benchmark_graph_8_gpu_fp16_dynamic_tweaked(self):
    """Test graph Keras with config tuning, 8 GPUs and dynamic fp16."""
    self._setup()
@@ -997,7 +779,7 @@ class TrivialKerasBenchmarkReal(keras_benchmark.KerasBenchmark):
  """Trivial model with real data benchmark tests."""

  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    flag_methods = [keras_imagenet_main.define_imagenet_keras_flags]
+    flag_methods = [resnet_imagenet_main.define_imagenet_keras_flags]

    def_flags = {}
    def_flags['use_trivial_model'] = True
@@ -1017,7 +799,7 @@ class TrivialKerasBenchmarkReal(keras_benchmark.KerasBenchmark):

  def _run_and_report_benchmark(self):
    start_time_sec = time.time()
-    stats = keras_imagenet_main.run(FLAGS)
+    stats = resnet_imagenet_main.run(FLAGS)
    wall_time_sec = time.time() - start_time_sec

    super(TrivialKerasBenchmarkReal, self)._report_benchmark(
@@ -1114,5 +896,111 @@ class TrivialKerasBenchmarkReal(keras_benchmark.KerasBenchmark):
        log_steps=FLAGS.log_steps)


+class Resnet50MultiWorkerKerasBenchmark(Resnet50KerasBenchmarkBase):
+  """Resnet50 distributed benchmark tests with multiple workers."""
+
+  def __init__(self, output_dir=None, default_flags=None):
+    super(Resnet50MultiWorkerKerasBenchmark, self).__init__(
+        output_dir=output_dir, default_flags=default_flags)
+
+  def _benchmark_common(self, eager, num_workers, all_reduce_alg):
+    """Common to all benchmarks in this class."""
+    self._setup()
+
+    num_gpus = 8
+    FLAGS.num_gpus = num_gpus
+    FLAGS.dtype = 'fp16'
+    FLAGS.enable_eager = eager
+    FLAGS.enable_xla = False
+    FLAGS.distribution_strategy = 'multi_worker_mirrored'
+    FLAGS.use_tensor_lr = True
+    FLAGS.tf_gpu_thread_mode = 'gpu_private'
+    FLAGS.model_dir = self._get_model_dir(
+        'benchmark_graph_8_gpu_{}_worker_fp16_{}_tweaked'.format(
+            num_workers, all_reduce_alg))
+    FLAGS.batch_size = 256 * num_gpus * num_workers
+    FLAGS.all_reduce_alg = all_reduce_alg
+
+    self._run_and_report_benchmark()
+
+  def benchmark_graph_8_gpu_1_worker_fp16_ring_tweaked(self):
+    """Legacy graph, 8 GPUs per worker, 1 worker, fp16, ring all-reduce."""
+    self._benchmark_common(eager=False, num_workers=1, all_reduce_alg='ring')
+
+  def benchmark_graph_8_gpu_1_worker_fp16_nccl_tweaked(self):
+    """Legacy graph, 8 GPUs per worker, 1 worker, fp16, nccl all-reduce."""
+    self._benchmark_common(eager=False, num_workers=1, all_reduce_alg='nccl')
+
+  def benchmark_graph_8_gpu_2_workers_fp16_ring_tweaked(self):
+    """Legacy graph, 8 GPUs per worker, 2 workers, fp16, ring all-reduce."""
+    self._benchmark_common(eager=False, num_workers=2, all_reduce_alg='ring')
+
+  def benchmark_graph_8_gpu_2_workers_fp16_nccl_tweaked(self):
+    """Legacy graph, 8 GPUs per worker, 2 workers, fp16, nccl all-reduce."""
+    self._benchmark_common(eager=False, num_workers=2, all_reduce_alg='nccl')
+
+  def benchmark_graph_8_gpu_8_workers_fp16_ring_tweaked(self):
+    """Legacy graph, 8 GPUs per worker, 8 workers, fp16, ring all-reduce."""
+    self._benchmark_common(eager=False, num_workers=8, all_reduce_alg='ring')
+
+  def benchmark_graph_8_gpu_8_workers_fp16_nccl_tweaked(self):
+    """Legacy graph, 8 GPUs per worker, 8 workers, fp16, nccl all-reduce."""
+    self._benchmark_common(eager=False, num_workers=8, all_reduce_alg='nccl')
+
+  def benchmark_eager_8_gpu_1_worker_fp16_ring_tweaked(self):
+    """Eager, 8 GPUs per worker, 1 worker, fp16, ring all-reduce."""
+    self._benchmark_common(eager=True, num_workers=1, all_reduce_alg='ring')
+
+  def benchmark_eager_8_gpu_1_worker_fp16_nccl_tweaked(self):
+    """Eager, 8 GPUs per worker, 1 worker, fp16, nccl all-reduce."""
+    self._benchmark_common(eager=True, num_workers=1, all_reduce_alg='nccl')
+
+  def benchmark_eager_8_gpu_2_workers_fp16_ring_tweaked(self):
+    """Eager, 8 GPUs per worker, 2 workers, fp16, ring all-reduce."""
+    self._benchmark_common(eager=True, num_workers=2, all_reduce_alg='ring')
+
+  def benchmark_eager_8_gpu_2_workers_fp16_nccl_tweaked(self):
+    """Eager, 8 GPUs per worker, 2 workers, fp16, nccl all-reduce."""
+    self._benchmark_common(eager=True, num_workers=2, all_reduce_alg='nccl')
+
+  def benchmark_eager_8_gpu_8_workers_fp16_ring_tweaked(self):
+    """Eager, 8 GPUs per worker, 8 workers, fp16, ring all-reduce."""
+    self._benchmark_common(eager=True, num_workers=8, all_reduce_alg='ring')
+
+  def benchmark_eager_8_gpu_8_workers_fp16_nccl_tweaked(self):
+    """Eager, 8 GPUs per worker, 8 workers, fp16, nccl all-reduce."""
+    self._benchmark_common(eager=True, num_workers=8, all_reduce_alg='nccl')
+
+
+class Resnet50MultiWorkerKerasBenchmarkSynth(Resnet50MultiWorkerKerasBenchmark):
+  """Resnet50 multi-worker synthetic data benchmark tests."""
+
+  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
+    def_flags = {}
+    def_flags['skip_eval'] = True
+    def_flags['report_accuracy_metrics'] = False
+    def_flags['use_synthetic_data'] = True
+    def_flags['train_steps'] = 110
+    def_flags['log_steps'] = 10
+
+    super(Resnet50MultiWorkerKerasBenchmarkSynth, self).__init__(
+        output_dir=output_dir, default_flags=def_flags)
+
+
+class Resnet50MultiWorkerKerasBenchmarkReal(Resnet50MultiWorkerKerasBenchmark):
+  """Resnet50 multi-worker real data benchmark tests."""
+
+  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
+    def_flags = {}
+    def_flags['skip_eval'] = True
+    def_flags['report_accuracy_metrics'] = False
+    def_flags['data_dir'] = os.path.join(root_data_dir, 'imagenet')
+    def_flags['train_steps'] = 110
+    def_flags['log_steps'] = 10
+
+    super(Resnet50MultiWorkerKerasBenchmarkReal, self).__init__(
+        output_dir=output_dir, default_flags=def_flags)
+
+
 if __name__ == '__main__':
  tf.test.main()
--- a/official/boosted_trees/__init__.py
+++ b/official/boosted_trees/__init__.py
--- a/official/resnet/keras/trivial_model.py
+++ b/official/resnet/keras/trivial_model.py
--- a/official/bert/README.md
+++ b/official/bert/README.md
 # BERT (Bidirectional Encoder Representations from Transformers)

-Note> Please do not create pull request. This model is still under development
-and testing.
-
 The academic paper which describes BERT in detail and provides full results on a
 number of tasks can be found here: https://arxiv.org/abs/1810.04805.

@@ -30,6 +27,31 @@ Our current released checkpoints are exactly the same as TF 1.x official BERT
 repository, thus inside `BertConfig`, there is `backward_compatible=True`. We
 are going to release new pre-trained checkpoints soon.

+### Access to Pretrained Checkpoints
+
+We provide checkpoints that are converted from [google-research/bert](https://github.com/google-research/bert),
+in order to keep consistent with BERT paper.
+
+*   **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/wwm_uncased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/wwm_cased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Uncased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/uncased_L-12_H-768_A-12.tar.gz)**:
+    12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Large, Uncased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/uncased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/cased_L-12_H-768_A-12.tar.gz)**:
+    12-layer, 768-hidden, 12-heads , 110M parameters
+*   **[`BERT-Large, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/tf_20/cased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+
+We recommend to host checkpoints on Google Cloud storage buckets when you use
+Cloud GPU/TPU. For example, in the following tutorial, we use:
+
+```shell
+export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/tf_20/uncased_L-24_H-1024_A-16
+```
+
 ### Restoring from Checkpoints

 `tf.train.Checkpoint` is used to manage model checkpoints in TF 2.0. To restore
@@ -70,9 +92,9 @@ Second, you need to install TF 2.0 `tf-night` on your VM:
 pip install tf-nightly-2.0-preview
 ```

-Warning: More details TPU-specific set-up instructions and tutorial for TF 2.0
-are coming. Note that this repo is not officially supported by Google Cloud TPU
-team yet.
+Warning: More details TPU-specific set-up instructions and tutorial should come
+along with official TF 2.x release for TPU. Note that this repo is not officially
+supported by Google Cloud TPU team yet.

 ## Process Datasets


--- a/official/bert/common_flags.py
+++ b/official/bert/common_flags.py
@@ -32,10 +32,10 @@ def define_common_bert_flags():
      'init_checkpoint', None,
      'Initial checkpoint (usually from a pre-trained BERT model).')
  flags.DEFINE_enum(
-      'strategy_type', 'mirror', ['tpu', 'mirror'],
+      'strategy_type', 'mirror', ['tpu', 'mirror', 'multi_worker_mirror'],
      'Distribution Strategy type to use for training. `tpu` uses '
-      'TPUStrategy for running on TPUs, `mirror` uses GPUs with '
-      'single host.')
+      'TPUStrategy for running on TPUs, `mirror` uses GPUs with single host, '
+      '`multi_worker_mirror` uses CPUs or GPUs with multiple hosts.')
  flags.DEFINE_integer('num_train_epochs', 3,
                       'Total number of training epochs to perform.')
  flags.DEFINE_integer(

--- a/official/bert/model_training_utils.py
+++ b/official/bert/model_training_utils.py
@@ -25,18 +25,12 @@ from absl import logging
 import tensorflow as tf
 from tensorflow.python.util import object_identity
 from official.utils.misc import distribution_utils
+from official.utils.misc import tpu_lib

 _SUMMARY_TXT = 'training_summary.txt'
 _MIN_SUMMARY_STEPS = 10


-def get_primary_cpu_task(use_remote_tpu=False):
-  """Returns primary CPU task to which input pipeline Ops are put."""
-
-  # Remote Eager Borg job configures the TPU worker with job name 'worker'.
-  return '/job:worker' if use_remote_tpu else ''
-
-
 def _save_checkpoint(checkpoint, model_dir, checkpoint_prefix):
  """Saves model to with provided checkpoint prefix."""

@@ -195,7 +189,7 @@ def run_customized_training_loop(

  # To reduce unnecessary send/receive input pipeline operation, we place input
  # pipeline ops in worker task.
-  with tf.device(get_primary_cpu_task(use_remote_tpu)):
+  with tf.device(tpu_lib.get_primary_cpu_task(use_remote_tpu)):
    train_iterator = _get_input_iterator(train_input_fn, strategy)

    with distribution_utils.get_strategy_scope(strategy):

--- a/official/bert/modeling.py
+++ b/official/bert/modeling.py
@@ -165,6 +165,7 @@ class BertModel(tf.keras.layers.Layer):
        max_position_embeddings=self.config.max_position_embeddings,
        dropout_prob=self.config.hidden_dropout_prob,
        initializer_range=self.config.initializer_range,
+        dtype=tf.float32,
        name="embedding_postprocessor")
    self.encoder = Transformer(
        num_hidden_layers=self.config.num_hidden_layers,
@@ -316,8 +317,9 @@ class EmbeddingPostprocessor(tf.keras.layers.Layer):
          dtype=self.dtype)

    self.output_layer_norm = tf.keras.layers.LayerNormalization(
-        name="layer_norm", axis=-1, epsilon=1e-12)
-    self.output_dropout = tf.keras.layers.Dropout(rate=self.dropout_prob)
+        name="layer_norm", axis=-1, epsilon=1e-12, dtype=tf.float32)
+    self.output_dropout = tf.keras.layers.Dropout(rate=self.dropout_prob,
+                                                  dtype=tf.float32)
    super(EmbeddingPostprocessor, self).build(input_shapes)

  def __call__(self, word_embeddings, token_type_ids=None, **kwargs):
@@ -714,11 +716,15 @@ class TransformerBlock(tf.keras.layers.Layer):
        rate=self.hidden_dropout_prob)
    self.attention_layer_norm = (
        tf.keras.layers.LayerNormalization(
-            name="self_attention_layer_norm", axis=-1, epsilon=1e-12))
+            name="self_attention_layer_norm", axis=-1, epsilon=1e-12,
+            # We do layer norm in float32 for numeric stability.
+            dtype=tf.float32))
    self.intermediate_dense = Dense2DProjection(
        output_size=self.intermediate_size,
        kernel_initializer=get_initializer(self.initializer_range),
        activation=self.intermediate_activation,
+        # Uses float32 so that gelu activation is done in float32.
+        dtype=tf.float32,
        name="intermediate")
    self.output_dense = Dense2DProjection(
        output_size=self.hidden_size,
@@ -726,7 +732,7 @@ class TransformerBlock(tf.keras.layers.Layer):
        name="output")
    self.output_dropout = tf.keras.layers.Dropout(rate=self.hidden_dropout_prob)
    self.output_layer_norm = tf.keras.layers.LayerNormalization(
-        name="output_layer_norm", axis=-1, epsilon=1e-12)
+        name="output_layer_norm", axis=-1, epsilon=1e-12, dtype=tf.float32)
    super(TransformerBlock, self).build(unused_input_shapes)

  def common_layers(self):
@@ -753,6 +759,10 @@ class TransformerBlock(tf.keras.layers.Layer):
    attention_output = self.attention_dropout(attention_output)
    # Use float32 in keras layer norm and the gelu activation in the
    # intermediate dense layer for numeric stability
+    # TODO(reedwm): These casts are probably unnecessary, as we passed
+    # dtype=tf.float32 to the layer norm constructor, so it will cast its inputs
+    # to float32 automatically. These manual casts additionally do the "+"
+    # operator in float32, but "+" is numerically stable in float16.
    if self.float_type == tf.float16:
      input_tensor = tf.cast(input_tensor, tf.float32)
      attention_output = tf.cast(attention_output, tf.float32)

--- a/official/bert/optimization.py
+++ b/official/bert/optimization.py
@@ -105,12 +105,14 @@ class AdamWeightDecay(tf.keras.optimizers.Adam):
               epsilon=1e-7,
               amsgrad=False,
               weight_decay_rate=0.0,
+               include_in_weight_decay=None,
               exclude_from_weight_decay=None,
               name='AdamWeightDecay',
               **kwargs):
    super(AdamWeightDecay, self).__init__(
        learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)
    self.weight_decay_rate = weight_decay_rate
+    self._include_in_weight_decay = include_in_weight_decay
    self._exclude_from_weight_decay = exclude_from_weight_decay

  @classmethod
@@ -178,6 +180,12 @@ class AdamWeightDecay(tf.keras.optimizers.Adam):
    """Whether to use L2 weight decay for `param_name`."""
    if self.weight_decay_rate == 0:
      return False
+
+    if self._include_in_weight_decay:
+      for r in self._include_in_weight_decay:
+        if re.search(r, param_name) is not None:
+          return True
+
    if self._exclude_from_weight_decay:
      for r in self._exclude_from_weight_decay:
        if re.search(r, param_name) is not None:

--- a/official/bert/run_classifier.py
+++ b/official/bert/run_classifier.py
@@ -210,7 +210,7 @@ def run_bert(strategy, input_meta_data):
      run_eagerly=FLAGS.run_eagerly)

  if FLAGS.model_export_path:
-    with tf.device(model_training_utils.get_primary_cpu_task(use_remote_tpu)):
+    with tf.device(tpu_lib.get_primary_cpu_task(use_remote_tpu)):
      model_saving_utils.export_bert_model(
          FLAGS.model_export_path, model=trained_model)
  return trained_model

--- a/official/bert/run_squad.py
+++ b/official/bert/run_squad.py
@@ -139,6 +139,8 @@ def predict_squad_customized(strategy, input_meta_data, bert_config,
        strategy.experimental_distribute_dataset(predict_dataset))

    with strategy.scope():
+      # Prediction always uses float32, even if training uses mixed precision.
+      tf.keras.mixed_precision.experimental.set_policy('float32')
      squad_model, _ = bert_models.squad_model(
          bert_config, input_meta_data['max_seq_length'], float_type=tf.float32)

@@ -187,7 +189,7 @@ def train_squad(strategy,

  use_float16 = common_flags.use_float16()
  if use_float16:
-    policy = tf.keras.mixed_precision.experimental.Policy('infer_float32_vars')
+    policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
    tf.keras.mixed_precision.experimental.set_policy(policy)

  bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
@@ -212,6 +214,9 @@ def train_squad(strategy,
    squad_model.optimizer = optimization.create_optimizer(
        FLAGS.learning_rate, steps_per_epoch * epochs, warmup_steps)
    if use_float16:
+      # Wraps optimizer with a LossScaleOptimizer. This is done automatically
+      # in compile() with the "mixed_float16" policy, but since we do not call
+      # compile(), we must wrap the optimizer manually.
      squad_model.optimizer = (
          tf.keras.mixed_precision.experimental.LossScaleOptimizer(
              squad_model.optimizer, loss_scale=common_flags.get_loss_scale()))
@@ -316,6 +321,8 @@ def main(_):
  strategy = None
  if FLAGS.strategy_type == 'mirror':
    strategy = tf.distribute.MirroredStrategy()
+  elif FLAGS.strategy_type == 'multi_worker_mirror':
+    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
  elif FLAGS.strategy_type == 'tpu':
    # Initialize TPU System.
    cluster_resolver = tpu_lib.tpu_initialize(FLAGS.tpu)

--- a/official/keras_application_models/README.md
+++ b/official/keras_application_models/README.md
-# Keras Application Models Benchmark
-## Overview
-This provides a single scaffold to benchmark the Keras built-in application [models](https://keras.io/applications/). All the models are for image classification applications, and include:
-
- - Xception
- - VGG16
- - VGG19
- - ResNet50
- - InceptionV3
- - InceptionResNetV2
- - MobileNet
- - DenseNet
- - NASNet
-
-## Dataset
-Synthetic dataset is used for the benchmark.
-
-## Callbacks
-Two custom callbacks are provided for model benchmarking: ExamplesPerSecondCallback and LoggingMetricCallback. For each callback, `epoch_based` and `batch_based` options are available to set the benchmark level. Check [model_callbacks.py](model_callbacks.py) for more details.
-
-## Running Code
-To benchmark a model, use `--model` to specify the model name. To perform the benchmark with eager execution, issue the following command:
-```
-python benchmark_main.py --model resnet50 --eager
-```
-Note that, if eager execution is enabled, only one GPU is utilized even if multiple GPUs are provided and multi_gpu_model is used.
-
-
-To use distribution strategy in the benchmark, run the following:
-```
-python benchmark_main.py --model resnet50 --dist_strat
-```
-Currently, only one of the --eager and --dist_strat arguments can be defined, as DistributionStrategy is not supported in Eager execution now.
-
-Arguments:
-  * `--model`: Which model to be benchmarked. The model name is defined as the keys of `MODELS` in [benchmark_main.py](benchmark_main.py).
-  * `--callbacks`: To specify a list of callbacks.
-
-Use the `--help` or `-h` flag to get a full list of possible arguments.
--- a/official/keras_application_models/benchmark_main.py
+++ b/official/keras_application_models/benchmark_main.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Benchmark on the keras built-in application models."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-# pylint: disable=g-bad-import-order
-import numpy as np
-from absl import app as absl_app
-from absl import flags
-import tensorflow as tf
-# pylint: enable=g-bad-import-order
-
-from official.keras_application_models import dataset
-from official.keras_application_models import model_callbacks
-from official.utils.flags import core as flags_core
-from official.utils.logs import logger
-from official.utils.misc import distribution_utils
-
-# Define a dictionary that maps model names to their model classes inside Keras
-MODELS = {
-    "vgg16": tf.keras.applications.VGG16,
-    "vgg19": tf.keras.applications.VGG19,
-    "inceptionv3": tf.keras.applications.InceptionV3,
-    "xception": tf.keras.applications.Xception,
-    "resnet50": tf.keras.applications.ResNet50,
-    "inceptionresnetv2": tf.keras.applications.InceptionResNetV2,
-    "mobilenet": tf.keras.applications.MobileNet,
-    "densenet121": tf.keras.applications.DenseNet121,
-    "densenet169": tf.keras.applications.DenseNet169,
-    "densenet201": tf.keras.applications.DenseNet201,
-    "nasnetlarge": tf.keras.applications.NASNetLarge,
-    "nasnetmobile": tf.keras.applications.NASNetMobile,
-}
-
-
-def run_keras_model_benchmark(_):
-  """Run the benchmark on keras model."""
-  # Ensure a valid model name was supplied via command line argument
-  if FLAGS.model not in MODELS.keys():
-    raise AssertionError("The --model command line argument should "
-                         "be a key in the `MODELS` dictionary.")
-
-  # Check if eager execution is enabled
-  if FLAGS.eager:
-    tf.logging.info("Eager execution is enabled...")
-    tf.enable_eager_execution()
-
-  # Load the model
-  tf.logging.info("Benchmark on {} model...".format(FLAGS.model))
-  keras_model = MODELS[FLAGS.model]
-
-  # Get dataset
-  dataset_name = "ImageNet"
-  if FLAGS.use_synthetic_data:
-    tf.logging.info("Using synthetic dataset...")
-    dataset_name += "_Synthetic"
-    train_dataset = dataset.generate_synthetic_input_dataset(
-        FLAGS.model, FLAGS.batch_size)
-    val_dataset = dataset.generate_synthetic_input_dataset(
-        FLAGS.model, FLAGS.batch_size)
-    model = keras_model(weights=None)
-  else:
-    tf.logging.info("Using CIFAR-10 dataset...")
-    dataset_name = "CIFAR-10"
-    ds = dataset.Cifar10Dataset(FLAGS.batch_size)
-    train_dataset = ds.train_dataset
-    val_dataset = ds.test_dataset
-    model = keras_model(
-        weights=None, input_shape=ds.input_shape, classes=ds.num_classes)
-
-  num_gpus = flags_core.get_num_gpus(FLAGS)
-
-  distribution = None
-  # Use distribution strategy
-  if FLAGS.dist_strat:
-    distribution = distribution_utils.get_distribution_strategy(
-        distribution_strategy=FLAGS.distribution_strategy,
-        num_gpus=num_gpus)
-  elif num_gpus > 1:
-    # Run with multi_gpu_model
-    # If eager execution is enabled, only one GPU is utilized even if multiple
-    # GPUs are provided.
-    if FLAGS.eager:
-      tf.logging.warning(
-          "{} GPUs are provided, but only one GPU is utilized as "
-          "eager execution is enabled.".format(num_gpus))
-    model = tf.keras.utils.multi_gpu_model(model, gpus=num_gpus)
-
-  # Adam optimizer and some other optimizers doesn't work well with
-  # distribution strategy (b/113076709)
-  # Use GradientDescentOptimizer here
-  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
-  model.compile(loss="categorical_crossentropy",
-                optimizer=optimizer,
-                metrics=["accuracy"],
-                distribute=distribution)
-
-  # Create benchmark logger for benchmark logging
-  run_params = {
-      "batch_size": FLAGS.batch_size,
-      "synthetic_data": FLAGS.use_synthetic_data,
-      "train_epochs": FLAGS.train_epochs,
-      "num_train_images": FLAGS.num_train_images,
-      "num_eval_images": FLAGS.num_eval_images,
-  }
-
-  benchmark_logger = logger.get_benchmark_logger()
-  benchmark_logger.log_run_info(
-      model_name=FLAGS.model,
-      dataset_name=dataset_name,
-      run_params=run_params,
-      test_id=FLAGS.benchmark_test_id)
-
-  # Create callbacks that log metric values about the training and evaluation
-  callbacks = model_callbacks.get_model_callbacks(
-      FLAGS.callbacks,
-      batch_size=FLAGS.batch_size,
-      metric_logger=benchmark_logger)
-  # Train and evaluate the model
-  history = model.fit(
-      train_dataset,
-      epochs=FLAGS.train_epochs,
-      callbacks=callbacks,
-      validation_data=val_dataset,
-      steps_per_epoch=int(np.ceil(FLAGS.num_train_images / FLAGS.batch_size)),
-      validation_steps=int(np.ceil(FLAGS.num_eval_images / FLAGS.batch_size))
-  )
-
-  tf.logging.info("Logging the evaluation results...")
-  for epoch in range(FLAGS.train_epochs):
-    eval_results = {
-        "accuracy": history.history["val_acc"][epoch],
-        "loss": history.history["val_loss"][epoch],
-        tf.GraphKeys.GLOBAL_STEP: (epoch + 1) * np.ceil(
-            FLAGS.num_eval_images/FLAGS.batch_size)
-    }
-    benchmark_logger.log_evaluation_result(eval_results)
-
-  # Clear the session explicitly to avoid session delete error
-  tf.keras.backend.clear_session()
-
-
-def define_keras_benchmark_flags():
-  """Add flags for keras built-in application models."""
-  flags_core.define_base(hooks=False)
-  flags_core.define_performance()
-  flags_core.define_image()
-  flags_core.define_benchmark()
-  flags.adopt_module_key_flags(flags_core)
-
-  flags_core.set_defaults(
-      data_format="channels_last",
-      use_synthetic_data=True,
-      batch_size=32,
-      train_epochs=2)
-
-  flags.DEFINE_enum(
-      name="model", default=None,
-      enum_values=MODELS.keys(), case_sensitive=False,
-      help=flags_core.help_wrap(
-          "Model to be benchmarked."))
-
-  flags.DEFINE_integer(
-      name="num_train_images", default=1000,
-      help=flags_core.help_wrap(
-          "The number of synthetic images for training. The default value is "
-          "1000."))
-
-  flags.DEFINE_integer(
-      name="num_eval_images", default=50,
-      help=flags_core.help_wrap(
-          "The number of synthetic images for evaluation. The default value is "
-          "50."))
-
-  flags.DEFINE_boolean(
-      name="eager", default=False, help=flags_core.help_wrap(
-          "To enable eager execution. Note that if eager execution is enabled, "
-          "only one GPU is utilized even if multiple GPUs are provided and "
-          "multi_gpu_model is used."))
-
-  flags.DEFINE_boolean(
-      name="dist_strat", default=False, help=flags_core.help_wrap(
-          "To enable distribution strategy for model training and evaluation. "
-          "Number of GPUs used for distribution strategy can be set by the "
-          "argument --num_gpus."))
-
-  flags.DEFINE_list(
-      name="callbacks",
-      default=["ExamplesPerSecondCallback", "LoggingMetricCallback"],
-      help=flags_core.help_wrap(
-          "A list of (case insensitive) strings to specify the names of "
-          "callbacks. For example: `--callbacks ExamplesPerSecondCallback,"
-          "LoggingMetricCallback`"))
-
-  @flags.multi_flags_validator(
-      ["eager", "dist_strat"],
-      message="Both --eager and --dist_strat were set. Only one can be "
-              "defined, as DistributionStrategy is not supported in Eager "
-              "execution currently.")
-  # pylint: disable=unused-variable
-  def _check_eager_dist_strat(flag_dict):
-    return not(flag_dict["eager"] and flag_dict["dist_strat"])
-
-
-def main(_):
-  with logger.benchmark_context(FLAGS):
-    run_keras_model_benchmark(FLAGS)
-
-
-if __name__ == "__main__":
-  tf.logging.set_verbosity(tf.logging.INFO)
-  define_keras_benchmark_flags()
-  FLAGS = flags.FLAGS
-  absl_app.run(main)
--- a/official/keras_application_models/dataset.py
+++ b/official/keras_application_models/dataset.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Prepare dataset for keras model benchmark."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-import tensorflow as tf
-from official.utils.misc import model_helpers  # pylint: disable=g-bad-import-order
-
-# Default values for dataset.
-_NUM_CHANNELS = 3
-_NUM_CLASSES = 1000
-
-
-def _get_default_image_size(model):
-  """Provide default image size for each model."""
-  image_size = (224, 224)
-  if model in ["inceptionv3", "xception", "inceptionresnetv2"]:
-    image_size = (299, 299)
-  elif model in ["nasnetlarge"]:
-    image_size = (331, 331)
-  return image_size
-
-
-def generate_synthetic_input_dataset(model, batch_size):
-  """Generate synthetic dataset."""
-  image_size = _get_default_image_size(model)
-  image_shape = (batch_size,) + image_size + (_NUM_CHANNELS,)
-  label_shape = (batch_size, _NUM_CLASSES)
-
-  dataset = model_helpers.generate_synthetic_data(
-      input_shape=tf.TensorShape(image_shape),
-      label_shape=tf.TensorShape(label_shape),
-  )
-  return dataset
-
-
-class Cifar10Dataset(object):
-  """CIFAR10 dataset, including train and test set.
-
-  Each sample consists of a 32x32 color image, and label is from 10 classes.
-  """
-
-  def __init__(self, batch_size):
-    """Initializes train/test datasets.
-
-    Args:
-      batch_size: int, the number of batch size.
-    """
-    self.input_shape = (32, 32, 3)
-    self.num_classes = 10
-    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
-    x_train, x_test = x_train / 255.0, x_test / 255.0
-    y_train, y_test = y_train.astype(np.int64), y_test.astype(np.int64)
-    y_train = tf.keras.utils.to_categorical(y_train, self.num_classes)
-    y_test = tf.keras.utils.to_categorical(y_test, self.num_classes)
-    self.train_dataset = tf.data.Dataset.from_tensor_slices(
-        (x_train, y_train)).shuffle(2000).batch(batch_size).repeat()
-    self.test_dataset = tf.data.Dataset.from_tensor_slices(
-        (x_test, y_test)).shuffle(2000).batch(batch_size).repeat()
--- a/official/keras_application_models/model_callbacks.py
+++ b/official/keras_application_models/model_callbacks.py
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Callbacks for Keras built-in application models.
-
-Note that, in the callbacks, the global_step is initialized in the __init__ of
-each callback rather than on_train_begin. As on_train_begin gets called in
-the fit_loop, and it will be reset with each call to fit(). To keep the
-global_step persistent across all training sessions, it should be initialized in
-the __init__.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import time
-
-import tensorflow as tf  # pylint: disable=g-bad-import-order
-
-from official.utils.logs import logger
-
-# Metrics to log after each batch and epoch
-_PER_BATCH_METRICS = {
-    "loss": "train_loss",
-    "acc": "train_accuracy",
-}
-_PER_EPOCH_METRICS = {
-    "loss": "train_loss",
-    "acc": "train_accuracy",
-    "val_loss": "loss",
-    "val_acc": "accuracy"
-}
-
-
-class ExamplesPerSecondCallback(tf.keras.callbacks.Callback):
-  """ExamplesPerSecond callback.
-
-  This callback records the average_examples_per_sec and
-  current_examples_per_sec during training.
-  """
-
-  def __init__(self, batch_size, every_n_steps=1, metric_logger=None):
-    self._batch_size = batch_size
-    self._every_n_steps = every_n_steps
-    self._logger = metric_logger or logger.BaseBenchmarkLogger()
-    self._global_step = 0  # Initialize it in __init__
-    super(ExamplesPerSecondCallback, self).__init__()
-
-  def on_train_begin(self, logs=None):
-    self._train_start_time = time.time()
-    self._last_recorded_time = time.time()
-
-  def on_batch_end(self, batch, logs=None):
-    """Log the examples_per_sec metric every_n_steps."""
-    self._global_step += 1
-    current_time = time.time()
-
-    if self._global_step % self._every_n_steps == 0:
-      average_examples_per_sec = self._batch_size * (
-          self._global_step / (current_time - self._train_start_time))
-      self._logger.log_metric(
-          "average_examples_per_sec", average_examples_per_sec,
-          global_step=self._global_step)
-
-      current_examples_per_sec = self._batch_size * (
-          self._every_n_steps / (current_time - self._last_recorded_time))
-      self._logger.log_metric(
-          "current_examples_per_sec", current_examples_per_sec,
-          global_step=self._global_step)
-      self._last_recorded_time = current_time  # Update last_recorded_time
-
-
-class LoggingMetricCallback(tf.keras.callbacks.Callback):
-  """LoggingMetric callback.
-
-  Log the predefined _PER_BATCH_METRICS after each batch, and log the predefined
-  _PER_EPOCH_METRICS after each epoch.
-  """
-
-  def __init__(self, metric_logger=None):
-    self._logger = metric_logger or logger.BaseBenchmarkLogger()
-    self._per_batch_metrics = _PER_BATCH_METRICS
-    self._per_epoch_metrics = _PER_EPOCH_METRICS
-    self._global_step = 0  # Initialize it in __init__
-    super(LoggingMetricCallback, self).__init__()
-
-  def on_batch_end(self, batch, logs=None):
-    """Log metrics after each batch."""
-    self._global_step += 1
-    for metric in _PER_BATCH_METRICS:
-      self._logger.log_metric(
-          _PER_BATCH_METRICS[metric],
-          logs.get(metric),
-          global_step=self._global_step)
-
-  def on_epoch_end(self, epoch, logs=None):
-    """Log metrics after each epoch."""
-    for metric in _PER_EPOCH_METRICS:
-      self._logger.log_metric(
-          _PER_EPOCH_METRICS[metric],
-          logs.get(metric),
-          global_step=self._global_step)
-
-
-def get_model_callbacks(name_list, **kwargs):
-  """Factory for getting a list of TensorFlow hooks for training by name.
-
-  Args:
-    name_list: a list of strings to name desired callback classes. Allowed:
-      ExamplesPerSecondCallback, LoggingMetricCallback, which are defined
-      as keys in CALLBACKS.
-    **kwargs: a dictionary of arguments to the callbacks.
-
-  Returns:
-    list of instantiated callbacks, ready to be used in a classifier.train call.
-
-  Raises:
-    ValueError: if an unrecognized name is passed.
-  """
-
-  if not name_list:
-    return []
-
-  callbacks = []
-  for name in name_list:
-    callback_name = CALLBACKS.get(name.strip().lower())
-    if callback_name is None:
-      raise ValueError(
-          "Unrecognized training callback requested: {}".format(name))
-    else:
-      callbacks.append(callback_name(**kwargs))
-
-  return callbacks
-
-
-def get_examples_per_second_callback(
-    every_n_steps=1, batch_size=32, metric_logger=None, **kwargs):  # pylint: disable=unused-argument
-  """Function to get ExamplesPerSecondCallback."""
-  return ExamplesPerSecondCallback(
-      batch_size=batch_size, every_n_steps=every_n_steps,
-      metric_logger=metric_logger or logger.get_benchmark_logger())
-
-
-def get_logging_metric_callback(metric_logger=None, **kwargs):  # pylint: disable=unused-argument
-  """Function to get LoggingMetricCallback."""
-  return LoggingMetricCallback(
-      metric_logger=metric_logger or logger.get_benchmark_logger())
-
-
-# A dictionary to map the callback name and its corresponding function
-CALLBACKS = {
-    "examplespersecondcallback": get_examples_per_second_callback,
-    "loggingmetriccallback": get_logging_metric_callback,
-}
--- a/official/mnist/mnist.py
+++ b/official/mnist/mnist.py
@@ -89,7 +89,8 @@ def create_model(data_format):

 def define_mnist_flags():
  flags_core.define_base()
-  flags_core.define_performance(num_parallel_calls=False)
+  flags_core.define_performance(inter_op=True, intra_op=True,
+                                num_parallel_calls=False)
  flags_core.define_image()
  flags.adopt_module_key_flags(flags_core)
  flags_core.set_defaults(data_dir='/tmp/mnist_data',

--- a/official/boosted_trees/README.md
+++ b/official/boosted_trees/README.md
--- a/official/keras_application_models/__init__.py
+++ b/official/keras_application_models/__init__.py