Merge remote-tracking branch 'upstream/master' into add_multilevel_crop_and_resize

47bc1813 · syiming · d8611151 · b035a227 · 47bc1813 · 47bc1813
Commit 47bc1813 authored Jul 01, 2020 by syiming
20 changed files
--- a/.github/README_TEMPLATE.md
+++ b/.github/README_TEMPLATE.md
 > :memo: A README.md template for releasing a paper code implementation to a GitHub repository.  
 >  
-> * Template version: 1.0.2020.125  
+> * Template version: 1.0.2020.170  
 > * Please modify sections depending on needs.  

 # Model name, Paper title, or Project Name

 > :memo: Add a badge for the ArXiv identifier of your paper (arXiv:YYMM.NNNNN)

-[![Paper](http://img.shields.io/badge/paper-arXiv.YYMM.NNNNN-B3181B.svg)](https://arxiv.org/abs/...)
+[![Paper](http://img.shields.io/badge/Paper-arXiv.YYMM.NNNNN-B3181B?logo=arXiv)](https://arxiv.org/abs/...)

 This repository is the official or unofficial implementation of the following paper.

@@ -28,8 +28,8 @@ This repository is the official or unofficial implementation of the following pa

 > :memo: Provide maintainer information.  

-* Last name, First name ([@GitHub username](https://github.com/username))
-* Last name, First name ([@GitHub username](https://github.com/username))
+* Full name ([@GitHub username](https://github.com/username))
+* Full name ([@GitHub username](https://github.com/username))

 ## Table of Contents

@@ -37,8 +37,8 @@ This repository is the official or unofficial implementation of the following pa

 ## Requirements

-[![TensorFlow 2.1](https://img.shields.io/badge/tensorflow-2.1-brightgreen)](https://github.com/tensorflow/tensorflow/releases/tag/v2.1.0)
-[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)
+[![TensorFlow 2.1](https://img.shields.io/badge/TensorFlow-2.1-FF6F00?logo=tensorflow)](https://github.com/tensorflow/tensorflow/releases/tag/v2.1.0)
+[![Python 3.6](https://img.shields.io/badge/Python-3.6-3776AB)](https://www.python.org/downloads/release/python-360/)

 > :memo: Provide details of the software required.  
 >  
@@ -54,6 +54,8 @@ pip install -r requirements.txt

 ## Results

+[![TensorFlow Hub](https://img.shields.io/badge/TF%20Hub-Models-FF6F00?logo=tensorflow)](https://tfhub.dev/...)
+
 > :memo: Provide a table with results. (e.g., accuracy, latency)  
 >  
 > * Provide links to the pre-trained models (checkpoint, SavedModel files).  
@@ -104,6 +106,8 @@ python3 ...

 ## License

+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+
 > :memo: Place your license text in a file named LICENSE in the root of the repository.  
 >  
 > * Include information about your license.  

--- a/README.md
+++ b/README.md
@@ -2,7 +2,8 @@

 # Welcome to the Model Garden for TensorFlow

-The TensorFlow Model Garden is a repository with a number of different implementations of state-of-the-art (SOTA) models and modeling solutions for TensorFlow users. We aim to demonstrate the best practices for modeling so that TensorFlow users can take full advantage of TensorFlow for their research and product development.
+The TensorFlow Model Garden is a repository with a number of different implementations of state-of-the-art (SOTA) models and modeling solutions for TensorFlow users. We aim to demonstrate the best practices for modeling so that TensorFlow users
+can take full advantage of TensorFlow for their research and product development.

 | Directory | Description |
 |-----------|-------------|
@@ -10,20 +11,28 @@ The TensorFlow Model Garden is a repository with a number of different implement
 | [research](research) | • A collection of research model implementations in TensorFlow 1 or 2 by researchers<br />• Maintained and supported by researchers |
 | [community](community) | • A curated list of the GitHub repositories with machine learning models and implementations powered by TensorFlow 2 |

-## [Announcements](../../wiki/Announcements)
+## [Announcements](https://github.com/tensorflow/models/wiki/Announcements)

 | Date | News |
 |------|------|
+| June 17, 2020 | [Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection](https://github.com/tensorflow/models/tree/master/research/object_detection#june-17th-2020) released
 | May 21, 2020 | [Unifying Deep Local and Global Features for Image Search (DELG)](https://github.com/tensorflow/models/tree/master/research/delf#delg) code released
+| May 19, 2020 | [MobileDets: Searching for Object Detection Architectures for Mobile Accelerators](https://github.com/tensorflow/models/tree/master/research/object_detection#may-19th-2020) released
 | May 7, 2020 | [MnasFPN with MobileNet-V2 backbone](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#mobile-models) released for object detection
 | May 1, 2020 | [DELF: DEep Local Features](https://github.com/tensorflow/models/tree/master/research/delf) updated to support TensorFlow 2.1
 | March 31, 2020 | [Introducing the Model Garden for TensorFlow 2](https://blog.tensorflow.org/2020/03/introducing-model-garden-for-tensorflow-2.html) ([Tweet](https://twitter.com/TensorFlow/status/1245029834633297921)) |

+## [Milestones](https://github.com/tensorflow/models/milestones)
+
+| Date | Milestone |
+|------|-----------|
+| July 8, 2020 | [![GitHub milestone](https://img.shields.io/github/milestones/progress/tensorflow/models/1)](https://github.com/tensorflow/models/milestone/1) |
+
 ## Contributions

 [![help wanted:paper implementation](https://img.shields.io/github/issues/tensorflow/models/help%20wanted%3Apaper%20implementation)](https://github.com/tensorflow/models/labels/help%20wanted%3Apaper%20implementation)

-If you want to contribute, please review the [contribution guidelines](../../wiki/How-to-contribute).
+If you want to contribute, please review the [contribution guidelines](https://github.com/tensorflow/models/wiki/How-to-contribute).

 ## License


--- a/community/README.md
+++ b/community/README.md
@@ -6,13 +6,12 @@ This repository provides a curated list of the GitHub repositories with machine

 **Note**: Contributing companies or individuals are responsible for maintaining their repositories.

-## Models / Implementations
+## Computer Vision

-### Computer Vision
+### Image Recognition

-#### Image Recognition
-| Model | Reference (Paper) | Features | Maintainer |
-|-------|-------------------|----------|------------|
+| Model | Paper | Features | Maintainer |
+|-------|-------|----------|------------|
 | [DenseNet 169](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/densenet169) | [Densely Connected Convolutional Networks](https://arxiv.org/pdf/1608.06993) | • FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [Inception V3](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/inceptionv3) | [Rethinking the Inception Architecture<br/>for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [Inception V4](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/inceptionv4) | [Inception-v4, Inception-ResNet and the Impact<br/>of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
@@ -21,12 +20,13 @@ This repository provides a curated list of the GitHub repositories with machine
 | [ResNet 50](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference | [Intel](https://github.com/IntelAI) |
 | [ResNet 50v1.5](https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50v1_5) | [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385) | • Int8 Inference<br/>• FP32 Inference<br/>• FP32 Training | [Intel](https://github.com/IntelAI) |

-#### Segmentation
-| Model | Reference (Paper) | &nbsp; &nbsp; &nbsp; Features &nbsp; &nbsp; &nbsp; | Maintainer |
-|-------|-------------------|----------|------------|
+### Segmentation
+
+| Model | Paper | Features | Maintainer |
+|-------|-------|----------|------------|
 | [Mask R-CNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) | • Automatic Mixed Precision<br/>• Multi-GPU training support with Horovod<br/>• TensorRT | [NVIDIA](https://github.com/NVIDIA) |
 | [U-Net Medical Image Segmentation](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/UNet_Medical) | [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597) | • Automatic Mixed Precision<br/>• Multi-GPU training support with Horovod<br/>• TensorRT | [NVIDIA](https://github.com/NVIDIA) |

 ## Contributions

-If you want to contribute, please review the [contribution guidelines](../../../wiki/How-to-contribute).
+If you want to contribute, please review the [contribution guidelines](https://github.com/tensorflow/models/wiki/How-to-contribute).
--- a/official/README.md
+++ b/official/README.md
@@ -19,9 +19,10 @@ In the near future, we will add:

 * State-of-the-art language understanding models:
  More members in Transformer family
-* Start-of-the-art image classification models:
+* State-of-the-art image classification models:
  EfficientNet, MnasNet, and variants
-* A set of excellent objection detection models.
+* State-of-the-art objection detection and instance segmentation models:
+  RetinaNet, Mask R-CNN, SpineNet, and variants

 ## Table of Contents

@@ -43,6 +44,7 @@ In the near future, we will add:
 |-------|-------------------|
 | [MNIST](vision/image_classification) | A basic model to classify digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) |
 | [ResNet](vision/image_classification) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
+| [EfficientNet](vision/image_classification) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |

 #### Object Detection and Segmentation

@@ -50,6 +52,8 @@ In the near future, we will add:
 |-------|-------------------|
 | [RetinaNet](vision/detection) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
 | [Mask R-CNN](vision/detection) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
+| [ShapeMask](vision/detection) | [ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors](https://arxiv.org/abs/1904.03239) |
+| [SpineNet](vision/detection) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |

 ### Natural Language Processing


--- a/official/benchmark/retinanet_benchmark.py
+++ b/official/benchmark/retinanet_benchmark.py
@@ -271,6 +271,23 @@ class RetinanetBenchmarkReal(RetinanetAccuracy):
    FLAGS.strategy_type = 'tpu'
    self._run_and_report_benchmark(params, do_eval=False, warmup=0)

+  @flagsaver.flagsaver
+  def benchmark_2x2_tpu_spinenet_coco(self):
+    """Run SpineNet with RetinaNet model accuracy test with 4 TPUs."""
+    self._setup()
+    params = self._params()
+    params['architecture']['backbone'] = 'spinenet'
+    params['architecture']['multilevel_features'] = 'identity'
+    params['architecture']['use_bfloat16'] = False
+    params['train']['batch_size'] = 64
+    params['train']['total_steps'] = 1875  # One epoch.
+    params['train']['iterations_per_loop'] = 500
+    params['train']['checkpoint']['path'] = ''
+    FLAGS.model_dir = self._get_model_dir(
+        'real_benchmark_2x2_tpu_spinenet_coco')
+    FLAGS.strategy_type = 'tpu'
+    self._run_and_report_benchmark(params, do_eval=False, warmup=0)
+

 if __name__ == '__main__':
  tf.test.main()
--- a/official/benchmark/unet3d_benchmark.py
+++ b/official/benchmark/unet3d_benchmark.py
@@ -32,9 +32,9 @@ from official.vision.segmentation import unet_model as unet_model_lib

 UNET3D_MIN_ACCURACY = 0.90
 UNET3D_MAX_ACCURACY = 0.98
-UNET_TRAINING_FILES = 'unet_training_data_files'
-UNET_EVAL_FILES = 'unet_eval_data_files'
-UNET_MODEL_CONFIG_FILE = 'unet_model_config'
+UNET_TRAINING_FILES = 'gs://mlcompass-data/unet3d/train_data/*'
+UNET_EVAL_FILES = 'gs://mlcompass-data/unet3d/eval_data/*'
+UNET_MODEL_CONFIG_FILE = 'gs://mlcompass-data/unet3d/config/unet_config.yaml'

 FLAGS = flags.FLAGS


--- a/official/colab/fine_tuning_bert.ipynb
+++ b/official/colab/fine_tuning_bert.ipynb
--- a/official/core/base_task.py
+++ b/official/core/base_task.py
@@ -14,15 +14,18 @@
 # limitations under the License.
 # ==============================================================================
 """Defines the base task abstraction."""
+import abc
 import functools
 from typing import Any, Callable, Optional

+import six
 import tensorflow as tf

 from official.modeling.hyperparams import config_definitions as cfg
 from official.utils import registry


+@six.add_metaclass(abc.ABCMeta)
 class Task(tf.Module):
  """A single-replica view of training procedure.

@@ -54,14 +57,13 @@ class Task(tf.Module):
    """
    pass

+  @abc.abstractmethod
  def build_model(self) -> tf.keras.Model:
    """Creates the model architecture.

    Returns:
      A model instance.
    """
-    # TODO(hongkuny): the base task should call network factory.
-    pass

  def compile_model(self,
                    model: tf.keras.Model,
@@ -98,6 +100,7 @@ class Task(tf.Module):
      model.test_step = functools.partial(validation_step, model=model)
    return model

+  @abc.abstractmethod
  def build_inputs(self,
                   params: cfg.DataConfig,
                   input_context: Optional[tf.distribute.InputContext] = None):
@@ -112,20 +115,19 @@ class Task(tf.Module):
    Returns:
      A nested structure of per-replica input functions.
    """
-    pass

-  def build_losses(self, features, model_outputs, aux_losses=None) -> tf.Tensor:
+  def build_losses(self, labels, model_outputs, aux_losses=None) -> tf.Tensor:
    """Standard interface to compute losses.

    Args:
-      features: optional feature/labels tensors.
+      labels: optional label tensors.
      model_outputs: a nested structure of output tensors.
      aux_losses: auxiliarly loss tensors, i.e. `losses` in keras.Model.

    Returns:
      The total loss tensor.
    """
-    del model_outputs, features
+    del model_outputs, labels

    if aux_losses is None:
      losses = [tf.constant(0.0, dtype=tf.float32)]
@@ -139,29 +141,29 @@ class Task(tf.Module):
    del training
    return []

-  def process_metrics(self, metrics, labels, outputs):
+  def process_metrics(self, metrics, labels, model_outputs):
    """Process and update metrics. Called when using custom training loop API.

    Args:
      metrics: a nested structure of metrics objects.
        The return of function self.build_metrics.
      labels: a tensor or a nested structure of tensors.
-      outputs: a tensor or a nested structure of tensors.
+      model_outputs: a tensor or a nested structure of tensors.
        For example, output of the keras model built by self.build_model.
    """
    for metric in metrics:
-      metric.update_state(labels, outputs)
+      metric.update_state(labels, model_outputs)

-  def process_compiled_metrics(self, compiled_metrics, labels, outputs):
+  def process_compiled_metrics(self, compiled_metrics, labels, model_outputs):
    """Process and update compiled_metrics. call when using compile/fit API.

    Args:
      compiled_metrics: the compiled metrics (model.compiled_metrics).
      labels: a tensor or a nested structure of tensors.
-      outputs: a tensor or a nested structure of tensors.
+      model_outputs: a tensor or a nested structure of tensors.
        For example, output of the keras model built by self.build_model.
    """
-    compiled_metrics.update_state(labels, outputs)
+    compiled_metrics.update_state(labels, model_outputs)

  def train_step(self,
                 inputs,
@@ -187,7 +189,7 @@ class Task(tf.Module):
      outputs = model(features, training=True)
      # Computes per-replica loss.
      loss = self.build_losses(
-          features=labels, model_outputs=outputs, aux_losses=model.losses)
+          labels=labels, model_outputs=outputs, aux_losses=model.losses)
      # Scales loss as the default gradients allreduce performs sum inside the
      # optimizer.
      scaled_loss = loss / tf.distribute.get_strategy().num_replicas_in_sync
@@ -231,7 +233,7 @@ class Task(tf.Module):
      features, labels = inputs, inputs
    outputs = self.inference_step(features, model)
    loss = self.build_losses(
-        features=labels, model_outputs=outputs, aux_losses=model.losses)
+        labels=labels, model_outputs=outputs, aux_losses=model.losses)
    logs = {self.loss: loss}
    if metrics:
      self.process_metrics(metrics, labels, outputs)
@@ -245,16 +247,57 @@ class Task(tf.Module):
    """Performs the forward step."""
    return model(inputs, training=False)

+  def aggregate_logs(self, state, step_logs):
+    """Optional aggregation over logs returned from a validation step."""
+    pass
+
+  def reduce_aggregated_logs(self, aggregated_logs):
+    """Optional reduce of aggregated logs over validation steps."""
+    return {}
+

 _REGISTERED_TASK_CLS = {}


 # TODO(b/158268740): Move these outside the base class file.
-def register_task_cls(task_config: cfg.TaskConfig) -> Task:
-  """Register ExperimentConfig factory method."""
-  return registry.register(_REGISTERED_TASK_CLS, task_config)
+# TODO(b/158741360): Add type annotations once pytype checks across modules.
+def register_task_cls(task_config_cls):
+  """Decorates a factory of Tasks for lookup by a subclass of TaskConfig.
+
+  This decorator supports registration of tasks as follows:
+
+  ```
+  @dataclasses.dataclass
+  class MyTaskConfig(TaskConfig):
+    # Add fields here.
+    pass
+
+  @register_task_cls(MyTaskConfig)
+  class MyTask(Task):
+    # Inherits def __init__(self, task_config).
+    pass
+
+  my_task_config = MyTaskConfig()
+  my_task = get_task(my_task_config)  # Returns MyTask(my_task_config).
+  ```
+
+  Besisdes a class itself, other callables that create a Task from a TaskConfig
+  can be decorated by the result of this function, as long as there is at most
+  one registration for each config class.
+
+  Args:
+    task_config_cls: a subclass of TaskConfig (*not* an instance of TaskConfig).
+      Each task_config_cls can only be used for a single registration.
+
+  Returns:
+    A callable for use as class decorator that registers the decorated class
+    for creation from an instance of task_config_cls.
+  """
+  return registry.register(_REGISTERED_TASK_CLS, task_config_cls)


-def get_task_cls(task_config: cfg.TaskConfig) -> Task:
-  task_cls = registry.lookup(_REGISTERED_TASK_CLS, task_config)
+# The user-visible get_task() is defined after classes have been registered.
+# TODO(b/158741360): Add type annotations once pytype checks across modules.
+def get_task_cls(task_config_cls):
+  task_cls = registry.lookup(_REGISTERED_TASK_CLS, task_config_cls)
  return task_cls
--- a/official/modeling/hyperparams/config_definitions.py
+++ b/official/modeling/hyperparams/config_definitions.py
@@ -162,19 +162,38 @@ class CallbacksConfig(base_config.Config):

 @dataclasses.dataclass
 class TrainerConfig(base_config.Config):
+  """Configuration for trainer.
+
+  Attributes:
+    optimizer_config: optimizer config, it includes optimizer, learning rate,
+      and warmup schedule configs.
+    train_tf_while_loop: whether or not to use tf while loop.
+    train_tf_function: whether or not to use tf_function for training loop.
+    eval_tf_function: whether or not to use tf_function for eval.
+    steps_per_loop: number of steps per loop.
+    summary_interval: number of steps between each summary.
+    checkpoint_intervals: number of steps between checkpoints.
+    max_to_keep: max checkpoints to keep.
+    continuous_eval_timeout: maximum number of seconds to wait between
+      checkpoints, if set to None, continuous eval will wait indefinetely.
+  """
  optimizer_config: OptimizationConfig = OptimizationConfig()
-  train_tf_while_loop: bool = True
-  train_tf_function: bool = True
-  eval_tf_function: bool = True
+  train_steps: int = 0
+  validation_steps: Optional[int] = None
+  validation_interval: int = 100
  steps_per_loop: int = 1000
  summary_interval: int = 1000
  checkpoint_interval: int = 1000
  max_to_keep: int = 5
+  continuous_eval_timeout: Optional[int] = None
+  train_tf_while_loop: bool = True
+  train_tf_function: bool = True
+  eval_tf_function: bool = True


 @dataclasses.dataclass
 class TaskConfig(base_config.Config):
-  network: base_config.Config = None
+  model: base_config.Config = None
  train_data: DataConfig = DataConfig()
  validation_data: DataConfig = DataConfig()

@@ -182,13 +201,9 @@ class TaskConfig(base_config.Config):
 @dataclasses.dataclass
 class ExperimentConfig(base_config.Config):
  """Top-level configuration."""
-  mode: str = "train"  # train, eval, train_and_eval.
  task: TaskConfig = TaskConfig()
  trainer: TrainerConfig = TrainerConfig()
  runtime: RuntimeConfig = RuntimeConfig()
-  train_steps: int = 0
-  validation_steps: Optional[int] = None
-  validation_interval: int = 100


 _REGISTERED_CONFIGS = {}

--- a/official/modeling/optimization/configs/optimization_config.py
+++ b/official/modeling/optimization/configs/optimization_config.py
@@ -39,12 +39,14 @@ class OptimizerConfig(oneof.OneOfConfig):
    adam: adam optimizer config.
    adamw: adam with weight decay.
    lamb: lamb optimizer.
+    rmsprop: rmsprop optimizer.
  """
  type: Optional[str] = None
  sgd: opt_cfg.SGDConfig = opt_cfg.SGDConfig()
  adam: opt_cfg.AdamConfig = opt_cfg.AdamConfig()
  adamw: opt_cfg.AdamWeightDecayConfig = opt_cfg.AdamWeightDecayConfig()
  lamb: opt_cfg.LAMBConfig = opt_cfg.LAMBConfig()
+  rmsprop: opt_cfg.RMSPropConfig = opt_cfg.RMSPropConfig()


 @dataclasses.dataclass

--- a/official/modeling/optimization/configs/optimizer_config.py
+++ b/official/modeling/optimization/configs/optimizer_config.py
@@ -40,6 +40,29 @@ class SGDConfig(base_config.Config):
  momentum: float = 0.0


+@dataclasses.dataclass
+class RMSPropConfig(base_config.Config):
+  """Configuration for RMSProp optimizer.
+
+  The attributes for this class matches the arguments of
+  tf.keras.optimizers.RMSprop.
+
+  Attributes:
+    name: name of the optimizer.
+    learning_rate: learning_rate for RMSprop optimizer.
+    rho: discounting factor for RMSprop optimizer.
+    momentum: momentum for RMSprop optimizer.
+    epsilon: epsilon value for RMSprop optimizer, help with numerical stability.
+    centered: Whether to normalize gradients or not.
+  """
+  name: str = "RMSprop"
+  learning_rate: float = 0.001
+  rho: float = 0.9
+  momentum: float = 0.0
+  epsilon: float = 1e-7
+  centered: bool = False
+
+
 @dataclasses.dataclass
 class AdamConfig(base_config.Config):
  """Configuration for Adam optimizer.

--- a/official/modeling/optimization/optimizer_factory.py
+++ b/official/modeling/optimization/optimizer_factory.py
@@ -14,7 +14,6 @@
 # limitations under the License.
 # ==============================================================================
 """Optimizer factory class."""
-
 from typing import Union

 import tensorflow as tf
@@ -29,7 +28,8 @@ OPTIMIZERS_CLS = {
    'sgd': tf.keras.optimizers.SGD,
    'adam': tf.keras.optimizers.Adam,
    'adamw': nlp_optimization.AdamWeightDecay,
-    'lamb': tfa_optimizers.LAMB
+    'lamb': tfa_optimizers.LAMB,
+    'rmsprop': tf.keras.optimizers.RMSprop
 }

 LR_CLS = {

--- a/official/modeling/optimization/optimizer_factory_test.py
+++ b/official/modeling/optimization/optimizer_factory_test.py
@@ -15,84 +15,37 @@
 # ==============================================================================
 """Tests for optimizer_factory.py."""

+from absl.testing import parameterized
+
 import tensorflow as tf
-import tensorflow_addons.optimizers as tfa_optimizers

 from official.modeling.optimization import optimizer_factory
 from official.modeling.optimization.configs import optimization_config
-from official.nlp import optimization as nlp_optimization
-
-
-class OptimizerFactoryTest(tf.test.TestCase):
-
-  def test_sgd_optimizer(self):
-    params = {
-        'optimizer': {
-            'type': 'sgd',
-            'sgd': {'learning_rate': 0.1, 'momentum': 0.9}
-        }
-    }
-    expected_optimizer_config = {
-        'name': 'SGD',
-        'learning_rate': 0.1,
-        'decay': 0.0,
-        'momentum': 0.9,
-        'nesterov': False
-    }
-    opt_config = optimization_config.OptimizationConfig(params)
-    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
-    lr = opt_factory.build_learning_rate()
-    optimizer = opt_factory.build_optimizer(lr)
-
-    self.assertIsInstance(optimizer, tf.keras.optimizers.SGD)
-    self.assertEqual(expected_optimizer_config, optimizer.get_config())
-
-  def test_adam_optimizer(self):
-
-    # Define adam optimizer with default values.
-    params = {
-        'optimizer': {
-            'type': 'adam'
-        }
-    }
-    expected_optimizer_config = tf.keras.optimizers.Adam().get_config()

-    opt_config = optimization_config.OptimizationConfig(params)
-    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
-    lr = opt_factory.build_learning_rate()
-    optimizer = opt_factory.build_optimizer(lr)

-    self.assertIsInstance(optimizer, tf.keras.optimizers.Adam)
-    self.assertEqual(expected_optimizer_config, optimizer.get_config())
+class OptimizerFactoryTest(tf.test.TestCase, parameterized.TestCase):

-  def test_adam_weight_decay_optimizer(self):
+  @parameterized.parameters(
+      ('sgd'),
+      ('rmsprop'),
+      ('adam'),
+      ('adamw'),
+      ('lamb'))
+  def test_optimizers(self, optimizer_type):
    params = {
        'optimizer': {
-            'type': 'adamw'
+            'type': optimizer_type
        }
    }
-    expected_optimizer_config = nlp_optimization.AdamWeightDecay().get_config()
-    opt_config = optimization_config.OptimizationConfig(params)
-    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
-    lr = opt_factory.build_learning_rate()
-    optimizer = opt_factory.build_optimizer(lr)
-
-    self.assertIsInstance(optimizer, nlp_optimization.AdamWeightDecay)
-    self.assertEqual(expected_optimizer_config, optimizer.get_config())
+    optimizer_cls = optimizer_factory.OPTIMIZERS_CLS[optimizer_type]
+    expected_optimizer_config = optimizer_cls().get_config()

-  def test_lamb_optimizer(self):
-    params = {
-        'optimizer': {
-            'type': 'lamb'
-        }
-    }
-    expected_optimizer_config = tfa_optimizers.LAMB().get_config()
    opt_config = optimization_config.OptimizationConfig(params)
    opt_factory = optimizer_factory.OptimizerFactory(opt_config)
    lr = opt_factory.build_learning_rate()
    optimizer = opt_factory.build_optimizer(lr)

-    self.assertIsInstance(optimizer, tfa_optimizers.LAMB)
+    self.assertIsInstance(optimizer, optimizer_cls)
    self.assertEqual(expected_optimizer_config, optimizer.get_config())

  def test_stepwise_lr_schedule(self):

--- a/official/modeling/tf_utils.py
+++ b/official/modeling/tf_utils.py
@@ -173,3 +173,18 @@ def assert_rank(tensor, expected_rank, name=None):
        "For the tensor `%s`, the actual tensor rank `%d` (shape = %s) is not "
        "equal to the expected tensor rank `%s`" %
        (name, actual_rank, str(tensor.shape), str(expected_rank)))
+
+
+def safe_mean(losses):
+  """Computes a safe mean of the losses.
+
+  Args:
+    losses: `Tensor` whose elements contain individual loss measurements.
+
+  Returns:
+    A scalar representing the mean of `losses`. If `num_present` is zero,
+      then zero is returned.
+  """
+  total = tf.reduce_sum(losses)
+  num_elements = tf.cast(tf.size(losses), dtype=losses.dtype)
+  return tf.math.divide_no_nan(total, num_elements)
--- a/official/nlp/bert/bert_models.py
+++ b/official/nlp/bert/bert_models.py
@@ -25,7 +25,6 @@ import tensorflow_hub as hub
 from official.modeling import tf_utils
 from official.nlp.albert import configs as albert_configs
 from official.nlp.bert import configs
-from official.nlp.modeling import losses
 from official.nlp.modeling import models
 from official.nlp.modeling import networks

@@ -67,22 +66,27 @@ class BertPretrainLossAndMetricLayer(tf.keras.layers.Layer):
          next_sentence_loss, name='next_sentence_loss', aggregation='mean')

  def call(self,
-           lm_output,
-           sentence_output,
+           lm_output_logits,
+           sentence_output_logits,
           lm_label_ids,
           lm_label_weights,
           sentence_labels=None):
    """Implements call() for the layer."""
    lm_label_weights = tf.cast(lm_label_weights, tf.float32)
-    lm_output = tf.cast(lm_output, tf.float32)
+    lm_output_logits = tf.cast(lm_output_logits, tf.float32)

-    mask_label_loss = losses.weighted_sparse_categorical_crossentropy_loss(
-        labels=lm_label_ids, predictions=lm_output, weights=lm_label_weights)
+    lm_prediction_losses = tf.keras.losses.sparse_categorical_crossentropy(
+        lm_label_ids, lm_output_logits, from_logits=True)
+    lm_numerator_loss = tf.reduce_sum(lm_prediction_losses * lm_label_weights)
+    lm_denominator_loss = tf.reduce_sum(lm_label_weights)
+    mask_label_loss = tf.math.divide_no_nan(lm_numerator_loss,
+                                            lm_denominator_loss)

    if sentence_labels is not None:
-      sentence_output = tf.cast(sentence_output, tf.float32)
-      sentence_loss = losses.weighted_sparse_categorical_crossentropy_loss(
-          labels=sentence_labels, predictions=sentence_output)
+      sentence_output_logits = tf.cast(sentence_output_logits, tf.float32)
+      sentence_loss = tf.keras.losses.sparse_categorical_crossentropy(
+          sentence_labels, sentence_output_logits, from_logits=True)
+      sentence_loss = tf.reduce_mean(sentence_loss)
      loss = mask_label_loss + sentence_loss
    else:
      sentence_loss = None
@@ -92,8 +96,8 @@ class BertPretrainLossAndMetricLayer(tf.keras.layers.Layer):
    # TODO(hongkuny): Avoids the hack and switches add_loss.
    final_loss = tf.fill(batch_shape, loss)

-    self._add_metrics(lm_output, lm_label_ids, lm_label_weights,
-                      mask_label_loss, sentence_output, sentence_labels,
+    self._add_metrics(lm_output_logits, lm_label_ids, lm_label_weights,
+                      mask_label_loss, sentence_output_logits, sentence_labels,
                      sentence_loss)
    return final_loss

@@ -228,11 +232,12 @@ def pretrain_model(bert_config,
      activation=tf_utils.get_activation(bert_config.hidden_act),
      num_token_predictions=max_predictions_per_seq,
      initializer=initializer,
-      output='predictions')
+      output='logits')

-  lm_output, sentence_output = pretrainer_model(
+  outputs = pretrainer_model(
      [input_word_ids, input_mask, input_type_ids, masked_lm_positions])
-
+  lm_output = outputs['masked_lm']
+  sentence_output = outputs['classification']
  pretrain_loss_layer = BertPretrainLossAndMetricLayer(
      vocab_size=bert_config.vocab_size)
  output_loss = pretrain_loss_layer(lm_output, sentence_output, masked_lm_ids,

--- a/official/nlp/bert/input_pipeline.py
+++ b/official/nlp/bert/input_pipeline.py
@@ -247,3 +247,39 @@ def create_squad_dataset(file_path,
  dataset = dataset.batch(batch_size, drop_remainder=True)
  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
  return dataset
+
+
+def create_retrieval_dataset(file_path,
+                             seq_length,
+                             batch_size,
+                             input_pipeline_context=None):
+  """Creates input dataset from (tf)records files for scoring."""
+  name_to_features = {
+      'input_ids': tf.io.FixedLenFeature([seq_length], tf.int64),
+      'input_mask': tf.io.FixedLenFeature([seq_length], tf.int64),
+      'segment_ids': tf.io.FixedLenFeature([seq_length], tf.int64),
+      'int_iden': tf.io.FixedLenFeature([1], tf.int64),
+  }
+  dataset = single_file_dataset(file_path, name_to_features)
+
+  # The dataset is always sharded by number of hosts.
+  # num_input_pipelines is the number of hosts rather than number of cores.
+  if input_pipeline_context and input_pipeline_context.num_input_pipelines > 1:
+    dataset = dataset.shard(input_pipeline_context.num_input_pipelines,
+                            input_pipeline_context.input_pipeline_id)
+
+  def _select_data_from_record(record):
+    x = {
+        'input_word_ids': record['input_ids'],
+        'input_mask': record['input_mask'],
+        'input_type_ids': record['segment_ids']
+    }
+    y = record['int_iden']
+    return (x, y)
+
+  dataset = dataset.map(
+      _select_data_from_record,
+      num_parallel_calls=tf.data.experimental.AUTOTUNE)
+  dataset = dataset.batch(batch_size, drop_remainder=False)
+  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
+  return dataset
--- a/official/nlp/bert/model_training_utils.py
+++ b/official/nlp/bert/model_training_utils.py
@@ -111,6 +111,7 @@ def run_customized_training_loop(
    model_dir=None,
    train_input_fn=None,
    steps_per_epoch=None,
+    num_eval_per_epoch=1,
    steps_per_loop=None,
    epochs=1,
    eval_input_fn=None,
@@ -144,6 +145,7 @@ def run_customized_training_loop(
      steps_per_epoch: Number of steps to run per epoch. At the end of each
        epoch, model checkpoint will be saved and evaluation will be conducted
        if evaluation dataset is provided.
+      num_eval_per_epoch: Number of evaluations per epoch.
      steps_per_loop: Number of steps per graph-mode loop. In order to reduce
        communication in eager context, training logs are printed every
        steps_per_loop.
@@ -158,16 +160,17 @@ def run_customized_training_loop(
      init_checkpoint: Optional checkpoint to load to `sub_model` returned by
        `model_fn`.
      custom_callbacks: A list of Keras Callbacks objects to run during
-        training. More specifically, `on_batch_begin()`, `on_batch_end()`,
-        `on_epoch_begin()`, `on_epoch_end()` methods are invoked during
-        training.  Note that some metrics may be missing from `logs`.
+        training. More specifically, `on_train_begin(), on_train_end(),
+        on_batch_begin()`, `on_batch_end()`, `on_epoch_begin()`,
+        `on_epoch_end()` methods are invoked during training.
+        Note that some metrics may be missing from `logs`.
      run_eagerly: Whether to run model training in pure eager execution. This
        should be disable for TPUStrategy.
      sub_model_export_name: If not None, will export `sub_model` returned by
        `model_fn` into checkpoint files. The name of intermediate checkpoint
        file is {sub_model_export_name}_step_{step}.ckpt and the last
-        checkpint's name is {sub_model_export_name}.ckpt;
-        if None, `sub_model` will not be exported as checkpoint.
+        checkpint's name is {sub_model_export_name}.ckpt; if None, `sub_model`
+        will not be exported as checkpoint.
      explicit_allreduce: Whether to explicitly perform gradient allreduce,
        instead of relying on implicit allreduce in optimizer.apply_gradients().
        default is False. For now, if training using FP16 mixed precision,
@@ -177,10 +180,10 @@ def run_customized_training_loop(
      pre_allreduce_callbacks: A list of callback functions that takes gradients
        and model variables pairs as input, manipulate them, and returns a new
        gradients and model variables paris. The callback functions will be
-        invoked in the list order and before gradients are allreduced.
-        With mixed precision training, the pre_allreduce_allbacks will be
-        applied on scaled_gradients. Default is no callbacks.
-        Only used when explicit_allreduce=True.
+        invoked in the list order and before gradients are allreduced. With
+        mixed precision training, the pre_allreduce_allbacks will be applied on
+        scaled_gradients. Default is no callbacks. Only used when
+        explicit_allreduce=True.
      post_allreduce_callbacks: A list of callback functions that takes
        gradients and model variables pairs as input, manipulate them, and
        returns a new gradients and model variables paris. The callback
@@ -208,6 +211,8 @@ def run_customized_training_loop(
  required_arguments = [
      strategy, model_fn, loss_fn, model_dir, steps_per_epoch, train_input_fn
  ]
+
+  steps_between_evals = int(steps_per_epoch / num_eval_per_epoch)
  if [arg for arg in required_arguments if arg is None]:
    raise ValueError('`strategy`, `model_fn`, `loss_fn`, `model_dir`, '
                     '`steps_per_epoch` and `train_input_fn` are required '
@@ -216,17 +221,17 @@ def run_customized_training_loop(
    if tf.config.list_logical_devices('TPU'):
      # One can't fully utilize a TPU with steps_per_loop=1, so in this case
      # default users to a more useful value.
-      steps_per_loop = min(1000, steps_per_epoch)
+      steps_per_loop = min(1000, steps_between_evals)
    else:
      steps_per_loop = 1
    logging.info('steps_per_loop not specified. Using steps_per_loop=%d',
                 steps_per_loop)
-  if steps_per_loop > steps_per_epoch:
+  if steps_per_loop > steps_between_evals:
    logging.warning(
        'steps_per_loop: %d is specified to be greater than '
-        ' steps_per_epoch: %d, we will use steps_per_epoch as'
-        ' steps_per_loop.', steps_per_loop, steps_per_epoch)
-    steps_per_loop = steps_per_epoch
+        ' steps_between_evals: %d, we will use steps_between_evals as'
+        ' steps_per_loop.', steps_per_loop, steps_between_evals)
+    steps_per_loop = steps_between_evals
  assert tf.executing_eagerly()

  if run_eagerly:
@@ -242,12 +247,9 @@ def run_customized_training_loop(
    raise ValueError(
        'if `metric_fn` is specified, metric_fn must be a callable.')

-  callback_list = tf.keras.callbacks.CallbackList(custom_callbacks)
-
  total_training_steps = steps_per_epoch * epochs
  train_iterator = _get_input_iterator(train_input_fn, strategy)
-  eval_loss_metric = tf.keras.metrics.Mean(
-      'training_loss', dtype=tf.float32)
+  eval_loss_metric = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)

  with distribution_utils.get_strategy_scope(strategy):
    # To correctly place the model weights on accelerators,
@@ -260,6 +262,9 @@ def run_customized_training_loop(
      raise ValueError('sub_model_export_name is specified as %s, but '
                       'sub_model is None.' % sub_model_export_name)

+    callback_list = tf.keras.callbacks.CallbackList(
+        callbacks=custom_callbacks, model=model)
+
    optimizer = model.optimizer

    if init_checkpoint:
@@ -270,8 +275,7 @@ def run_customized_training_loop(
      checkpoint.restore(init_checkpoint).assert_existing_objects_matched()
      logging.info('Loading from checkpoint file completed')

-    train_loss_metric = tf.keras.metrics.Mean(
-        'training_loss', dtype=tf.float32)
+    train_loss_metric = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)
    eval_metrics = [metric_fn()] if metric_fn else []
    # If evaluation is required, make a copy of metric as it will be used by
    # both train and evaluation.
@@ -440,18 +444,20 @@ def run_customized_training_loop(

    latest_checkpoint_file = tf.train.latest_checkpoint(model_dir)
    if latest_checkpoint_file:
-      logging.info(
-          'Checkpoint file %s found and restoring from '
-          'checkpoint', latest_checkpoint_file)
+      logging.info('Checkpoint file %s found and restoring from '
+                   'checkpoint', latest_checkpoint_file)
      checkpoint.restore(latest_checkpoint_file)
      logging.info('Loading from checkpoint file completed')

    current_step = optimizer.iterations.numpy()
    checkpoint_name = 'ctl_step_{step}.ckpt'

-    while current_step < total_training_steps:
+    logs = {}
+    callback_list.on_train_begin()
+    while current_step < total_training_steps and not model.stop_training:
      if current_step % steps_per_epoch == 0:
-        callback_list.on_epoch_begin(int(current_step / steps_per_epoch) + 1)
+        callback_list.on_epoch_begin(
+            int(current_step / steps_per_epoch) + 1)

      # Training loss/metric are taking average over steps inside micro
      # training loop. We reset the their values before each round.
@@ -461,7 +467,7 @@ def run_customized_training_loop(

      callback_list.on_batch_begin(current_step)
      # Runs several steps in the host while loop.
-      steps = steps_to_run(current_step, steps_per_epoch, steps_per_loop)
+      steps = steps_to_run(current_step, steps_between_evals, steps_per_loop)

      if tf.config.list_physical_devices('GPU'):
        # TODO(zongweiz): merge with train_steps once tf.while_loop
@@ -470,11 +476,9 @@ def run_customized_training_loop(
          train_single_step(train_iterator)
      else:
        # Converts steps to a Tensor to avoid tf.function retracing.
-        train_steps(train_iterator,
-                    tf.convert_to_tensor(steps, dtype=tf.int32))
+        train_steps(train_iterator, tf.convert_to_tensor(steps, dtype=tf.int32))
      train_loss = _float_metric_value(train_loss_metric)
      current_step += steps
-      callback_list.on_batch_end(current_step - 1, {'loss': train_loss})

      # Updates training logging.
      training_status = 'Train Step: %d/%d  / loss = %s' % (
@@ -492,8 +496,7 @@ def run_customized_training_loop(
              'learning_rate',
              optimizer.learning_rate(current_step),
              step=current_step)
-        tf.summary.scalar(
-            train_loss_metric.name, train_loss, step=current_step)
+        tf.summary.scalar(train_loss_metric.name, train_loss, step=current_step)
        for metric in train_metrics + model.metrics:
          metric_value = _float_metric_value(metric)
          training_status += '  %s = %f' % (metric.name, metric_value)
@@ -501,7 +504,11 @@ def run_customized_training_loop(
        summary_writer.flush()
      logging.info(training_status)

-      if current_step % steps_per_epoch == 0:
+      # If no need for evaluation, we only call on_batch_end with train_loss,
+      # this is to ensure we get granular global_step/sec on Tensorboard.
+      if current_step % steps_between_evals:
+        callback_list.on_batch_end(current_step - 1, {'loss': train_loss})
+      else:
        # Save a submodel with the step in the file name after each epoch.
        if sub_model_export_name:
          _save_checkpoint(
@@ -514,7 +521,6 @@ def run_customized_training_loop(
        if current_step < total_training_steps:
          _save_checkpoint(strategy, checkpoint, model_dir,
                           checkpoint_name.format(step=current_step))
-          logs = None
          if eval_input_fn:
            logging.info('Running evaluation after step: %s.', current_step)
            logs = _run_evaluation(current_step,
@@ -523,8 +529,15 @@ def run_customized_training_loop(
            eval_loss_metric.reset_states()
            for metric in eval_metrics + model.metrics:
              metric.reset_states()
+        # We add train_loss here rather than call on_batch_end twice to make
+        # sure that no duplicated values are generated.
+        logs['loss'] = train_loss
+        callback_list.on_batch_end(current_step - 1, logs)

-          callback_list.on_epoch_end(int(current_step / steps_per_epoch), logs)
+      # Calls on_epoch_end after each real epoch ends to prevent mis-calculation
+      # of training steps.
+      if current_step % steps_per_epoch == 0:
+        callback_list.on_epoch_end(int(current_step / steps_per_epoch), logs)

    if sub_model_export_name:
      _save_checkpoint(strategy, sub_model_checkpoint, model_dir,
@@ -532,14 +545,11 @@ def run_customized_training_loop(

    _save_checkpoint(strategy, checkpoint, model_dir,
                     checkpoint_name.format(step=current_step))
-    logs = None
    if eval_input_fn:
      logging.info('Running final evaluation after training is complete.')
      logs = _run_evaluation(current_step,
                             _get_input_iterator(eval_input_fn, strategy))
-
    callback_list.on_epoch_end(int(current_step / steps_per_epoch), logs)
-
    training_summary = {
        'total_training_steps': total_training_steps,
        'train_loss': _float_metric_value(train_loss_metric),
@@ -557,4 +567,6 @@ def run_customized_training_loop(
    if not _should_export_summary(strategy):
      tf.io.gfile.rmtree(summary_dir)

+    callback_list.on_train_end()
+
    return model
--- a/official/nlp/bert/model_training_utils_test.py
+++ b/official/nlp/bert/model_training_utils_test.py
@@ -258,6 +258,7 @@ class ModelTrainingUtilsTest(tf.test.TestCase, parameterized.TestCase):
        loss_fn=tf.keras.losses.categorical_crossentropy,
        model_dir=model_dir,
        steps_per_epoch=20,
+        num_eval_per_epoch=4,
        steps_per_loop=10,
        epochs=2,
        train_input_fn=input_fn,
@@ -269,14 +270,15 @@ class ModelTrainingUtilsTest(tf.test.TestCase, parameterized.TestCase):
        run_eagerly=False)
    self.assertEqual(callback.epoch_begin, [(1, {}), (2, {})])
    epoch_ends, epoch_end_infos = zip(*callback.epoch_end)
-    self.assertEqual(list(epoch_ends), [1, 2])
+    self.assertEqual(list(epoch_ends), [1, 2, 2])
    for info in epoch_end_infos:
      self.assertIn('accuracy', info)

-    self.assertEqual(callback.batch_begin,
-                     [(0, {}), (10, {}), (20, {}), (30, {})])
+    self.assertEqual(callback.batch_begin, [(0, {}), (5, {}), (10, {}),
+                                            (15, {}), (20, {}), (25, {}),
+                                            (30, {}), (35, {})])
    batch_ends, batch_end_infos = zip(*callback.batch_end)
-    self.assertEqual(list(batch_ends), [9, 19, 29, 39])
+    self.assertEqual(list(batch_ends), [4, 9, 14, 19, 24, 29, 34, 39])
    for info in batch_end_infos:
      self.assertIn('loss', info)


--- a/official/nlp/bert/run_squad_helper.py
+++ b/official/nlp/bert/run_squad_helper.py
@@ -61,7 +61,11 @@ def define_common_squad_flags():
  flags.DEFINE_integer('train_batch_size', 32, 'Total batch size for training.')
  # Predict processing related.
  flags.DEFINE_string('predict_file', None,
-                      'Prediction data path with train tfrecords.')
+                      'SQuAD prediction json file path. '
+                      '`predict` mode supports multiple files: one can use '
+                      'wildcard to specify multiple files and it can also be '
+                      'multiple file patterns separated by comma. Note that '
+                      '`eval` mode only supports a single predict file.')
  flags.DEFINE_bool(
      'do_lower_case', True,
      'Whether to lower case the input text. Should be True for uncased '
@@ -159,22 +163,9 @@ def get_dataset_fn(input_file_pattern, max_seq_length, global_batch_size,
  return _dataset_fn


-def predict_squad_customized(strategy,
-                             input_meta_data,
-                             bert_config,
-                             checkpoint_path,
-                             predict_tfrecord_path,
-                             num_steps):
-  """Make predictions using a Bert-based squad model."""
-  predict_dataset_fn = get_dataset_fn(
-      predict_tfrecord_path,
-      input_meta_data['max_seq_length'],
-      FLAGS.predict_batch_size,
-      is_training=False)
-  predict_iterator = iter(
-      strategy.experimental_distribute_datasets_from_function(
-          predict_dataset_fn))
-
+def get_squad_model_to_predict(strategy, bert_config, checkpoint_path,
+                               input_meta_data):
+  """Gets a squad model to make predictions."""
  with strategy.scope():
    # Prediction always uses float32, even if training uses mixed precision.
    tf.keras.mixed_precision.experimental.set_policy('float32')
@@ -188,6 +179,23 @@ def predict_squad_customized(strategy,
  logging.info('Restoring checkpoints from %s', checkpoint_path)
  checkpoint = tf.train.Checkpoint(model=squad_model)
  checkpoint.restore(checkpoint_path).expect_partial()
+  return squad_model
+
+
+def predict_squad_customized(strategy,
+                             input_meta_data,
+                             predict_tfrecord_path,
+                             num_steps,
+                             squad_model):
+  """Make predictions using a Bert-based squad model."""
+  predict_dataset_fn = get_dataset_fn(
+      predict_tfrecord_path,
+      input_meta_data['max_seq_length'],
+      FLAGS.predict_batch_size,
+      is_training=False)
+  predict_iterator = iter(
+      strategy.experimental_distribute_datasets_from_function(
+          predict_dataset_fn))

  @tf.function
  def predict_step(iterator):
@@ -287,8 +295,8 @@ def train_squad(strategy,
      post_allreduce_callbacks=[clip_by_global_norm_callback])


-def prediction_output_squad(
-    strategy, input_meta_data, tokenizer, bert_config, squad_lib, checkpoint):
+def prediction_output_squad(strategy, input_meta_data, tokenizer, squad_lib,
+                            predict_file, squad_model):
  """Makes predictions for a squad dataset."""
  doc_stride = input_meta_data['doc_stride']
  max_query_length = input_meta_data['max_query_length']
@@ -296,7 +304,7 @@ def prediction_output_squad(
  version_2_with_negative = input_meta_data.get('version_2_with_negative',
                                                False)
  eval_examples = squad_lib.read_squad_examples(
-      input_file=FLAGS.predict_file,
+      input_file=predict_file,
      is_training=False,
      version_2_with_negative=version_2_with_negative)

@@ -337,8 +345,7 @@ def prediction_output_squad(

  num_steps = int(dataset_size / FLAGS.predict_batch_size)
  all_results = predict_squad_customized(
-      strategy, input_meta_data, bert_config,
-      checkpoint, eval_writer.filename, num_steps)
+      strategy, input_meta_data, eval_writer.filename, num_steps, squad_model)

  all_predictions, all_nbest_json, scores_diff_json = (
      squad_lib.postprocess_output(
@@ -356,11 +363,14 @@ def prediction_output_squad(


 def dump_to_files(all_predictions, all_nbest_json, scores_diff_json,
-                  squad_lib, version_2_with_negative):
+                  squad_lib, version_2_with_negative, file_prefix=''):
  """Save output to json files."""
-  output_prediction_file = os.path.join(FLAGS.model_dir, 'predictions.json')
-  output_nbest_file = os.path.join(FLAGS.model_dir, 'nbest_predictions.json')
-  output_null_log_odds_file = os.path.join(FLAGS.model_dir, 'null_odds.json')
+  output_prediction_file = os.path.join(FLAGS.model_dir,
+                                        '%spredictions.json' % file_prefix)
+  output_nbest_file = os.path.join(FLAGS.model_dir,
+                                   '%snbest_predictions.json' % file_prefix)
+  output_null_log_odds_file = os.path.join(FLAGS.model_dir, file_prefix,
+                                           '%snull_odds.json' % file_prefix)
  logging.info('Writing predictions to: %s', (output_prediction_file))
  logging.info('Writing nbest to: %s', (output_nbest_file))

@@ -370,6 +380,22 @@ def dump_to_files(all_predictions, all_nbest_json, scores_diff_json,
    squad_lib.write_to_json_files(scores_diff_json, output_null_log_odds_file)


+def _get_matched_files(input_path):
+  """Returns all files that matches the input_path."""
+  input_patterns = input_path.strip().split(',')
+  all_matched_files = []
+  for input_pattern in input_patterns:
+    input_pattern = input_pattern.strip()
+    if not input_pattern:
+      continue
+    matched_files = tf.io.gfile.glob(input_pattern)
+    if not matched_files:
+      raise ValueError('%s does not match any files.' % input_pattern)
+    else:
+      all_matched_files.extend(matched_files)
+  return sorted(all_matched_files)
+
+
 def predict_squad(strategy,
                  input_meta_data,
                  tokenizer,
@@ -379,11 +405,24 @@ def predict_squad(strategy,
  """Get prediction results and evaluate them to hard drive."""
  if init_checkpoint is None:
    init_checkpoint = tf.train.latest_checkpoint(FLAGS.model_dir)
-  all_predictions, all_nbest_json, scores_diff_json = prediction_output_squad(
-      strategy, input_meta_data, tokenizer,
-      bert_config, squad_lib, init_checkpoint)
-  dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib,
-                input_meta_data.get('version_2_with_negative', False))
+
+  all_predict_files = _get_matched_files(FLAGS.predict_file)
+  squad_model = get_squad_model_to_predict(strategy, bert_config,
+                                           init_checkpoint, input_meta_data)
+  for idx, predict_file in enumerate(all_predict_files):
+    all_predictions, all_nbest_json, scores_diff_json = prediction_output_squad(
+        strategy, input_meta_data, tokenizer, squad_lib, predict_file,
+        squad_model)
+    if len(all_predict_files) == 1:
+      file_prefix = ''
+    else:
+      # if predict_file is /path/xquad.ar.json, the `file_prefix` may be
+      # "xquad.ar-0-"
+      file_prefix = '%s-' % os.path.splitext(
+          os.path.basename(all_predict_files[idx]))[0]
+    dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib,
+                  input_meta_data.get('version_2_with_negative', False),
+                  file_prefix)


 def eval_squad(strategy,
@@ -395,9 +434,17 @@ def eval_squad(strategy,
  """Get prediction results and evaluate them against ground truth."""
  if init_checkpoint is None:
    init_checkpoint = tf.train.latest_checkpoint(FLAGS.model_dir)
+
+  all_predict_files = _get_matched_files(FLAGS.predict_file)
+  if len(all_predict_files) != 1:
+    raise ValueError('`eval_squad` only supports one predict file, '
+                     'but got %s' % all_predict_files)
+
+  squad_model = get_squad_model_to_predict(strategy, bert_config,
+                                           init_checkpoint, input_meta_data)
  all_predictions, all_nbest_json, scores_diff_json = prediction_output_squad(
-      strategy, input_meta_data, tokenizer,
-      bert_config, squad_lib, init_checkpoint)
+      strategy, input_meta_data, tokenizer, squad_lib, all_predict_files[0],
+      squad_model)
  dump_to_files(all_predictions, all_nbest_json, scores_diff_json, squad_lib,
                input_meta_data.get('version_2_with_negative', False))


--- a/official/nlp/configs/bert.py
+++ b/official/nlp/configs/bert.py
@@ -13,7 +13,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""A multi-head BERT encoder network for pretraining."""
+"""Multi-head BERT encoder network with classification heads.
+
+Includes configurations and instantiation methods.
+"""
 from typing import List, Optional, Text

 import dataclasses
@@ -24,7 +27,6 @@ from official.modeling.hyperparams import base_config
 from official.modeling.hyperparams import config_definitions as cfg
 from official.nlp.configs import encoders
 from official.nlp.modeling import layers
-from official.nlp.modeling import networks
 from official.nlp.modeling.models import bert_pretrainer


@@ -47,43 +49,34 @@ class BertPretrainerConfig(base_config.Config):
  cls_heads: List[ClsHeadConfig] = dataclasses.field(default_factory=list)


-def instantiate_from_cfg(
+def instantiate_classification_heads_from_cfgs(
+    cls_head_configs: List[ClsHeadConfig]) -> List[layers.ClassificationHead]:
+  return [
+      layers.ClassificationHead(**cfg.as_dict()) for cfg in cls_head_configs
+    ] if cls_head_configs else []
+
+
+def instantiate_bertpretrainer_from_cfg(
    config: BertPretrainerConfig,
-    encoder_network: Optional[tf.keras.Model] = None):
+    encoder_network: Optional[tf.keras.Model] = None
+    ) -> bert_pretrainer.BertPretrainerV2:
  """Instantiates a BertPretrainer from the config."""
  encoder_cfg = config.encoder
  if encoder_network is None:
-    encoder_network = networks.TransformerEncoder(
-        vocab_size=encoder_cfg.vocab_size,
-        hidden_size=encoder_cfg.hidden_size,
-        num_layers=encoder_cfg.num_layers,
-        num_attention_heads=encoder_cfg.num_attention_heads,
-        intermediate_size=encoder_cfg.intermediate_size,
-        activation=tf_utils.get_activation(encoder_cfg.hidden_activation),
-        dropout_rate=encoder_cfg.dropout_rate,
-        attention_dropout_rate=encoder_cfg.attention_dropout_rate,
-        max_sequence_length=encoder_cfg.max_position_embeddings,
-        type_vocab_size=encoder_cfg.type_vocab_size,
-        initializer=tf.keras.initializers.TruncatedNormal(
-            stddev=encoder_cfg.initializer_range))
-  if config.cls_heads:
-    classification_heads = [
-        layers.ClassificationHead(**cfg.as_dict()) for cfg in config.cls_heads
-    ]
-  else:
-    classification_heads = []
+    encoder_network = encoders.instantiate_encoder_from_cfg(encoder_cfg)
  return bert_pretrainer.BertPretrainerV2(
      config.num_masked_tokens,
      mlm_activation=tf_utils.get_activation(encoder_cfg.hidden_activation),
      mlm_initializer=tf.keras.initializers.TruncatedNormal(
          stddev=encoder_cfg.initializer_range),
      encoder_network=encoder_network,
-      classification_heads=classification_heads)
+      classification_heads=instantiate_classification_heads_from_cfgs(
+          config.cls_heads))


 @dataclasses.dataclass
 class BertPretrainDataConfig(cfg.DataConfig):
-  """Data config for BERT pretraining task."""
+  """Data config for BERT pretraining task (tasks/masked_lm)."""
  input_path: str = ""
  global_batch_size: int = 512
  is_training: bool = True
@@ -95,15 +88,15 @@ class BertPretrainDataConfig(cfg.DataConfig):

 @dataclasses.dataclass
 class BertPretrainEvalDataConfig(BertPretrainDataConfig):
-  """Data config for the eval set in BERT pretraining task."""
+  """Data config for the eval set in BERT pretraining task (tasks/masked_lm)."""
  input_path: str = ""
  global_batch_size: int = 512
  is_training: bool = False


 @dataclasses.dataclass
-class BertSentencePredictionDataConfig(cfg.DataConfig):
-  """Data of sentence prediction dataset."""
+class SentencePredictionDataConfig(cfg.DataConfig):
+  """Data config for sentence prediction task (tasks/sentence_prediction)."""
  input_path: str = ""
  global_batch_size: int = 32
  is_training: bool = True
@@ -111,10 +104,55 @@ class BertSentencePredictionDataConfig(cfg.DataConfig):


 @dataclasses.dataclass
-class BertSentencePredictionDevDataConfig(cfg.DataConfig):
-  """Dev data of MNLI sentence prediction dataset."""
+class SentencePredictionDevDataConfig(cfg.DataConfig):
+  """Dev Data config for sentence prediction (tasks/sentence_prediction)."""
  input_path: str = ""
  global_batch_size: int = 32
  is_training: bool = False
  seq_length: int = 128
  drop_remainder: bool = False
+
+
+@dataclasses.dataclass
+class QADataConfig(cfg.DataConfig):
+  """Data config for question answering task (tasks/question_answering)."""
+  input_path: str = ""
+  global_batch_size: int = 48
+  is_training: bool = True
+  seq_length: int = 384
+
+
+@dataclasses.dataclass
+class QADevDataConfig(cfg.DataConfig):
+  """Dev Data config for queston answering (tasks/question_answering)."""
+  input_path: str = ""
+  input_preprocessed_data_path: str = ""
+  version_2_with_negative: bool = False
+  doc_stride: int = 128
+  global_batch_size: int = 48
+  is_training: bool = False
+  seq_length: int = 384
+  query_length: int = 64
+  drop_remainder: bool = False
+  vocab_file: str = ""
+  tokenization: str = "WordPiece"  # WordPiece or SentencePiece
+  do_lower_case: bool = True
+
+
+@dataclasses.dataclass
+class TaggingDataConfig(cfg.DataConfig):
+  """Data config for tagging (tasks/tagging)."""
+  input_path: str = ""
+  global_batch_size: int = 48
+  is_training: bool = True
+  seq_length: int = 384
+
+
+@dataclasses.dataclass
+class TaggingDevDataConfig(cfg.DataConfig):
+  """Dev Data config for tagging (tasks/tagging)."""
+  input_path: str = ""
+  global_batch_size: int = 48
+  is_training: bool = False
+  seq_length: int = 384
+  drop_remainder: bool = False