Adding panoptic evaluation tools and update internal changes. (#6320)

* Internal changes PiperOrigin-RevId: 237183552 * update readme PiperOrigin-RevId: 237184584

Adding panoptic evaluation tools and update internal changes. (#6320)
* Internal changes PiperOrigin-RevId: 237183552 * update readme PiperOrigin-RevId: 237184584
8cf8446b · Yukun Zhu · aquariusjay · 05a79f5a · 8cf8446b · 8cf8446b
Commit 8cf8446b authored Mar 07, 2019 by Yukun Zhu Committed by aquariusjay Mar 07, 2019
20 changed files
--- a/research/deeplab/README.md
+++ b/research/deeplab/README.md
@@ -64,6 +64,21 @@ works:

 ```

+*  Auto-DeepLab (also called hnasnet in core/nas_network.py):
+
+```
+@inproceedings{autodeeplab2019,
+  title={Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic
+Image Segmentation},
+  author={Chenxi Liu and Liang-Chieh Chen and Florian Schroff and Hartwig Adam
+  and Wei Hua and Alan Yuille and Li Fei-Fei},
+  booktitle={CVPR},
+  year={2019}
+}
+
+```
+
+
 In the current implementation, we support adopting the following network
 backbones:

@@ -72,6 +87,15 @@ backbones:
 2.  Xception [9, 10]: A powerful network structure intended for server-side
    deployment.

+3.  ResNet-v1-{50,101} [14]: We provide both the original ResNet-v1 and its
+    'beta' variant where the 'stem' is modified for semantic segmentation.
+
+4.  PNASNet [15]: A Powerful network structure found by neural architecture
+    search.
+
+5.  Auto-DeepLab (called HNASNet in the code): A segmentation-specific network
+    backbone found by neural architecture search.
+
 This directory contains our TensorFlow [11] implementation. We provide codes
 allowing users to train the model, evaluate results in terms of mIOU (mean
 intersection-over-union), and visualize segmentation results. We use PASCAL VOC
@@ -91,6 +115,8 @@ Some segmentation results on Flickr images:
 *   YuKun Zhu, github: [yknzhu](https://github.com/YknZhu)
 *   George Papandreou, github: [gpapan](https://github.com/gpapan)
 *   Hui Hui, github: [huihui-personal](https://github.com/huihui-personal)
+*   Maxwell D. Collins, github: [mcollinswisc](https://github.com/mcollinswisc)
+*   Ting Liu: github: [tingliu](https://github.com/tingliu)

 ## Tables of Contents

@@ -131,9 +157,17 @@ under tensorflow/models. Please refer to the LICENSE for details.

 ## Change Logs

+### March 6, 2019
+
+* Released the evaluation code (under the `evaluation` folder) for image
+parsing, a.k.a. panoptic segmentation. In particular, the released code supports
+evaluating the parsing results in terms of both the parsing covering and
+panoptic quality metrics. **Contributors**: Maxwell Collins and Ting Liu.
+
+
 ### February 6, 2019

-* Update decoder module to exploit multiple low-level features with different
+* Updated decoder module to exploit multiple low-level features with different
 output_strides.

 ### December 3, 2018
@@ -241,3 +275,11 @@ and Cityscapes.
 13. **The Cityscapes Dataset for Semantic Urban Scene Understanding**<br />
    Cordts, Marius, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, Bernt Schiele. <br />
    [[link]](https://www.cityscapes-dataset.com/). In CVPR, 2016.
+
+14. **Deep Residual Learning for Image Recognition**<br />
+    Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. <br />
+    [[link]](https://arxiv.org/abs/1512.03385). In CVPR, 2016.
+
+15. **Progressive Neural Architecture Search**<br />
+    Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy. <br />
+    [[link]](https://arxiv.org/abs/1712.00559). In ECCV, 2018.
--- a/research/deeplab/core/nas_cell.py
+++ b/research/deeplab/core/nas_cell.py
@@ -175,3 +175,25 @@ class NASBaseCell(object):
        h for h, is_used in zip(net, used_hiddenstates) if not is_used])
    net = tf.concat(values=states_to_combine, axis=3)
    return net
+
+  @tf.contrib.framework.add_arg_scope
+  def _apply_drop_path(self, net):
+    """Apply drop_path regularization."""
+    drop_path_keep_prob = self._drop_path_keep_prob
+    if drop_path_keep_prob < 1.0:
+      # Scale keep prob by layer number.
+      assert self._cell_num != -1
+      layer_ratio = (self._cell_num + 1) / float(self._total_num_cells)
+      drop_path_keep_prob = 1 - layer_ratio * (1 - drop_path_keep_prob)
+      # Decrease keep prob over time.
+      current_step = tf.cast(tf.train.get_or_create_global_step(), tf.float32)
+      current_ratio = tf.minimum(1.0, current_step / self._total_training_steps)
+      drop_path_keep_prob = (1 - current_ratio * (1 - drop_path_keep_prob))
+      # Drop path.
+      noise_shape = [tf.shape(net)[0], 1, 1, 1]
+      random_tensor = drop_path_keep_prob
+      random_tensor += tf.random_uniform(noise_shape, dtype=tf.float32)
+      binary_tensor = tf.cast(tf.floor(random_tensor), net.dtype)
+      keep_prob_inv = tf.cast(1.0 / drop_path_keep_prob, net.dtype)
+      net = net * keep_prob_inv * binary_tensor
+    return net
--- a/research/deeplab/core/nas_network.py
+++ b/research/deeplab/core/nas_network.py
@@ -13,7 +13,21 @@
 # limitations under the License.
 # ==============================================================================

-"""Network structure used by NAS."""
+"""Network structure used by NAS.
+
+Here we provide a few NAS backbones for semantic segmentation.
+Currently, we have
+
+1. pnasnet
+"Progressive Neural Architecture Search", Chenxi Liu, Barret Zoph,
+Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei,
+Alan Yuille, Jonathan Huang, Kevin Murphy. In ECCV, 2018.
+
+2. hnasnet (also called Auto-DeepLab)
+"Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic
+Image Segmentation", Chenxi Liu, Liang-Chieh Chen, Florian Schroff,
+Hartwig Adam, Wei Hua, Alan Yuille, Li Fei-Fei. In CVPR, 2019.
+"""

 from __future__ import absolute_import
 from __future__ import division

--- a/research/deeplab/core/nas_network_test.py
+++ b/research/deeplab/core/nas_network_test.py
@@ -19,7 +19,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

-import google3
+
 import numpy as np
 import tensorflow as tf


--- a/research/deeplab/core/preprocess_utils.py
+++ b/research/deeplab/core/preprocess_utils.py
@@ -59,6 +59,28 @@ def flip_dim(tensor_list, prob=0.5, dim=1):
  return outputs


+def _image_dimensions(image, rank):
+  """Returns the dimensions of an image tensor.
+
+  Args:
+    image: A rank-D Tensor. For 3-D  of shape: `[height, width, channels]`.
+    rank: The expected rank of the image
+
+  Returns:
+    A list of corresponding to the dimensions of the input image. Dimensions
+      that are statically known are python integers, otherwise they are integer
+      scalar tensors.
+  """
+  if image.get_shape().is_fully_defined():
+    return image.get_shape().as_list()
+  else:
+    static_shape = image.get_shape().with_rank(rank).as_list()
+    dynamic_shape = tf.unstack(tf.shape(image), rank)
+    return [
+        s if s is not None else d for s, d in zip(static_shape, dynamic_shape)
+    ]
+
+
 def pad_to_bounding_box(image, offset_height, offset_width, target_height,
                        target_width, pad_value):
  """Pads the given image with the given pad_value.
@@ -82,39 +104,61 @@ def pad_to_bounding_box(image, offset_height, offset_width, target_height,
    ValueError: If the shape of image is incompatible with the offset_* or
    target_* arguments.
  """
-  image_rank = tf.rank(image)
-  image_rank_assert = tf.Assert(
-      tf.equal(image_rank, 3),
-      ['Wrong image tensor rank [Expected] [Actual]',
-       3, image_rank])
-  with tf.control_dependencies([image_rank_assert]):
-    image -= pad_value
-  image_shape = tf.shape(image)
-  height, width = image_shape[0], image_shape[1]
-  target_width_assert = tf.Assert(
-      tf.greater_equal(
-          target_width, width),
-      ['target_width must be >= width'])
-  target_height_assert = tf.Assert(
-      tf.greater_equal(target_height, height),
-      ['target_height must be >= height'])
-  with tf.control_dependencies([target_width_assert]):
-    after_padding_width = target_width - offset_width - width
-  with tf.control_dependencies([target_height_assert]):
-    after_padding_height = target_height - offset_height - height
-  offset_assert = tf.Assert(
-      tf.logical_and(
-          tf.greater_equal(after_padding_width, 0),
-          tf.greater_equal(after_padding_height, 0)),
-      ['target size not possible with the given target offsets'])
-
-  height_params = tf.stack([offset_height, after_padding_height])
-  width_params = tf.stack([offset_width, after_padding_width])
-  channel_params = tf.stack([0, 0])
-  with tf.control_dependencies([offset_assert]):
-    paddings = tf.stack([height_params, width_params, channel_params])
-  padded = tf.pad(image, paddings)
-  return padded + pad_value
+  with tf.name_scope(None, 'pad_to_bounding_box', [image]):
+    image = tf.convert_to_tensor(image, name='image')
+    original_dtype = image.dtype
+    if original_dtype != tf.float32 and original_dtype != tf.float64:
+      # If image dtype is not float, we convert it to int32 to avoid overflow.
+      image = tf.cast(image, tf.int32)
+    image_rank_assert = tf.Assert(
+        tf.logical_or(
+            tf.equal(tf.rank(image), 3),
+            tf.equal(tf.rank(image), 4)),
+        ['Wrong image tensor rank.'])
+    with tf.control_dependencies([image_rank_assert]):
+      image -= pad_value
+    image_shape = image.get_shape()
+    is_batch = True
+    if image_shape.ndims == 3:
+      is_batch = False
+      image = tf.expand_dims(image, 0)
+    elif image_shape.ndims is None:
+      is_batch = False
+      image = tf.expand_dims(image, 0)
+      image.set_shape([None] * 4)
+    elif image.get_shape().ndims != 4:
+      raise ValueError('Input image must have either 3 or 4 dimensions.')
+    _, height, width, _ = _image_dimensions(image, rank=4)
+    target_width_assert = tf.Assert(
+        tf.greater_equal(
+            target_width, width),
+        ['target_width must be >= width'])
+    target_height_assert = tf.Assert(
+        tf.greater_equal(target_height, height),
+        ['target_height must be >= height'])
+    with tf.control_dependencies([target_width_assert]):
+      after_padding_width = target_width - offset_width - width
+    with tf.control_dependencies([target_height_assert]):
+      after_padding_height = target_height - offset_height - height
+    offset_assert = tf.Assert(
+        tf.logical_and(
+            tf.greater_equal(after_padding_width, 0),
+            tf.greater_equal(after_padding_height, 0)),
+        ['target size not possible with the given target offsets'])
+    batch_params = tf.stack([0, 0])
+    height_params = tf.stack([offset_height, after_padding_height])
+    width_params = tf.stack([offset_width, after_padding_width])
+    channel_params = tf.stack([0, 0])
+    with tf.control_dependencies([offset_assert]):
+      paddings = tf.stack([batch_params, height_params, width_params,
+                           channel_params])
+    padded = tf.pad(image, paddings)
+    if not is_batch:
+      padded = tf.squeeze(padded, axis=[0])
+    outputs = padded + pad_value
+    if outputs.dtype != original_dtype:
+      outputs = tf.cast(outputs, original_dtype)
+    return outputs


 def _crop(image, offset_height, offset_width, crop_height, crop_width):
@@ -267,7 +311,7 @@ def get_random_scale(min_scale_factor, max_scale_factor, step_size):
    raise ValueError('Unexpected value of min_scale_factor.')

  if min_scale_factor == max_scale_factor:
-    return tf.to_float(min_scale_factor)
+    return tf.cast(min_scale_factor, tf.float32)

  # When step_size = 0, we sample the value uniformly from [min, max).
  if step_size == 0:
@@ -297,7 +341,9 @@ def randomly_scale_image_and_label(image, label=None, scale=1.0):
  if scale == 1.0:
    return image, label
  image_shape = tf.shape(image)
-  new_dim = tf.to_int32(tf.to_float([image_shape[0], image_shape[1]]) * scale)
+  new_dim = tf.cast(
+      tf.cast([image_shape[0], image_shape[1]], tf.float32) * scale,
+      tf.int32)

  # Need squeeze and expand_dims because image interpolation takes
  # 4D tensors as input.
@@ -389,9 +435,9 @@ def resize_to_range(image,
  """
  with tf.name_scope(scope, 'resize_to_range', [image]):
    new_tensor_list = []
-    min_size = tf.to_float(min_size)
+    min_size = tf.cast(min_size, tf.float32)
    if max_size is not None:
-      max_size = tf.to_float(max_size)
+      max_size = tf.cast(max_size, tf.float32)
      # Modify the max_size to be a multiple of factor plus 1 and make sure the
      # max dimension after resizing is no larger than max_size.
      if factor is not None:
@@ -399,8 +445,8 @@ def resize_to_range(image,
                    - factor)

    [orig_height, orig_width, _] = resolve_shape(image, rank=3)
-    orig_height = tf.to_float(orig_height)
-    orig_width = tf.to_float(orig_width)
+    orig_height = tf.cast(orig_height, tf.float32)
+    orig_width = tf.cast(orig_width, tf.float32)
    orig_min_size = tf.minimum(orig_height, orig_width)

    # Calculate the larger of the possible sizes
@@ -419,7 +465,7 @@ def resize_to_range(image,
      small_width = tf.to_int32(tf.ceil(orig_width * small_scale_factor))
      small_size = tf.stack([small_height, small_width])
      new_size = tf.cond(
-          tf.to_float(tf.reduce_max(large_size)) > max_size,
+          tf.cast(tf.reduce_max(large_size), tf.float32) > max_size,
          lambda: small_size,
          lambda: large_size)
    # Ensure that both output sides are multiples of factor plus one.

--- a/research/deeplab/core/preprocess_utils_test.py
+++ b/research/deeplab/core/preprocess_utils_test.py
@@ -252,25 +252,27 @@ class PreprocessUtilsTest(tf.test.TestCase):
                                   [255, 3, 5, 255, 255],
                                   [255, 255, 255, 255, 255]]]).astype(dtype)

-      with self.test_session():
-        image_placeholder = tf.placeholder(tf.float32)
+      with self.session() as sess:
        padded_image = preprocess_utils.pad_to_bounding_box(
-            image_placeholder, 2, 1, 5, 5, 255)
-        self.assertAllClose(padded_image.eval(
-            feed_dict={image_placeholder: image}), expected_image)
+            image, 2, 1, 5, 5, 255)
+        padded_image = sess.run(padded_image)
+        self.assertAllClose(padded_image, expected_image)
+        # Add batch size = 1 to image.
+        padded_image = preprocess_utils.pad_to_bounding_box(
+            np.expand_dims(image, 0), 2, 1, 5, 5, 255)
+        padded_image = sess.run(padded_image)
+        self.assertAllClose(padded_image, np.expand_dims(expected_image, 0))

  def testReturnOriginalImageWhenTargetSizeIsEqualToImageSize(self):
    image = np.dstack([[[5, 6],
                        [9, 0]],
                       [[4, 3],
                        [3, 5]]])
-
-    with self.test_session():
-      image_placeholder = tf.placeholder(tf.float32)
+    with self.session() as sess:
      padded_image = preprocess_utils.pad_to_bounding_box(
-          image_placeholder, 0, 0, 2, 2, 255)
-      self.assertAllClose(padded_image.eval(
-          feed_dict={image_placeholder: image}), image)
+          image, 0, 0, 2, 2, 255)
+      padded_image = sess.run(padded_image)
+      self.assertAllClose(padded_image, image)

  def testDieOnTargetSizeGreaterThanImageSize(self):
    image = np.dstack([[[5, 6],
@@ -306,7 +308,7 @@ class PreprocessUtilsTest(tf.test.TestCase):
          'target size not possible with the given target offsets'):
        padded_image.eval(feed_dict={image_placeholder: image})

-  def testDieIfImageTensorRankIsNotThree(self):
+  def testDieIfImageTensorRankIsTwo(self):
    image = np.vstack([[5, 6],
                       [9, 0]])
    with self.test_session():

--- a/research/deeplab/datasets/data_generator_test.py
+++ b/research/deeplab/datasets/data_generator_test.py
@@ -17,7 +17,7 @@
 from __future__ import print_function

 import collections
-import google3
+
 import tensorflow as tf

 from deeplab import common
@@ -37,7 +37,7 @@ class DatasetTest(tf.test.TestCase):
        dataset_name='pascal_voc_seg',
        split_name='val',
        dataset_dir=
-        'research/deeplab/testing/pascal_voc_seg',
+        'deeplab/testing/pascal_voc_seg',
        batch_size=1,
        crop_size=[3, 3],  # Use small size for testing.
        min_resize_value=3,

--- a/research/deeplab/datasets/remove_gt_colormap.py
+++ b/research/deeplab/datasets/remove_gt_colormap.py
@@ -72,7 +72,7 @@ def main(unused_argv):
                                       '*.' + FLAGS.segmentation_format))
  for annotation in annotations:
    raw_annotation = _remove_colormap(annotation)
-    filename = os.path.splitext(os.path.basename(annotation))[0]
+    filename = os.path.basename(annotation)[:-4]
    _save_annotation(raw_annotation,
                     os.path.join(
                         FLAGS.output_dir,

--- a/research/deeplab/evaluation/README.md
+++ b/research/deeplab/evaluation/README.md
+# Evaluation Metrics for Whole Image Parsing
+
+Whole Image Parsing [1], also known as Panoptic Segmentation [2], generalizes
+the tasks of semantic segmentation for "stuff" classes and instance
+segmentation for "thing" classes, assigning both semantic and instance labels
+to every pixel in an image.
+
+Previous works evaluate the parsing result with separate metrics (e.g., one for
+semantic segmentation result and one for object detection result). Recently,
+Kirillov et al. propose the unified instance-based Panoptic Quality (PQ) metric
+[2] into several benchmarks [3, 4].
+
+However, we notice that the instance-based PQ metric often places
+disproportionate emphasis on small instance parsing, as well as on "thing" over
+"stuff" classes. To remedy these effects, we propose an alternative
+region-based Parsing Covering (PC) metric [5], which adapts the Covering
+metric [6], previously used for class-agnostics segmentation quality
+evaluation, to the task of image parsing.
+
+Here, we provide implementation of both PQ and PC for evaluating the parsing
+results. We briefly explain both metrics below for reference.
+
+## Panoptic Quality (PQ)
+
+Given a groundtruth segmentation S and a predicted segmentation S', PQ is
+defined as follows:
+
+<p align="center">
+    <img src="g3doc/img/equation_pq.png" width=400>
+</p>
+
+where R and R' are groundtruth regions and predicted regions respectively,
+and |TP|, |FP|, and |FN| are the number of true positives, false postives,
+and false negatives. The matching is determined by a threshold of 0.5
+Intersection-Over-Union (IOU).
+
+PQ treats all regions of the same ‘stuff‘ class as one instance, and the
+size of instances is not considered. For example, instances with 10 × 10
+pixels contribute equally to the metric as instances with 1000 × 1000 pixels.
+Therefore, PQ is sensitive to false positives with small regions and some
+heuristics could improve the performance, such as removing those small
+regions (as also pointed out in the open-sourced evaluation code from [2]).
+Thus, we argue that PQ is suitable in applications where one cares equally for
+the parsing quality of instances irrespective of their sizes.
+
+## Parsing Covering (PC)
+
+We notice that there are applications where one pays more attention to large
+objects, e.g., autonomous driving (where nearby objects are more important
+than far away ones). Motivated by this, we propose to also evaluate the
+quality of image parsing results by extending the existing Covering metric [5],
+which accounts for instance sizes. Specifically, our proposed metric, Parsing
+Covering (PC), is defined as follows:
+
+<p align="center">
+    <img src="g3doc/img/equation_pc.png" width=400>
+</p>
+
+
+where S<sub>i</sub> and S<sub>i</sub>' are the groundtruth segmentation and
+predicted segmentation for the i-th semantic class respectively, and
+N<sub>i</sub> is the total number of pixels of groundtruth regions from
+S<sub>i</sub> . The Covering for class i, Cov<sub>i</sub> , is computed in
+the same way as the original Covering metric except that only groundtruth
+regions from S<sub>i</sub> and predicted regions from S<sub>i</sub>' are
+considered. PC is then obtained by computing the average of Cov<sub>i</sub>
+over C semantic classes.
+
+A notable difference between PQ and the proposed PC is that there is no
+matching involved in PC and hence no matching threshold. As an attempt to
+treat equally "thing" and "stuff", the segmentation of "stuff" classes still
+receives partial PC score if the segmentation is only partially correct. For
+example, if one out of three equally-sized trees is perfectly segmented, the
+model will get the same partial score by using PC regardless of considering
+"tree" as "stuff" or "thing".
+
+## Tutorial
+
+To evaluate the parsing results with PQ and PC, we provide two options:
+
+1. Python off-line evaluation with results saved in the [COCO format](http://cocodataset.org/#format-results).
+2. TensorFlow on-line evaluation.
+
+Below, we explain each option in detail.
+
+#### 1. Python off-line evaluation with results saved in COCO format
+
+[COCO result format](http://cocodataset.org/#format-results) has been
+adopted by several benchmarks [3, 4]. Therefore, we provide a convenient
+function, `eval_coco_format`, to evaluate the results saved in COCO format
+in terms of PC and re-implemented PQ.
+
+Before using the provided function, the users need to download the official COCO
+panotpic segmentation task API. Please see [installation](../g3doc/installation.md#add-libraries-to-pythonpath)
+for reference.
+
+Once the official COCO panoptic segmentation task API is downloaded, the
+users should be able to run the `eval_coco_format.py` to evaluate the parsing
+results in terms of both PC and reimplemented PQ.
+
+To be concrete, let's take a look at the function, `eval_coco_format` in
+`eval_coco_format.py`:
+
+```python
+eval_coco_format(gt_json_file,
+                 pred_json_file,
+                 gt_folder=None,
+                 pred_folder=None,
+                 metric='pq',
+                 num_categories=201,
+                 ignored_label=0,
+                 max_instances_per_category=256,
+                 intersection_offset=None,
+                 normalize_by_image_size=True,
+                 num_workers=0,
+                 print_digits=3):
+
+```
+where
+
+1. `gt_json_file`: Path to a JSON file giving ground-truth annotations in COCO
+format.
+2. `pred_json_file`: Path to a JSON file for the predictions to evaluate.
+3. `gt_folder`: Folder containing panoptic-format ID images to match
+ground-truth annotations to image regions.
+4. `pred_folder`: Path to a folder containing ID images for predictions.
+5. `metric`: Name of a metric to compute. Set to `pc`, `pq` for evaluation in PC
+or PQ, respectively.
+6. `num_categories`: The number of segmentation categories (or "classes") in the
+dataset.
+7. `ignored_label`: A category id that is ignored in evaluation, e.g. the "void"
+label in COCO panoptic segmentation dataset.
+8. `max_instances_per_category`: The maximum number of instances for each
+category to ensure unique instance labels.
+9. `intersection_offset`: The maximum number of unique labels.
+10. `normalize_by_image_size`: Whether to normalize groundtruth instance region
+areas by image size when using PC.
+11. `num_workers`: If set to a positive number, will spawn child processes to
+compute parts of the metric in parallel by splitting the images between the
+workers. If set to -1, will use the value of multiprocessing.cpu_count().
+12. `print_digits`: Number of significant digits to print in summary of computed
+metrics.
+
+The input arguments have default values set for the COCO panoptic segmentation
+dataset. Thus, users only need to provide the `gt_json_file` and the
+`pred_json_file` (following the COCO format) to run the evaluation on COCO with
+PQ. If users want to evaluate the results on other datasets, they may need
+to change the default values.
+
+As an example, the interested users could take a look at the provided unit
+test, `test_compare_pq_with_reference_eval`, in `eval_coco_format_test.py`.
+
+#### 2. TensorFlow on-line evaluation
+
+Users may also want to run the TensorFlow on-line evaluation, similar to the
+[tf.contrib.metrics.streaming_mean_iou](https://www.tensorflow.org/api_docs/python/tf/contrib/metrics/streaming_mean_iou).
+
+Below, we provide a code snippet that shows how to use the provided
+`streaming_panoptic_quality` and `streaming_parsing_covering`.
+
+```python
+metric_map = {}
+metric_map['panoptic_quality'] = streaming_metrics.streaming_panoptic_quality(
+    category_label,
+    instance_label,
+    category_prediction,
+    instance_prediction,
+    num_classes=201,
+    max_instances_per_category=256,
+    ignored_label=0,
+    offset=256*256)
+metric_map['parsing_covering'] = streaming_metrics.streaming_parsing_covering(
+    category_label,
+    instance_label,
+    category_prediction,
+    instance_prediction,
+    num_classes=201,
+    max_instances_per_category=256,
+    ignored_label=0,
+    offset=256*256,
+    normalize_by_image_size=True)
+metrics_to_values, metrics_to_updates = slim.metrics.aggregate_metric_map(
+    metric_map)
+```
+where `metric_map` is a dictionary storing the streamed results of PQ and PC.
+
+The `category_label` and the `instance_label` are the semantic segmentation and
+instance segmentation groundtruth, respectively. That is, in the panoptic
+segmentation format:
+panoptic_label = category_label * max_instances_per_category + instance_label.
+Similarly, the `category_prediction` and the `instance_prediction` are the
+predicted semantic segmentation and instance segmentation, respectively.
+
+Below, we provide a code snippet about how to summarize the results in the
+context of tf.summary.
+
+```python
+summary_ops = []
+for metric_name, metric_value in metrics_to_values.iteritems():
+  if metric_name == 'panoptic_quality':
+    [pq, sq, rq, total_tp, total_fn, total_fp] = tf.unstack(
+      metric_value, 6, axis=0)
+    panoptic_metrics = {
+      # Panoptic quality.
+      'pq': pq,
+      # Segmentation quality.
+      'sq': sq,
+      # Recognition quality.
+      'rq': rq,
+      # Total true positives.
+      'total_tp': total_tp,
+      # Total false negatives.
+      'total_fn': total_fn,
+      # Total false positives.
+      'total_fp': total_fp,
+    }
+    # Find the valid classes that will be used for evaluation. We will
+    # ignore the `ignore_label` class and other classes which have (tp + fn
+    # + fp) equal to 0.
+    valid_classes = tf.logical_and(
+        tf.not_equal(tf.range(0, num_classes), void_label),
+        tf.not_equal(total_tp + total_fn + total_fp, 0))
+    for target_metric, target_value in panoptic_metrics.iteritems():
+      output_metric_name = '{}_{}'.format(metric_name, target_metric)
+      op = tf.summary.scalar(
+          output_metric_name,
+          tf.reduce_mean(tf.boolean_mask(target_value, valid_classes)))
+      op = tf.Print(op, [target_value], output_metric_name + '_classwise: ',
+                    summarize=num_classes)
+      op = tf.Print(
+          op,
+          [tf.reduce_mean(tf.boolean_mask(target_value, valid_classes))],
+          output_metric_name + '_mean: ',
+           summarize=1)
+      summary_ops.append(op)
+  elif metric_name == 'parsing_covering':
+    [per_class_covering,
+     total_per_class_weighted_ious,
+     total_per_class_gt_areas] = tf.unstack(metric_value, 3, axis=0)
+    # Find the valid classes that will be used for evaluation. We will
+    # ignore the `void_label` class and other classes which have
+    # total_per_class_weighted_ious + total_per_class_gt_areas equal to 0.
+    valid_classes = tf.logical_and(
+        tf.not_equal(tf.range(0, num_classes), void_label),
+        tf.not_equal(
+            total_per_class_weighted_ious + total_per_class_gt_areas, 0))
+    op = tf.summary.scalar(
+        metric_name,
+        tf.reduce_mean(tf.boolean_mask(per_class_covering, valid_classes)))
+    op = tf.Print(op, [per_class_covering], metric_name + '_classwise: ',
+                  summarize=num_classes)
+    op = tf.Print(
+        op,
+        [tf.reduce_mean(
+            tf.boolean_mask(per_class_covering, valid_classes))],
+        metric_name + '_mean: ',
+        summarize=1)
+    summary_ops.append(op)
+  else:
+    raise ValueError('The metric_name "%s" is not supported.' % metric_name)
+```
+
+Afterwards, the users could use the following code to run the evaluation in
+TensorFlow.
+
+Users can take a look at eval.py for reference which provides a simple
+example to run the streaming evaluation of mIOU for semantic segmentation.
+
+```python
+metric_values = slim.evaluation.evaluation_loop(
+  master=FLAGS.master,
+  checkpoint_dir=FLAGS.checkpoint_dir,
+  logdir=FLAGS.eval_logdir,
+  num_evals=num_batches,
+  eval_op=metrics_to_updates.values(),
+  final_op=metrics_to_values.values(),
+  summary_op=tf.summary.merge(summary_ops),
+  max_number_of_evaluations=FLAGS.max_number_of_evaluations,
+  eval_interval_secs=FLAGS.eval_interval_secs)
+```
+
+
+### References
+
+1. **Image Parsing: Unifying Segmentation, Detection, and Recognition**<br />
+   Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu<br />
+   IJCV, 2005.
+
+2. **Panoptic Segmentation**<br />
+   Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother and Piotr
+   Dollár<br />
+   arXiv:1801.00868, 2018.
+
+3. **Microsoft COCO: Common Objects in Context**<br />
+    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross
+    Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick,
+    Piotr Dollar<br />
+    In the Proc. of ECCV, 2014.
+
+4. **The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes**<br />
+   Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter Kontschieder<br />
+   In the Proc. of ICCV, 2017.
+
+5. **DeeperLab: Single-Shot Image Parser**<br />
+   Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu,
+   Xiao Zhang, Vivienne Sze, George Papandreou, Liang-Chieh Chen<br />
+   arXiv: 1902.05093, 2019.
+
+6. **Contour Detection and Hierarchical Image Segmentation**<br />
+   Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik<br />
+   PAMI, 2011
--- a/research/deeplab/evaluation/__init__.py
+++ b/research/deeplab/evaluation/__init__.py
--- a/research/deeplab/evaluation/base_metric.py
+++ b/research/deeplab/evaluation/base_metric.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Defines the top-level interface for evaluating segmentations."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import abc
+import numpy as np
+import six
+
+
+_EPSILON = 1e-10
+
+
+def realdiv_maybe_zero(x, y):
+  """Element-wise x / y where y may contain zeros, for those returns 0 too."""
+  return np.where(
+      np.less(np.abs(y), _EPSILON), np.zeros_like(x), np.divide(x, y))
+
+
+@six.add_metaclass(abc.ABCMeta)
+class SegmentationMetric(object):
+  """Abstract base class for computers of segmentation metrics.
+
+  Subclasses will implement both:
+  1. Comparing the predicted segmentation for an image with the groundtruth.
+  2. Computing the final metric over a set of images.
+  These are often done as separate steps, due to the need to accumulate
+  intermediate values other than the metric itself across images, computing the
+  actual metric value only on these accumulations after all the images have been
+  compared.
+
+  A simple usage would be:
+
+    metric = MetricImplementation(...)
+    for <image>, <groundtruth> in evaluation_set:
+      <prediction> = run_segmentation(<image>)
+      metric.compare_and_accumulate(<prediction>, <groundtruth>)
+    print(metric.result())
+
+  """
+
+  def __init__(self, num_categories, ignored_label, max_instances_per_category,
+               offset):
+    """Base initialization for SegmentationMetric.
+
+    Args:
+      num_categories: The number of segmentation categories (or "classes" in the
+        dataset.
+      ignored_label: A category id that is ignored in evaluation, e.g. the void
+        label as defined in COCO panoptic segmentation dataset.
+      max_instances_per_category: The maximum number of instances for each
+        category. Used in ensuring unique instance labels.
+      offset: The maximum number of unique labels. This is used, by multiplying
+        the ground-truth labels, to generate unique ids for individual regions
+        of overlap between groundtruth and predicted segments.
+    """
+    self.num_categories = num_categories
+    self.ignored_label = ignored_label
+    self.max_instances_per_category = max_instances_per_category
+    self.offset = offset
+    self.reset()
+
+  def _naively_combine_labels(self, category_array, instance_array):
+    """Naively creates a combined label array from categories and instances."""
+    return (category_array.astype(np.uint32) * self.max_instances_per_category +
+            instance_array.astype(np.uint32))
+
+  @abc.abstractmethod
+  def compare_and_accumulate(
+      self, groundtruth_category_array, groundtruth_instance_array,
+      predicted_category_array, predicted_instance_array):
+    """Compares predicted segmentation with groundtruth, accumulates its metric.
+
+    It is not assumed that instance ids are unique across different categories.
+    See for example combine_semantic_and_instance_predictions.py in official
+    PanopticAPI evaluation code for issues to consider when fusing category
+    and instance labels.
+
+    Instances ids of the ignored category have the meaning that id 0 is "void"
+    and remaining ones are crowd instances.
+
+    Args:
+      groundtruth_category_array: A 2D numpy uint16 array of groundtruth
+        per-pixel category labels.
+      groundtruth_instance_array: A 2D numpy uint16 array of groundtruth
+        instance labels.
+      predicted_category_array: A 2D numpy uint16 array of predicted per-pixel
+        category labels.
+      predicted_instance_array: A 2D numpy uint16 array of predicted instance
+        labels.
+
+    Returns:
+      The value of the metric over all comparisons done so far, including this
+      one, as a float scalar.
+    """
+    raise NotImplementedError('Must be implemented in subclasses.')
+
+  @abc.abstractmethod
+  def result(self):
+    """Computes the metric over all comparisons done so far."""
+    raise NotImplementedError('Must be implemented in subclasses.')
+
+  @abc.abstractmethod
+  def detailed_results(self, is_thing=None):
+    """Computes and returns the detailed final metric results.
+
+    Args:
+      is_thing: A boolean array of length `num_categories`. The entry
+        `is_thing[category_id]` is True iff that category is a "thing" category
+        instead of "stuff."
+
+    Returns:
+      A dictionary with a breakdown of metrics and/or metric factors by things,
+      stuff, and all categories.
+    """
+    raise NotImplementedError('Not implemented in subclasses.')
+
+  @abc.abstractmethod
+  def result_per_category(self):
+    """For supported metrics, return individual per-category metric values.
+
+    Returns:
+      A numpy array of shape `[self.num_categories]`, where index `i` is the
+      metrics value over only that category.
+    """
+    raise NotImplementedError('Not implemented in subclass.')
+
+  def print_detailed_results(self, is_thing=None, print_digits=3):
+    """Prints out a detailed breakdown of metric results.
+
+    Args:
+      is_thing: A boolean array of length num_categories.
+        `is_thing[category_id]` will say whether that category is a "thing"
+        rather than "stuff."
+      print_digits: Number of significant digits to print in computed metrics.
+    """
+    raise NotImplementedError('Not implemented in subclass.')
+
+  @abc.abstractmethod
+  def merge(self, other_instance):
+    """Combines the accumulated results of another instance into self.
+
+    The following two cases should put `metric_a` into an equivalent state.
+
+    Case 1 (with merge):
+
+      metric_a = MetricsSubclass(...)
+      metric_a.compare_and_accumulate(<comparison 1>)
+      metric_a.compare_and_accumulate(<comparison 2>)
+
+      metric_b = MetricsSubclass(...)
+      metric_b.compare_and_accumulate(<comparison 3>)
+      metric_b.compare_and_accumulate(<comparison 4>)
+
+      metric_a.merge(metric_b)
+
+    Case 2 (without merge):
+
+      metric_a = MetricsSubclass(...)
+      metric_a.compare_and_accumulate(<comparison 1>)
+      metric_a.compare_and_accumulate(<comparison 2>)
+      metric_a.compare_and_accumulate(<comparison 3>)
+      metric_a.compare_and_accumulate(<comparison 4>)
+
+    Args:
+      other_instance: Another compatible instance of the same metric subclass.
+    """
+    raise NotImplementedError('Not implemented in subclass.')
+
+  @abc.abstractmethod
+  def reset(self):
+    """Resets the accumulation to the metric class's state at initialization.
+
+    Note that this function will be called in SegmentationMetric.__init__.
+    """
+    raise NotImplementedError('Must be implemented in subclasses.')
--- a/research/deeplab/evaluation/eval_coco_format.py
+++ b/research/deeplab/evaluation/eval_coco_format.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Computes evaluation metrics on groundtruth and predictions in COCO format.
+
+The Common Objects in Context (COCO) dataset defines a format for specifying
+combined semantic and instance segmentations as "panoptic" segmentations. This
+is done with the combination of JSON and image files as specified at:
+http://cocodataset.org/#format-results
+where the JSON file specifies the overall structure of the result,
+including the categories for each annotation, and the images specify the image
+region for each annotation in that image by its ID.
+
+This script computes additional metrics such as Parsing Covering on datasets and
+predictions in this format. An implementation of Panoptic Quality is also
+provided for convenience.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import json
+import multiprocessing
+import os
+
+from absl import app
+from absl import flags
+from absl import logging
+import numpy as np
+from PIL import Image
+import utils as panopticapi_utils
+import six
+
+from deeplab.evaluation import panoptic_quality
+from deeplab.evaluation import parsing_covering
+
+FLAGS = flags.FLAGS
+
+flags.DEFINE_string(
+    'gt_json_file', None,
+    ' Path to a JSON file giving ground-truth annotations in COCO format.')
+flags.DEFINE_string('pred_json_file', None,
+                    'Path to a JSON file for the predictions to evaluate.')
+flags.DEFINE_string(
+    'gt_folder', None,
+    'Folder containing panoptic-format ID images to match ground-truth '
+    'annotations to image regions.')
+flags.DEFINE_string('pred_folder', None,
+                    'Folder containing ID images for predictions.')
+flags.DEFINE_enum(
+    'metric', 'pq', ['pq', 'pc'], 'Shorthand name of a metric to compute. '
+    'Supported values are:\n'
+    'Panoptic Quality (pq)\n'
+    'Parsing Covering (pc)')
+flags.DEFINE_integer(
+    'num_categories', 201,
+    'The number of segmentation categories (or "classes") in the dataset.')
+flags.DEFINE_integer(
+    'ignored_label', 0,
+    'A category id that is ignored in evaluation, e.g. the void label as '
+    'defined in COCO panoptic segmentation dataset.')
+flags.DEFINE_integer(
+    'max_instances_per_category', 256,
+    'The maximum number of instances for each category. Used in ensuring '
+    'unique instance labels.')
+flags.DEFINE_integer('intersection_offset', None,
+                     'The maximum number of unique labels.')
+flags.DEFINE_bool(
+    'normalize_by_image_size', True,
+    'Whether to normalize groundtruth instance region areas by image size. If '
+    'True, groundtruth instance areas and weighted IoUs will be divided by the '
+    'size of the corresponding image before accumulated across the dataset. '
+    'Only used for Parsing Covering (pc) evaluation.')
+flags.DEFINE_integer(
+    'num_workers', 0, 'If set to a positive number, will spawn child processes '
+    'to compute parts of the metric in parallel by splitting '
+    'the images between the workers. If set to -1, will use '
+    'the value of multiprocessing.cpu_count().')
+flags.DEFINE_integer('print_digits', 3,
+                     'Number of significant digits to print in metrics.')
+
+
+def _build_metric(metric,
+                  num_categories,
+                  ignored_label,
+                  max_instances_per_category,
+                  intersection_offset=None,
+                  normalize_by_image_size=True):
+  """Creates a metric aggregator objet of the given name."""
+  if metric == 'pq':
+    logging.warning('One should check Panoptic Quality results against the '
+                    'official COCO API code. Small numerical differences '
+                    '(< 0.1%) can be magnified by rounding.')
+    return panoptic_quality.PanopticQuality(num_categories, ignored_label,
+                                            max_instances_per_category,
+                                            intersection_offset)
+  elif metric == 'pc':
+    return parsing_covering.ParsingCovering(
+        num_categories, ignored_label, max_instances_per_category,
+        intersection_offset, normalize_by_image_size)
+  else:
+    raise ValueError('No implementation for metric "%s"' % metric)
+
+
+def _matched_annotations(gt_json, pred_json):
+  """Yields a set of (groundtruth, prediction) image annotation pairs.."""
+  image_id_to_pred_ann = {
+      annotation['image_id']: annotation
+      for annotation in pred_json['annotations']
+  }
+  for gt_ann in gt_json['annotations']:
+    image_id = gt_ann['image_id']
+    pred_ann = image_id_to_pred_ann[image_id]
+    yield gt_ann, pred_ann
+
+
+def _open_panoptic_id_image(image_path):
+  """Loads a COCO-format panoptic ID image from file."""
+  return panopticapi_utils.rgb2id(
+      np.array(Image.open(image_path), dtype=np.uint32))
+
+
+def _split_panoptic(ann_json, id_array, ignored_label, allow_crowds):
+  """Given the COCO JSON and ID map, splits into categories and instances."""
+  category = np.zeros(id_array.shape, np.uint16)
+  instance = np.zeros(id_array.shape, np.uint16)
+  next_instance_id = collections.defaultdict(int)
+  # Skip instance label 0 for ignored label. That is reserved for void.
+  next_instance_id[ignored_label] = 1
+  for segment_info in ann_json['segments_info']:
+    if allow_crowds and segment_info['iscrowd']:
+      category_id = ignored_label
+    else:
+      category_id = segment_info['category_id']
+    mask = np.equal(id_array, segment_info['id'])
+    category[mask] = category_id
+    instance[mask] = next_instance_id[category_id]
+    next_instance_id[category_id] += 1
+  return category, instance
+
+
+def _category_and_instance_from_annotation(ann_json, folder, ignored_label,
+                                           allow_crowds):
+  """Given the COCO JSON annotations, finds maps of categories and instances."""
+  panoptic_id_image = _open_panoptic_id_image(
+      os.path.join(folder, ann_json['file_name']))
+  return _split_panoptic(ann_json, panoptic_id_image, ignored_label,
+                         allow_crowds)
+
+
+def _compute_metric(metric_aggregator, gt_folder, pred_folder,
+                    annotation_pairs):
+  """Iterates over matched annotation pairs and computes a metric over them."""
+  for gt_ann, pred_ann in annotation_pairs:
+    # We only expect "iscrowd" to appear in the ground-truth, and not in model
+    # output. In predicted JSON it is simply ignored, as done in official code.
+    gt_category, gt_instance = _category_and_instance_from_annotation(
+        gt_ann, gt_folder, metric_aggregator.ignored_label, True)
+    pred_category, pred_instance = _category_and_instance_from_annotation(
+        pred_ann, pred_folder, metric_aggregator.ignored_label, False)
+
+    metric_aggregator.compare_and_accumulate(gt_category, gt_instance,
+                                             pred_category, pred_instance)
+  return metric_aggregator
+
+
+def _iterate_work_queue(work_queue):
+  """Creates an iterable that retrieves items from a queue until one is None."""
+  task = work_queue.get(block=True)
+  while task is not None:
+    yield task
+    task = work_queue.get(block=True)
+
+
+def _run_metrics_worker(metric_aggregator, gt_folder, pred_folder, work_queue,
+                        result_queue):
+  result = _compute_metric(metric_aggregator, gt_folder, pred_folder,
+                           _iterate_work_queue(work_queue))
+  result_queue.put(result, block=True)
+
+
+def _is_thing_array(categories_json, ignored_label):
+  """is_thing[category_id] is a bool on if category is "thing" or "stuff"."""
+  is_thing_dict = {}
+  for category_json in categories_json:
+    is_thing_dict[category_json['id']] = bool(category_json['isthing'])
+
+  # Check our assumption that the category ids are consecutive.
+  # Usually metrics should be able to handle this case, but adding a warning
+  # here.
+  max_category_id = max(six.iterkeys(is_thing_dict))
+  if len(is_thing_dict) != max_category_id + 1:
+    seen_ids = six.viewkeys(is_thing_dict)
+    all_ids = set(six.moves.range(max_category_id + 1))
+    unseen_ids = all_ids.difference(seen_ids)
+    if unseen_ids != {ignored_label}:
+      logging.warning(
+          'Nonconsecutive category ids or no category JSON specified for ids: '
+          '%s', unseen_ids)
+
+  is_thing_array = np.zeros(max_category_id + 1)
+  for category_id, is_thing in six.iteritems(is_thing_dict):
+    is_thing_array[category_id] = is_thing
+
+  return is_thing_array
+
+
+def eval_coco_format(gt_json_file,
+                     pred_json_file,
+                     gt_folder=None,
+                     pred_folder=None,
+                     metric='pq',
+                     num_categories=201,
+                     ignored_label=0,
+                     max_instances_per_category=256,
+                     intersection_offset=None,
+                     normalize_by_image_size=True,
+                     num_workers=0,
+                     print_digits=3):
+  """Top-level code to compute metrics on a COCO-format result.
+
+  Note that the default values are set for COCO panoptic segmentation dataset,
+  and thus the users may want to change it for their own dataset evaluation.
+
+  Args:
+    gt_json_file: Path to a JSON file giving ground-truth annotations in COCO
+      format.
+    pred_json_file: Path to a JSON file for the predictions to evaluate.
+    gt_folder: Folder containing panoptic-format ID images to match ground-truth
+      annotations to image regions.
+    pred_folder: Folder containing ID images for predictions.
+    metric: Name of a metric to compute.
+    num_categories: The number of segmentation categories (or "classes") in the
+      dataset.
+    ignored_label: A category id that is ignored in evaluation, e.g. the "void"
+      label as defined in the COCO panoptic segmentation dataset.
+    max_instances_per_category: The maximum number of instances for each
+      category. Used in ensuring unique instance labels.
+    intersection_offset: The maximum number of unique labels.
+    normalize_by_image_size: Whether to normalize groundtruth instance region
+      areas by image size. If True, groundtruth instance areas and weighted IoUs
+      will be divided by the size of the corresponding image before accumulated
+      across the dataset. Only used for Parsing Covering (pc) evaluation.
+    num_workers: If set to a positive number, will spawn child processes to
+      compute parts of the metric in parallel by splitting the images between
+      the workers. If set to -1, will use the value of
+      multiprocessing.cpu_count().
+    print_digits: Number of significant digits to print in summary of computed
+      metrics.
+
+  Returns:
+    The computed result of the metric as a float scalar.
+  """
+  with open(gt_json_file, 'r') as gt_json_fo:
+    gt_json = json.load(gt_json_fo)
+  with open(pred_json_file, 'r') as pred_json_fo:
+    pred_json = json.load(pred_json_fo)
+  if gt_folder is None:
+    gt_folder = gt_json_file.replace('.json', '')
+  if pred_folder is None:
+    pred_folder = pred_json_file.replace('.json', '')
+  if intersection_offset is None:
+    intersection_offset = (num_categories + 1) * max_instances_per_category
+
+  metric_aggregator = _build_metric(
+      metric, num_categories, ignored_label, max_instances_per_category,
+      intersection_offset, normalize_by_image_size)
+
+  if num_workers == -1:
+    logging.info('Attempting to get the CPU count to set # workers.')
+    num_workers = multiprocessing.cpu_count()
+
+  if num_workers > 0:
+    logging.info('Computing metric in parallel with %d workers.', num_workers)
+    work_queue = multiprocessing.Queue()
+    result_queue = multiprocessing.Queue()
+    workers = []
+    worker_args = (metric_aggregator, gt_folder, pred_folder, work_queue,
+                   result_queue)
+    for _ in six.moves.range(num_workers):
+      workers.append(
+          multiprocessing.Process(target=_run_metrics_worker, args=worker_args))
+    for worker in workers:
+      worker.start()
+    for ann_pair in _matched_annotations(gt_json, pred_json):
+      work_queue.put(ann_pair, block=True)
+
+    # Will cause each worker to return a result and terminate upon recieving a
+    # None task.
+    for _ in six.moves.range(num_workers):
+      work_queue.put(None, block=True)
+
+    # Retrieve results.
+    for _ in six.moves.range(num_workers):
+      metric_aggregator.merge(result_queue.get(block=True))
+
+    for worker in workers:
+      worker.join()
+  else:
+    logging.info('Computing metric in a single process.')
+    annotation_pairs = _matched_annotations(gt_json, pred_json)
+    _compute_metric(metric_aggregator, gt_folder, pred_folder, annotation_pairs)
+
+  is_thing = _is_thing_array(gt_json['categories'], ignored_label)
+  metric_aggregator.print_detailed_results(
+      is_thing=is_thing, print_digits=print_digits)
+  return metric_aggregator.detailed_results(is_thing=is_thing)
+
+
+def main(argv):
+  if len(argv) > 1:
+    raise app.UsageError('Too many command-line arguments.')
+
+  eval_coco_format(FLAGS.gt_json_file, FLAGS.pred_json_file, FLAGS.gt_folder,
+                   FLAGS.pred_folder, FLAGS.metric, FLAGS.num_categories,
+                   FLAGS.ignored_label, FLAGS.max_instances_per_category,
+                   FLAGS.intersection_offset, FLAGS.normalize_by_image_size,
+                   FLAGS.num_workers, FLAGS.print_digits)
+
+
+if __name__ == '__main__':
+  flags.mark_flags_as_required(
+      ['gt_json_file', 'gt_folder', 'pred_json_file', 'pred_folder'])
+  app.run(main)
--- a/research/deeplab/evaluation/eval_coco_format_test.py
+++ b/research/deeplab/evaluation/eval_coco_format_test.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for eval_coco_format script."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from absl import flags
+from absl.testing import absltest
+import evaluation as panopticapi_eval
+
+from deeplab.evaluation import eval_coco_format
+
+_TEST_DIR = 'deeplab/evaluation/testdata'
+
+FLAGS = flags.FLAGS
+
+
+class EvalCocoFormatTest(absltest.TestCase):
+
+  def test_compare_pq_with_reference_eval(self):
+    sample_data_dir = os.path.join(_TEST_DIR)
+    gt_json_file = os.path.join(sample_data_dir, 'coco_gt.json')
+    gt_folder = os.path.join(sample_data_dir, 'coco_gt')
+    pred_json_file = os.path.join(sample_data_dir, 'coco_pred.json')
+    pred_folder = os.path.join(sample_data_dir, 'coco_pred')
+
+    panopticapi_results = panopticapi_eval.pq_compute(
+        gt_json_file, pred_json_file, gt_folder, pred_folder)
+    deeplab_results = eval_coco_format.eval_coco_format(
+        gt_json_file,
+        pred_json_file,
+        gt_folder,
+        pred_folder,
+        metric='pq',
+        num_categories=7,
+        ignored_label=0,
+        max_instances_per_category=256,
+        intersection_offset=(256 * 256))
+    self.assertCountEqual(deeplab_results.keys(), ['All', 'Things', 'Stuff'])
+    for cat_group in ['All', 'Things', 'Stuff']:
+      self.assertCountEqual(deeplab_results[cat_group], ['pq', 'sq', 'rq', 'n'])
+      for metric in ['pq', 'sq', 'rq', 'n']:
+        self.assertAlmostEqual(deeplab_results[cat_group][metric],
+                               panopticapi_results[cat_group][metric])
+
+  def test_compare_pc_with_golden_value(self):
+    sample_data_dir = os.path.join(_TEST_DIR)
+    gt_json_file = os.path.join(sample_data_dir, 'coco_gt.json')
+    gt_folder = os.path.join(sample_data_dir, 'coco_gt')
+    pred_json_file = os.path.join(sample_data_dir, 'coco_pred.json')
+    pred_folder = os.path.join(sample_data_dir, 'coco_pred')
+
+    deeplab_results = eval_coco_format.eval_coco_format(
+        gt_json_file,
+        pred_json_file,
+        gt_folder,
+        pred_folder,
+        metric='pc',
+        num_categories=7,
+        ignored_label=0,
+        max_instances_per_category=256,
+        intersection_offset=(256 * 256),
+        normalize_by_image_size=False)
+    self.assertCountEqual(deeplab_results.keys(), ['All', 'Things', 'Stuff'])
+    for cat_group in ['All', 'Things', 'Stuff']:
+      self.assertCountEqual(deeplab_results[cat_group], ['pc', 'n'])
+    self.assertAlmostEqual(deeplab_results['All']['pc'], 0.68210561)
+    self.assertEqual(deeplab_results['All']['n'], 6)
+    self.assertAlmostEqual(deeplab_results['Things']['pc'], 0.5890529)
+    self.assertEqual(deeplab_results['Things']['n'], 4)
+    self.assertAlmostEqual(deeplab_results['Stuff']['pc'], 0.86821097)
+    self.assertEqual(deeplab_results['Stuff']['n'], 2)
+
+  def test_compare_pc_with_golden_value_normalize_by_size(self):
+    sample_data_dir = os.path.join(_TEST_DIR)
+    gt_json_file = os.path.join(sample_data_dir, 'coco_gt.json')
+    gt_folder = os.path.join(sample_data_dir, 'coco_gt')
+    pred_json_file = os.path.join(sample_data_dir, 'coco_pred.json')
+    pred_folder = os.path.join(sample_data_dir, 'coco_pred')
+
+    deeplab_results = eval_coco_format.eval_coco_format(
+        gt_json_file,
+        pred_json_file,
+        gt_folder,
+        pred_folder,
+        metric='pc',
+        num_categories=7,
+        ignored_label=0,
+        max_instances_per_category=256,
+        intersection_offset=(256 * 256),
+        normalize_by_image_size=True)
+    self.assertCountEqual(deeplab_results.keys(), ['All', 'Things', 'Stuff'])
+    self.assertAlmostEqual(deeplab_results['All']['pc'], 0.68214908840)
+
+  def test_pc_with_multiple_workers(self):
+    sample_data_dir = os.path.join(_TEST_DIR)
+    gt_json_file = os.path.join(sample_data_dir, 'coco_gt.json')
+    gt_folder = os.path.join(sample_data_dir, 'coco_gt')
+    pred_json_file = os.path.join(sample_data_dir, 'coco_pred.json')
+    pred_folder = os.path.join(sample_data_dir, 'coco_pred')
+
+    deeplab_results = eval_coco_format.eval_coco_format(
+        gt_json_file,
+        pred_json_file,
+        gt_folder,
+        pred_folder,
+        metric='pc',
+        num_categories=7,
+        ignored_label=0,
+        max_instances_per_category=256,
+        intersection_offset=(256 * 256),
+        num_workers=3,
+        normalize_by_image_size=False)
+    self.assertCountEqual(deeplab_results.keys(), ['All', 'Things', 'Stuff'])
+    self.assertAlmostEqual(deeplab_results['All']['pc'], 0.68210561668)
+
+
+if __name__ == '__main__':
+  absltest.main()
--- a/research/deeplab/evaluation/g3doc/img/equation_pc.png
+++ b/research/deeplab/evaluation/g3doc/img/equation_pc.png
--- a/research/deeplab/evaluation/g3doc/img/equation_pq.png
+++ b/research/deeplab/evaluation/g3doc/img/equation_pq.png
--- a/research/deeplab/evaluation/panoptic_quality.py
+++ b/research/deeplab/evaluation/panoptic_quality.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of the Panoptic Quality metric.
+
+Panoptic Quality is an instance-based metric for evaluating the task of
+image parsing, aka panoptic segmentation.
+
+Please see the paper for details:
+"Panoptic Segmentation", Alexander Kirillov, Kaiming He, Ross Girshick,
+Carsten Rother and Piotr Dollar. arXiv:1801.00868, 2018.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import numpy as np
+import prettytable
+import six
+
+from deeplab.evaluation import base_metric
+
+
+def _ids_to_counts(id_array):
+  """Given a numpy array, a mapping from each unique entry to its count."""
+  ids, counts = np.unique(id_array, return_counts=True)
+  return dict(six.moves.zip(ids, counts))
+
+
+class PanopticQuality(base_metric.SegmentationMetric):
+  """Metric class for Panoptic Quality.
+
+  "Panoptic Segmentation" by Alexander Kirillov, Kaiming He, Ross Girshick,
+  Carsten Rother, Piotr Dollar.
+  https://arxiv.org/abs/1801.00868
+  """
+
+  def compare_and_accumulate(
+      self, groundtruth_category_array, groundtruth_instance_array,
+      predicted_category_array, predicted_instance_array):
+    """See base class."""
+    # First, combine the category and instance labels so that every unique
+    # value for (category, instance) is assigned a unique integer label.
+    pred_segment_id = self._naively_combine_labels(predicted_category_array,
+                                                   predicted_instance_array)
+    gt_segment_id = self._naively_combine_labels(groundtruth_category_array,
+                                                 groundtruth_instance_array)
+
+    # Pre-calculate areas for all groundtruth and predicted segments.
+    gt_segment_areas = _ids_to_counts(gt_segment_id)
+    pred_segment_areas = _ids_to_counts(pred_segment_id)
+
+    # We assume there is only one void segment and it has instance id = 0.
+    void_segment_id = self.ignored_label * self.max_instances_per_category
+
+    # There may be other ignored groundtruth segments with instance id > 0, find
+    # those ids using the unique segment ids extracted with the area computation
+    # above.
+    ignored_segment_ids = {
+        gt_segment_id for gt_segment_id in six.iterkeys(gt_segment_areas)
+        if (gt_segment_id //
+            self.max_instances_per_category) == self.ignored_label
+    }
+
+    # Next, combine the groundtruth and predicted labels. Dividing up the pixels
+    # based on which groundtruth segment and which predicted segment they belong
+    # to, this will assign a different 32-bit integer label to each choice
+    # of (groundtruth segment, predicted segment), encoded as
+    #   gt_segment_id * offset + pred_segment_id.
+    intersection_id_array = (
+        gt_segment_id.astype(np.uint32) * self.offset +
+        pred_segment_id.astype(np.uint32))
+
+    # For every combination of (groundtruth segment, predicted segment) with a
+    # non-empty intersection, this counts the number of pixels in that
+    # intersection.
+    intersection_areas = _ids_to_counts(intersection_id_array)
+
+    # Helper function that computes the area of the overlap between a predicted
+    # segment and the ground-truth void/ignored segment.
+    def prediction_void_overlap(pred_segment_id):
+      void_intersection_id = void_segment_id * self.offset + pred_segment_id
+      return intersection_areas.get(void_intersection_id, 0)
+
+    # Compute overall ignored overlap.
+    def prediction_ignored_overlap(pred_segment_id):
+      total_ignored_overlap = 0
+      for ignored_segment_id in ignored_segment_ids:
+        intersection_id = ignored_segment_id * self.offset + pred_segment_id
+        total_ignored_overlap += intersection_areas.get(intersection_id, 0)
+      return total_ignored_overlap
+
+    # Sets that are populated with which segments groundtruth/predicted segments
+    # have been matched with overlapping predicted/groundtruth segments
+    # respectively.
+    gt_matched = set()
+    pred_matched = set()
+
+    # Calculate IoU per pair of intersecting segments of the same category.
+    for intersection_id, intersection_area in six.iteritems(intersection_areas):
+      gt_segment_id = intersection_id // self.offset
+      pred_segment_id = intersection_id % self.offset
+
+      gt_category = gt_segment_id // self.max_instances_per_category
+      pred_category = pred_segment_id // self.max_instances_per_category
+      if gt_category != pred_category:
+        continue
+
+      # Union between the groundtruth and predicted segments being compared does
+      # not include the portion of the predicted segment that consists of
+      # groundtruth "void" pixels.
+      union = (
+          gt_segment_areas[gt_segment_id] +
+          pred_segment_areas[pred_segment_id] - intersection_area -
+          prediction_void_overlap(pred_segment_id))
+      iou = intersection_area / union
+      if iou > 0.5:
+        self.tp_per_class[gt_category] += 1
+        self.iou_per_class[gt_category] += iou
+        gt_matched.add(gt_segment_id)
+        pred_matched.add(pred_segment_id)
+
+    # Count false negatives for each category.
+    for gt_segment_id in six.iterkeys(gt_segment_areas):
+      if gt_segment_id in gt_matched:
+        continue
+      category = gt_segment_id // self.max_instances_per_category
+      # Failing to detect a void segment is not a false negative.
+      if category == self.ignored_label:
+        continue
+      self.fn_per_class[category] += 1
+
+    # Count false positives for each category.
+    for pred_segment_id in six.iterkeys(pred_segment_areas):
+      if pred_segment_id in pred_matched:
+        continue
+      # A false positive is not penalized if is mostly ignored in the
+      # groundtruth.
+      if (prediction_ignored_overlap(pred_segment_id) /
+          pred_segment_areas[pred_segment_id]) > 0.5:
+        continue
+      category = pred_segment_id // self.max_instances_per_category
+      self.fp_per_class[category] += 1
+
+    return self.result()
+
+  def _valid_categories(self):
+    """Categories with a "valid" value for the metric, have > 0 instances.
+
+    We will ignore the `ignore_label` class and other classes which have
+    `tp + fn + fp = 0`.
+
+    Returns:
+      Boolean array of shape `[num_categories]`.
+    """
+    valid_categories = np.not_equal(
+        self.tp_per_class + self.fn_per_class + self.fp_per_class, 0)
+    if self.ignored_label >= 0 and self.ignored_label < self.num_categories:
+      valid_categories[self.ignored_label] = False
+    return valid_categories
+
+  def detailed_results(self, is_thing=None):
+    """See base class."""
+    valid_categories = self._valid_categories()
+
+    # If known, break down which categories are valid _and_ things/stuff.
+    category_sets = collections.OrderedDict()
+    category_sets['All'] = valid_categories
+    if is_thing is not None:
+      category_sets['Things'] = np.logical_and(valid_categories, is_thing)
+      category_sets['Stuff'] = np.logical_and(valid_categories,
+                                              np.logical_not(is_thing))
+
+    # Compute individual per-class metrics that constitute factors of PQ.
+    sq = base_metric.realdiv_maybe_zero(self.iou_per_class, self.tp_per_class)
+    rq = base_metric.realdiv_maybe_zero(
+        self.tp_per_class,
+        self.tp_per_class + 0.5 * self.fn_per_class + 0.5 * self.fp_per_class)
+    pq = np.multiply(sq, rq)
+
+    # Assemble detailed results dictionary.
+    results = {}
+    for category_set_name, in_category_set in six.iteritems(category_sets):
+      if np.any(in_category_set):
+        results[category_set_name] = {
+            'pq': np.mean(pq[in_category_set]),
+            'sq': np.mean(sq[in_category_set]),
+            'rq': np.mean(rq[in_category_set]),
+            # The number of categories in this subset.
+            'n': np.sum(in_category_set.astype(np.int32)),
+        }
+      else:
+        results[category_set_name] = {'pq': 0, 'sq': 0, 'rq': 0, 'n': 0}
+
+    return results
+
+  def result_per_category(self):
+    """See base class."""
+    sq = base_metric.realdiv_maybe_zero(self.iou_per_class, self.tp_per_class)
+    rq = base_metric.realdiv_maybe_zero(
+        self.tp_per_class,
+        self.tp_per_class + 0.5 * self.fn_per_class + 0.5 * self.fp_per_class)
+    return np.multiply(sq, rq)
+
+  def print_detailed_results(self, is_thing=None, print_digits=3):
+    """See base class."""
+    results = self.detailed_results(is_thing=is_thing)
+
+    tab = prettytable.PrettyTable()
+
+    tab.add_column('', [], align='l')
+    for fieldname in ['PQ', 'SQ', 'RQ', 'N']:
+      tab.add_column(fieldname, [], align='r')
+
+    for category_set, subset_results in six.iteritems(results):
+      data_cols = [
+          round(subset_results[col_key], print_digits) * 100
+          for col_key in ['pq', 'sq', 'rq']
+      ]
+      data_cols += [subset_results['n']]
+      tab.add_row([category_set] + data_cols)
+
+    print(tab)
+
+  def result(self):
+    """See base class."""
+    pq_per_class = self.result_per_category()
+    valid_categories = self._valid_categories()
+    if not np.any(valid_categories):
+      return 0.
+    return np.mean(pq_per_class[valid_categories])
+
+  def merge(self, other_instance):
+    """See base class."""
+    self.iou_per_class += other_instance.iou_per_class
+    self.tp_per_class += other_instance.tp_per_class
+    self.fn_per_class += other_instance.fn_per_class
+    self.fp_per_class += other_instance.fp_per_class
+
+  def reset(self):
+    """See base class."""
+    self.iou_per_class = np.zeros(self.num_categories, dtype=np.float64)
+    self.tp_per_class = np.zeros(self.num_categories, dtype=np.float64)
+    self.fn_per_class = np.zeros(self.num_categories, dtype=np.float64)
+    self.fp_per_class = np.zeros(self.num_categories, dtype=np.float64)
--- a/research/deeplab/evaluation/panoptic_quality_test.py
+++ b/research/deeplab/evaluation/panoptic_quality_test.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Panoptic Quality metric."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+from absl.testing import absltest
+import numpy as np
+import six
+
+from deeplab.evaluation import panoptic_quality
+from deeplab.evaluation import test_utils
+
+# See the definition of the color names at:
+#   https://en.wikipedia.org/wiki/Web_colors.
+_CLASS_COLOR_MAP = {
+    (0, 0, 0): 0,
+    (0, 0, 255): 1,  # Person (blue).
+    (255, 0, 0): 2,  # Bear (red).
+    (0, 255, 0): 3,  # Tree (lime).
+    (255, 0, 255): 4,  # Bird (fuchsia).
+    (0, 255, 255): 5,  # Sky (aqua).
+    (255, 255, 0): 6,  # Cat (yellow).
+}
+
+
+class PanopticQualityTest(absltest.TestCase):
+
+  def test_perfect_match(self):
+    categories = np.zeros([6, 6], np.uint16)
+    instances = np.array([
+        [1, 1, 1, 1, 1, 1],
+        [1, 2, 2, 2, 2, 1],
+        [1, 2, 2, 2, 2, 1],
+        [1, 2, 2, 2, 2, 1],
+        [1, 2, 2, 1, 1, 1],
+        [1, 2, 1, 1, 1, 1],
+    ],
+                         dtype=np.uint16)
+
+    pq = panoptic_quality.PanopticQuality(
+        num_categories=1,
+        ignored_label=2,
+        max_instances_per_category=16,
+        offset=16)
+    pq.compare_and_accumulate(categories, instances, categories, instances)
+    np.testing.assert_array_equal(pq.iou_per_class, [2.0])
+    np.testing.assert_array_equal(pq.tp_per_class, [2])
+    np.testing.assert_array_equal(pq.fn_per_class, [0])
+    np.testing.assert_array_equal(pq.fp_per_class, [0])
+    np.testing.assert_array_equal(pq.result_per_category(), [1.0])
+    self.assertEqual(pq.result(), 1.0)
+
+  def test_totally_wrong(self):
+    det_categories = np.array([
+        [0, 0, 0, 0, 0, 0],
+        [0, 1, 0, 0, 1, 0],
+        [0, 1, 1, 1, 1, 0],
+        [0, 1, 1, 1, 1, 0],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+    ],
+                              dtype=np.uint16)
+    gt_categories = 1 - det_categories
+    instances = np.zeros([6, 6], np.uint16)
+
+    pq = panoptic_quality.PanopticQuality(
+        num_categories=2,
+        ignored_label=2,
+        max_instances_per_category=1,
+        offset=16)
+    pq.compare_and_accumulate(gt_categories, instances, det_categories,
+                              instances)
+    np.testing.assert_array_equal(pq.iou_per_class, [0.0, 0.0])
+    np.testing.assert_array_equal(pq.tp_per_class, [0, 0])
+    np.testing.assert_array_equal(pq.fn_per_class, [1, 1])
+    np.testing.assert_array_equal(pq.fp_per_class, [1, 1])
+    np.testing.assert_array_equal(pq.result_per_category(), [0.0, 0.0])
+    self.assertEqual(pq.result(), 0.0)
+
+  def test_matches_by_iou(self):
+    good_det_labels = np.array(
+        [
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 1, 1, 1, 1],
+            [1, 2, 2, 2, 2, 1],
+            [1, 2, 2, 2, 1, 1],
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 1, 1, 1, 1],
+        ],
+        dtype=np.uint16)
+    gt_labels = np.array(
+        [
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 2, 2, 2, 1],
+            [1, 2, 2, 2, 2, 1],
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 1, 1, 1, 1],
+        ],
+        dtype=np.uint16)
+
+    pq = panoptic_quality.PanopticQuality(
+        num_categories=1,
+        ignored_label=2,
+        max_instances_per_category=16,
+        offset=16)
+    pq.compare_and_accumulate(
+        np.zeros_like(gt_labels), gt_labels, np.zeros_like(good_det_labels),
+        good_det_labels)
+
+    # iou(1, 1) = 28/30
+    # iou(2, 2) = 6/8
+    np.testing.assert_array_almost_equal(pq.iou_per_class, [28 / 30 + 6 / 8])
+    np.testing.assert_array_equal(pq.tp_per_class, [2])
+    np.testing.assert_array_equal(pq.fn_per_class, [0])
+    np.testing.assert_array_equal(pq.fp_per_class, [0])
+    self.assertAlmostEqual(pq.result(), (28 / 30 + 6 / 8) / 2)
+
+    bad_det_labels = np.array(
+        [
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 1, 1, 1, 1],
+            [1, 1, 1, 2, 2, 1],
+            [1, 1, 1, 2, 2, 1],
+            [1, 1, 1, 2, 2, 1],
+            [1, 1, 1, 1, 1, 1],
+        ],
+        dtype=np.uint16)
+
+    pq.reset()
+    pq.compare_and_accumulate(
+        np.zeros_like(gt_labels), gt_labels, np.zeros_like(bad_det_labels),
+        bad_det_labels)
+
+    # iou(1, 1) = 27/32
+    np.testing.assert_array_almost_equal(pq.iou_per_class, [27 / 32])
+    np.testing.assert_array_equal(pq.tp_per_class, [1])
+    np.testing.assert_array_equal(pq.fn_per_class, [1])
+    np.testing.assert_array_equal(pq.fp_per_class, [1])
+    self.assertAlmostEqual(pq.result(), (27 / 32) * (1 / 2))
+
+  def test_wrong_instances(self):
+    categories = np.array([
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 2, 2, 1, 2, 2],
+        [1, 2, 2, 1, 2, 2],
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+    ],
+                          dtype=np.uint16)
+    predicted_instances = np.array([
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 1, 1],
+        [0, 0, 0, 0, 1, 1],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+    ],
+                                   dtype=np.uint16)
+    groundtruth_instances = np.zeros([6, 6], dtype=np.uint16)
+
+    pq = panoptic_quality.PanopticQuality(
+        num_categories=3,
+        ignored_label=0,
+        max_instances_per_category=10,
+        offset=100)
+    pq.compare_and_accumulate(categories, groundtruth_instances, categories,
+                              predicted_instances)
+
+    np.testing.assert_array_equal(pq.iou_per_class, [0.0, 1.0, 0.0])
+    np.testing.assert_array_equal(pq.tp_per_class, [0, 1, 0])
+    np.testing.assert_array_equal(pq.fn_per_class, [0, 0, 1])
+    np.testing.assert_array_equal(pq.fp_per_class, [0, 0, 2])
+    np.testing.assert_array_equal(pq.result_per_category(), [0, 1, 0])
+    self.assertAlmostEqual(pq.result(), 0.5)
+
+  def test_instance_order_is_arbitrary(self):
+    categories = np.array([
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 2, 2, 1, 2, 2],
+        [1, 2, 2, 1, 2, 2],
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+    ],
+                          dtype=np.uint16)
+    predicted_instances = np.array([
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 1, 1],
+        [0, 0, 0, 0, 1, 1],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+    ],
+                                   dtype=np.uint16)
+    groundtruth_instances = np.array([
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+        [0, 1, 1, 0, 0, 0],
+        [0, 1, 1, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+    ],
+                                     dtype=np.uint16)
+
+    pq = panoptic_quality.PanopticQuality(
+        num_categories=3,
+        ignored_label=0,
+        max_instances_per_category=10,
+        offset=100)
+    pq.compare_and_accumulate(categories, groundtruth_instances, categories,
+                              predicted_instances)
+
+    np.testing.assert_array_equal(pq.iou_per_class, [0.0, 1.0, 2.0])
+    np.testing.assert_array_equal(pq.tp_per_class, [0, 1, 2])
+    np.testing.assert_array_equal(pq.fn_per_class, [0, 0, 0])
+    np.testing.assert_array_equal(pq.fp_per_class, [0, 0, 0])
+    np.testing.assert_array_equal(pq.result_per_category(), [0, 1, 1])
+    self.assertAlmostEqual(pq.result(), 1.0)
+
+  def test_matches_expected(self):
+    pred_classes = test_utils.read_segmentation_with_rgb_color_map(
+        'team_pred_class.png', _CLASS_COLOR_MAP)
+    pred_instances = test_utils.read_test_image(
+        'team_pred_instance.png', mode='L')
+
+    instance_class_map = {
+        0: 0,
+        47: 1,
+        97: 1,
+        133: 1,
+        150: 1,
+        174: 1,
+        198: 2,
+        215: 1,
+        244: 1,
+        255: 1,
+    }
+    gt_instances, gt_classes = test_utils.panoptic_segmentation_with_class_map(
+        'team_gt_instance.png', instance_class_map)
+
+    pq = panoptic_quality.PanopticQuality(
+        num_categories=3,
+        ignored_label=0,
+        max_instances_per_category=256,
+        offset=256 * 256)
+    pq.compare_and_accumulate(gt_classes, gt_instances, pred_classes,
+                              pred_instances)
+    np.testing.assert_array_almost_equal(
+        pq.iou_per_class, [2.06104, 5.26827, 0.54069], decimal=4)
+    np.testing.assert_array_equal(pq.tp_per_class, [1, 7, 1])
+    np.testing.assert_array_equal(pq.fn_per_class, [0, 1, 0])
+    np.testing.assert_array_equal(pq.fp_per_class, [0, 0, 0])
+    np.testing.assert_array_almost_equal(pq.result_per_category(),
+                                         [2.061038, 0.702436, 0.54069])
+    self.assertAlmostEqual(pq.result(), 0.62156287)
+
+  def test_merge_accumulates_all_across_instances(self):
+    categories = np.zeros([6, 6], np.uint16)
+    good_det_labels = np.array([
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 2, 2, 2, 2, 1],
+        [1, 2, 2, 2, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+    ],
+                               dtype=np.uint16)
+    gt_labels = np.array([
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 2, 2, 2, 1],
+        [1, 2, 2, 2, 2, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+    ],
+                         dtype=np.uint16)
+
+    good_pq = panoptic_quality.PanopticQuality(
+        num_categories=1,
+        ignored_label=2,
+        max_instances_per_category=16,
+        offset=16)
+    for _ in six.moves.range(2):
+      good_pq.compare_and_accumulate(categories, gt_labels, categories,
+                                     good_det_labels)
+
+    bad_det_labels = np.array([
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 1, 1, 1],
+        [1, 1, 1, 2, 2, 1],
+        [1, 1, 1, 2, 2, 1],
+        [1, 1, 1, 2, 2, 1],
+        [1, 1, 1, 1, 1, 1],
+    ],
+                              dtype=np.uint16)
+
+    bad_pq = panoptic_quality.PanopticQuality(
+        num_categories=1,
+        ignored_label=2,
+        max_instances_per_category=16,
+        offset=16)
+    for _ in six.moves.range(2):
+      bad_pq.compare_and_accumulate(categories, gt_labels, categories,
+                                    bad_det_labels)
+
+    good_pq.merge(bad_pq)
+
+    np.testing.assert_array_almost_equal(
+        good_pq.iou_per_class, [2 * (28 / 30 + 6 / 8) + 2 * (27 / 32)])
+    np.testing.assert_array_equal(good_pq.tp_per_class, [2 * 2 + 2])
+    np.testing.assert_array_equal(good_pq.fn_per_class, [2])
+    np.testing.assert_array_equal(good_pq.fp_per_class, [2])
+    self.assertAlmostEqual(good_pq.result(), 0.63177083)
+
+
+if __name__ == '__main__':
+  absltest.main()
--- a/research/deeplab/evaluation/parsing_covering.py
+++ b/research/deeplab/evaluation/parsing_covering.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Implementation of the Parsing Covering metric.
+
+Parsing Covering is a region-based metric for evaluating the task of
+image parsing, aka panoptic segmentation.
+
+Please see the paper for details:
+"DeeperLab: Single-Shot Image Parser", Tien-Ju Yang, Maxwell D. Collins,
+Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze,
+George Papandreou, Liang-Chieh Chen. arXiv: 1902.05093, 2019.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+
+import numpy as np
+import prettytable
+import six
+
+from deeplab.evaluation import base_metric
+
+
+class ParsingCovering(base_metric.SegmentationMetric):
+  r"""Metric class for Parsing Covering.
+
+  Computes segmentation covering metric introduced in (Arbelaez, et al., 2010)
+  with extension to handle multi-class semantic labels (a.k.a. parsing
+  covering). Specifically, segmentation covering (SC) is defined in Eq. (8) in
+  (Arbelaez et al., 2010) as:
+
+  SC(c) = \sum_{R\in S}(|R| * \max_{R'\in S'}O(R,R')) / \sum_{R\in S}|R|,
+
+  where S are the groundtruth instance regions and S' are the predicted
+  instance regions. The parsing covering is simply:
+
+  PC = \sum_{c=1}^{C}SC(c) / C,
+
+  where C is the number of classes.
+  """
+
+  def __init__(self,
+               num_categories,
+               ignored_label,
+               max_instances_per_category,
+               offset,
+               normalize_by_image_size=True):
+    """Initialization for ParsingCovering.
+
+    Args:
+      num_categories: The number of segmentation categories (or "classes" in the
+        dataset.
+      ignored_label: A category id that is ignored in evaluation, e.g. the void
+        label as defined in COCO panoptic segmentation dataset.
+      max_instances_per_category: The maximum number of instances for each
+        category. Used in ensuring unique instance labels.
+      offset: The maximum number of unique labels. This is used, by multiplying
+        the ground-truth labels, to generate unique ids for individual regions
+        of overlap between groundtruth and predicted segments.
+      normalize_by_image_size: Whether to normalize groundtruth instance region
+        areas by image size. If True, groundtruth instance areas and weighted
+        IoUs will be divided by the size of the corresponding image before
+        accumulated across the dataset.
+    """
+    super(ParsingCovering, self).__init__(num_categories, ignored_label,
+                                          max_instances_per_category, offset)
+    self.normalize_by_image_size = normalize_by_image_size
+
+  def compare_and_accumulate(
+      self, groundtruth_category_array, groundtruth_instance_array,
+      predicted_category_array, predicted_instance_array):
+    """See base class."""
+    # Allocate intermediate data structures.
+    max_ious = np.zeros([self.num_categories, self.max_instances_per_category],
+                        dtype=np.float64)
+    gt_areas = np.zeros([self.num_categories, self.max_instances_per_category],
+                        dtype=np.float64)
+    pred_areas = np.zeros(
+        [self.num_categories, self.max_instances_per_category],
+        dtype=np.float64)
+    # This is a dictionary in the format:
+    #   {(category, gt_instance): [(pred_instance, intersection_area)]}.
+    intersections = collections.defaultdict(list)
+
+    # First, combine the category and instance labels so that every unique
+    # value for (category, instance) is assigned a unique integer label.
+    pred_segment_id = self._naively_combine_labels(predicted_category_array,
+                                                   predicted_instance_array)
+    gt_segment_id = self._naively_combine_labels(groundtruth_category_array,
+                                                 groundtruth_instance_array)
+
+    # Next, combine the groundtruth and predicted labels. Dividing up the pixels
+    # based on which groundtruth segment and which predicted segment they belong
+    # to, this will assign a different 32-bit integer label to each choice
+    # of (groundtruth segment, predicted segment), encoded as
+    #   gt_segment_id * offset + pred_segment_id.
+    intersection_id_array = (
+        gt_segment_id.astype(np.uint32) * self.offset +
+        pred_segment_id.astype(np.uint32))
+
+    # For every combination of (groundtruth segment, predicted segment) with a
+    # non-empty intersection, this counts the number of pixels in that
+    # intersection.
+    intersection_ids, intersection_areas = np.unique(
+        intersection_id_array, return_counts=True)
+
+    # Find areas of all groundtruth and predicted instances, as well as of their
+    # intersections.
+    for intersection_id, intersection_area in six.moves.zip(
+        intersection_ids, intersection_areas):
+      gt_segment_id = intersection_id // self.offset
+      gt_category = gt_segment_id // self.max_instances_per_category
+      if gt_category == self.ignored_label:
+        continue
+      gt_instance = gt_segment_id % self.max_instances_per_category
+      gt_areas[gt_category, gt_instance] += intersection_area
+
+      pred_segment_id = intersection_id % self.offset
+      pred_category = pred_segment_id // self.max_instances_per_category
+      pred_instance = pred_segment_id % self.max_instances_per_category
+      pred_areas[pred_category, pred_instance] += intersection_area
+      if pred_category != gt_category:
+        continue
+
+      intersections[gt_category, gt_instance].append((pred_instance,
+                                                      intersection_area))
+
+    # Find maximum IoU for every groundtruth instance.
+    for gt_label, instance_intersections in six.iteritems(intersections):
+      category, gt_instance = gt_label
+      gt_area = gt_areas[category, gt_instance]
+      ious = []
+      for pred_instance, intersection_area in instance_intersections:
+        pred_area = pred_areas[category, pred_instance]
+        union = gt_area + pred_area - intersection_area
+        ious.append(intersection_area / union)
+      max_ious[category, gt_instance] = max(ious)
+
+    # Normalize groundtruth instance areas by image size if necessary.
+    if self.normalize_by_image_size:
+      gt_areas /= groundtruth_category_array.size
+
+    # Compute per-class weighted IoUs and areas summed over all groundtruth
+    # instances.
+    self.weighted_iou_per_class += np.sum(max_ious * gt_areas, axis=-1)
+    self.gt_area_per_class += np.sum(gt_areas, axis=-1)
+
+    return self.result()
+
+  def result_per_category(self):
+    """See base class."""
+    return base_metric.realdiv_maybe_zero(self.weighted_iou_per_class,
+                                          self.gt_area_per_class)
+
+  def _valid_categories(self):
+    """Categories with a "valid" value for the metric, have > 0 instances.
+
+    We will ignore the `ignore_label` class and other classes which have
+    groundtruth area of 0.
+
+    Returns:
+      Boolean array of shape `[num_categories]`.
+    """
+    valid_categories = np.not_equal(self.gt_area_per_class, 0)
+    if self.ignored_label >= 0 and self.ignored_label < self.num_categories:
+      valid_categories[self.ignored_label] = False
+    return valid_categories
+
+  def detailed_results(self, is_thing=None):
+    """See base class."""
+    valid_categories = self._valid_categories()
+
+    # If known, break down which categories are valid _and_ things/stuff.
+    category_sets = collections.OrderedDict()
+    category_sets['All'] = valid_categories
+    if is_thing is not None:
+      category_sets['Things'] = np.logical_and(valid_categories, is_thing)
+      category_sets['Stuff'] = np.logical_and(valid_categories,
+                                              np.logical_not(is_thing))
+
+    covering_per_class = self.result_per_category()
+    results = {}
+    for category_set_name, in_category_set in six.iteritems(category_sets):
+      if np.any(in_category_set):
+        results[category_set_name] = {
+            'pc': np.mean(covering_per_class[in_category_set]),
+            # The number of valid categories in this subset.
+            'n': np.sum(in_category_set.astype(np.int32)),
+        }
+      else:
+        results[category_set_name] = {'pc': 0, 'n': 0}
+
+    return results
+
+  def print_detailed_results(self, is_thing=None, print_digits=3):
+    """See base class."""
+    results = self.detailed_results(is_thing=is_thing)
+
+    tab = prettytable.PrettyTable()
+
+    tab.add_column('', [], align='l')
+    for fieldname in ['PC', 'N']:
+      tab.add_column(fieldname, [], align='r')
+
+    for category_set, subset_results in six.iteritems(results):
+      data_cols = [
+          round(subset_results['pc'], print_digits) * 100, subset_results['n']
+      ]
+      tab.add_row([category_set] + data_cols)
+
+    print(tab)
+
+  def result(self):
+    """See base class."""
+    covering_per_class = self.result_per_category()
+    valid_categories = self._valid_categories()
+    if not np.any(valid_categories):
+      return 0.
+    return np.mean(covering_per_class[valid_categories])
+
+  def merge(self, other_instance):
+    """See base class."""
+    self.weighted_iou_per_class += other_instance.weighted_iou_per_class
+    self.gt_area_per_class += other_instance.gt_area_per_class
+
+  def reset(self):
+    """See base class."""
+    self.weighted_iou_per_class = np.zeros(
+        self.num_categories, dtype=np.float64)
+    self.gt_area_per_class = np.zeros(self.num_categories, dtype=np.float64)
--- a/research/deeplab/evaluation/parsing_covering_test.py
+++ b/research/deeplab/evaluation/parsing_covering_test.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Parsing Covering metric."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+
+from absl.testing import absltest
+import numpy as np
+
+from deeplab.evaluation import parsing_covering
+from deeplab.evaluation import test_utils
+
+# See the definition of the color names at:
+#   https://en.wikipedia.org/wiki/Web_colors.
+_CLASS_COLOR_MAP = {
+    (0, 0, 0): 0,
+    (0, 0, 255): 1,  # Person (blue).
+    (255, 0, 0): 2,  # Bear (red).
+    (0, 255, 0): 3,  # Tree (lime).
+    (255, 0, 255): 4,  # Bird (fuchsia).
+    (0, 255, 255): 5,  # Sky (aqua).
+    (255, 255, 0): 6,  # Cat (yellow).
+}
+
+
+class CoveringConveringTest(absltest.TestCase):
+
+  def test_perfect_match(self):
+    categories = np.zeros([6, 6], np.uint16)
+    instances = np.array([
+        [2, 2, 2, 2, 2, 2],
+        [2, 4, 4, 4, 4, 2],
+        [2, 4, 4, 4, 4, 2],
+        [2, 4, 4, 4, 4, 2],
+        [2, 4, 4, 2, 2, 2],
+        [2, 4, 2, 2, 2, 2],
+    ],
+                         dtype=np.uint16)
+
+    pc = parsing_covering.ParsingCovering(
+        num_categories=3,
+        ignored_label=2,
+        max_instances_per_category=2,
+        offset=16,
+        normalize_by_image_size=False)
+    pc.compare_and_accumulate(categories, instances, categories, instances)
+    np.testing.assert_array_equal(pc.weighted_iou_per_class, [0.0, 21.0, 0.0])
+    np.testing.assert_array_equal(pc.gt_area_per_class, [0.0, 21.0, 0.0])
+    np.testing.assert_array_equal(pc.result_per_category(), [0.0, 1.0, 0.0])
+    self.assertEqual(pc.result(), 1.0)
+
+  def test_totally_wrong(self):
+    categories = np.zeros([6, 6], np.uint16)
+    gt_instances = np.array([
+        [0, 0, 0, 0, 0, 0],
+        [0, 1, 0, 0, 1, 0],
+        [0, 1, 1, 1, 1, 0],
+        [0, 1, 1, 1, 1, 0],
+        [0, 0, 0, 0, 0, 0],
+        [0, 0, 0, 0, 0, 0],
+    ],
+                            dtype=np.uint16)
+    pred_instances = 1 - gt_instances
+
+    pc = parsing_covering.ParsingCovering(
+        num_categories=2,
+        ignored_label=0,
+        max_instances_per_category=1,
+        offset=16,
+        normalize_by_image_size=False)
+    pc.compare_and_accumulate(categories, gt_instances, categories,
+                              pred_instances)
+    np.testing.assert_array_equal(pc.weighted_iou_per_class, [0.0, 0.0])
+    np.testing.assert_array_equal(pc.gt_area_per_class, [0.0, 10.0])
+    np.testing.assert_array_equal(pc.result_per_category(), [0.0, 0.0])
+    self.assertEqual(pc.result(), 0.0)
+
+  def test_matches_expected(self):
+    pred_classes = test_utils.read_segmentation_with_rgb_color_map(
+        'team_pred_class.png', _CLASS_COLOR_MAP)
+    pred_instances = test_utils.read_test_image(
+        'team_pred_instance.png', mode='L')
+
+    instance_class_map = {
+        0: 0,
+        47: 1,
+        97: 1,
+        133: 1,
+        150: 1,
+        174: 1,
+        198: 2,
+        215: 1,
+        244: 1,
+        255: 1,
+    }
+    gt_instances, gt_classes = test_utils.panoptic_segmentation_with_class_map(
+        'team_gt_instance.png', instance_class_map)
+
+    pc = parsing_covering.ParsingCovering(
+        num_categories=3,
+        ignored_label=0,
+        max_instances_per_category=256,
+        offset=256 * 256,
+        normalize_by_image_size=False)
+    pc.compare_and_accumulate(gt_classes, gt_instances, pred_classes,
+                              pred_instances)
+    np.testing.assert_array_almost_equal(
+        pc.weighted_iou_per_class, [0.0, 39864.14634, 3136], decimal=4)
+    np.testing.assert_array_equal(pc.gt_area_per_class, [0.0, 56870, 5800])
+    np.testing.assert_array_almost_equal(
+        pc.result_per_category(), [0.0, 0.70097, 0.54069], decimal=4)
+    self.assertAlmostEqual(pc.result(), 0.6208296732)
+
+  def test_matches_expected_normalize_by_size(self):
+    pred_classes = test_utils.read_segmentation_with_rgb_color_map(
+        'team_pred_class.png', _CLASS_COLOR_MAP)
+    pred_instances = test_utils.read_test_image(
+        'team_pred_instance.png', mode='L')
+
+    instance_class_map = {
+        0: 0,
+        47: 1,
+        97: 1,
+        133: 1,
+        150: 1,
+        174: 1,
+        198: 2,
+        215: 1,
+        244: 1,
+        255: 1,
+    }
+    gt_instances, gt_classes = test_utils.panoptic_segmentation_with_class_map(
+        'team_gt_instance.png', instance_class_map)
+
+    pc = parsing_covering.ParsingCovering(
+        num_categories=3,
+        ignored_label=0,
+        max_instances_per_category=256,
+        offset=256 * 256,
+        normalize_by_image_size=True)
+    pc.compare_and_accumulate(gt_classes, gt_instances, pred_classes,
+                              pred_instances)
+    np.testing.assert_array_almost_equal(
+        pc.weighted_iou_per_class, [0.0, 0.5002088756, 0.03935002196],
+        decimal=4)
+    np.testing.assert_array_almost_equal(
+        pc.gt_area_per_class, [0.0, 0.7135955832, 0.07277746408], decimal=4)
+    # Note that the per-category and overall PCs are identical to those without
+    # normalization in the previous test, because we only have a single image.
+    np.testing.assert_array_almost_equal(
+        pc.result_per_category(), [0.0, 0.70097, 0.54069], decimal=4)
+    self.assertAlmostEqual(pc.result(), 0.6208296732)
+
+
+if __name__ == '__main__':
+  absltest.main()
--- a/research/deeplab/evaluation/streaming_metrics.py
+++ b/research/deeplab/evaluation/streaming_metrics.py
+# Copyright 2019 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Code to compute segmentation in a "streaming" pattern in Tensorflow.
+
+These aggregate the metric over examples of the evaluation set. Each example is
+assumed to be fed in in a stream, and the metric implementation accumulates
+across them.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from deeplab.evaluation import panoptic_quality
+from deeplab.evaluation import parsing_covering
+
+_EPSILON = 1e-10
+
+
+def _realdiv_maybe_zero(x, y):
+  """Support tf.realdiv(x, y) where y may contain zeros."""
+  return tf.where(tf.less(y, _EPSILON), tf.zeros_like(x), tf.realdiv(x, y))
+
+
+def _running_total(value, shape, name=None):
+  """Maintains a running total of tensor `value` between calls."""
+  with tf.variable_scope(name, 'running_total', [value]):
+    total_var = tf.get_variable(
+        'total',
+        shape,
+        value.dtype,
+        initializer=tf.zeros_initializer(),
+        trainable=False,
+        collections=[
+            tf.GraphKeys.LOCAL_VARIABLES, tf.GraphKeys.METRIC_VARIABLES
+        ])
+    updated_total = tf.assign_add(total_var, value, use_locking=True)
+
+  return total_var, updated_total
+
+
+def _panoptic_quality_helper(
+    groundtruth_category_array, groundtruth_instance_array,
+    predicted_category_array, predicted_instance_array, num_classes,
+    max_instances_per_category, ignored_label, offset):
+  """Helper function to compute panoptic quality."""
+  pq = panoptic_quality.PanopticQuality(num_classes, ignored_label,
+                                        max_instances_per_category, offset)
+  pq.compare_and_accumulate(groundtruth_category_array,
+                            groundtruth_instance_array,
+                            predicted_category_array, predicted_instance_array)
+  return pq.iou_per_class, pq.tp_per_class, pq.fn_per_class, pq.fp_per_class
+
+
+def streaming_panoptic_quality(groundtruth_categories,
+                               groundtruth_instances,
+                               predicted_categories,
+                               predicted_instances,
+                               num_classes,
+                               max_instances_per_category,
+                               ignored_label,
+                               offset,
+                               name=None):
+  """Aggregates the panoptic metric across calls with different input tensors.
+
+  See tf.metrics.* functions for comparable functionality and usage.
+
+  Args:
+    groundtruth_categories: A 2D uint16 tensor of groundtruth category labels.
+    groundtruth_instances: A 2D uint16 tensor of groundtruth instance labels.
+    predicted_categories: A 2D uint16 tensor of predicted category labels.
+    predicted_instances: A 2D uint16 tensor of predicted instance labels.
+    num_classes: Number of classes in the dataset as an integer.
+    max_instances_per_category: The maximum number of instances for each class
+      as an integer or integer tensor.
+    ignored_label: The class id to be ignored in evaluation as an integer or
+      integer tensor.
+    offset: The maximum number of unique labels as an integer or integer tensor.
+    name: An optional variable_scope name.
+
+  Returns:
+    qualities: A tensor of shape `[6, num_classes]`, where (1) panoptic quality,
+      (2) segmentation quality, (3) recognition quality, (4) total_tp,
+      (5) total_fn and (6) total_fp are saved in the respective rows.
+    update_ops: List of operations that update the running overall panoptic
+      quality.
+
+  Raises:
+    RuntimeError: If eager execution is enabled.
+  """
+  if tf.executing_eagerly():
+    raise RuntimeError('Cannot aggregate when eager execution is enabled.')
+
+  input_args = [
+      tf.convert_to_tensor(groundtruth_categories, tf.uint16),
+      tf.convert_to_tensor(groundtruth_instances, tf.uint16),
+      tf.convert_to_tensor(predicted_categories, tf.uint16),
+      tf.convert_to_tensor(predicted_instances, tf.uint16),
+      tf.convert_to_tensor(num_classes, tf.int32),
+      tf.convert_to_tensor(max_instances_per_category, tf.int32),
+      tf.convert_to_tensor(ignored_label, tf.int32),
+      tf.convert_to_tensor(offset, tf.int32),
+  ]
+  return_types = [
+      tf.float64,
+      tf.float64,
+      tf.float64,
+      tf.float64,
+  ]
+  with tf.variable_scope(name, 'streaming_panoptic_quality', input_args):
+    panoptic_results = tf.py_func(
+        _panoptic_quality_helper, input_args, return_types, stateful=False)
+    iou, tp, fn, fp = tuple(panoptic_results)
+
+    total_iou, updated_iou = _running_total(
+        iou, [num_classes], name='iou_total')
+    total_tp, updated_tp = _running_total(tp, [num_classes], name='tp_total')
+    total_fn, updated_fn = _running_total(fn, [num_classes], name='fn_total')
+    total_fp, updated_fp = _running_total(fp, [num_classes], name='fp_total')
+    update_ops = [updated_iou, updated_tp, updated_fn, updated_fp]
+
+    sq = _realdiv_maybe_zero(total_iou, total_tp)
+    rq = _realdiv_maybe_zero(total_tp,
+                             total_tp + 0.5 * total_fn + 0.5 * total_fp)
+    pq = tf.multiply(sq, rq)
+    qualities = tf.stack([pq, sq, rq, total_tp, total_fn, total_fp], axis=0)
+  return qualities, update_ops
+
+
+def _parsing_covering_helper(
+    groundtruth_category_array, groundtruth_instance_array,
+    predicted_category_array, predicted_instance_array, num_classes,
+    max_instances_per_category, ignored_label, offset, normalize_by_image_size):
+  """Helper function to compute parsing covering."""
+  pc = parsing_covering.ParsingCovering(num_classes, ignored_label,
+                                        max_instances_per_category, offset,
+                                        normalize_by_image_size)
+  pc.compare_and_accumulate(groundtruth_category_array,
+                            groundtruth_instance_array,
+                            predicted_category_array, predicted_instance_array)
+  return pc.weighted_iou_per_class, pc.gt_area_per_class
+
+
+def streaming_parsing_covering(groundtruth_categories,
+                               groundtruth_instances,
+                               predicted_categories,
+                               predicted_instances,
+                               num_classes,
+                               max_instances_per_category,
+                               ignored_label,
+                               offset,
+                               normalize_by_image_size=True,
+                               name=None):
+  """Aggregates the covering across calls with different input tensors.
+
+  See tf.metrics.* functions for comparable functionality and usage.
+
+  Args:
+    groundtruth_categories: A 2D uint16 tensor of groundtruth category labels.
+    groundtruth_instances: A 2D uint16 tensor of groundtruth instance labels.
+    predicted_categories: A 2D uint16 tensor of predicted category labels.
+    predicted_instances: A 2D uint16 tensor of predicted instance labels.
+    num_classes: Number of classes in the dataset as an integer.
+    max_instances_per_category: The maximum number of instances for each class
+      as an integer or integer tensor.
+    ignored_label: The class id to be ignored in evaluation as an integer or
+      integer tensor.
+    offset: The maximum number of unique labels as an integer or integer tensor.
+    normalize_by_image_size: Whether to normalize groundtruth region areas by
+      image size. If True, groundtruth instance areas and weighted IoUs will be
+      divided by the size of the corresponding image before accumulated across
+      the dataset.
+    name: An optional variable_scope name.
+
+  Returns:
+    coverings: A tensor of shape `[3, num_classes]`, where (1) per class
+      coverings, (2) per class sum of weighted IoUs, and (3) per class sum of
+      groundtruth region areas are saved in the perspective rows.
+    update_ops: List of operations that update the running overall parsing
+      covering.
+
+  Raises:
+    RuntimeError: If eager execution is enabled.
+  """
+  if tf.executing_eagerly():
+    raise RuntimeError('Cannot aggregate when eager execution is enabled.')
+
+  input_args = [
+      tf.convert_to_tensor(groundtruth_categories, tf.uint16),
+      tf.convert_to_tensor(groundtruth_instances, tf.uint16),
+      tf.convert_to_tensor(predicted_categories, tf.uint16),
+      tf.convert_to_tensor(predicted_instances, tf.uint16),
+      tf.convert_to_tensor(num_classes, tf.int32),
+      tf.convert_to_tensor(max_instances_per_category, tf.int32),
+      tf.convert_to_tensor(ignored_label, tf.int32),
+      tf.convert_to_tensor(offset, tf.int32),
+      tf.convert_to_tensor(normalize_by_image_size, tf.bool),
+  ]
+  return_types = [
+      tf.float64,
+      tf.float64,
+  ]
+  with tf.variable_scope(name, 'streaming_parsing_covering', input_args):
+    covering_results = tf.py_func(
+        _parsing_covering_helper, input_args, return_types, stateful=False)
+    weighted_iou_per_class, gt_area_per_class = tuple(covering_results)
+
+    total_weighted_iou_per_class, updated_weighted_iou_per_class = (
+        _running_total(
+            weighted_iou_per_class, [num_classes],
+            name='weighted_iou_per_class_total'))
+    total_gt_area_per_class, updated_gt_area_per_class = _running_total(
+        gt_area_per_class, [num_classes], name='gt_area_per_class_total')
+
+    covering_per_class = _realdiv_maybe_zero(total_weighted_iou_per_class,
+                                             total_gt_area_per_class)
+    coverings = tf.stack([
+        covering_per_class,
+        total_weighted_iou_per_class,
+        total_gt_area_per_class,
+    ],
+                         axis=0)
+    update_ops = [updated_weighted_iou_per_class, updated_gt_area_per_class]
+
+  return coverings, update_ops