adding implementation of https://arxiv.org/abs/1805.06066 (#5265)

* adding implementation of https://arxiv.org/abs/1805.06066 * fixing typos and sentences in README

adding implementation of https://arxiv.org/abs/1805.06066 (#5265)
* adding implementation of https://arxiv.org/abs/1805.06066 * fixing typos and sentences in README
e6ce8cdd · Arsalan Mousavian · Taylor Robie · 3c373614 · e6ce8cdd · e6ce8cdd
Commit e6ce8cdd authored Sep 11, 2018 by Arsalan Mousavian Committed by Taylor Robie Sep 11, 2018
20 changed files
--- a/research/cognitive_planning/BUILD
+++ b/research/cognitive_planning/BUILD
+package(default_visibility = [":internal"])
+licenses(["notice"])  # Apache 2.0
+exports_files(["LICENSE"])
+package_group(
+    name = "internal",
+    packages = [
+        "//cognitive_planning/...",
+    ],
+)
+py_binary(
+    name = "train_supervised_active_vision",
+    srcs = [
+        "train_supervised_active_vision.py",
+    ],
+)
--- a/research/cognitive_planning/README.md
+++ b/research/cognitive_planning/README.md
+# cognitive_planning
+**Visual Representation for Semantic Target Driven Navigation**
+Arsalan Mousavian, Alexander Toshev, Marek Fiser, Jana Kosecka, James Davidson
+This is the implementation of semantic target driven navigation training and evaluation on 
+Active Vision dataset. 
+ECCV Workshop on Visual Learning and Embodied Agents in Simulation Environments
+2018.
+<div align="center">
+  <table style="width:100%" border="0">
+    <tr>
+      <td align="center"><img src='https://cs.gmu.edu/~amousavi/gifs/smaller_fridge_2.gif'></td>
+      <td align="center"><img src='https://cs.gmu.edu/~amousavi/gifs/smaller_tv_1.gif'></td>
+    </tr>
+    <tr>
+      <td align="center">Target: Fridge</td>
+      <td align="center">Target: Television</td>
+    </tr>
+    <tr>
+      <td align="center"><img src='https://cs.gmu.edu/~amousavi/gifs/smaller_microwave_1.gif'></td>
+      <td align="center"><img src='https://cs.gmu.edu/~amousavi/gifs/smaller_couch_1.gif'></td>
+    </tr>
+    <tr>
+      <td align="center">Target: Microwave</td>
+      <td align="center">Target: Couch</td>
+    </tr>
+  </table>
+</div>
+Paper: [https://arxiv.org/abs/1805.06066](https://arxiv.org/abs/1805.06066)
+## 1. Installation
+### Requirements
+#### Python Packages
+```shell
+networkx
+gin-config
+```
+### Download cognitive_planning
+```shell
+git clone --depth 1 https://github.com/tensorflow/models.git
+```
+## 2. Datasets
+### Download ActiveVision Dataset 
+We used Active Vision Dataset (AVD) which can be downloaded from [here](http://cs.unc.edu/~ammirato/active_vision_dataset_website/). To make our code faster and reduce memory footprint, we created the AVD Minimal dataset. AVD Minimal consists of low resolution images from the original AVD dataset. In addition, we added annotations for target views, predicted object detections from pre-trained object detector on MS-COCO dataset, and predicted semantic segmentation from pre-trained model on NYU-v2 dataset. AVD minimal can be downloaded from [here](https://storage.googleapis.com/active-vision-dataset/AVD_Minimal.zip). Set `$AVD_DIR` as the path to the downloaded AVD Minimal.
+### TODO: SUNCG Dataset
+Current version of the code does not support SUNCG dataset. It can be added by
+implementing necessary functions of `envs/task_env.py` using the public
+released code of SUNCG environment such as
+[House3d](https://github.com/facebookresearch/House3D) and
+[MINOS](https://github.com/minosworld/minos). 
+### ActiveVisionDataset Demo
+If you wish to navigate the environment, to see how the AVD looks like you can use the following command:
+```shell
+python viz_active_vision_dataset_main -- \
+  --mode=human \
+  --gin_config=envs/configs/active_vision_config.gin \
+  --gin_params='ActiveVisionDatasetEnv.dataset_root=$AVD_DIR'
+```
+## 3. Training
+Right now, the released version only supports training and inference using the real data from Active Vision Dataset.
+When RGB image modality is used, the Resnet embeddings are initialized. To start the training download pre-trained Resnet50 check point in the working directory ./resnet_v2_50_checkpoint/resnet_v2_50.ckpt
+```
+wget http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz
+```
+### Run training
+Use the following command for training:
+```shell
+# Train
+python train_supervised_active_vision.py \
+  --mode='train' \
+  --logdir=$CHECKPOINT_DIR \
+  --modality_types='det' \
+  --batch_size=8 \
+  --train_iters=200000 \
+  --lstm_cell_size=2048 \
+  --policy_fc_size=2048 \
+  --sequence_length=20 \
+  --max_eval_episode_length=100 \
+  --test_iters=194 \
+  --gin_config=envs/configs/active_vision_config.gin \
+  --gin_params='ActiveVisionDatasetEnv.dataset_root=$AVD_DIR' \
+  --logtostderr
+```
+The training can be run for different modalities and modality combinations, including semantic segmentation, object detectors, RGB images, depth images. Low resolution images and outputs of detectors pretrained on COCO dataset and semantic segmenation pre trained on NYU dataset are provided as a part of this distribution and can be found in Meta directory of AVD_Minimal. 
+Additional details are described in the comments of the code and in the paper.
+### Run Evaluation
+Use the following command for unrolling the policy on the eval environments. The inference code periodically check the checkpoint folder for new checkpoints to use it for unrolling the policy on the eval environments. After each evaluation, it will create a folder in the $CHECKPOINT_DIR/evals/$ITER where $ITER is the iteration number at which the checkpoint is stored.
+```shell
+# Eval
+python train_supervised_active_vision.py \
+  --mode='eval' \
+  --logdir=$CHECKPOINT_DIR \
+  --modality_types='det' \
+  --batch_size=8 \
+  --train_iters=200000 \
+  --lstm_cell_size=2048 \
+  --policy_fc_size=2048 \
+  --sequence_length=20 \
+  --max_eval_episode_length=100 \
+  --test_iters=194 \
+  --gin_config=envs/configs/active_vision_config.gin \
+  --gin_params='ActiveVisionDatasetEnv.dataset_root=$AVD_DIR' \
+  --logtostderr
+```
+At any point, you can run the following command to compute statistics such as success rate over all the evaluations so far. It also generates gif images for unrolling of the best policy.
+```shell
+# Visualize and Compute Stats
+python viz_active_vision_dataset_main.py \
+   --mode=eval \ 
+   --eval_folder=$CHECKPOINT_DIR/evals/ \
+   --output_folder=$OUTPUT_GIFS_FOLDER \
+   --gin_config=envs/configs/active_vision_config.gin \
+   --gin_params='ActiveVisionDatasetEnv.dataset_root=$AVD_DIR'
+```
+## Contact
+To ask questions or report issues please open an issue on the tensorflow/models
+[issues tracker](https://github.com/tensorflow/models/issues).
+Please assign issues to @arsalan-mousavian.
+## Reference
+The details of the training and experiments can be found in the following paper. If you find our work useful in your research please consider citing our paper:
+```
+@inproceedings{MousavianECCVW18,
+  author = {A. Mousavian and A. Toshev and M. Fiser and J. Kosecka and J. Davidson},
+  title = {Visual Representations for Semantic Target Driven Navigation},
+  booktitle = {ECCV Workshop on Visual Learning and Embodied Agents in Simulation Environments},
+  year = {2018},
+}
+```
--- a/research/cognitive_planning/__init__.py
+++ b/research/cognitive_planning/__init__.py
--- a/research/cognitive_planning/command
+++ b/research/cognitive_planning/command
+python train_supervised_active_vision \
+  --mode='train' \
+  --logdir=/usr/local/google/home/kosecka/checkin_log_det/ \
+  --modality_types='det' \
+  --batch_size=8 \
+  --train_iters=200000 \
+  --lstm_cell_size=2048 \
+  --policy_fc_size=2048 \
+  --sequence_length=20 \
+  --max_eval_episode_length=100 \
+  --test_iters=194 \
+  --gin_config=robotics/cognitive_planning/envs/configs/active_vision_config.gin \
+  --gin_params='ActiveVisionDatasetEnv.dataset_root="/usr/local/google/home/kosecka/AVD_minimal/"' \
+  --logtostderr
--- a/research/cognitive_planning/embedders.py
+++ b/research/cognitive_planning/embedders.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Interface for different embedders for modalities."""
+import abc
+import numpy as np
+import tensorflow as tf
+import preprocessing
+from tensorflow.contrib.slim.nets import resnet_v2
+slim = tf.contrib.slim
+class Embedder(object):
+  """Represents the embedder for different modalities.
+  Modalities can be semantic segmentation, depth channel, object detection and
+  so on, which require specific embedder for them.
+  """
+  __metaclass__ = abc.ABCMeta
+  @abc.abstractmethod
+  def build(self, observation):
+    """Builds the model to embed the observation modality.
+    Args:
+      observation: tensor that contains the raw observation from modality.
+    Returns:
+      Embedding tensor for the given observation tensor.
+    """
+    raise NotImplementedError(
+        'Needs to be implemented as part of Embedder Interface')
+class DetectionBoxEmbedder(Embedder):
+  """Represents the model that encodes the detection boxes from images."""
+  def __init__(self, rnn_state_size, scope=None):
+    self._rnn_state_size = rnn_state_size
+    self._scope = scope
+  def build(self, observations):
+    """Builds the model to embed object detection observations.
+    Args:
+      observations: a tuple of (dets, det_num).
+        dets is a tensor of BxTxLxE that has the detection boxes in all the
+          images of the batch. B is the batch size, T is the maximum length of
+          episode, L is the maximum number of detections per image in the batch
+          and E is the size of each detection embedding.
+        det_num is a tensor of BxT that contains the number of detected boxes
+          each image of each sequence in the batch.
+    Returns:
+      For each image in the batch, returns the accumulative embedding of all the
+      detection boxes in that image.
+    """
+    with tf.variable_scope(self._scope, default_name=''):
+      shape = observations[0].shape
+      dets = tf.reshape(observations[0], [-1, shape[-2], shape[-1]])
+      det_num = tf.reshape(observations[1], [-1])
+      lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(self._rnn_state_size)
+      batch_size = tf.shape(dets)[0]
+      lstm_outputs, _ = tf.nn.dynamic_rnn(
+          cell=lstm_cell,
+          inputs=dets,
+          sequence_length=det_num,
+          initial_state=lstm_cell.zero_state(batch_size, dtype=tf.float32),
+          dtype=tf.float32)
+      # Gathering the last state of each sequence in the batch.
+      batch_range = tf.range(batch_size)
+      indices = tf.stack([batch_range, det_num - 1], axis=1)
+      last_lstm_outputs = tf.gather_nd(lstm_outputs, indices)
+      last_lstm_outputs = tf.reshape(last_lstm_outputs,
+                                     [-1, shape[1], self._rnn_state_size])
+    return last_lstm_outputs
+class ResNet(Embedder):
+  """Residual net embedder for image data."""
+  def __init__(self, params, *args, **kwargs):
+    super(ResNet, self).__init__(*args, **kwargs)
+    self._params = params
+    self._extra_train_ops = []
+  def build(self, images):
+    shape = images.get_shape().as_list()
+    if len(shape) == 5:
+      images = tf.reshape(images,
+                          [shape[0] * shape[1], shape[2], shape[3], shape[4]])
+    embedding = self._build_model(images)
+    if len(shape) == 5:
+      embedding = tf.reshape(embedding, [shape[0], shape[1], -1])
+    return embedding
+  @property
+  def extra_train_ops(self):
+    return self._extra_train_ops
+  def _build_model(self, images):
+    """Builds the model."""
+    # Convert images to floats and normalize them.
+    images = tf.to_float(images)
+    bs = images.get_shape().as_list()[0]
+    images = [
+        tf.image.per_image_standardization(tf.squeeze(i))
+        for i in tf.split(images, bs)
+    ]
+    images = tf.concat([tf.expand_dims(i, axis=0) for i in images], axis=0)
+    with tf.variable_scope('init'):
+      x = self._conv('init_conv', images, 3, 3, 16, self._stride_arr(1))
+    strides = [1, 2, 2]
+    activate_before_residual = [True, False, False]
+    if self._params.use_bottleneck:
+      res_func = self._bottleneck_residual
+      filters = [16, 64, 128, 256]
+    else:
+      res_func = self._residual
+      filters = [16, 16, 32, 128]
+    with tf.variable_scope('unit_1_0'):
+      x = res_func(x, filters[0], filters[1], self._stride_arr(strides[0]),
+                   activate_before_residual[0])
+    for i in xrange(1, self._params.num_residual_units):
+      with tf.variable_scope('unit_1_%d' % i):
+        x = res_func(x, filters[1], filters[1], self._stride_arr(1), False)
+    with tf.variable_scope('unit_2_0'):
+      x = res_func(x, filters[1], filters[2], self._stride_arr(strides[1]),
+                   activate_before_residual[1])
+    for i in xrange(1, self._params.num_residual_units):
+      with tf.variable_scope('unit_2_%d' % i):
+        x = res_func(x, filters[2], filters[2], self._stride_arr(1), False)
+    with tf.variable_scope('unit_3_0'):
+      x = res_func(x, filters[2], filters[3], self._stride_arr(strides[2]),
+                   activate_before_residual[2])
+    for i in xrange(1, self._params.num_residual_units):
+      with tf.variable_scope('unit_3_%d' % i):
+        x = res_func(x, filters[3], filters[3], self._stride_arr(1), False)
+    with tf.variable_scope('unit_last'):
+      x = self._batch_norm('final_bn', x)
+      x = self._relu(x, self._params.relu_leakiness)
+    with tf.variable_scope('pool_logit'):
+      x = self._global_avg_pooling(x)
+    return x
+  def _stride_arr(self, stride):
+    return [1, stride, stride, 1]
+  def _batch_norm(self, name, x):
+    """batch norm implementation."""
+    with tf.variable_scope(name):
+      params_shape = [x.shape[-1]]
+      beta = tf.get_variable(
+          'beta',
+          params_shape,
+          tf.float32,
+          initializer=tf.constant_initializer(0.0, tf.float32))
+      gamma = tf.get_variable(
+          'gamma',
+          params_shape,
+          tf.float32,
+          initializer=tf.constant_initializer(1.0, tf.float32))
+      if self._params.is_train:
+        mean, variance = tf.nn.moments(x, [0, 1, 2], name='moments')
+        moving_mean = tf.get_variable(
+            'moving_mean',
+            params_shape,
+            tf.float32,
+            initializer=tf.constant_initializer(0.0, tf.float32),
+            trainable=False)
+        moving_variance = tf.get_variable(
+            'moving_variance',
+            params_shape,
+            tf.float32,
+            initializer=tf.constant_initializer(1.0, tf.float32),
+            trainable=False)
+        self._extra_train_ops.append(
+            tf.assign_moving_average(moving_mean, mean, 0.9))
+        self._extra_train_ops.append(
+            tf.assign_moving_average(moving_variance, variance, 0.9))
+      else:
+        mean = tf.get_variable(
+            'moving_mean',
+            params_shape,
+            tf.float32,
+            initializer=tf.constant_initializer(0.0, tf.float32),
+            trainable=False)
+        variance = tf.get_variable(
+            'moving_variance',
+            params_shape,
+            tf.float32,
+            initializer=tf.constant_initializer(1.0, tf.float32),
+            trainable=False)
+        tf.summary.histogram(mean.op.name, mean)
+        tf.summary.histogram(variance.op.name, variance)
+      # elipson used to be 1e-5. Maybe 0.001 solves NaN problem in deeper net.
+      y = tf.nn.batch_normalization(x, mean, variance, beta, gamma, 0.001)
+      y.set_shape(x.shape)
+      return y
+  def _residual(self,
+                x,
+                in_filter,
+                out_filter,
+                stride,
+                activate_before_residual=False):
+    """Residual unit with 2 sub layers."""
+    if activate_before_residual:
+      with tf.variable_scope('shared_activation'):
+        x = self._batch_norm('init_bn', x)
+        x = self._relu(x, self._params.relu_leakiness)
+        orig_x = x
+    else:
+      with tf.variable_scope('residual_only_activation'):
+        orig_x = x
+        x = self._batch_norm('init_bn', x)
+        x = self._relu(x, self._params.relu_leakiness)
+    with tf.variable_scope('sub1'):
+      x = self._conv('conv1', x, 3, in_filter, out_filter, stride)
+    with tf.variable_scope('sub2'):
+      x = self._batch_norm('bn2', x)
+      x = self._relu(x, self._params.relu_leakiness)
+      x = self._conv('conv2', x, 3, out_filter, out_filter, [1, 1, 1, 1])
+    with tf.variable_scope('sub_add'):
+      if in_filter != out_filter:
+        orig_x = tf.nn.avg_pool(orig_x, stride, stride, 'VALID')
+        orig_x = tf.pad(
+            orig_x, [[0, 0], [0, 0], [0, 0], [(out_filter - in_filter) // 2,
+                                              (out_filter - in_filter) // 2]])
+      x += orig_x
+    return x
+  def _bottleneck_residual(self,
+                           x,
+                           in_filter,
+                           out_filter,
+                           stride,
+                           activate_before_residual=False):
+    """A residual convolutional layer with a bottleneck.
+    The layer is a composite of three convolutional layers with a ReLU non-
+    linearity and batch normalization after each linear convolution. The depth
+    if the second and third layer is out_filter / 4 (hence it is a bottleneck).
+    Args:
+      x: a float 4 rank Tensor representing the input to the layer.
+      in_filter: a python integer representing depth of the input.
+      out_filter: a python integer representing depth of the output.
+      stride: a python integer denoting the stride of the layer applied before
+        the first convolution.
+      activate_before_residual: a python boolean. If True, then a ReLU is
+        applied as a first operation on the input x before everything else.
+    Returns:
+      A 4 rank Tensor with batch_size = batch size of input, width and height =
+      width / stride and height / stride of the input and depth = out_filter.
+    """
+    if activate_before_residual:
+      with tf.variable_scope('common_bn_relu'):
+        x = self._batch_norm('init_bn', x)
+        x = self._relu(x, self._params.relu_leakiness)
+        orig_x = x
+    else:
+      with tf.variable_scope('residual_bn_relu'):
+        orig_x = x
+        x = self._batch_norm('init_bn', x)
+        x = self._relu(x, self._params.relu_leakiness)
+    with tf.variable_scope('sub1'):
+      x = self._conv('conv1', x, 1, in_filter, out_filter / 4, stride)
+    with tf.variable_scope('sub2'):
+      x = self._batch_norm('bn2', x)
+      x = self._relu(x, self._params.relu_leakiness)
+      x = self._conv('conv2', x, 3, out_filter / 4, out_filter / 4,
+                     [1, 1, 1, 1])
+    with tf.variable_scope('sub3'):
+      x = self._batch_norm('bn3', x)
+      x = self._relu(x, self._params.relu_leakiness)
+      x = self._conv('conv3', x, 1, out_filter / 4, out_filter, [1, 1, 1, 1])
+    with tf.variable_scope('sub_add'):
+      if in_filter != out_filter:
+        orig_x = self._conv('project', orig_x, 1, in_filter, out_filter, stride)
+      x += orig_x
+    return x
+  def _decay(self):
+    costs = []
+    for var in tf.trainable_variables():
+      if var.op.name.find(r'DW') > 0:
+        costs.append(tf.nn.l2_loss(var))
+    return tf.mul(self._params.weight_decay_rate, tf.add_n(costs))
+  def _conv(self, name, x, filter_size, in_filters, out_filters, strides):
+    """Convolution."""
+    with tf.variable_scope(name):
+      n = filter_size * filter_size * out_filters
+      kernel = tf.get_variable(
+          'DW', [filter_size, filter_size, in_filters, out_filters],
+          tf.float32,
+          initializer=tf.random_normal_initializer(stddev=np.sqrt(2.0 / n)))
+      return tf.nn.conv2d(x, kernel, strides, padding='SAME')
+  def _relu(self, x, leakiness=0.0):
+    return tf.where(tf.less(x, 0.0), leakiness * x, x, name='leaky_relu')
+  def _fully_connected(self, x, out_dim):
+    x = tf.reshape(x, [self._params.batch_size, -1])
+    w = tf.get_variable(
+        'DW', [x.get_shape()[1], out_dim],
+        initializer=tf.uniform_unit_scaling_initializer(factor=1.0))
+    b = tf.get_variable(
+        'biases', [out_dim], initializer=tf.constant_initializer())
+    return tf.nn.xw_plus_b(x, w, b)
+  def _global_avg_pooling(self, x):
+    assert x.get_shape().ndims == 4
+    return tf.reduce_mean(x, [1, 2])
+class MLPEmbedder(Embedder):
+  """Embedder of vectorial data.
+  The net is a multi-layer perceptron, with ReLU nonlinearities in all layers
+  except the last one.
+  """
+  def __init__(self, layers, *args, **kwargs):
+    """Constructs MLPEmbedder.
+    Args:
+      layers: a list of python integers representing layer sizes.
+      *args: arguments for super constructor.
+      **kwargs: keyed arguments for super constructor.
+    """
+    super(MLPEmbedder, self).__init__(*args, **kwargs)
+    self._layers = layers
+  def build(self, features):
+    shape = features.get_shape().as_list()
+    if len(shape) == 3:
+      features = tf.reshape(features, [shape[0] * shape[1], shape[2]])
+    x = features
+    for i, dim in enumerate(self._layers):
+      with tf.variable_scope('layer_%i' % i):
+        x = self._fully_connected(x, dim)
+        if i < len(self._layers) - 1:
+          x = self._relu(x)
+    if len(shape) == 3:
+      x = tf.reshape(x, shape[:-1] + [self._layers[-1]])
+    return x
+  def _fully_connected(self, x, out_dim):
+    w = tf.get_variable(
+        'DW', [x.get_shape()[1], out_dim],
+        initializer=tf.variance_scaling_initializer(distribution='uniform'))
+    b = tf.get_variable(
+        'biases', [out_dim], initializer=tf.constant_initializer())
+    return tf.nn.xw_plus_b(x, w, b)
+  def _relu(self, x, leakiness=0.0):
+    return tf.where(tf.less(x, 0.0), leakiness * x, x, name='leaky_relu')
+class SmallNetworkEmbedder(Embedder):
+  """Embedder for image like observations.
+  The network is comprised of multiple conv layers and a fully connected layer
+  at the end. The number of conv layers and the parameters are configured from
+  params.
+  """
+  def __init__(self, params, *args, **kwargs):
+    """Constructs the small network.
+    Args:
+      params: params should be tf.hparams type. params need to have a list of
+        conv_sizes, conv_strides, conv_channels. The length of these lists
+        should be equal to each other and to the number of conv layers in the
+        network. Plus, it also needs to have boolean variable named to_one_hot
+        which indicates whether the input should be converted to one hot or not.
+        The size of the fully connected layer is specified by
+        params.embedding_size.
+      *args: The rest of the parameters.
+      **kwargs: the reset of the parameters.
+    Raises:
+      ValueError: If the length of params.conv_strides, params.conv_sizes, and
+        params.conv_channels are not equal.
+    """
+    super(SmallNetworkEmbedder, self).__init__(*args, **kwargs)
+    self._params = params
+    if len(self._params.conv_sizes) != len(self._params.conv_strides):
+      raise ValueError(
+          'Conv sizes and strides should have the same length: {} != {}'.format(
+              len(self._params.conv_sizes), len(self._params.conv_strides)))
+    if len(self._params.conv_sizes) != len(self._params.conv_channels):
+      raise ValueError(
+          'Conv sizes and channels should have the same length: {} != {}'.
+          format(len(self._params.conv_sizes), len(self._params.conv_channels)))
+  def build(self, images):
+    """Builds the embedder with the given speicifcation.
+    Args:
+      images: a tensor that contains the input images which has the shape of
+        NxTxHxWxC where N is the batch size, T is the maximum length of the
+        sequence, H and W are the height and width of the images and C is the
+        number of channels.
+    Returns:
+      A tensor that is the embedding of the images.
+    """
+    shape = images.get_shape().as_list()
+    images = tf.reshape(images,
+                        [shape[0] * shape[1], shape[2], shape[3], shape[4]])
+    with slim.arg_scope(
+        [slim.conv2d, slim.fully_connected],
+        activation_fn=tf.nn.relu,
+        weights_regularizer=slim.l2_regularizer(self._params.weight_decay_rate),
+        biases_initializer=tf.zeros_initializer()):
+      with slim.arg_scope([slim.conv2d], padding='SAME'):
+        # convert the image to one hot if needed.
+        if self._params.to_one_hot:
+          net = tf.one_hot(
+              tf.squeeze(tf.to_int32(images), axis=[-1]),
+              self._params.one_hot_length)
+        else:
+          net = images
+        p = self._params
+        # Adding conv layers with the specified configurations.
+        for conv_id, kernel_stride_channel in enumerate(
+            zip(p.conv_sizes, p.conv_strides, p.conv_channels)):
+          kernel_size, stride, channels = kernel_stride_channel
+          net = slim.conv2d(
+              net,
+              channels, [kernel_size, kernel_size],
+              stride,
+              scope='conv_{}'.format(conv_id + 1))
+        net = slim.flatten(net)
+        net = slim.fully_connected(net, self._params.embedding_size, scope='fc')
+        output = tf.reshape(net, [shape[0], shape[1], -1])
+        return output
+class ResNet50Embedder(Embedder):
+  """Uses ResNet50 to embed input images."""
+  def build(self, images):
+    """Builds a ResNet50 embedder for the input images.
+    It assumes that the range of the pixel values in the images tensor is
+      [0,255] and should be castable to tf.uint8.
+    Args:
+      images: a tensor that contains the input images which has the shape of
+          NxTxHxWx3 where N is the batch size, T is the maximum length of the
+          sequence, H and W are the height and width of the images and C is the
+          number of channels.
+    Returns:
+      The embedding of the input image with the shape of NxTxL where L is the
+        embedding size of the output.
+    Raises:
+      ValueError: if the shape of the input does not agree with the expected
+      shape explained in the Args section.
+    """
+    shape = images.get_shape().as_list()
+    if len(shape) != 5:
+      raise ValueError(
+          'The tensor shape should have 5 elements, {} is provided'.format(
+              len(shape)))
+    if shape[4] != 3:
+      raise ValueError('Three channels are expected for the input image')
+    images = tf.cast(images, tf.uint8)
+    images = tf.reshape(images,
+                        [shape[0] * shape[1], shape[2], shape[3], shape[4]])
+    with slim.arg_scope(resnet_v2.resnet_arg_scope()):
+      def preprocess_fn(x):
+        x = tf.expand_dims(x, 0)
+        x = tf.image.resize_bilinear(x, [299, 299],
+                                       align_corners=False)
+        return(tf.squeeze(x, [0]))
+      images = tf.map_fn(preprocess_fn, images, dtype=tf.float32)
+      net, _ = resnet_v2.resnet_v2_50(
+          images, is_training=False, global_pool=True)
+      output = tf.reshape(net, [shape[0], shape[1], -1])
+      return output
+class IdentityEmbedder(Embedder):
+  """This embedder just returns the input as the output.
+  Used for modalitites that the embedding of the modality is the same as the
+  modality itself. For example, it can be used for one_hot goal.
+  """
+  def build(self, images):
+    return images
--- a/research/cognitive_planning/envs/__init__.py
+++ b/research/cognitive_planning/envs/__init__.py
--- a/research/cognitive_planning/envs/active_vision_dataset_env.py
+++ b/research/cognitive_planning/envs/active_vision_dataset_env.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Gym environment for the ActiveVision Dataset.
+   The dataset is captured with a robot moving around and taking picture in
+   multiple directions. The actions are moving in four directions, and rotate
+   clockwise or counter clockwise. The observations are the output of vision
+   pipelines such as object detectors. The goal is to find objects of interest
+   in each environment. For more details, refer:
+   http://cs.unc.edu/~ammirato/active_vision_dataset_website/.
+"""
+import tensorflow as tf
+import collections
+import copy
+import json
+import os
+from StringIO import StringIO
+import time
+import gym
+from gym.envs.registration import register
+import gym.spaces
+import networkx as nx
+import numpy as np
+import scipy.io as sio
+from absl import logging
+import gin
+import cv2
+import label_map_util
+import visualization_utils as vis_util
+from envs import task_env
+register(
+    id='active-vision-env-v0',
+    entry_point=
+    'cognitive_planning.envs.active_vision_dataset_env:ActiveVisionDatasetEnv',  # pylint: disable=line-too-long
+)
+_MAX_DEPTH_VALUE = 12102
+SUPPORTED_ACTIONS = [
+    'right', 'rotate_cw', 'rotate_ccw', 'forward', 'left', 'backward', 'stop'
+]
+SUPPORTED_MODALITIES = [
+    task_env.ModalityTypes.SEMANTIC_SEGMENTATION,
+    task_env.ModalityTypes.DEPTH,
+    task_env.ModalityTypes.OBJECT_DETECTION,
+    task_env.ModalityTypes.IMAGE,
+    task_env.ModalityTypes.GOAL,
+    task_env.ModalityTypes.PREV_ACTION,
+    task_env.ModalityTypes.DISTANCE,
+]
+# Data structure for storing the information related to the graph of the world.
+_Graph = collections.namedtuple('_Graph', [
+    'graph', 'id_to_index', 'index_to_id', 'target_indexes', 'distance_to_goal'
+])
+def _init_category_index(label_map_path):
+  """Creates category index from class indexes to name of the classes.
+  Args:
+    label_map_path: path to the mapping.
+  Returns:
+    A map for mapping int keys to string categories.
+  """
+  label_map = label_map_util.load_labelmap(label_map_path)
+  num_classes = np.max(x.id for x in label_map.item)
+  categories = label_map_util.convert_label_map_to_categories(
+      label_map, max_num_classes=num_classes, use_display_name=True)
+  category_index = label_map_util.create_category_index(categories)
+  return category_index
+def _draw_detections(image_np, detections, category_index):
+  """Draws detections on to the image.
+  Args:
+    image_np: Image in the form of uint8 numpy array.
+    detections: a dictionary that contains the detection outputs.
+    category_index: contains the mapping between indexes and the category names.
+  Returns:
+    Does not return anything but draws the boxes on the
+  """
+  vis_util.visualize_boxes_and_labels_on_image_array(
+      image_np,
+      detections['detection_boxes'],
+      detections['detection_classes'],
+      detections['detection_scores'],
+      category_index,
+      use_normalized_coordinates=True,
+      max_boxes_to_draw=1000,
+      min_score_thresh=.0,
+      agnostic_mode=False)
+def generate_detection_image(detections,
+                             image_size,
+                             category_map,
+                             num_classes,
+                             is_binary=True):
+  """Generates one_hot vector of the image using the detection boxes.
+  Args:
+    detections: 2D object detections from the image. It's a dictionary that
+      contains detection_boxes, detection_classes, and detection_scores with
+      dimensions of nx4, nx1, nx1 where n is the number of detections.
+    image_size: The resolution of the output image.
+    category_map: dictionary that maps label names to index.
+    num_classes: Number of classes.
+    is_binary: If true, it sets the corresponding channels to 0 and 1.
+      Otherwise, sets the score in the corresponding channel.
+  Returns:
+    Returns image_size x image_size x num_classes image for the detection boxes.
+  """
+  res = np.zeros((image_size, image_size, num_classes), dtype=np.float32)
+  boxes = detections['detection_boxes']
+  labels = detections['detection_classes']
+  scores = detections['detection_scores']
+  for box, label, score in zip(boxes, labels, scores):
+    transformed_boxes = [int(round(t)) for t in box * image_size]
+    y1, x1, y2, x2 = transformed_boxes
+    # Detector returns fixed number of detections. Boxes with area of zero
+    # are equivalent of boxes that don't correspond to any detection box.
+    # So, we need to skip the boxes with area 0.
+    if (y2 - y1) * (x2 - x1) == 0:
+      continue
+    assert category_map[label] < num_classes, 'label = {}'.format(label)
+    value = score
+    if is_binary:
+      value = 1
+    res[y1:y2, x1:x2, category_map[label]] = value
+  return res
+def _get_detection_path(root, detection_folder_name, world):
+  return os.path.join(root, 'Meta', detection_folder_name, world + '.npy')
+def _get_image_folder(root, world):
+  return os.path.join(root, world, 'jpg_rgb')
+def _get_json_path(root, world):
+  return os.path.join(root, world, 'annotations.json')
+def _get_image_path(root, world, image_id):
+  return os.path.join(_get_image_folder(root, world), image_id + '.jpg')
+def _get_image_list(path, worlds):
+  """Builds a dictionary for all the worlds.
+  Args:
+    path: the path to the dataset on cns.
+    worlds: list of the worlds.
+  Returns:
+    dictionary where the key is the world names and the values
+    are the image_ids of that world.
+  """
+  world_id_dict = {}
+  for loc in worlds:
+    files = [t[:-4] for t in tf.gfile.ListDir(_get_image_folder(path, loc))]
+    world_id_dict[loc] = files
+  return world_id_dict
+def read_all_poses(dataset_root, world):
+  """Reads all the poses for each world.
+  Args:
+    dataset_root: the path to the root of the dataset.
+    world: string, name of the world.
+  Returns:
+    Dictionary of poses for all the images in each world. The key is the image
+    id of each view and the values are tuple of (x, z, R, scale). Where x and z
+    are the first and third coordinate of translation. R is the 3x3 rotation
+    matrix and scale is a float scalar that indicates the scale that needs to
+    be multipled to x and z in order to get the real world coordinates.
+  Raises:
+    ValueError: if the number of images do not match the number of poses read.
+  """
+  path = os.path.join(dataset_root, world, 'image_structs.mat')
+  with tf.gfile.Open(path) as f:
+    data = sio.loadmat(f)
+  xyz = data['image_structs']['world_pos']
+  image_names = data['image_structs']['image_name'][0]
+  rot = data['image_structs']['R'][0]
+  scale = data['scale'][0][0]
+  n = xyz.shape[1]
+  x = [xyz[0][i][0][0] for i in range(n)]
+  z = [xyz[0][i][2][0] for i in range(n)]
+  names = [name[0][:-4] for name in image_names]
+  if len(names) != len(x):
+    raise ValueError('number of image names are not equal to the number of '
+                     'poses {} != {}'.format(len(names), len(x)))
+  output = {}
+  for i in range(n):
+    if rot[i].shape[0] != 0:
+      assert rot[i].shape[0] == 3
+      assert rot[i].shape[1] == 3
+      output[names[i]] = (x[i], z[i], rot[i], scale)
+    else:
+      output[names[i]] = (x[i], z[i], None, scale)
+  return output
+def read_cached_data(should_load_images, dataset_root, segmentation_file_name,
+                     targets_file_name, output_size):
+  """Reads all the necessary cached data.
+  Args:
+    should_load_images: whether to load the images or not.
+    dataset_root: path to the root of the dataset.
+    segmentation_file_name: The name of the file that contains semantic
+      segmentation annotations.
+    targets_file_name: The name of the file the contains targets annotated for
+      each world.
+    output_size: Size of the output images. This is used for pre-processing the
+      loaded images.
+  Returns:
+    Dictionary of all the cached data.
+  """
+  load_start = time.time()
+  result_data = {}
+  annotated_target_path = os.path.join(dataset_root, 'Meta',
+                                       targets_file_name + '.npy')
+  logging.info('loading targets: %s', annotated_target_path)
+  with tf.gfile.Open(annotated_target_path) as f:
+    result_data['targets'] = np.load(f).item()
+  depth_image_path = os.path.join(dataset_root, 'Meta/depth_imgs.npy')
+  logging.info('loading depth: %s', depth_image_path)
+  with tf.gfile.Open(depth_image_path) as f:
+    depth_data = np.load(f).item()
+  logging.info('processing depth')
+  for home_id in depth_data:
+    images = depth_data[home_id]
+    for image_id in images:
+      depth = images[image_id]
+      depth = cv2.resize(
+          depth / _MAX_DEPTH_VALUE, (output_size, output_size),
+          interpolation=cv2.INTER_NEAREST)
+      depth_mask = (depth > 0).astype(np.float32)
+      depth = np.dstack((depth, depth_mask))
+      images[image_id] = depth
+  result_data[task_env.ModalityTypes.DEPTH] = depth_data
+  sseg_path = os.path.join(dataset_root, 'Meta',
+                           segmentation_file_name + '.npy')
+  logging.info('loading sseg: %s', sseg_path)
+  with tf.gfile.Open(sseg_path) as f:
+    sseg_data = np.load(f).item()
+  logging.info('processing sseg')
+  for home_id in sseg_data:
+    images = sseg_data[home_id]
+    for image_id in images:
+      sseg = images[image_id]
+      sseg = cv2.resize(
+          sseg, (output_size, output_size), interpolation=cv2.INTER_NEAREST)
+      images[image_id] = np.expand_dims(sseg, axis=-1).astype(np.float32)
+  result_data[task_env.ModalityTypes.SEMANTIC_SEGMENTATION] = sseg_data
+  if should_load_images:
+    image_path = os.path.join(dataset_root, 'Meta/imgs.npy')
+    logging.info('loading imgs: %s', image_path)
+    with tf.gfile.Open(image_path) as f:
+      image_data = np.load(f).item()
+    result_data[task_env.ModalityTypes.IMAGE] = image_data
+  with tf.gfile.Open(os.path.join(dataset_root, 'Meta/world_id_dict.npy')) as f:
+    result_data['world_id_dict'] = np.load(f).item()
+  logging.info('logging done in %f seconds', time.time() - load_start)
+  return result_data
+@gin.configurable
+def get_spec_dtype_map():
+  return {gym.spaces.Box: np.float32}
+@gin.configurable
+class ActiveVisionDatasetEnv(task_env.TaskEnv):
+  """Simulates the environment from ActiveVisionDataset."""
+  cached_data = None
+  def __init__(
+      self,
+      episode_length,
+      modality_types,
+      confidence_threshold,
+      output_size,
+      worlds,
+      targets,
+      compute_distance,
+      should_draw_detections,
+      dataset_root,
+      labelmap_path,
+      reward_collision,
+      reward_goal_range,
+      num_detection_classes,
+      segmentation_file_name,
+      detection_folder_name,
+      actions,
+      targets_file_name,
+      eval_init_points_file_name=None,
+      shaped_reward=False,
+  ):
+    """Instantiates the environment for ActiveVision Dataset.
+    Args:
+      episode_length: the length of each episode.
+      modality_types: a list of the strings where each entry indicates the name
+        of the modalities to be loaded. Valid entries are "sseg", "det",
+        "depth", "image", "distance", and "prev_action". "distance" should be
+        used for computing metrics in tf agents.
+      confidence_threshold: Consider detections more than confidence_threshold
+        for potential targets.
+      output_size: Resolution of the output image.
+      worlds: List of the name of the worlds.
+      targets: List of the target names. Each entry is a string label of the
+        target category (e.g. 'fridge', 'microwave', so on).
+      compute_distance: If True, outputs the distance of the view to the goal.
+      should_draw_detections (bool): If True, the image returned for the
+        observation will contains the bounding boxes.
+      dataset_root: the path to the root folder of the dataset.
+      labelmap_path: path to the dictionary that converts label strings to
+        indexes.
+      reward_collision: the reward the agents get after hitting an obstacle.
+        It should be a non-positive number.
+      reward_goal_range: the number of steps from goal, such that the agent is
+        considered to have reached the goal. If the agent's distance is less
+        than the specified goal range, the episode is also finishes by setting
+        done = True.
+      num_detection_classes: number of classes that detector outputs.
+      segmentation_file_name: the name of the file that contains the semantic
+        information. The file should be in the dataset_root/Meta/ folder.
+      detection_folder_name: Name of the folder that contains the detections
+        for each world. The folder should be under dataset_root/Meta/ folder.
+      actions: The list of the action names. Valid entries are listed in
+        SUPPORTED_ACTIONS.
+      targets_file_name: the name of the file that contains the annotated
+        targets. The file should be in the dataset_root/Meta/Folder
+      eval_init_points_file_name: The name of the file that contains the initial
+        points for evaluating the performance of the agent. If set to None,
+        episodes start at random locations. Should be only set for evaluation.
+      shaped_reward: Whether to add delta goal distance to the reward each step.
+    Raises:
+      ValueError: If one of the targets are not available in the annotated
+        targets or the modality names are not from the domain specified above.
+      ValueError: If one of the actions is not in SUPPORTED_ACTIONS.
+      ValueError: If the reward_collision is a positive number.
+      ValueError: If there is no action other than stop provided.
+    """
+    if reward_collision > 0:
+      raise ValueError('"reward" for collision should be non positive')
+    if reward_goal_range < 0:
+      logging.warning('environment does not terminate the episode if the agent '
+                      'is too close to the environment')
+    if not modality_types:
+      raise ValueError('modality names can not be empty')
+    for name in modality_types:
+      if name not in SUPPORTED_MODALITIES:
+        raise ValueError('invalid modality type: {}'.format(name))
+    actions_other_than_stop_found = False
+    for a in actions:
+      if a != 'stop':
+        actions_other_than_stop_found = True
+      if a not in SUPPORTED_ACTIONS:
+        raise ValueError('invalid action %s', a)
+    if not actions_other_than_stop_found:
+      raise ValueError('environment needs to have actions other than stop.')
+    super(ActiveVisionDatasetEnv, self).__init__()
+    self._episode_length = episode_length
+    self._modality_types = set(modality_types)
+    self._confidence_threshold = confidence_threshold
+    self._output_size = output_size
+    self._dataset_root = dataset_root
+    self._worlds = worlds
+    self._targets = targets
+    self._all_graph = {}
+    for world in self._worlds:
+      with tf.gfile.Open(_get_json_path(self._dataset_root, world), 'r') as f:
+        file_content = f.read()
+        file_content = file_content.replace('.jpg', '')
+        io = StringIO(file_content)
+        self._all_graph[world] = json.load(io)
+    self._cur_world = ''
+    self._cur_image_id = ''
+    self._cur_graph = None  # Loaded by _update_graph
+    self._steps_taken = 0
+    self._last_action_success = True
+    self._category_index = _init_category_index(labelmap_path)
+    self._category_map = dict(
+        [(c, i) for i, c in enumerate(self._category_index)])
+    self._detection_cache = {}
+    if not ActiveVisionDatasetEnv.cached_data:
+      ActiveVisionDatasetEnv.cached_data = read_cached_data(
+          True, self._dataset_root, segmentation_file_name, targets_file_name,
+          self._output_size)
+    cached_data = ActiveVisionDatasetEnv.cached_data
+    self._world_id_dict = cached_data['world_id_dict']
+    self._depth_images = cached_data[task_env.ModalityTypes.DEPTH]
+    self._semantic_segmentations = cached_data[
+        task_env.ModalityTypes.SEMANTIC_SEGMENTATION]
+    self._annotated_targets = cached_data['targets']
+    self._cached_imgs = cached_data[task_env.ModalityTypes.IMAGE]
+    self._graph_cache = {}
+    self._compute_distance = compute_distance
+    self._should_draw_detections = should_draw_detections
+    self._reward_collision = reward_collision
+    self._reward_goal_range = reward_goal_range
+    self._num_detection_classes = num_detection_classes
+    self._actions = actions
+    self._detection_folder_name = detection_folder_name
+    self._shaped_reward = shaped_reward
+    self._eval_init_points = None
+    if eval_init_points_file_name is not None:
+      self._eval_init_index = 0
+      init_points_path = os.path.join(self._dataset_root, 'Meta',
+                                      eval_init_points_file_name + '.npy')
+      with tf.gfile.Open(init_points_path) as points_file:
+        data = np.load(points_file).item()
+      self._eval_init_points = []
+      for world in self._worlds:
+        for goal in self._targets:
+          if world in self._annotated_targets[goal]:
+            for image_id in data[world]:
+              self._eval_init_points.append((world, image_id[0], goal))
+        logging.info('loaded %d eval init points', len(self._eval_init_points))
+    self.action_space = gym.spaces.Discrete(len(self._actions))
+    obs_shapes = {}
+    if task_env.ModalityTypes.SEMANTIC_SEGMENTATION in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.SEMANTIC_SEGMENTATION] = gym.spaces.Box(
+          low=0, high=255, shape=(self._output_size, self._output_size, 1))
+    if task_env.ModalityTypes.OBJECT_DETECTION in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.OBJECT_DETECTION] = gym.spaces.Box(
+          low=0,
+          high=255,
+          shape=(self._output_size, self._output_size,
+                 self._num_detection_classes))
+    if task_env.ModalityTypes.DEPTH in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.DEPTH] = gym.spaces.Box(
+          low=0,
+          high=_MAX_DEPTH_VALUE,
+          shape=(self._output_size, self._output_size, 2))
+    if task_env.ModalityTypes.IMAGE in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.IMAGE] = gym.spaces.Box(
+          low=0, high=255, shape=(self._output_size, self._output_size, 3))
+    if task_env.ModalityTypes.GOAL in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.GOAL] = gym.spaces.Box(
+          low=0, high=1., shape=(len(self._targets),))
+    if task_env.ModalityTypes.PREV_ACTION in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.PREV_ACTION] = gym.spaces.Box(
+          low=0, high=1., shape=(len(self._actions) + 1,))
+    if task_env.ModalityTypes.DISTANCE in self._modality_types:
+      obs_shapes[task_env.ModalityTypes.DISTANCE] = gym.spaces.Box(
+          low=0, high=255, shape=(1,))
+    self.observation_space = gym.spaces.Dict(obs_shapes)
+    self._prev_action = np.zeros((len(self._actions) + 1), dtype=np.float32)
+    # Loading all the poses.
+    all_poses = {}
+    for world in self._worlds:
+      all_poses[world] = read_all_poses(self._dataset_root, world)
+    self._cached_poses = all_poses
+    self._vertex_to_pose = {}
+    self._pose_to_vertex = {}
+  @property
+  def actions(self):
+    """Returns list of actions for the env."""
+    return self._actions
+  def _next_image(self, image_id, action):
+    """Given the action, returns the name of the image that agent ends up in.
+    Args:
+      image_id: The image id of the current view.
+      action: valid actions are ['right', 'rotate_cw', 'rotate_ccw',
+      'forward', 'left']. Each rotation is 30 degrees.
+    Returns:
+      The image name for the next location of the agent. If the action results
+      in collision or it is not possible for the agent to execute that action,
+      returns empty string.
+    """
+    assert action in self._actions, 'invalid action : {}'.format(action)
+    assert self._cur_world in self._all_graph, 'invalid world {}'.format(
+        self._cur_world)
+    assert image_id in self._all_graph[
+        self._cur_world], 'image_id {} is not in {}'.format(
+            image_id, self._cur_world)
+    return self._all_graph[self._cur_world][image_id][action]
+  def _largest_detection_for_image(self, image_id, detections_dict):
+    """Assigns area of the largest box for the view with given image id.
+    Args:
+      image_id: Image id of the view.
+      detections_dict: Detections for the view.
+    """
+    for cls, box, score in zip(detections_dict['detection_classes'],
+                               detections_dict['detection_boxes'],
+                               detections_dict['detection_scores']):
+      if cls not in self._targets:
+        continue
+      if score < self._confidence_threshold:
+        continue
+      ymin, xmin, ymax, xmax = box
+      area = (ymax - ymin) * (xmax - xmin)
+      if abs(area) < 1e-5:
+        continue
+      if image_id not in self._detection_area:
+        self._detection_area[image_id] = area
+      else:
+        self._detection_area[image_id] = max(self._detection_area[image_id],
+                                             area)
+  def _compute_goal_indexes(self):
+    """Computes the goal indexes for the environment.
+    Returns:
+      The indexes of the goals that are closest to target categories. A vertex
+      is goal vertice if the desired objects are detected in the image and the
+      target categories are not seen by moving forward from that vertice.
+    """
+    for image_id in self._world_id_dict[self._cur_world]:
+      detections_dict = self._detection_table[image_id]
+      self._largest_detection_for_image(image_id, detections_dict)
+    goal_indexes = []
+    for image_id in self._world_id_dict[self._cur_world]:
+      if image_id not in self._detection_area:
+        continue
+      # Detection box is large enough.
+      if self._detection_area[image_id] < 0.01:
+        continue
+      ok = True
+      next_image_id = self._next_image(image_id, 'forward')
+      if next_image_id:
+        if next_image_id in self._detection_area:
+          ok = False
+      if ok:
+        goal_indexes.append(self._cur_graph.id_to_index[image_id])
+    return goal_indexes
+  def to_image_id(self, vid):
+    """Converts vertex id to the image id.
+    Args:
+      vid: vertex id of the view.
+    Returns:
+      image id of the input vertex id.
+    """
+    return self._cur_graph.index_to_id[vid]
+  def to_vertex(self, image_id):
+    return self._cur_graph.id_to_index[image_id]
+  def observation(self, view_pose):
+    """Returns the observation at the given the vertex.
+    Args:
+      view_pose: pose of the view of interest.
+    Returns:
+      Observation at the given view point.
+    Raises:
+      ValueError: if the given view pose is not similar to any of the poses in
+        the current world.
+    """
+    vertex = self.pose_to_vertex(view_pose)
+    if vertex is None:
+      raise ValueError('The given found is not close enough to any of the poses'
+                       ' in the environment.')
+    image_id = self._cur_graph.index_to_id[vertex]
+    output = collections.OrderedDict()
+    if task_env.ModalityTypes.SEMANTIC_SEGMENTATION in self._modality_types:
+      output[task_env.ModalityTypes.
+             SEMANTIC_SEGMENTATION] = self._semantic_segmentations[
+                 self._cur_world][image_id]
+    detection = None
+    need_det = (
+        task_env.ModalityTypes.OBJECT_DETECTION in self._modality_types or
+        (task_env.ModalityTypes.IMAGE in self._modality_types and
+         self._should_draw_detections))
+    if need_det:
+      detection = self._detection_table[image_id]
+      detection_image = generate_detection_image(
+          detection,
+          self._output_size,
+          self._category_map,
+          num_classes=self._num_detection_classes)
+    if task_env.ModalityTypes.OBJECT_DETECTION in self._modality_types:
+      output[task_env.ModalityTypes.OBJECT_DETECTION] = detection_image
+    if task_env.ModalityTypes.DEPTH in self._modality_types:
+      output[task_env.ModalityTypes.DEPTH] = self._depth_images[
+          self._cur_world][image_id]
+    if task_env.ModalityTypes.IMAGE in self._modality_types:
+      output_img = self._cached_imgs[self._cur_world][image_id]
+      if self._should_draw_detections:
+        output_img = output_img.copy()
+        _draw_detections(output_img, detection, self._category_index)
+      output[task_env.ModalityTypes.IMAGE] = output_img
+    if task_env.ModalityTypes.GOAL in self._modality_types:
+      goal = np.zeros((len(self._targets),), dtype=np.float32)
+      goal[self._targets.index(self._cur_goal)] = 1.
+      output[task_env.ModalityTypes.GOAL] = goal
+    if task_env.ModalityTypes.PREV_ACTION in self._modality_types:
+      output[task_env.ModalityTypes.PREV_ACTION] = self._prev_action
+    if task_env.ModalityTypes.DISTANCE in self._modality_types:
+      output[task_env.ModalityTypes.DISTANCE] = np.asarray(
+          [self.gt_value(self._cur_goal, vertex)], dtype=np.float32)
+    return output
+  def _step_no_reward(self, action):
+    """Performs a step in the environment with given action.
+    Args:
+      action: Action that is used to step in the environment. Action can be
+        string or integer. If the type is integer then it uses the ith element
+        from self._actions list. Otherwise, uses the string value as the action.
+    Returns:
+      observation, done, info
+      observation: dictonary that contains all the observations specified in
+        modality_types.
+        observation[task_env.ModalityTypes.OBJECT_DETECTION]: contains the
+        detection of the current view.
+        observation[task_env.ModalityTypes.IMAGE]: contains the
+          image of the current view. Note that if using the images for training,
+          should_load_images should be set to false.
+        observation[task_env.ModalityTypes.SEMANTIC_SEGMENTATION]: contains the
+          semantic segmentation of the current view.
+        observation[task_env.ModalityTypes.DEPTH]: If selected, returns the
+          depth map for the current view.
+        observation[task_env.ModalityTypes.PREV_ACTION]: If selected, returns
+          a numpy of (action_size + 1,). The first action_size elements indicate
+          the action and the last element indicates whether the previous action
+          was successful or not.
+      done: True after episode_length steps have been taken, False otherwise.
+      info: Empty dictionary.
+    Raises:
+      ValueError: for invalid actions.
+    """
+    # Primarily used for gym interface.
+    if not isinstance(action, str):
+      if not self.action_space.contains(action):
+        raise ValueError('Not a valid actions: %d', action)
+      action = self._actions[action]
+    if action not in self._actions:
+      raise ValueError('Not a valid action: %s', action)
+    action_index = self._actions.index(action)
+    if action == 'stop':
+      next_image_id = self._cur_image_id
+      done = True
+      success = True
+    else:
+      next_image_id = self._next_image(self._cur_image_id, action)
+      self._steps_taken += 1
+      done = False
+      success = True
+    if not next_image_id:
+      success = False
+    else:
+      self._cur_image_id = next_image_id
+    if self._steps_taken >= self._episode_length:
+      done = True
+    cur_vertex = self._cur_graph.id_to_index[self._cur_image_id]
+    observation = self.observation(self.vertex_to_pose(cur_vertex))
+    # Concatenation of one-hot prev action + a binary number for success of
+    # previous actions.
+    self._prev_action = np.zeros((len(self._actions) + 1,), dtype=np.float32)
+    self._prev_action[action_index] = 1.
+    self._prev_action[-1] = float(success)
+    distance_to_goal = self.gt_value(self._cur_goal, cur_vertex)
+    if success:
+      if distance_to_goal <= self._reward_goal_range:
+        done = True
+    return observation, done, {'success': success}
+  @property
+  def graph(self):
+    return self._cur_graph.graph
+  def state(self):
+    return self.vertex_to_pose(self.to_vertex(self._cur_image_id))
+  def gt_value(self, goal, v):
+    """Computes the distance to the goal from vertex v.
+    Args:
+      goal: name of the goal.
+      v: vertex id.
+    Returns:
+      Minimmum number of steps to the given goal.
+    """
+    assert goal in self._cur_graph.distance_to_goal, 'goal: {}'.format(goal)
+    assert v in self._cur_graph.distance_to_goal[goal]
+    res = self._cur_graph.distance_to_goal[goal][v]
+    return res
+  def _update_graph(self):
+    """Creates the graph for each environment and updates the _cur_graph."""
+    if self._cur_world not in self._graph_cache:
+      graph = nx.DiGraph()
+      id_to_index = {}
+      index_to_id = {}
+      image_list = self._world_id_dict[self._cur_world]
+      for i, image_id in enumerate(image_list):
+        id_to_index[image_id] = i
+        index_to_id[i] = image_id
+        graph.add_node(i)
+      for image_id in image_list:
+        for action in self._actions:
+          if action == 'stop':
+            continue
+          next_image = self._all_graph[self._cur_world][image_id][action]
+          if next_image:
+            graph.add_edge(
+                id_to_index[image_id], id_to_index[next_image], action=action)
+      target_indexes = {}
+      number_of_nodes_without_targets = graph.number_of_nodes()
+      distance_to_goal = {}
+      for goal in self._targets:
+        if self._cur_world not in self._annotated_targets[goal]:
+          continue
+        goal_indexes = [
+            id_to_index[i]
+            for i in self._annotated_targets[goal][self._cur_world]
+            if i
+        ]
+        super_source_index = graph.number_of_nodes()
+        target_indexes[goal] = super_source_index
+        graph.add_node(super_source_index)
+        index_to_id[super_source_index] = goal
+        id_to_index[goal] = super_source_index
+        for v in goal_indexes:
+          graph.add_edge(v, super_source_index, action='stop')
+          graph.add_edge(super_source_index, v, action='stop')
+        distance_to_goal[goal] = {}
+        for v in range(number_of_nodes_without_targets):
+          distance_to_goal[goal][v] = len(
+              nx.shortest_path(graph, v, super_source_index)) - 2
+      self._graph_cache[self._cur_world] = _Graph(
+          graph, id_to_index, index_to_id, target_indexes, distance_to_goal)
+    self._cur_graph = self._graph_cache[self._cur_world]
+  def reset_for_eval(self, new_world, new_goal, new_image_id):
+    """Resets to the given goal and image_id."""
+    return self._reset_env(new_world=new_world, new_goal=new_goal, new_image_id=new_image_id)
+  def get_init_config(self, path):
+    """Exposes the initial state of the agent for the given path.
+    Args:
+      path: sequences of the vertexes that the agent moves.
+    Returns:
+      image_id of the first view, world, and the goal.
+    """
+    return self._cur_graph.index_to_id[path[0]], self._cur_world, self._cur_goal
+  def _reset_env(
+      self,
+      new_world=None,
+      new_goal=None,
+      new_image_id=None,
+  ):
+    """Resets the agent in a random world and random id.
+    Args:
+      new_world: If not None, sets the new world to new_world.
+      new_goal: If not None, sets the new goal to new_goal.
+      new_image_id: If not None, sets the first image id to new_image_id.
+    Returns:
+      observation: dictionary of the observations. Content of the observation
+      is similar to that of the step function.
+    Raises:
+      ValueError: if it can't find a world and annotated goal.
+    """
+    self._steps_taken = 0
+    # The first prev_action is special all zero vector + success=1.
+    self._prev_action = np.zeros((len(self._actions) + 1,), dtype=np.float32)
+    self._prev_action[len(self._actions)] = 1.
+    if self._eval_init_points is not None:
+      if self._eval_init_index >= len(self._eval_init_points):
+        self._eval_init_index = 0
+      a = self._eval_init_points[self._eval_init_index]
+      self._cur_world, self._cur_image_id, self._cur_goal = a
+      self._eval_init_index += 1
+    elif not new_world:
+      attempts = 100
+      found = False
+      while attempts >= 0:
+        attempts -= 1
+        self._cur_goal = np.random.choice(self._targets)
+        available_worlds = list(
+            set(self._annotated_targets[self._cur_goal].keys()).intersection(
+                set(self._worlds)))
+        if available_worlds:
+          found = True
+          break
+      if not found:
+        raise ValueError('could not find a world that has a target annotated')
+      self._cur_world = np.random.choice(available_worlds)
+    else:
+      self._cur_world = new_world
+      self._cur_goal = new_goal
+      if new_world not in self._annotated_targets[new_goal]:
+        return None
+    self._cur_goal_index = self._targets.index(self._cur_goal)
+    if new_image_id:
+      self._cur_image_id = new_image_id
+    else:
+      self._cur_image_id = np.random.choice(
+          self._world_id_dict[self._cur_world])
+    if self._cur_world not in self._detection_cache:
+      with tf.gfile.Open(
+          _get_detection_path(self._dataset_root, self._detection_folder_name,
+                              self._cur_world)) as f:
+        # Each file contains a dictionary with image ids as keys and detection
+        # dicts as values.
+        self._detection_cache[self._cur_world] = np.load(f).item()
+    self._detection_table = self._detection_cache[self._cur_world]
+    self._detection_area = {}
+    self._update_graph()
+    if self._cur_world not in self._vertex_to_pose:
+      # adding fake pose for the super nodes of each target categories.
+      self._vertex_to_pose[self._cur_world] = {
+          index: (-index,) for index in self._cur_graph.target_indexes.values()
+      }
+      # Calling vetex_to_pose for each vertex results in filling out the
+      # dictionaries that contain pose related data.
+      for image_id in self._world_id_dict[self._cur_world]:
+        self.vertex_to_pose(self.to_vertex(image_id))
+      # Filling out pose_to_vertex from vertex_to_pose.
+      self._pose_to_vertex[self._cur_world] = {
+          tuple(v): k
+          for k, v in self._vertex_to_pose[self._cur_world].iteritems()
+      }
+    cur_vertex = self._cur_graph.id_to_index[self._cur_image_id]
+    observation = self.observation(self.vertex_to_pose(cur_vertex))
+    return observation
+  def cur_vertex(self):
+    return self._cur_graph.id_to_index[self._cur_image_id]
+  def cur_image_id(self):
+    return self._cur_image_id
+  def path_to_goal(self, image_id=None):
+    """Returns the path from image_id to the self._cur_goal.
+    Args:
+      image_id: If set to None, computes the path from the current view.
+        Otherwise, sets the current view to the given image_id.
+    Returns:
+      The path to the goal.
+    Raises:
+      Exception if there's no path from the view to the goal.
+    """
+    if image_id is None:
+      image_id = self._cur_image_id
+    super_source = self._cur_graph.target_indexes[self._cur_goal]
+    try:
+      path = nx.shortest_path(self._cur_graph.graph,
+                              self._cur_graph.id_to_index[image_id],
+                              super_source)
+    except:
+      print 'path not found, image_id = ', self._cur_world, self._cur_image_id
+      raise
+    return path[:-1]
+  def targets(self):
+    return [self.vertex_to_pose(self._cur_graph.target_indexes[self._cur_goal])]
+  def vertex_to_pose(self, v):
+    """Returns pose of the view for a given vertex.
+    Args:
+      v: integer, vertex index.
+    Returns:
+      (x, z, dir_x, dir_z) where x and z are the tranlation and dir_x, dir_z are
+        a vector giving direction of the view.
+    """
+    if v in self._vertex_to_pose[self._cur_world]:
+      return np.copy(self._vertex_to_pose[self._cur_world][v])
+    x, z, rot, scale = self._cached_poses[self._cur_world][self.to_image_id(
+        v)]
+    if rot is None:  # if rotation is not provided for the given vertex.
+      self._vertex_to_pose[self._cur_world][v] = np.asarray(
+          [x * scale, z * scale, v])
+      return np.copy(self._vertex_to_pose[self._cur_world][v])
+    # Multiply rotation matrix by [0,0,1] to get a vector of length 1 in the
+    # direction of the ray.
+    direction = np.zeros((3, 1), dtype=np.float32)
+    direction[2][0] = 1
+    direction = np.matmul(np.transpose(rot), direction)
+    direction = [direction[0][0], direction[2][0]]
+    self._vertex_to_pose[self._cur_world][v] = np.asarray(
+        [x * scale, z * scale, direction[0], direction[1]])
+    return np.copy(self._vertex_to_pose[self._cur_world][v])
+  def pose_to_vertex(self, pose):
+    """Returns the vertex id for the given pose."""
+    if tuple(pose) not in self._pose_to_vertex[self._cur_world]:
+      raise ValueError(
+          'The given pose is not present in the dictionary: {}'.format(
+              tuple(pose)))
+    return self._pose_to_vertex[self._cur_world][tuple(pose)]
+  def check_scene_graph(self, world, goal):
+    """Checks the connectivity of the scene graph.
+    Goes over all the views. computes the shortest path to the goal. If it
+    crashes it means that it's not connected. Otherwise, the env graph is fine.
+    Args:
+      world: the string name of the world.
+      goal: the string label for the goal.
+    Returns:
+      Nothing.
+    """
+    obs = self._reset_env(new_world=world, new_goal=goal)
+    if not obs:
+      print '{} is not availble in {}'.format(goal, world)
+      return True
+    for image_id in self._world_id_dict[self._cur_world]:
+      print 'check image_id = {}'.format(image_id)
+      self._cur_image_id = image_id
+      path = self.path_to_goal()
+      actions = []
+      for i in range(len(path) - 2):
+        actions.append(self.action(path[i], path[i + 1]))
+      actions.append('stop')
+  @property
+  def goal_one_hot(self):
+    res = np.zeros((len(self._targets),), dtype=np.float32)
+    res[self._cur_goal_index] = 1.
+    return res
+  @property
+  def goal_index(self):
+    return self._cur_goal_index
+  @property
+  def goal_string(self):
+    return self._cur_goal
+  @property
+  def worlds(self):
+    return self._worlds
+  @property
+  def possible_targets(self):
+    return self._targets
+  def action(self, from_pose, to_pose):
+    """Returns the action that takes source vertex to destination vertex.
+    Args:
+      from_pose: pose of the source.
+      to_pose: pose of the destination.
+    Returns:
+      Returns the index of the action.
+    Raises:
+      ValueError: If it is not possible to go from the first vertice to second
+      vertice with one action, it raises value error.
+    """
+    from_index = self.pose_to_vertex(from_pose)
+    to_index = self.pose_to_vertex(to_pose)
+    if to_index not in self.graph[from_index]:
+      from_image_id = self.to_image_id(from_index)
+      to_image_id = self.to_image_id(to_index)
+      raise ValueError('{},{} is not connected to {},{}'.format(
+          from_index, from_image_id, to_index, to_image_id))
+    return self._actions.index(self.graph[from_index][to_index]['action'])
+  def random_step_sequence(self, min_len=None, max_len=None):
+    """Generates random step sequence that takes agent to the goal.
+    Args:
+      min_len: integer, minimum length of a step sequence. Not yet implemented.
+      max_len: integer, should be set to an integer and it is the maximum number
+        of observations and path length to be max_len.
+    Returns:
+      Tuple of (path, actions, states, step_outputs).
+        path: a random path from a random starting point and random environment.
+        actions: actions of the returned path.
+        states: viewpoints of all the states in between.
+        step_outputs: list of step() return tuples.
+    Raises:
+      ValueError: if first_n is not greater than zero; if min_len is different
+        from None.
+    """
+    if max_len is None:
+      raise ValueError('max_len can not be set as None')
+    if max_len < 1:
+      raise ValueError('first_n must be greater or equal to 1.')
+    if min_len is not None:
+      raise ValueError('min_len is not yet implemented.')
+    path = []
+    actions = []
+    states = []
+    step_outputs = []
+    obs = self.reset()
+    last_obs_tuple = [obs, 0, False, {}]
+    for _ in xrange(max_len):
+      action = np.random.choice(self._actions)
+      # We don't want to sample stop action because stop does not add new
+      # information.
+      while action == 'stop':
+        action = np.random.choice(self._actions)
+      path.append(self.to_vertex(self._cur_image_id))
+      onehot = np.zeros((len(self._actions),), dtype=np.float32)
+      onehot[self._actions.index(action)] = 1.
+      actions.append(onehot)
+      states.append(self.vertex_to_pose(path[-1]))
+      step_outputs.append(copy.deepcopy(last_obs_tuple))
+      last_obs_tuple = self.step(action)
+    return path, actions, states, step_outputs
--- a/research/cognitive_planning/envs/configs/active_vision_config.gin
+++ b/research/cognitive_planning/envs/configs/active_vision_config.gin
+#-*-Python-*-
+ActiveVisionDatasetEnv.episode_length = 200
+ActiveVisionDatasetEnv.actions = [
+    'right', 'rotate_cw', 'rotate_ccw', 'forward', 'left', 'backward', 'stop'
+]
+ActiveVisionDatasetEnv.confidence_threshold = 0.5
+ActiveVisionDatasetEnv.output_size = 64
+ActiveVisionDatasetEnv.worlds = [
+    'Home_001_1', 'Home_001_2', 'Home_002_1', 'Home_003_1', 'Home_003_2',
+    'Home_004_1', 'Home_004_2', 'Home_005_1', 'Home_005_2', 'Home_006_1',
+    'Home_007_1', 'Home_010_1', 'Home_011_1', 'Home_013_1', 'Home_014_1',
+    'Home_014_2', 'Home_015_1', 'Home_016_1'
+]
+ActiveVisionDatasetEnv.targets = [
+    'tv', 'dining_table', 'fridge', 'microwave', 'couch'
+]
+ActiveVisionDatasetEnv.compute_distance = False
+ActiveVisionDatasetEnv.should_draw_detections = False
+ActiveVisionDatasetEnv.dataset_root = '/usr/local/google/home/kosecka/AVD_Minimal/'
+ActiveVisionDatasetEnv.labelmap_path = 'label_map.txt'
+ActiveVisionDatasetEnv.reward_collision = 0
+ActiveVisionDatasetEnv.reward_goal_range = 2
+ActiveVisionDatasetEnv.num_detection_classes = 90
+ActiveVisionDatasetEnv.segmentation_file_name='sseg_crf'
+ActiveVisionDatasetEnv.detection_folder_name='Detections'
+ActiveVisionDatasetEnv.targets_file_name='annotated_targets'
+ActiveVisionDatasetEnv.shaped_reward=False
--- a/research/cognitive_planning/envs/task_env.py
+++ b/research/cognitive_planning/envs/task_env.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""An interface representing the topology of an environment.
+Allows for high level planning and high level instruction generation for
+navigation tasks.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import abc
+import enum
+import gym
+import gin
+@gin.config.constants_from_enum
+class ModalityTypes(enum.Enum):
+  """Types of the modalities that can be used."""
+  IMAGE = 0
+  SEMANTIC_SEGMENTATION = 1
+  OBJECT_DETECTION = 2
+  DEPTH = 3
+  GOAL = 4
+  PREV_ACTION = 5
+  PREV_SUCCESS = 6
+  STATE = 7
+  DISTANCE = 8
+  CAN_STEP = 9
+  def __lt__(self, other):
+    if self.__class__ is other.__class__:
+      return self.value < other.value
+    return NotImplemented
+class TaskEnvInterface(object):
+  """Interface for an environment topology.
+  An environment can implement this interface if there is a topological graph
+  underlying this environment. All paths below are defined as paths in this
+  graph. Using path_to_actions function one can translate a topological path
+  to a geometric path in the environment.
+  """
+  __metaclass__ = abc.ABCMeta
+  @abc.abstractmethod
+  def random_step_sequence(self, min_len=None, max_len=None):
+    """Generates a random sequence of actions and executes them.
+    Args:
+      min_len: integer, minimum length of a step sequence.
+      max_len: integer, if it is set to non-None, the method returns only
+        the first n steps of a random sequence. If the environment is
+        computationally heavy this argument should be set to speed up the
+        training and avoid unnecessary computations by the environment.
+    Returns:
+      A path, defined as a list of vertex indices, a list of actions, a list of
+      states, and a list of step() return tuples.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  @abc.abstractmethod
+  def targets(self):
+    """A list of targets in the environment.
+    Returns:
+      A list of target locations.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  @abc.abstractproperty
+  def state(self):
+    """Returns the position for the current location of agent."""
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  @abc.abstractproperty
+  def graph(self):
+    """Returns a graph representing the environment topology.
+    Returns:
+      nx.Graph object.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  @abc.abstractmethod
+  def vertex_to_pose(self, vertex_index):
+    """Maps a vertex index to a pose in the environment.
+    Pose of the camera can be represented by (x,y,theta) or (x,y,z,theta).
+    Args:
+      vertex_index: index of a vertex in the topology graph.
+    Returns:
+      A np.array of floats of size 3 or 4 representing the pose of the vertex.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  @abc.abstractmethod
+  def pose_to_vertex(self, pose):
+    """Maps a coordinate in the maze to the closest vertex in topology graph.
+    Args:
+      pose: np.array of floats containing a the pose of the view.
+    Returns:
+      index of a vertex.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  @abc.abstractmethod
+  def observation(self, state):
+    """Returns observation at location xy and orientation theta.
+    Args:
+      state: a np.array of floats containing coordinates of a location and
+        orientation.
+    Returns:
+      Dictionary of observations in the case of multiple observations.
+      The keys are the modality names and the values are the np.array of float
+      of observations for corresponding modality.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+  def action(self, init_state, final_state):
+    """Computes the transition action from state1 to state2.
+    If the environment is discrete and the views are not adjacent in the
+    environment. i.e. it is not possible to move from the first view to the
+    second view with one action it should return None. In the continuous case,
+    it will be the continuous difference of first view and second view.
+    Args:
+      init_state: numpy array, the initial view of the agent.
+      final_state: numpy array, the final view of the agent.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of EnvTopology interface.')
+@gin.configurable
+class TaskEnv(gym.Env, TaskEnvInterface):
+  """An environment which uses a Task to compute reward.
+  The environment implements a a gym interface, as well as EnvTopology. The
+  former makes sure it can be used within an RL training, while the latter
+  makes sure it can be used by a Task.
+  This environment requires _step_no_reward to be implemented, which steps
+  through it but does not return reward. Instead, the reward calculation is
+  delegated to the Task object, which in return can access needed properties
+  of the environment. These properties are exposed via the EnvTopology
+  interface.
+  """
+  def __init__(self, task=None):
+    self._task = task
+  def set_task(self, task):
+    self._task = task
+  @abc.abstractmethod
+  def _step_no_reward(self, action):
+    """Same as _step without returning reward.
+    Args:
+      action: see _step.
+    Returns:
+      state, done, info as defined in _step.
+    """
+    raise NotImplementedError('Implement step.')
+  @abc.abstractmethod
+  def _reset_env(self):
+    """Resets the environment. Returns initial observation."""
+    raise NotImplementedError('Implement _reset. Must call super!')
+  def step(self, action):
+    obs, done, info = self._step_no_reward(action)
+    reward = 0.0
+    if self._task is not None:
+      obs, reward, done, info = self._task.reward(obs, done, info)
+    return obs, reward, done, info
+  def reset(self):
+    """Resets the environment. Gym API."""
+    obs = self._reset_env()
+    if self._task is not None:
+      self._task.reset(obs)
+    return obs
--- a/research/cognitive_planning/envs/util.py
+++ b/research/cognitive_planning/envs/util.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A module with utility functions.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+def trajectory_to_deltas(trajectory, state):
+  """Computes a sequence of deltas of a state to traverse a trajectory in 2D.
+  The initial state of the agent contains its pose -- location in 2D and
+  orientation. When the computed deltas are incrementally added to it, it
+  traverses the specified trajectory while keeping its orientation parallel to
+  the trajectory.
+  Args:
+    trajectory: a np.array of floats of shape n x 2. The n-th row contains the
+      n-th point.
+    state: a 3 element np.array of floats containing agent's location and
+      orientation in radians.
+  Returns:
+    A np.array of floats of size n x 3.
+  """
+  state = np.reshape(state, [-1])
+  init_xy = state[0:2]
+  init_theta = state[2]
+  delta_xy = trajectory - np.concatenate(
+      [np.reshape(init_xy, [1, 2]), trajectory[:-1, :]], axis=0)
+  thetas = np.reshape(np.arctan2(delta_xy[:, 1], delta_xy[:, 0]), [-1, 1])
+  thetas = np.concatenate([np.reshape(init_theta, [1, 1]), thetas], axis=0)
+  delta_thetas = thetas[1:] - thetas[:-1]
+  deltas = np.concatenate([delta_xy, delta_thetas], axis=1)
+  return deltas
--- a/research/cognitive_planning/label_map.txt
+++ b/research/cognitive_planning/label_map.txt
+item {
+  name: "/m/01g317"
+  id: 1
+  display_name: "person"
+}
+item {
+  name: "/m/0199g"
+  id: 2
+  display_name: "bicycle"
+}
+item {
+  name: "/m/0k4j"
+  id: 3
+  display_name: "car"
+}
+item {
+  name: "/m/04_sv"
+  id: 4
+  display_name: "motorcycle"
+}
+item {
+  name: "/m/05czz6l"
+  id: 5
+  display_name: "airplane"
+}
+item {
+  name: "/m/01bjv"
+  id: 6
+  display_name: "bus"
+}
+item {
+  name: "/m/07jdr"
+  id: 7
+  display_name: "train"
+}
+item {
+  name: "/m/07r04"
+  id: 8
+  display_name: "truck"
+}
+item {
+  name: "/m/019jd"
+  id: 9
+  display_name: "boat"
+}
+item {
+  name: "/m/015qff"
+  id: 10
+  display_name: "traffic light"
+}
+item {
+  name: "/m/01pns0"
+  id: 11
+  display_name: "fire hydrant"
+}
+item {
+  name: "/m/02pv19"
+  id: 13
+  display_name: "stop sign"
+}
+item {
+  name: "/m/015qbp"
+  id: 14
+  display_name: "parking meter"
+}
+item {
+  name: "/m/0cvnqh"
+  id: 15
+  display_name: "bench"
+}
+item {
+  name: "/m/015p6"
+  id: 16
+  display_name: "bird"
+}
+item {
+  name: "/m/01yrx"
+  id: 17
+  display_name: "cat"
+}
+item {
+  name: "/m/0bt9lr"
+  id: 18
+  display_name: "dog"
+}
+item {
+  name: "/m/03k3r"
+  id: 19
+  display_name: "horse"
+}
+item {
+  name: "/m/07bgp"
+  id: 20
+  display_name: "sheep"
+}
+item {
+  name: "/m/01xq0k1"
+  id: 21
+  display_name: "cow"
+}
+item {
+  name: "/m/0bwd_0j"
+  id: 22
+  display_name: "elephant"
+}
+item {
+  name: "/m/01dws"
+  id: 23
+  display_name: "bear"
+}
+item {
+  name: "/m/0898b"
+  id: 24
+  display_name: "zebra"
+}
+item {
+  name: "/m/03bk1"
+  id: 25
+  display_name: "giraffe"
+}
+item {
+  name: "/m/01940j"
+  id: 27
+  display_name: "backpack"
+}
+item {
+  name: "/m/0hnnb"
+  id: 28
+  display_name: "umbrella"
+}
+item {
+  name: "/m/080hkjn"
+  id: 31
+  display_name: "handbag"
+}
+item {
+  name: "/m/01rkbr"
+  id: 32
+  display_name: "tie"
+}
+item {
+  name: "/m/01s55n"
+  id: 33
+  display_name: "suitcase"
+}
+item {
+  name: "/m/02wmf"
+  id: 34
+  display_name: "frisbee"
+}
+item {
+  name: "/m/071p9"
+  id: 35
+  display_name: "skis"
+}
+item {
+  name: "/m/06__v"
+  id: 36
+  display_name: "snowboard"
+}
+item {
+  name: "/m/018xm"
+  id: 37
+  display_name: "sports ball"
+}
+item {
+  name: "/m/02zt3"
+  id: 38
+  display_name: "kite"
+}
+item {
+  name: "/m/03g8mr"
+  id: 39
+  display_name: "baseball bat"
+}
+item {
+  name: "/m/03grzl"
+  id: 40
+  display_name: "baseball glove"
+}
+item {
+  name: "/m/06_fw"
+  id: 41
+  display_name: "skateboard"
+}
+item {
+  name: "/m/019w40"
+  id: 42
+  display_name: "surfboard"
+}
+item {
+  name: "/m/0dv9c"
+  id: 43
+  display_name: "tennis racket"
+}
+item {
+  name: "/m/04dr76w"
+  id: 44
+  display_name: "bottle"
+}
+item {
+  name: "/m/09tvcd"
+  id: 46
+  display_name: "wine glass"
+}
+item {
+  name: "/m/08gqpm"
+  id: 47
+  display_name: "cup"
+}
+item {
+  name: "/m/0dt3t"
+  id: 48
+  display_name: "fork"
+}
+item {
+  name: "/m/04ctx"
+  id: 49
+  display_name: "knife"
+}
+item {
+  name: "/m/0cmx8"
+  id: 50
+  display_name: "spoon"
+}
+item {
+  name: "/m/04kkgm"
+  id: 51
+  display_name: "bowl"
+}
+item {
+  name: "/m/09qck"
+  id: 52
+  display_name: "banana"
+}
+item {
+  name: "/m/014j1m"
+  id: 53
+  display_name: "apple"
+}
+item {
+  name: "/m/0l515"
+  id: 54
+  display_name: "sandwich"
+}
+item {
+  name: "/m/0cyhj_"
+  id: 55
+  display_name: "orange"
+}
+item {
+  name: "/m/0hkxq"
+  id: 56
+  display_name: "broccoli"
+}
+item {
+  name: "/m/0fj52s"
+  id: 57
+  display_name: "carrot"
+}
+item {
+  name: "/m/01b9xk"
+  id: 58
+  display_name: "hot dog"
+}
+item {
+  name: "/m/0663v"
+  id: 59
+  display_name: "pizza"
+}
+item {
+  name: "/m/0jy4k"
+  id: 60
+  display_name: "donut"
+}
+item {
+  name: "/m/0fszt"
+  id: 61
+  display_name: "cake"
+}
+item {
+  name: "/m/01mzpv"
+  id: 62
+  display_name: "chair"
+}
+item {
+  name: "/m/02crq1"
+  id: 63
+  display_name: "couch"
+}
+item {
+  name: "/m/03fp41"
+  id: 64
+  display_name: "potted plant"
+}
+item {
+  name: "/m/03ssj5"
+  id: 65
+  display_name: "bed"
+}
+item {
+  name: "/m/04bcr3"
+  id: 67
+  display_name: "dining table"
+}
+item {
+  name: "/m/09g1w"
+  id: 70
+  display_name: "toilet"
+}
+item {
+  name: "/m/07c52"
+  id: 72
+  display_name: "tv"
+}
+item {
+  name: "/m/01c648"
+  id: 73
+  display_name: "laptop"
+}
+item {
+  name: "/m/020lf"
+  id: 74
+  display_name: "mouse"
+}
+item {
+  name: "/m/0qjjc"
+  id: 75
+  display_name: "remote"
+}
+item {
+  name: "/m/01m2v"
+  id: 76
+  display_name: "keyboard"
+}
+item {
+  name: "/m/050k8"
+  id: 77
+  display_name: "cell phone"
+}
+item {
+  name: "/m/0fx9l"
+  id: 78
+  display_name: "microwave"
+}
+item {
+  name: "/m/029bxz"
+  id: 79
+  display_name: "oven"
+}
+item {
+  name: "/m/01k6s3"
+  id: 80
+  display_name: "toaster"
+}
+item {
+  name: "/m/0130jx"
+  id: 81
+  display_name: "sink"
+}
+item {
+  name: "/m/040b_t"
+  id: 82
+  display_name: "refrigerator"
+}
+item {
+  name: "/m/0bt_c3"
+  id: 84
+  display_name: "book"
+}
+item {
+  name: "/m/01x3z"
+  id: 85
+  display_name: "clock"
+}
+item {
+  name: "/m/02s195"
+  id: 86
+  display_name: "vase"
+}
+item {
+  name: "/m/01lsmm"
+  id: 87
+  display_name: "scissors"
+}
+item {
+  name: "/m/0kmg4"
+  id: 88
+  display_name: "teddy bear"
+}
+item {
+  name: "/m/03wvsk"
+  id: 89
+  display_name: "hair drier"
+}
+item {
+  name: "/m/012xff"
+  id: 90
+  display_name: "toothbrush"
+}
--- a/research/cognitive_planning/label_map_util.py
+++ b/research/cognitive_planning/label_map_util.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Label map utility functions."""
+import logging
+import tensorflow as tf
+from google.protobuf import text_format
+import string_int_label_map_pb2
+def _validate_label_map(label_map):
+  """Checks if a label map is valid.
+  Args:
+    label_map: StringIntLabelMap to validate.
+  Raises:
+    ValueError: if label map is invalid.
+  """
+  for item in label_map.item:
+    if item.id < 0:
+      raise ValueError('Label map ids should be >= 0.')
+    if (item.id == 0 and item.name != 'background' and
+        item.display_name != 'background'):
+      raise ValueError('Label map id 0 is reserved for the background label')
+def create_category_index(categories):
+  """Creates dictionary of COCO compatible categories keyed by category id.
+  Args:
+    categories: a list of dicts, each of which has the following keys:
+      'id': (required) an integer id uniquely identifying this category.
+      'name': (required) string representing category name
+        e.g., 'cat', 'dog', 'pizza'.
+  Returns:
+    category_index: a dict containing the same entries as categories, but keyed
+      by the 'id' field of each category.
+  """
+  category_index = {}
+  for cat in categories:
+    category_index[cat['id']] = cat
+  return category_index
+def get_max_label_map_index(label_map):
+  """Get maximum index in label map.
+  Args:
+    label_map: a StringIntLabelMapProto
+  Returns:
+    an integer
+  """
+  return max([item.id for item in label_map.item])
+def convert_label_map_to_categories(label_map,
+                                    max_num_classes,
+                                    use_display_name=True):
+  """Loads label map proto and returns categories list compatible with eval.
+  This function loads a label map and returns a list of dicts, each of which
+  has the following keys:
+    'id': (required) an integer id uniquely identifying this category.
+    'name': (required) string representing category name
+      e.g., 'cat', 'dog', 'pizza'.
+  We only allow class into the list if its id-label_id_offset is
+  between 0 (inclusive) and max_num_classes (exclusive).
+  If there are several items mapping to the same id in the label map,
+  we will only keep the first one in the categories list.
+  Args:
+    label_map: a StringIntLabelMapProto or None.  If None, a default categories
+      list is created with max_num_classes categories.
+    max_num_classes: maximum number of (consecutive) label indices to include.
+    use_display_name: (boolean) choose whether to load 'display_name' field
+      as category name.  If False or if the display_name field does not exist,
+      uses 'name' field as category names instead.
+  Returns:
+    categories: a list of dictionaries representing all possible categories.
+  """
+  categories = []
+  list_of_ids_already_added = []
+  if not label_map:
+    label_id_offset = 1
+    for class_id in range(max_num_classes):
+      categories.append({
+          'id': class_id + label_id_offset,
+          'name': 'category_{}'.format(class_id + label_id_offset)
+      })
+    return categories
+  for item in label_map.item:
+    if not 0 < item.id <= max_num_classes:
+      logging.info('Ignore item %d since it falls outside of requested '
+                   'label range.', item.id)
+      continue
+    if use_display_name and item.HasField('display_name'):
+      name = item.display_name
+    else:
+      name = item.name
+    if item.id not in list_of_ids_already_added:
+      list_of_ids_already_added.append(item.id)
+      categories.append({'id': item.id, 'name': name})
+  return categories
+def load_labelmap(path):
+  """Loads label map proto.
+  Args:
+    path: path to StringIntLabelMap proto text file.
+  Returns:
+    a StringIntLabelMapProto
+  """
+  with tf.gfile.GFile(path, 'r') as fid:
+    label_map_string = fid.read()
+    label_map = string_int_label_map_pb2.StringIntLabelMap()
+    try:
+      text_format.Merge(label_map_string, label_map)
+    except text_format.ParseError:
+      label_map.ParseFromString(label_map_string)
+  _validate_label_map(label_map)
+  return label_map
+def get_label_map_dict(label_map_path, use_display_name=False):
+  """Reads a label map and returns a dictionary of label names to id.
+  Args:
+    label_map_path: path to label_map.
+    use_display_name: whether to use the label map items' display names as keys.
+  Returns:
+    A dictionary mapping label names to id.
+  """
+  label_map = load_labelmap(label_map_path)
+  label_map_dict = {}
+  for item in label_map.item:
+    if use_display_name:
+      label_map_dict[item.display_name] = item.id
+    else:
+      label_map_dict[item.name] = item.id
+  return label_map_dict
+def create_category_index_from_labelmap(label_map_path):
+  """Reads a label map and returns a category index.
+  Args:
+    label_map_path: Path to `StringIntLabelMap` proto text file.
+  Returns:
+    A category index, which is a dictionary that maps integer ids to dicts
+    containing categories, e.g.
+    {1: {'id': 1, 'name': 'dog'}, 2: {'id': 2, 'name': 'cat'}, ...}
+  """
+  label_map = load_labelmap(label_map_path)
+  max_num_classes = max(item.id for item in label_map.item)
+  categories = convert_label_map_to_categories(label_map, max_num_classes)
+  return create_category_index(categories)
+def create_class_agnostic_category_index():
+  """Creates a category index with a single `object` class."""
+  return {1: {'id': 1, 'name': 'object'}}
--- a/research/cognitive_planning/policies.py
+++ b/research/cognitive_planning/policies.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Interface for the policy of the agents use for navigation."""
+import abc
+import tensorflow as tf
+from absl import logging
+import embedders
+from envs import task_env
+slim = tf.contrib.slim
+def _print_debug_ios(history, goal, output):
+  """Prints sizes of history, goal and outputs."""
+  if history is not None:
+    shape = history.get_shape().as_list()
+    # logging.info('history embedding shape ')
+    # logging.info(shape)
+  if len(shape) != 3:
+      raise ValueError('history Tensor must have rank=3')
+  if goal is not None:
+     logging.info('goal embedding shape ')
+     logging.info(goal.get_shape().as_list())
+  if output is not None:
+     logging.info('targets shape ')
+     logging.info(output.get_shape().as_list())
+class Policy(object):
+  """Represents the policy of the agent for navigation tasks.
+  Instantiates a policy that takes embedders for each modality and builds a
+  model to infer the actions.
+  """
+  __metaclass__ = abc.ABCMeta
+  def __init__(self, embedders_dict, action_size):
+    """Instantiates the policy.
+    Args:
+      embedders_dict: Dictionary of embedders for different modalities. Keys
+        should be identical to keys of observation modality.
+      action_size: Number of possible actions.
+    """
+    self._embedders = embedders_dict
+    self._action_size = action_size
+  @abc.abstractmethod
+  def build(self, observations, prev_state):
+    """Builds the model that represents the policy of the agent.
+    Args:
+      observations: Dictionary of observations from different modalities. Keys
+        are the name of the modalities.
+      prev_state: The tensor of the previous state of the model. Should be set
+        to None if the policy is stateless
+    Returns:
+      Tuple of (action, state) where action is the action logits and state is
+      the state of the model after taking new observation.
+    """
+    raise NotImplementedError(
+        'Needs implementation as part of Policy interface')
+class LSTMPolicy(Policy):
+  """Represents the implementation of the LSTM based policy.
+  The architecture of the model is as follows. It embeds all the observations
+  using the embedders, concatenates the embeddings of all the modalities. Feed
+  them through two fully connected layers. The lstm takes the features from
+  fully connected layer and the previous action and success of previous action
+  and feed them to LSTM. The value for each action is predicted afterwards.
+  Although the class name has the word LSTM in it, it also supports a mode that
+  builds the network without LSTM just for comparison purposes.
+  """
+  def __init__(self,
+               modality_names,
+               embedders_dict,
+               action_size,
+               params,
+               max_episode_length,
+               feedforward_mode=False):
+    """Instantiates the LSTM policy.
+    Args:
+      modality_names: List of modality names. Makes sure the ordering in
+        concatenation remains the same as modality_names list. Each modality
+        needs to be in the embedders_dict.
+      embedders_dict: Dictionary of embedders for different modalities. Keys
+        should be identical to keys of observation modality. Values should be
+        instance of Embedder class. All the observations except PREV_ACTION
+        requires embedder.
+      action_size: Number of possible actions.
+      params: is instance of tf.hparams and contains the hyperparameters for the
+        policy network.
+      max_episode_length: integer, specifying the maximum length of each
+        episode.
+      feedforward_mode: If True, it does not add LSTM to the model. It should
+        only be set True for comparison between LSTM and feedforward models.
+    """
+    super(LSTMPolicy, self).__init__(embedders_dict, action_size)
+    self._modality_names = modality_names
+    self._lstm_state_size = params.lstm_state_size
+    self._fc_channels = params.fc_channels
+    self._weight_decay = params.weight_decay
+    self._target_embedding_size = params.target_embedding_size
+    self._max_episode_length = max_episode_length
+    self._feedforward_mode = feedforward_mode
+  def _build_lstm(self, encoded_inputs, prev_state, episode_length,
+                  prev_action=None):
+    """Builds an LSTM on top of the encoded inputs.
+    If prev_action is not None then it concatenates them to the input of LSTM.
+    Args:
+      encoded_inputs: The embedding of the observations and goal.
+      prev_state: previous state of LSTM.
+      episode_length: The tensor that contains the length of the sequence for
+        each element of the batch.
+      prev_action: tensor to previous chosen action and additional bit for
+        indicating whether the previous action was successful or not.
+    Returns:
+      a tuple of (lstm output, lstm state).
+    """
+    # Adding prev action and success in addition to the embeddings of the
+    # modalities.
+    if prev_action is not None:
+      encoded_inputs = tf.concat([encoded_inputs, prev_action], axis=-1)
+    with tf.variable_scope('LSTM'):
+      lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(self._lstm_state_size)
+      if prev_state is None:
+        # If prev state is set to None, a state of all zeros will be
+        # passed as a previous value for the cell. Should be used for the
+        # first step of each episode.
+        tf_prev_state = lstm_cell.zero_state(
+            encoded_inputs.get_shape().as_list()[0], dtype=tf.float32)
+      else:
+        tf_prev_state = tf.nn.rnn_cell.LSTMStateTuple(prev_state[0],
+                                                      prev_state[1])
+      lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
+          cell=lstm_cell,
+          inputs=encoded_inputs,
+          sequence_length=episode_length,
+          initial_state=tf_prev_state,
+          dtype=tf.float32,
+      )
+    lstm_outputs = tf.reshape(lstm_outputs, [-1, lstm_cell.output_size])
+    return lstm_outputs, lstm_state
+  def build(
+      self,
+      observations,
+      prev_state,
+  ):
+    """Builds the model that represents the policy of the agent.
+    Args:
+      observations: Dictionary of observations from different modalities. Keys
+        are the name of the modalities. Observation should have the following
+        key-values.
+          observations['goal']: One-hot tensor that indicates the semantic
+            category of the goal. The shape should be
+            (batch_size x max_sequence_length x goals).
+          observations[task_env.ModalityTypes.PREV_ACTION]: has action_size + 1
+            elements where the first action_size numbers are the one hot vector
+            of the previous action and the last element indicates whether the
+            previous action was successful or not. If
+            task_env.ModalityTypes.PREV_ACTION is not in the observation, it
+            will not be used in the policy.
+      prev_state: Previous state of the model. It should be a tuple of (c,h)
+        where c and h are the previous cell value and hidden state of the lstm.
+        Each element of tuple has shape of (batch_size x lstm_cell_size).
+        If it is set to None, then it initializes the state of the lstm with all
+        zeros.
+    Returns:
+      Tuple of (action, state) where action is the action logits and state is
+      the state of the model after taking new observation.
+    Raises:
+      ValueError: If any of the modality names is not in observations or
+        embedders_dict.
+      ValueError: If 'goal' is not in the observations.
+    """
+    for modality_name in self._modality_names:
+      if modality_name not in observations:
+        raise ValueError('modality name does not exist in observations: {} not '
+                         'in {}'.format(modality_name, observations.keys()))
+      if modality_name not in self._embedders:
+        if modality_name == task_env.ModalityTypes.PREV_ACTION:
+          continue
+        raise ValueError('modality name does not have corresponding embedder'
+                         ' {} not in {}'.format(modality_name,
+                                                self._embedders.keys()))
+    if task_env.ModalityTypes.GOAL not in observations:
+      raise ValueError('goal should be provided in the observations')
+    goal = observations[task_env.ModalityTypes.GOAL]
+    prev_action = None
+    if task_env.ModalityTypes.PREV_ACTION in observations:
+      prev_action = observations[task_env.ModalityTypes.PREV_ACTION]
+    with tf.variable_scope('policy'):
+      with slim.arg_scope(
+          [slim.fully_connected],
+          activation_fn=tf.nn.relu,
+          weights_initializer=tf.truncated_normal_initializer(stddev=0.01),
+          weights_regularizer=slim.l2_regularizer(self._weight_decay)):
+        all_inputs = []
+        # Concatenating the embedding of each modality by applying the embedders
+        # to corresponding observations.
+        def embed(name):
+          with tf.variable_scope('embed_{}'.format(name)):
+            # logging.info('Policy uses embedding %s', name)
+            return self._embedders[name].build(observations[name])
+        all_inputs = map(embed, [
+            x for x in self._modality_names
+            if x != task_env.ModalityTypes.PREV_ACTION
+        ])
+        # Computing goal embedding.
+        shape = goal.get_shape().as_list()
+        with tf.variable_scope('embed_goal'):
+          encoded_goal = tf.reshape(goal, [shape[0] * shape[1], -1])
+          encoded_goal = slim.fully_connected(encoded_goal,
+                                              self._target_embedding_size)
+          encoded_goal = tf.reshape(encoded_goal, [shape[0], shape[1], -1])
+          all_inputs.append(encoded_goal)
+        # Concatenating all the modalities and goal.
+        all_inputs = tf.concat(all_inputs, axis=-1, name='concat_embeddings')
+        shape = all_inputs.get_shape().as_list()
+        all_inputs = tf.reshape(all_inputs, [shape[0] * shape[1], shape[2]])
+        # Applying fully connected layers.
+        encoded_inputs = slim.fully_connected(all_inputs, self._fc_channels)
+        encoded_inputs = slim.fully_connected(encoded_inputs, self._fc_channels)
+        if not self._feedforward_mode:
+          encoded_inputs = tf.reshape(encoded_inputs,
+                                      [shape[0], shape[1], self._fc_channels])
+          lstm_outputs, lstm_state = self._build_lstm(
+              encoded_inputs=encoded_inputs,
+              prev_state=prev_state,
+              episode_length=tf.ones((shape[0],), dtype=tf.float32) *
+              self._max_episode_length,
+              prev_action=prev_action,
+          )
+        else:
+          # If feedforward_mode=True, directly compute bypass the whole LSTM
+          # computations.
+          lstm_outputs = encoded_inputs
+        lstm_outputs = slim.fully_connected(lstm_outputs, self._fc_channels)
+        action_values = slim.fully_connected(
+            lstm_outputs, self._action_size, activation_fn=None)
+        action_values = tf.reshape(action_values, [shape[0], shape[1], -1])
+        if not self._feedforward_mode:
+          return action_values, lstm_state
+        else:
+          return action_values, None
+class TaskPolicy(Policy):
+  """A covenience abstract class providing functionality to deal with Tasks."""
+  def __init__(self,
+               task_config,
+               model_hparams=None,
+               embedder_hparams=None,
+               train_hparams=None):
+    """Constructs a policy which knows how to work with tasks (see tasks.py).
+    It allows to read task history, goal and outputs in consistency with the
+    task config.
+    Args:
+      task_config: an object of type tasks.TaskIOConfig (see tasks.py)
+      model_hparams: a tf.HParams object containing parameter pertaining to
+        model (these are implementation specific)
+      embedder_hparams: a tf.HParams object containing parameter pertaining to
+        history, goal embedders (these are implementation specific)
+      train_hparams: a tf.HParams object containing parameter pertaining to
+        trainin (these are implementation specific)`
+    """
+    super(TaskPolicy, self).__init__(None, None)
+    self._model_hparams = model_hparams
+    self._embedder_hparams = embedder_hparams
+    self._train_hparams = train_hparams
+    self._task_config = task_config
+    self._extra_train_ops = []
+  @property
+  def extra_train_ops(self):
+    """Training ops in addition to the loss, e.g. batch norm updates.
+    Returns:
+      A list of tf ops.
+    """
+    return self._extra_train_ops
+  def _embed_task_ios(self, streams):
+    """Embeds a list of heterogenous streams.
+    These streams correspond to task history, goal and output. The number of
+    streams is equal to the total number of history, plus one for the goal if
+    present, plus one for the output. If the number of history is k, then the
+    first k streams are the history.
+    The used embedders depend on the input (or goal) types. If an input is an
+    image, then a ResNet embedder is used, otherwise
+    MLPEmbedder (see embedders.py).
+    Args:
+      streams: a list of Tensors.
+    Returns:
+      Three float Tensors history, goal, output. If there are no history, or no
+      goal, then the corresponding returned values are None. The shape of the
+      embedded history is batch_size x sequence_length x sum of all embedding
+      dimensions for all history. The shape of the goal is embedding dimension.
+    """
+    # EMBED history.
+    index = 0
+    inps = []
+    scopes = []
+    for c in self._task_config.inputs:
+      if c == task_env.ModalityTypes.IMAGE:
+        scope_name = 'image_embedder/image'
+        reuse = scope_name in scopes
+        scopes.append(scope_name)
+        with tf.variable_scope(scope_name, reuse=reuse):
+          resnet_embedder = embedders.ResNet(self._embedder_hparams.image)
+          image_embeddings = resnet_embedder.build(streams[index])
+          # Uncover batch norm ops.
+          if self._embedder_hparams.image.is_train:
+            self._extra_train_ops += resnet_embedder.extra_train_ops
+          inps.append(image_embeddings)
+          index += 1
+      else:
+        scope_name = 'input_embedder/vector'
+        reuse = scope_name in scopes
+        scopes.append(scope_name)
+        with tf.variable_scope(scope_name, reuse=reuse):
+          input_vector_embedder = embedders.MLPEmbedder(
+              layers=self._embedder_hparams.vector)
+          vector_embedder = input_vector_embedder.build(streams[index])
+          inps.append(vector_embedder)
+          index += 1
+    history = tf.concat(inps, axis=2) if inps else None
+    # EMBED goal.
+    goal = None
+    if self._task_config.query is not None:
+      scope_name = 'image_embedder/query'
+      reuse = scope_name in scopes
+      scopes.append(scope_name)
+      with tf.variable_scope(scope_name, reuse=reuse):
+        resnet_goal_embedder = embedders.ResNet(self._embedder_hparams.goal)
+        goal = resnet_goal_embedder.build(streams[index])
+        if self._embedder_hparams.goal.is_train:
+          self._extra_train_ops += resnet_goal_embedder.extra_train_ops
+        index += 1
+    # Embed true targets if needed (tbd).
+    true_target = streams[index]
+    return history, goal, true_target
+  @abc.abstractmethod
+  def build(self, feeds, prev_state):
+    pass
+class ReactivePolicy(TaskPolicy):
+  """A policy which ignores history.
+  It processes only the current observation (last element in history) and the
+  goal to output a prediction.
+  """
+  def __init__(self, *args, **kwargs):
+    super(ReactivePolicy, self).__init__(*args, **kwargs)
+  # The current implementation ignores the prev_state as it is purely reactive.
+  # It returns None for the current state.
+  def build(self, feeds, prev_state):
+    history, goal, _ = self._embed_task_ios(feeds)
+    _print_debug_ios(history, goal, None)
+    with tf.variable_scope('output_decoder'):
+      # Concatenate the embeddings of the current observation and the goal.
+      reactive_input = tf.concat([tf.squeeze(history[:, -1, :]), goal], axis=1)
+      oconfig = self._task_config.output.shape
+      assert len(oconfig) == 1
+      decoder = embedders.MLPEmbedder(
+          layers=self._embedder_hparams.predictions.layer_sizes + oconfig)
+      predictions = decoder.build(reactive_input)
+    return predictions, None
+class RNNPolicy(TaskPolicy):
+  """A policy which takes into account the full history via RNN.
+  The implementation might and will change.
+  The history, together with the goal, is processed using a stacked LSTM. The
+  output of the last LSTM step is used to produce a prediction. Currently, only
+  a single step output is supported.
+  """
+  def __init__(self, lstm_hparams, *args, **kwargs):
+    super(RNNPolicy, self).__init__(*args, **kwargs)
+    self._lstm_hparams = lstm_hparams
+  # The prev_state is ignored as for now the full history is specified as first
+  # element of the feeds. It might turn out to be beneficial to keep the state
+  # as part of the policy object.
+  def build(self, feeds, state):
+    history, goal, _ = self._embed_task_ios(feeds)
+    _print_debug_ios(history, goal, None)
+    params = self._lstm_hparams
+    cell = lambda: tf.contrib.rnn.BasicLSTMCell(params.cell_size)
+    stacked_lstm = tf.contrib.rnn.MultiRNNCell(
+        [cell() for _ in range(params.num_layers)])
+    # history is of shape batch_size x seq_len x embedding_dimension
+    batch_size, seq_len, _ = tuple(history.get_shape().as_list())
+    if state is None:
+      state = stacked_lstm.zero_state(batch_size, tf.float32)
+    for t in range(seq_len):
+      if params.concat_goal_everywhere:
+        lstm_input = tf.concat([tf.squeeze(history[:, t, :]), goal], axis=1)
+      else:
+        lstm_input = tf.squeeze(history[:, t, :])
+      output, state = stacked_lstm(lstm_input, state)
+    with tf.variable_scope('output_decoder'):
+      oconfig = self._task_config.output.shape
+      assert len(oconfig) == 1
+      features = tf.concat([output, goal], axis=1)
+      assert len(output.get_shape().as_list()) == 2
+      assert len(goal.get_shape().as_list()) == 2
+      decoder = embedders.MLPEmbedder(
+          layers=self._embedder_hparams.predictions.layer_sizes + oconfig)
+      # Prediction is done off the last step lstm output and the goal.
+      predictions = decoder.build(features)
+    return predictions, state
--- a/research/cognitive_planning/preprocessing/__init__.py
+++ b/research/cognitive_planning/preprocessing/__init__.py
--- a/research/cognitive_planning/preprocessing/cifarnet_preprocessing.py
+++ b/research/cognitive_planning/preprocessing/cifarnet_preprocessing.py
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides utilities to preprocess images in CIFAR-10.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import tensorflow as tf
+_PADDING = 4
+slim = tf.contrib.slim
+def preprocess_for_train(image,
+                         output_height,
+                         output_width,
+                         padding=_PADDING,
+                         add_image_summaries=True):
+  """Preprocesses the given image for training.
+  Note that the actual resizing scale is sampled from
+    [`resize_size_min`, `resize_size_max`].
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    padding: The amound of padding before and after each dimension of the image.
+    add_image_summaries: Enable image summaries.
+  Returns:
+    A preprocessed image.
+  """
+  if add_image_summaries:
+    tf.summary.image('image', tf.expand_dims(image, 0))
+  # Transform the image to floats.
+  image = tf.to_float(image)
+  if padding > 0:
+    image = tf.pad(image, [[padding, padding], [padding, padding], [0, 0]])
+  # Randomly crop a [height, width] section of the image.
+  distorted_image = tf.random_crop(image,
+                                   [output_height, output_width, 3])
+  # Randomly flip the image horizontally.
+  distorted_image = tf.image.random_flip_left_right(distorted_image)
+  if add_image_summaries:
+    tf.summary.image('distorted_image', tf.expand_dims(distorted_image, 0))
+  # Because these operations are not commutative, consider randomizing
+  # the order their operation.
+  distorted_image = tf.image.random_brightness(distorted_image,
+                                               max_delta=63)
+  distorted_image = tf.image.random_contrast(distorted_image,
+                                             lower=0.2, upper=1.8)
+  # Subtract off the mean and divide by the variance of the pixels.
+  return tf.image.per_image_standardization(distorted_image)
+def preprocess_for_eval(image, output_height, output_width,
+                        add_image_summaries=True):
+  """Preprocesses the given image for evaluation.
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    add_image_summaries: Enable image summaries.
+  Returns:
+    A preprocessed image.
+  """
+  if add_image_summaries:
+    tf.summary.image('image', tf.expand_dims(image, 0))
+  # Transform the image to floats.
+  image = tf.to_float(image)
+  # Resize and crop if needed.
+  resized_image = tf.image.resize_image_with_crop_or_pad(image,
+                                                         output_width,
+                                                         output_height)
+  if add_image_summaries:
+    tf.summary.image('resized_image', tf.expand_dims(resized_image, 0))
+  # Subtract off the mean and divide by the variance of the pixels.
+  return tf.image.per_image_standardization(resized_image)
+def preprocess_image(image, output_height, output_width, is_training=False,
+                     add_image_summaries=True):
+  """Preprocesses the given image.
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    is_training: `True` if we're preprocessing the image for training and
+      `False` otherwise.
+    add_image_summaries: Enable image summaries.
+  Returns:
+    A preprocessed image.
+  """
+  if is_training:
+    return preprocess_for_train(
+        image, output_height, output_width,
+        add_image_summaries=add_image_summaries)
+  else:
+    return preprocess_for_eval(
+        image, output_height, output_width,
+        add_image_summaries=add_image_summaries)
--- a/research/cognitive_planning/preprocessing/inception_preprocessing.py
+++ b/research/cognitive_planning/preprocessing/inception_preprocessing.py
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides utilities to preprocess images for the Inception networks."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import tensorflow as tf
+from tensorflow.python.ops import control_flow_ops
+def apply_with_random_selector(x, func, num_cases):
+  """Computes func(x, sel), with sel sampled from [0...num_cases-1].
+  Args:
+    x: input Tensor.
+    func: Python function to apply.
+    num_cases: Python int32, number of cases to sample sel from.
+  Returns:
+    The result of func(x, sel), where func receives the value of the
+    selector as a python integer, but sel is sampled dynamically.
+  """
+  sel = tf.random_uniform([], maxval=num_cases, dtype=tf.int32)
+  # Pass the real x only to one of the func calls.
+  return control_flow_ops.merge([
+      func(control_flow_ops.switch(x, tf.equal(sel, case))[1], case)
+      for case in range(num_cases)])[0]
+def distort_color(image, color_ordering=0, fast_mode=True, scope=None):
+  """Distort the color of a Tensor image.
+  Each color distortion is non-commutative and thus ordering of the color ops
+  matters. Ideally we would randomly permute the ordering of the color ops.
+  Rather then adding that level of complication, we select a distinct ordering
+  of color ops for each preprocessing thread.
+  Args:
+    image: 3-D Tensor containing single image in [0, 1].
+    color_ordering: Python int, a type of distortion (valid values: 0-3).
+    fast_mode: Avoids slower ops (random_hue and random_contrast)
+    scope: Optional scope for name_scope.
+  Returns:
+    3-D Tensor color-distorted image on range [0, 1]
+  Raises:
+    ValueError: if color_ordering not in [0, 3]
+  """
+  with tf.name_scope(scope, 'distort_color', [image]):
+    if fast_mode:
+      if color_ordering == 0:
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+      else:
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+    else:
+      if color_ordering == 0:
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_hue(image, max_delta=0.2)
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+      elif color_ordering == 1:
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+        image = tf.image.random_hue(image, max_delta=0.2)
+      elif color_ordering == 2:
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+        image = tf.image.random_hue(image, max_delta=0.2)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+      elif color_ordering == 3:
+        image = tf.image.random_hue(image, max_delta=0.2)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+      else:
+        raise ValueError('color_ordering must be in [0, 3]')
+    # The random_* ops do not necessarily clamp.
+    return tf.clip_by_value(image, 0.0, 1.0)
+def distorted_bounding_box_crop(image,
+                                bbox,
+                                min_object_covered=0.1,
+                                aspect_ratio_range=(0.75, 1.33),
+                                area_range=(0.05, 1.0),
+                                max_attempts=100,
+                                scope=None):
+  """Generates cropped_image using a one of the bboxes randomly distorted.
+  See `tf.image.sample_distorted_bounding_box` for more documentation.
+  Args:
+    image: 3-D Tensor of image (it will be converted to floats in [0, 1]).
+    bbox: 3-D float Tensor of bounding boxes arranged [1, num_boxes, coords]
+      where each coordinate is [0, 1) and the coordinates are arranged
+      as [ymin, xmin, ymax, xmax]. If num_boxes is 0 then it would use the whole
+      image.
+    min_object_covered: An optional `float`. Defaults to `0.1`. The cropped
+      area of the image must contain at least this fraction of any bounding box
+      supplied.
+    aspect_ratio_range: An optional list of `floats`. The cropped area of the
+      image must have an aspect ratio = width / height within this range.
+    area_range: An optional list of `floats`. The cropped area of the image
+      must contain a fraction of the supplied image within in this range.
+    max_attempts: An optional `int`. Number of attempts at generating a cropped
+      region of the image of the specified constraints. After `max_attempts`
+      failures, return the entire image.
+    scope: Optional scope for name_scope.
+  Returns:
+    A tuple, a 3-D Tensor cropped_image and the distorted bbox
+  """
+  with tf.name_scope(scope, 'distorted_bounding_box_crop', [image, bbox]):
+    # Each bounding box has shape [1, num_boxes, box coords] and
+    # the coordinates are ordered [ymin, xmin, ymax, xmax].
+    # A large fraction of image datasets contain a human-annotated bounding
+    # box delineating the region of the image containing the object of interest.
+    # We choose to create a new bounding box for the object which is a randomly
+    # distorted version of the human-annotated bounding box that obeys an
+    # allowed range of aspect ratios, sizes and overlap with the human-annotated
+    # bounding box. If no box is supplied, then we assume the bounding box is
+    # the entire image.
+    sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box(
+        tf.shape(image),
+        bounding_boxes=bbox,
+        min_object_covered=min_object_covered,
+        aspect_ratio_range=aspect_ratio_range,
+        area_range=area_range,
+        max_attempts=max_attempts,
+        use_image_if_no_bounding_boxes=True)
+    bbox_begin, bbox_size, distort_bbox = sample_distorted_bounding_box
+    # Crop the image to the specified bounding box.
+    cropped_image = tf.slice(image, bbox_begin, bbox_size)
+    return cropped_image, distort_bbox
+def preprocess_for_train(image, height, width, bbox,
+                         fast_mode=True,
+                         scope=None,
+                         add_image_summaries=True):
+  """Distort one image for training a network.
+  Distorting images provides a useful technique for augmenting the data
+  set during training in order to make the network invariant to aspects
+  of the image that do not effect the label.
+  Additionally it would create image_summaries to display the different
+  transformations applied to the image.
+  Args:
+    image: 3-D Tensor of image. If dtype is tf.float32 then the range should be
+      [0, 1], otherwise it would converted to tf.float32 assuming that the range
+      is [0, MAX], where MAX is largest positive representable number for
+      int(8/16/32) data type (see `tf.image.convert_image_dtype` for details).
+    height: integer
+    width: integer
+    bbox: 3-D float Tensor of bounding boxes arranged [1, num_boxes, coords]
+      where each coordinate is [0, 1) and the coordinates are arranged
+      as [ymin, xmin, ymax, xmax].
+    fast_mode: Optional boolean, if True avoids slower transformations (i.e.
+      bi-cubic resizing, random_hue or random_contrast).
+    scope: Optional scope for name_scope.
+    add_image_summaries: Enable image summaries.
+  Returns:
+    3-D float Tensor of distorted image used for training with range [-1, 1].
+  """
+  with tf.name_scope(scope, 'distort_image', [image, height, width, bbox]):
+    if bbox is None:
+      bbox = tf.constant([0.0, 0.0, 1.0, 1.0],
+                         dtype=tf.float32,
+                         shape=[1, 1, 4])
+    if image.dtype != tf.float32:
+      image = tf.image.convert_image_dtype(image, dtype=tf.float32)
+    # Each bounding box has shape [1, num_boxes, box coords] and
+    # the coordinates are ordered [ymin, xmin, ymax, xmax].
+    image_with_box = tf.image.draw_bounding_boxes(tf.expand_dims(image, 0),
+                                                  bbox)
+    if add_image_summaries:
+      tf.summary.image('image_with_bounding_boxes', image_with_box)
+    distorted_image, distorted_bbox = distorted_bounding_box_crop(image, bbox)
+    # Restore the shape since the dynamic slice based upon the bbox_size loses
+    # the third dimension.
+    distorted_image.set_shape([None, None, 3])
+    image_with_distorted_box = tf.image.draw_bounding_boxes(
+        tf.expand_dims(image, 0), distorted_bbox)
+    if add_image_summaries:
+      tf.summary.image('images_with_distorted_bounding_box',
+                       image_with_distorted_box)
+    # This resizing operation may distort the images because the aspect
+    # ratio is not respected. We select a resize method in a round robin
+    # fashion based on the thread number.
+    # Note that ResizeMethod contains 4 enumerated resizing methods.
+    # We select only 1 case for fast_mode bilinear.
+    num_resize_cases = 1 if fast_mode else 4
+    distorted_image = apply_with_random_selector(
+        distorted_image,
+        lambda x, method: tf.image.resize_images(x, [height, width], method),
+        num_cases=num_resize_cases)
+    if add_image_summaries:
+      tf.summary.image('cropped_resized_image',
+                       tf.expand_dims(distorted_image, 0))
+    # Randomly flip the image horizontally.
+    distorted_image = tf.image.random_flip_left_right(distorted_image)
+    # Randomly distort the colors. There are 1 or 4 ways to do it.
+    num_distort_cases = 1 if fast_mode else 4
+    distorted_image = apply_with_random_selector(
+        distorted_image,
+        lambda x, ordering: distort_color(x, ordering, fast_mode),
+        num_cases=num_distort_cases)
+    if add_image_summaries:
+      tf.summary.image('final_distorted_image',
+                       tf.expand_dims(distorted_image, 0))
+    distorted_image = tf.subtract(distorted_image, 0.5)
+    distorted_image = tf.multiply(distorted_image, 2.0)
+    return distorted_image
+def preprocess_for_eval(image, height, width,
+                        central_fraction=0.875, scope=None):
+  """Prepare one image for evaluation.
+  If height and width are specified it would output an image with that size by
+  applying resize_bilinear.
+  If central_fraction is specified it would crop the central fraction of the
+  input image.
+  Args:
+    image: 3-D Tensor of image. If dtype is tf.float32 then the range should be
+      [0, 1], otherwise it would converted to tf.float32 assuming that the range
+      is [0, MAX], where MAX is largest positive representable number for
+      int(8/16/32) data type (see `tf.image.convert_image_dtype` for details).
+    height: integer
+    width: integer
+    central_fraction: Optional Float, fraction of the image to crop.
+    scope: Optional scope for name_scope.
+  Returns:
+    3-D float Tensor of prepared image.
+  """
+  with tf.name_scope(scope, 'eval_image', [image, height, width]):
+    if image.dtype != tf.float32:
+      image = tf.image.convert_image_dtype(image, dtype=tf.float32)
+    # Crop the central region of the image with an area containing 87.5% of
+    # the original image.
+    if central_fraction:
+      image = tf.image.central_crop(image, central_fraction=central_fraction)
+    if height and width:
+      # Resize the image to the specified height and width.
+      image = tf.expand_dims(image, 0)
+      image = tf.image.resize_bilinear(image, [height, width],
+                                       align_corners=False)
+      image = tf.squeeze(image, [0])
+    image = tf.subtract(image, 0.5)
+    image = tf.multiply(image, 2.0)
+    return image
+def preprocess_image(image, height, width,
+                     is_training=False,
+                     bbox=None,
+                     fast_mode=True,
+                     add_image_summaries=True):
+  """Pre-process one image for training or evaluation.
+  Args:
+    image: 3-D Tensor [height, width, channels] with the image. If dtype is
+      tf.float32 then the range should be [0, 1], otherwise it would converted
+      to tf.float32 assuming that the range is [0, MAX], where MAX is largest
+      positive representable number for int(8/16/32) data type (see
+      `tf.image.convert_image_dtype` for details).
+    height: integer, image expected height.
+    width: integer, image expected width.
+    is_training: Boolean. If true it would transform an image for train,
+      otherwise it would transform it for evaluation.
+    bbox: 3-D float Tensor of bounding boxes arranged [1, num_boxes, coords]
+      where each coordinate is [0, 1) and the coordinates are arranged as
+      [ymin, xmin, ymax, xmax].
+    fast_mode: Optional boolean, if True avoids slower transformations.
+    add_image_summaries: Enable image summaries.
+  Returns:
+    3-D float Tensor containing an appropriately scaled image
+  Raises:
+    ValueError: if user does not provide bounding box
+  """
+  if is_training:
+    return preprocess_for_train(image, height, width, bbox, fast_mode,
+                                add_image_summaries=add_image_summaries)
+  else:
+    return preprocess_for_eval(image, height, width)
--- a/research/cognitive_planning/preprocessing/lenet_preprocessing.py
+++ b/research/cognitive_planning/preprocessing/lenet_preprocessing.py
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides utilities for preprocessing."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import tensorflow as tf
+slim = tf.contrib.slim
+def preprocess_image(image, output_height, output_width, is_training):
+  """Preprocesses the given image.
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    is_training: `True` if we're preprocessing the image for training and
+      `False` otherwise.
+  Returns:
+    A preprocessed image.
+  """
+  image = tf.to_float(image)
+  image = tf.image.resize_image_with_crop_or_pad(
+      image, output_width, output_height)
+  image = tf.subtract(image, 128.0)
+  image = tf.div(image, 128.0)
+  return image
--- a/research/cognitive_planning/preprocessing/preprocessing_factory.py
+++ b/research/cognitive_planning/preprocessing/preprocessing_factory.py
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Contains a factory for building various models."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import tensorflow as tf
+from preprocessing import cifarnet_preprocessing
+from preprocessing import inception_preprocessing
+from preprocessing import lenet_preprocessing
+from preprocessing import vgg_preprocessing
+slim = tf.contrib.slim
+def get_preprocessing(name, is_training=False):
+  """Returns preprocessing_fn(image, height, width, **kwargs).
+  Args:
+    name: The name of the preprocessing function.
+    is_training: `True` if the model is being used for training and `False`
+      otherwise.
+  Returns:
+    preprocessing_fn: A function that preprocessing a single image (pre-batch).
+      It has the following signature:
+        image = preprocessing_fn(image, output_height, output_width, ...).
+  Raises:
+    ValueError: If Preprocessing `name` is not recognized.
+  """
+  preprocessing_fn_map = {
+      'cifarnet': cifarnet_preprocessing,
+      'inception': inception_preprocessing,
+      'inception_v1': inception_preprocessing,
+      'inception_v2': inception_preprocessing,
+      'inception_v3': inception_preprocessing,
+      'inception_v4': inception_preprocessing,
+      'inception_resnet_v2': inception_preprocessing,
+      'lenet': lenet_preprocessing,
+      'mobilenet_v1': inception_preprocessing,
+      'nasnet_mobile': inception_preprocessing,
+      'nasnet_large': inception_preprocessing,
+      'pnasnet_large': inception_preprocessing,
+      'resnet_v1_50': vgg_preprocessing,
+      'resnet_v1_101': vgg_preprocessing,
+      'resnet_v1_152': vgg_preprocessing,
+      'resnet_v1_200': vgg_preprocessing,
+      'resnet_v2_50': vgg_preprocessing,
+      'resnet_v2_101': vgg_preprocessing,
+      'resnet_v2_152': vgg_preprocessing,
+      'resnet_v2_200': vgg_preprocessing,
+      'vgg': vgg_preprocessing,
+      'vgg_a': vgg_preprocessing,
+      'vgg_16': vgg_preprocessing,
+      'vgg_19': vgg_preprocessing,
+  }
+  if name not in preprocessing_fn_map:
+    raise ValueError('Preprocessing name [%s] was not recognized' % name)
+  def preprocessing_fn(image, output_height, output_width, **kwargs):
+    return preprocessing_fn_map[name].preprocess_image(
+        image, output_height, output_width, is_training=is_training, **kwargs)
+  return preprocessing_fn
--- a/research/cognitive_planning/preprocessing/vgg_preprocessing.py
+++ b/research/cognitive_planning/preprocessing/vgg_preprocessing.py
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides utilities to preprocess images.
+The preprocessing steps for VGG were introduced in the following technical
+report:
+  Very Deep Convolutional Networks For Large-Scale Image Recognition
+  Karen Simonyan and Andrew Zisserman
+  arXiv technical report, 2015
+  PDF: http://arxiv.org/pdf/1409.1556.pdf
+  ILSVRC 2014 Slides: http://www.robots.ox.ac.uk/~karen/pdf/ILSVRC_2014.pdf
+  CC-BY-4.0
+More information can be obtained from the VGG website:
+www.robots.ox.ac.uk/~vgg/research/very_deep/
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import tensorflow as tf
+slim = tf.contrib.slim
+_R_MEAN = 123.68
+_G_MEAN = 116.78
+_B_MEAN = 103.94
+_RESIZE_SIDE_MIN = 256
+_RESIZE_SIDE_MAX = 512
+def _crop(image, offset_height, offset_width, crop_height, crop_width):
+  """Crops the given image using the provided offsets and sizes.
+  Note that the method doesn't assume we know the input image size but it does
+  assume we know the input image rank.
+  Args:
+    image: an image of shape [height, width, channels].
+    offset_height: a scalar tensor indicating the height offset.
+    offset_width: a scalar tensor indicating the width offset.
+    crop_height: the height of the cropped image.
+    crop_width: the width of the cropped image.
+  Returns:
+    the cropped (and resized) image.
+  Raises:
+    InvalidArgumentError: if the rank is not 3 or if the image dimensions are
+      less than the crop size.
+  """
+  original_shape = tf.shape(image)
+  rank_assertion = tf.Assert(
+      tf.equal(tf.rank(image), 3),
+      ['Rank of image must be equal to 3.'])
+  with tf.control_dependencies([rank_assertion]):
+    cropped_shape = tf.stack([crop_height, crop_width, original_shape[2]])
+  size_assertion = tf.Assert(
+      tf.logical_and(
+          tf.greater_equal(original_shape[0], crop_height),
+          tf.greater_equal(original_shape[1], crop_width)),
+      ['Crop size greater than the image size.'])
+  offsets = tf.to_int32(tf.stack([offset_height, offset_width, 0]))
+  # Use tf.slice instead of crop_to_bounding box as it accepts tensors to
+  # define the crop size.
+  with tf.control_dependencies([size_assertion]):
+    image = tf.slice(image, offsets, cropped_shape)
+  return tf.reshape(image, cropped_shape)
+def _random_crop(image_list, crop_height, crop_width):
+  """Crops the given list of images.
+  The function applies the same crop to each image in the list. This can be
+  effectively applied when there are multiple image inputs of the same
+  dimension such as:
+    image, depths, normals = _random_crop([image, depths, normals], 120, 150)
+  Args:
+    image_list: a list of image tensors of the same dimension but possibly
+      varying channel.
+    crop_height: the new height.
+    crop_width: the new width.
+  Returns:
+    the image_list with cropped images.
+  Raises:
+    ValueError: if there are multiple image inputs provided with different size
+      or the images are smaller than the crop dimensions.
+  """
+  if not image_list:
+    raise ValueError('Empty image_list.')
+  # Compute the rank assertions.
+  rank_assertions = []
+  for i in range(len(image_list)):
+    image_rank = tf.rank(image_list[i])
+    rank_assert = tf.Assert(
+        tf.equal(image_rank, 3),
+        ['Wrong rank for tensor  %s [expected] [actual]',
+         image_list[i].name, 3, image_rank])
+    rank_assertions.append(rank_assert)
+  with tf.control_dependencies([rank_assertions[0]]):
+    image_shape = tf.shape(image_list[0])
+  image_height = image_shape[0]
+  image_width = image_shape[1]
+  crop_size_assert = tf.Assert(
+      tf.logical_and(
+          tf.greater_equal(image_height, crop_height),
+          tf.greater_equal(image_width, crop_width)),
+      ['Crop size greater than the image size.'])
+  asserts = [rank_assertions[0], crop_size_assert]
+  for i in range(1, len(image_list)):
+    image = image_list[i]
+    asserts.append(rank_assertions[i])
+    with tf.control_dependencies([rank_assertions[i]]):
+      shape = tf.shape(image)
+    height = shape[0]
+    width = shape[1]
+    height_assert = tf.Assert(
+        tf.equal(height, image_height),
+        ['Wrong height for tensor %s [expected][actual]',
+         image.name, height, image_height])
+    width_assert = tf.Assert(
+        tf.equal(width, image_width),
+        ['Wrong width for tensor %s [expected][actual]',
+         image.name, width, image_width])
+    asserts.extend([height_assert, width_assert])
+  # Create a random bounding box.
+  #
+  # Use tf.random_uniform and not numpy.random.rand as doing the former would
+  # generate random numbers at graph eval time, unlike the latter which
+  # generates random numbers at graph definition time.
+  with tf.control_dependencies(asserts):
+    max_offset_height = tf.reshape(image_height - crop_height + 1, [])
+  with tf.control_dependencies(asserts):
+    max_offset_width = tf.reshape(image_width - crop_width + 1, [])
+  offset_height = tf.random_uniform(
+      [], maxval=max_offset_height, dtype=tf.int32)
+  offset_width = tf.random_uniform(
+      [], maxval=max_offset_width, dtype=tf.int32)
+  return [_crop(image, offset_height, offset_width,
+                crop_height, crop_width) for image in image_list]
+def _central_crop(image_list, crop_height, crop_width):
+  """Performs central crops of the given image list.
+  Args:
+    image_list: a list of image tensors of the same dimension but possibly
+      varying channel.
+    crop_height: the height of the image following the crop.
+    crop_width: the width of the image following the crop.
+  Returns:
+    the list of cropped images.
+  """
+  outputs = []
+  for image in image_list:
+    image_height = tf.shape(image)[0]
+    image_width = tf.shape(image)[1]
+    offset_height = (image_height - crop_height) / 2
+    offset_width = (image_width - crop_width) / 2
+    outputs.append(_crop(image, offset_height, offset_width,
+                         crop_height, crop_width))
+  return outputs
+def _mean_image_subtraction(image, means):
+  """Subtracts the given means from each image channel.
+  For example:
+    means = [123.68, 116.779, 103.939]
+    image = _mean_image_subtraction(image, means)
+  Note that the rank of `image` must be known.
+  Args:
+    image: a tensor of size [height, width, C].
+    means: a C-vector of values to subtract from each channel.
+  Returns:
+    the centered image.
+  Raises:
+    ValueError: If the rank of `image` is unknown, if `image` has a rank other
+      than three or if the number of channels in `image` doesn't match the
+      number of values in `means`.
+  """
+  if image.get_shape().ndims != 3:
+    raise ValueError('Input must be of size [height, width, C>0]')
+  num_channels = image.get_shape().as_list()[-1]
+  if len(means) != num_channels:
+    raise ValueError('len(means) must match the number of channels')
+  channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image)
+  for i in range(num_channels):
+    channels[i] -= means[i]
+  return tf.concat(axis=2, values=channels)
+def _smallest_size_at_least(height, width, smallest_side):
+  """Computes new shape with the smallest side equal to `smallest_side`.
+  Computes new shape with the smallest side equal to `smallest_side` while
+  preserving the original aspect ratio.
+  Args:
+    height: an int32 scalar tensor indicating the current height.
+    width: an int32 scalar tensor indicating the current width.
+    smallest_side: A python integer or scalar `Tensor` indicating the size of
+      the smallest side after resize.
+  Returns:
+    new_height: an int32 scalar tensor indicating the new height.
+    new_width: and int32 scalar tensor indicating the new width.
+  """
+  smallest_side = tf.convert_to_tensor(smallest_side, dtype=tf.int32)
+  height = tf.to_float(height)
+  width = tf.to_float(width)
+  smallest_side = tf.to_float(smallest_side)
+  scale = tf.cond(tf.greater(height, width),
+                  lambda: smallest_side / width,
+                  lambda: smallest_side / height)
+  new_height = tf.to_int32(tf.rint(height * scale))
+  new_width = tf.to_int32(tf.rint(width * scale))
+  return new_height, new_width
+def _aspect_preserving_resize(image, smallest_side):
+  """Resize images preserving the original aspect ratio.
+  Args:
+    image: A 3-D image `Tensor`.
+    smallest_side: A python integer or scalar `Tensor` indicating the size of
+      the smallest side after resize.
+  Returns:
+    resized_image: A 3-D tensor containing the resized image.
+  """
+  smallest_side = tf.convert_to_tensor(smallest_side, dtype=tf.int32)
+  shape = tf.shape(image)
+  height = shape[0]
+  width = shape[1]
+  new_height, new_width = _smallest_size_at_least(height, width, smallest_side)
+  image = tf.expand_dims(image, 0)
+  resized_image = tf.image.resize_bilinear(image, [new_height, new_width],
+                                           align_corners=False)
+  resized_image = tf.squeeze(resized_image)
+  resized_image.set_shape([None, None, 3])
+  return resized_image
+def preprocess_for_train(image,
+                         output_height,
+                         output_width,
+                         resize_side_min=_RESIZE_SIDE_MIN,
+                         resize_side_max=_RESIZE_SIDE_MAX):
+  """Preprocesses the given image for training.
+  Note that the actual resizing scale is sampled from
+    [`resize_size_min`, `resize_size_max`].
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    resize_side_min: The lower bound for the smallest side of the image for
+      aspect-preserving resizing.
+    resize_side_max: The upper bound for the smallest side of the image for
+      aspect-preserving resizing.
+  Returns:
+    A preprocessed image.
+  """
+  resize_side = tf.random_uniform(
+      [], minval=resize_side_min, maxval=resize_side_max+1, dtype=tf.int32)
+  image = _aspect_preserving_resize(image, resize_side)
+  image = _random_crop([image], output_height, output_width)[0]
+  image.set_shape([output_height, output_width, 3])
+  image = tf.to_float(image)
+  image = tf.image.random_flip_left_right(image)
+  return _mean_image_subtraction(image, [_R_MEAN, _G_MEAN, _B_MEAN])
+def preprocess_for_eval(image, output_height, output_width, resize_side):
+  """Preprocesses the given image for evaluation.
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    resize_side: The smallest side of the image for aspect-preserving resizing.
+  Returns:
+    A preprocessed image.
+  """
+  image = _aspect_preserving_resize(image, resize_side)
+  image = _central_crop([image], output_height, output_width)[0]
+  image.set_shape([output_height, output_width, 3])
+  image = tf.to_float(image)
+  return _mean_image_subtraction(image, [_R_MEAN, _G_MEAN, _B_MEAN])
+def preprocess_image(image, output_height, output_width, is_training=False,
+                     resize_side_min=_RESIZE_SIDE_MIN,
+                     resize_side_max=_RESIZE_SIDE_MAX):
+  """Preprocesses the given image.
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    output_height: The height of the image after preprocessing.
+    output_width: The width of the image after preprocessing.
+    is_training: `True` if we're preprocessing the image for training and
+      `False` otherwise.
+    resize_side_min: The lower bound for the smallest side of the image for
+      aspect-preserving resizing. If `is_training` is `False`, then this value
+      is used for rescaling.
+    resize_side_max: The upper bound for the smallest side of the image for
+      aspect-preserving resizing. If `is_training` is `False`, this value is
+      ignored. Otherwise, the resize side is sampled from
+        [resize_size_min, resize_size_max].
+  Returns:
+    A preprocessed image.
+  """
+  if is_training:
+    return preprocess_for_train(image, output_height, output_width,
+                                resize_side_min, resize_side_max)
+  else:
+    return preprocess_for_eval(image, output_height, output_width,
+                               resize_side_min)
--- a/research/cognitive_planning/standard_fields.py
+++ b/research/cognitive_planning/standard_fields.py
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Contains classes specifying naming conventions used for object detection.
+Specifies:
+  InputDataFields: standard fields used by reader/preprocessor/batcher.
+  DetectionResultFields: standard fields returned by object detector.
+  BoxListFields: standard field used by BoxList
+  TfExampleFields: standard fields for tf-example data format (go/tf-example).
+"""
+class InputDataFields(object):
+  """Names for the input tensors.
+  Holds the standard data field names to use for identifying input tensors. This
+  should be used by the decoder to identify keys for the returned tensor_dict
+  containing input tensors. And it should be used by the model to identify the
+  tensors it needs.
+  Attributes:
+    image: image.
+    image_additional_channels: additional channels.
+    original_image: image in the original input size.
+    key: unique key corresponding to image.
+    source_id: source of the original image.
+    filename: original filename of the dataset (without common path).
+    groundtruth_image_classes: image-level class labels.
+    groundtruth_boxes: coordinates of the ground truth boxes in the image.
+    groundtruth_classes: box-level class labels.
+    groundtruth_label_types: box-level label types (e.g. explicit negative).
+    groundtruth_is_crowd: [DEPRECATED, use groundtruth_group_of instead]
+      is the groundtruth a single object or a crowd.
+    groundtruth_area: area of a groundtruth segment.
+    groundtruth_difficult: is a `difficult` object
+    groundtruth_group_of: is a `group_of` objects, e.g. multiple objects of the
+      same class, forming a connected group, where instances are heavily
+      occluding each other.
+    proposal_boxes: coordinates of object proposal boxes.
+    proposal_objectness: objectness score of each proposal.
+    groundtruth_instance_masks: ground truth instance masks.
+    groundtruth_instance_boundaries: ground truth instance boundaries.
+    groundtruth_instance_classes: instance mask-level class labels.
+    groundtruth_keypoints: ground truth keypoints.
+    groundtruth_keypoint_visibilities: ground truth keypoint visibilities.
+    groundtruth_label_scores: groundtruth label scores.
+    groundtruth_weights: groundtruth weight factor for bounding boxes.
+    num_groundtruth_boxes: number of groundtruth boxes.
+    true_image_shapes: true shapes of images in the resized images, as resized
+      images can be padded with zeros.
+    multiclass_scores: the label score per class for each box.
+  """
+  image = 'image'
+  image_additional_channels = 'image_additional_channels'
+  original_image = 'original_image'
+  key = 'key'
+  source_id = 'source_id'
+  filename = 'filename'
+  groundtruth_image_classes = 'groundtruth_image_classes'
+  groundtruth_boxes = 'groundtruth_boxes'
+  groundtruth_classes = 'groundtruth_classes'
+  groundtruth_label_types = 'groundtruth_label_types'
+  groundtruth_is_crowd = 'groundtruth_is_crowd'
+  groundtruth_area = 'groundtruth_area'
+  groundtruth_difficult = 'groundtruth_difficult'
+  groundtruth_group_of = 'groundtruth_group_of'
+  proposal_boxes = 'proposal_boxes'
+  proposal_objectness = 'proposal_objectness'
+  groundtruth_instance_masks = 'groundtruth_instance_masks'
+  groundtruth_instance_boundaries = 'groundtruth_instance_boundaries'
+  groundtruth_instance_classes = 'groundtruth_instance_classes'
+  groundtruth_keypoints = 'groundtruth_keypoints'
+  groundtruth_keypoint_visibilities = 'groundtruth_keypoint_visibilities'
+  groundtruth_label_scores = 'groundtruth_label_scores'
+  groundtruth_weights = 'groundtruth_weights'
+  num_groundtruth_boxes = 'num_groundtruth_boxes'
+  true_image_shape = 'true_image_shape'
+  multiclass_scores = 'multiclass_scores'
+class DetectionResultFields(object):
+  """Naming conventions for storing the output of the detector.
+  Attributes:
+    source_id: source of the original image.
+    key: unique key corresponding to image.
+    detection_boxes: coordinates of the detection boxes in the image.
+    detection_scores: detection scores for the detection boxes in the image.
+    detection_classes: detection-level class labels.
+    detection_masks: contains a segmentation mask for each detection box.
+    detection_boundaries: contains an object boundary for each detection box.
+    detection_keypoints: contains detection keypoints for each detection box.
+    num_detections: number of detections in the batch.
+  """
+  source_id = 'source_id'
+  key = 'key'
+  detection_boxes = 'detection_boxes'
+  detection_scores = 'detection_scores'
+  detection_classes = 'detection_classes'
+  detection_masks = 'detection_masks'
+  detection_boundaries = 'detection_boundaries'
+  detection_keypoints = 'detection_keypoints'
+  num_detections = 'num_detections'
+class BoxListFields(object):
+  """Naming conventions for BoxLists.
+  Attributes:
+    boxes: bounding box coordinates.
+    classes: classes per bounding box.
+    scores: scores per bounding box.
+    weights: sample weights per bounding box.
+    objectness: objectness score per bounding box.
+    masks: masks per bounding box.
+    boundaries: boundaries per bounding box.
+    keypoints: keypoints per bounding box.
+    keypoint_heatmaps: keypoint heatmaps per bounding box.
+    is_crowd: is_crowd annotation per bounding box.
+  """
+  boxes = 'boxes'
+  classes = 'classes'
+  scores = 'scores'
+  weights = 'weights'
+  objectness = 'objectness'
+  masks = 'masks'
+  boundaries = 'boundaries'
+  keypoints = 'keypoints'
+  keypoint_heatmaps = 'keypoint_heatmaps'
+  is_crowd = 'is_crowd'
+class TfExampleFields(object):
+  """TF-example proto feature names for object detection.
+  Holds the standard feature names to load from an Example proto for object
+  detection.
+  Attributes:
+    image_encoded: JPEG encoded string
+    image_format: image format, e.g. "JPEG"
+    filename: filename
+    channels: number of channels of image
+    colorspace: colorspace, e.g. "RGB"
+    height: height of image in pixels, e.g. 462
+    width: width of image in pixels, e.g. 581
+    source_id: original source of the image
+    image_class_text: image-level label in text format
+    image_class_label: image-level label in numerical format
+    object_class_text: labels in text format, e.g. ["person", "cat"]
+    object_class_label: labels in numbers, e.g. [16, 8]
+    object_bbox_xmin: xmin coordinates of groundtruth box, e.g. 10, 30
+    object_bbox_xmax: xmax coordinates of groundtruth box, e.g. 50, 40
+    object_bbox_ymin: ymin coordinates of groundtruth box, e.g. 40, 50
+    object_bbox_ymax: ymax coordinates of groundtruth box, e.g. 80, 70
+    object_view: viewpoint of object, e.g. ["frontal", "left"]
+    object_truncated: is object truncated, e.g. [true, false]
+    object_occluded: is object occluded, e.g. [true, false]
+    object_difficult: is object difficult, e.g. [true, false]
+    object_group_of: is object a single object or a group of objects
+    object_depiction: is object a depiction
+    object_is_crowd: [DEPRECATED, use object_group_of instead]
+      is the object a single object or a crowd
+    object_segment_area: the area of the segment.
+    object_weight: a weight factor for the object's bounding box.
+    instance_masks: instance segmentation masks.
+    instance_boundaries: instance boundaries.
+    instance_classes: Classes for each instance segmentation mask.
+    detection_class_label: class label in numbers.
+    detection_bbox_ymin: ymin coordinates of a detection box.
+    detection_bbox_xmin: xmin coordinates of a detection box.
+    detection_bbox_ymax: ymax coordinates of a detection box.
+    detection_bbox_xmax: xmax coordinates of a detection box.
+    detection_score: detection score for the class label and box.
+  """
+  image_encoded = 'image/encoded'
+  image_format = 'image/format'  # format is reserved keyword
+  filename = 'image/filename'
+  channels = 'image/channels'
+  colorspace = 'image/colorspace'
+  height = 'image/height'
+  width = 'image/width'
+  source_id = 'image/source_id'
+  image_class_text = 'image/class/text'
+  image_class_label = 'image/class/label'
+  object_class_text = 'image/object/class/text'
+  object_class_label = 'image/object/class/label'
+  object_bbox_ymin = 'image/object/bbox/ymin'
+  object_bbox_xmin = 'image/object/bbox/xmin'
+  object_bbox_ymax = 'image/object/bbox/ymax'
+  object_bbox_xmax = 'image/object/bbox/xmax'
+  object_view = 'image/object/view'
+  object_truncated = 'image/object/truncated'
+  object_occluded = 'image/object/occluded'
+  object_difficult = 'image/object/difficult'
+  object_group_of = 'image/object/group_of'
+  object_depiction = 'image/object/depiction'
+  object_is_crowd = 'image/object/is_crowd'
+  object_segment_area = 'image/object/segment/area'
+  object_weight = 'image/object/weight'
+  instance_masks = 'image/segmentation/object'
+  instance_boundaries = 'image/boundaries/object'
+  instance_classes = 'image/segmentation/object/class'
+  detection_class_label = 'image/detection/label'
+  detection_bbox_ymin = 'image/detection/bbox/ymin'
+  detection_bbox_xmin = 'image/detection/bbox/xmin'
+  detection_bbox_ymax = 'image/detection/bbox/ymax'
+  detection_bbox_xmax = 'image/detection/bbox/xmax'
+  detection_score = 'image/detection/score'