Open-source FEELVOS model, which was developed by Paul Voigtlaender during his...

Open-source FEELVOS model, which was developed by Paul Voigtlaender during his 2018 summer internship at Google. The work has been accepted to CVPR 2019. (#6274)

Open-source FEELVOS model, which was developed by Paul Voigtlaender during his...
Open-source FEELVOS model, which was developed by Paul Voigtlaender during his 2018 summer internship at Google. The work has been accepted to CVPR 2019. (#6274)
e1ae37c4 · aquariusjay · GitHub · 5274ec8b · e1ae37c4 · e1ae37c4
Unverified Commit e1ae37c4 authored Feb 27, 2019 by aquariusjay Committed by GitHub Feb 27, 2019
20 changed files
--- a/research/feelvos/CONTRIBUTING.md
+++ b/research/feelvos/CONTRIBUTING.md
+# How to Contribute
+We'd love to accept your patches and contributions to this project. There are
+just a few small guidelines you need to follow.
+## Contributor License Agreement
+Contributions to this project must be accompanied by a Contributor License
+Agreement. You (or your employer) retain the copyright to your contribution;
+this simply gives us permission to use and redistribute your contributions as
+part of the project. Head over to <https://cla.developers.google.com/> to see
+your current agreements on file or to sign a new one.
+You generally only need to submit a CLA once, so if you've already submitted one
+(even if it was for a different project), you probably don't need to do it
+again.
+## Code reviews
+All submissions, including submissions by project members, require review. We
+use GitHub pull requests for this purpose. Consult
+[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
+information on using pull requests.
+## Community Guidelines
+This project follows [Google's Open Source Community
+Guidelines](https://opensource.google.com/conduct/).
--- a/research/feelvos/LICENSE
+++ b/research/feelvos/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/research/feelvos/README.md
+++ b/research/feelvos/README.md
+# FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation
+FEELVOS is a fast model for video object segmentation which does not rely on fine-tuning on the
+first frame.
+For details, please refer to our paper. If you find the code useful, please
+also consider citing it.
+* FEELVOS:
+```
+@inproceedings{feelvos2019,
+    title={FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation},
+    author={Paul Voigtlaender and Yuning Chai and Florian Schroff and Hartwig Adam and Bastian Leibe and Liang-Chieh Chen},
+    booktitle={CVPR},
+    year={2019}
+}
+```
+## Dependencies
+FEELVOS requires a good GPU with around 12 GB of memory and depends on the following libraries
+* TensorFlow
+* Pillow
+* Numpy
+* Scipy
+* Scikit Learn Image
+* tf Slim (which is included in the "tensorflow/models/research/" checkout)
+* DeepLab (which is included in the "tensorflow/models/research/" checkout)
+* correlation_cost (optional, see below)
+For detailed steps to install Tensorflow, follow the [Tensorflow installation
+instructions](https://www.tensorflow.org/install/). A typical user can install
+Tensorflow using the following command:
+```bash
+pip install tensorflow-gpu
+```
+The remaining libraries can also be installed with pip using:
+```bash
+pip install pillow scipy scikit-image
+```
+## Dependency on correlation_cost
+For fast cross-correlation, we use correlation cost as an external dependency. By default FEELVOS
+will use a slow and memory hungry fallback implementation without correlation_cost. If you care for
+performance, you should set up correlation_cost by following the instructions in
+correlation_cost/README and afterwards setting ```USE_CORRELATION_COST = True``` in
+utils/embedding_utils.py.
+## Pre-trained Models
+We provide 2 pre-trained FEELVOS models, both are based on Xception-65:
+* [Trained on DAVIS 2017](http://download.tensorflow.org/models/feelvos_davis17_trained.tar.gz)
+* [Trained on DAVIS 2017 and YouTube-VOS](http://download.tensorflow.org/models/feelvos_davis17_and_youtubevos_trained.tar.gz)
+Additionally, we provide a [DeepLab checkpoint for Xception-65 pre-trained on ImageNet and COCO](http://download.tensorflow.org/models/xception_65_coco_pretrained_2018_10_02.tar.gz),
+which can be used as an initialization for training FEELVOS.
+## Pre-computed Segmentation Masks
+We provide [pre-computed segmentation masks](http://download.tensorflow.org/models/feelvos_precomputed_masks.zip)
+for FEELVOS both for training with and without YouTube-VOS data for the following datasets:
+* DAVIS 2017 validation set
+* DAVIS 2017 test-dev set
+* YouTube-Objects dataset
+## Local Inference
+For a demo of local inference on DAVIS 2017 run
+```bash
+# From tensorflow/models/research/feelvos
+sh eval.sh
+```
+## Local Training
+For a demo of local training on DAVIS 2017 run
+```bash
+# From tensorflow/models/research/feelvos
+sh train.sh
+```
+## Contacts (Maintainers)
+*   Paul Voigtlaender, github: [pvoigtlaender](https://github.com/pvoigtlaender)
+*   Yuning Chai, github: [yuningchai](https://github.com/yuningchai)
+*   Liang-Chieh Chen, github: [aquariusjay](https://github.com/aquariusjay)
+## License
+All the codes in feelvos folder is covered by the [LICENSE](https://github.com/tensorflow/models/blob/master/LICENSE)
+under tensorflow/models. Please refer to the LICENSE for details.
--- a/research/feelvos/__init__.py
+++ b/research/feelvos/__init__.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
--- a/research/feelvos/common.py
+++ b/research/feelvos/common.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides flags that are common to scripts.
+Common flags from train/vis_video.py are collected in this script.
+"""
+import tensorflow as tf
+from deeplab import common
+flags = tf.app.flags
+flags.DEFINE_enum(
+    'classification_loss', 'softmax_with_attention',
+    ['softmax', 'triplet', 'softmax_with_attention'],
+    'Type of loss function used for classifying pixels, can be either softmax, '
+    'softmax_with_attention, or triplet.')
+flags.DEFINE_integer('k_nearest_neighbors', 1,
+                     'The number of nearest neighbors to use.')
+flags.DEFINE_integer('embedding_dimension', 100, 'The dimension used for the '
+                                                 'learned embedding')
+flags.DEFINE_boolean('use_softmax_feedback', True,
+                     'Whether to give the softmax predictions of the last '
+                     'frame as additional input to the segmentation head.')
+flags.DEFINE_boolean('sample_adjacent_and_consistent_query_frames', True,
+                     'If true, the query frames (all but the first frame '
+                     'which is the reference frame) will be sampled such '
+                     'that they are adjacent video frames and have the same '
+                     'crop coordinates and flip augmentation. Note that if '
+                     'use_softmax_feedback is True, this option will '
+                     'automatically be activated.')
+flags.DEFINE_integer('embedding_seg_feature_dimension', 256,
+                     'The dimensionality used in the segmentation head layers.')
+flags.DEFINE_integer('embedding_seg_n_layers', 4, 'The number of layers in the '
+                                                  'segmentation head.')
+flags.DEFINE_integer('embedding_seg_kernel_size', 7, 'The kernel size used in '
+                                                     'the segmentation head.')
+flags.DEFINE_multi_integer('embedding_seg_atrous_rates', [],
+                           'The atrous rates to use for the segmentation head.')
+flags.DEFINE_boolean('normalize_nearest_neighbor_distances', True,
+                     'Whether to normalize the nearest neighbor distances '
+                     'to [0,1] using sigmoid, scale and shift.')
+flags.DEFINE_boolean('also_attend_to_previous_frame', True, 'Whether to also '
+                     'use nearest neighbor attention with respect to the '
+                     'previous frame.')
+flags.DEFINE_bool('use_local_previous_frame_attention', True,
+                  'Whether to restrict the previous frame attention to a local '
+                  'search window. Only has an effect, if '
+                  'also_attend_to_previous_frame is True.')
+flags.DEFINE_integer('previous_frame_attention_window_size', 15,
+                     'The window size used for local previous frame attention,'
+                     ' if use_local_previous_frame_attention is True.')
+flags.DEFINE_boolean('use_first_frame_matching', True, 'Whether to extract '
+                     'features by matching to the reference frame. This should '
+                     'always be true except for ablation experiments.')
+FLAGS = flags.FLAGS
+# Constants
+# Perform semantic segmentation predictions.
+OUTPUT_TYPE = common.OUTPUT_TYPE
+# Semantic segmentation item names.
+LABELS_CLASS = common.LABELS_CLASS
+IMAGE = common.IMAGE
+HEIGHT = common.HEIGHT
+WIDTH = common.WIDTH
+IMAGE_NAME = common.IMAGE_NAME
+SOURCE_ID = 'source_id'
+VIDEO_ID = 'video_id'
+LABEL = common.LABEL
+ORIGINAL_IMAGE = common.ORIGINAL_IMAGE
+PRECEDING_FRAME_LABEL = 'preceding_frame_label'
+# Test set name.
+TEST_SET = common.TEST_SET
+# Internal constants.
+OBJECT_LABEL = 'object_label'
+class VideoModelOptions(common.ModelOptions):
+  """Internal version of immutable class to hold model options."""
+  def __new__(cls,
+              outputs_to_num_classes,
+              crop_size=None,
+              atrous_rates=None,
+              output_stride=8):
+    """Constructor to set default values.
+    Args:
+      outputs_to_num_classes: A dictionary from output type to the number of
+        classes. For example, for the task of semantic segmentation with 21
+        semantic classes, we would have outputs_to_num_classes['semantic'] = 21.
+      crop_size: A tuple [crop_height, crop_width].
+      atrous_rates: A list of atrous convolution rates for ASPP.
+      output_stride: The ratio of input to output spatial resolution.
+    Returns:
+      A new VideoModelOptions instance.
+    """
+    self = super(VideoModelOptions, cls).__new__(
+        cls,
+        outputs_to_num_classes,
+        crop_size,
+        atrous_rates,
+        output_stride)
+    # Add internal flags.
+    self.classification_loss = FLAGS.classification_loss
+    return self
--- a/research/feelvos/correlation_cost/README.md
+++ b/research/feelvos/correlation_cost/README.md
+# correlation_cost
+FEELVOS uses correlation_cost as an optional dependency to improve the speed and memory consumption
+of cross-correlation.
+## Installation
+Unfortunately we cannot provide the code for correlation_cost directly, so you
+will have to copy some files from this pull request
+https://github.com/tensorflow/tensorflow/pull/21392/. For your convenience we
+prepared scripts to download and adjust the code automatically.
+In the best case, all you need to do is run compile.sh with the path to your
+CUDA installation (tested only with CUDA 9).
+Note that the path should be to a folder containing the cuda folder, not to the
+cuda folder itself, e.g. if your cuda is in /usr/local/cuda-9.0, you can create
+a symlink /usr/local/cuda pointing to /usr/local/cuda-9.0 and then run
+```bash
+sh build.sh /usr/local/
+```
+This will
+* Download the code via ```sh get_code.sh ```
+* Apply minor adjustments to the code via ```sh fix_code.sh```
+* Clone the dependencies cub and thrust from github via ```sh clone_dependencies.sh```
+* Compile a shared library correlation_cost.so for correlation_cost via
+```sh compile.sh "${CUDA_DIR}"```
+Please review the licenses of correlation_cost, cub, and thrust.
+## Enabling correlation_cost
+If you managed to create the correlation_cost.so file, then set
+```USE_CORRELATION_COST = True``` in feelvos/utils/embedding_utils.py and try to run
+```sh eval.sh```.
--- a/research/feelvos/correlation_cost/build.sh
+++ b/research/feelvos/correlation_cost/build.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# This script is used to download and build the code for correlation_cost.
+#
+# Usage:
+#   sh ./build.sh cuda_dir
+# Where cuda_dir points to a directory containing the cuda folder (not the cuda folder itself).
+#
+#
+if [ "$#" -ne 1 ]; then
+  echo "Illegal number of parameters, usage: ./build.sh cuda_dir"
+  echo "Where cuda_dir points to a directory containing the cuda folder (not the cuda folder itself)"
+  exit 1
+fi
+set -e
+set -x
+sh ./get_code.sh
+sh ./fix_code.sh
+sh ./clone_dependencies.sh
+sh ./compile.sh $1
--- a/research/feelvos/correlation_cost/clone_dependencies.sh
+++ b/research/feelvos/correlation_cost/clone_dependencies.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# This script is used to clone the dependencies, i.e. cub and thrust, of correlation_cost from github.
+#
+# Usage:
+#   sh ./clone_dependencies.sh
+#
+#
+# Clone cub.
+if [ ! -d cub ] ; then
+  git clone https://github.com/dmlc/cub.git
+fi
+# Clone thrust.
+if [ ! -d thrust ] ; then
+  git clone https://github.com/thrust/thrust.git
+fi
--- a/research/feelvos/correlation_cost/compile.sh
+++ b/research/feelvos/correlation_cost/compile.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# This script is used to compile the code for correlation_cost and create correlation_cost.so.
+#
+#  Usage:
+#    sh ./compile.sh cuda_dir
+#  Where cuda_dir points to a directory containing the cuda folder (not the cuda folder itself).
+#
+#
+if [ "$#" -ne 1 ]; then
+  echo "Illegal number of parameters, usage: ./compile.sh cuda_dir"
+  exit 1
+fi
+CUDA_DIR=$1
+if [ ! -d "${CUDA_DIR}/cuda" ]; then
+  echo "cuda_dir must point to a directory containing the cuda folder, not to the cuda folder itself"
+  exit 1
+fi
+TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
+TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
+CUB_DIR=cub
+THRUST_DIR=thrust
+# Depending on the versions of your nvcc and gcc, the flag --expt-relaxed-constexpr might be required or should be removed.
+# If nvcc complains about a too new gcc version, you can point it to another gcc
+# version by using something like nvcc -ccbin /path/to/your/gcc6
+nvcc -std=c++11 --expt-relaxed-constexpr -I ./ -I ${CUB_DIR}/../ -I ${THRUST_DIR} -I ${CUDA_DIR}/ -c -o correlation_cost_op_gpu.o kernels/correlation_cost_op_gpu.cu.cc ${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC
+g++ -std=c++11 -I ./ -L ${CUDA_DIR}/cuda/lib64 -shared -o correlation_cost.so ops/correlation_cost_op.cc kernels/correlation_cost_op.cc correlation_cost_op_gpu.o ${TF_CFLAGS[@]} -fPIC -lcudart ${TF_LFLAGS[@]} -D GOOGLE_CUDA=1
--- a/research/feelvos/correlation_cost/fix_code.sh
+++ b/research/feelvos/correlation_cost/fix_code.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# This script is used to modify the downloaded code.
+#
+#  Usage:
+#    sh ./fix_code.sh
+#
+#
+sed -i "s/tensorflow\/contrib\/correlation_cost\///g" kernels/correlation_cost_op_gpu.cu.cc
+sed -i "s/tensorflow\/contrib\/correlation_cost\///g" kernels/correlation_cost_op.cc
+sed -i "s/external\/cub_archive\//cub\//g" kernels/correlation_cost_op_gpu.cu.cc
+sed -i "s/from tensorflow.contrib.util import loader/import tensorflow as tf/g" python/ops/correlation_cost_op.py
+grep -v "from tensorflow" python/ops/correlation_cost_op.py | grep -v resource_loader.get_path_to_datafile > correlation_cost_op.py.tmp && mv correlation_cost_op.py.tmp python/ops/correlation_cost_op.py
+sed -i "s/array_ops/tf/g" python/ops/correlation_cost_op.py
+sed -i "s/ops/tf/g" python/ops/correlation_cost_op.py
+sed -i "s/loader.load_op_library(/tf.load_op_library('feelvos\/correlation_cost\/correlation_cost.so')/g" python/ops/correlation_cost_op.py
+sed -i "s/gen_correlation_cost_op/_correlation_cost_op_so/g" python/ops/correlation_cost_op.py
--- a/research/feelvos/correlation_cost/get_code.sh
+++ b/research/feelvos/correlation_cost/get_code.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# This script is used to download the code for correlation_cost.
+#
+#  Usage:
+#    sh ./get_code.sh
+#
+#
+mkdir -p kernels ops python/ops
+touch __init__.py
+touch python/__init__.py
+touch python/ops/__init__.py
+wget https://raw.githubusercontent.com/tensorflow/tensorflow/91b163b9bd8dd0f8c2631b4245a67dfd387536a6/tensorflow/contrib/correlation_cost/ops/correlation_cost_op.cc -O ops/correlation_cost_op.cc
+wget https://raw.githubusercontent.com/tensorflow/tensorflow/91b163b9bd8dd0f8c2631b4245a67dfd387536a6/tensorflow/contrib/correlation_cost/python/ops/correlation_cost_op.py -O python/ops/correlation_cost_op.py
+wget https://raw.githubusercontent.com/tensorflow/tensorflow/91b163b9bd8dd0f8c2631b4245a67dfd387536a6/tensorflow/contrib/correlation_cost/kernels/correlation_cost_op.cc -O kernels/correlation_cost_op.cc
+wget https://raw.githubusercontent.com/tensorflow/tensorflow/91b163b9bd8dd0f8c2631b4245a67dfd387536a6/tensorflow/contrib/correlation_cost/kernels/correlation_cost_op.h -O kernels/correlation_cost_op.h
+wget https://raw.githubusercontent.com/tensorflow/tensorflow/91b163b9bd8dd0f8c2631b4245a67dfd387536a6/tensorflow/contrib/correlation_cost/kernels/correlation_cost_op_gpu.cu.cc -O kernels/correlation_cost_op_gpu.cu.cc
--- a/research/feelvos/datasets/__init__.py
+++ b/research/feelvos/datasets/__init__.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
--- a/research/feelvos/datasets/build_davis2017_data.py
+++ b/research/feelvos/datasets/build_davis2017_data.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converts DAVIS 2017 data to TFRecord file format with SequenceExample protos.
+"""
+import io
+import math
+import os
+from StringIO import StringIO
+import numpy as np
+import PIL
+import tensorflow as tf
+FLAGS = tf.app.flags.FLAGS
+tf.app.flags.DEFINE_string('data_folder', 'DAVIS2017/',
+                           'Folder containing the DAVIS 2017 data')
+tf.app.flags.DEFINE_string('imageset', 'val',
+                           'Which subset to use, either train or val')
+tf.app.flags.DEFINE_string(
+    'output_dir', './tfrecord',
+    'Path to save converted TFRecords of TensorFlow examples.')
+_NUM_SHARDS_TRAIN = 10
+_NUM_SHARDS_VAL = 1
+def read_image(path):
+  with open(path) as fid:
+    image_str = fid.read()
+    image = PIL.Image.open(io.BytesIO(image_str))
+    w, h = image.size
+  return image_str, (h, w)
+def read_annotation(path):
+  """Reads a single image annotation from a png image.
+  Args:
+    path: Path to the png image.
+  Returns:
+    png_string: The png encoded as string.
+    size: Tuple of (height, width).
+  """
+  with open(path) as fid:
+    x = np.array(PIL.Image.open(fid))
+    h, w = x.shape
+    im = PIL.Image.fromarray(x)
+  output = StringIO()
+  im.save(output, format='png')
+  png_string = output.getvalue()
+  output.close()
+  return png_string, (h, w)
+def process_video(key, input_dir, anno_dir):
+  """Creates a SequenceExample for the video.
+  Args:
+    key: Name of the video.
+    input_dir: Directory which contains the image files.
+    anno_dir: Directory which contains the annotation files.
+  Returns:
+    The created SequenceExample.
+  """
+  frame_names = sorted(tf.gfile.ListDirectory(input_dir))
+  anno_files = sorted(tf.gfile.ListDirectory(anno_dir))
+  assert len(frame_names) == len(anno_files)
+  sequence = tf.train.SequenceExample()
+  context = sequence.context.feature
+  features = sequence.feature_lists.feature_list
+  for i, name in enumerate(frame_names):
+    image_str, image_shape = read_image(
+        os.path.join(input_dir, name))
+    anno_str, anno_shape = read_annotation(
+        os.path.join(anno_dir, name[:-4] + '.png'))
+    image_encoded = features['image/encoded'].feature.add()
+    image_encoded.bytes_list.value.append(image_str)
+    segmentation_encoded = features['segmentation/object/encoded'].feature.add()
+    segmentation_encoded.bytes_list.value.append(anno_str)
+    np.testing.assert_array_equal(np.array(image_shape), np.array(anno_shape))
+    if i == 0:
+      first_shape = np.array(image_shape)
+    else:
+      np.testing.assert_array_equal(np.array(image_shape), first_shape)
+  context['video_id'].bytes_list.value.append(key.encode('ascii'))
+  context['clip/frames'].int64_list.value.append(len(frame_names))
+  context['image/format'].bytes_list.value.append('JPEG')
+  context['image/channels'].int64_list.value.append(3)
+  context['image/height'].int64_list.value.append(first_shape[0])
+  context['image/width'].int64_list.value.append(first_shape[1])
+  context['segmentation/object/format'].bytes_list.value.append('PNG')
+  context['segmentation/object/height'].int64_list.value.append(first_shape[0])
+  context['segmentation/object/width'].int64_list.value.append(first_shape[1])
+  return sequence
+def convert(data_folder, imageset, output_dir, num_shards):
+  """Converts the specified subset of DAVIS 2017 to TFRecord format.
+  Args:
+    data_folder: The path to the DAVIS 2017 data.
+    imageset: The subset to use, either train or val.
+    output_dir: Where to store the TFRecords.
+    num_shards: The number of shards used for storing the data.
+  """
+  sets_file = os.path.join(data_folder, 'ImageSets', '2017', imageset + '.txt')
+  vids = [x.strip() for x in open(sets_file).readlines()]
+  num_vids = len(vids)
+  num_vids_per_shard = int(math.ceil(num_vids) / float(num_shards))
+  for shard_id in range(num_shards):
+    output_filename = os.path.join(
+        output_dir,
+        '%s-%05d-of-%05d.tfrecord' % (imageset, shard_id, num_shards))
+    with tf.python_io.TFRecordWriter(output_filename) as tfrecord_writer:
+      start_idx = shard_id * num_vids_per_shard
+      end_idx = min((shard_id + 1) * num_vids_per_shard, num_vids)
+      for i in range(start_idx, end_idx):
+        print('Converting video %d/%d shard %d video %s' % (
+            i + 1, num_vids, shard_id, vids[i]))
+        img_dir = os.path.join(data_folder, 'JPEGImages', '480p', vids[i])
+        anno_dir = os.path.join(data_folder, 'Annotations', '480p', vids[i])
+        example = process_video(vids[i], img_dir, anno_dir)
+        tfrecord_writer.write(example.SerializeToString())
+def main(unused_argv):
+  imageset = FLAGS.imageset
+  assert imageset in ('train', 'val')
+  if imageset == 'train':
+    num_shards = _NUM_SHARDS_TRAIN
+  else:
+    num_shards = _NUM_SHARDS_VAL
+  convert(FLAGS.data_folder, FLAGS.imageset, FLAGS.output_dir, num_shards)
+if __name__ == '__main__':
+  tf.app.run()
--- a/research/feelvos/datasets/download_and_convert_davis17.sh
+++ b/research/feelvos/datasets/download_and_convert_davis17.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# Script to download and preprocess the DAVIS 2017 dataset.
+#
+# Usage:
+#   bash ./download_and_convert_davis17.sh
+# Exit immediately if a command exits with a non-zero status.
+set -e
+CURRENT_DIR=$(pwd)
+WORK_DIR="./davis17"
+mkdir -p "${WORK_DIR}"
+cd "${WORK_DIR}"
+# Helper function to download and unpack the DAVIS 2017 dataset.
+download_and_uncompress() {
+  local BASE_URL=${1}
+  local FILENAME=${2}
+  if [ ! -f "${FILENAME}" ]; then
+    echo "Downloading ${FILENAME} to ${WORK_DIR}"
+    wget -nd -c "${BASE_URL}/${FILENAME}"
+    echo "Uncompressing ${FILENAME}"
+    unzip "${FILENAME}"
+  fi
+}
+BASE_URL="https://data.vision.ee.ethz.ch/csergi/share/davis/"
+FILENAME="DAVIS-2017-trainval-480p.zip"
+download_and_uncompress "${BASE_URL}" "${FILENAME}"
+cd "${CURRENT_DIR}"
+# Root path for DAVIS 2017 dataset.
+DAVIS_ROOT="${WORK_DIR}/DAVIS"
+# Build TFRecords of the dataset.
+# First, create output directory for storing TFRecords.
+OUTPUT_DIR="${WORK_DIR}/tfrecord"
+mkdir -p "${OUTPUT_DIR}"
+IMAGE_FOLDER="${DAVIS_ROOT}/JPEGImages"
+LIST_FOLDER="${DAVIS_ROOT}/ImageSets/Segmentation"
+# Convert validation set.
+if [ ! -f "${OUTPUT_DIR}/val-00000-of-00001.tfrecord" ]; then
+  echo "Converting DAVIS 2017 dataset (val)..."
+  python ./build_davis2017_data.py \
+    --data_folder="${DAVIS_ROOT}" \
+    --imageset=val \
+    --output_dir="${OUTPUT_DIR}"
+fi
+# Convert training set.
+if [ ! -f "${OUTPUT_DIR}/train-00009-of-00010.tfrecord" ]; then
+  echo "Converting DAVIS 2017 dataset (train)..."
+  python ./build_davis2017_data.py \
+    --data_folder="${DAVIS_ROOT}" \
+    --imageset=train \
+    --output_dir="${OUTPUT_DIR}"
+fi
--- a/research/feelvos/datasets/tfsequence_example_decoder.py
+++ b/research/feelvos/datasets/tfsequence_example_decoder.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Contains the TFExampleDecoder.
+The TFExampleDecode is a DataDecoder used to decode TensorFlow Example protos.
+In order to do so each requested item must be paired with one or more Example
+features that are parsed to produce the Tensor-based manifestation of the item.
+"""
+import tensorflow as tf
+slim = tf.contrib.slim
+data_decoder = slim.data_decoder
+class TFSequenceExampleDecoder(data_decoder.DataDecoder):
+  """A decoder for TensorFlow SequenceExamples.
+  Decoding SequenceExample proto buffers is comprised of two stages:
+  (1) Example parsing and (2) tensor manipulation.
+  In the first stage, the tf.parse_single_sequence_example function is called
+  with a list of FixedLenFeatures and SparseLenFeatures. These instances tell TF
+  how to parse the example. The output of this stage is a set of tensors.
+  In the second stage, the resulting tensors are manipulated to provide the
+  requested 'item' tensors.
+  To perform this decoding operation, a SequenceExampleDecoder is given a list
+  of ItemHandlers. Each ItemHandler indicates the set of features for stage 1
+  and contains the instructions for post_processing its tensors for stage 2.
+  """
+  def __init__(self, keys_to_context_features, keys_to_sequence_features,
+               items_to_handlers):
+    """Constructs the decoder.
+    Args:
+      keys_to_context_features: a dictionary from TF-SequenceExample context
+        keys to either tf.VarLenFeature or tf.FixedLenFeature instances.
+        See tensorflow's parsing_ops.py.
+      keys_to_sequence_features: a dictionary from TF-SequenceExample sequence
+        keys to either tf.VarLenFeature or tf.FixedLenSequenceFeature instances.
+        See tensorflow's parsing_ops.py.
+      items_to_handlers: a dictionary from items (strings) to ItemHandler
+        instances. Note that the ItemHandler's are provided the keys that they
+        use to return the final item Tensors.
+    Raises:
+      ValueError: if the same key is present for context features and sequence
+        features.
+    """
+    unique_keys = set()
+    unique_keys.update(keys_to_context_features)
+    unique_keys.update(keys_to_sequence_features)
+    if len(unique_keys) != (
+        len(keys_to_context_features) + len(keys_to_sequence_features)):
+      # This situation is ambiguous in the decoder's keys_to_tensors variable.
+      raise ValueError('Context and sequence keys are not unique. \n'
+                       ' Context keys: %s \n Sequence keys: %s' %
+                       (list(keys_to_context_features.keys()),
+                        list(keys_to_sequence_features.keys())))
+    self._keys_to_context_features = keys_to_context_features
+    self._keys_to_sequence_features = keys_to_sequence_features
+    self._items_to_handlers = items_to_handlers
+  def list_items(self):
+    """See base class."""
+    return self._items_to_handlers.keys()
+  def decode(self, serialized_example, items=None):
+    """Decodes the given serialized TF-SequenceExample.
+    Args:
+      serialized_example: a serialized TF-SequenceExample tensor.
+      items: the list of items to decode. These must be a subset of the item
+        keys in self._items_to_handlers. If `items` is left as None, then all
+        of the items in self._items_to_handlers are decoded.
+    Returns:
+      the decoded items, a list of tensor.
+    """
+    context, feature_list = tf.parse_single_sequence_example(
+        serialized_example, self._keys_to_context_features,
+        self._keys_to_sequence_features)
+    # Reshape non-sparse elements just once:
+    for k in self._keys_to_context_features:
+      v = self._keys_to_context_features[k]
+      if isinstance(v, tf.FixedLenFeature):
+        context[k] = tf.reshape(context[k], v.shape)
+    if not items:
+      items = self._items_to_handlers.keys()
+    outputs = []
+    for item in items:
+      handler = self._items_to_handlers[item]
+      keys_to_tensors = {
+          key: context[key] if key in context else feature_list[key]
+          for key in handler.keys
+      }
+      outputs.append(handler.tensors_to_item(keys_to_tensors))
+    return outputs
--- a/research/feelvos/datasets/video_dataset.py
+++ b/research/feelvos/datasets/video_dataset.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides data from video object segmentation datasets.
+This file provides both images and annotations (instance segmentations) for
+TensorFlow. Currently, we support the following datasets:
+1. DAVIS 2017 (https://davischallenge.org/davis2017/code.html).
+2. DAVIS 2016 (https://davischallenge.org/davis2016/code.html).
+3. YouTube-VOS (https://youtube-vos.org/dataset/download).
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import collections
+import os.path
+import tensorflow as tf
+from feelvos.datasets import tfsequence_example_decoder
+slim = tf.contrib.slim
+dataset = slim.dataset
+tfexample_decoder = slim.tfexample_decoder
+_ITEMS_TO_DESCRIPTIONS = {
+    'image': 'A color image of varying height and width.',
+    'labels_class': ('A semantic segmentation label whose size matches image.'
+                     'Its values range from 0 (background) to num_classes.'),
+}
+# Named tuple to describe the dataset properties.
+DatasetDescriptor = collections.namedtuple(
+    'DatasetDescriptor',
+    ['splits_to_sizes',   # Splits of the dataset into training, val, and test.
+     'num_classes',   # Number of semantic classes.
+     'ignore_label',  # Ignore label value.
+    ]
+)
+_DAVIS_2016_INFORMATION = DatasetDescriptor(
+    splits_to_sizes={'train': [30, 1830],
+                     'val': [20, 1376]},
+    num_classes=2,
+    ignore_label=255,
+)
+_DAVIS_2017_INFORMATION = DatasetDescriptor(
+    splits_to_sizes={'train': [60, 4219],
+                     'val': [30, 2023],
+                     'test-dev': [30, 2037]},
+    num_classes=None,  # Number of instances per videos differ.
+    ignore_label=255,
+)
+_YOUTUBE_VOS_2018_INFORMATION = DatasetDescriptor(
+    # Leave these sizes as None to allow for different splits into
+    # training and validation sets.
+    splits_to_sizes={'train': [None, None],
+                     'val': [None, None]},
+    num_classes=None,  # Number of instances per video differs.
+    ignore_label=255,
+)
+_DATASETS_INFORMATION = {
+    'davis_2016': _DAVIS_2016_INFORMATION,
+    'davis_2017': _DAVIS_2017_INFORMATION,
+    'youtube_vos_2018': _YOUTUBE_VOS_2018_INFORMATION,
+}
+# Default file pattern of SSTable. Note we include '-' to avoid the confusion
+# between `train-` and `trainval-` sets.
+_FILE_PATTERN = '%s-*'
+def get_dataset(dataset_name,
+                split_name,
+                dataset_dir,
+                file_pattern=None,
+                data_type='tf_sequence_example',
+                decode_video_frames=False):
+  """Gets an instance of slim Dataset.
+  Args:
+    dataset_name: String, dataset name.
+    split_name: String, the train/val Split name.
+    dataset_dir: String, the directory of the dataset sources.
+    file_pattern: String, file pattern of SSTable.
+    data_type: String, data type. Currently supports 'tf_example' and
+      'annotated_image'.
+    decode_video_frames: Boolean, decode the images or not. Not decoding it here
+        is useful if we subsample later
+  Returns:
+    An instance of slim Dataset.
+  Raises:
+    ValueError: If the dataset_name or split_name is not recognized, or if
+      the dataset_type is not supported.
+  """
+  if dataset_name not in _DATASETS_INFORMATION:
+    raise ValueError('The specified dataset is not supported yet.')
+  splits_to_sizes = _DATASETS_INFORMATION[dataset_name].splits_to_sizes
+  if split_name not in splits_to_sizes:
+    raise ValueError('data split name %s not recognized' % split_name)
+  # Prepare the variables for different datasets.
+  num_classes = _DATASETS_INFORMATION[dataset_name].num_classes
+  ignore_label = _DATASETS_INFORMATION[dataset_name].ignore_label
+  if file_pattern is None:
+    file_pattern = _FILE_PATTERN
+  file_pattern = os.path.join(dataset_dir, file_pattern % split_name)
+  if data_type == 'tf_sequence_example':
+    keys_to_context_features = {
+        'image/format': tf.FixedLenFeature((), tf.string, default_value='jpeg'),
+        'image/height': tf.FixedLenFeature((), tf.int64, default_value=0),
+        'image/width': tf.FixedLenFeature((), tf.int64, default_value=0),
+        'segmentation/object/format': tf.FixedLenFeature(
+            (), tf.string, default_value='png'),
+        'video_id': tf.FixedLenFeature((), tf.string, default_value='unknown')
+    }
+    label_name = 'class' if dataset_name == 'davis_2016' else 'object'
+    keys_to_sequence_features = {
+        'image/encoded': tf.FixedLenSequenceFeature((), dtype=tf.string),
+        'segmentation/{}/encoded'.format(label_name):
+            tf.FixedLenSequenceFeature((), tf.string),
+        'segmentation/{}/encoded'.format(label_name):
+            tf.FixedLenSequenceFeature((), tf.string),
+    }
+    items_to_handlers = {
+        'height': tfexample_decoder.Tensor('image/height'),
+        'width': tfexample_decoder.Tensor('image/width'),
+        'video_id': tfexample_decoder.Tensor('video_id')
+    }
+    if decode_video_frames:
+      decode_image_handler = tfexample_decoder.Image(
+          image_key='image/encoded',
+          format_key='image/format',
+          channels=3,
+          repeated=True)
+      items_to_handlers['image'] = decode_image_handler
+      decode_label_handler = tfexample_decoder.Image(
+          image_key='segmentation/{}/encoded'.format(label_name),
+          format_key='segmentation/{}/format'.format(label_name),
+          channels=1,
+          repeated=True)
+      items_to_handlers['labels_class'] = decode_label_handler
+    else:
+      items_to_handlers['image/encoded'] = tfexample_decoder.Tensor(
+          'image/encoded')
+      items_to_handlers[
+          'segmentation/object/encoded'] = tfexample_decoder.Tensor(
+              'segmentation/{}/encoded'.format(label_name))
+    decoder = tfsequence_example_decoder.TFSequenceExampleDecoder(
+        keys_to_context_features, keys_to_sequence_features, items_to_handlers)
+  else:
+    raise ValueError('Unknown data type.')
+  size = splits_to_sizes[split_name]
+  if isinstance(size, collections.Sequence):
+    num_videos = size[0]
+    num_samples = size[1]
+  else:
+    num_videos = 0
+    num_samples = size
+  return dataset.Dataset(
+      data_sources=file_pattern,
+      reader=tf.TFRecordReader,
+      decoder=decoder,
+      num_samples=num_samples,
+      num_videos=num_videos,
+      items_to_descriptions=_ITEMS_TO_DESCRIPTIONS,
+      ignore_label=ignore_label,
+      num_classes=num_classes,
+      name=dataset_name,
+      multi_label=True)
--- a/research/feelvos/eval.sh
+++ b/research/feelvos/eval.sh
+#!/bin/bash
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+#
+# This script is used to locally run inference on DAVIS 2017. Users could also
+# modify from this script for their use case. See train.sh for an example of
+# local training.
+#
+# Usage:
+#   # From the tensorflow/models/research/feelvos directory.
+#   sh ./eval.sh
+#
+#
+# Exit immediately if a command exits with a non-zero status.
+set -e
+# Move one-level up to tensorflow/models/research directory.
+cd ..
+# Update PYTHONPATH.
+export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim:`pwd`/feelvos
+# Set up the working environment.
+CURRENT_DIR=$(pwd)
+WORK_DIR="${CURRENT_DIR}/feelvos"
+# Run embedding_utils_test first to make sure the PYTHONPATH is correctly set.
+python "${WORK_DIR}"/utils/embedding_utils_test.py -v
+# Go to datasets folder and download and convert the DAVIS 2017 dataset.
+DATASET_DIR="datasets"
+cd "${WORK_DIR}/${DATASET_DIR}"
+sh download_and_convert_davis17.sh
+# Go to models folder and download and unpack the DAVIS 2017 trained model.
+MODELS_DIR="models"
+mkdir -p "${WORK_DIR}/${MODELS_DIR}"
+cd "${WORK_DIR}/${MODELS_DIR}"
+if [ ! -d "feelvos_davis17_trained" ]; then
+  wget http://download.tensorflow.org/models/feelvos_davis17_trained.tar.gz
+  tar -xvf feelvos_davis17_trained.tar.gz
+  echo "model_checkpoint_path: \"model.ckpt-200004\"" > feelvos_davis17_trained/checkpoint
+  rm feelvos_davis17_trained.tar.gz
+fi
+CHECKPOINT_DIR="${WORK_DIR}/${MODELS_DIR}/feelvos_davis17_trained/"
+# Go back to orignal directory.
+cd "${CURRENT_DIR}"
+# Set up the working directories.
+DAVIS_FOLDER="davis17"
+EXP_FOLDER="exp/eval_on_val_set"
+VIS_LOGDIR="${WORK_DIR}/${DATASET_DIR}/${DAVIS_FOLDER}/${EXP_FOLDER}/eval"
+mkdir -p ${VIS_LOGDIR}
+DAVIS_DATASET="${WORK_DIR}/${DATASET_DIR}/${DAVIS_FOLDER}/tfrecord"
+python "${WORK_DIR}"/vis_video.py \
+  --dataset=davis_2017 \
+  --dataset_dir="${DAVIS_DATASET}" \
+  --vis_logdir="${VIS_LOGDIR}" \
+  --checkpoint_dir="${CHECKPOINT_DIR}" \
+  --logtostderr \
+  --atrous_rates=12 \
+  --atrous_rates=24 \
+  --atrous_rates=36 \
+  --decoder_output_stride=4 \
+  --model_variant=xception_65 \
+  --multi_grid=1 \
+  --multi_grid=1 \
+  --multi_grid=1 \
+  --output_stride=8 \
+  --save_segmentations
--- a/research/feelvos/input_preprocess.py
+++ b/research/feelvos/input_preprocess.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Prepare the data used for FEELVOS training/evaluation."""
+import tensorflow as tf
+from deeplab.core import feature_extractor
+from deeplab.core import preprocess_utils
+# The probability of flipping the images and labels
+# left-right during training
+_PROB_OF_FLIP = 0.5
+get_random_scale = preprocess_utils.get_random_scale
+randomly_scale_image_and_label = (
+    preprocess_utils.randomly_scale_image_and_label)
+def preprocess_image_and_label(image,
+                               label,
+                               crop_height,
+                               crop_width,
+                               min_resize_value=None,
+                               max_resize_value=None,
+                               resize_factor=None,
+                               min_scale_factor=1.,
+                               max_scale_factor=1.,
+                               scale_factor_step_size=0,
+                               ignore_label=255,
+                               is_training=True,
+                               model_variant=None):
+  """Preprocesses the image and label.
+  Args:
+    image: Input image.
+    label: Ground truth annotation label.
+    crop_height: The height value used to crop the image and label.
+    crop_width: The width value used to crop the image and label.
+    min_resize_value: Desired size of the smaller image side.
+    max_resize_value: Maximum allowed size of the larger image side.
+    resize_factor: Resized dimensions are multiple of factor plus one.
+    min_scale_factor: Minimum scale factor value.
+    max_scale_factor: Maximum scale factor value.
+    scale_factor_step_size: The step size from min scale factor to max scale
+      factor. The input is randomly scaled based on the value of
+      (min_scale_factor, max_scale_factor, scale_factor_step_size).
+    ignore_label: The label value which will be ignored for training and
+      evaluation.
+    is_training: If the preprocessing is used for training or not.
+    model_variant: Model variant (string) for choosing how to mean-subtract the
+      images. See feature_extractor.network_map for supported model variants.
+  Returns:
+    original_image: Original image (could be resized).
+    processed_image: Preprocessed image.
+    label: Preprocessed ground truth segmentation label.
+  Raises:
+    ValueError: Ground truth label not provided during training.
+  """
+  if is_training and label is None:
+    raise ValueError('During training, label must be provided.')
+  if model_variant is None:
+    tf.logging.warning('Default mean-subtraction is performed. Please specify '
+                       'a model_variant. See feature_extractor.network_map for '
+                       'supported model variants.')
+  # Keep reference to original image.
+  original_image = image
+  processed_image = tf.cast(image, tf.float32)
+  if label is not None:
+    label = tf.cast(label, tf.int32)
+  # Resize image and label to the desired range.
+  if min_resize_value is not None or max_resize_value is not None:
+    [processed_image, label] = (
+        preprocess_utils.resize_to_range(
+            image=processed_image,
+            label=label,
+            min_size=min_resize_value,
+            max_size=max_resize_value,
+            factor=resize_factor,
+            align_corners=True))
+    # The `original_image` becomes the resized image.
+    original_image = tf.identity(processed_image)
+  # Data augmentation by randomly scaling the inputs.
+  scale = get_random_scale(
+      min_scale_factor, max_scale_factor, scale_factor_step_size)
+  processed_image, label = randomly_scale_image_and_label(
+      processed_image, label, scale)
+  processed_image.set_shape([None, None, 3])
+  if crop_height is not None and crop_width is not None:
+    # Pad image and label to have dimensions >= [crop_height, crop_width].
+    image_shape = tf.shape(processed_image)
+    image_height = image_shape[0]
+    image_width = image_shape[1]
+    target_height = image_height + tf.maximum(crop_height - image_height, 0)
+    target_width = image_width + tf.maximum(crop_width - image_width, 0)
+    # Pad image with mean pixel value.
+    mean_pixel = tf.reshape(
+        feature_extractor.mean_pixel(model_variant), [1, 1, 3])
+    processed_image = preprocess_utils.pad_to_bounding_box(
+        processed_image, 0, 0, target_height, target_width, mean_pixel)
+    if label is not None:
+      label = preprocess_utils.pad_to_bounding_box(
+          label, 0, 0, target_height, target_width, ignore_label)
+    # Randomly crop the image and label.
+    if is_training and label is not None:
+      processed_image, label = preprocess_utils.random_crop(
+          [processed_image, label], crop_height, crop_width)
+    processed_image.set_shape([crop_height, crop_width, 3])
+    if label is not None:
+      label.set_shape([crop_height, crop_width, 1])
+  if is_training:
+    # Randomly left-right flip the image and label.
+    processed_image, label, _ = preprocess_utils.flip_dim(
+        [processed_image, label], _PROB_OF_FLIP, dim=1)
+  return original_image, processed_image, label
+def preprocess_images_and_labels_consistently(images,
+                                              labels,
+                                              crop_height,
+                                              crop_width,
+                                              min_resize_value=None,
+                                              max_resize_value=None,
+                                              resize_factor=None,
+                                              min_scale_factor=1.,
+                                              max_scale_factor=1.,
+                                              scale_factor_step_size=0,
+                                              ignore_label=255,
+                                              is_training=True,
+                                              model_variant=None):
+  """Preprocesses images and labels in a consistent way.
+  Similar to preprocess_image_and_label, but works on a list of images
+  and a list of labels and uses the same crop coordinates and either flips
+  all images and labels or none of them.
+  Args:
+    images: List of input images.
+    labels: List of ground truth annotation labels.
+    crop_height: The height value used to crop the image and label.
+    crop_width: The width value used to crop the image and label.
+    min_resize_value: Desired size of the smaller image side.
+    max_resize_value: Maximum allowed size of the larger image side.
+    resize_factor: Resized dimensions are multiple of factor plus one.
+    min_scale_factor: Minimum scale factor value.
+    max_scale_factor: Maximum scale factor value.
+    scale_factor_step_size: The step size from min scale factor to max scale
+      factor. The input is randomly scaled based on the value of
+      (min_scale_factor, max_scale_factor, scale_factor_step_size).
+    ignore_label: The label value which will be ignored for training and
+      evaluation.
+    is_training: If the preprocessing is used for training or not.
+    model_variant: Model variant (string) for choosing how to mean-subtract the
+      images. See feature_extractor.network_map for supported model variants.
+  Returns:
+    original_images: Original images (could be resized).
+    processed_images: Preprocessed images.
+    labels: Preprocessed ground truth segmentation labels.
+  Raises:
+    ValueError: Ground truth label not provided during training.
+  """
+  if is_training and labels is None:
+    raise ValueError('During training, labels must be provided.')
+  if model_variant is None:
+    tf.logging.warning('Default mean-subtraction is performed. Please specify '
+                       'a model_variant. See feature_extractor.network_map for '
+                       'supported model variants.')
+  if labels is not None:
+    assert len(images) == len(labels)
+  num_imgs = len(images)
+  # Keep reference to original images.
+  original_images = images
+  processed_images = [tf.cast(image, tf.float32) for image in images]
+  if labels is not None:
+    labels = [tf.cast(label, tf.int32) for label in labels]
+  # Resize images and labels to the desired range.
+  if min_resize_value is not None or max_resize_value is not None:
+    processed_images, labels = zip(*[
+        preprocess_utils.resize_to_range(
+            image=processed_image,
+            label=label,
+            min_size=min_resize_value,
+            max_size=max_resize_value,
+            factor=resize_factor,
+            align_corners=True) for processed_image, label
+        in zip(processed_images, labels)])
+    # The `original_images` becomes the resized images.
+    original_images = [tf.identity(processed_image)
+                       for processed_image in processed_images]
+  # Data augmentation by randomly scaling the inputs.
+  scale = get_random_scale(
+      min_scale_factor, max_scale_factor, scale_factor_step_size)
+  processed_images, labels = zip(
+      *[randomly_scale_image_and_label(processed_image, label, scale)
+        for processed_image, label in zip(processed_images, labels)])
+  for processed_image in processed_images:
+    processed_image.set_shape([None, None, 3])
+  if crop_height is not None and crop_width is not None:
+    # Pad image and label to have dimensions >= [crop_height, crop_width].
+    image_shape = tf.shape(processed_images[0])
+    image_height = image_shape[0]
+    image_width = image_shape[1]
+    target_height = image_height + tf.maximum(crop_height - image_height, 0)
+    target_width = image_width + tf.maximum(crop_width - image_width, 0)
+    # Pad image with mean pixel value.
+    mean_pixel = tf.reshape(
+        feature_extractor.mean_pixel(model_variant), [1, 1, 3])
+    processed_images = [preprocess_utils.pad_to_bounding_box(
+        processed_image, 0, 0, target_height, target_width, mean_pixel)
+                        for processed_image in processed_images]
+    if labels is not None:
+      labels = [preprocess_utils.pad_to_bounding_box(
+          label, 0, 0, target_height, target_width, ignore_label)
+                for label in labels]
+    # Randomly crop the images and labels.
+    if is_training and labels is not None:
+      cropped = preprocess_utils.random_crop(
+          processed_images + labels, crop_height, crop_width)
+      assert len(cropped) == 2 * num_imgs
+      processed_images = cropped[:num_imgs]
+      labels = cropped[num_imgs:]
+    for processed_image in processed_images:
+      processed_image.set_shape([crop_height, crop_width, 3])
+    if labels is not None:
+      for label in labels:
+        label.set_shape([crop_height, crop_width, 1])
+  if is_training:
+    # Randomly left-right flip the image and label.
+    res = preprocess_utils.flip_dim(
+        list(processed_images + labels), _PROB_OF_FLIP, dim=1)
+    maybe_flipped = res[:-1]
+    assert len(maybe_flipped) == 2 * num_imgs
+    processed_images = maybe_flipped[:num_imgs]
+    labels = maybe_flipped[num_imgs:]
+  return original_images, processed_images, labels
--- a/research/feelvos/model.py
+++ b/research/feelvos/model.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+r"""Provides DeepLab model definition and helper functions.
+DeepLab is a deep learning system for semantic image segmentation with
+the following features:
+(1) Atrous convolution to explicitly control the resolution at which
+feature responses are computed within Deep Convolutional Neural Networks.
+(2) Atrous spatial pyramid pooling (ASPP) to robustly segment objects at
+multiple scales with filters at multiple sampling rates and effective
+fields-of-views.
+(3) ASPP module augmented with image-level feature and batch normalization.
+(4) A simple yet effective decoder module to recover the object boundaries.
+See the following papers for more details:
+"Encoder-Decoder with Atrous Separable Convolution for Semantic Image
+Segmentation"
+Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam.
+(https://arxiv.org/abs1802.02611)
+"Rethinking Atrous Convolution for Semantic Image Segmentation,"
+Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam
+(https://arxiv.org/abs/1706.05587)
+"DeepLab: Semantic Image Segmentation with Deep Convolutional Nets,
+Atrous Convolution, and Fully Connected CRFs",
+Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy,
+Alan L Yuille (* equal contribution)
+(https://arxiv.org/abs/1606.00915)
+"Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected
+CRFs"
+Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy,
+Alan L. Yuille (* equal contribution)
+(https://arxiv.org/abs/1412.7062)
+"""
+import collections
+import tensorflow as tf
+from deeplab import model
+from feelvos import common
+from feelvos.utils import embedding_utils
+from feelvos.utils import train_utils
+slim = tf.contrib.slim
+get_branch_logits = model.get_branch_logits
+get_extra_layer_scopes = model.get_extra_layer_scopes
+multi_scale_logits_v2 = model.multi_scale_logits
+refine_by_decoder = model.refine_by_decoder
+scale_dimension = model.scale_dimension
+split_separable_conv2d = model.split_separable_conv2d
+MERGED_LOGITS_SCOPE = model.MERGED_LOGITS_SCOPE
+IMAGE_POOLING_SCOPE = model.IMAGE_POOLING_SCOPE
+ASPP_SCOPE = model.ASPP_SCOPE
+CONCAT_PROJECTION_SCOPE = model.CONCAT_PROJECTION_SCOPE
+def predict_labels(images,
+                   model_options,
+                   image_pyramid=None,
+                   reference_labels=None,
+                   k_nearest_neighbors=1,
+                   embedding_dimension=None,
+                   use_softmax_feedback=False,
+                   initial_softmax_feedback=None,
+                   embedding_seg_feature_dimension=256,
+                   embedding_seg_n_layers=4,
+                   embedding_seg_kernel_size=7,
+                   embedding_seg_atrous_rates=None,
+                   also_return_softmax_probabilities=False,
+                   num_frames_per_video=None,
+                   normalize_nearest_neighbor_distances=False,
+                   also_attend_to_previous_frame=False,
+                   use_local_previous_frame_attention=False,
+                   previous_frame_attention_window_size=9,
+                   use_first_frame_matching=True,
+                   also_return_embeddings=False,
+                   ref_embeddings=None):
+  """Predicts segmentation labels.
+  Args:
+    images: A tensor of size [batch, height, width, channels].
+    model_options: An InternalModelOptions instance to configure models.
+    image_pyramid: Input image scales for multi-scale feature extraction.
+    reference_labels: A tensor of size [batch, height, width, 1].
+      ground truth labels used to perform a nearest neighbor query
+    k_nearest_neighbors: Integer, the number of neighbors to use for nearest
+      neighbor queries.
+    embedding_dimension: Integer, the dimension used for the learned embedding.
+    use_softmax_feedback: Boolean, whether to give the softmax predictions of
+      the last frame as additional input to the segmentation head.
+    initial_softmax_feedback: Float32 tensor, or None. Can be used to
+      initialize the softmax predictions used for the feedback loop.
+      Typically only useful for inference. Only has an effect if
+      use_softmax_feedback is True.
+    embedding_seg_feature_dimension: Integer, the dimensionality used in the
+      segmentation head layers.
+    embedding_seg_n_layers: Integer, the number of layers in the segmentation
+      head.
+    embedding_seg_kernel_size: Integer, the kernel size used in the
+      segmentation head.
+    embedding_seg_atrous_rates: List of integers of length
+      embedding_seg_n_layers, the atrous rates to use for the segmentation head.
+    also_return_softmax_probabilities: Boolean, if true, additionally return
+      the softmax probabilities as second return value.
+    num_frames_per_video: Integer, the number of frames per video.
+    normalize_nearest_neighbor_distances: Boolean, whether to normalize the
+      nearest neighbor distances to [0,1] using sigmoid, scale and shift.
+    also_attend_to_previous_frame: Boolean, whether to also use nearest
+      neighbor attention with respect to the previous frame.
+    use_local_previous_frame_attention: Boolean, whether to restrict the
+      previous frame attention to a local search window.
+      Only has an effect, if also_attend_to_previous_frame is True.
+    previous_frame_attention_window_size: Integer, the window size used for
+      local previous frame attention, if use_local_previous_frame_attention
+      is True.
+    use_first_frame_matching: Boolean, whether to extract features by matching
+      to the reference frame. This should always be true except for ablation
+      experiments.
+    also_return_embeddings: Boolean, whether to return the embeddings as well.
+    ref_embeddings: Tuple of
+      (first_frame_embeddings, previous_frame_embeddings),
+      each of shape [batch, height, width, embedding_dimension], or None.
+  Returns:
+    A dictionary with keys specifying the output_type (e.g., semantic
+      prediction) and values storing Tensors representing predictions (argmax
+      over channels). Each prediction has size [batch, height, width].
+    If also_return_softmax_probabilities is True, the second return value are
+      the softmax probabilities.
+    If also_return_embeddings is True, it will also return an embeddings
+      tensor of shape [batch, height, width, embedding_dimension].
+  Raises:
+    ValueError: If classification_loss is not softmax, softmax_with_attention,
+      nor triplet.
+  """
+  if (model_options.classification_loss == 'triplet' and
+      reference_labels is None):
+    raise ValueError('Need reference_labels for triplet loss')
+  if model_options.classification_loss == 'softmax_with_attention':
+    if embedding_dimension is None:
+      raise ValueError('Need embedding_dimension for softmax_with_attention '
+                       'loss')
+    if reference_labels is None:
+      raise ValueError('Need reference_labels for softmax_with_attention loss')
+    res = (
+        multi_scale_logits_with_nearest_neighbor_matching(
+            images,
+            model_options=model_options,
+            image_pyramid=image_pyramid,
+            is_training=False,
+            reference_labels=reference_labels,
+            clone_batch_size=1,
+            num_frames_per_video=num_frames_per_video,
+            embedding_dimension=embedding_dimension,
+            max_neighbors_per_object=0,
+            k_nearest_neighbors=k_nearest_neighbors,
+            use_softmax_feedback=use_softmax_feedback,
+            initial_softmax_feedback=initial_softmax_feedback,
+            embedding_seg_feature_dimension=embedding_seg_feature_dimension,
+            embedding_seg_n_layers=embedding_seg_n_layers,
+            embedding_seg_kernel_size=embedding_seg_kernel_size,
+            embedding_seg_atrous_rates=embedding_seg_atrous_rates,
+            normalize_nearest_neighbor_distances=
+            normalize_nearest_neighbor_distances,
+            also_attend_to_previous_frame=also_attend_to_previous_frame,
+            use_local_previous_frame_attention=
+            use_local_previous_frame_attention,
+            previous_frame_attention_window_size=
+            previous_frame_attention_window_size,
+            use_first_frame_matching=use_first_frame_matching,
+            also_return_embeddings=also_return_embeddings,
+            ref_embeddings=ref_embeddings
+        ))
+    if also_return_embeddings:
+      outputs_to_scales_to_logits, embeddings = res
+    else:
+      outputs_to_scales_to_logits = res
+      embeddings = None
+  else:
+    outputs_to_scales_to_logits = multi_scale_logits_v2(
+        images,
+        model_options=model_options,
+        image_pyramid=image_pyramid,
+        is_training=False,
+        fine_tune_batch_norm=False)
+  predictions = {}
+  for output in sorted(outputs_to_scales_to_logits):
+    scales_to_logits = outputs_to_scales_to_logits[output]
+    original_logits = scales_to_logits[MERGED_LOGITS_SCOPE]
+    if isinstance(original_logits, list):
+      assert len(original_logits) == 1
+      original_logits = original_logits[0]
+    logits = tf.image.resize_bilinear(original_logits, tf.shape(images)[1:3],
+                                      align_corners=True)
+    if model_options.classification_loss in ('softmax',
+                                             'softmax_with_attention'):
+      predictions[output] = tf.argmax(logits, 3)
+    elif model_options.classification_loss == 'triplet':
+      # to keep this fast, we do the nearest neighbor assignment on the
+      # resolution at which the embedding is extracted and scale the result up
+      # afterwards
+      embeddings = original_logits
+      reference_labels_logits_size = tf.squeeze(
+          tf.image.resize_nearest_neighbor(
+              reference_labels[tf.newaxis],
+              train_utils.resolve_shape(embeddings)[1:3],
+              align_corners=True), axis=0)
+      nn_labels = embedding_utils.assign_labels_by_nearest_neighbors(
+          embeddings[0], embeddings[1:], reference_labels_logits_size,
+          k_nearest_neighbors)
+      predictions[common.OUTPUT_TYPE] = tf.image.resize_nearest_neighbor(
+          nn_labels, tf.shape(images)[1:3], align_corners=True)
+    else:
+      raise ValueError(
+          'Only support softmax, triplet, or softmax_with_attention for '
+          'classification_loss.')
+  if also_return_embeddings:
+    assert also_return_softmax_probabilities
+    return predictions, tf.nn.softmax(original_logits, axis=-1), embeddings
+  elif also_return_softmax_probabilities:
+    return predictions, tf.nn.softmax(original_logits, axis=-1)
+  else:
+    return predictions
+def multi_scale_logits_with_nearest_neighbor_matching(
+    images,
+    model_options,
+    image_pyramid,
+    clone_batch_size,
+    reference_labels,
+    num_frames_per_video,
+    embedding_dimension,
+    max_neighbors_per_object,
+    weight_decay=0.0001,
+    is_training=False,
+    fine_tune_batch_norm=False,
+    k_nearest_neighbors=1,
+    use_softmax_feedback=False,
+    initial_softmax_feedback=None,
+    embedding_seg_feature_dimension=256,
+    embedding_seg_n_layers=4,
+    embedding_seg_kernel_size=7,
+    embedding_seg_atrous_rates=None,
+    normalize_nearest_neighbor_distances=False,
+    also_attend_to_previous_frame=False,
+    damage_initial_previous_frame_mask=False,
+    use_local_previous_frame_attention=False,
+    previous_frame_attention_window_size=9,
+    use_first_frame_matching=True,
+    also_return_embeddings=False,
+    ref_embeddings=None):
+  """Gets the logits for multi-scale inputs using nearest neighbor attention.
+  Adjusted version of multi_scale_logits_v2 to support nearest neighbor
+  attention and a variable number of classes for each element of the batch.
+  The returned logits are all downsampled (due to max-pooling layers)
+  for both training and evaluation.
+  Args:
+    images: A tensor of size [batch, height, width, channels].
+    model_options: A ModelOptions instance to configure models.
+    image_pyramid: Input image scales for multi-scale feature extraction.
+    clone_batch_size: Integer, the number of videos on a batch.
+    reference_labels: The segmentation labels of the reference frame on which
+      attention is applied.
+    num_frames_per_video: Integer, the number of frames per video.
+    embedding_dimension: Integer, the dimension of the embedding.
+    max_neighbors_per_object: Integer, the maximum number of candidates
+      for the nearest neighbor query per object after subsampling.
+      Can be 0 for no subsampling.
+    weight_decay: The weight decay for model variables.
+    is_training: Is training or not.
+    fine_tune_batch_norm: Fine-tune the batch norm parameters or not.
+    k_nearest_neighbors: Integer, the number of nearest neighbors to use.
+    use_softmax_feedback: Boolean, whether to give the softmax predictions of
+      the last frame as additional input to the segmentation head.
+    initial_softmax_feedback: List of Float32 tensors, or None.
+      Can be used to initialize the softmax predictions used for the feedback
+      loop. Only has an effect if use_softmax_feedback is True.
+    embedding_seg_feature_dimension: Integer, the dimensionality used in the
+      segmentation head layers.
+    embedding_seg_n_layers: Integer, the number of layers in the segmentation
+      head.
+    embedding_seg_kernel_size: Integer, the kernel size used in the
+      segmentation head.
+    embedding_seg_atrous_rates: List of integers of length
+      embedding_seg_n_layers, the atrous rates to use for the segmentation head.
+    normalize_nearest_neighbor_distances: Boolean, whether to normalize the
+      nearest neighbor distances to [0,1] using sigmoid, scale and shift.
+    also_attend_to_previous_frame: Boolean, whether to also use nearest
+      neighbor attention with respect to the previous frame.
+    damage_initial_previous_frame_mask: Boolean, whether to artificially damage
+      the initial previous frame mask. Only has an effect if
+      also_attend_to_previous_frame is True.
+    use_local_previous_frame_attention: Boolean, whether to restrict the
+      previous frame attention to a local search window.
+      Only has an effect, if also_attend_to_previous_frame is True.
+    previous_frame_attention_window_size: Integer, the window size used for
+      local previous frame attention, if use_local_previous_frame_attention
+      is True.
+    use_first_frame_matching: Boolean, whether to extract features by matching
+      to the reference frame. This should always be true except for ablation
+      experiments.
+    also_return_embeddings: Boolean, whether to return the embeddings as well.
+    ref_embeddings: Tuple of
+      (first_frame_embeddings, previous_frame_embeddings),
+      each of shape [batch, height, width, embedding_dimension], or None.
+  Returns:
+    outputs_to_scales_to_logits: A map of maps from output_type (e.g.,
+      semantic prediction) to a dictionary of multi-scale logits names to
+      logits. For each output_type, the dictionary has keys which
+      correspond to the scales and values which correspond to the logits.
+      For example, if `scales` equals [1.0, 1.5], then the keys would
+      include 'merged_logits', 'logits_1.00' and 'logits_1.50'.
+    If also_return_embeddings is True, it will also return an embeddings
+      tensor of shape [batch, height, width, embedding_dimension].
+  Raises:
+    ValueError: If model_options doesn't specify crop_size and its
+      add_image_level_feature = True, since add_image_level_feature requires
+      crop_size information.
+  """
+  # Setup default values.
+  if not image_pyramid:
+    image_pyramid = [1.0]
+  crop_height = (
+      model_options.crop_size[0]
+      if model_options.crop_size else tf.shape(images)[1])
+  crop_width = (
+      model_options.crop_size[1]
+      if model_options.crop_size else tf.shape(images)[2])
+  # Compute the height, width for the output logits.
+  logits_output_stride = (
+      model_options.decoder_output_stride or model_options.output_stride)
+  logits_height = scale_dimension(
+      crop_height,
+      max(1.0, max(image_pyramid)) / logits_output_stride)
+  logits_width = scale_dimension(
+      crop_width,
+      max(1.0, max(image_pyramid)) / logits_output_stride)
+  # Compute the logits for each scale in the image pyramid.
+  outputs_to_scales_to_logits = {
+      k: {}
+      for k in model_options.outputs_to_num_classes
+  }
+  for image_scale in image_pyramid:
+    if image_scale != 1.0:
+      scaled_height = scale_dimension(crop_height, image_scale)
+      scaled_width = scale_dimension(crop_width, image_scale)
+      scaled_crop_size = [scaled_height, scaled_width]
+      scaled_images = tf.image.resize_bilinear(
+          images, scaled_crop_size, align_corners=True)
+      scaled_reference_labels = tf.image.resize_nearest_neighbor(
+          reference_labels, scaled_crop_size, align_corners=True
+      )
+      if model_options.crop_size is None:
+        scaled_crop_size = None
+      if model_options.crop_size:
+        scaled_images.set_shape([None, scaled_height, scaled_width, 3])
+    else:
+      scaled_crop_size = model_options.crop_size
+      scaled_images = images
+      scaled_reference_labels = reference_labels
+    updated_options = model_options._replace(crop_size=scaled_crop_size)
+    res = embedding_utils.get_logits_with_matching(
+        scaled_images,
+        updated_options,
+        weight_decay=weight_decay,
+        reuse=tf.AUTO_REUSE,
+        is_training=is_training,
+        fine_tune_batch_norm=fine_tune_batch_norm,
+        reference_labels=scaled_reference_labels,
+        batch_size=clone_batch_size,
+        num_frames_per_video=num_frames_per_video,
+        embedding_dimension=embedding_dimension,
+        max_neighbors_per_object=max_neighbors_per_object,
+        k_nearest_neighbors=k_nearest_neighbors,
+        use_softmax_feedback=use_softmax_feedback,
+        initial_softmax_feedback=initial_softmax_feedback,
+        embedding_seg_feature_dimension=embedding_seg_feature_dimension,
+        embedding_seg_n_layers=embedding_seg_n_layers,
+        embedding_seg_kernel_size=embedding_seg_kernel_size,
+        embedding_seg_atrous_rates=embedding_seg_atrous_rates,
+        normalize_nearest_neighbor_distances=
+        normalize_nearest_neighbor_distances,
+        also_attend_to_previous_frame=also_attend_to_previous_frame,
+        damage_initial_previous_frame_mask=damage_initial_previous_frame_mask,
+        use_local_previous_frame_attention=use_local_previous_frame_attention,
+        previous_frame_attention_window_size=
+        previous_frame_attention_window_size,
+        use_first_frame_matching=use_first_frame_matching,
+        also_return_embeddings=also_return_embeddings,
+        ref_embeddings=ref_embeddings
+    )
+    if also_return_embeddings:
+      outputs_to_logits, embeddings = res
+    else:
+      outputs_to_logits = res
+      embeddings = None
+    # Resize the logits to have the same dimension before merging.
+    for output in sorted(outputs_to_logits):
+      if isinstance(outputs_to_logits[output], collections.Sequence):
+        outputs_to_logits[output] = [tf.image.resize_bilinear(
+            x, [logits_height, logits_width], align_corners=True)
+                                     for x in outputs_to_logits[output]]
+      else:
+        outputs_to_logits[output] = tf.image.resize_bilinear(
+            outputs_to_logits[output], [logits_height, logits_width],
+            align_corners=True)
+    # Return when only one input scale.
+    if len(image_pyramid) == 1:
+      for output in sorted(model_options.outputs_to_num_classes):
+        outputs_to_scales_to_logits[output][
+            MERGED_LOGITS_SCOPE] = outputs_to_logits[output]
+      if also_return_embeddings:
+        return outputs_to_scales_to_logits, embeddings
+      else:
+        return outputs_to_scales_to_logits
+    # Save logits to the output map.
+    for output in sorted(model_options.outputs_to_num_classes):
+      outputs_to_scales_to_logits[output][
+          'logits_%.2f' % image_scale] = outputs_to_logits[output]
+  # Merge the logits from all the multi-scale inputs.
+  for output in sorted(model_options.outputs_to_num_classes):
+    # Concatenate the multi-scale logits for each output type.
+    all_logits = [
+        [tf.expand_dims(l, axis=4)]
+        for logits in outputs_to_scales_to_logits[output].values()
+        for l in logits
+    ]
+    transposed = map(list, zip(*all_logits))
+    all_logits = [tf.concat(t, 4) for t in transposed]
+    merge_fn = (
+        tf.reduce_max
+        if model_options.merge_method == 'max' else tf.reduce_mean)
+    outputs_to_scales_to_logits[output][MERGED_LOGITS_SCOPE] = [merge_fn(
+        l, axis=4) for l in all_logits]
+  if also_return_embeddings:
+    return outputs_to_scales_to_logits, embeddings
+  else:
+    return outputs_to_scales_to_logits
--- a/research/feelvos/train.py
+++ b/research/feelvos/train.py
+# Copyright 2018 The TensorFlow Authors All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Training script for the FEELVOS model.
+See model.py for more details and usage.
+"""
+import six
+import tensorflow as tf
+from feelvos import common
+from feelvos import model
+from feelvos.datasets import video_dataset
+from feelvos.utils import embedding_utils
+from feelvos.utils import train_utils
+from feelvos.utils import video_input_generator
+from deployment import model_deploy
+slim = tf.contrib.slim
+prefetch_queue = slim.prefetch_queue
+flags = tf.app.flags
+FLAGS = flags.FLAGS
+# Settings for multi-GPUs/multi-replicas training.
+flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy.')
+flags.DEFINE_boolean('clone_on_cpu', False, 'Use CPUs to deploy clones.')
+flags.DEFINE_integer('num_replicas', 1, 'Number of worker replicas.')
+flags.DEFINE_integer('startup_delay_steps', 15,
+                     'Number of training steps between replicas startup.')
+flags.DEFINE_integer('num_ps_tasks', 0,
+                     'The number of parameter servers. If the value is 0, then '
+                     'the parameters are handled locally by the worker.')
+flags.DEFINE_string('master', '', 'BNS name of the tensorflow server')
+flags.DEFINE_integer('task', 0, 'The task ID.')
+# Settings for logging.
+flags.DEFINE_string('train_logdir', None,
+                    'Where the checkpoint and logs are stored.')
+flags.DEFINE_integer('log_steps', 10,
+                     'Display logging information at every log_steps.')
+flags.DEFINE_integer('save_interval_secs', 1200,
+                     'How often, in seconds, we save the model to disk.')
+flags.DEFINE_integer('save_summaries_secs', 600,
+                     'How often, in seconds, we compute the summaries.')
+# Settings for training strategy.
+flags.DEFINE_enum('learning_policy', 'poly', ['poly', 'step'],
+                  'Learning rate policy for training.')
+flags.DEFINE_float('base_learning_rate', 0.0007,
+                   'The base learning rate for model training.')
+flags.DEFINE_float('learning_rate_decay_factor', 0.1,
+                   'The rate to decay the base learning rate.')
+flags.DEFINE_integer('learning_rate_decay_step', 2000,
+                     'Decay the base learning rate at a fixed step.')
+flags.DEFINE_float('learning_power', 0.9,
+                   'The power value used in the poly learning policy.')
+flags.DEFINE_integer('training_number_of_steps', 200000,
+                     'The number of steps used for training')
+flags.DEFINE_float('momentum', 0.9, 'The momentum value to use')
+flags.DEFINE_integer('train_batch_size', 6,
+                     'The number of images in each batch during training.')
+flags.DEFINE_integer('train_num_frames_per_video', 3,
+                     'The number of frames used per video during training')
+flags.DEFINE_float('weight_decay', 0.00004,
+                   'The value of the weight decay for training.')
+flags.DEFINE_multi_integer('train_crop_size', [465, 465],
+                           'Image crop size [height, width] during training.')
+flags.DEFINE_float('last_layer_gradient_multiplier', 1.0,
+                   'The gradient multiplier for last layers, which is used to '
+                   'boost the gradient of last layers if the value > 1.')
+flags.DEFINE_boolean('upsample_logits', True,
+                     'Upsample logits during training.')
+flags.DEFINE_integer('batch_capacity_factor', 16, 'Batch capacity factor.')
+flags.DEFINE_integer('num_readers', 1, 'Number of readers for data provider.')
+flags.DEFINE_integer('batch_num_threads', 1, 'Batch number of threads.')
+flags.DEFINE_integer('prefetch_queue_capacity_factor', 32,
+                     'Prefetch queue capacity factor.')
+flags.DEFINE_integer('prefetch_queue_num_threads', 1,
+                     'Prefetch queue number of threads.')
+flags.DEFINE_integer('train_max_neighbors_per_object', 1024,
+                     'The maximum number of candidates for the nearest '
+                     'neighbor query per object after subsampling')
+# Settings for fine-tuning the network.
+flags.DEFINE_string('tf_initial_checkpoint', None,
+                    'The initial checkpoint in tensorflow format.')
+flags.DEFINE_boolean('initialize_last_layer', False,
+                     'Initialize the last layer.')
+flags.DEFINE_boolean('last_layers_contain_logits_only', False,
+                     'Only consider logits as last layers or not.')
+flags.DEFINE_integer('slow_start_step', 0,
+                     'Training model with small learning rate for few steps.')
+flags.DEFINE_float('slow_start_learning_rate', 1e-4,
+                   'Learning rate employed during slow start.')
+flags.DEFINE_boolean('fine_tune_batch_norm', False,
+                     'Fine tune the batch norm parameters or not.')
+flags.DEFINE_float('min_scale_factor', 1.,
+                   'Mininum scale factor for data augmentation.')
+flags.DEFINE_float('max_scale_factor', 1.3,
+                   'Maximum scale factor for data augmentation.')
+flags.DEFINE_float('scale_factor_step_size', 0,
+                   'Scale factor step size for data augmentation.')
+flags.DEFINE_multi_integer('atrous_rates', None,
+                           'Atrous rates for atrous spatial pyramid pooling.')
+flags.DEFINE_integer('output_stride', 8,
+                     'The ratio of input to output spatial resolution.')
+flags.DEFINE_boolean('sample_only_first_frame_for_finetuning', False,
+                     'Whether to only sample the first frame during '
+                     'fine-tuning. This should be False when using lucid data, '
+                     'but True when fine-tuning on the first frame only. Only '
+                     'has an effect if first_frame_finetuning is True.')
+flags.DEFINE_multi_integer('first_frame_finetuning', [0],
+                           'Whether to only sample the first frame for '
+                           'fine-tuning.')
+# Dataset settings.
+flags.DEFINE_multi_string('dataset', [], 'Name of the segmentation datasets.')
+flags.DEFINE_multi_float('dataset_sampling_probabilities', [],
+                         'A list of probabilities to sample each of the '
+                         'datasets.')
+flags.DEFINE_string('train_split', 'train',
+                    'Which split of the dataset to be used for training')
+flags.DEFINE_multi_string('dataset_dir', [], 'Where the datasets reside.')
+flags.DEFINE_multi_integer('three_frame_dataset', [0],
+                           'Whether the dataset has exactly three frames per '
+                           'video of which the first is to be used as reference'
+                           ' and the two others are consecutive frames to be '
+                           'used as query  frames.'
+                           'Set true for pascal lucid data.')
+flags.DEFINE_boolean('damage_initial_previous_frame_mask', False,
+                     'Whether to artificially damage the initial previous '
+                     'frame mask. Only has an effect if '
+                     'also_attend_to_previous_frame is True.')
+flags.DEFINE_float('top_k_percent_pixels', 0.15, 'Float in [0.0, 1.0].'
+                   'When its value < 1.0, only compute the loss for the top k'
+                   'percent pixels (e.g., the top 20% pixels). This is useful'
+                   'for hard pixel mining.')
+flags.DEFINE_integer('hard_example_mining_step', 100000,
+                     'The training step in which the hard exampling mining '
+                     'kicks off. Note that we gradually reduce the mining '
+                     'percent to the top_k_percent_pixels. For example, if '
+                     'hard_example_mining_step=100K and '
+                     'top_k_percent_pixels=0.25, then mining percent will '
+                     'gradually reduce from 100% to 25% until 100K steps '
+                     'after which we only mine top 25% pixels. Only has an '
+                     'effect if top_k_percent_pixels < 1.0')
+def _build_deeplab(inputs_queue_or_samples, outputs_to_num_classes,
+                   ignore_label):
+  """Builds a clone of DeepLab.
+  Args:
+    inputs_queue_or_samples: A prefetch queue for images and labels, or
+      directly a dict of the samples.
+    outputs_to_num_classes: A map from output type to the number of classes.
+      For example, for the task of semantic segmentation with 21 semantic
+      classes, we would have outputs_to_num_classes['semantic'] = 21.
+    ignore_label: Ignore label.
+  Returns:
+    A map of maps from output_type (e.g., semantic prediction) to a
+      dictionary of multi-scale logits names to logits. For each output_type,
+      the dictionary has keys which correspond to the scales and values which
+      correspond to the logits. For example, if `scales` equals [1.0, 1.5],
+      then the keys would include 'merged_logits', 'logits_1.00' and
+      'logits_1.50'.
+  Raises:
+    ValueError: If classification_loss is not softmax, softmax_with_attention,
+      or triplet.
+  """
+  if hasattr(inputs_queue_or_samples, 'dequeue'):
+    samples = inputs_queue_or_samples.dequeue()
+  else:
+    samples = inputs_queue_or_samples
+  train_crop_size = (None if 0 in FLAGS.train_crop_size else
+                     FLAGS.train_crop_size)
+  model_options = common.VideoModelOptions(
+      outputs_to_num_classes=outputs_to_num_classes,
+      crop_size=train_crop_size,
+      atrous_rates=FLAGS.atrous_rates,
+      output_stride=FLAGS.output_stride)
+  if model_options.classification_loss == 'softmax_with_attention':
+    clone_batch_size = FLAGS.train_batch_size // FLAGS.num_clones
+    # Create summaries of ground truth labels.
+    for n in range(clone_batch_size):
+      tf.summary.image(
+          'gt_label_%d' % n,
+          tf.cast(samples[common.LABEL][
+              n * FLAGS.train_num_frames_per_video:
+              (n + 1) * FLAGS.train_num_frames_per_video],
+                  tf.uint8) * 32, max_outputs=FLAGS.train_num_frames_per_video)
+    if common.PRECEDING_FRAME_LABEL in samples:
+      preceding_frame_label = samples[common.PRECEDING_FRAME_LABEL]
+      init_softmax = []
+      for n in range(clone_batch_size):
+        init_softmax_n = embedding_utils.create_initial_softmax_from_labels(
+            preceding_frame_label[n, tf.newaxis],
+            samples[common.LABEL][n * FLAGS.train_num_frames_per_video,
+                                  tf.newaxis],
+            FLAGS.decoder_output_stride,
+            reduce_labels=True)
+        init_softmax_n = tf.squeeze(init_softmax_n, axis=0)
+        init_softmax.append(init_softmax_n)
+        tf.summary.image('preceding_frame_label',
+                         tf.cast(preceding_frame_label[n, tf.newaxis],
+                                 tf.uint8) * 32)
+    else:
+      init_softmax = None
+    outputs_to_scales_to_logits = (
+        model.multi_scale_logits_with_nearest_neighbor_matching(
+            samples[common.IMAGE],
+            model_options=model_options,
+            image_pyramid=FLAGS.image_pyramid,
+            weight_decay=FLAGS.weight_decay,
+            is_training=True,
+            fine_tune_batch_norm=FLAGS.fine_tune_batch_norm,
+            reference_labels=samples[common.LABEL],
+            clone_batch_size=FLAGS.train_batch_size // FLAGS.num_clones,
+            num_frames_per_video=FLAGS.train_num_frames_per_video,
+            embedding_dimension=FLAGS.embedding_dimension,
+            max_neighbors_per_object=FLAGS.train_max_neighbors_per_object,
+            k_nearest_neighbors=FLAGS.k_nearest_neighbors,
+            use_softmax_feedback=FLAGS.use_softmax_feedback,
+            initial_softmax_feedback=init_softmax,
+            embedding_seg_feature_dimension=
+            FLAGS.embedding_seg_feature_dimension,
+            embedding_seg_n_layers=FLAGS.embedding_seg_n_layers,
+            embedding_seg_kernel_size=FLAGS.embedding_seg_kernel_size,
+            embedding_seg_atrous_rates=FLAGS.embedding_seg_atrous_rates,
+            normalize_nearest_neighbor_distances=
+            FLAGS.normalize_nearest_neighbor_distances,
+            also_attend_to_previous_frame=FLAGS.also_attend_to_previous_frame,
+            damage_initial_previous_frame_mask=
+            FLAGS.damage_initial_previous_frame_mask,
+            use_local_previous_frame_attention=
+            FLAGS.use_local_previous_frame_attention,
+            previous_frame_attention_window_size=
+            FLAGS.previous_frame_attention_window_size,
+            use_first_frame_matching=FLAGS.use_first_frame_matching
+        ))
+  else:
+    outputs_to_scales_to_logits = model.multi_scale_logits_v2(
+        samples[common.IMAGE],
+        model_options=model_options,
+        image_pyramid=FLAGS.image_pyramid,
+        weight_decay=FLAGS.weight_decay,
+        is_training=True,
+        fine_tune_batch_norm=FLAGS.fine_tune_batch_norm)
+  if model_options.classification_loss == 'softmax':
+    for output, num_classes in six.iteritems(outputs_to_num_classes):
+      train_utils.add_softmax_cross_entropy_loss_for_each_scale(
+          outputs_to_scales_to_logits[output],
+          samples[common.LABEL],
+          num_classes,
+          ignore_label,
+          loss_weight=1.0,
+          upsample_logits=FLAGS.upsample_logits,
+          scope=output)
+  elif model_options.classification_loss == 'triplet':
+    for output, _ in six.iteritems(outputs_to_num_classes):
+      train_utils.add_triplet_loss_for_each_scale(
+          FLAGS.train_batch_size // FLAGS.num_clones,
+          FLAGS.train_num_frames_per_video,
+          FLAGS.embedding_dimension, outputs_to_scales_to_logits[output],
+          samples[common.LABEL], scope=output)
+  elif model_options.classification_loss == 'softmax_with_attention':
+    labels = samples[common.LABEL]
+    batch_size = FLAGS.train_batch_size // FLAGS.num_clones
+    num_frames_per_video = FLAGS.train_num_frames_per_video
+    h, w = train_utils.resolve_shape(labels)[1:3]
+    labels = tf.reshape(labels, tf.stack(
+        [batch_size, num_frames_per_video, h, w, 1]))
+    # Strip the reference labels off.
+    if FLAGS.also_attend_to_previous_frame or FLAGS.use_softmax_feedback:
+      n_ref_frames = 2
+    else:
+      n_ref_frames = 1
+    labels = labels[:, n_ref_frames:]
+    # Merge batch and time dimensions.
+    labels = tf.reshape(labels, tf.stack(
+        [batch_size * (num_frames_per_video - n_ref_frames), h, w, 1]))
+    for output, num_classes in six.iteritems(outputs_to_num_classes):
+      train_utils.add_dynamic_softmax_cross_entropy_loss_for_each_scale(
+          outputs_to_scales_to_logits[output],
+          labels,
+          ignore_label,
+          loss_weight=1.0,
+          upsample_logits=FLAGS.upsample_logits,
+          scope=output,
+          top_k_percent_pixels=FLAGS.top_k_percent_pixels,
+          hard_example_mining_step=FLAGS.hard_example_mining_step)
+  else:
+    raise ValueError('Only support softmax, softmax_with_attention'
+                     ' or triplet for classification_loss.')
+  return outputs_to_scales_to_logits
+def main(unused_argv):
+  # Set up deployment (i.e., multi-GPUs and/or multi-replicas).
+  config = model_deploy.DeploymentConfig(
+      num_clones=FLAGS.num_clones,
+      clone_on_cpu=FLAGS.clone_on_cpu,
+      replica_id=FLAGS.task,
+      num_replicas=FLAGS.num_replicas,
+      num_ps_tasks=FLAGS.num_ps_tasks)
+  with tf.Graph().as_default():
+    with tf.device(config.inputs_device()):
+      train_crop_size = (None if 0 in FLAGS.train_crop_size else
+                         FLAGS.train_crop_size)
+      assert FLAGS.dataset
+      assert len(FLAGS.dataset) == len(FLAGS.dataset_dir)
+      if len(FLAGS.first_frame_finetuning) == 1:
+        first_frame_finetuning = (list(FLAGS.first_frame_finetuning)
+                                  * len(FLAGS.dataset))
+      else:
+        first_frame_finetuning = FLAGS.first_frame_finetuning
+      if len(FLAGS.three_frame_dataset) == 1:
+        three_frame_dataset = (list(FLAGS.three_frame_dataset)
+                               * len(FLAGS.dataset))
+      else:
+        three_frame_dataset = FLAGS.three_frame_dataset
+      assert len(FLAGS.dataset) == len(first_frame_finetuning)
+      assert len(FLAGS.dataset) == len(three_frame_dataset)
+      datasets, samples_list = zip(
+          *[_get_dataset_and_samples(config, train_crop_size, dataset,
+                                     dataset_dir, bool(first_frame_finetuning_),
+                                     bool(three_frame_dataset_))
+            for dataset, dataset_dir, first_frame_finetuning_,
+            three_frame_dataset_ in zip(FLAGS.dataset, FLAGS.dataset_dir,
+                                        first_frame_finetuning,
+                                        three_frame_dataset)])
+      # Note that this way of doing things is wasteful since it will evaluate
+      # all branches but just use one of them. But let's do it anyway for now,
+      # since it's easy and will probably be fast enough.
+      dataset = datasets[0]
+      if len(samples_list) == 1:
+        samples = samples_list[0]
+      else:
+        probabilities = FLAGS.dataset_sampling_probabilities
+        if probabilities:
+          assert len(probabilities) == len(samples_list)
+        else:
+          # Default to uniform probabilities.
+          probabilities = [1.0 / len(samples_list) for _ in samples_list]
+        probabilities = tf.constant(probabilities)
+        logits = tf.log(probabilities[tf.newaxis])
+        rand_idx = tf.squeeze(tf.multinomial(logits, 1, output_dtype=tf.int32),
+                              axis=[0, 1])
+        def wrap(x):
+          def f():
+            return x
+          return f
+        samples = tf.case({tf.equal(rand_idx, idx): wrap(s)
+                           for idx, s in enumerate(samples_list)},
+                          exclusive=True)
+      # Prefetch_queue requires the shape to be known at graph creation time.
+      # So we only use it if we crop to a fixed size.
+      if train_crop_size is None:
+        inputs_queue = samples
+      else:
+        inputs_queue = prefetch_queue.prefetch_queue(
+            samples,
+            capacity=FLAGS.prefetch_queue_capacity_factor*config.num_clones,
+            num_threads=FLAGS.prefetch_queue_num_threads)
+    # Create the global step on the device storing the variables.
+    with tf.device(config.variables_device()):
+      global_step = tf.train.get_or_create_global_step()
+      # Define the model and create clones.
+      model_fn = _build_deeplab
+      if FLAGS.classification_loss == 'triplet':
+        embedding_dim = FLAGS.embedding_dimension
+        output_type_to_dim = {'embedding': embedding_dim}
+      else:
+        output_type_to_dim = {common.OUTPUT_TYPE: dataset.num_classes}
+      model_args = (inputs_queue, output_type_to_dim, dataset.ignore_label)
+      clones = model_deploy.create_clones(config, model_fn, args=model_args)
+      # Gather update_ops from the first clone. These contain, for example,
+      # the updates for the batch_norm variables created by model_fn.
+      first_clone_scope = config.clone_scope(0)
+      update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, first_clone_scope)
+    # Gather initial summaries.
+    summaries = set(tf.get_collection(tf.GraphKeys.SUMMARIES))
+    # Add summaries for model variables.
+    for model_var in tf.contrib.framework.get_model_variables():
+      summaries.add(tf.summary.histogram(model_var.op.name, model_var))
+    # Add summaries for losses.
+    for loss in tf.get_collection(tf.GraphKeys.LOSSES, first_clone_scope):
+      summaries.add(tf.summary.scalar('losses/%s' % loss.op.name, loss))
+    # Build the optimizer based on the device specification.
+    with tf.device(config.optimizer_device()):
+      learning_rate = train_utils.get_model_learning_rate(
+          FLAGS.learning_policy,
+          FLAGS.base_learning_rate,
+          FLAGS.learning_rate_decay_step,
+          FLAGS.learning_rate_decay_factor,
+          FLAGS.training_number_of_steps,
+          FLAGS.learning_power,
+          FLAGS.slow_start_step,
+          FLAGS.slow_start_learning_rate)
+      optimizer = tf.train.MomentumOptimizer(learning_rate, FLAGS.momentum)
+      summaries.add(tf.summary.scalar('learning_rate', learning_rate))
+    startup_delay_steps = FLAGS.task * FLAGS.startup_delay_steps
+    with tf.device(config.variables_device()):
+      total_loss, grads_and_vars = model_deploy.optimize_clones(
+          clones, optimizer)
+      total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
+      summaries.add(tf.summary.scalar('total_loss', total_loss))
+      # Modify the gradients for biases and last layer variables.
+      last_layers = model.get_extra_layer_scopes(
+          FLAGS.last_layers_contain_logits_only)
+      grad_mult = train_utils.get_model_gradient_multipliers(
+          last_layers, FLAGS.last_layer_gradient_multiplier)
+      if grad_mult:
+        grads_and_vars = slim.learning.multiply_gradients(grads_and_vars,
+                                                          grad_mult)
+      with tf.name_scope('grad_clipping'):
+        grads_and_vars = slim.learning.clip_gradient_norms(grads_and_vars, 5.0)
+      # Create histogram summaries for the gradients.
+      # We have too many summaries for mldash, so disable this one for now.
+      # for grad, var in grads_and_vars:
+      #   summaries.add(tf.summary.histogram(
+      #       var.name.replace(':0', '_0') + '/gradient', grad))
+      # Create gradient update op.
+      grad_updates = optimizer.apply_gradients(grads_and_vars,
+                                               global_step=global_step)
+      update_ops.append(grad_updates)
+      update_op = tf.group(*update_ops)
+      with tf.control_dependencies([update_op]):
+        train_tensor = tf.identity(total_loss, name='train_op')
+    # Add the summaries from the first clone. These contain the summaries
+    # created by model_fn and either optimize_clones() or _gather_clone_loss().
+    summaries |= set(tf.get_collection(tf.GraphKeys.SUMMARIES,
+                                       first_clone_scope))
+    # Merge all summaries together.
+    summary_op = tf.summary.merge(list(summaries))
+    # Soft placement allows placing on CPU ops without GPU implementation.
+    session_config = tf.ConfigProto(allow_soft_placement=True,
+                                    log_device_placement=False)
+    # Start the training.
+    slim.learning.train(
+        train_tensor,
+        logdir=FLAGS.train_logdir,
+        log_every_n_steps=FLAGS.log_steps,
+        master=FLAGS.master,
+        number_of_steps=FLAGS.training_number_of_steps,
+        is_chief=(FLAGS.task == 0),
+        session_config=session_config,
+        startup_delay_steps=startup_delay_steps,
+        init_fn=train_utils.get_model_init_fn(FLAGS.train_logdir,
+                                              FLAGS.tf_initial_checkpoint,
+                                              FLAGS.initialize_last_layer,
+                                              last_layers,
+                                              ignore_missing_vars=True),
+        summary_op=summary_op,
+        save_summaries_secs=FLAGS.save_summaries_secs,
+        save_interval_secs=FLAGS.save_interval_secs)
+def _get_dataset_and_samples(config, train_crop_size, dataset_name,
+                             dataset_dir, first_frame_finetuning,
+                             three_frame_dataset):
+  """Creates dataset object and samples dict of tensor.
+  Args:
+    config: A DeploymentConfig.
+    train_crop_size: Integer, the crop size used for training.
+    dataset_name: String, the name of the dataset.
+    dataset_dir: String, the directory of the dataset.
+    first_frame_finetuning: Boolean, whether the used dataset is a dataset
+      for first frame fine-tuning.
+    three_frame_dataset: Boolean, whether the dataset has exactly three frames
+      per video of which the first is to be used as reference and the two
+      others are consecutive frames to be used as query frames.
+  Returns:
+    dataset: An instance of slim Dataset.
+    samples: A dictionary of tensors for semantic segmentation.
+  """
+  # Split the batch across GPUs.
+  assert FLAGS.train_batch_size % config.num_clones == 0, (
+      'Training batch size not divisble by number of clones (GPUs).')
+  clone_batch_size = FLAGS.train_batch_size / config.num_clones
+  if first_frame_finetuning:
+    train_split = 'val'
+  else:
+    train_split = FLAGS.train_split
+  data_type = 'tf_sequence_example'
+  # Get dataset-dependent information.
+  dataset = video_dataset.get_dataset(
+      dataset_name,
+      train_split,
+      dataset_dir=dataset_dir,
+      data_type=data_type)
+  tf.gfile.MakeDirs(FLAGS.train_logdir)
+  tf.logging.info('Training on %s set', train_split)
+  samples = video_input_generator.get(
+      dataset,
+      FLAGS.train_num_frames_per_video,
+      train_crop_size,
+      clone_batch_size,
+      num_readers=FLAGS.num_readers,
+      num_threads=FLAGS.batch_num_threads,
+      min_resize_value=FLAGS.min_resize_value,
+      max_resize_value=FLAGS.max_resize_value,
+      resize_factor=FLAGS.resize_factor,
+      min_scale_factor=FLAGS.min_scale_factor,
+      max_scale_factor=FLAGS.max_scale_factor,
+      scale_factor_step_size=FLAGS.scale_factor_step_size,
+      dataset_split=FLAGS.train_split,
+      is_training=True,
+      model_variant=FLAGS.model_variant,
+      batch_capacity_factor=FLAGS.batch_capacity_factor,
+      decoder_output_stride=FLAGS.decoder_output_stride,
+      first_frame_finetuning=first_frame_finetuning,
+      sample_only_first_frame_for_finetuning=
+      FLAGS.sample_only_first_frame_for_finetuning,
+      sample_adjacent_and_consistent_query_frames=
+      FLAGS.sample_adjacent_and_consistent_query_frames or
+      FLAGS.use_softmax_feedback,
+      remap_labels_to_reference_frame=True,
+      three_frame_dataset=three_frame_dataset,
+      add_prev_frame_label=not FLAGS.also_attend_to_previous_frame
+  )
+  return dataset, samples
+if __name__ == '__main__':
+  flags.mark_flag_as_required('train_logdir')
+  tf.logging.set_verbosity(tf.logging.INFO)
+  tf.app.run()