Unverified Commit ac8b4413 authored by Neal Wu's avatar Neal Wu Committed by GitHub
Browse files

Merge pull request #5783 from aneliaangelova/mew

Added struct2depth model
parents 7cc688ae 811fa209
...@@ -48,6 +48,7 @@ ...@@ -48,6 +48,7 @@
/research/slim/ @sguada @nathansilberman /research/slim/ @sguada @nathansilberman
/research/steve/ @buckman-google /research/steve/ @buckman-google
/research/street/ @theraysmith /research/street/ @theraysmith
/research/struct2depth/ @aneliaangelova
/research/swivel/ @waterson /research/swivel/ @waterson
/research/syntaxnet/ @calberti @andorardo @bogatyy @markomernick /research/syntaxnet/ @calberti @andorardo @bogatyy @markomernick
/research/tcn/ @coreylynch @sermanet /research/tcn/ @coreylynch @sermanet
......
...@@ -74,6 +74,7 @@ request. ...@@ -74,6 +74,7 @@ request.
- [slim](slim): image classification models in TF-Slim. - [slim](slim): image classification models in TF-Slim.
- [street](street): identify the name of a street (in France) from an image - [street](street): identify the name of a street (in France) from an image
using a Deep RNN. using a Deep RNN.
- [struct2depth](struct2depth): unsupervised learning of depth and ego-motion.
- [swivel](swivel): the Swivel algorithm for generating word embeddings. - [swivel](swivel): the Swivel algorithm for generating word embeddings.
- [syntaxnet](syntaxnet): neural models of natural language syntax. - [syntaxnet](syntaxnet): neural models of natural language syntax.
- [tcn](tcn): Self-supervised representation learning from multi-view video. - [tcn](tcn): Self-supervised representation learning from multi-view video.
......
package(default_visibility = ["//visibility:public"])
# struct2depth
This a method for unsupervised learning of depth and egomotion from monocular video, achieving new state-of-the-art results on both tasks by explicitly modeling 3D object motion, performing on-line refinement and improving quality for moving objects by novel loss formulations. It will appear in the following paper:
**V. Casser, S. Pirk, R. Mahjourian, A. Angelova, Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos, AAAI Conference on Artificial Intelligence, 2019**
https://arxiv.org/pdf/1811.06152.pdf
This code is implemented and supported by Vincent Casser (git username: VincentCa) and Anelia Angelova (git username: AneliaAngelova). Please contact anelia@google.com for questions.
Project website: https://sites.google.com/view/struct2depth.
## Quick start: Running training
Before running training, run gen_data_* script for the respective dataset in order to generate the data in the appropriate format for KITTI or Cityscapes. It is assumed that motion masks are already generated and stored as images.
Models are trained from an Imagenet pretrained model.
```shell
ckpt_dir="your/checkpoint/folder"
data_dir="KITTI_SEQ2_LR/" # Set for KITTI
data_dir="CITYSCAPES_SEQ2_LR/" # Set for Cityscapes
imagenet_ckpt="resnet_pretrained/model.ckpt"
python train.py \
--logtostderr \
--checkpoint_dir $ckpt_dir \
--data_dir $data_dir \
--architecture resnet \
--imagenet_ckpt $imagenet_ckpt \
--imagenet_norm true \
--joint_encoder false
```
## Running depth/egomotion inference on an image folder
KITTI is trained on the raw image data (resized to 416 x 128), but inputs are standardized before feeding them, and Cityscapes images are cropped using the following cropping parameters: (192, 1856, 256, 768). If using a different crop, it is likely that additional training is necessary. Therefore, please follow the inference example shown below when using one of the models. The right choice might depend on a variety of factors. For example, if a checkpoint should be used for odometry, be aware that for improved odometry on motion models, using segmentation masks could be advantageous (setting *use_masks=true* for inference). On the other hand, all models can be used for single-frame depth estimation without any additional information.
```shell
input_dir="your/image/folder"
output_dir="your/output/folder"
model_checkpoint="your/model/checkpoint"
python inference.py \
--logtostderr \
--file_extension png \
--depth \
--egomotion true \
--input_dir $input_dir \
--output_dir $output_dir \
--model_ckpt $model_checkpoint
```
Note that the egomotion prediction expects the files in the input directory to be a consecutive sequence, and that sorting the filenames alphabetically is putting them in the right order.
One can also run inference on KITTI by providing
```shell
--input_list_file ~/kitti-raw-uncompressed/test_files_eigen.txt
```
and on Cityscapes by passing
```shell
--input_list_file CITYSCAPES_FULL/test_files_cityscapes.txt
```
instead of *input_dir*.
Alternatively inference can also be ran on pre-processed images.
## Running on-line refinement
On-line refinement is executed on top of an existing inference folder, so make sure to run regular inference first. Then you can run the on-line fusion procedure as follows:
```shell
prediction_dir="some/prediction/dir"
model_ckpt="checkpoints/checkpoints_baseline/model-199160"
handle_motion="false"
size_constraint_weight="0" # This must be zero when not handling motion.
# If running on KITTI, set as follows:
data_dir="KITTI_SEQ2_LR_EIGEN/"
triplet_list_file="$data_dir/test_files_eigen_triplets.txt"
triplet_list_file_remains="$data_dir/test_files_eigen_triplets_remains.txt"
ft_name="kitti"
# If running on Cityscapes, set as follows:
data_dir="CITYSCAPES_SEQ2_LR_TEST/" # Set for Cityscapes
triplet_list_file="/CITYSCAPES_SEQ2_LR_TEST/test_files_cityscapes_triplets.txt"
triplet_list_file_remains="CITYSCAPES_SEQ2_LR_TEST/test_files_cityscapes_triplets_remains.txt"
ft_name="cityscapes"
python optimize.py \
--logtostderr \
--output_dir $prediction_dir \
--data_dir $data_dir \
--triplet_list_file $triplet_list_file \
--triplet_list_file_remains $triplet_list_file_remains \
--ft_name $ft_name \
--model_ckpt $model_ckpt \
--file_extension png \
--handle_motion $handle_motion \
--size_constraint_weight $size_constraint_weight
```
## Running evaluation
```shell
prediction_dir="some/prediction/dir"
# Use these settings for KITTI:
eval_list_file="KITTI_FULL/kitti-raw-uncompressed/test_files_eigen.txt"
eval_crop="garg"
eval_mode="kitti"
# Use these settings for Cityscapes:
eval_list_file="CITYSCAPES_FULL/test_files_cityscapes.txt"
eval_crop="none"
eval_mode="cityscapes"
python evaluate.py \
--logtostderr \
--prediction_dir $prediction_dir \
--eval_list_file $eval_list_file \
--eval_crop $eval_crop \
--eval_mode $eval_mode
```
## Credits
This code is implemented and supported by Vincent Casser and Anelia Angelova and can be found at
https://sites.google.com/view/struct2depth.
The core implementation is derived from [https://github.com/tensorflow/models/tree/master/research/vid2depth)](https://github.com/tensorflow/models/tree/master/research/vid2depth)
by [Reza Mahjourian](rezama@google.com), which in turn is based on [SfMLearner
(https://github.com/tinghuiz/SfMLearner)](https://github.com/tinghuiz/SfMLearner)
by [Tinghui Zhou](https://github.com/tinghuiz).
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Common utilities for data pre-processing, e.g. matching moving object across frames."""
import numpy as np
def compute_overlap(mask1, mask2):
# Use IoU here.
return np.sum(mask1 & mask2)/np.sum(mask1 | mask2)
def align(seg_img1, seg_img2, seg_img3, threshold_same=0.3):
res_img1 = np.zeros_like(seg_img1)
res_img2 = np.zeros_like(seg_img2)
res_img3 = np.zeros_like(seg_img3)
remaining_objects2 = list(np.unique(seg_img2.flatten()))
remaining_objects3 = list(np.unique(seg_img3.flatten()))
for seg_id in np.unique(seg_img1):
# See if we can find correspondences to seg_id in seg_img2.
max_overlap2 = float('-inf')
max_segid2 = -1
for seg_id2 in remaining_objects2:
overlap = compute_overlap(seg_img1==seg_id, seg_img2==seg_id2)
if overlap>max_overlap2:
max_overlap2 = overlap
max_segid2 = seg_id2
if max_overlap2 > threshold_same:
max_overlap3 = float('-inf')
max_segid3 = -1
for seg_id3 in remaining_objects3:
overlap = compute_overlap(seg_img2==max_segid2, seg_img3==seg_id3)
if overlap>max_overlap3:
max_overlap3 = overlap
max_segid3 = seg_id3
if max_overlap3 > threshold_same:
res_img1[seg_img1==seg_id] = seg_id
res_img2[seg_img2==max_segid2] = seg_id
res_img3[seg_img3==max_segid3] = seg_id
remaining_objects2.remove(max_segid2)
remaining_objects3.remove(max_segid3)
return res_img1, res_img2, res_img3
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
""" Offline data generation for the Cityscapes dataset."""
import os
from absl import app
from absl import flags
from absl import logging
import numpy as np
import cv2
import os, glob
import alignment
from alignment import compute_overlap
from alignment import align
SKIP = 2
WIDTH = 416
HEIGHT = 128
SUB_FOLDER = 'train'
INPUT_DIR = '/usr/local/google/home/anelia/struct2depth/CITYSCAPES_FULL/'
OUTPUT_DIR = '/usr/local/google/home/anelia/struct2depth/CITYSCAPES_Processed/'
def crop(img, segimg, fx, fy, cx, cy):
# Perform center cropping, preserving 50% vertically.
middle_perc = 0.50
left = 1 - middle_perc
half = left / 2
a = img[int(img.shape[0]*(half)):int(img.shape[0]*(1-half)), :]
aseg = segimg[int(segimg.shape[0]*(half)):int(segimg.shape[0]*(1-half)), :]
cy /= (1 / middle_perc)
# Resize to match target height while preserving aspect ratio.
wdt = int((float(HEIGHT)*a.shape[1]/a.shape[0]))
x_scaling = float(wdt)/a.shape[1]
y_scaling = float(HEIGHT)/a.shape[0]
b = cv2.resize(a, (wdt, HEIGHT))
bseg = cv2.resize(aseg, (wdt, HEIGHT))
# Adjust intrinsics.
fx*=x_scaling
fy*=y_scaling
cx*=x_scaling
cy*=y_scaling
# Perform center cropping horizontally.
remain = b.shape[1] - WIDTH
cx /= (b.shape[1] / WIDTH)
c = b[:, int(remain/2):b.shape[1]-int(remain/2)]
cseg = bseg[:, int(remain/2):b.shape[1]-int(remain/2)]
return c, cseg, fx, fy, cx, cy
def run_all():
dir_name=INPUT_DIR + '/leftImg8bit_sequence/' + SUB_FOLDER + '/*'
print('Processing directory', dir_name)
for location in glob.glob(INPUT_DIR + '/leftImg8bit_sequence/' + SUB_FOLDER + '/*'):
location_name = os.path.basename(location)
print('Processing location', location_name)
files = sorted(glob.glob(location + '/*.png'))
files = [file for file in files if '-seg.png' not in file]
# Break down into sequences
sequences = {}
seq_nr = 0
last_seq = ''
last_imgnr = -1
for i in range(len(files)):
seq = os.path.basename(files[i]).split('_')[1]
nr = int(os.path.basename(files[i]).split('_')[2])
if seq!=last_seq or last_imgnr+1!=nr:
seq_nr+=1
last_imgnr = nr
last_seq = seq
if not seq_nr in sequences:
sequences[seq_nr] = []
sequences[seq_nr].append(files[i])
for (k,v) in sequences.items():
print('Processing sequence', k, 'with', len(v), 'elements...')
output_dir = OUTPUT_DIR + '/' + location_name + '_' + str(k)
if not os.path.isdir(output_dir):
os.mkdir(output_dir)
files = sorted(v)
triplet = []
seg_triplet = []
ct = 1
# Find applicable intrinsics.
for j in range(len(files)):
osegname = os.path.basename(files[j]).split('_')[1]
oimgnr = os.path.basename(files[j]).split('_')[2]
applicable_intrinsics = INPUT_DIR + '/camera/' + SUB_FOLDER + '/' + location_name + '/' + location_name + '_' + osegname + '_' + oimgnr + '_camera.json'
# Get the intrinsics for one of the file of the sequence.
if os.path.isfile(applicable_intrinsics):
f = open(applicable_intrinsics, 'r')
lines = f.readlines()
f.close()
lines = [line.rstrip() for line in lines]
fx = float(lines[11].split(': ')[1].replace(',', ''))
fy = float(lines[12].split(': ')[1].replace(',', ''))
cx = float(lines[13].split(': ')[1].replace(',', ''))
cy = float(lines[14].split(': ')[1].replace(',', ''))
for j in range(0, len(files), SKIP):
img = cv2.imread(files[j])
segimg = cv2.imread(files[j].replace('.png', '-seg.png'))
smallimg, segimg, fx_this, fy_this, cx_this, cy_this = crop(img, segimg, fx, fy, cx, cy)
triplet.append(smallimg)
seg_triplet.append(segimg)
if len(triplet)==3:
cmb = np.hstack(triplet)
align1, align2, align3 = align(seg_triplet[0], seg_triplet[1], seg_triplet[2])
cmb_seg = np.hstack([align1, align2, align3])
cv2.imwrite(os.path.join(output_dir, str(ct).zfill(10) + '.png'), cmb)
cv2.imwrite(os.path.join(output_dir, str(ct).zfill(10) + '-fseg.png'), cmb_seg)
f = open(os.path.join(output_dir, str(ct).zfill(10) + '_cam.txt'), 'w')
f.write(str(fx_this) + ',0.0,' + str(cx_this) + ',0.0,' + str(fy_this) + ',' + str(cy_this) + ',0.0,0.0,1.0')
f.close()
del triplet[0]
del seg_triplet[0]
ct+=1
# Create file list for training. Be careful as it collects and includes all files recursively.
fn = open(OUTPUT_DIR + '/' + SUB_FOLDER + '.txt', 'w')
for f in glob.glob(OUTPUT_DIR + '/*/*.png'):
if '-seg.png' in f or '-fseg.png' in f:
continue
folder_name = f.split('/')[-2]
img_name = f.split('/')[-1].replace('.png', '')
fn.write(folder_name + ' ' + img_name + '\n')
fn.close()
def main(_):
run_all()
if __name__ == '__main__':
app.run(main)
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
""" Offline data generation for the KITTI dataset."""
import os
from absl import app
from absl import flags
from absl import logging
import numpy as np
import cv2
import os, glob
import alignment
from alignment import compute_overlap
from alignment import align
SEQ_LENGTH = 3
WIDTH = 416
HEIGHT = 128
STEPSIZE = 1
INPUT_DIR = '/usr/local/google/home/anelia/struct2depth/KITTI_FULL/kitti-raw-uncompressed'
OUTPUT_DIR = '/usr/local/google/home/anelia/struct2depth/KITTI_procesed/'
def get_line(file, start):
file = open(file, 'r')
lines = file.readlines()
lines = [line.rstrip() for line in lines]
ret = None
for line in lines:
nline = line.split(': ')
if nline[0]==start:
ret = nline[1].split(' ')
ret = np.array([float(r) for r in ret], dtype=float)
ret = ret.reshape((3,4))[0:3, 0:3]
break
file.close()
return ret
def crop(img, segimg, fx, fy, cx, cy):
# Perform center cropping, preserving 50% vertically.
middle_perc = 0.50
left = 1-middle_perc
half = left/2
a = img[int(img.shape[0]*(half)):int(img.shape[0]*(1-half)), :]
aseg = segimg[int(segimg.shape[0]*(half)):int(segimg.shape[0]*(1-half)), :]
cy /= (1/middle_perc)
# Resize to match target height while preserving aspect ratio.
wdt = int((128*a.shape[1]/a.shape[0]))
x_scaling = float(wdt)/a.shape[1]
y_scaling = 128.0/a.shape[0]
b = cv2.resize(a, (wdt, 128))
bseg = cv2.resize(aseg, (wdt, 128))
# Adjust intrinsics.
fx*=x_scaling
fy*=y_scaling
cx*=x_scaling
cy*=y_scaling
# Perform center cropping horizontally.
remain = b.shape[1] - 416
cx /= (b.shape[1]/416)
c = b[:, int(remain/2):b.shape[1]-int(remain/2)]
cseg = bseg[:, int(remain/2):b.shape[1]-int(remain/2)]
return c, cseg, fx, fy, cx, cy
def run_all():
ct = 0
if not OUTPUT_DIR.endswith('/'):
OUTPUT_DIR = OUTPUT_DIR + '/'
for d in glob.glob(INPUT_DIR + '/*/'):
date = d.split('/')[-2]
file_calibration = d + 'calib_cam_to_cam.txt'
calib_raw = [get_line(file_calibration, 'P_rect_02'), get_line(file_calibration, 'P_rect_03')]
for d2 in glob.glob(d + '*/'):
seqname = d2.split('/')[-2]
print('Processing sequence', seqname)
for subfolder in ['image_02/data', 'image_03/data']:
ct = 1
seqname = d2.split('/')[-2] + subfolder.replace('image', '').replace('/data', '')
if not os.path.exists(OUTPUT_DIR + seqname):
os.mkdir(OUTPUT_DIR + seqname)
calib_camera = calib_raw[0] if subfolder=='image_02/data' else calib_raw[1]
folder = d2 + subfolder
files = glob.glob(folder + '/*.png')
files = [file for file in files if not 'disp' in file and not 'flip' in file and not 'seg' in file]
files = sorted(files)
for i in range(SEQ_LENGTH, len(files)+1, STEPSIZE):
imgnum = str(ct).zfill(10)
if os.path.exists(OUTPUT_DIR + seqname + '/' + imgnum + '.png'):
ct+=1
continue
big_img = np.zeros(shape=(HEIGHT, WIDTH*SEQ_LENGTH, 3))
wct = 0
for j in range(i-SEQ_LENGTH, i): # Collect frames for this sample.
img = cv2.imread(files[j])
ORIGINAL_HEIGHT, ORIGINAL_WIDTH, _ = img.shape
zoom_x = WIDTH/ORIGINAL_WIDTH
zoom_y = HEIGHT/ORIGINAL_HEIGHT
# Adjust intrinsics.
calib_current = calib_camera.copy()
calib_current[0, 0] *= zoom_x
calib_current[0, 2] *= zoom_x
calib_current[1, 1] *= zoom_y
calib_current[1, 2] *= zoom_y
calib_representation = ','.join([str(c) for c in calib_current.flatten()])
img = cv2.resize(img, (WIDTH, HEIGHT))
big_img[:,wct*WIDTH:(wct+1)*WIDTH] = img
wct+=1
cv2.imwrite(OUTPUT_DIR + seqname + '/' + imgnum + '.png', big_img)
f = open(OUTPUT_DIR + seqname + '/' + imgnum + '_cam.txt', 'w')
f.write(calib_representation)
f.close()
ct+=1
def main(_):
run_all()
if __name__ == '__main__':
app.run(main)
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Runs struct2depth at inference. Produces depth estimates, ego-motion and object motion."""
# Example usage:
#
# python inference.py \
# --input_dir ~/struct2depth/kitti-raw-uncompressed/ \
# --output_dir ~/struct2depth/output \
# --model_ckpt ~/struct2depth/model/model-199160
# --file_extension png \
# --depth \
# --egomotion true \
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
from absl import app
from absl import flags
from absl import logging
#import matplotlib.pyplot as plt
import model
import numpy as np
import fnmatch
import tensorflow as tf
import nets
import util
gfile = tf.gfile
# CMAP = 'plasma'
INFERENCE_MODE_SINGLE = 'single' # Take plain single-frame input.
INFERENCE_MODE_TRIPLETS = 'triplets' # Take image triplets as input.
# For KITTI, we just resize input images and do not perform cropping. For
# Cityscapes, the car hood and more image content has been cropped in order
# to fit aspect ratio, and remove static content from the images. This has to be
# kept at inference time.
INFERENCE_CROP_NONE = 'none'
INFERENCE_CROP_CITYSCAPES = 'cityscapes'
flags.DEFINE_string('output_dir', None, 'Directory to store predictions.')
flags.DEFINE_string('file_extension', 'png', 'Image data file extension of '
'files provided with input_dir. Also determines the output '
'file format of depth prediction images.')
flags.DEFINE_bool('depth', True, 'Determines if the depth prediction network '
'should be executed and its predictions be saved.')
flags.DEFINE_bool('egomotion', False, 'Determines if the egomotion prediction '
'network should be executed and its predictions be saved. If '
'inference is run in single inference mode, it is assumed '
'that files in the same directory belong in the same '
'sequence, and sorting them alphabetically establishes the '
'right temporal order.')
flags.DEFINE_string('model_ckpt', None, 'Model checkpoint to evaluate.')
flags.DEFINE_string('input_dir', None, 'Directory containing image files to '
'evaluate. This crawls recursively for images in the '
'directory, mirroring relative subdirectory structures '
'into the output directory.')
flags.DEFINE_string('input_list_file', None, 'Text file containing paths to '
'image files to process. Paths should be relative with '
'respect to the list file location. Relative path '
'structures will be mirrored in the output directory.')
flags.DEFINE_integer('batch_size', 1, 'The size of a sample batch')
flags.DEFINE_integer('img_height', 128, 'Input frame height.')
flags.DEFINE_integer('img_width', 416, 'Input frame width.')
flags.DEFINE_integer('seq_length', 3, 'Number of frames in sequence.')
flags.DEFINE_enum('architecture', nets.RESNET, nets.ARCHITECTURES,
'Defines the architecture to use for the depth prediction '
'network. Defaults to ResNet-based encoder and accompanying '
'decoder.')
flags.DEFINE_boolean('imagenet_norm', True, 'Whether to normalize the input '
'images channel-wise so that they match the distribution '
'most ImageNet-models were trained on.')
flags.DEFINE_bool('use_skip', True, 'Whether to use skip connections in the '
'encoder-decoder architecture.')
flags.DEFINE_bool('joint_encoder', False, 'Whether to share parameters '
'between the depth and egomotion networks by using a joint '
'encoder architecture. The egomotion network is then '
'operating only on the hidden representation provided by the '
'joint encoder.')
flags.DEFINE_bool('shuffle', False, 'Whether to shuffle the order in which '
'images are processed.')
flags.DEFINE_bool('flip', False, 'Whether images should be flipped as well as '
'resulting predictions (for test-time augmentation). This '
'currently applies to the depth network only.')
flags.DEFINE_enum('inference_mode', INFERENCE_MODE_SINGLE,
[INFERENCE_MODE_SINGLE,
INFERENCE_MODE_TRIPLETS],
'Whether to use triplet mode for inference, which accepts '
'triplets instead of single frames.')
flags.DEFINE_enum('inference_crop', INFERENCE_CROP_NONE,
[INFERENCE_CROP_NONE,
INFERENCE_CROP_CITYSCAPES],
'Whether to apply a Cityscapes-specific crop on the input '
'images first before running inference.')
flags.DEFINE_bool('use_masks', False, 'Whether to mask out potentially '
'moving objects when feeding image input to the egomotion '
'network. This might improve odometry results when using '
'a motion model. For this, pre-computed segmentation '
'masks have to be available for every image, with the '
'background being zero.')
FLAGS = flags.FLAGS
flags.mark_flag_as_required('output_dir')
flags.mark_flag_as_required('model_ckpt')
def _run_inference(output_dir=None,
file_extension='png',
depth=True,
egomotion=False,
model_ckpt=None,
input_dir=None,
input_list_file=None,
batch_size=1,
img_height=128,
img_width=416,
seq_length=3,
architecture=nets.RESNET,
imagenet_norm=True,
use_skip=True,
joint_encoder=True,
shuffle=False,
flip_for_depth=False,
inference_mode=INFERENCE_MODE_SINGLE,
inference_crop=INFERENCE_CROP_NONE,
use_masks=False):
"""Runs inference. Refer to flags in inference.py for details."""
inference_model = model.Model(is_training=False,
batch_size=batch_size,
img_height=img_height,
img_width=img_width,
seq_length=seq_length,
architecture=architecture,
imagenet_norm=imagenet_norm,
use_skip=use_skip,
joint_encoder=joint_encoder)
vars_to_restore = util.get_vars_to_save_and_restore(model_ckpt)
saver = tf.train.Saver(vars_to_restore)
sv = tf.train.Supervisor(logdir='/tmp/', saver=None)
with sv.managed_session() as sess:
saver.restore(sess, model_ckpt)
if not gfile.Exists(output_dir):
gfile.MakeDirs(output_dir)
logging.info('Predictions will be saved in %s.', output_dir)
# Collect all images to run inference on.
im_files, basepath_in = collect_input_images(input_dir, input_list_file,
file_extension)
if shuffle:
logging.info('Shuffling data...')
np.random.shuffle(im_files)
logging.info('Running inference on %d files.', len(im_files))
# Create missing output folders and pre-compute target directories.
output_dirs = create_output_dirs(im_files, basepath_in, output_dir)
# Run depth prediction network.
if depth:
im_batch = []
for i in range(len(im_files)):
if i % 100 == 0:
logging.info('%s of %s files processed.', i, len(im_files))
# Read image and run inference.
if inference_mode == INFERENCE_MODE_SINGLE:
if inference_crop == INFERENCE_CROP_NONE:
im = util.load_image(im_files[i], resize=(img_width, img_height))
elif inference_crop == INFERENCE_CROP_CITYSCAPES:
im = util.crop_cityscapes(util.load_image(im_files[i]),
resize=(img_width, img_height))
elif inference_mode == INFERENCE_MODE_TRIPLETS:
im = util.load_image(im_files[i], resize=(img_width * 3, img_height))
im = im[:, img_width:img_width*2]
if flip_for_depth:
im = np.flip(im, axis=1)
im_batch.append(im)
if len(im_batch) == batch_size or i == len(im_files) - 1:
# Call inference on batch.
for _ in range(batch_size - len(im_batch)): # Fill up batch.
im_batch.append(np.zeros(shape=(img_height, img_width, 3),
dtype=np.float32))
im_batch = np.stack(im_batch, axis=0)
est_depth = inference_model.inference_depth(im_batch, sess)
if flip_for_depth:
est_depth = np.flip(est_depth, axis=2)
im_batch = np.flip(im_batch, axis=2)
for j in range(len(im_batch)):
color_map = util.normalize_depth_for_display(
np.squeeze(est_depth[j]))
visualization = np.concatenate((im_batch[j], color_map), axis=0)
# Save raw prediction and color visualization. Extract filename
# without extension from full path: e.g. path/to/input_dir/folder1/
# file1.png -> file1
k = i - len(im_batch) + 1 + j
filename_root = os.path.splitext(os.path.basename(im_files[k]))[0]
pref = '_flip' if flip_for_depth else ''
output_raw = os.path.join(
output_dirs[k], filename_root + pref + '.npy')
output_vis = os.path.join(
output_dirs[k], filename_root + pref + '.png')
with gfile.Open(output_raw, 'wb') as f:
np.save(f, est_depth[j])
util.save_image(output_vis, visualization, file_extension)
im_batch = []
# Run egomotion network.
if egomotion:
if inference_mode == INFERENCE_MODE_SINGLE:
# Run regular egomotion inference loop.
input_image_seq = []
input_seg_seq = []
current_sequence_dir = None
current_output_handle = None
for i in range(len(im_files)):
sequence_dir = os.path.dirname(im_files[i])
if sequence_dir != current_sequence_dir:
# Assume start of a new sequence, since this image lies in a
# different directory than the previous ones.
# Clear egomotion input buffer.
output_filepath = os.path.join(output_dirs[i], 'egomotion.txt')
if current_output_handle is not None:
current_output_handle.close()
current_sequence_dir = sequence_dir
logging.info('Writing egomotion sequence to %s.', output_filepath)
current_output_handle = gfile.Open(output_filepath, 'w')
input_image_seq = []
im = util.load_image(im_files[i], resize=(img_width, img_height))
input_image_seq.append(im)
if use_masks:
im_seg_path = im_files[i].replace('.%s' % file_extension,
'-seg.%s' % file_extension)
if not gfile.Exists(im_seg_path):
raise ValueError('No segmentation mask %s has been found for '
'image %s. If none are available, disable '
'use_masks.' % (im_seg_path, im_files[i]))
input_seg_seq.append(util.load_image(im_seg_path,
resize=(img_width, img_height),
interpolation='nn'))
if len(input_image_seq) < seq_length: # Buffer not filled yet.
continue
if len(input_image_seq) > seq_length: # Remove oldest entry.
del input_image_seq[0]
if use_masks:
del input_seg_seq[0]
input_image_stack = np.concatenate(input_image_seq, axis=2)
input_image_stack = np.expand_dims(input_image_stack, axis=0)
if use_masks:
input_image_stack = mask_image_stack(input_image_stack,
input_seg_seq)
est_egomotion = np.squeeze(inference_model.inference_egomotion(
input_image_stack, sess))
egomotion_str = []
for j in range(seq_length - 1):
egomotion_str.append(','.join([str(d) for d in est_egomotion[j]]))
current_output_handle.write(
str(i) + ' ' + ' '.join(egomotion_str) + '\n')
if current_output_handle is not None:
current_output_handle.close()
elif inference_mode == INFERENCE_MODE_TRIPLETS:
written_before = []
for i in range(len(im_files)):
im = util.load_image(im_files[i], resize=(img_width * 3, img_height))
input_image_stack = np.concatenate(
[im[:, :img_width], im[:, img_width:img_width*2],
im[:, img_width*2:]], axis=2)
input_image_stack = np.expand_dims(input_image_stack, axis=0)
if use_masks:
im_seg_path = im_files[i].replace('.%s' % file_extension,
'-seg.%s' % file_extension)
if not gfile.Exists(im_seg_path):
raise ValueError('No segmentation mask %s has been found for '
'image %s. If none are available, disable '
'use_masks.' % (im_seg_path, im_files[i]))
seg = util.load_image(im_seg_path,
resize=(img_width * 3, img_height),
interpolation='nn')
input_seg_seq = [seg[:, :img_width], seg[:, img_width:img_width*2],
seg[:, img_width*2:]]
input_image_stack = mask_image_stack(input_image_stack,
input_seg_seq)
est_egomotion = inference_model.inference_egomotion(
input_image_stack, sess)
est_egomotion = np.squeeze(est_egomotion)
egomotion_1_2 = ','.join([str(d) for d in est_egomotion[0]])
egomotion_2_3 = ','.join([str(d) for d in est_egomotion[1]])
output_filepath = os.path.join(output_dirs[i], 'egomotion.txt')
file_mode = 'w' if output_filepath not in written_before else 'a'
with gfile.Open(output_filepath, file_mode) as current_output_handle:
current_output_handle.write(str(i) + ' ' + egomotion_1_2 + ' ' +
egomotion_2_3 + '\n')
written_before.append(output_filepath)
logging.info('Done.')
def mask_image_stack(input_image_stack, input_seg_seq):
"""Masks out moving image contents by using the segmentation masks provided.
This can lead to better odometry accuracy for motion models, but is optional
to use. Is only called if use_masks is enabled.
Args:
input_image_stack: The input image stack of shape (1, H, W, seq_length).
input_seg_seq: List of segmentation masks with seq_length elements of shape
(H, W, C) for some number of channels C.
Returns:
Input image stack with detections provided by segmentation mask removed.
"""
background = [mask == 0 for mask in input_seg_seq]
background = reduce(lambda m1, m2: m1 & m2, background)
# If masks are RGB, assume all channels to be the same. Reduce to the first.
if background.ndim == 3 and background.shape[2] > 1:
background = np.expand_dims(background[:, :, 0], axis=2)
elif background.ndim == 2: # Expand.
background = np.expand_dism(background, axis=2)
# background is now of shape (H, W, 1).
background_stack = np.tile(background, [1, 1, input_image_stack.shape[3]])
return np.multiply(input_image_stack, background_stack)
def collect_input_images(input_dir, input_list_file, file_extension):
"""Collects all input images that are to be processed."""
if input_dir is not None:
im_files = _recursive_glob(input_dir, '*.' + file_extension)
basepath_in = os.path.normpath(input_dir)
elif input_list_file is not None:
im_files = util.read_text_lines(input_list_file)
basepath_in = os.path.dirname(input_list_file)
im_files = [os.path.join(basepath_in, f) for f in im_files]
im_files = [f for f in im_files if 'disp' not in f and '-seg' not in f and
'-fseg' not in f and '-flip' not in f]
return sorted(im_files), basepath_in
def create_output_dirs(im_files, basepath_in, output_dir):
"""Creates required directories, and returns output dir for each file."""
output_dirs = []
for i in range(len(im_files)):
relative_folder_in = os.path.relpath(
os.path.dirname(im_files[i]), basepath_in)
absolute_folder_out = os.path.join(output_dir, relative_folder_in)
if not gfile.IsDirectory(absolute_folder_out):
gfile.MakeDirs(absolute_folder_out)
output_dirs.append(absolute_folder_out)
return output_dirs
def _recursive_glob(treeroot, pattern):
results = []
for base, _, files in os.walk(treeroot):
files = fnmatch.filter(files, pattern)
results.extend(os.path.join(base, f) for f in files)
return results
def main(_):
#if (flags.input_dir is None) == (flags.input_list_file is None):
# raise ValueError('Exactly one of either input_dir or input_list_file has '
# 'to be provided.')
#if not flags.depth and not flags.egomotion:
# raise ValueError('At least one of the depth and egomotion network has to '
# 'be called for inference.')
#if (flags.inference_mode == inference_lib.INFERENCE_MODE_TRIPLETS and
# flags.seq_length != 3):
# raise ValueError('For sequence lengths other than three, single inference '
# 'mode has to be used.')
_run_inference(output_dir=FLAGS.output_dir,
file_extension=FLAGS.file_extension,
depth=FLAGS.depth,
egomotion=FLAGS.egomotion,
model_ckpt=FLAGS.model_ckpt,
input_dir=FLAGS.input_dir,
input_list_file=FLAGS.input_list_file,
batch_size=FLAGS.batch_size,
img_height=FLAGS.img_height,
img_width=FLAGS.img_width,
seq_length=FLAGS.seq_length,
architecture=FLAGS.architecture,
imagenet_norm=FLAGS.imagenet_norm,
use_skip=FLAGS.use_skip,
joint_encoder=FLAGS.joint_encoder,
shuffle=FLAGS.shuffle,
flip_for_depth=FLAGS.flip,
inference_mode=FLAGS.inference_mode,
inference_crop=FLAGS.inference_crop,
use_masks=FLAGS.use_masks)
if __name__ == '__main__':
app.run(main)
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Build model for inference or training."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from absl import logging
import numpy as np
import tensorflow as tf
import nets
import project
import reader
import util
gfile = tf.gfile
slim = tf.contrib.slim
NUM_SCALES = 4
class Model(object):
"""Model code based on SfMLearner."""
def __init__(self,
data_dir=None,
file_extension='png',
is_training=True,
learning_rate=0.0002,
beta1=0.9,
reconstr_weight=0.85,
smooth_weight=0.05,
ssim_weight=0.15,
icp_weight=0.0,
batch_size=4,
img_height=128,
img_width=416,
seq_length=3,
architecture=nets.RESNET,
imagenet_norm=True,
weight_reg=0.05,
exhaustive_mode=False,
random_scale_crop=False,
flipping_mode=reader.FLIP_RANDOM,
random_color=True,
depth_upsampling=True,
depth_normalization=True,
compute_minimum_loss=True,
use_skip=True,
joint_encoder=True,
build_sum=True,
shuffle=True,
input_file='train',
handle_motion=False,
equal_weighting=False,
size_constraint_weight=0.0,
train_global_scale_var=True):
self.data_dir = data_dir
self.file_extension = file_extension
self.is_training = is_training
self.learning_rate = learning_rate
self.reconstr_weight = reconstr_weight
self.smooth_weight = smooth_weight
self.ssim_weight = ssim_weight
self.icp_weight = icp_weight
self.beta1 = beta1
self.batch_size = batch_size
self.img_height = img_height
self.img_width = img_width
self.seq_length = seq_length
self.architecture = architecture
self.imagenet_norm = imagenet_norm
self.weight_reg = weight_reg
self.exhaustive_mode = exhaustive_mode
self.random_scale_crop = random_scale_crop
self.flipping_mode = flipping_mode
self.random_color = random_color
self.depth_upsampling = depth_upsampling
self.depth_normalization = depth_normalization
self.compute_minimum_loss = compute_minimum_loss
self.use_skip = use_skip
self.joint_encoder = joint_encoder
self.build_sum = build_sum
self.shuffle = shuffle
self.input_file = input_file
self.handle_motion = handle_motion
self.equal_weighting = equal_weighting
self.size_constraint_weight = size_constraint_weight
self.train_global_scale_var = train_global_scale_var
logging.info('data_dir: %s', data_dir)
logging.info('file_extension: %s', file_extension)
logging.info('is_training: %s', is_training)
logging.info('learning_rate: %s', learning_rate)
logging.info('reconstr_weight: %s', reconstr_weight)
logging.info('smooth_weight: %s', smooth_weight)
logging.info('ssim_weight: %s', ssim_weight)
logging.info('icp_weight: %s', icp_weight)
logging.info('size_constraint_weight: %s', size_constraint_weight)
logging.info('beta1: %s', beta1)
logging.info('batch_size: %s', batch_size)
logging.info('img_height: %s', img_height)
logging.info('img_width: %s', img_width)
logging.info('seq_length: %s', seq_length)
logging.info('architecture: %s', architecture)
logging.info('imagenet_norm: %s', imagenet_norm)
logging.info('weight_reg: %s', weight_reg)
logging.info('exhaustive_mode: %s', exhaustive_mode)
logging.info('random_scale_crop: %s', random_scale_crop)
logging.info('flipping_mode: %s', flipping_mode)
logging.info('random_color: %s', random_color)
logging.info('depth_upsampling: %s', depth_upsampling)
logging.info('depth_normalization: %s', depth_normalization)
logging.info('compute_minimum_loss: %s', compute_minimum_loss)
logging.info('use_skip: %s', use_skip)
logging.info('joint_encoder: %s', joint_encoder)
logging.info('build_sum: %s', build_sum)
logging.info('shuffle: %s', shuffle)
logging.info('input_file: %s', input_file)
logging.info('handle_motion: %s', handle_motion)
logging.info('equal_weighting: %s', equal_weighting)
logging.info('train_global_scale_var: %s', train_global_scale_var)
if self.size_constraint_weight > 0 or not is_training:
self.global_scale_var = tf.Variable(
0.1, name='global_scale_var',
trainable=self.is_training and train_global_scale_var,
dtype=tf.float32,
constraint=lambda x: tf.clip_by_value(x, 0, np.infty))
if self.is_training:
self.reader = reader.DataReader(self.data_dir, self.batch_size,
self.img_height, self.img_width,
self.seq_length, NUM_SCALES,
self.file_extension,
self.random_scale_crop,
self.flipping_mode,
self.random_color,
self.imagenet_norm,
self.shuffle,
self.input_file)
self.build_train_graph()
else:
self.build_depth_test_graph()
self.build_egomotion_test_graph()
if self.handle_motion:
self.build_objectmotion_test_graph()
# At this point, the model is ready. Print some info on model params.
util.count_parameters()
def build_train_graph(self):
self.build_inference_for_training()
self.build_loss()
self.build_train_op()
if self.build_sum:
self.build_summaries()
def build_inference_for_training(self):
"""Invokes depth and ego-motion networks and computes clouds if needed."""
(self.image_stack, self.image_stack_norm, self.seg_stack,
self.intrinsic_mat, self.intrinsic_mat_inv) = self.reader.read_data()
with tf.variable_scope('depth_prediction'):
# Organized by ...[i][scale]. Note that the order is flipped in
# variables in build_loss() below.
self.disp = {}
self.depth = {}
self.depth_upsampled = {}
self.inf_loss = 0.0
# Organized by [i].
disp_bottlenecks = [None] * self.seq_length
if self.icp_weight > 0:
self.cloud = {}
for i in range(self.seq_length):
image = self.image_stack_norm[:, :, :, 3 * i:3 * (i + 1)]
multiscale_disps_i, disp_bottlenecks[i] = nets.disp_net(
self.architecture, image, self.use_skip,
self.weight_reg, True)
multiscale_depths_i = [1.0 / d for d in multiscale_disps_i]
self.disp[i] = multiscale_disps_i
self.depth[i] = multiscale_depths_i
if self.depth_upsampling:
self.depth_upsampled[i] = []
# Upsample low-resolution depth maps using differentiable bilinear
# interpolation.
for s in range(len(multiscale_depths_i)):
self.depth_upsampled[i].append(tf.image.resize_bilinear(
multiscale_depths_i[s], [self.img_height, self.img_width],
align_corners=True))
if self.icp_weight > 0:
multiscale_clouds_i = [
project.get_cloud(d,
self.intrinsic_mat_inv[:, s, :, :],
name='cloud%d_%d' % (s, i))
for (s, d) in enumerate(multiscale_depths_i)
]
self.cloud[i] = multiscale_clouds_i
# Reuse the same depth graph for all images.
tf.get_variable_scope().reuse_variables()
if self.handle_motion:
# Define egomotion network. This network can see the whole scene except
# for any moving objects as indicated by the provided segmentation masks.
# To avoid the network getting clues of motion by tracking those masks, we
# define the segmentation masks as the union temporally.
with tf.variable_scope('egomotion_prediction'):
base_input = self.image_stack_norm # (B, H, W, 9)
seg_input = self.seg_stack # (B, H, W, 9)
ref_zero = tf.constant(0, dtype=tf.uint8)
# Motion model is currently defined for three-frame sequences.
object_mask1 = tf.equal(seg_input[:, :, :, 0], ref_zero)
object_mask2 = tf.equal(seg_input[:, :, :, 3], ref_zero)
object_mask3 = tf.equal(seg_input[:, :, :, 6], ref_zero)
mask_complete = tf.expand_dims(tf.logical_and( # (B, H, W, 1)
tf.logical_and(object_mask1, object_mask2), object_mask3), axis=3)
mask_complete = tf.tile(mask_complete, (1, 1, 1, 9)) # (B, H, W, 9)
# Now mask out base_input.
self.mask_complete = tf.to_float(mask_complete)
self.base_input_masked = base_input * self.mask_complete
self.egomotion = nets.egomotion_net(
image_stack=self.base_input_masked,
disp_bottleneck_stack=None,
joint_encoder=False,
seq_length=self.seq_length,
weight_reg=self.weight_reg)
# Define object motion network for refinement. This network only sees
# one object at a time over the whole sequence, and tries to estimate its
# motion. The sequence of images are the respective warped frames.
# For each scale, contains batch_size elements of shape (N, 2, 6).
self.object_transforms = {}
# For each scale, contains batch_size elements of shape (N, H, W, 9).
self.object_masks = {}
self.object_masks_warped = {}
# For each scale, contains batch_size elements of size N.
self.object_ids = {}
self.egomotions_seq = {}
self.warped_seq = {}
self.inputs_objectmotion_net = {}
with tf.variable_scope('objectmotion_prediction'):
# First, warp raw images according to overall egomotion.
for s in range(NUM_SCALES):
self.warped_seq[s] = []
self.egomotions_seq[s] = []
for source_index in range(self.seq_length):
egomotion_mat_i_1 = project.get_transform_mat(
self.egomotion, source_index, 1)
warped_image_i_1, _ = (
project.inverse_warp(
self.image_stack[
:, :, :, source_index*3:(source_index+1)*3],
self.depth_upsampled[1][s],
egomotion_mat_i_1,
self.intrinsic_mat[:, 0, :, :],
self.intrinsic_mat_inv[:, 0, :, :]))
self.warped_seq[s].append(warped_image_i_1)
self.egomotions_seq[s].append(egomotion_mat_i_1)
# Second, for every object in the segmentation mask, take its mask and
# warp it according to the egomotion estimate. Then put a threshold to
# binarize the warped result. Use this mask to mask out background and
# other objects, and pass the filtered image to the object motion
# network.
self.object_transforms[s] = []
self.object_masks[s] = []
self.object_ids[s] = []
self.object_masks_warped[s] = []
self.inputs_objectmotion_net[s] = {}
for i in range(self.batch_size):
seg_sequence = self.seg_stack[i] # (H, W, 9=3*3)
object_ids = tf.unique(tf.reshape(seg_sequence, [-1]))[0]
self.object_ids[s].append(object_ids)
color_stack = []
mask_stack = []
mask_stack_warped = []
for j in range(self.seq_length):
current_image = self.warped_seq[s][j][i] # (H, W, 3)
current_seg = seg_sequence[:, :, j * 3:(j+1) * 3] # (H, W, 3)
def process_obj_mask_warp(obj_id):
"""Performs warping of the individual object masks."""
obj_mask = tf.to_float(tf.equal(current_seg, obj_id))
# Warp obj_mask according to overall egomotion.
obj_mask_warped, _ = (
project.inverse_warp(
tf.expand_dims(obj_mask, axis=0),
# Middle frame, highest scale, batch element i:
tf.expand_dims(self.depth_upsampled[1][s][i], axis=0),
# Matrix for warping j into middle frame, batch elem. i:
tf.expand_dims(self.egomotions_seq[s][j][i], axis=0),
tf.expand_dims(self.intrinsic_mat[i, 0, :, :], axis=0),
tf.expand_dims(self.intrinsic_mat_inv[i, 0, :, :],
axis=0)))
obj_mask_warped = tf.squeeze(obj_mask_warped)
obj_mask_binarized = tf.greater( # Threshold to binarize mask.
obj_mask_warped, tf.constant(0.5))
return tf.to_float(obj_mask_binarized)
def process_obj_mask(obj_id):
"""Returns the individual object masks separately."""
return tf.to_float(tf.equal(current_seg, obj_id))
object_masks = tf.map_fn( # (N, H, W, 3)
process_obj_mask, object_ids, dtype=tf.float32)
if self.size_constraint_weight > 0:
# The object segmentation masks are all in object_masks.
# We need to measure the height of every of them, and get the
# approximate distance.
# self.depth_upsampled of shape (seq_length, scale, B, H, W).
depth_pred = self.depth_upsampled[j][s][i] # (H, W)
def get_losses(obj_mask):
"""Get motion constraint loss."""
# Find height of segment.
coords = tf.where(tf.greater( # Shape (num_true, 2=yx)
obj_mask[:, :, 0], tf.constant(0.5, dtype=tf.float32)))
y_max = tf.reduce_max(coords[:, 0])
y_min = tf.reduce_min(coords[:, 0])
seg_height = y_max - y_min
f_y = self.intrinsic_mat[i, 0, 1, 1]
approx_depth = ((f_y * self.global_scale_var) /
tf.to_float(seg_height))
reference_pred = tf.boolean_mask(
depth_pred, tf.greater(
tf.reshape(obj_mask[:, :, 0],
(self.img_height, self.img_width, 1)),
tf.constant(0.5, dtype=tf.float32)))
# Establish loss on approx_depth, a scalar, and
# reference_pred, our dense prediction. Normalize both to
# prevent degenerative depth shrinking.
global_mean_depth_pred = tf.reduce_mean(depth_pred)
reference_pred /= global_mean_depth_pred
approx_depth /= global_mean_depth_pred
spatial_err = tf.abs(reference_pred - approx_depth)
mean_spatial_err = tf.reduce_mean(spatial_err)
return mean_spatial_err
losses = tf.map_fn(
get_losses, object_masks, dtype=tf.float32)
self.inf_loss += tf.reduce_mean(losses)
object_masks_warped = tf.map_fn( # (N, H, W, 3)
process_obj_mask_warp, object_ids, dtype=tf.float32)
filtered_images = tf.map_fn(
lambda mask: current_image * mask, object_masks_warped,
dtype=tf.float32) # (N, H, W, 3)
color_stack.append(filtered_images)
mask_stack.append(object_masks)
mask_stack_warped.append(object_masks_warped)
# For this batch-element, if there are N moving objects,
# color_stack, mask_stack and mask_stack_warped contain both
# seq_length elements of shape (N, H, W, 3).
# We can now concatenate them on the last axis, creating a tensor of
# (N, H, W, 3*3 = 9), and, assuming N does not get too large so that
# we have enough memory, pass them in a single batch to the object
# motion network.
mask_stack = tf.concat(mask_stack, axis=3) # (N, H, W, 9)
mask_stack_warped = tf.concat(mask_stack_warped, axis=3)
color_stack = tf.concat(color_stack, axis=3) # (N, H, W, 9)
all_transforms = nets.objectmotion_net(
# We cut the gradient flow here as the object motion gradient
# should have no saying in how the egomotion network behaves.
# One could try just stopping the gradient for egomotion, but
# not for the depth prediction network.
image_stack=tf.stop_gradient(color_stack),
disp_bottleneck_stack=None,
joint_encoder=False, # Joint encoder not supported.
seq_length=self.seq_length,
weight_reg=self.weight_reg)
# all_transforms of shape (N, 2, 6).
self.object_transforms[s].append(all_transforms)
self.object_masks[s].append(mask_stack)
self.object_masks_warped[s].append(mask_stack_warped)
self.inputs_objectmotion_net[s][i] = color_stack
tf.get_variable_scope().reuse_variables()
else:
# Don't handle motion, classic model formulation.
with tf.name_scope('egomotion_prediction'):
if self.joint_encoder:
# Re-arrange disp_bottleneck_stack to be of shape
# [B, h_hid, w_hid, c_hid * seq_length]. Currently, it is a list with
# seq_length elements, each of dimension [B, h_hid, w_hid, c_hid].
disp_bottleneck_stack = tf.concat(disp_bottlenecks, axis=3)
else:
disp_bottleneck_stack = None
self.egomotion = nets.egomotion_net(
image_stack=self.image_stack_norm,
disp_bottleneck_stack=disp_bottleneck_stack,
joint_encoder=self.joint_encoder,
seq_length=self.seq_length,
weight_reg=self.weight_reg)
def build_loss(self):
"""Adds ops for computing loss."""
with tf.name_scope('compute_loss'):
self.reconstr_loss = 0
self.smooth_loss = 0
self.ssim_loss = 0
self.icp_transform_loss = 0
self.icp_residual_loss = 0
# self.images is organized by ...[scale][B, h, w, seq_len * 3].
self.images = [None for _ in range(NUM_SCALES)]
# Following nested lists are organized by ...[scale][source-target].
self.warped_image = [{} for _ in range(NUM_SCALES)]
self.warp_mask = [{} for _ in range(NUM_SCALES)]
self.warp_error = [{} for _ in range(NUM_SCALES)]
self.ssim_error = [{} for _ in range(NUM_SCALES)]
self.icp_transform = [{} for _ in range(NUM_SCALES)]
self.icp_residual = [{} for _ in range(NUM_SCALES)]
self.middle_frame_index = util.get_seq_middle(self.seq_length)
# Compute losses at each scale.
for s in range(NUM_SCALES):
# Scale image stack.
if s == 0: # Just as a precaution. TF often has interpolation bugs.
self.images[s] = self.image_stack
else:
height_s = int(self.img_height / (2**s))
width_s = int(self.img_width / (2**s))
self.images[s] = tf.image.resize_bilinear(
self.image_stack, [height_s, width_s], align_corners=True)
# Smoothness.
if self.smooth_weight > 0:
for i in range(self.seq_length):
# When computing minimum loss, use the depth map from the middle
# frame only.
if not self.compute_minimum_loss or i == self.middle_frame_index:
disp_smoothing = self.disp[i][s]
if self.depth_normalization:
# Perform depth normalization, dividing by the mean.
mean_disp = tf.reduce_mean(disp_smoothing, axis=[1, 2, 3],
keep_dims=True)
disp_input = disp_smoothing / mean_disp
else:
disp_input = disp_smoothing
scaling_f = (1.0 if self.equal_weighting else 1.0 / (2**s))
self.smooth_loss += scaling_f * self.depth_smoothness(
disp_input, self.images[s][:, :, :, 3 * i:3 * (i + 1)])
self.debug_all_warped_image_batches = []
for i in range(self.seq_length):
for j in range(self.seq_length):
if i == j:
continue
# When computing minimum loss, only consider the middle frame as
# target.
if self.compute_minimum_loss and j != self.middle_frame_index:
continue
# We only consider adjacent frames, unless either
# compute_minimum_loss is on (where the middle frame is matched with
# all other frames) or exhaustive_mode is on (where all frames are
# matched with each other).
if (not self.compute_minimum_loss and not self.exhaustive_mode and
abs(i - j) != 1):
continue
selected_scale = 0 if self.depth_upsampling else s
source = self.images[selected_scale][:, :, :, 3 * i:3 * (i + 1)]
target = self.images[selected_scale][:, :, :, 3 * j:3 * (j + 1)]
if self.depth_upsampling:
target_depth = self.depth_upsampled[j][s]
else:
target_depth = self.depth[j][s]
key = '%d-%d' % (i, j)
if self.handle_motion:
# self.seg_stack of shape (B, H, W, 9).
# target_depth corresponds to middle frame, of shape (B, H, W, 1).
# Now incorporate the other warping results, performed according
# to the object motion network's predictions.
# self.object_masks batch_size elements of (N, H, W, 9).
# self.object_masks_warped batch_size elements of (N, H, W, 9).
# self.object_transforms batch_size elements of (N, 2, 6).
self.all_batches = []
for batch_s in range(self.batch_size):
# To warp i into j, first take the base warping (this is the
# full image i warped into j using only the egomotion estimate).
base_warping = self.warped_seq[s][i][batch_s]
transform_matrices_thisbatch = tf.map_fn(
lambda transform: project.get_transform_mat(
tf.expand_dims(transform, axis=0), i, j)[0],
self.object_transforms[0][batch_s])
def inverse_warp_wrapper(matrix):
"""Wrapper for inverse warping method."""
warp_image, _ = (
project.inverse_warp(
tf.expand_dims(base_warping, axis=0),
tf.expand_dims(target_depth[batch_s], axis=0),
tf.expand_dims(matrix, axis=0),
tf.expand_dims(self.intrinsic_mat[
batch_s, selected_scale, :, :], axis=0),
tf.expand_dims(self.intrinsic_mat_inv[
batch_s, selected_scale, :, :], axis=0)))
return warp_image
warped_images_thisbatch = tf.map_fn(
inverse_warp_wrapper, transform_matrices_thisbatch,
dtype=tf.float32)
warped_images_thisbatch = warped_images_thisbatch[:, 0, :, :, :]
# warped_images_thisbatch is now of shape (N, H, W, 9).
# Combine warped frames into a single one, using the object
# masks. Result should be (1, 128, 416, 3).
# Essentially, we here want to sum them all up, filtered by the
# respective object masks.
mask_base_valid_source = tf.equal(
self.seg_stack[batch_s, :, :, i*3:(i+1)*3],
tf.constant(0, dtype=tf.uint8))
mask_base_valid_target = tf.equal(
self.seg_stack[batch_s, :, :, j*3:(j+1)*3],
tf.constant(0, dtype=tf.uint8))
mask_valid = tf.logical_and(
mask_base_valid_source, mask_base_valid_target)
self.base_warping = base_warping * tf.to_float(mask_valid)
background = tf.expand_dims(self.base_warping, axis=0)
def construct_const_filter_tensor(obj_id):
return tf.fill(
dims=[self.img_height, self.img_width, 3],
value=tf.sign(obj_id)) * tf.to_float(
tf.equal(self.seg_stack[batch_s, :, :, 3:6],
tf.cast(obj_id, dtype=tf.uint8)))
filter_tensor = tf.map_fn(
construct_const_filter_tensor,
tf.to_float(self.object_ids[s][batch_s]))
filter_tensor = tf.stack(filter_tensor, axis=0)
objects_to_add = tf.reduce_sum(
tf.multiply(warped_images_thisbatch, filter_tensor),
axis=0, keepdims=True)
combined = background + objects_to_add
self.all_batches.append(combined)
# Now of shape (B, 128, 416, 3).
self.warped_image[s][key] = tf.concat(self.all_batches, axis=0)
else:
# Don't handle motion, classic model formulation.
egomotion_mat_i_j = project.get_transform_mat(
self.egomotion, i, j)
# Inverse warp the source image to the target image frame for
# photometric consistency loss.
self.warped_image[s][key], self.warp_mask[s][key] = (
project.inverse_warp(
source,
target_depth,
egomotion_mat_i_j,
self.intrinsic_mat[:, selected_scale, :, :],
self.intrinsic_mat_inv[:, selected_scale, :, :]))
# Reconstruction loss.
self.warp_error[s][key] = tf.abs(self.warped_image[s][key] - target)
if not self.compute_minimum_loss:
self.reconstr_loss += tf.reduce_mean(
self.warp_error[s][key] * self.warp_mask[s][key])
# SSIM.
if self.ssim_weight > 0:
self.ssim_error[s][key] = self.ssim(self.warped_image[s][key],
target)
# TODO(rezama): This should be min_pool2d().
if not self.compute_minimum_loss:
ssim_mask = slim.avg_pool2d(self.warp_mask[s][key], 3, 1,
'VALID')
self.ssim_loss += tf.reduce_mean(
self.ssim_error[s][key] * ssim_mask)
# If the minimum loss should be computed, the loss calculation has been
# postponed until here.
if self.compute_minimum_loss:
for frame_index in range(self.middle_frame_index):
key1 = '%d-%d' % (frame_index, self.middle_frame_index)
key2 = '%d-%d' % (self.seq_length - frame_index - 1,
self.middle_frame_index)
logging.info('computing min error between %s and %s', key1, key2)
min_error = tf.minimum(self.warp_error[s][key1],
self.warp_error[s][key2])
self.reconstr_loss += tf.reduce_mean(min_error)
if self.ssim_weight > 0: # Also compute the minimum SSIM loss.
min_error_ssim = tf.minimum(self.ssim_error[s][key1],
self.ssim_error[s][key2])
self.ssim_loss += tf.reduce_mean(min_error_ssim)
# Build the total loss as composed of L1 reconstruction, SSIM, smoothing
# and object size constraint loss as appropriate.
self.reconstr_loss *= self.reconstr_weight
self.total_loss = self.reconstr_loss
if self.smooth_weight > 0:
self.smooth_loss *= self.smooth_weight
self.total_loss += self.smooth_loss
if self.ssim_weight > 0:
self.ssim_loss *= self.ssim_weight
self.total_loss += self.ssim_loss
if self.size_constraint_weight > 0:
self.inf_loss *= self.size_constraint_weight
self.total_loss += self.inf_loss
def gradient_x(self, img):
return img[:, :, :-1, :] - img[:, :, 1:, :]
def gradient_y(self, img):
return img[:, :-1, :, :] - img[:, 1:, :, :]
def depth_smoothness(self, depth, img):
"""Computes image-aware depth smoothness loss."""
depth_dx = self.gradient_x(depth)
depth_dy = self.gradient_y(depth)
image_dx = self.gradient_x(img)
image_dy = self.gradient_y(img)
weights_x = tf.exp(-tf.reduce_mean(tf.abs(image_dx), 3, keepdims=True))
weights_y = tf.exp(-tf.reduce_mean(tf.abs(image_dy), 3, keepdims=True))
smoothness_x = depth_dx * weights_x
smoothness_y = depth_dy * weights_y
return tf.reduce_mean(abs(smoothness_x)) + tf.reduce_mean(abs(smoothness_y))
def ssim(self, x, y):
"""Computes a differentiable structured image similarity measure."""
c1 = 0.01**2 # As defined in SSIM to stabilize div. by small denominator.
c2 = 0.03**2
mu_x = slim.avg_pool2d(x, 3, 1, 'VALID')
mu_y = slim.avg_pool2d(y, 3, 1, 'VALID')
sigma_x = slim.avg_pool2d(x**2, 3, 1, 'VALID') - mu_x**2
sigma_y = slim.avg_pool2d(y**2, 3, 1, 'VALID') - mu_y**2
sigma_xy = slim.avg_pool2d(x * y, 3, 1, 'VALID') - mu_x * mu_y
ssim_n = (2 * mu_x * mu_y + c1) * (2 * sigma_xy + c2)
ssim_d = (mu_x**2 + mu_y**2 + c1) * (sigma_x + sigma_y + c2)
ssim = ssim_n / ssim_d
return tf.clip_by_value((1 - ssim) / 2, 0, 1)
def build_train_op(self):
with tf.name_scope('train_op'):
optim = tf.train.AdamOptimizer(self.learning_rate, self.beta1)
self.train_op = slim.learning.create_train_op(self.total_loss, optim)
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.incr_global_step = tf.assign(
self.global_step, self.global_step + 1)
def build_summaries(self):
"""Adds scalar and image summaries for TensorBoard."""
tf.summary.scalar('total_loss', self.total_loss)
tf.summary.scalar('reconstr_loss', self.reconstr_loss)
if self.smooth_weight > 0:
tf.summary.scalar('smooth_loss', self.smooth_loss)
if self.ssim_weight > 0:
tf.summary.scalar('ssim_loss', self.ssim_loss)
if self.icp_weight > 0:
tf.summary.scalar('icp_transform_loss', self.icp_transform_loss)
tf.summary.scalar('icp_residual_loss', self.icp_residual_loss)
if self.size_constraint_weight > 0:
tf.summary.scalar('inf_loss', self.inf_loss)
tf.summary.histogram('global_scale_var', self.global_scale_var)
if self.handle_motion:
for s in range(NUM_SCALES):
for batch_s in range(self.batch_size):
whole_strip = tf.concat([self.warped_seq[s][0][batch_s],
self.warped_seq[s][1][batch_s],
self.warped_seq[s][2][batch_s]], axis=1)
tf.summary.image('base_warp_batch%s_scale%s' % (batch_s, s),
tf.expand_dims(whole_strip, axis=0))
whole_strip_input = tf.concat(
[self.inputs_objectmotion_net[s][batch_s][:, :, :, 0:3],
self.inputs_objectmotion_net[s][batch_s][:, :, :, 3:6],
self.inputs_objectmotion_net[s][batch_s][:, :, :, 6:9]], axis=2)
tf.summary.image('input_objectmotion_batch%s_scale%s' % (batch_s, s),
whole_strip_input) # (B, H, 3*W, 3)
for batch_s in range(self.batch_size):
whole_strip = tf.concat([self.base_input_masked[batch_s, :, :, 0:3],
self.base_input_masked[batch_s, :, :, 3:6],
self.base_input_masked[batch_s, :, :, 6:9]],
axis=1)
tf.summary.image('input_egomotion_batch%s' % batch_s,
tf.expand_dims(whole_strip, axis=0))
# Show transform predictions (of all objects).
for batch_s in range(self.batch_size):
for i in range(self.seq_length - 1):
# self.object_transforms contains batch_size elements of (N, 2, 6).
tf.summary.histogram('batch%d_tx%d' % (batch_s, i),
self.object_transforms[0][batch_s][:, i, 0])
tf.summary.histogram('batch%d_ty%d' % (batch_s, i),
self.object_transforms[0][batch_s][:, i, 1])
tf.summary.histogram('batch%d_tz%d' % (batch_s, i),
self.object_transforms[0][batch_s][:, i, 2])
tf.summary.histogram('batch%d_rx%d' % (batch_s, i),
self.object_transforms[0][batch_s][:, i, 3])
tf.summary.histogram('batch%d_ry%d' % (batch_s, i),
self.object_transforms[0][batch_s][:, i, 4])
tf.summary.histogram('batch%d_rz%d' % (batch_s, i),
self.object_transforms[0][batch_s][:, i, 5])
for i in range(self.seq_length - 1):
tf.summary.histogram('tx%d' % i, self.egomotion[:, i, 0])
tf.summary.histogram('ty%d' % i, self.egomotion[:, i, 1])
tf.summary.histogram('tz%d' % i, self.egomotion[:, i, 2])
tf.summary.histogram('rx%d' % i, self.egomotion[:, i, 3])
tf.summary.histogram('ry%d' % i, self.egomotion[:, i, 4])
tf.summary.histogram('rz%d' % i, self.egomotion[:, i, 5])
for s in range(NUM_SCALES):
for i in range(self.seq_length):
tf.summary.image('scale%d_image%d' % (s, i),
self.images[s][:, :, :, 3 * i:3 * (i + 1)])
if i in self.depth:
tf.summary.histogram('scale%d_depth%d' % (s, i), self.depth[i][s])
tf.summary.histogram('scale%d_disp%d' % (s, i), self.disp[i][s])
tf.summary.image('scale%d_disparity%d' % (s, i), self.disp[i][s])
for key in self.warped_image[s]:
tf.summary.image('scale%d_warped_image%s' % (s, key),
self.warped_image[s][key])
tf.summary.image('scale%d_warp_error%s' % (s, key),
self.warp_error[s][key])
if self.ssim_weight > 0:
tf.summary.image('scale%d_ssim_error%s' % (s, key),
self.ssim_error[s][key])
if self.icp_weight > 0:
tf.summary.image('scale%d_icp_residual%s' % (s, key),
self.icp_residual[s][key])
transform = self.icp_transform[s][key]
tf.summary.histogram('scale%d_icp_tx%s' % (s, key), transform[:, 0])
tf.summary.histogram('scale%d_icp_ty%s' % (s, key), transform[:, 1])
tf.summary.histogram('scale%d_icp_tz%s' % (s, key), transform[:, 2])
tf.summary.histogram('scale%d_icp_rx%s' % (s, key), transform[:, 3])
tf.summary.histogram('scale%d_icp_ry%s' % (s, key), transform[:, 4])
tf.summary.histogram('scale%d_icp_rz%s' % (s, key), transform[:, 5])
def build_depth_test_graph(self):
"""Builds depth model reading from placeholders."""
with tf.variable_scope('depth_prediction'):
input_image = tf.placeholder(
tf.float32, [self.batch_size, self.img_height, self.img_width, 3],
name='raw_input')
if self.imagenet_norm:
input_image = (input_image - reader.IMAGENET_MEAN) / reader.IMAGENET_SD
est_disp, _ = nets.disp_net(architecture=self.architecture,
image=input_image,
use_skip=self.use_skip,
weight_reg=self.weight_reg,
is_training=True)
est_depth = 1.0 / est_disp[0]
self.input_image = input_image
self.est_depth = est_depth
def build_egomotion_test_graph(self):
"""Builds egomotion model reading from placeholders."""
input_image_stack = tf.placeholder(
tf.float32,
[1, self.img_height, self.img_width, self.seq_length * 3],
name='raw_input')
input_bottleneck_stack = None
if self.imagenet_norm:
im_mean = tf.tile(
tf.constant(reader.IMAGENET_MEAN), multiples=[self.seq_length])
im_sd = tf.tile(
tf.constant(reader.IMAGENET_SD), multiples=[self.seq_length])
input_image_stack = (input_image_stack - im_mean) / im_sd
if self.joint_encoder:
# Pre-compute embeddings here.
with tf.variable_scope('depth_prediction', reuse=True):
input_bottleneck_stack = []
encoder_selected = nets.encoder(self.architecture)
for i in range(self.seq_length):
input_image = input_image_stack[:, :, :, i * 3:(i + 1) * 3]
tf.get_variable_scope().reuse_variables()
embedding, _ = encoder_selected(
target_image=input_image,
weight_reg=self.weight_reg,
is_training=True)
input_bottleneck_stack.append(embedding)
input_bottleneck_stack = tf.concat(input_bottleneck_stack, axis=3)
with tf.variable_scope('egomotion_prediction'):
est_egomotion = nets.egomotion_net(
image_stack=input_image_stack,
disp_bottleneck_stack=input_bottleneck_stack,
joint_encoder=self.joint_encoder,
seq_length=self.seq_length,
weight_reg=self.weight_reg)
self.input_image_stack = input_image_stack
self.est_egomotion = est_egomotion
def build_objectmotion_test_graph(self):
"""Builds egomotion model reading from placeholders."""
input_image_stack_om = tf.placeholder(
tf.float32,
[1, self.img_height, self.img_width, self.seq_length * 3],
name='raw_input')
if self.imagenet_norm:
im_mean = tf.tile(
tf.constant(reader.IMAGENET_MEAN), multiples=[self.seq_length])
im_sd = tf.tile(
tf.constant(reader.IMAGENET_SD), multiples=[self.seq_length])
input_image_stack_om = (input_image_stack_om - im_mean) / im_sd
with tf.variable_scope('objectmotion_prediction'):
est_objectmotion = nets.objectmotion_net(
image_stack=input_image_stack_om,
disp_bottleneck_stack=None,
joint_encoder=self.joint_encoder,
seq_length=self.seq_length,
weight_reg=self.weight_reg)
self.input_image_stack_om = input_image_stack_om
self.est_objectmotion = est_objectmotion
def inference_depth(self, inputs, sess):
return sess.run(self.est_depth, feed_dict={self.input_image: inputs})
def inference_egomotion(self, inputs, sess):
return sess.run(
self.est_egomotion, feed_dict={self.input_image_stack: inputs})
def inference_objectmotion(self, inputs, sess):
return sess.run(
self.est_objectmotion, feed_dict={self.input_image_stack_om: inputs})
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Depth and Ego-Motion networks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import tensorflow as tf
slim = tf.contrib.slim
SIMPLE = 'simple'
RESNET = 'resnet'
ARCHITECTURES = [SIMPLE, RESNET]
SCALE_TRANSLATION = 0.001
SCALE_ROTATION = 0.01
# Disparity (inverse depth) values range from 0.01 to 10. Note that effectively,
# this is undone if depth normalization is used, which scales the values to
# have a mean of 1.
DISP_SCALING = 10
MIN_DISP = 0.01
WEIGHT_DECAY_KEY = 'WEIGHT_DECAY'
EGOMOTION_VEC_SIZE = 6
def egomotion_net(image_stack, disp_bottleneck_stack, joint_encoder, seq_length,
weight_reg):
"""Predict ego-motion vectors from a stack of frames or embeddings.
Args:
image_stack: Input tensor with shape [B, h, w, seq_length * 3] in order.
disp_bottleneck_stack: Input tensor with shape [B, h_hidden, w_hidden,
seq_length * c_hidden] in order.
joint_encoder: Determines if the same encoder is used for computing the
bottleneck layer of both the egomotion and the depth prediction
network. If enabled, disp_bottleneck_stack is used as input, and the
encoding steps are skipped. If disabled, a separate encoder is defined
on image_stack.
seq_length: The sequence length used.
weight_reg: The amount of weight regularization.
Returns:
Egomotion vectors with shape [B, seq_length - 1, 6].
"""
num_egomotion_vecs = seq_length - 1
with tf.variable_scope('pose_exp_net') as sc:
end_points_collection = sc.original_name_scope + '_end_points'
with slim.arg_scope([slim.conv2d, slim.conv2d_transpose],
normalizer_fn=None,
weights_regularizer=slim.l2_regularizer(weight_reg),
normalizer_params=None,
activation_fn=tf.nn.relu,
outputs_collections=end_points_collection):
if not joint_encoder:
# Define separate encoder. If sharing, we can skip the encoding step,
# as the bottleneck layer will already be passed as input.
cnv1 = slim.conv2d(image_stack, 16, [7, 7], stride=2, scope='cnv1')
cnv2 = slim.conv2d(cnv1, 32, [5, 5], stride=2, scope='cnv2')
cnv3 = slim.conv2d(cnv2, 64, [3, 3], stride=2, scope='cnv3')
cnv4 = slim.conv2d(cnv3, 128, [3, 3], stride=2, scope='cnv4')
cnv5 = slim.conv2d(cnv4, 256, [3, 3], stride=2, scope='cnv5')
with tf.variable_scope('pose'):
inputs = disp_bottleneck_stack if joint_encoder else cnv5
cnv6 = slim.conv2d(inputs, 256, [3, 3], stride=2, scope='cnv6')
cnv7 = slim.conv2d(cnv6, 256, [3, 3], stride=2, scope='cnv7')
pred_channels = EGOMOTION_VEC_SIZE * num_egomotion_vecs
egomotion_pred = slim.conv2d(cnv7, pred_channels, [1, 1], scope='pred',
stride=1, normalizer_fn=None,
activation_fn=None)
egomotion_avg = tf.reduce_mean(egomotion_pred, [1, 2])
egomotion_res = tf.reshape(
egomotion_avg, [-1, num_egomotion_vecs, EGOMOTION_VEC_SIZE])
# Tinghui found that scaling by a small constant facilitates training.
egomotion_scaled = tf.concat([egomotion_res[:, 0:3] * SCALE_TRANSLATION,
egomotion_res[:, 3:6] * SCALE_ROTATION],
axis=1)
return egomotion_scaled
def objectmotion_net(image_stack, disp_bottleneck_stack, joint_encoder,
seq_length, weight_reg):
"""Predict object-motion vectors from a stack of frames or embeddings.
Args:
image_stack: Input tensor with shape [B, h, w, seq_length * 3] in order.
disp_bottleneck_stack: Input tensor with shape [B, h_hidden, w_hidden,
seq_length * c_hidden] in order.
joint_encoder: Determines if the same encoder is used for computing the
bottleneck layer of both the egomotion and the depth prediction
network. If enabled, disp_bottleneck_stack is used as input, and the
encoding steps are skipped. If disabled, a separate encoder is defined
on image_stack.
seq_length: The sequence length used.
weight_reg: The amount of weight regularization.
Returns:
Egomotion vectors with shape [B, seq_length - 1, 6].
"""
num_egomotion_vecs = seq_length - 1
with tf.variable_scope('pose_exp_net') as sc:
end_points_collection = sc.original_name_scope + '_end_points'
with slim.arg_scope([slim.conv2d, slim.conv2d_transpose],
normalizer_fn=None,
weights_regularizer=slim.l2_regularizer(weight_reg),
normalizer_params=None,
activation_fn=tf.nn.relu,
outputs_collections=end_points_collection):
if not joint_encoder:
# Define separate encoder. If sharing, we can skip the encoding step,
# as the bottleneck layer will already be passed as input.
cnv1 = slim.conv2d(image_stack, 16, [7, 7], stride=2, scope='cnv1')
cnv2 = slim.conv2d(cnv1, 32, [5, 5], stride=2, scope='cnv2')
cnv3 = slim.conv2d(cnv2, 64, [3, 3], stride=2, scope='cnv3')
cnv4 = slim.conv2d(cnv3, 128, [3, 3], stride=2, scope='cnv4')
cnv5 = slim.conv2d(cnv4, 256, [3, 3], stride=2, scope='cnv5')
with tf.variable_scope('pose'):
inputs = disp_bottleneck_stack if joint_encoder else cnv5
cnv6 = slim.conv2d(inputs, 256, [3, 3], stride=2, scope='cnv6')
cnv7 = slim.conv2d(cnv6, 256, [3, 3], stride=2, scope='cnv7')
pred_channels = EGOMOTION_VEC_SIZE * num_egomotion_vecs
egomotion_pred = slim.conv2d(cnv7, pred_channels, [1, 1], scope='pred',
stride=1, normalizer_fn=None,
activation_fn=None)
egomotion_avg = tf.reduce_mean(egomotion_pred, [1, 2])
egomotion_res = tf.reshape(
egomotion_avg, [-1, num_egomotion_vecs, EGOMOTION_VEC_SIZE])
# Tinghui found that scaling by a small constant facilitates training.
egomotion_scaled = tf.concat([egomotion_res[:, 0:3] * SCALE_TRANSLATION,
egomotion_res[:, 3:6] * SCALE_ROTATION],
axis=1)
return egomotion_scaled
def disp_net(architecture, image, use_skip, weight_reg, is_training):
"""Defines an encoder-decoder architecture for depth prediction."""
if architecture not in ARCHITECTURES:
raise ValueError('Unknown architecture.')
encoder_selected = encoder(architecture)
decoder_selected = decoder(architecture)
# Encode image.
bottleneck, skip_connections = encoder_selected(image, weight_reg,
is_training)
# Decode to depth.
multiscale_disps_i = decoder_selected(target_image=image,
bottleneck=bottleneck,
weight_reg=weight_reg,
use_skip=use_skip,
skip_connections=skip_connections)
return multiscale_disps_i, bottleneck
def encoder(architecture):
return encoder_resnet if architecture == RESNET else encoder_simple
def decoder(architecture):
return decoder_resnet if architecture == RESNET else decoder_simple
def encoder_simple(target_image, weight_reg, is_training):
"""Defines the old encoding architecture."""
del is_training
with slim.arg_scope([slim.conv2d],
normalizer_fn=None,
normalizer_params=None,
weights_regularizer=slim.l2_regularizer(weight_reg),
activation_fn=tf.nn.relu):
# Define (joint) encoder.
cnv1 = slim.conv2d(target_image, 32, [7, 7], stride=2, scope='cnv1')
cnv1b = slim.conv2d(cnv1, 32, [7, 7], stride=1, scope='cnv1b')
cnv2 = slim.conv2d(cnv1b, 64, [5, 5], stride=2, scope='cnv2')
cnv2b = slim.conv2d(cnv2, 64, [5, 5], stride=1, scope='cnv2b')
cnv3 = slim.conv2d(cnv2b, 128, [3, 3], stride=2, scope='cnv3')
cnv3b = slim.conv2d(cnv3, 128, [3, 3], stride=1, scope='cnv3b')
cnv4 = slim.conv2d(cnv3b, 256, [3, 3], stride=2, scope='cnv4')
cnv4b = slim.conv2d(cnv4, 256, [3, 3], stride=1, scope='cnv4b')
cnv5 = slim.conv2d(cnv4b, 512, [3, 3], stride=2, scope='cnv5')
cnv5b = slim.conv2d(cnv5, 512, [3, 3], stride=1, scope='cnv5b')
cnv6 = slim.conv2d(cnv5b, 512, [3, 3], stride=2, scope='cnv6')
cnv6b = slim.conv2d(cnv6, 512, [3, 3], stride=1, scope='cnv6b')
cnv7 = slim.conv2d(cnv6b, 512, [3, 3], stride=2, scope='cnv7')
cnv7b = slim.conv2d(cnv7, 512, [3, 3], stride=1, scope='cnv7b')
return cnv7b, (cnv6b, cnv5b, cnv4b, cnv3b, cnv2b, cnv1b)
def decoder_simple(target_image, bottleneck, weight_reg, use_skip,
skip_connections):
"""Defines the old depth decoder architecture."""
h = target_image.get_shape()[1].value
w = target_image.get_shape()[2].value
(cnv6b, cnv5b, cnv4b, cnv3b, cnv2b, cnv1b) = skip_connections
with slim.arg_scope([slim.conv2d, slim.conv2d_transpose],
normalizer_fn=None,
normalizer_params=None,
weights_regularizer=slim.l2_regularizer(weight_reg),
activation_fn=tf.nn.relu):
up7 = slim.conv2d_transpose(bottleneck, 512, [3, 3], stride=2,
scope='upcnv7')
up7 = _resize_like(up7, cnv6b)
if use_skip:
i7_in = tf.concat([up7, cnv6b], axis=3)
else:
i7_in = up7
icnv7 = slim.conv2d(i7_in, 512, [3, 3], stride=1, scope='icnv7')
up6 = slim.conv2d_transpose(icnv7, 512, [3, 3], stride=2, scope='upcnv6')
up6 = _resize_like(up6, cnv5b)
if use_skip:
i6_in = tf.concat([up6, cnv5b], axis=3)
else:
i6_in = up6
icnv6 = slim.conv2d(i6_in, 512, [3, 3], stride=1, scope='icnv6')
up5 = slim.conv2d_transpose(icnv6, 256, [3, 3], stride=2, scope='upcnv5')
up5 = _resize_like(up5, cnv4b)
if use_skip:
i5_in = tf.concat([up5, cnv4b], axis=3)
else:
i5_in = up5
icnv5 = slim.conv2d(i5_in, 256, [3, 3], stride=1, scope='icnv5')
up4 = slim.conv2d_transpose(icnv5, 128, [3, 3], stride=2, scope='upcnv4')
up4 = _resize_like(up4, cnv3b)
if use_skip:
i4_in = tf.concat([up4, cnv3b], axis=3)
else:
i4_in = up4
icnv4 = slim.conv2d(i4_in, 128, [3, 3], stride=1, scope='icnv4')
disp4 = (slim.conv2d(icnv4, 1, [3, 3], stride=1, activation_fn=tf.sigmoid,
normalizer_fn=None, scope='disp4')
* DISP_SCALING + MIN_DISP)
disp4_up = tf.image.resize_bilinear(disp4, [np.int(h / 4), np.int(w / 4)],
align_corners=True)
up3 = slim.conv2d_transpose(icnv4, 64, [3, 3], stride=2, scope='upcnv3')
up3 = _resize_like(up3, cnv2b)
if use_skip:
i3_in = tf.concat([up3, cnv2b, disp4_up], axis=3)
else:
i3_in = tf.concat([up3, disp4_up])
icnv3 = slim.conv2d(i3_in, 64, [3, 3], stride=1, scope='icnv3')
disp3 = (slim.conv2d(icnv3, 1, [3, 3], stride=1, activation_fn=tf.sigmoid,
normalizer_fn=None, scope='disp3')
* DISP_SCALING + MIN_DISP)
disp3_up = tf.image.resize_bilinear(disp3, [np.int(h / 2), np.int(w / 2)],
align_corners=True)
up2 = slim.conv2d_transpose(icnv3, 32, [3, 3], stride=2, scope='upcnv2')
up2 = _resize_like(up2, cnv1b)
if use_skip:
i2_in = tf.concat([up2, cnv1b, disp3_up], axis=3)
else:
i2_in = tf.concat([up2, disp3_up])
icnv2 = slim.conv2d(i2_in, 32, [3, 3], stride=1, scope='icnv2')
disp2 = (slim.conv2d(icnv2, 1, [3, 3], stride=1, activation_fn=tf.sigmoid,
normalizer_fn=None, scope='disp2')
* DISP_SCALING + MIN_DISP)
disp2_up = tf.image.resize_bilinear(disp2, [h, w], align_corners=True)
up1 = slim.conv2d_transpose(icnv2, 16, [3, 3], stride=2, scope='upcnv1')
i1_in = tf.concat([up1, disp2_up], axis=3)
icnv1 = slim.conv2d(i1_in, 16, [3, 3], stride=1, scope='icnv1')
disp1 = (slim.conv2d(icnv1, 1, [3, 3], stride=1, activation_fn=tf.sigmoid,
normalizer_fn=None, scope='disp1')
* DISP_SCALING + MIN_DISP)
return [disp1, disp2, disp3, disp4]
def encoder_resnet(target_image, weight_reg, is_training):
"""Defines a ResNet18-based encoding architecture.
This implementation follows Juyong Kim's implementation of ResNet18 on GitHub:
https://github.com/dalgu90/resnet-18-tensorflow
Args:
target_image: Input tensor with shape [B, h, w, 3] to encode.
weight_reg: Parameter ignored.
is_training: Whether the model is being trained or not.
Returns:
Tuple of tensors, with the first being the bottleneck layer as tensor of
size [B, h_hid, w_hid, c_hid], and others being intermediate layers
for building skip-connections.
"""
del weight_reg
encoder_filters = [64, 64, 128, 256, 512]
stride = 2
# conv1
with tf.variable_scope('conv1'):
x = _conv(target_image, 7, encoder_filters[0], stride)
x = _bn(x, is_train=is_training)
econv1 = _relu(x)
x = tf.nn.max_pool(econv1, [1, 3, 3, 1], [1, 2, 2, 1], 'SAME')
# conv2_x
x = _residual_block(x, is_training, name='conv2_1')
econv2 = _residual_block(x, is_training, name='conv2_2')
# conv3_x
x = _residual_block_first(econv2, is_training, encoder_filters[2], stride,
name='conv3_1')
econv3 = _residual_block(x, is_training, name='conv3_2')
# conv4_x
x = _residual_block_first(econv3, is_training, encoder_filters[3], stride,
name='conv4_1')
econv4 = _residual_block(x, is_training, name='conv4_2')
# conv5_x
x = _residual_block_first(econv4, is_training, encoder_filters[4], stride,
name='conv5_1')
econv5 = _residual_block(x, is_training, name='conv5_2')
return econv5, (econv4, econv3, econv2, econv1)
def decoder_resnet(target_image, bottleneck, weight_reg, use_skip,
skip_connections):
"""Defines the depth decoder architecture.
Args:
target_image: The original encoder input tensor with shape [B, h, w, 3].
Just the shape information is used here.
bottleneck: Bottleneck layer to be decoded.
weight_reg: The amount of weight regularization.
use_skip: Whether the passed skip connections econv1, econv2, econv3 and
econv4 should be used.
skip_connections: Tensors for building skip-connections.
Returns:
Disparities at 4 different scales.
"""
(econv4, econv3, econv2, econv1) = skip_connections
decoder_filters = [16, 32, 64, 128, 256]
default_pad = tf.constant([[0, 0], [1, 1], [1, 1], [0, 0]])
reg = slim.l2_regularizer(weight_reg) if weight_reg > 0.0 else None
with slim.arg_scope([slim.conv2d, slim.conv2d_transpose],
normalizer_fn=None,
normalizer_params=None,
activation_fn=tf.nn.relu,
weights_regularizer=reg):
upconv5 = slim.conv2d_transpose(bottleneck, decoder_filters[4], [3, 3],
stride=2, scope='upconv5')
upconv5 = _resize_like(upconv5, econv4)
if use_skip:
i5_in = tf.concat([upconv5, econv4], axis=3)
else:
i5_in = upconv5
i5_in = tf.pad(i5_in, default_pad, mode='REFLECT')
iconv5 = slim.conv2d(i5_in, decoder_filters[4], [3, 3], stride=1,
scope='iconv5', padding='VALID')
upconv4 = slim.conv2d_transpose(iconv5, decoder_filters[3], [3, 3],
stride=2, scope='upconv4')
upconv4 = _resize_like(upconv4, econv3)
if use_skip:
i4_in = tf.concat([upconv4, econv3], axis=3)
else:
i4_in = upconv4
i4_in = tf.pad(i4_in, default_pad, mode='REFLECT')
iconv4 = slim.conv2d(i4_in, decoder_filters[3], [3, 3], stride=1,
scope='iconv4', padding='VALID')
disp4_input = tf.pad(iconv4, default_pad, mode='REFLECT')
disp4 = (slim.conv2d(disp4_input, 1, [3, 3], stride=1,
activation_fn=tf.sigmoid, normalizer_fn=None,
scope='disp4', padding='VALID')
* DISP_SCALING + MIN_DISP)
upconv3 = slim.conv2d_transpose(iconv4, decoder_filters[2], [3, 3],
stride=2, scope='upconv3')
upconv3 = _resize_like(upconv3, econv2)
if use_skip:
i3_in = tf.concat([upconv3, econv2], axis=3)
else:
i3_in = upconv3
i3_in = tf.pad(i3_in, default_pad, mode='REFLECT')
iconv3 = slim.conv2d(i3_in, decoder_filters[2], [3, 3], stride=1,
scope='iconv3', padding='VALID')
disp3_input = tf.pad(iconv3, default_pad, mode='REFLECT')
disp3 = (slim.conv2d(disp3_input, 1, [3, 3], stride=1,
activation_fn=tf.sigmoid, normalizer_fn=None,
scope='disp3', padding='VALID')
* DISP_SCALING + MIN_DISP)
upconv2 = slim.conv2d_transpose(iconv3, decoder_filters[1], [3, 3],
stride=2, scope='upconv2')
upconv2 = _resize_like(upconv2, econv1)
if use_skip:
i2_in = tf.concat([upconv2, econv1], axis=3)
else:
i2_in = upconv2
i2_in = tf.pad(i2_in, default_pad, mode='REFLECT')
iconv2 = slim.conv2d(i2_in, decoder_filters[1], [3, 3], stride=1,
scope='iconv2', padding='VALID')
disp2_input = tf.pad(iconv2, default_pad, mode='REFLECT')
disp2 = (slim.conv2d(disp2_input, 1, [3, 3], stride=1,
activation_fn=tf.sigmoid, normalizer_fn=None,
scope='disp2', padding='VALID')
* DISP_SCALING + MIN_DISP)
upconv1 = slim.conv2d_transpose(iconv2, decoder_filters[0], [3, 3],
stride=2, scope='upconv1')
upconv1 = _resize_like(upconv1, target_image)
upconv1 = tf.pad(upconv1, default_pad, mode='REFLECT')
iconv1 = slim.conv2d(upconv1, decoder_filters[0], [3, 3], stride=1,
scope='iconv1', padding='VALID')
disp1_input = tf.pad(iconv1, default_pad, mode='REFLECT')
disp1 = (slim.conv2d(disp1_input, 1, [3, 3], stride=1,
activation_fn=tf.sigmoid, normalizer_fn=None,
scope='disp1', padding='VALID')
* DISP_SCALING + MIN_DISP)
return [disp1, disp2, disp3, disp4]
def _residual_block_first(x, is_training, out_channel, strides, name='unit'):
"""Helper function for defining ResNet architecture."""
in_channel = x.get_shape().as_list()[-1]
with tf.variable_scope(name):
# Shortcut connection
if in_channel == out_channel:
if strides == 1:
shortcut = tf.identity(x)
else:
shortcut = tf.nn.max_pool(x, [1, strides, strides, 1],
[1, strides, strides, 1], 'VALID')
else:
shortcut = _conv(x, 1, out_channel, strides, name='shortcut')
# Residual
x = _conv(x, 3, out_channel, strides, name='conv_1')
x = _bn(x, is_train=is_training, name='bn_1')
x = _relu(x, name='relu_1')
x = _conv(x, 3, out_channel, 1, name='conv_2')
x = _bn(x, is_train=is_training, name='bn_2')
# Merge
x = x + shortcut
x = _relu(x, name='relu_2')
return x
def _residual_block(x, is_training, input_q=None, output_q=None, name='unit'):
"""Helper function for defining ResNet architecture."""
num_channel = x.get_shape().as_list()[-1]
with tf.variable_scope(name):
shortcut = x # Shortcut connection
# Residual
x = _conv(x, 3, num_channel, 1, input_q=input_q, output_q=output_q,
name='conv_1')
x = _bn(x, is_train=is_training, name='bn_1')
x = _relu(x, name='relu_1')
x = _conv(x, 3, num_channel, 1, input_q=output_q, output_q=output_q,
name='conv_2')
x = _bn(x, is_train=is_training, name='bn_2')
# Merge
x = x + shortcut
x = _relu(x, name='relu_2')
return x
def _conv(x, filter_size, out_channel, stride, pad='SAME', input_q=None,
output_q=None, name='conv'):
"""Helper function for defining ResNet architecture."""
if (input_q is None) ^ (output_q is None):
raise ValueError('Input/Output splits are not correctly given.')
in_shape = x.get_shape()
with tf.variable_scope(name):
# Main operation: conv2d
with tf.device('/CPU:0'):
kernel = tf.get_variable(
'kernel', [filter_size, filter_size, in_shape[3], out_channel],
tf.float32, initializer=tf.random_normal_initializer(
stddev=np.sqrt(2.0/filter_size/filter_size/out_channel)))
if kernel not in tf.get_collection(WEIGHT_DECAY_KEY):
tf.add_to_collection(WEIGHT_DECAY_KEY, kernel)
conv = tf.nn.conv2d(x, kernel, [1, stride, stride, 1], pad)
return conv
def _bn(x, is_train, name='bn'):
"""Helper function for defining ResNet architecture."""
bn = tf.layers.batch_normalization(x, training=is_train, name=name)
return bn
def _relu(x, name=None, leakness=0.0):
"""Helper function for defining ResNet architecture."""
if leakness > 0.0:
name = 'lrelu' if name is None else name
return tf.maximum(x, x*leakness, name='lrelu')
else:
name = 'relu' if name is None else name
return tf.nn.relu(x, name='relu')
def _resize_like(inputs, ref):
i_h, i_w = inputs.get_shape()[1], inputs.get_shape()[2]
r_h, r_w = ref.get_shape()[1], ref.get_shape()[2]
if i_h == r_h and i_w == r_w:
return inputs
else:
# TODO(casser): Other interpolation methods could be explored here.
return tf.image.resize_bilinear(inputs, [r_h.value, r_w.value],
align_corners=True)
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Applies online refinement while running inference.
Instructions: Run static inference first before calling this script. Make sure
to point output_dir to the same folder where static inference results were
saved previously.
For example use, please refer to README.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import datetime
import os
import random
from absl import app
from absl import flags
from absl import logging
import numpy as np
import tensorflow as tf
import model
import nets
import reader
import util
gfile = tf.gfile
SAVE_EVERY = 1 # Defines the interval that predictions should be saved at.
SAVE_PREVIEWS = True # If set, while save image previews of depth predictions.
FIXED_SEED = 8964 # Fixed seed for repeatability.
flags.DEFINE_string('output_dir', None, 'Directory to store predictions. '
'Assumes that regular inference has been executed before '
'and results were stored in this folder.')
flags.DEFINE_string('data_dir', None, 'Folder pointing to preprocessed '
'triplets to fine-tune on.')
flags.DEFINE_string('triplet_list_file', None, 'Text file containing paths to '
'image files to process. Paths should be relative with '
'respect to the list file location. Every line should be '
'of the form [input_folder_name] [input_frame_num] '
'[output_path], where [output_path] is optional to specify '
'a different path to store the prediction.')
flags.DEFINE_string('triplet_list_file_remains', None, 'Optional text file '
'containing relative paths to image files which should not '
'be fine-tuned, e.g. because of missing adjacent frames. '
'For all files listed, the static prediction will be '
'copied instead. File can be empty. If not, every line '
'should be of the form [input_folder_name] '
'[input_frame_num] [output_path], where [output_path] is '
'optional to specify a different path to take and store '
'the unrefined prediction from/to.')
flags.DEFINE_string('model_ckpt', None, 'Model checkpoint to optimize.')
flags.DEFINE_string('ft_name', '', 'Optional prefix for temporary files.')
flags.DEFINE_string('file_extension', 'png', 'Image data file extension.')
flags.DEFINE_float('learning_rate', 0.0001, 'Adam learning rate.')
flags.DEFINE_float('beta1', 0.9, 'Adam momentum.')
flags.DEFINE_float('reconstr_weight', 0.85, 'Frame reconstruction loss weight.')
flags.DEFINE_float('ssim_weight', 0.15, 'SSIM loss weight.')
flags.DEFINE_float('smooth_weight', 0.01, 'Smoothness loss weight.')
flags.DEFINE_float('icp_weight', 0.0, 'ICP loss weight.')
flags.DEFINE_float('size_constraint_weight', 0.0005, 'Weight of the object '
'size constraint loss. Use only with motion handling.')
flags.DEFINE_integer('batch_size', 1, 'The size of a sample batch')
flags.DEFINE_integer('img_height', 128, 'Input frame height.')
flags.DEFINE_integer('img_width', 416, 'Input frame width.')
flags.DEFINE_integer('seq_length', 3, 'Number of frames in sequence.')
flags.DEFINE_enum('architecture', nets.RESNET, nets.ARCHITECTURES,
'Defines the architecture to use for the depth prediction '
'network. Defaults to ResNet-based encoder and accompanying '
'decoder.')
flags.DEFINE_boolean('imagenet_norm', True, 'Whether to normalize the input '
'images channel-wise so that they match the distribution '
'most ImageNet-models were trained on.')
flags.DEFINE_float('weight_reg', 0.05, 'The amount of weight regularization to '
'apply. This has no effect on the ResNet-based encoder '
'architecture.')
flags.DEFINE_boolean('exhaustive_mode', False, 'Whether to exhaustively warp '
'from any frame to any other instead of just considering '
'adjacent frames. Where necessary, multiple egomotion '
'estimates will be applied. Does not have an effect if '
'compute_minimum_loss is enabled.')
flags.DEFINE_boolean('random_scale_crop', False, 'Whether to apply random '
'image scaling and center cropping during training.')
flags.DEFINE_bool('depth_upsampling', True, 'Whether to apply depth '
'upsampling of lower-scale representations before warping to '
'compute reconstruction loss on full-resolution image.')
flags.DEFINE_bool('depth_normalization', True, 'Whether to apply depth '
'normalization, that is, normalizing inverse depth '
'prediction maps by their mean to avoid degeneration towards '
'small values.')
flags.DEFINE_bool('compute_minimum_loss', True, 'Whether to take the '
'element-wise minimum of the reconstruction/SSIM error in '
'order to avoid overly penalizing dis-occlusion effects.')
flags.DEFINE_bool('use_skip', True, 'Whether to use skip connections in the '
'encoder-decoder architecture.')
flags.DEFINE_bool('joint_encoder', False, 'Whether to share parameters '
'between the depth and egomotion networks by using a joint '
'encoder architecture. The egomotion network is then '
'operating only on the hidden representation provided by the '
'joint encoder.')
flags.DEFINE_float('egomotion_threshold', 0.01, 'Minimum egomotion magnitude '
'to apply finetuning. If lower, just forwards the ordinary '
'prediction.')
flags.DEFINE_integer('num_steps', 20, 'Number of optimization steps to run.')
flags.DEFINE_boolean('handle_motion', True, 'Whether the checkpoint was '
'trained with motion handling.')
flags.DEFINE_bool('flip', False, 'Whether images should be flipped as well as '
'resulting predictions (for test-time augmentation). This '
'currently applies to the depth network only.')
FLAGS = flags.FLAGS
flags.mark_flag_as_required('output_dir')
flags.mark_flag_as_required('data_dir')
flags.mark_flag_as_required('model_ckpt')
flags.mark_flag_as_required('triplet_list_file')
def main(_):
"""Runs fine-tuning and inference.
There are three categories of images.
1) Images where we have previous and next frame, and that are not filtered
out by the heuristic. For them, we will use the fine-tuned predictions.
2) Images where we have previous and next frame, but that were filtered out
by our heuristic. For them, we will use the ordinary prediction instead.
3) Images where we have at least one missing adjacent frame. For them, we will
use the ordinary prediction as indicated by triplet_list_file_remains (if
provided). They will also not be part of the generated inference list in
the first place.
Raises:
ValueError: Invalid parameters have been passed.
"""
if FLAGS.handle_motion and FLAGS.joint_encoder:
raise ValueError('Using a joint encoder is currently not supported when '
'modeling object motion.')
if FLAGS.handle_motion and FLAGS.seq_length != 3:
raise ValueError('The current motion model implementation only supports '
'using a sequence length of three.')
if FLAGS.handle_motion and not FLAGS.compute_minimum_loss:
raise ValueError('Computing the minimum photometric loss is required when '
'enabling object motion handling.')
if FLAGS.size_constraint_weight > 0 and not FLAGS.handle_motion:
raise ValueError('To enforce object size constraints, enable motion '
'handling.')
if FLAGS.icp_weight > 0.0:
raise ValueError('ICP is currently not supported.')
if FLAGS.compute_minimum_loss and FLAGS.seq_length % 2 != 1:
raise ValueError('Compute minimum loss requires using an odd number of '
'images in a sequence.')
if FLAGS.compute_minimum_loss and FLAGS.exhaustive_mode:
raise ValueError('Exhaustive mode has no effect when compute_minimum_loss '
'is enabled.')
if FLAGS.img_width % (2 ** 5) != 0 or FLAGS.img_height % (2 ** 5) != 0:
logging.warn('Image size is not divisible by 2^5. For the architecture '
'employed, this could cause artefacts caused by resizing in '
'lower dimensions.')
if FLAGS.output_dir.endswith('/'):
FLAGS.output_dir = FLAGS.output_dir[:-1]
# Create file lists to prepare fine-tuning, save it to unique_file.
unique_file_name = (str(datetime.datetime.now().date()) + '_' +
str(datetime.datetime.now().time()).replace(':', '_'))
unique_file = os.path.join(FLAGS.data_dir, unique_file_name + '.txt')
with gfile.FastGFile(FLAGS.triplet_list_file, 'r') as f:
files_to_process = f.readlines()
files_to_process = [line.rstrip() for line in files_to_process]
files_to_process = [line for line in files_to_process if len(line)]
logging.info('Creating unique file list %s with %s entries.', unique_file,
len(files_to_process))
with gfile.FastGFile(unique_file, 'w') as f_out:
fetches_network = FLAGS.num_steps * FLAGS.batch_size
fetches_saves = FLAGS.batch_size * int(np.floor(FLAGS.num_steps/SAVE_EVERY))
repetitions = fetches_network + 3 * fetches_saves
for i in range(len(files_to_process)):
for _ in range(repetitions):
f_out.write(files_to_process[i] + '\n')
# Read remaining files.
remaining = []
if gfile.Exists(FLAGS.triplet_list_file_remains):
with gfile.FastGFile(FLAGS.triplet_list_file_remains, 'r') as f:
remaining = f.readlines()
remaining = [line.rstrip() for line in remaining]
remaining = [line for line in remaining if len(line)]
logging.info('Running fine-tuning on %s files, %s files are remaining.',
len(files_to_process), len(remaining))
# Run fine-tuning process and save predictions in id-folders.
tf.set_random_seed(FIXED_SEED)
np.random.seed(FIXED_SEED)
random.seed(FIXED_SEED)
flipping_mode = reader.FLIP_ALWAYS if FLAGS.flip else reader.FLIP_NONE
train_model = model.Model(data_dir=FLAGS.data_dir,
file_extension=FLAGS.file_extension,
is_training=True,
learning_rate=FLAGS.learning_rate,
beta1=FLAGS.beta1,
reconstr_weight=FLAGS.reconstr_weight,
smooth_weight=FLAGS.smooth_weight,
ssim_weight=FLAGS.ssim_weight,
icp_weight=FLAGS.icp_weight,
batch_size=FLAGS.batch_size,
img_height=FLAGS.img_height,
img_width=FLAGS.img_width,
seq_length=FLAGS.seq_length,
architecture=FLAGS.architecture,
imagenet_norm=FLAGS.imagenet_norm,
weight_reg=FLAGS.weight_reg,
exhaustive_mode=FLAGS.exhaustive_mode,
random_scale_crop=FLAGS.random_scale_crop,
flipping_mode=flipping_mode,
random_color=False,
depth_upsampling=FLAGS.depth_upsampling,
depth_normalization=FLAGS.depth_normalization,
compute_minimum_loss=FLAGS.compute_minimum_loss,
use_skip=FLAGS.use_skip,
joint_encoder=FLAGS.joint_encoder,
build_sum=False,
shuffle=False,
input_file=unique_file_name,
handle_motion=FLAGS.handle_motion,
size_constraint_weight=FLAGS.size_constraint_weight,
train_global_scale_var=False)
failed_heuristic_ids = finetune_inference(train_model, FLAGS.model_ckpt,
FLAGS.output_dir + '_ft')
logging.info('Fine-tuning completed, %s files were filtered out by '
'heuristic.', len(failed_heuristic_ids))
for failed_id in failed_heuristic_ids:
failed_entry = files_to_process[failed_id]
remaining.append(failed_entry)
logging.info('In total, %s images were fine-tuned, while %s were not.',
len(files_to_process)-len(failed_heuristic_ids), len(remaining))
# Copy all results to have the same structural output as running ordinary
# inference.
for i in range(len(files_to_process)):
if files_to_process[i] not in remaining: # Use fine-tuned result.
elements = files_to_process[i].split(' ')
source_file = os.path.join(FLAGS.output_dir + '_ft', FLAGS.ft_name +
'id_' + str(i),
str(FLAGS.num_steps).zfill(10) +
('_flip' if FLAGS.flip else ''))
if len(elements) == 2: # No differing mapping defined.
target_dir = os.path.join(FLAGS.output_dir + '_ft', elements[0])
target_file = os.path.join(
target_dir, elements[1] + ('_flip' if FLAGS.flip else ''))
else: # Other mapping for file defined, copy to this location instead.
target_dir = os.path.join(
FLAGS.output_dir + '_ft', os.path.dirname(elements[2]))
target_file = os.path.join(
target_dir,
os.path.basename(elements[2]) + ('_flip' if FLAGS.flip else ''))
if not gfile.Exists(target_dir):
gfile.MakeDirs(target_dir)
logging.info('Copy refined result %s to %s.', source_file, target_file)
gfile.Copy(source_file + '.npy', target_file + '.npy', overwrite=True)
gfile.Copy(source_file + '.txt', target_file + '.txt', overwrite=True)
gfile.Copy(source_file + '.%s' % FLAGS.file_extension,
target_file + '.%s' % FLAGS.file_extension, overwrite=True)
for j in range(len(remaining)):
elements = remaining[j].split(' ')
if len(elements) == 2: # No differing mapping defined.
target_dir = os.path.join(FLAGS.output_dir + '_ft', elements[0])
target_file = os.path.join(
target_dir, elements[1] + ('_flip' if FLAGS.flip else ''))
else: # Other mapping for file defined, copy to this location instead.
target_dir = os.path.join(
FLAGS.output_dir + '_ft', os.path.dirname(elements[2]))
target_file = os.path.join(
target_dir,
os.path.basename(elements[2]) + ('_flip' if FLAGS.flip else ''))
if not gfile.Exists(target_dir):
gfile.MakeDirs(target_dir)
source_file = target_file.replace('_ft', '')
logging.info('Copy unrefined result %s to %s.', source_file, target_file)
gfile.Copy(source_file + '.npy', target_file + '.npy', overwrite=True)
gfile.Copy(source_file + '.%s' % FLAGS.file_extension,
target_file + '.%s' % FLAGS.file_extension, overwrite=True)
logging.info('Done, predictions saved in %s.', FLAGS.output_dir + '_ft')
def finetune_inference(train_model, model_ckpt, output_dir):
"""Train model."""
vars_to_restore = None
if model_ckpt is not None:
vars_to_restore = util.get_vars_to_save_and_restore(model_ckpt)
ckpt_path = model_ckpt
pretrain_restorer = tf.train.Saver(vars_to_restore)
sv = tf.train.Supervisor(logdir=None, save_summaries_secs=0, saver=None,
summary_op=None)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
img_nr = 0
failed_heuristic = []
with sv.managed_session(config=config) as sess:
# TODO(casser): Caching the weights would be better to avoid I/O bottleneck.
while True: # Loop terminates when all examples have been processed.
if model_ckpt is not None:
logging.info('Restored weights from %s', ckpt_path)
pretrain_restorer.restore(sess, ckpt_path)
logging.info('Running fine-tuning, image %s...', img_nr)
img_pred_folder = os.path.join(
output_dir, FLAGS.ft_name + 'id_' + str(img_nr))
if not gfile.Exists(img_pred_folder):
gfile.MakeDirs(img_pred_folder)
step = 1
# Run fine-tuning.
while step <= FLAGS.num_steps:
logging.info('Running step %s of %s.', step, FLAGS.num_steps)
fetches = {
'train': train_model.train_op,
'global_step': train_model.global_step,
'incr_global_step': train_model.incr_global_step
}
_ = sess.run(fetches)
if step % SAVE_EVERY == 0:
# Get latest prediction for middle frame, highest scale.
pred = train_model.depth[1][0].eval(session=sess)
if FLAGS.flip:
pred = np.flip(pred, axis=2)
input_img = train_model.image_stack.eval(session=sess)
input_img_prev = input_img[0, :, :, 0:3]
input_img_center = input_img[0, :, :, 3:6]
input_img_next = input_img[0, :, :, 6:]
img_pred_file = os.path.join(
img_pred_folder,
str(step).zfill(10) + ('_flip' if FLAGS.flip else '') + '.npy')
motion = np.squeeze(train_model.egomotion.eval(session=sess))
# motion of shape (seq_length - 1, 6).
motion = np.mean(motion, axis=0) # Average egomotion across frames.
if SAVE_PREVIEWS or step == FLAGS.num_steps:
# Also save preview of depth map.
color_map = util.normalize_depth_for_display(
np.squeeze(pred[0, :, :]))
visualization = np.concatenate(
(input_img_prev, input_img_center, input_img_next, color_map))
motion_s = [str(m) for m in motion]
s_rep = ','.join(motion_s)
with gfile.Open(img_pred_file.replace('.npy', '.txt'), 'w') as f:
f.write(s_rep)
util.save_image(
img_pred_file.replace('.npy', '.%s' % FLAGS.file_extension),
visualization, FLAGS.file_extension)
with gfile.Open(img_pred_file, 'wb') as f:
np.save(f, pred)
# Apply heuristic to not finetune if egomotion magnitude is too low.
ego_magnitude = np.linalg.norm(motion[:3], ord=2)
heuristic = ego_magnitude >= FLAGS.egomotion_threshold
if not heuristic and step == FLAGS.num_steps:
failed_heuristic.append(img_nr)
step += 1
img_nr += 1
return failed_heuristic
if __name__ == '__main__':
app.run(main)
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Geometry utilities for projecting frames based on depth and motion.
Modified from Spatial Transformer Networks:
https://github.com/tensorflow/models/blob/master/transformer/spatial_transformer.py
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from absl import logging
import numpy as np
import tensorflow as tf
def inverse_warp(img, depth, egomotion_mat, intrinsic_mat,
intrinsic_mat_inv):
"""Inverse warp a source image to the target image plane.
Args:
img: The source image (to sample pixels from) -- [B, H, W, 3].
depth: Depth map of the target image -- [B, H, W].
egomotion_mat: Matrix defining egomotion transform -- [B, 4, 4].
intrinsic_mat: Camera intrinsic matrix -- [B, 3, 3].
intrinsic_mat_inv: Inverse of the intrinsic matrix -- [B, 3, 3].
Returns:
Projected source image
"""
dims = tf.shape(img)
batch_size, img_height, img_width = dims[0], dims[1], dims[2]
depth = tf.reshape(depth, [batch_size, 1, img_height * img_width])
grid = _meshgrid_abs(img_height, img_width)
grid = tf.tile(tf.expand_dims(grid, 0), [batch_size, 1, 1])
cam_coords = _pixel2cam(depth, grid, intrinsic_mat_inv)
ones = tf.ones([batch_size, 1, img_height * img_width])
cam_coords_hom = tf.concat([cam_coords, ones], axis=1)
# Get projection matrix for target camera frame to source pixel frame
hom_filler = tf.constant([0.0, 0.0, 0.0, 1.0], shape=[1, 1, 4])
hom_filler = tf.tile(hom_filler, [batch_size, 1, 1])
intrinsic_mat_hom = tf.concat(
[intrinsic_mat, tf.zeros([batch_size, 3, 1])], axis=2)
intrinsic_mat_hom = tf.concat([intrinsic_mat_hom, hom_filler], axis=1)
proj_target_cam_to_source_pixel = tf.matmul(intrinsic_mat_hom, egomotion_mat)
source_pixel_coords = _cam2pixel(cam_coords_hom,
proj_target_cam_to_source_pixel)
source_pixel_coords = tf.reshape(source_pixel_coords,
[batch_size, 2, img_height, img_width])
source_pixel_coords = tf.transpose(source_pixel_coords, perm=[0, 2, 3, 1])
projected_img, mask = _spatial_transformer(img, source_pixel_coords)
return projected_img, mask
def get_transform_mat(egomotion_vecs, i, j):
"""Returns a transform matrix defining the transform from frame i to j."""
egomotion_transforms = []
batchsize = tf.shape(egomotion_vecs)[0]
if i == j:
return tf.tile(tf.expand_dims(tf.eye(4, 4), axis=0), [batchsize, 1, 1])
for k in range(min(i, j), max(i, j)):
transform_matrix = _egomotion_vec2mat(egomotion_vecs[:, k, :], batchsize)
if i > j: # Going back in sequence, need to invert egomotion.
egomotion_transforms.insert(0, tf.linalg.inv(transform_matrix))
else: # Going forward in sequence
egomotion_transforms.append(transform_matrix)
# Multiply all matrices.
egomotion_mat = egomotion_transforms[0]
for i in range(1, len(egomotion_transforms)):
egomotion_mat = tf.matmul(egomotion_mat, egomotion_transforms[i])
return egomotion_mat
def _pixel2cam(depth, pixel_coords, intrinsic_mat_inv):
"""Transform coordinates in the pixel frame to the camera frame."""
cam_coords = tf.matmul(intrinsic_mat_inv, pixel_coords) * depth
return cam_coords
def _cam2pixel(cam_coords, proj_c2p):
"""Transform coordinates in the camera frame to the pixel frame."""
pcoords = tf.matmul(proj_c2p, cam_coords)
x = tf.slice(pcoords, [0, 0, 0], [-1, 1, -1])
y = tf.slice(pcoords, [0, 1, 0], [-1, 1, -1])
z = tf.slice(pcoords, [0, 2, 0], [-1, 1, -1])
# Not tested if adding a small number is necessary
x_norm = x / (z + 1e-10)
y_norm = y / (z + 1e-10)
pixel_coords = tf.concat([x_norm, y_norm], axis=1)
return pixel_coords
def _meshgrid_abs(height, width):
"""Meshgrid in the absolute coordinates."""
x_t = tf.matmul(
tf.ones(shape=tf.stack([height, 1])),
tf.transpose(tf.expand_dims(tf.linspace(-1.0, 1.0, width), 1), [1, 0]))
y_t = tf.matmul(
tf.expand_dims(tf.linspace(-1.0, 1.0, height), 1),
tf.ones(shape=tf.stack([1, width])))
x_t = (x_t + 1.0) * 0.5 * tf.cast(width - 1, tf.float32)
y_t = (y_t + 1.0) * 0.5 * tf.cast(height - 1, tf.float32)
x_t_flat = tf.reshape(x_t, (1, -1))
y_t_flat = tf.reshape(y_t, (1, -1))
ones = tf.ones_like(x_t_flat)
grid = tf.concat([x_t_flat, y_t_flat, ones], axis=0)
return grid
def _euler2mat(z, y, x):
"""Converts euler angles to rotation matrix.
From:
https://github.com/pulkitag/pycaffe-utils/blob/master/rot_utils.py#L174
TODO: Remove the dimension for 'N' (deprecated for converting all source
poses altogether).
Args:
z: rotation angle along z axis (in radians) -- size = [B, n]
y: rotation angle along y axis (in radians) -- size = [B, n]
x: rotation angle along x axis (in radians) -- size = [B, n]
Returns:
Rotation matrix corresponding to the euler angles, with shape [B, n, 3, 3].
"""
batch_size = tf.shape(z)[0]
n = 1
z = tf.clip_by_value(z, -np.pi, np.pi)
y = tf.clip_by_value(y, -np.pi, np.pi)
x = tf.clip_by_value(x, -np.pi, np.pi)
# Expand to B x N x 1 x 1
z = tf.expand_dims(tf.expand_dims(z, -1), -1)
y = tf.expand_dims(tf.expand_dims(y, -1), -1)
x = tf.expand_dims(tf.expand_dims(x, -1), -1)
zeros = tf.zeros([batch_size, n, 1, 1])
ones = tf.ones([batch_size, n, 1, 1])
cosz = tf.cos(z)
sinz = tf.sin(z)
rotz_1 = tf.concat([cosz, -sinz, zeros], axis=3)
rotz_2 = tf.concat([sinz, cosz, zeros], axis=3)
rotz_3 = tf.concat([zeros, zeros, ones], axis=3)
zmat = tf.concat([rotz_1, rotz_2, rotz_3], axis=2)
cosy = tf.cos(y)
siny = tf.sin(y)
roty_1 = tf.concat([cosy, zeros, siny], axis=3)
roty_2 = tf.concat([zeros, ones, zeros], axis=3)
roty_3 = tf.concat([-siny, zeros, cosy], axis=3)
ymat = tf.concat([roty_1, roty_2, roty_3], axis=2)
cosx = tf.cos(x)
sinx = tf.sin(x)
rotx_1 = tf.concat([ones, zeros, zeros], axis=3)
rotx_2 = tf.concat([zeros, cosx, -sinx], axis=3)
rotx_3 = tf.concat([zeros, sinx, cosx], axis=3)
xmat = tf.concat([rotx_1, rotx_2, rotx_3], axis=2)
return tf.matmul(tf.matmul(xmat, ymat), zmat)
def _egomotion_vec2mat(vec, batch_size):
"""Converts 6DoF transform vector to transformation matrix.
Args:
vec: 6DoF parameters [tx, ty, tz, rx, ry, rz] -- [B, 6].
batch_size: Batch size.
Returns:
A transformation matrix -- [B, 4, 4].
"""
translation = tf.slice(vec, [0, 0], [-1, 3])
translation = tf.expand_dims(translation, -1)
rx = tf.slice(vec, [0, 3], [-1, 1])
ry = tf.slice(vec, [0, 4], [-1, 1])
rz = tf.slice(vec, [0, 5], [-1, 1])
rot_mat = _euler2mat(rz, ry, rx)
rot_mat = tf.squeeze(rot_mat, squeeze_dims=[1])
filler = tf.constant([0.0, 0.0, 0.0, 1.0], shape=[1, 1, 4])
filler = tf.tile(filler, [batch_size, 1, 1])
transform_mat = tf.concat([rot_mat, translation], axis=2)
transform_mat = tf.concat([transform_mat, filler], axis=1)
return transform_mat
def _bilinear_sampler(im, x, y, name='blinear_sampler'):
"""Perform bilinear sampling on im given list of x, y coordinates.
Implements the differentiable sampling mechanism with bilinear kernel
in https://arxiv.org/abs/1506.02025.
x,y are tensors specifying normalized coordinates [-1, 1] to be sampled on im.
For example, (-1, -1) in (x, y) corresponds to pixel location (0, 0) in im,
and (1, 1) in (x, y) corresponds to the bottom right pixel in im.
Args:
im: Batch of images with shape [B, h, w, channels].
x: Tensor of normalized x coordinates in [-1, 1], with shape [B, h, w, 1].
y: Tensor of normalized y coordinates in [-1, 1], with shape [B, h, w, 1].
name: Name scope for ops.
Returns:
Sampled image with shape [B, h, w, channels].
Principled mask with shape [B, h, w, 1], dtype:float32. A value of 1.0
in the mask indicates that the corresponding coordinate in the sampled
image is valid.
"""
with tf.variable_scope(name):
x = tf.reshape(x, [-1])
y = tf.reshape(y, [-1])
# Constants.
batch_size = tf.shape(im)[0]
_, height, width, channels = im.get_shape().as_list()
x = tf.to_float(x)
y = tf.to_float(y)
height_f = tf.cast(height, 'float32')
width_f = tf.cast(width, 'float32')
zero = tf.constant(0, dtype=tf.int32)
max_y = tf.cast(tf.shape(im)[1] - 1, 'int32')
max_x = tf.cast(tf.shape(im)[2] - 1, 'int32')
# Scale indices from [-1, 1] to [0, width - 1] or [0, height - 1].
x = (x + 1.0) * (width_f - 1.0) / 2.0
y = (y + 1.0) * (height_f - 1.0) / 2.0
# Compute the coordinates of the 4 pixels to sample from.
x0 = tf.cast(tf.floor(x), 'int32')
x1 = x0 + 1
y0 = tf.cast(tf.floor(y), 'int32')
y1 = y0 + 1
mask = tf.logical_and(
tf.logical_and(x0 >= zero, x1 <= max_x),
tf.logical_and(y0 >= zero, y1 <= max_y))
mask = tf.to_float(mask)
x0 = tf.clip_by_value(x0, zero, max_x)
x1 = tf.clip_by_value(x1, zero, max_x)
y0 = tf.clip_by_value(y0, zero, max_y)
y1 = tf.clip_by_value(y1, zero, max_y)
dim2 = width
dim1 = width * height
# Create base index.
base = tf.range(batch_size) * dim1
base = tf.reshape(base, [-1, 1])
base = tf.tile(base, [1, height * width])
base = tf.reshape(base, [-1])
base_y0 = base + y0 * dim2
base_y1 = base + y1 * dim2
idx_a = base_y0 + x0
idx_b = base_y1 + x0
idx_c = base_y0 + x1
idx_d = base_y1 + x1
# Use indices to lookup pixels in the flat image and restore channels dim.
im_flat = tf.reshape(im, tf.stack([-1, channels]))
im_flat = tf.to_float(im_flat)
pixel_a = tf.gather(im_flat, idx_a)
pixel_b = tf.gather(im_flat, idx_b)
pixel_c = tf.gather(im_flat, idx_c)
pixel_d = tf.gather(im_flat, idx_d)
x1_f = tf.to_float(x1)
y1_f = tf.to_float(y1)
# And finally calculate interpolated values.
wa = tf.expand_dims(((x1_f - x) * (y1_f - y)), 1)
wb = tf.expand_dims((x1_f - x) * (1.0 - (y1_f - y)), 1)
wc = tf.expand_dims(((1.0 - (x1_f - x)) * (y1_f - y)), 1)
wd = tf.expand_dims(((1.0 - (x1_f - x)) * (1.0 - (y1_f - y))), 1)
output = tf.add_n([wa * pixel_a, wb * pixel_b, wc * pixel_c, wd * pixel_d])
output = tf.reshape(output, tf.stack([batch_size, height, width, channels]))
mask = tf.reshape(mask, tf.stack([batch_size, height, width, 1]))
return output, mask
def _spatial_transformer(img, coords):
"""A wrapper over binlinear_sampler(), taking absolute coords as input."""
img_height = tf.cast(tf.shape(img)[1], tf.float32)
img_width = tf.cast(tf.shape(img)[2], tf.float32)
px = coords[:, :, :, :1]
py = coords[:, :, :, 1:]
# Normalize coordinates to [-1, 1] to send to _bilinear_sampler.
px = px / (img_width - 1) * 2.0 - 1.0
py = py / (img_height - 1) * 2.0 - 1.0
output_img, mask = _bilinear_sampler(img, px, py)
return output_img, mask
def get_cloud(depth, intrinsics_inv, name=None):
"""Convert depth map to 3D point cloud."""
with tf.name_scope(name):
dims = depth.shape.as_list()
batch_size, img_height, img_width = dims[0], dims[1], dims[2]
depth = tf.reshape(depth, [batch_size, 1, img_height * img_width])
grid = _meshgrid_abs(img_height, img_width)
grid = tf.tile(tf.expand_dims(grid, 0), [batch_size, 1, 1])
cam_coords = _pixel2cam(depth, grid, intrinsics_inv)
cam_coords = tf.transpose(cam_coords, [0, 2, 1])
cam_coords = tf.reshape(cam_coords, [batch_size, img_height, img_width, 3])
logging.info('depth -> cloud: %s', cam_coords)
return cam_coords
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Reads data that is produced by dataset/gen_data.py."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import random
from absl import logging
import tensorflow as tf
import util
gfile = tf.gfile
QUEUE_SIZE = 2000
QUEUE_BUFFER = 3
# See nets.encoder_resnet as reference for below input-normalizing constants.
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_SD = (0.229, 0.224, 0.225)
FLIP_RANDOM = 'random' # Always perform random flipping.
FLIP_ALWAYS = 'always' # Always flip image input, used for test augmentation.
FLIP_NONE = 'none' # Always disables flipping.
class DataReader(object):
"""Reads stored sequences which are produced by dataset/gen_data.py."""
def __init__(self, data_dir, batch_size, img_height, img_width, seq_length,
num_scales, file_extension, random_scale_crop, flipping_mode,
random_color, imagenet_norm, shuffle, input_file='train'):
self.data_dir = data_dir
self.batch_size = batch_size
self.img_height = img_height
self.img_width = img_width
self.seq_length = seq_length
self.num_scales = num_scales
self.file_extension = file_extension
self.random_scale_crop = random_scale_crop
self.flipping_mode = flipping_mode
self.random_color = random_color
self.imagenet_norm = imagenet_norm
self.shuffle = shuffle
self.input_file = input_file
def read_data(self):
"""Provides images and camera intrinsics."""
with tf.name_scope('data_loading'):
with tf.name_scope('enqueue_paths'):
seed = random.randint(0, 2**31 - 1)
self.file_lists = self.compile_file_list(self.data_dir, self.input_file)
image_paths_queue = tf.train.string_input_producer(
self.file_lists['image_file_list'], seed=seed,
shuffle=self.shuffle,
num_epochs=(1 if not self.shuffle else None)
)
seg_paths_queue = tf.train.string_input_producer(
self.file_lists['segment_file_list'], seed=seed,
shuffle=self.shuffle,
num_epochs=(1 if not self.shuffle else None))
cam_paths_queue = tf.train.string_input_producer(
self.file_lists['cam_file_list'], seed=seed,
shuffle=self.shuffle,
num_epochs=(1 if not self.shuffle else None))
img_reader = tf.WholeFileReader()
_, image_contents = img_reader.read(image_paths_queue)
seg_reader = tf.WholeFileReader()
_, seg_contents = seg_reader.read(seg_paths_queue)
if self.file_extension == 'jpg':
image_seq = tf.image.decode_jpeg(image_contents)
seg_seq = tf.image.decode_jpeg(seg_contents, channels=3)
elif self.file_extension == 'png':
image_seq = tf.image.decode_png(image_contents, channels=3)
seg_seq = tf.image.decode_png(seg_contents, channels=3)
with tf.name_scope('load_intrinsics'):
cam_reader = tf.TextLineReader()
_, raw_cam_contents = cam_reader.read(cam_paths_queue)
rec_def = []
for _ in range(9):
rec_def.append([1.0])
raw_cam_vec = tf.decode_csv(raw_cam_contents, record_defaults=rec_def)
raw_cam_vec = tf.stack(raw_cam_vec)
intrinsics = tf.reshape(raw_cam_vec, [3, 3])
with tf.name_scope('convert_image'):
image_seq = self.preprocess_image(image_seq) # Converts to float.
if self.random_color:
with tf.name_scope('image_augmentation'):
image_seq = self.augment_image_colorspace(image_seq)
image_stack = self.unpack_images(image_seq)
seg_stack = self.unpack_images(seg_seq)
if self.flipping_mode != FLIP_NONE:
random_flipping = (self.flipping_mode == FLIP_RANDOM)
with tf.name_scope('image_augmentation_flip'):
image_stack, seg_stack, intrinsics = self.augment_images_flip(
image_stack, seg_stack, intrinsics,
randomized=random_flipping)
if self.random_scale_crop:
with tf.name_scope('image_augmentation_scale_crop'):
image_stack, seg_stack, intrinsics = self.augment_images_scale_crop(
image_stack, seg_stack, intrinsics, self.img_height,
self.img_width)
with tf.name_scope('multi_scale_intrinsics'):
intrinsic_mat = self.get_multi_scale_intrinsics(intrinsics,
self.num_scales)
intrinsic_mat.set_shape([self.num_scales, 3, 3])
intrinsic_mat_inv = tf.matrix_inverse(intrinsic_mat)
intrinsic_mat_inv.set_shape([self.num_scales, 3, 3])
if self.imagenet_norm:
im_mean = tf.tile(
tf.constant(IMAGENET_MEAN), multiples=[self.seq_length])
im_sd = tf.tile(
tf.constant(IMAGENET_SD), multiples=[self.seq_length])
image_stack_norm = (image_stack - im_mean) / im_sd
else:
image_stack_norm = image_stack
with tf.name_scope('batching'):
if self.shuffle:
(image_stack, image_stack_norm, seg_stack, intrinsic_mat,
intrinsic_mat_inv) = tf.train.shuffle_batch(
[image_stack, image_stack_norm, seg_stack, intrinsic_mat,
intrinsic_mat_inv],
batch_size=self.batch_size,
capacity=QUEUE_SIZE + QUEUE_BUFFER * self.batch_size,
min_after_dequeue=QUEUE_SIZE)
else:
(image_stack, image_stack_norm, seg_stack, intrinsic_mat,
intrinsic_mat_inv) = tf.train.batch(
[image_stack, image_stack_norm, seg_stack, intrinsic_mat,
intrinsic_mat_inv],
batch_size=self.batch_size,
num_threads=1,
capacity=QUEUE_SIZE + QUEUE_BUFFER * self.batch_size)
logging.info('image_stack: %s', util.info(image_stack))
return (image_stack, image_stack_norm, seg_stack, intrinsic_mat,
intrinsic_mat_inv)
def unpack_images(self, image_seq):
"""[h, w * seq_length, 3] -> [h, w, 3 * seq_length]."""
with tf.name_scope('unpack_images'):
image_list = [
image_seq[:, i * self.img_width:(i + 1) * self.img_width, :]
for i in range(self.seq_length)
]
image_stack = tf.concat(image_list, axis=2)
image_stack.set_shape(
[self.img_height, self.img_width, self.seq_length * 3])
return image_stack
@classmethod
def preprocess_image(cls, image):
# Convert from uint8 to float.
return tf.image.convert_image_dtype(image, dtype=tf.float32)
@classmethod
def augment_image_colorspace(cls, image_stack):
"""Apply data augmentation to inputs."""
image_stack_aug = image_stack
# Randomly shift brightness.
apply_brightness = tf.less(tf.random_uniform(
shape=[], minval=0.0, maxval=1.0, dtype=tf.float32), 0.5)
image_stack_aug = tf.cond(
apply_brightness,
lambda: tf.image.random_brightness(image_stack_aug, max_delta=0.1),
lambda: image_stack_aug)
# Randomly shift contrast.
apply_contrast = tf.less(tf.random_uniform(
shape=[], minval=0.0, maxval=1.0, dtype=tf.float32), 0.5)
image_stack_aug = tf.cond(
apply_contrast,
lambda: tf.image.random_contrast(image_stack_aug, 0.85, 1.15),
lambda: image_stack_aug)
# Randomly change saturation.
apply_saturation = tf.less(tf.random_uniform(
shape=[], minval=0.0, maxval=1.0, dtype=tf.float32), 0.5)
image_stack_aug = tf.cond(
apply_saturation,
lambda: tf.image.random_saturation(image_stack_aug, 0.85, 1.15),
lambda: image_stack_aug)
# Randomly change hue.
apply_hue = tf.less(tf.random_uniform(
shape=[], minval=0.0, maxval=1.0, dtype=tf.float32), 0.5)
image_stack_aug = tf.cond(
apply_hue,
lambda: tf.image.random_hue(image_stack_aug, max_delta=0.1),
lambda: image_stack_aug)
image_stack_aug = tf.clip_by_value(image_stack_aug, 0, 1)
return image_stack_aug
@classmethod
def augment_images_flip(cls, image_stack, seg_stack, intrinsics,
randomized=True):
"""Randomly flips the image horizontally."""
def flip(cls, image_stack, seg_stack, intrinsics):
_, in_w, _ = image_stack.get_shape().as_list()
fx = intrinsics[0, 0]
fy = intrinsics[1, 1]
cx = in_w - intrinsics[0, 2]
cy = intrinsics[1, 2]
intrinsics = cls.make_intrinsics_matrix(fx, fy, cx, cy)
return (tf.image.flip_left_right(image_stack),
tf.image.flip_left_right(seg_stack), intrinsics)
if randomized:
prob = tf.random_uniform(shape=[], minval=0.0, maxval=1.0,
dtype=tf.float32)
predicate = tf.less(prob, 0.5)
return tf.cond(predicate,
lambda: flip(cls, image_stack, seg_stack, intrinsics),
lambda: (image_stack, seg_stack, intrinsics))
else:
return flip(cls, image_stack, seg_stack, intrinsics)
@classmethod
def augment_images_scale_crop(cls, im, seg, intrinsics, out_h, out_w):
"""Randomly scales and crops image."""
def scale_randomly(im, seg, intrinsics):
"""Scales image and adjust intrinsics accordingly."""
in_h, in_w, _ = im.get_shape().as_list()
scaling = tf.random_uniform([2], 1, 1.15)
x_scaling = scaling[0]
y_scaling = scaling[1]
out_h = tf.cast(in_h * y_scaling, dtype=tf.int32)
out_w = tf.cast(in_w * x_scaling, dtype=tf.int32)
# Add batch.
im = tf.expand_dims(im, 0)
im = tf.image.resize_area(im, [out_h, out_w])
im = im[0]
seg = tf.expand_dims(seg, 0)
seg = tf.image.resize_area(seg, [out_h, out_w])
seg = seg[0]
fx = intrinsics[0, 0] * x_scaling
fy = intrinsics[1, 1] * y_scaling
cx = intrinsics[0, 2] * x_scaling
cy = intrinsics[1, 2] * y_scaling
intrinsics = cls.make_intrinsics_matrix(fx, fy, cx, cy)
return im, seg, intrinsics
# Random cropping
def crop_randomly(im, seg, intrinsics, out_h, out_w):
"""Crops image and adjust intrinsics accordingly."""
# batch_size, in_h, in_w, _ = im.get_shape().as_list()
in_h, in_w, _ = tf.unstack(tf.shape(im))
offset_y = tf.random_uniform([1], 0, in_h - out_h + 1, dtype=tf.int32)[0]
offset_x = tf.random_uniform([1], 0, in_w - out_w + 1, dtype=tf.int32)[0]
im = tf.image.crop_to_bounding_box(im, offset_y, offset_x, out_h, out_w)
seg = tf.image.crop_to_bounding_box(seg, offset_y, offset_x, out_h, out_w)
fx = intrinsics[0, 0]
fy = intrinsics[1, 1]
cx = intrinsics[0, 2] - tf.cast(offset_x, dtype=tf.float32)
cy = intrinsics[1, 2] - tf.cast(offset_y, dtype=tf.float32)
intrinsics = cls.make_intrinsics_matrix(fx, fy, cx, cy)
return im, seg, intrinsics
im, seg, intrinsics = scale_randomly(im, seg, intrinsics)
im, seg, intrinsics = crop_randomly(im, seg, intrinsics, out_h, out_w)
return im, seg, intrinsics
def compile_file_list(self, data_dir, split, load_pose=False):
"""Creates a list of input files."""
logging.info('data_dir: %s', data_dir)
with gfile.Open(os.path.join(data_dir, '%s.txt' % split), 'r') as f:
frames = f.readlines()
frames = [k.rstrip() for k in frames]
subfolders = [x.split(' ')[0] for x in frames]
frame_ids = [x.split(' ')[1] for x in frames]
image_file_list = [
os.path.join(data_dir, subfolders[i], frame_ids[i] + '.' +
self.file_extension)
for i in range(len(frames))
]
segment_file_list = [
os.path.join(data_dir, subfolders[i], frame_ids[i] + '-fseg.' +
self.file_extension)
for i in range(len(frames))
]
cam_file_list = [
os.path.join(data_dir, subfolders[i], frame_ids[i] + '_cam.txt')
for i in range(len(frames))
]
file_lists = {}
file_lists['image_file_list'] = image_file_list
file_lists['segment_file_list'] = segment_file_list
file_lists['cam_file_list'] = cam_file_list
if load_pose:
pose_file_list = [
os.path.join(data_dir, subfolders[i], frame_ids[i] + '_pose.txt')
for i in range(len(frames))
]
file_lists['pose_file_list'] = pose_file_list
self.steps_per_epoch = len(image_file_list) // self.batch_size
return file_lists
@classmethod
def make_intrinsics_matrix(cls, fx, fy, cx, cy):
r1 = tf.stack([fx, 0, cx])
r2 = tf.stack([0, fy, cy])
r3 = tf.constant([0., 0., 1.])
intrinsics = tf.stack([r1, r2, r3])
return intrinsics
@classmethod
def get_multi_scale_intrinsics(cls, intrinsics, num_scales):
"""Returns multiple intrinsic matrices for different scales."""
intrinsics_multi_scale = []
# Scale the intrinsics accordingly for each scale
for s in range(num_scales):
fx = intrinsics[0, 0] / (2**s)
fy = intrinsics[1, 1] / (2**s)
cx = intrinsics[0, 2] / (2**s)
cy = intrinsics[1, 2] / (2**s)
intrinsics_multi_scale.append(cls.make_intrinsics_matrix(fx, fy, cx, cy))
intrinsics_multi_scale = tf.stack(intrinsics_multi_scale)
return intrinsics_multi_scale
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Train the model. Please refer to README for example usage."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import os
import random
import time
from absl import app
from absl import flags
from absl import logging
import numpy as np
import tensorflow as tf
import model
import nets
import reader
import util
gfile = tf.gfile
MAX_TO_KEEP = 1000000 # Maximum number of checkpoints to keep.
flags.DEFINE_string('data_dir', None, 'Preprocessed data.')
flags.DEFINE_string('file_extension', 'png', 'Image data file extension.')
flags.DEFINE_float('learning_rate', 0.0002, 'Adam learning rate.')
flags.DEFINE_float('beta1', 0.9, 'Adam momentum.')
flags.DEFINE_float('reconstr_weight', 0.85, 'Frame reconstruction loss weight.')
flags.DEFINE_float('ssim_weight', 0.15, 'SSIM loss weight.')
flags.DEFINE_float('smooth_weight', 0.04, 'Smoothness loss weight.')
flags.DEFINE_float('icp_weight', 0.0, 'ICP loss weight.')
flags.DEFINE_float('size_constraint_weight', 0.0005, 'Weight of the object '
'size constraint loss. Use only when motion handling is '
'enabled.')
flags.DEFINE_integer('batch_size', 4, 'The size of a sample batch')
flags.DEFINE_integer('img_height', 128, 'Input frame height.')
flags.DEFINE_integer('img_width', 416, 'Input frame width.')
flags.DEFINE_integer('seq_length', 3, 'Number of frames in sequence.')
flags.DEFINE_enum('architecture', nets.RESNET, nets.ARCHITECTURES,
'Defines the architecture to use for the depth prediction '
'network. Defaults to ResNet-based encoder and accompanying '
'decoder.')
flags.DEFINE_boolean('imagenet_norm', True, 'Whether to normalize the input '
'images channel-wise so that they match the distribution '
'most ImageNet-models were trained on.')
flags.DEFINE_float('weight_reg', 0.05, 'The amount of weight regularization to '
'apply. This has no effect on the ResNet-based encoder '
'architecture.')
flags.DEFINE_boolean('exhaustive_mode', False, 'Whether to exhaustively warp '
'from any frame to any other instead of just considering '
'adjacent frames. Where necessary, multiple egomotion '
'estimates will be applied. Does not have an effect if '
'compute_minimum_loss is enabled.')
flags.DEFINE_boolean('random_scale_crop', False, 'Whether to apply random '
'image scaling and center cropping during training.')
flags.DEFINE_enum('flipping_mode', reader.FLIP_RANDOM,
[reader.FLIP_RANDOM, reader.FLIP_ALWAYS, reader.FLIP_NONE],
'Determines the image flipping mode: if random, performs '
'on-the-fly augmentation. Otherwise, flips the input images '
'always or never, respectively.')
flags.DEFINE_string('pretrained_ckpt', None, 'Path to checkpoint with '
'pretrained weights. Do not include .data* extension.')
flags.DEFINE_string('imagenet_ckpt', None, 'Initialize the weights according '
'to an ImageNet-pretrained checkpoint. Requires '
'architecture to be ResNet-18.')
flags.DEFINE_string('checkpoint_dir', None, 'Directory to save model '
'checkpoints.')
flags.DEFINE_integer('train_steps', 10000000, 'Number of training steps.')
flags.DEFINE_integer('summary_freq', 100, 'Save summaries every N steps.')
flags.DEFINE_bool('depth_upsampling', True, 'Whether to apply depth '
'upsampling of lower-scale representations before warping to '
'compute reconstruction loss on full-resolution image.')
flags.DEFINE_bool('depth_normalization', True, 'Whether to apply depth '
'normalization, that is, normalizing inverse depth '
'prediction maps by their mean to avoid degeneration towards '
'small values.')
flags.DEFINE_bool('compute_minimum_loss', True, 'Whether to take the '
'element-wise minimum of the reconstruction/SSIM error in '
'order to avoid overly penalizing dis-occlusion effects.')
flags.DEFINE_bool('use_skip', True, 'Whether to use skip connections in the '
'encoder-decoder architecture.')
flags.DEFINE_bool('equal_weighting', False, 'Whether to use equal weighting '
'of the smoothing loss term, regardless of resolution.')
flags.DEFINE_bool('joint_encoder', False, 'Whether to share parameters '
'between the depth and egomotion networks by using a joint '
'encoder architecture. The egomotion network is then '
'operating only on the hidden representation provided by the '
'joint encoder.')
flags.DEFINE_bool('handle_motion', True, 'Whether to try to handle motion by '
'using the provided segmentation masks.')
flags.DEFINE_string('master', 'local', 'Location of the session.')
FLAGS = flags.FLAGS
flags.mark_flag_as_required('data_dir')
flags.mark_flag_as_required('checkpoint_dir')
def main(_):
# Fixed seed for repeatability
seed = 8964
tf.set_random_seed(seed)
np.random.seed(seed)
random.seed(seed)
if FLAGS.handle_motion and FLAGS.joint_encoder:
raise ValueError('Using a joint encoder is currently not supported when '
'modeling object motion.')
if FLAGS.handle_motion and FLAGS.seq_length != 3:
raise ValueError('The current motion model implementation only supports '
'using a sequence length of three.')
if FLAGS.handle_motion and not FLAGS.compute_minimum_loss:
raise ValueError('Computing the minimum photometric loss is required when '
'enabling object motion handling.')
if FLAGS.size_constraint_weight > 0 and not FLAGS.handle_motion:
raise ValueError('To enforce object size constraints, enable motion '
'handling.')
if FLAGS.imagenet_ckpt and not FLAGS.imagenet_norm:
logging.warn('When initializing with an ImageNet-pretrained model, it is '
'recommended to normalize the image inputs accordingly using '
'imagenet_norm.')
if FLAGS.compute_minimum_loss and FLAGS.seq_length % 2 != 1:
raise ValueError('Compute minimum loss requires using an odd number of '
'images in a sequence.')
if FLAGS.architecture != nets.RESNET and FLAGS.imagenet_ckpt:
raise ValueError('Can only load weights from pre-trained ImageNet model '
'when using ResNet-architecture.')
if FLAGS.compute_minimum_loss and FLAGS.exhaustive_mode:
raise ValueError('Exhaustive mode has no effect when compute_minimum_loss '
'is enabled.')
if FLAGS.img_width % (2 ** 5) != 0 or FLAGS.img_height % (2 ** 5) != 0:
logging.warn('Image size is not divisible by 2^5. For the architecture '
'employed, this could cause artefacts caused by resizing in '
'lower dimensions.')
if FLAGS.icp_weight > 0.0:
# TODO(casser): Change ICP interface to take matrix instead of vector.
raise ValueError('ICP is currently not supported.')
if not gfile.Exists(FLAGS.checkpoint_dir):
gfile.MakeDirs(FLAGS.checkpoint_dir)
train_model = model.Model(data_dir=FLAGS.data_dir,
file_extension=FLAGS.file_extension,
is_training=True,
learning_rate=FLAGS.learning_rate,
beta1=FLAGS.beta1,
reconstr_weight=FLAGS.reconstr_weight,
smooth_weight=FLAGS.smooth_weight,
ssim_weight=FLAGS.ssim_weight,
icp_weight=FLAGS.icp_weight,
batch_size=FLAGS.batch_size,
img_height=FLAGS.img_height,
img_width=FLAGS.img_width,
seq_length=FLAGS.seq_length,
architecture=FLAGS.architecture,
imagenet_norm=FLAGS.imagenet_norm,
weight_reg=FLAGS.weight_reg,
exhaustive_mode=FLAGS.exhaustive_mode,
random_scale_crop=FLAGS.random_scale_crop,
flipping_mode=FLAGS.flipping_mode,
depth_upsampling=FLAGS.depth_upsampling,
depth_normalization=FLAGS.depth_normalization,
compute_minimum_loss=FLAGS.compute_minimum_loss,
use_skip=FLAGS.use_skip,
joint_encoder=FLAGS.joint_encoder,
handle_motion=FLAGS.handle_motion,
equal_weighting=FLAGS.equal_weighting,
size_constraint_weight=FLAGS.size_constraint_weight)
train(train_model, FLAGS.pretrained_ckpt, FLAGS.imagenet_ckpt,
FLAGS.checkpoint_dir, FLAGS.train_steps, FLAGS.summary_freq)
def train(train_model, pretrained_ckpt, imagenet_ckpt, checkpoint_dir,
train_steps, summary_freq):
"""Train model."""
vars_to_restore = None
if pretrained_ckpt is not None:
vars_to_restore = util.get_vars_to_save_and_restore(pretrained_ckpt)
ckpt_path = pretrained_ckpt
elif imagenet_ckpt:
vars_to_restore = util.get_imagenet_vars_to_restore(imagenet_ckpt)
ckpt_path = imagenet_ckpt
pretrain_restorer = tf.train.Saver(vars_to_restore)
vars_to_save = util.get_vars_to_save_and_restore()
vars_to_save[train_model.global_step.op.name] = train_model.global_step
saver = tf.train.Saver(vars_to_save, max_to_keep=MAX_TO_KEEP)
sv = tf.train.Supervisor(logdir=checkpoint_dir, save_summaries_secs=0,
saver=None)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with sv.managed_session(config=config) as sess:
if pretrained_ckpt is not None or imagenet_ckpt:
logging.info('Restoring pretrained weights from %s', ckpt_path)
pretrain_restorer.restore(sess, ckpt_path)
logging.info('Attempting to resume training from %s...', checkpoint_dir)
checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
logging.info('Last checkpoint found: %s', checkpoint)
if checkpoint:
saver.restore(sess, checkpoint)
logging.info('Training...')
start_time = time.time()
last_summary_time = time.time()
steps_per_epoch = train_model.reader.steps_per_epoch
step = 1
while step <= train_steps:
fetches = {
'train': train_model.train_op,
'global_step': train_model.global_step,
'incr_global_step': train_model.incr_global_step
}
if step % summary_freq == 0:
fetches['loss'] = train_model.total_loss
fetches['summary'] = sv.summary_op
results = sess.run(fetches)
global_step = results['global_step']
if step % summary_freq == 0:
sv.summary_writer.add_summary(results['summary'], global_step)
train_epoch = math.ceil(global_step / steps_per_epoch)
train_step = global_step - (train_epoch - 1) * steps_per_epoch
this_cycle = time.time() - last_summary_time
last_summary_time += this_cycle
logging.info(
'Epoch: [%2d] [%5d/%5d] time: %4.2fs (%ds total) loss: %.3f',
train_epoch, train_step, steps_per_epoch, this_cycle,
time.time() - start_time, results['loss'])
if step % steps_per_epoch == 0:
logging.info('[*] Saving checkpoint to %s...', checkpoint_dir)
saver.save(sess, os.path.join(checkpoint_dir, 'model'),
global_step=global_step)
# Setting step to global_step allows for training for a total of
# train_steps even if the program is restarted during training.
step = global_step + 1
if __name__ == '__main__':
app.run(main)
# Copyright 2018 The TensorFlow Authors All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Contains common utilities and functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import locale
import os
import re
from absl import logging
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import cv2
gfile = tf.gfile
CMAP_DEFAULT = 'plasma'
# Defines the cropping that is applied to the Cityscapes dataset with respect to
# the original raw input resolution.
CITYSCAPES_CROP = [256, 768, 192, 1856]
def crop_cityscapes(im, resize=None):
ymin, ymax, xmin, xmax = CITYSCAPES_CROP
im = im[ymin:ymax, xmin:xmax]
if resize is not None:
im = cv2.resize(im, resize)
return im
def gray2rgb(im, cmap=CMAP_DEFAULT):
cmap = plt.get_cmap(cmap)
result_img = cmap(im.astype(np.float32))
if result_img.shape[2] > 3:
result_img = np.delete(result_img, 3, 2)
return result_img
def load_image(img_file, resize=None, interpolation='linear'):
"""Load image from disk. Output value range: [0,1]."""
im_data = np.fromstring(gfile.Open(img_file).read(), np.uint8)
im = cv2.imdecode(im_data, cv2.IMREAD_COLOR)
im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
if resize and resize != im.shape[:2]:
ip = cv2.INTER_LINEAR if interpolation == 'linear' else cv2.INTER_NEAREST
im = cv2.resize(im, resize, interpolation=ip)
return np.array(im, dtype=np.float32) / 255.0
def save_image(img_file, im, file_extension):
"""Save image from disk. Expected input value range: [0,1]."""
im = (im * 255.0).astype(np.uint8)
with gfile.Open(img_file, 'w') as f:
im = cv2.cvtColor(im, cv2.COLOR_RGB2BGR)
_, im_data = cv2.imencode('.%s' % file_extension, im)
f.write(im_data.tostring())
def normalize_depth_for_display(depth, pc=95, crop_percent=0, normalizer=None,
cmap=CMAP_DEFAULT):
"""Converts a depth map to an RGB image."""
# Convert to disparity.
disp = 1.0 / (depth + 1e-6)
if normalizer is not None:
disp /= normalizer
else:
disp /= (np.percentile(disp, pc) + 1e-6)
disp = np.clip(disp, 0, 1)
disp = gray2rgb(disp, cmap=cmap)
keep_h = int(disp.shape[0] * (1 - crop_percent))
disp = disp[:keep_h]
return disp
def get_seq_start_end(target_index, seq_length, sample_every=1):
"""Returns absolute seq start and end indices for a given target frame."""
half_offset = int((seq_length - 1) / 2) * sample_every
end_index = target_index + half_offset
start_index = end_index - (seq_length - 1) * sample_every
return start_index, end_index
def get_seq_middle(seq_length):
"""Returns relative index for the middle frame in sequence."""
half_offset = int((seq_length - 1) / 2)
return seq_length - 1 - half_offset
def info(obj):
"""Return info on shape and dtype of a numpy array or TensorFlow tensor."""
if obj is None:
return 'None.'
elif isinstance(obj, list):
if obj:
return 'List of %d... %s' % (len(obj), info(obj[0]))
else:
return 'Empty list.'
elif isinstance(obj, tuple):
if obj:
return 'Tuple of %d... %s' % (len(obj), info(obj[0]))
else:
return 'Empty tuple.'
else:
if is_a_numpy_array(obj):
return 'Array with shape: %s, dtype: %s' % (obj.shape, obj.dtype)
else:
return str(obj)
def is_a_numpy_array(obj):
"""Returns true if obj is a numpy array."""
return type(obj).__module__ == np.__name__
def count_parameters(also_print=True):
"""Cound the number of parameters in the model.
Args:
also_print: Boolean. If True also print the numbers.
Returns:
The total number of parameters.
"""
total = 0
if also_print:
logging.info('Model Parameters:')
for (_, v) in get_vars_to_save_and_restore().items():
shape = v.get_shape()
if also_print:
logging.info('%s %s: %s', v.op.name, shape,
format_number(shape.num_elements()))
total += shape.num_elements()
if also_print:
logging.info('Total: %s', format_number(total))
return total
def get_vars_to_save_and_restore(ckpt=None):
"""Returns list of variables that should be saved/restored.
Args:
ckpt: Path to existing checkpoint. If present, returns only the subset of
variables that exist in given checkpoint.
Returns:
List of all variables that need to be saved/restored.
"""
model_vars = tf.trainable_variables()
# Add batchnorm variables.
bn_vars = [v for v in tf.global_variables()
if 'moving_mean' in v.op.name or 'moving_variance' in v.op.name or
'mu' in v.op.name or 'sigma' in v.op.name or
'global_scale_var' in v.op.name]
model_vars.extend(bn_vars)
model_vars = sorted(model_vars, key=lambda x: x.op.name)
mapping = {}
if ckpt is not None:
ckpt_var = tf.contrib.framework.list_variables(ckpt)
ckpt_var_names = [name for (name, unused_shape) in ckpt_var]
ckpt_var_shapes = [shape for (unused_name, shape) in ckpt_var]
not_loaded = list(ckpt_var_names)
for v in model_vars:
if v.op.name not in ckpt_var_names:
# For backward compatibility, try additional matching.
v_additional_name = v.op.name.replace('egomotion_prediction/', '')
if v_additional_name in ckpt_var_names:
# Check if shapes match.
ind = ckpt_var_names.index(v_additional_name)
if ckpt_var_shapes[ind] == v.get_shape():
mapping[v_additional_name] = v
not_loaded.remove(v_additional_name)
continue
else:
logging.warn('Shape mismatch, will not restore %s.', v.op.name)
logging.warn('Did not find var %s in checkpoint: %s', v.op.name,
os.path.basename(ckpt))
else:
# Check if shapes match.
ind = ckpt_var_names.index(v.op.name)
if ckpt_var_shapes[ind] == v.get_shape():
mapping[v.op.name] = v
not_loaded.remove(v.op.name)
else:
logging.warn('Shape mismatch, will not restore %s.', v.op.name)
if not_loaded:
logging.warn('The following variables in the checkpoint were not loaded:')
for varname_not_loaded in not_loaded:
logging.info('%s', varname_not_loaded)
else: # just get model vars.
for v in model_vars:
mapping[v.op.name] = v
return mapping
def get_imagenet_vars_to_restore(imagenet_ckpt):
"""Returns dict of variables to restore from ImageNet-checkpoint."""
vars_to_restore_imagenet = {}
ckpt_var_names = tf.contrib.framework.list_variables(imagenet_ckpt)
ckpt_var_names = [name for (name, unused_shape) in ckpt_var_names]
model_vars = tf.global_variables()
for v in model_vars:
if 'global_step' in v.op.name: continue
mvname_noprefix = v.op.name.replace('depth_prediction/', '')
mvname_noprefix = mvname_noprefix.replace('moving_mean', 'mu')
mvname_noprefix = mvname_noprefix.replace('moving_variance', 'sigma')
if mvname_noprefix in ckpt_var_names:
vars_to_restore_imagenet[mvname_noprefix] = v
else:
logging.info('The following variable will not be restored from '
'pretrained ImageNet-checkpoint: %s', mvname_noprefix)
return vars_to_restore_imagenet
def format_number(n):
"""Formats number with thousands commas."""
locale.setlocale(locale.LC_ALL, 'en_US')
return locale.format('%d', n, grouping=True)
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
return [atoi(c) for c in re.split(r'(\d+)', text)]
def read_text_lines(filepath):
with tf.gfile.Open(filepath, 'r') as f:
lines = f.readlines()
lines = [l.rstrip() for l in lines]
return lines
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment