This a method for unsupervised learning of depth and egomotion from monocular video, achieving new state-of-the-art results on both tasks by explicitly modeling 3D object motion, performing on-line refinement and improving quality for moving objects by novel loss formulations. It will appear in the following paper:
**V. Casser, S. Pirk, R. Mahjourian, A. Angelova, Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos, AAAI Conference on Artificial Intelligence, 2019**
https://arxiv.org/pdf/1811.06152.pdf
This code is implemented and supported by Vincent Casser (git username: VincentCa) and Anelia Angelova (git username: AneliaAngelova). Please contact anelia@google.com for questions.
Before running training, run gen_data_* script for the respective dataset in order to generate the data in the appropriate format for KITTI or Cityscapes. It is assumed that motion masks are already generated and stored as images.
Models are trained from an Imagenet pretrained model.
```shell
ckpt_dir="your/checkpoint/folder"
data_dir="KITTI_SEQ2_LR/"# Set for KITTI
data_dir="CITYSCAPES_SEQ2_LR/"# Set for Cityscapes
imagenet_ckpt="resnet_pretrained/model.ckpt"
python train.py \
--logtostderr\
--checkpoint_dir$ckpt_dir\
--data_dir$data_dir\
--architecture resnet \
--imagenet_ckpt$imagenet_ckpt\
--imagenet_normtrue\
--joint_encoderfalse
```
## Running depth/egomotion inference on an image folder
KITTI is trained on the raw image data (resized to 416 x 128), but inputs are standardized before feeding them, and Cityscapes images are cropped using the following cropping parameters: (192, 1856, 256, 768). If using a different crop, it is likely that additional training is necessary. Therefore, please follow the inference example shown below when using one of the models. The right choice might depend on a variety of factors. For example, if a checkpoint should be used for odometry, be aware that for improved odometry on motion models, using segmentation masks could be advantageous (setting *use_masks=true* for inference). On the other hand, all models can be used for single-frame depth estimation without any additional information.
```shell
input_dir="your/image/folder"
output_dir="your/output/folder"
model_checkpoint="your/model/checkpoint"
python inference.py \
--logtostderr\
--file_extension png \
--depth\
--egomotiontrue\
--input_dir$input_dir\
--output_dir$output_dir\
--model_ckpt$model_checkpoint
```
Note that the egomotion prediction expects the files in the input directory to be a consecutive sequence, and that sorting the filenames alphabetically is putting them in the right order.
Alternatively inference can also be ran on pre-processed images.
## Running on-line refinement
On-line refinement is executed on top of an existing inference folder, so make sure to run regular inference first. Then you can run the on-line fusion procedure as follows:
This code is implemented and supported by Vincent Casser and Anelia Angelova and can be found at
https://sites.google.com/view/struct2depth.
The core implementation is derived from [https://github.com/tensorflow/models/tree/master/research/vid2depth)](https://github.com/tensorflow/models/tree/master/research/vid2depth)
by [Reza Mahjourian](rezama@google.com), which in turn is based on [SfMLearner