Commit 85dd5fa4 authored by Zhichao Lu's avatar Zhichao Lu Committed by pkulzc
Browse files

Merged commit includes the following changes:

204489224  by Zhichao Lu:

    Modify ssd mobilenet v1 fpn config to be a bit more tolerant to OOM failure by bumping down the batch size to 64 and doubling the number of iterations to 25k. It now converges in 2.5 hours.

--
204488942  by Zhichao Lu:

    Internal change

204480631  by Zhichao Lu:

    This CL makes sure that num_steps parameter are not updated to 0 if num_steps field is not mentioned in config.

    The default behavior for number of steps parameter for training is infinite (train forever). The default value num_steps in train.proto is 0 (for training indefinitely). However the estimator/training function expects the num_steps to be set to None to train indefinitely.

--
204437217  by Zhichao Lu:

    Create a Docker image to support TensorFlow Lite / Object Detection blog post.

--
204317570  by Zhichao Lu:

    Internal change

PiperOrigin-RevId: 204489224
parent 11070af9
......@@ -77,6 +77,10 @@ Extras:
Run an instance segmentation model</a><br>
* <a href='g3doc/challenge_evaluation.md'>
Run the evaluation for the Open Images Challenge 2018</a><br>
* <a href='g3doc/tpu_compatibility.md'>
TPU compatible detection pipelines</a><br>
* <a href='g3doc/running_on_mobile_tensorflowlite.md'>
Running object detection on mobile devices with TensorFlow Lite</a><br>
## Getting Help
......@@ -95,6 +99,26 @@ reporting an issue.
## Release information
### July 13, 2018
There are many new updates in this release, extending the functionality and
capability of the API:
* Moving from slim-based training to [Estimator](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator)-based
training.
* Support for [RetinaNet](https://arxiv.org/abs/1708.02002), and a [MobileNet](https://ai.googleblog.com/2017/06/mobilenets-open-source-models-for.html)
adaptation of RetinaNet.
* A novel SSD-based architecture called the [Pooling Pyramid Network](https://arxiv.org/abs/1807.03284) (PPN).
* Releasing several [TPU](https://cloud.google.com/tpu/)-compatible models.
These can be found in the `samples/configs/` directory with a comment in the
pipeline configuration files indicating TPU compatibility.
* Support for quantized training.
* Updated documentation for new binaries, Cloud training, and [Tensorflow Lite](https://www.tensorflow.org/mobile/tflite/).
<b>Thanks to contributors</b>: Sara Robinson, Aakanksha Chowdhery, Derek Chow,
Pengchong Jin, Jonathan Huang, Vivek Rathod, Zhichao Lu, Ronny Votel
### June 25, 2018
Additional evaluation tools for the [Open Images Challenge 2018](https://storage.googleapis.com/openimages/web/challenge.html) are out.
......
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# #==========================================================================
FROM tensorflow/tensorflow:nightly-devel
# Get the tensorflow models research directory, and move it into tensorflow
# source folder to match recommendation of installation
RUN git clone --depth 1 https://github.com/tensorflow/models.git && \
mv models /tensorflow/models
# Install gcloud and gsutil commands
# https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu
RUN export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - && \
apt-get update -y && apt-get install google-cloud-sdk -y
# Install the Tensorflow Object Detection API from here
# https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md
# Install object detection api dependencies
RUN apt-get install -y protobuf-compiler python-pil python-lxml python-tk && \
pip install Cython && \
pip install contextlib2 && \
pip install jupyter && \
pip install matplotlib
# Install pycocoapi
RUN git clone --depth 1 https://github.com/cocodataset/cocoapi.git && \
cd cocoapi/PythonAPI && \
make -j8 && \
cp -r pycocotools /tensorflow/models/research && \
cd ../../ && \
rm -rf cocoapi
# Get protoc 3.0.0, rather than the old version already in the container
RUN curl -OL "https://github.com/google/protobuf/releases/download/v3.0.0/protoc-3.0.0-linux-x86_64.zip" && \
unzip protoc-3.0.0-linux-x86_64.zip -d proto3 && \
mv proto3/bin/* /usr/local/bin && \
mv proto3/include/* /usr/local/include && \
rm -rf proto3 protoc-3.0.0-linux-x86_64.zip
# Run protoc on the object detection repo
RUN cd /tensorflow/models/research && \
protoc object_detection/protos/*.proto --python_out=.
# Set the PYTHONPATH to finish installing the API
ENV PYTHONPATH $PYTHONPATH:/tensorflow/models/research:/tensorflow/models/research/slim
# Install wget (to make life easier below) and editors (to allow people to edit
# the files inside the container)
RUN apt-get install -y wget vim emacs nano
# Grab various data files which are used throughout the demo: dataset,
# pretrained model, and pretrained TensorFlow Lite model. Install these all in
# the same directories as recommended by the blog post.
# Pets example dataset
RUN mkdir -p /tmp/pet_faces_tfrecord/ && \
cd /tmp/pet_faces_tfrecord && \
curl "http://download.tensorflow.org/models/object_detection/pet_faces_tfrecord.tar.gz" | tar xzf -
# Pretrained model
# This one doesn't need its own directory, since it comes in a folder.
RUN cd /tmp && \
curl -O "http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03.tar.gz" && \
tar xzf ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03.tar.gz && \
rm ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03.tar.gz
# Trained TensorFlow Lite model. This should get replaced by one generated from
# export_tflite_ssd_graph.py when that command is called.
RUN cd /tmp && \
curl -L -o tflite.zip \
https://storage.googleapis.com/download.tensorflow.org/models/tflite/frozengraphs_ssd_mobilenet_v1_0.75_quant_pets_2018_06_29.zip && \
unzip tflite.zip -d tflite && \
rm tflite.zip
# Install Android development tools
# Inspired by the following sources:
# https://github.com/bitrise-docker/android/blob/master/Dockerfile
# https://github.com/reddit/docker-android-build/blob/master/Dockerfile
# Set environment variables
ENV ANDROID_HOME /opt/android-sdk-linux
ENV ANDROID_NDK_HOME /opt/android-ndk-r14b
ENV PATH ${PATH}:${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools
# Install SDK tools
RUN cd /opt && \
curl -OL https://dl.google.com/android/repository/sdk-tools-linux-4333796.zip && \
unzip sdk-tools-linux-4333796.zip -d ${ANDROID_HOME} && \
rm sdk-tools-linux-4333796.zip
# Accept licenses before installing components, no need to echo y for each component
# License is valid for all the standard components in versions installed from this file
# Non-standard components: MIPS system images, preview versions, GDK (Google Glass) and Android Google TV require separate licenses, not accepted there
RUN yes | sdkmanager --licenses
# Install platform tools, SDK platform, and other build tools
RUN yes | sdkmanager \
"tools" \
"platform-tools" \
"platforms;android-27" \
"platforms;android-23" \
"build-tools;27.0.3" \
"build-tools;23.0.3"
# Install Android NDK (r14b)
RUN cd /opt && \
curl -L -o android-ndk.zip http://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip && \
unzip -q android-ndk.zip && \
rm -f android-ndk.zip
# Configure the build to use the things we just downloaded
RUN cd /tensorflow && \
printf '\n\nn\ny\nn\nn\nn\ny\nn\nn\nn\nn\nn\nn\n\ny\n%s\n\n\n' ${ANDROID_HOME}|./configure
WORKDIR /tensorflow
# Dockerfile for the TPU and TensorFlow Lite Object Detection tutorial
This Docker image automates the setup involved with training
object detection models on Google Cloud and building the Android TensorFlow Lite
demo app. We recommend using this container if you decide to work through our
tutorial on ["Training and serving a real-time mobile object detector in
30 minutes with Cloud TPUs"](https://medium.com/tensorflow/training-and-serving-a-realtime-mobile-object-detector-in-30-minutes-with-cloud-tpus-b78971cf1193), though of course it may be useful even if you would
like to use the Object Detection API outside the context of the tutorial.
A couple words of warning:
1. Docker containers do not have persistent storage. This means that any changes
you make to files inside the container will not persist if you restart
the container. When running through the tutorial,
**do not close the container**.
2. To be able to deploy the [Android app](
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/examples/android/app)
(which you will build at the end of the tutorial),
you will need to kill any instances of `adb` running on the host machine. You
can accomplish this by closing all instances of Android Studio, and then
running `adb kill-server`.
You can install Docker by following the [instructions here](
https://docs.docker.com/install/).
## Running The Container
From this directory, build the Dockerfile as follows (this takes a while):
```
docker build --tag detect-tf .
```
Run the container:
```
docker run --rm -it --privileged -p 6006:6006 detect-tf
```
When running the container, you will find yourself inside the `/tensorflow`
directory, which is the path to the TensorFlow [source
tree](https://github.com/tensorflow/tensorflow).
## Text Editing
The tutorial also
requires you to occasionally edit files inside the source tree.
This Docker images comes with `vim`, `nano`, and `emacs` preinstalled for your
convenience.
## What's In This Container
This container is derived from the nightly build of TensorFlow, and contains the
sources for TensorFlow at `/tensorflow`, as well as the
[TensorFlow Models](https://github.com/tensorflow/models) which are available at
`/tensorflow/models` (and contain the Object Detection API as a subdirectory
at `/tensorflow/models/research/object_detection`).
The Oxford-IIIT Pets dataset, the COCO pre-trained SSD + MobileNet (v1)
checkpoint, and example
trained model are all available in `/tmp` in their respective folders.
This container also has the `gsutil` and `gcloud` utilities, the `bazel` build
tool, and all dependencies necessary to use the Object Detection API, and
compile and install the TensorFlow Lite Android demo app.
At various points throughout the tutorial, you may see references to the
*research directory*. This refers to the `research` folder within the
models repository, located at
`/tensorflow/models/resesarch`.
# Tensorflow detection model zoo
We provide a collection of detection models pre-trained on the [COCO
dataset](http://mscoco.org), the [Kitti dataset](http://www.cvlibs.net/datasets/kitti/), and the
[Open Images dataset](https://github.com/openimages/dataset). These models can
dataset](http://mscoco.org), the [Kitti dataset](http://www.cvlibs.net/datasets/kitti/),
the [Open Images dataset](https://github.com/openimages/dataset) and the
[AVA v2.1 dataset](https://research.google.com/ava/). These models can
be useful for
out-of-the-box inference if you are interested in categories already in COCO
(e.g., humans, cars, etc) or in Open Images (e.g.,
......@@ -57,19 +58,26 @@ Some remarks on frozen inference graphs:
a detector (and discarding the part past that point), which negatively impacts
standard mAP metrics.
* Our frozen inference graphs are generated using the
[v1.5.0](https://github.com/tensorflow/tensorflow/tree/v1.5.0)
[v1.8.0](https://github.com/tensorflow/tensorflow/tree/v1.8.0)
release version of Tensorflow and we do not guarantee that these will work
with other versions; this being said, each frozen inference graph can be
regenerated using your current version of Tensorflow by re-running the
[exporter](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md),
pointing it at the model directory as well as the config file inside of it.
pointing it at the model directory as well as the corresponding config file in
[samples/configs](https://github.com/tensorflow/models/tree/master/research/object_detection/samples/configs).
## COCO-trained models {#coco-models}
## COCO-trained models
| Model name | Speed (ms) | COCO mAP[^1] | Outputs |
| ------------ | :--------------: | :--------------: | :-------------: |
| [ssd_mobilenet_v1_coco](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz) | 30 | 21 | Boxes |
| [ssd_mobilenet_v1_0.75_depth_coco ☆](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03.tar.gz) | 26 | 18 | Boxes |
| [ssd_mobilenet_v1_quantized_coco ☆](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_03.tar.gz) | 29 | 18 | Boxes |
| [ssd_mobilenet_v1_0.75_depth_quantized_coco ☆](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_03.tar.gz) | 29 | 16 | Boxes |
| [ssd_mobilenet_v1_ppn_coco ☆](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_ppn_shared_box_predictor_300x300_coco14_sync_2018_07_03.tar.gz) | 26 | 20 | Boxes |
| [ssd_mobilenet_v1_fpn_coco ☆](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz) | 56 | 32 | Boxes |
| [ssd_resnet_50_fpn_coco ☆](http://download.tensorflow.org/models/object_detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz) | 76 | 35 | Boxes |
| [ssd_mobilenet_v2_coco](http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz) | 31 | 22 | Boxes |
| [ssdlite_mobilenet_v2_coco](http://download.tensorflow.org/models/object_detection/ssdlite_mobilenet_v2_coco_2018_05_09.tar.gz) | 27 | 22 | Boxes |
| [ssd_inception_v2_coco](http://download.tensorflow.org/models/object_detection/ssd_inception_v2_coco_2018_01_28.tar.gz) | 42 | 24 | Boxes |
......@@ -88,15 +96,15 @@ Some remarks on frozen inference graphs:
| [mask_rcnn_resnet101_atrous_coco](http://download.tensorflow.org/models/object_detection/mask_rcnn_resnet101_atrous_coco_2018_01_28.tar.gz) | 470 | 33 | Masks |
| [mask_rcnn_resnet50_atrous_coco](http://download.tensorflow.org/models/object_detection/mask_rcnn_resnet50_atrous_coco_2018_01_28.tar.gz) | 343 | 29 | Masks |
Note: The asterisk (☆) at the end of model name indicates that this model supports TPU training.
## Kitti-trained models {#kitti-models}
## Kitti-trained models
Model name | Speed (ms) | Pascal mAP@0.5 | Outputs
----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
[faster_rcnn_resnet101_kitti](http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2018_01_28.tar.gz) | 79 | 87 | Boxes
## Open Images-trained models {#open-images-models}
## Open Images-trained models
Model name | Speed (ms) | Open Images mAP@0.5[^2] | Outputs
----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
......@@ -104,7 +112,7 @@ Model name
[faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid](http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz) | 347 | | Boxes
## AVA v2.1 trained models {#ava-models}
## AVA v2.1 trained models
Model name | Speed (ms) | Pascal mAP@0.5 | Outputs
----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---: | :-------------: | :-----:
......@@ -112,5 +120,6 @@ Model name
[^1]: See [MSCOCO evaluation protocol](http://cocodataset.org/#detections-eval).
[^2]: This is PASCAL mAP with a slightly different way of true positives computation: see [Open Images evaluation protocol](evaluation_protocols.md#open-images).
......@@ -34,37 +34,22 @@ A local training job can be run with the following command:
```bash
# From the tensorflow/models/research/ directory
python object_detection/train.py \
--logtostderr \
--pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
--train_dir=${PATH_TO_TRAIN_DIR}
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
NUM_TRAIN_STEPS=50000
NUM_EVAL_STEPS=2000
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--num_eval_steps=${NUM_EVAL_STEPS} \
--alsologtostderr
```
where `${PATH_TO_YOUR_PIPELINE_CONFIG}` points to the pipeline config and
`${PATH_TO_TRAIN_DIR}` points to the directory in which training checkpoints
and events will be written to. By default, the training job will
run indefinitely until the user kills it.
## Running the Evaluation Job
Evaluation is run as a separate job. The eval job will periodically poll the
train directory for new checkpoints and evaluate them on a test dataset. The
job can be run using the following command:
```bash
# From the tensorflow/models/research/ directory
python object_detection/eval.py \
--logtostderr \
--pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
--checkpoint_dir=${PATH_TO_TRAIN_DIR} \
--eval_dir=${PATH_TO_EVAL_DIR}
```
where `${PATH_TO_YOUR_PIPELINE_CONFIG}` points to the pipeline config,
`${PATH_TO_TRAIN_DIR}` points to the directory in which training checkpoints
were saved (same as the training job) and `${PATH_TO_EVAL_DIR}` points to the
directory in which evaluation events will be saved. As with the training job,
the eval job run until terminated by default.
where `${PIPELINE_CONFIG_PATH}` points to the pipeline config and
`${MODEL_DIR}` points to the directory in which training checkpoints
and events will be written to. Note that this binary will interleave both
training and evaluation.
## Running Tensorboard
......@@ -73,9 +58,9 @@ using the recommended directory structure, Tensorboard can be run using the
following command:
```bash
tensorboard --logdir=${PATH_TO_MODEL_DIRECTORY}
tensorboard --logdir=${MODEL_DIR}
```
where `${PATH_TO_MODEL_DIRECTORY}` points to the directory that contains the
where `${MODEL_DIR}` points to the directory that contains the
train and eval directories. Please note it may take Tensorboard a couple minutes
to populate with data.
# Running on Google Cloud Platform
# Running on Google Cloud ML Engine
The Tensorflow Object Detection API supports distributed training on Google
Cloud ML Engine. This section documents instructions on how to train and
......@@ -23,26 +23,28 @@ evaluation jobs for a few iterations
## Packaging
In order to run the Tensorflow Object Detection API on Cloud ML, it must be
packaged (along with it's TF-Slim dependency). The required packages can be
created with the following command
packaged (along with it's TF-Slim dependency and the
[pycocotools](https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools)
library). The required packages can be created with the following command
``` bash
# From tensorflow/models/research/
bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools
python setup.py sdist
(cd slim && python setup.py sdist)
```
This will create python packages in dist/object_detection-0.1.tar.gz and
slim/dist/slim-0.1.tar.gz.
This will create python packages dist/object_detection-0.1.tar.gz,
slim/dist/slim-0.1.tar.gz, and /tmp/pycocotools/pycocotools-2.0.tar.gz.
## Running a Multiworker Training Job
## Running a Multiworker (GPU) Training Job on CMLE
Google Cloud ML requires a YAML configuration file for a multiworker training
job using GPUs. A sample YAML file is given below:
```
trainingInput:
runtimeVersion: "1.2"
runtimeVersion: "1.8"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 9
......@@ -68,22 +70,22 @@ The YAML file should be saved on the local machine (not on GCP). Once it has
been written, a user can start a training job on Cloud ML Engine using the
following command:
``` bash
```bash
# From tensorflow/models/research/
gcloud ml-engine jobs submit training object_detection_`date +%s` \
--runtime-version 1.2 \
--job-dir=gs://${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
gcloud ml-engine jobs submit training object_detection_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 1.8 \
--job-dir=gs://${MODEL_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-central1 \
--config ${PATH_TO_LOCAL_YAML_FILE} \
-- \
--train_dir=gs://${TRAIN_DIR} \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
```
Where `${PATH_TO_LOCAL_YAML_FILE}` is the local path to the YAML configuration,
`gs://${TRAIN_DIR}` specifies the directory on Google Cloud Storage where the
`gs://${MODEL_DIR}` specifies the directory on Google Cloud Storage where the
training checkpoints and events will be written to and
`gs://${PIPELINE_CONFIG_PATH}` points to the pipeline configuration stored on
Google Cloud Storage.
......@@ -91,34 +93,69 @@ Google Cloud Storage.
Users can monitor the progress of their training job on the [ML Engine
Dashboard](https://console.cloud.google.com/mlengine/jobs).
Note: This sample is supported for use with 1.2 runtime version.
Note: This sample is supported for use with 1.8 runtime version.
## Running a TPU Training Job on CMLE
Launching a training job with a TPU compatible pipeline config requires using a
similar command:
```bash
gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%m_%d_%Y_%H_%M_%S` \
--job-dir=gs://${MODEL_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_tpu_main \
--runtime-version 1.8 \
--scale-tier BASIC_TPU \
--region us-central1 \
-- \
--tpu_zone us-central1 \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
```
In contrast with the GPU training command, there is no need to specify a YAML
file and we point to the *object_detection.model_tpu_main* binary instead of
*object_detection.model_main*. We must also now set `scale-tier` to be
`BASIC_TPU` and provide a `tpu_zone`. Finally as before `pipeline_config_path`
points to a points to the pipeline configuration stored on Google Cloud Storage
(but is now must be a TPU compatible model).
## Running an Evaluation Job on CMLE
## Running an Evaluation Job on Cloud
Note: You only need to do this when using TPU for training as it does not
interleave evaluation during training as in the case of Multiworker GPU
training.
Evaluation jobs run on a single machine, so it is not necessary to write a YAML
configuration for evaluation. Run the following command to start the evaluation
job:
``` bash
gcloud ml-engine jobs submit training object_detection_eval_`date +%s` \
--runtime-version 1.2 \
--job-dir=gs://${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.eval \
```bash
gcloud ml-engine jobs submit training object_detection_eval_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 1.8 \
--job-dir=gs://${MODEL_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-central1 \
--scale-tier BASIC_GPU \
-- \
--checkpoint_dir=gs://${TRAIN_DIR} \
--eval_dir=gs://${EVAL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
--checkpoint_dir=gs://${MODEL_DIR}
```
Where `gs://${TRAIN_DIR}` points to the directory on Google Cloud Storage where
training checkpoints are saved (same as the training job), `gs://${EVAL_DIR}`
points to where evaluation events will be saved on Google Cloud Storage and
Where `gs://${MODEL_DIR}` points to the directory on Google Cloud Storage where
training checkpoints are saved (same as the training job), as well as
to where evaluation events will be saved on Google Cloud Storage and
`gs://${PIPELINE_CONFIG_PATH}` points to where the pipeline configuration is
stored on Google Cloud Storage.
Typically one starts an evaluation job concurrently with the training job.
Note that we do not support running evaluation on TPU, so the above command
line for launching evaluation jobs is the same whether you are training
on GPU or TPU.
## Running Tensorboard
You can run Tensorboard locally on your own machine to view progress of your
......@@ -130,3 +167,4 @@ tensorboard --logdir=gs://${YOUR_CLOUD_BUCKET}
```
Note it may Tensorboard a few minutes to populate with results.
# Running on mobile with TensorFlow Lite
In this section, we will show you how to use [TensorFlow
Lite](https://www.tensorflow.org/mobile/tflite/) to get a smaller model and
allow you take advantage of ops that have been optimized for mobile devices.
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded
devices. It enables on-device machine learning inference with low latency and a
small binary size. TensorFlow Lite uses many techniques for this such as
quantized kernels that allow smaller and faster (fixed-point math) models.
For this section, you will need to build [TensorFlow from
source](https://www.tensorflow.org/install/install_sources) to get the
TensorFlow Lite support for the SSD model. You will also need to install the
[bazel build
tool](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android#bazel).
To make these commands easier to run, let’s set up some environment variables:
```shell
export CONFIG_FILE=PATH_TO_BE_CONFIGURED/pipeline.config
export CHECKPOINT_PATH=PATH_TO_BE_CONFIGURED/model.ckpt
export OUTPUT_DIR=/tmp/tflite
```
We start with a checkpoint and get a TensorFlow frozen graph with compatible ops
that we can use with TensorFlow Lite. First, you’ll need to install these
[python
libraries](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md).
Then to get the frozen graph, run the export_tflite_ssd_graph.py script from the
`models/research` directory with this command:
```shell
object_detection/export_tflite_ssd_graph.py \
--pipeline_config_path=$CONFIG_FILE \
--trained_checkpoint_prefix=$CHECKPOINT_PATH \
--output_directory=$OUTPUT_DIR \
--add_postprocessing_op=true
```
In the /tmp/tflite directory, you should now see two files: tflite_graph.pb and
tflite_graph.pbtxt. Note that the add_postprocessing flag enables the model to
take advantage of a custom optimized detection post-processing operation which
can be thought of as a replacement for
[tf.image.non_max_suppression](https://www.tensorflow.org/api_docs/python/tf/image/non_max_suppression).
Make sure not to confuse export_tflite_ssd_graph with export_inference_graph in
the same directory. Both scripts output frozen graphs: export_tflite_ssd_graph
will output the frozen graph that we can input to TensorFlow Lite directly and
is the one we’ll be using.
Next we’ll use TensorFlow Lite to get the optimized model by using
[TOCO](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/toco),
the TensorFlow Lite Optimizing Converter. This will convert the resulting frozen
graph (tflite_graph.pb) to the TensorFlow Lite flatbuffer format (detect.tflite)
via the following command. For a quantized model, run this from the tensorflow/
directory:
```shell
bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \
--input_file=$OUTPUT_DIR/tflite_graph.pb \
--output_file=$OUTPUT_DIR/detect.tflite \
--input_shapes=1,300,300,3 \
--input_arrays=normalized_input_image_tensor \
--output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \
--inference_type=QUANTIZED_UINT8 \
--mean_values=128 \
--std_values=128 \
--change_concat_input_ranges=false \
--allow_custom_ops
```
This command takes the input tensor normalized_input_image_tensor after resizing
each camera image frame to 300x300 pixels. The outputs of the quantized model
are named 'TFLite_Detection_PostProcess', 'TFLite_Detection_PostProcess:1',
'TFLite_Detection_PostProcess:2', and 'TFLite_Detection_PostProcess:3' and
represent four arrays: detection_boxes, detection_classes, detection_scores, and
num_detections. The documentation for other flags used in this command is
[here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/toco/g3doc/cmdline_reference.md).
If things ran successfully, you should now see a third file in the /tmp/tflite
directory called detect.tflite. This file contains the graph and all model
parameters and can be run via the TensorFlow Lite interpreter on the Android
device. For a floating point model, run this from the tensorflow/ directory:
```shell
bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \
--input_file=$OUTPUT_DIR/tflite_graph.pb \
--output_file=$OUTPUT_DIR/detect.tflite \
--input_shapes=1,300,300,3 \
--input_arrays=normalized_input_image_tensor \
--output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \
--inference_type=FLOAT \
--allow_custom_ops
```
# Running our model on Android
To run our TensorFlow Lite model on device, we will need to install the Android
NDK and SDK. The current recommended Android NDK version is 14b and can be found
on the [NDK
Archives](https://developer.android.com/ndk/downloads/older_releases.html#ndk-14b-downloads)
page. Android SDK and build tools can be [downloaded
separately](https://developer.android.com/tools/revisions/build-tools.html) or
used as part of [Android
Studio](https://developer.android.com/studio/index.html). To build the
TensorFlow Lite Android demo, build tools require API >= 23 (but it will run on
devices with API >= 21). Additional details are available on the [TensorFlow
Lite Android App
page](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/java/demo/README.md).
Next we need to point the app to our new detect.tflite file and give it the
names of our new labels. Specifically, we will copy our TensorFlow Lite
flatbuffer to the app assets directory with the following command:
```shell
cp /tmp/tflite/detect.tflite \
//tensorflow/contrib/lite/examples/android/app/src/main/assets
```
You will also need to copy your new labelmap labels_list.txt to the assets
directory.
We will now edit the BUILD file to point to this new model. First, open the
BUILD file tensorflow/contrib/lite/examples/android/BUILD. Then find the assets
section, and replace the line “@tflite_mobilenet_ssd_quant//:detect.tflite”
(which by default points to a COCO pretrained model) with the path to your new
TFLite model
“//tensorflow/contrib/lite/examples/android/app/src/main/assets:detect.tflite”.
Finally, change the last line in assets section to use the new label map as
well.
We will also need to tell our app to use the new label map. In order to do this,
open up the
tensorflow/contrib/lite/examples/android/app/src/main/java/org/tensorflow/demo/DetectorActivity.java
file in a text editor and find the definition of TF_OD_API_LABELS_FILE. Update
this path to point to your new label map file:
"file:///android_asset/labels_list.txt". Note that if your model is quantized,
the flag TF_OD_API_IS_QUANTIZED is set to true, and if your model is floating
point, the flag TF_OD_API_IS_QUANTIZED is set to false. This new section of
DetectorActivity.java should now look as follows for a quantized model:
```shell
private static final boolean TF_OD_API_IS_QUANTIZED = true;
private static final String TF_OD_API_MODEL_FILE = "detect.tflite";
private static final String TF_OD_API_LABELS_FILE = "file:///android_asset/labels_list.txt";
```
Once you’ve copied the TensorFlow Lite file and edited your BUILD and
DetectorActivity.java files, you can build the demo app, run this bazel command
from the tensorflow directory:
```shell
bazel build -c opt --config=android_arm{,64} --cxxopt='--std=c++11'
"//tensorflow/contrib/lite/examples/android:tflite_demo"
```
Now install the demo on a
[debug-enabled](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android#install)
Android phone via [Android Debug
Bridge](https://developer.android.com/studio/command-line/adb) (adb):
```shell
adb install bazel-bin/tensorflow/contrib/lite/examples/android/tflite_demo.apk
```
......@@ -93,17 +93,18 @@ python object_detection/dataset_tools/create_pet_tf_record.py \
Note: It is normal to see some warnings when running this script. You may ignore
them.
Two TFRecord files named `pet_train.record` and `pet_val.record` should be
generated in the `tensorflow/models/research/` directory.
Two 10-sharded TFRecord files named `pet_faces_train.record-*` and
`pet_faces_val.record-*` should be generated in the
`tensorflow/models/research/` directory.
Now that the data has been generated, we'll need to upload it to Google Cloud
Storage so the data can be accessed by ML Engine. Run the following command to
copy the files into your GCS bucket (substituting `${YOUR_GCS_BUCKET}`):
``` bash
```bash
# From tensorflow/models/research/
gsutil cp pet_train.record gs://${YOUR_GCS_BUCKET}/data/pet_train.record
gsutil cp pet_val.record gs://${YOUR_GCS_BUCKET}/data/pet_val.record
gsutil cp pet_faces_train.record-* gs://${YOUR_GCS_BUCKET}/data/
gsutil cp pet_faces_val.record-* gs://${YOUR_GCS_BUCKET}/data/
gsutil cp object_detection/data/pet_label_map.pbtxt gs://${YOUR_GCS_BUCKET}/data/pet_label_map.pbtxt
```
......@@ -176,8 +177,8 @@ the following:
- model.ckpt.meta
- model.ckpt.data-00000-of-00001
- pet_label_map.pbtxt
- pet_train.record
- pet_val.record
- pet_faces_train.record-*
- pet_faces_val.record-*
```
You can inspect your bucket using the [Google Cloud Storage
......@@ -193,59 +194,39 @@ Before we can start a job on Google Cloud ML Engine, we must:
To package the Tensorflow Object Detection code, run the following commands from
the `tensorflow/models/research/` directory:
``` bash
```bash
# From tensorflow/models/research/
bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools
python setup.py sdist
(cd slim && python setup.py sdist)
```
You should see two tar.gz files created at `dist/object_detection-0.1.tar.gz`
and `slim/dist/slim-0.1.tar.gz`.
This will create python packages dist/object_detection-0.1.tar.gz,
slim/dist/slim-0.1.tar.gz, and /tmp/pycocotools/pycocotools-2.0.tar.gz.
For running the training Cloud ML job, we'll configure the cluster to use 10
training jobs (1 master + 9 workers) and three parameters servers. The
configuration file can be found at `object_detection/samples/cloud/cloud.yml`.
Note: This sample is supported for use with 1.2 runtime version.
Note: This sample is supported for use with 1.8 runtime version.
To start training, execute the following command from the
To start training and evaluation, execute the following command from the
`tensorflow/models/research/` directory:
``` bash
```bash
# From tensorflow/models/research/
gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.2 \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
gcloud ml-engine jobs submit training `whoami`_object_detection_pets_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 1.8 \
--job-dir=gs://${YOUR_GCS_BUCKET}/model_dir \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
-- \
--train_dir=gs://${YOUR_GCS_BUCKET}/train \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/faster_rcnn_resnet101_pets.config
```
Once training has started, we can run an evaluation concurrently:
``` bash
# From tensorflow/models/research/
gcloud ml-engine jobs submit training `whoami`_object_detection_eval_`date +%s` \
--runtime-version 1.2 \
--job-dir=gs://${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.eval \
--region us-central1 \
--scale-tier BASIC_GPU \
-- \
--checkpoint_dir=gs://${YOUR_GCS_BUCKET}/train \
--eval_dir=gs://${YOUR_GCS_BUCKET}/eval \
--model_dir=gs://${YOUR_GCS_BUCKET}/model_dir \
--pipeline_config_path=gs://${YOUR_GCS_BUCKET}/data/faster_rcnn_resnet101_pets.config
```
Note: Even though we're running an evaluation job, the `gcloud ml-engine jobs
submit training` command is correct. ML Engine does not distinguish between
training and evaluation jobs.
Users can monitor and stop training and evaluation jobs on the [ML Engine
Dashboard](https://console.cloud.google.com/mlengine/jobs).
......@@ -254,12 +235,12 @@ Dashboard](https://console.cloud.google.com/mlengine/jobs).
You can monitor progress of the training and eval jobs by running Tensorboard on
your local machine:
``` bash
```bash
# This command needs to be run once to allow your local machine to access your
# GCS bucket.
gcloud auth application-default login
tensorboard --logdir=gs://${YOUR_GCS_BUCKET}
tensorboard --logdir=gs://${YOUR_GCS_BUCKET}/model_dir
```
Once Tensorboard is running, navigate to `localhost:6006` from your favourite
......@@ -284,12 +265,12 @@ that they've converged.
## Exporting the Tensorflow Graph
After your model has been trained, you should export it to a Tensorflow
graph proto. First, you need to identify a candidate checkpoint to export. You
can search your bucket using the [Google Cloud Storage
After your model has been trained, you should export it to a Tensorflow graph
proto. First, you need to identify a candidate checkpoint to export. You can
search your bucket using the [Google Cloud Storage
Browser](https://console.cloud.google.com/storage/browser). The file should be
stored under `${YOUR_GCS_BUCKET}/train`. The checkpoint will typically consist of
three files:
stored under `${YOUR_GCS_BUCKET}/model_dir`. The checkpoint will typically
consist of three files:
* `model.ckpt-${CHECKPOINT_NUMBER}.data-00000-of-00001`
* `model.ckpt-${CHECKPOINT_NUMBER}.index`
......@@ -298,9 +279,9 @@ three files:
After you've identified a candidate checkpoint to export, run the following
command from `tensorflow/models/research/`:
``` bash
```bash
# From tensorflow/models/research/
gsutil cp gs://${YOUR_GCS_BUCKET}/train/model.ckpt-${CHECKPOINT_NUMBER}.* .
gsutil cp gs://${YOUR_GCS_BUCKET}/model_dir/model.ckpt-${CHECKPOINT_NUMBER}.* .
python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path object_detection/samples/configs/faster_rcnn_resnet101_pets.config \
......
# TPU compatible detection pipelines
[TOC]
The Tensorflow Object Detection API supports TPU training for some models. To
make models TPU compatible you need to make a few tweaks to the model config as
mentioned below. We also provide several sample configs that you can use as a
template.
## TPU compatibility
### Static shaped tensors
TPU training currently requires all tensors in the Tensorflow Graph to have
static shapes. However, most of the sample configs in Object Detection API have
a few different tensors that are dynamically shaped. Fortunately, we provide
simple alternatives in the model configuration that modifies these tensors to
have static shape:
* **Image tensors with static shape** - This can be achieved either by using a
`fixed_shape_resizer` that resizes images to a fixed spatial shape or by
setting `pad_to_max_dimension: true` in `keep_aspect_ratio_resizer` which
pads the resized images with zeros to the bottom and right. Padded image
tensors are correctly handled internally within the model.
```
image_resizer {
fixed_shape_resizer {
height: 640
width: 640
}
}
```
or
```
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 640
max_dimension: 640
pad_to_max_dimension: true
}
}
```
* **Groundtruth tensors with static shape** - Images in a typical detection
dataset have variable number of groundtruth boxes and associated classes.
Setting `max_number_of_boxes` to a large enough number in the
`train_input_reader` and `eval_input_reader` pads the groundtruth tensors
with zeros to a static shape. Padded groundtruth tensors are correctly
handled internally within the model.
```
train_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_train.record-?????-of-00100"
}
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
max_number_of_boxes: 200
}
eval_input_reader: {
tf_record_input_reader {
input_path: "PATH_TO_BE_CONFIGURED/mscoco_val.record-?????-of-0010"
}
label_map_path: "PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt"
max_number_of_boxes: 200
}
```
### TPU friendly ops
Although TPU supports a vast number of tensorflow ops, a few used in the
Tensorflow Object Detection API are unsupported. We list such ops below and
recommend compatible substitutes.
* **Anchor sampling** - Typically we use hard example mining in standard SSD
pipeliens to balance positive and negative anchors that contribute to the
loss. Hard Example mining uses non max suppression as a subroutine and since
non max suppression is not currently supported on TPUs we cannot use hard
example mining. Fortunately, we provide an implementation of focal loss that
can be used instead of hard example mining. Remove `hard_example_miner` from
the config and substitute `weighted_sigmoid` classification loss with
`weighted_sigmoid_focal` loss.
```
loss {
classification_loss {
weighted_sigmoid_focal {
alpha: 0.25
gamma: 2.0
}
}
localization_loss {
weighted_smooth_l1 {
}
}
classification_weight: 1.0
localization_weight: 1.0
}
```
* **Target Matching** - Object detection API provides two choices for matcher
used in target assignment: `argmax_matcher` and `bipartite_matcher`.
Bipartite matcher is not currently supported on TPU, therefore we must
modify the configs to use `argmax_matcher`. Additionally, set
`use_matmul_gather: true` for efficiency on TPU.
```
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
```
### TPU training hyperparameters
Object Detection training on TPU uses synchronous SGD. On a typical cloud TPU
with 8 cores we recommend batch sizes that are 8x large when compared to a GPU
config that uses asynchronous SGD. We also use fewer training steps (~ 1/100 x)
due to the large batch size. This necessitates careful tuning of some other
training parameters as listed below.
* **Batch size** - Use the largest batch size that can fit on cloud TPU.
```
train_config {
batch_size: 1024
}
```
* **Training steps** - Typically only 10s of thousands.
```
train_config {
num_steps: 25000
}
```
* **Batch norm decay** - Use smaller decay constants (0.97 or 0.997) since we
take fewer training steps.
```
batch_norm {
scale: true,
decay: 0.97,
epsilon: 0.001,
}
```
* **Learning rate** - Use large learning rate with warmup. Scale learning rate
linearly with batch size. See `cosine_decay_learning_rate` or
`manual_step_learning_rate` for examples.
```
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .04
total_steps: 25000
warmup_learning_rate: .013333
warmup_steps: 2000
}
}
```
or
```
learning_rate: {
manual_step_learning_rate {
warmup: true
initial_learning_rate: .01333
schedule {
step: 2000
learning_rate: 0.04
}
schedule {
step: 15000
learning_rate: 0.004
}
}
}
```
## Example TPU compatible configs
We provide example config files that you can use to train your own models on TPU
* <a href='https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v1_300x300_coco14_sync.config'>ssd_mobilenet_v1_300x300</a> <br>
* <a href='https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v1_ppn_shared_box_predictor_300x300_coco14_sync.config'>ssd_mobilenet_v1_ppn_300x300</a> <br>
* <a href='https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync.config'>ssd_mobilenet_v1_fpn_640x640
(mobilenet based retinanet)</a> <br>
* <a href='https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config'>ssd_resnet50_v1_fpn_640x640
(retinanet)</a> <br>
## Supported Meta architectures
Currently, `SSDMetaArch` models are supported on TPUs. `FasterRCNNMetaArch` is
going to be supported soon.
......@@ -500,11 +500,13 @@ def create_estimator_and_inputs(run_config,
eval_config = configs['eval_config']
eval_input_config = configs['eval_input_config']
if train_steps is None:
train_steps = configs['train_config'].num_steps
# update train_steps from config but only when non-zero value is provided
if train_steps is None and train_config.num_steps != 0:
train_steps = train_config.num_steps
if eval_steps is None:
eval_steps = configs['eval_config'].num_examples
# update eval_steps from config but only when non-zero value is provided
if eval_steps is None and eval_config.num_examples != 0:
eval_steps = eval_config.num_examples
detection_model_fn = functools.partial(
model_builder.build, model_config=model_config)
......
......@@ -3,8 +3,7 @@
# See Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint
# Achieves 29.6 mAP on COCO14 minival dataset. Doubling the number of training
# steps to 25k gets 31.5 mAP
# Achieves 29.7 mAP on COCO14 minival dataset.
# This config is TPU compatible
......@@ -133,11 +132,11 @@ model {
train_config: {
fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
batch_size: 128
batch_size: 64
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 12500
num_steps: 25000
data_augmentation_options {
random_horizontal_flip {
}
......@@ -156,10 +155,10 @@ train_config: {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .08
total_steps: 12500
warmup_learning_rate: .026666
warmup_steps: 1000
learning_rate_base: .04
total_steps: 25000
warmup_learning_rate: .013333
warmup_steps: 2000
}
}
momentum_optimizer_value: 0.9
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment