DELF Training Documentation Update (#8663)

* First version of working script to download the GLDv2 dataset * First version of the DEFL package installation script * First working version of the DELF package installation script * Fixed feedback from PR review * Push to Github of changes to the TFRecord data generation script for DELF. * Merged commit includes the following changes: 315363544 by Andre Araujo: Added the generation of TRAIN and VALIDATE splits from the train dataset. -- 314676530 by Andre Araujo: Updated script to download GLDv2 images for DELF training. -- 314101235 by Andre Araujo: Added newly created module 'utils' to the copybara script. -- 313677085 by Andre Araujo: Code migration from TF1 to TF2 for: - logging (replaced usage of tf.compat.v1.logging.info) - testing directories (replaced usage of tf.compat.v1.test.get_temp_dir()) - feature/object extraction scripts (replaced usage of tf.compat.v1.train.string_input_producer and tf.compat.v1.train.start_queue_runners with PIL) -- 312770828 by Andre Araujo: Internal change. -- PiperOrigin-RevId: 315363544 * First version of the updated README of the DELF training instructions * Added to the README the section describing the generation of the training data * Added warning about the TFRecord generation time * Updated the launch of the training * Minor README update * Integrated review feedback Co-authored-by: Andre Araujo <andrearaujo@google.com>

DELF Training Documentation Update (#8663)
* First version of working script to download the GLDv2 dataset * First version of the DEFL package installation script * First working version of the DELF package installation script * Fixed feedback from PR review * Push to Github of changes to the TFRecord data generation script for DELF. * Merged commit includes the following changes: 315363544 by Andre Araujo: Added the generation of TRAIN and VALIDATE splits from the train dataset. -- 314676530 by Andre Araujo: Updated script to download GLDv2 images for DELF training. -- 314101235 by Andre Araujo: Added newly created module 'utils' to the copybara script. -- 313677085 by Andre Araujo: Code migration from TF1 to TF2 for: - logging (replaced usage of tf.compat.v1.logging.info) - testing directories (replaced usage of tf.compat.v1.test.get_temp_dir()) - feature/object extraction scripts (replaced usage of tf.compat.v1.train.string_input_producer and tf.compat.v1.train.start_queue_runners with PIL) -- 312770828 by Andre Araujo: Internal change. -- PiperOrigin-RevId: 315363544 * First version of the updated README of the DELF training instructions * Added to the README the section describing the generation of the training data * Added warning about the TFRecord generation time * Updated the launch of the training * Minor README update * Integrated review feedback Co-authored-by: Andre Araujo <andrearaujo@google.com>
41613994 · Dan Anghel · GitHub · 8722f59f · 41613994
Unverified Commit 41613994 authored Jun 11, 2020 by Dan Anghel Committed by GitHub Jun 11, 2020
Hide whitespace changes
Inline Side-by-side

Showing with 119 additions and 20 deletions

research/delf/delf/python/training/README.md research/delf/delf/python/training/README.md +119 -20

No files found.
--- a/research/delf/delf/python/training/README.md
+++ b/research/delf/delf/python/training/README.md
-# DELF training instructions
+# DELF Training Instructions
-## Install DELF library
+This README documents the end-to-end process for training a landmark detection and retrieval
+model using the DELF library on the [Google Landmarks Dataset v2](https://github.com/cvdfoundation/google-landmark) (GLDv2). This can be achieved following these steps:
+1. Install the DELF Python library.
+2. Download the raw images of the GLDv2 dataset.
+3. Prepare the training data.
+4. Run the training.
-To be able to use this code, please follow
+The next sections will cove each of these steps in greater detail.
-[these instructions](../../../INSTALL_INSTRUCTIONS.md) to properly install the
-DELF library.
-## Data preparation
+## Prerequisites
-See the
+Clone the [TensorFlow Model Garden](https://github.com/tensorflow/models) repository and move
-[build_image_dataset.py](https://github.com/tensorflow/models/blob/master/research/delf/delf/python/training/build_image_dataset.py)
+into the `models/research/delf/delf/python/training`folder.
-script to prepare the data, following the instructions therein to download the
+```
-dataset (via Kaggle) and then running the script.
+git clone https://github.com/tensorflow/models.git
+cd models/research/delf/delf/python/training
+```
+## Install the DELF Library
+The DELF Python library can be installed by running the [`install_delf.sh`](./install_delf.sh)
+script using the command:
+```
+bash install_delf.sh
+```
+The script installs both the DELF library and its dependencies in the following sequence:
+* Install TensorFlow 2.2 and TensorFlow 2.2 for GPU.
+* Install the [TF-Slim](https://github.com/google-research/tf-slim) library from source.
+* Download [protoc](https://github.com/protocolbuffers/protobuf) and compile the DELF Protocol
+Buffers.
+* Install the matplotlib, numpy, scikit-image, scipy and python3-tk Python libraries.
+* Install the [TensorFlow Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) from the cloned TensorFlow Model Garden repository.
+* Install the DELF package.
+*Please note that the current installation only works on 64 bits Linux architectures due to the 
+`protoc` binary downloaded by the installation script. If you wish to install the DELF library on
+other architectures please update the [`install_delf.sh`](./install_delf.sh) script by referencing
+the desired `protoc` [binary release](https://github.com/protocolbuffers/protobuf/releases).*
-## Running training
+## Download the GLDv2 Training Data
-Assuming the data was downloaded to `/tmp/gld_tfrecord/`, running the following
+The [GLDv2](https://github.com/cvdfoundation/google-landmark) images are grouped in 3 datasets: TRAIN, INDEX, TEST. Images in each dataset are grouped into `*.tar` files and individually
-command should start training a model:
+referenced in `*.csv`files containing training metadata and licensing information. The number of
+`*.tar` files per dataset is as follows:
+* TRAIN: 500 files.
+* INDEX: 100 files.
+* TEST: 20 files.
-```sh
+To download the GLDv2 images, run the [`download_dataset.sh`](./download_dataset.sh) script like in
-python3 tensorflow_models/research/delf/delf/python/training/train.py \
+the following example:
-  --train_file_pattern=/tmp/gld_tfrecord/train* \
-  --validation_file_pattern=/tmp/gld_tfrecord/train* \
-  --debug
 ```
+bash download_dataset.sh 500 100 20
+```
+The script takes the following parameters, in order:
+* The number of image files from the TRAIN dataset to download (maximum 500).
+* The number of image files from the INDEX dataset to download (maximum 100).
+* The number of image files from the TEST dataset to download (maximum 20).
+The script downloads the GLDv2 images under the following directory structure:
+* gldv2_dataset/
+  * train/ - Contains raw images from the TRAIN dataset.
+  * index/ - Contains raw images from the INDEX dataset.
+  * test/ - Contains raw images from the TEST dataset.
+Each of the three folders `gldv2_dataset/train/`, `gldv2_dataset/index/` and `gldv2_dataset/test/`
+contains the following:
+* The downloaded `*.tar` files.
+* The corresponding MD5 checksum files, `*.txt`.
+* The unpacked content of the downloaded files. (*Images are organized in folders and subfolders
+based on the first, second and third character in their file name.*)
+* The CSV files containing training and licensing metadata of the downloaded images.
+*Please note that due to the large size of the GLDv2 dataset, the download can take up to 12 
+hours and up to 1 TB of disk space. In order to save bandwidth and disk space, you may want to start by downloading only the TRAIN dataset, the only one required for the training, thus saving
+approximately ~95 GB, the equivalent of the INDEX and TEST datasets. To further save disk space,
+the `*.tar` files can be deleted after downloading and upacking them.*
+## Prepare the Data for Training
+Preparing the data for training consists of creating [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)
+files from the raw GLDv2 images grouped into TRAIN and VALIDATION splits. The training set
+produced contains only the *clean* subset of the GLDv2 dataset. The [CVPR'20 paper](https://arxiv.org/abs/2004.01804)
+introducing the GLDv2 dataset contains a detailed description of the *clean* subset.
-Note that one may want to split the train TFRecords into a train/val (for
+Generating the TFRecord files containing the TRAIN and VALIDATION splits of the *clean* GLDv2 
-training, we usually simply split it 80/20 randomly).
+subset can be achieved by running the [`build_image_dataset.py`](./build_image_dataset.py) 
+script. Assuming that the GLDv2 images have been downloaded to the `gldv2_dataset` folder, the 
+script can be run as follows:
+```
+python3 build_image_dataset.py \
+    --train_csv_path=gldv2_dataset/train/train.csv \
+    --train_clean_csv_path=gldv2_dataset/train/train_clean.csv \
+    --train_directory=gldv2_dataset/train/*/*/*/ \
+    --output_directory=gldv2_dataset/tfrecord/ \
+    --num_shards=128 \
+    --generate_train_validation_splits \
+    --validation_split_size=0.2
+```
+*Please refer to the source code of the [`build_image_dataset.py`](./build_image_dataset.py) script for a detailed description of its parameters.*
+The TFRecord files written in the `OUTPUT_DIRECTORY` will be prefixed as follows:
+* TRAIN split: `train-*`
+* VALIDATION split: `validation-*`
+The same script can be used to generate TFRecord files for the TEST split for post-training
+evaluation purposes. This can be achieved by adding the parameters:
+```
+    --test_csv_path=gldv2_dataset/train/test.csv \
+    --test_directory=gldv2_dataset/test/*/*/*/ \
+```
+In this scenario, the TFRecord files of the TEST split written in the `OUTPUT_DIRECTORY` will be
+named according to the pattern `test-*`.
+*Please note that due to the large size of the GLDv2 dataset, the generation of the TFRecord 
+files can take up to 12 hours and up to 500 GB of space disk.*
+## Running the Training
+Assuming the TFRecord files were generated in the `gldv2_dataset/tfrecord/` directory, running 
+the following command should start training a model:
+```
+python3 train.py \
+  --train_file_pattern=gldv2_dataset/tfrecord/train* \
+  --validation_file_pattern=gldv2_dataset/tfrecord/validation*
+```