Unverified Commit 41613994 authored by Dan Anghel's avatar Dan Anghel Committed by GitHub
Browse files

DELF Training Documentation Update (#8663)



* First version of working script to download the GLDv2 dataset

* First version of the DEFL package installation script

* First working version of the DELF package installation script

* Fixed feedback from PR review

* Push to Github of changes to the TFRecord data generation script for DELF.

* Merged commit includes the following changes:
315363544  by Andre Araujo:

    Added the generation of TRAIN and VALIDATE splits from the train dataset.

--
314676530  by Andre Araujo:

    Updated script to download GLDv2 images for DELF training.

--
314101235  by Andre Araujo:

    Added newly created module 'utils' to the copybara script.

--
313677085  by Andre Araujo:

    Code migration from TF1 to TF2 for:
    - logging (replaced usage of tf.compat.v1.logging.info)
    - testing directories (replaced usage of tf.compat.v1.test.get_temp_dir())
    - feature/object extraction scripts (replaced usage of tf.compat.v1.train.string_input_producer and tf.compat.v1.train.start_queue_runners with PIL)

--
312770828  by Andre Araujo:

    Internal change.

--

PiperOrigin-RevId: 315363544

* First version of the updated README of the DELF training instructions

* Added to the README the section describing the generation of the training data

* Added warning about the TFRecord generation time

* Updated the launch of the training

* Minor README update

* Integrated review feedback
Co-authored-by: default avatarAndre Araujo <andrearaujo@google.com>
parent 8722f59f
# DELF training instructions # DELF Training Instructions
## Install DELF library This README documents the end-to-end process for training a landmark detection and retrieval
model using the DELF library on the [Google Landmarks Dataset v2](https://github.com/cvdfoundation/google-landmark) (GLDv2). This can be achieved following these steps:
1. Install the DELF Python library.
2. Download the raw images of the GLDv2 dataset.
3. Prepare the training data.
4. Run the training.
To be able to use this code, please follow The next sections will cove each of these steps in greater detail.
[these instructions](../../../INSTALL_INSTRUCTIONS.md) to properly install the
DELF library.
## Data preparation ## Prerequisites
See the Clone the [TensorFlow Model Garden](https://github.com/tensorflow/models) repository and move
[build_image_dataset.py](https://github.com/tensorflow/models/blob/master/research/delf/delf/python/training/build_image_dataset.py) into the `models/research/delf/delf/python/training`folder.
script to prepare the data, following the instructions therein to download the ```
dataset (via Kaggle) and then running the script. git clone https://github.com/tensorflow/models.git
cd models/research/delf/delf/python/training
```
## Install the DELF Library
The DELF Python library can be installed by running the [`install_delf.sh`](./install_delf.sh)
script using the command:
```
bash install_delf.sh
```
The script installs both the DELF library and its dependencies in the following sequence:
* Install TensorFlow 2.2 and TensorFlow 2.2 for GPU.
* Install the [TF-Slim](https://github.com/google-research/tf-slim) library from source.
* Download [protoc](https://github.com/protocolbuffers/protobuf) and compile the DELF Protocol
Buffers.
* Install the matplotlib, numpy, scikit-image, scipy and python3-tk Python libraries.
* Install the [TensorFlow Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection) from the cloned TensorFlow Model Garden repository.
* Install the DELF package.
*Please note that the current installation only works on 64 bits Linux architectures due to the
`protoc` binary downloaded by the installation script. If you wish to install the DELF library on
other architectures please update the [`install_delf.sh`](./install_delf.sh) script by referencing
the desired `protoc` [binary release](https://github.com/protocolbuffers/protobuf/releases).*
## Running training ## Download the GLDv2 Training Data
Assuming the data was downloaded to `/tmp/gld_tfrecord/`, running the following The [GLDv2](https://github.com/cvdfoundation/google-landmark) images are grouped in 3 datasets: TRAIN, INDEX, TEST. Images in each dataset are grouped into `*.tar` files and individually
command should start training a model: referenced in `*.csv`files containing training metadata and licensing information. The number of
`*.tar` files per dataset is as follows:
* TRAIN: 500 files.
* INDEX: 100 files.
* TEST: 20 files.
```sh To download the GLDv2 images, run the [`download_dataset.sh`](./download_dataset.sh) script like in
python3 tensorflow_models/research/delf/delf/python/training/train.py \ the following example:
--train_file_pattern=/tmp/gld_tfrecord/train* \
--validation_file_pattern=/tmp/gld_tfrecord/train* \
--debug
``` ```
bash download_dataset.sh 500 100 20
```
The script takes the following parameters, in order:
* The number of image files from the TRAIN dataset to download (maximum 500).
* The number of image files from the INDEX dataset to download (maximum 100).
* The number of image files from the TEST dataset to download (maximum 20).
The script downloads the GLDv2 images under the following directory structure:
* gldv2_dataset/
* train/ - Contains raw images from the TRAIN dataset.
* index/ - Contains raw images from the INDEX dataset.
* test/ - Contains raw images from the TEST dataset.
Each of the three folders `gldv2_dataset/train/`, `gldv2_dataset/index/` and `gldv2_dataset/test/`
contains the following:
* The downloaded `*.tar` files.
* The corresponding MD5 checksum files, `*.txt`.
* The unpacked content of the downloaded files. (*Images are organized in folders and subfolders
based on the first, second and third character in their file name.*)
* The CSV files containing training and licensing metadata of the downloaded images.
*Please note that due to the large size of the GLDv2 dataset, the download can take up to 12
hours and up to 1 TB of disk space. In order to save bandwidth and disk space, you may want to start by downloading only the TRAIN dataset, the only one required for the training, thus saving
approximately ~95 GB, the equivalent of the INDEX and TEST datasets. To further save disk space,
the `*.tar` files can be deleted after downloading and upacking them.*
## Prepare the Data for Training
Preparing the data for training consists of creating [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)
files from the raw GLDv2 images grouped into TRAIN and VALIDATION splits. The training set
produced contains only the *clean* subset of the GLDv2 dataset. The [CVPR'20 paper](https://arxiv.org/abs/2004.01804)
introducing the GLDv2 dataset contains a detailed description of the *clean* subset.
Note that one may want to split the train TFRecords into a train/val (for Generating the TFRecord files containing the TRAIN and VALIDATION splits of the *clean* GLDv2
training, we usually simply split it 80/20 randomly). subset can be achieved by running the [`build_image_dataset.py`](./build_image_dataset.py)
script. Assuming that the GLDv2 images have been downloaded to the `gldv2_dataset` folder, the
script can be run as follows:
```
python3 build_image_dataset.py \
--train_csv_path=gldv2_dataset/train/train.csv \
--train_clean_csv_path=gldv2_dataset/train/train_clean.csv \
--train_directory=gldv2_dataset/train/*/*/*/ \
--output_directory=gldv2_dataset/tfrecord/ \
--num_shards=128 \
--generate_train_validation_splits \
--validation_split_size=0.2
```
*Please refer to the source code of the [`build_image_dataset.py`](./build_image_dataset.py) script for a detailed description of its parameters.*
The TFRecord files written in the `OUTPUT_DIRECTORY` will be prefixed as follows:
* TRAIN split: `train-*`
* VALIDATION split: `validation-*`
The same script can be used to generate TFRecord files for the TEST split for post-training
evaluation purposes. This can be achieved by adding the parameters:
```
--test_csv_path=gldv2_dataset/train/test.csv \
--test_directory=gldv2_dataset/test/*/*/*/ \
```
In this scenario, the TFRecord files of the TEST split written in the `OUTPUT_DIRECTORY` will be
named according to the pattern `test-*`.
*Please note that due to the large size of the GLDv2 dataset, the generation of the TFRecord
files can take up to 12 hours and up to 500 GB of space disk.*
## Running the Training
Assuming the TFRecord files were generated in the `gldv2_dataset/tfrecord/` directory, running
the following command should start training a model:
```
python3 train.py \
--train_file_pattern=gldv2_dataset/tfrecord/train* \
--validation_file_pattern=gldv2_dataset/tfrecord/validation*
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment