README.md

# YouTube-8M Tensorflow Starter Code

This repo contains starter code (written in TensorFlow 2.x) for training and
evaluating machine learning models over the [YouTube-8M][1] dataset.
This is the Tensorflow2 version of the original starter code:
[YouTube-8M Tensorflow Starter Code][2]
which was tested on Tensorflow 1.14. (The code gives an end-to-end
working example for reading the dataset, training a TensorFlow model,
and evaluating the performance of the model). Functionalities are maintained,
while necessary migrations were done to accomodate running on tf2 environment.

### Requirements

The starter code requires Tensorflow. If you haven't installed it yet, follow
the instructions on [tensorflow.org][3].
This code has been tested with Tensorflow 2.4.0. Going forward,
we will continue to target the latest released version of Tensorflow.

Please verify that you have Python 3.6+ and Tensorflow 2.4.0 or higher
installed by running the following commands:

```sh
python --version
python -c 'import tensorflow as tf; print(tf.__version__)'
```

Refer to the [instructions here][4]
for using the model in this repo. Make sure to add the models folder to your
Python path.

[1]: https://research.google.com/youtube8m/
[2]: https://github.com/google/youtube-8m
[3]: https://www.tensorflow.org/install/
[4]:
https://github.com/tensorflow/models/tree/master/official#running-the-models

#### Using GPUs

If your Tensorflow installation has GPU support
(which should have been provided with  `pip install tensorflow` for any version
above Tensorflow 1.15), this code will make use of all of your compatible GPUs.
You can verify your installation by running

```
tf.config.list_physical_devices('GPU')
```

This will print out something like the following for each of your compatible
GPUs.

```
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720]
Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB
deviceMemoryBandwidth: 681.88GiB/s
...
```

### Train and inference
Train video-level model on frame-level features and inference at segment-level.

#### Train using the config file.
Create a YAML or JSON file for specifying the parameters to be overridden.
Working examples can be found in yt8m/experiments directory.
```sh
task:
  model:
    cluster_size: 2048
    hidden_size: 2048
    add_batch_norm: true
    sample_random_frames: true
    is_training: true
    activation: "relu6"
    pooling_method: "average"
    yt8m_agg_classifier_model: "MoeModel"
  train_data:
    segment_labels: false
    temporal_stride: 1
    num_devices: 1
    input_path: 'gs://youtube8m-ml/2/frame/train/train*.tfrecord'
    num_examples: 3888919
 ...
```

The code can be run in different modes: `train / train_and_eval / eval`.
Run `yt8m_train.py` and specify which mode you wish to execute.
Training is done using frame-level features with video-level labels,
while inference can be done at segment-level.
Setting `segment_labels=True` in your configuration forces
the segment level labels to be used in the evaluation/validation phrase.
If set to `False`, video level labels are used for inference.

The following commands will train a model on Google Cloud over frame-level
features.

```bash
python3 yt8m_train.py --mode='train' \
    --experiment='yt8m_experiment' \
    --model_dir=$MODEL_DIR \
    --config_file=$CONFIG_FILE
```

In order to run evaluation after each training epoch,
set the mode to `train_and_eval`.
Paths to both train and validation dataset on Google Cloud are set as
train: `input_path=gs://youtube8m-ml/2/frame/train/train*.tfrecord`
validation:`input_path=gs://youtube8m-ml/3/frame/validate/validate*.tfrecord`
as default.

```bash
python3 yt8m_train.py --mode='train_and_eval' \
     --experiment='yt8m_experiment' \
     --model_dir=$MODEL_DIR \
     --config_file=$CONFIG_FILE \
```

Running on evaluation mode loads saved checkpoint from specified path
and runs inference task.
```bash
python3 yt8m_train.py --mode='eval' \
     --experiment='yt8m_experiment' \
     --model_dir=$MODEL_DIR \
     --config_file=$CONFIG_FILE
```


Once these job starts executing you will see outputs similar to the following:
```
train | step:  15190 | training until step 22785...
train | step:  22785 | steps/sec:    0.4 | output:
    {'learning_rate': 0.0049961056,
     'model_loss': 0.0012011167,
     'total_loss': 0.0013538885,
     'training_loss': 0.0013538885}

```

and the following for evaluation:

```
eval | step:  22785 | running 2172 steps of evaluation...
eval | step:  22785 | eval time:  1663.4 | output:
    {'avg_hit_at_one': 0.5572835238737471,
     'avg_perr': 0.557277077999072,
     'gap': 0.768825760186494,
     'map': 0.19354554465020685,
     'model_loss': 0.0005052475,
     'total_loss': 0.0006564412,
     'validation_loss': 0.0006564412}
```