{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3E96e1UKQ8uR"
      },
      "source": [
        "# MoViNet Tutorial\n",
        "\n",
        "This notebook provides basic example code to create, build, and run [MoViNets (Mobile Video Networks)](https://arxiv.org/pdf/2103.11511.pdf). Models use TF Keras and support inference in TF 1 and TF 2. Pretrained models are provided by [TensorFlow Hub](https://tfhub.dev/google/collections/movinet/), trained on [Kinetics 600](https://deepmind.com/research/open-source/kinetics) for video action classification."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8_oLnvJy7kz5"
      },
      "source": [
        "## Setup\n",
        "\n",
        "It is recommended to run the models using GPUs or TPUs.\n",
        "\n",
        "To select a GPU/TPU in Colab, select `Runtime \u003e Change runtime type \u003e Hardware accelerator` dropdown in the top menu.\n",
        "\n",
        "### Install the TensorFlow Model Garden pip package\n",
        "\n",
        "- tf-models-official is the stable Model Garden package. Note that it may not include the latest changes in the tensorflow_models github repo.\n",
        "- To include latest changes, you may install tf-models-nightly, which is the nightly Model Garden package created daily automatically.\n",
        "pip will install all models and dependencies automatically.\n",
        "\n",
        "Install the [mediapy](https://github.com/google/mediapy) package for visualizing images/videos."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "s3khsunT7kWa"
      },
      "outputs": [],
      "source": [
        "!pip install -q tf-models-nightly tfds-nightly\n",
        "\n",
        "!command -v ffmpeg \u003e/dev/null || (apt update \u0026\u0026 apt install -y ffmpeg)\n",
        "!pip install -q mediapy"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dI_1csl6Q-gH"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from six.moves import urllib\n",
        "\n",
        "import matplotlib.pyplot as plt\n",
        "import mediapy as media\n",
        "import numpy as np\n",
        "from PIL import Image\n",
        "import tensorflow as tf\n",
        "import tensorflow_datasets as tfds\n",
        "import tensorflow_hub as hub\n",
        "\n",
        "from official.vision.beta.configs import video_classification\n",
        "from official.projects.movinet.configs import movinet as movinet_configs\n",
        "from official.projects.movinet.modeling import movinet\n",
        "from official.projects.movinet.modeling import movinet_layers\n",
        "from official.projects.movinet.modeling import movinet_model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6g0tuFvf71S9"
      },
      "source": [
        "## Example Usage with TensorFlow Hub\n",
        "\n",
        "Load MoViNet-A2-Base from TensorFlow Hub, as part of the [MoViNet collection](https://tfhub.dev/google/collections/movinet/).\n",
        "\n",
        "The following code will:\n",
        "\n",
        "- Load a MoViNet KerasLayer from [tfhub.dev](https://tfhub.dev).\n",
        "- Wrap the layer in a [Keras Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model).\n",
        "- Load an example image, and reshape it to a single frame video.\n",
        "- Classify the video"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nTUdhlRJzl2o"
      },
      "outputs": [],
      "source": [
        "movinet_a2_hub_url = 'https://tfhub.dev/tensorflow/movinet/a2/base/kinetics-600/classification/1'\n",
        "\n",
        "inputs = tf.keras.layers.Input(\n",
        "    shape=[None, None, None, 3],\n",
        "    dtype=tf.float32)\n",
        "\n",
        "encoder = hub.KerasLayer(movinet_a2_hub_url, trainable=True)\n",
        "\n",
        "# Important: To use tf.nn.conv3d on CPU, we must compile with tf.function.\n",
        "encoder.call = tf.function(encoder.call, experimental_compile=True)\n",
        "\n",
        "# [batch_size, 600]\n",
        "outputs = encoder(dict(image=inputs))\n",
        "\n",
        "model = tf.keras.Model(inputs, outputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7kU1_pL10l0B"
      },
      "source": [
        "To provide a simple example video for classification, we can load a static image and reshape it to produce a video with a single frame."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Iy0rKRrT723_"
      },
      "outputs": [],
      "source": [
        "image_url = 'https://upload.wikimedia.org/wikipedia/commons/8/84/Ski_Famille_-_Family_Ski_Holidays.jpg'\n",
        "image_height = 224\n",
        "image_width = 224\n",
        "\n",
        "with urllib.request.urlopen(image_url) as f:\n",
        "  image = Image.open(f).resize((image_height, image_width))\n",
        "video = tf.reshape(np.array(image), [1, 1, image_height, image_width, 3])\n",
        "video = tf.cast(video, tf.float32) / 255.\n",
        "\n",
        "image"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yf6EefHuWfxC"
      },
      "source": [
        "Run the model and output the predicted label. Expected output should be skiing (labels 464-467). E.g., 465 = \"skiing crosscountry\".\n",
        "\n",
        "See [here](https://gist.github.com/willprice/f19da185c9c5f32847134b87c1960769#file-kinetics_600_labels-csv) for a full list of all labels."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OOpEKuqH8sH7"
      },
      "outputs": [],
      "source": [
        "output = model(video)\n",
        "output_label_index = tf.argmax(output, -1)[0].numpy()\n",
        "\n",
        "print(output_label_index)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_s-7bEoa3f8g"
      },
      "source": [
        "## Example Usage with the TensorFlow Model Garden\n",
        "\n",
        "Fine-tune MoViNet-A0-Base on [UCF-101](https://www.crcv.ucf.edu/research/data-sets/ucf101/).\n",
        "\n",
        "The following code will:\n",
        "\n",
        "- Load the UCF-101 dataset with [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/ucf101).\n",
        "- Create a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) pipeline for training and evaluation.\n",
        "- Display some example videos from the dataset.\n",
        "- Build a MoViNet model and load pretrained weights.\n",
        "- Fine-tune the final classifier layers on UCF-101."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o7unW4WVr580"
      },
      "source": [
        "### Load the UCF-101 Dataset with TensorFlow Datasets\n",
        "\n",
        "Calling `download_and_prepare()` will automatically download the dataset. After downloading, this cell will output information about the dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FxM1vNYp_YAM"
      },
      "outputs": [],
      "source": [
        "dataset_name = 'ucf101'\n",
        "\n",
        "builder = tfds.builder(dataset_name)\n",
        "\n",
        "config = tfds.download.DownloadConfig(verify_ssl=False)\n",
        "builder.download_and_prepare(download_config=config)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "executionInfo": {
          "elapsed": 2957,
          "status": "ok",
          "timestamp": 1619748263684,
          "user": {
            "displayName": "",
            "photoUrl": "",
            "userId": ""
          },
          "user_tz": 360
        },
        "id": "boQHbcfDhXpJ",
        "outputId": "eabc3307-d6bf-4f29-cc5a-c8dc6360701b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Number of classes: 101\n",
            "Number of examples for train: 9537\n",
            "Number of examples for test: 3783\n",
            "\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "tfds.core.DatasetInfo(\n",
              "    name='ucf101',\n",
              "    full_name='ucf101/ucf101_1_256/2.0.0',\n",
              "    description=\"\"\"\n",
              "    A 101-label video classification dataset.\n",
              "    \"\"\",\n",
              "    config_description=\"\"\"\n",
              "    256x256 UCF with the first action recognition split.\n",
              "    \"\"\",\n",
              "    homepage='https://www.crcv.ucf.edu/data-sets/ucf101/',\n",
              "    data_path='/readahead/128M/placer/prod/home/tensorflow-datasets-cns-storage-owner/datasets/ucf101/ucf101_1_256/2.0.0',\n",
              "    download_size=6.48 GiB,\n",
              "    dataset_size=Unknown size,\n",
              "    features=FeaturesDict({\n",
              "        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=101),\n",
              "        'video': Video(Image(shape=(256, 256, 3), dtype=tf.uint8)),\n",
              "    }),\n",
              "    supervised_keys=None,\n",
              "    splits={\n",
              "        'test': \u003cSplitInfo num_examples=3783, num_shards=32\u003e,\n",
              "        'train': \u003cSplitInfo num_examples=9537, num_shards=64\u003e,\n",
              "    },\n",
              "    citation=\"\"\"@article{DBLP:journals/corr/abs-1212-0402,\n",
              "      author    = {Khurram Soomro and\n",
              "                   Amir Roshan Zamir and\n",
              "                   Mubarak Shah},\n",
              "      title     = {{UCF101:} {A} Dataset of 101 Human Actions Classes From Videos in\n",
              "                   The Wild},\n",
              "      journal   = {CoRR},\n",
              "      volume    = {abs/1212.0402},\n",
              "      year      = {2012},\n",
              "      url       = {http://arxiv.org/abs/1212.0402},\n",
              "      archivePrefix = {arXiv},\n",
              "      eprint    = {1212.0402},\n",
              "      timestamp = {Mon, 13 Aug 2018 16:47:45 +0200},\n",
              "      biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1212-0402},\n",
              "      bibsource = {dblp computer science bibliography, https://dblp.org}\n",
              "    }\"\"\",\n",
              ")"
            ]
          },
          "execution_count": 0,
          "metadata": {
            "tags": []
          },
          "output_type": "execute_result"
        }
      ],
      "source": [
        "num_classes = builder.info.features['label'].num_classes\n",
        "num_examples = {\n",
        "    name: split.num_examples\n",
        "    for name, split in builder.info.splits.items()\n",
        "}\n",
        "\n",
        "print('Number of classes:', num_classes)\n",
        "print('Number of examples for train:', num_examples['train'])\n",
        "print('Number of examples for test:', num_examples['test'])\n",
        "print()\n",
        "\n",
        "builder.info"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BsJJgnBBqDKZ"
      },
      "source": [
        "Build the training and evaluation datasets."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9cO_BCu9le3r"
      },
      "outputs": [],
      "source": [
        "batch_size = 8\n",
        "num_frames = 8\n",
        "frame_stride = 10\n",
        "resolution = 172\n",
        "\n",
        "def format_features(features):\n",
        "  video = features['video']\n",
        "  video = video[:, ::frame_stride]\n",
        "  video = video[:, :num_frames]\n",
        "\n",
        "  video = tf.reshape(video, [-1, video.shape[2], video.shape[3], 3])\n",
        "  video = tf.image.resize(video, (resolution, resolution))\n",
        "  video = tf.reshape(video, [-1, num_frames, resolution, resolution, 3])\n",
        "  video = tf.cast(video, tf.float32) / 255.\n",
        "\n",
        "  label = tf.one_hot(features['label'], num_classes)\n",
        "  return (video, label)\n",
        "\n",
        "train_dataset = builder.as_dataset(\n",
        "    split='train',\n",
        "    batch_size=batch_size,\n",
        "    shuffle_files=True)\n",
        "train_dataset = train_dataset.map(\n",
        "    format_features,\n",
        "    num_parallel_calls=tf.data.AUTOTUNE)\n",
        "train_dataset = train_dataset.repeat()\n",
        "train_dataset = train_dataset.prefetch(2)\n",
        "\n",
        "test_dataset = builder.as_dataset(\n",
        "    split='test',\n",
        "    batch_size=batch_size)\n",
        "test_dataset = test_dataset.map(\n",
        "    format_features,\n",
        "    num_parallel_calls=tf.data.AUTOTUNE,\n",
        "    deterministic=True)\n",
        "test_dataset = test_dataset.prefetch(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rToX7_Ymgh57"
      },
      "source": [
        "Display some example videos from the dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KG8Z7rUj06of"
      },
      "outputs": [],
      "source": [
        "videos, labels = next(iter(train_dataset))\n",
        "media.show_videos(videos.numpy(), codec='gif', fps=5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R3RHeuHdsd_3"
      },
      "source": [
        "### Build MoViNet-A0-Base and Load Pretrained Weights"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JXVQOP9Rqk0I"
      },
      "source": [
        "Here we create a MoViNet model using the open source code provided in [tensorflow/models](https://github.com/tensorflow/models) and load the pretrained weights. Here we freeze the all layers except the final classifier head to speed up fine-tuning."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JpfxpeGSsbzJ"
      },
      "outputs": [],
      "source": [
        "model_id = 'a0'\n",
        "\n",
        "tf.keras.backend.clear_session()\n",
        "\n",
        "backbone = movinet.Movinet(\n",
        "    model_id=model_id)\n",
        "model = movinet_model.MovinetClassifier(\n",
        "    backbone=backbone,\n",
        "    num_classes=600)\n",
        "model.build([batch_size, num_frames, resolution, resolution, 3])\n",
        "\n",
        "# Load pretrained weights from TF Hub\n",
        "movinet_hub_url = f'https://tfhub.dev/tensorflow/movinet/{model_id}/base/kinetics-600/classification/1'\n",
        "movinet_hub_model = hub.KerasLayer(movinet_hub_url, trainable=True)\n",
        "pretrained_weights = {w.name: w for w in movinet_hub_model.weights}\n",
        "model_weights = {w.name: w for w in model.weights}\n",
        "for name in pretrained_weights:\n",
        "  model_weights[name].assign(pretrained_weights[name])\n",
        "\n",
        "# Wrap the backbone with a new classifier to create a new classifier head\n",
        "# with num_classes outputs\n",
        "model = movinet_model.MovinetClassifier(\n",
        "    backbone=backbone,\n",
        "    num_classes=num_classes)\n",
        "model.build([batch_size, num_frames, resolution, resolution, 3])\n",
        "\n",
        "# Freeze all layers except for the final classifier head\n",
        "for layer in model.layers[:-1]:\n",
        "  layer.trainable = False\n",
        "model.layers[-1].trainable = True"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ucntdu2xqgXB"
      },
      "source": [
        "Configure fine-tuning with training/evaluation steps, loss object, metrics, learning rate, optimizer, and callbacks.\n",
        "\n",
        "Here we use 3 epochs. Training for more epochs should improve accuracy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WUYTw48BouTu"
      },
      "outputs": [],
      "source": [
        "num_epochs = 3\n",
        "\n",
        "train_steps = num_examples['train'] // batch_size\n",
        "total_train_steps = train_steps * num_epochs\n",
        "test_steps = num_examples['test'] // batch_size\n",
        "\n",
        "loss_obj = tf.keras.losses.CategoricalCrossentropy(\n",
        "    from_logits=True,\n",
        "    label_smoothing=0.1)\n",
        "\n",
        "metrics = [\n",
        "    tf.keras.metrics.TopKCategoricalAccuracy(\n",
        "        k=1, name='top_1', dtype=tf.float32),\n",
        "    tf.keras.metrics.TopKCategoricalAccuracy(\n",
        "        k=5, name='top_5', dtype=tf.float32),\n",
        "]\n",
        "\n",
        "initial_learning_rate = 0.01\n",
        "learning_rate = tf.keras.optimizers.schedules.CosineDecay(\n",
        "    initial_learning_rate, decay_steps=total_train_steps,\n",
        ")\n",
        "optimizer = tf.keras.optimizers.RMSprop(\n",
        "    learning_rate, rho=0.9, momentum=0.9, epsilon=1.0, clipnorm=1.0)\n",
        "\n",
        "model.compile(loss=loss_obj, optimizer=optimizer, metrics=metrics)\n",
        "\n",
        "callbacks = [\n",
        "    tf.keras.callbacks.TensorBoard(),\n",
        "]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0IyAOOlcpHna"
      },
      "source": [
        "Run the fine-tuning with Keras compile/fit. After fine-tuning the model, we should be able to achieve \u003e70% accuracy on the test set."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "executionInfo": {
          "elapsed": 982253,
          "status": "ok",
          "timestamp": 1619750139919,
          "user": {
            "displayName": "",
            "photoUrl": "",
            "userId": ""
          },
          "user_tz": 360
        },
        "id": "Zecc_K3lga8I",
        "outputId": "e4c5c61e-aa08-47db-c04c-42dea3efb545"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Epoch 1/3\n",
            "1192/1192 [==============================] - 348s 286ms/step - loss: 3.4914 - top_1: 0.3639 - top_5: 0.6294 - val_loss: 2.5153 - val_top_1: 0.5975 - val_top_5: 0.8565\n",
            "Epoch 2/3\n",
            "1192/1192 [==============================] - 286s 240ms/step - loss: 2.1397 - top_1: 0.6794 - top_5: 0.9231 - val_loss: 2.0695 - val_top_1: 0.6838 - val_top_5: 0.9070\n",
            "Epoch 3/3\n",
            "1192/1192 [==============================] - 348s 292ms/step - loss: 1.8925 - top_1: 0.7660 - top_5: 0.9454 - val_loss: 1.9848 - val_top_1: 0.7116 - val_top_5: 0.9227\n"
          ]
        }
      ],
      "source": [
        "results = model.fit(\n",
        "    train_dataset,\n",
        "    validation_data=test_dataset,\n",
        "    epochs=num_epochs,\n",
        "    steps_per_epoch=train_steps,\n",
        "    validation_steps=test_steps,\n",
        "    callbacks=callbacks,\n",
        "    validation_freq=1,\n",
        "    verbose=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XuH8XflmpU9d"
      },
      "source": [
        "We can also view the training and evaluation progress in TensorBoard."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9fZhzhRJRd2J"
      },
      "outputs": [],
      "source": [
        "%reload_ext tensorboard\n",
        "%tensorboard --logdir logs --port 0"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "last_runtime": {
        "build_target": "//learning/deepmind/public/tools/ml_python:ml_notebook",
        "kind": "private"
      },
      "name": "movinet_tutorial.ipynb",
      "provenance": [
        {
          "file_id": "11msGCxFjxwioBOBJavP9alfTclUQCJf-",
          "timestamp": 1617043059980
        }
      ]
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}