{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "3E96e1UKQ8uR" }, "source": [ "# MoViNet Tutorial\n", "\n", "This notebook provides basic example code to create, build, and run [MoViNets (Mobile Video Networks)](https://arxiv.org/pdf/2103.11511.pdf). Models use TF Keras and support inference in TF 1 and TF 2. Pretrained models are provided by [TensorFlow Hub](https://tfhub.dev/google/collections/movinet/), trained on [Kinetics 600](https://deepmind.com/research/open-source/kinetics) for video action classification." ] }, { "cell_type": "markdown", "metadata": { "id": "8_oLnvJy7kz5" }, "source": [ "## Setup\n", "\n", "It is recommended to run the models using GPUs or TPUs.\n", "\n", "To select a GPU/TPU in Colab, select `Runtime \u003e Change runtime type \u003e Hardware accelerator` dropdown in the top menu.\n", "\n", "### Install the TensorFlow Model Garden pip package\n", "\n", "- tf-models-official is the stable Model Garden package. Note that it may not include the latest changes in the tensorflow_models github repo.\n", "- To include latest changes, you may install tf-models-nightly, which is the nightly Model Garden package created daily automatically.\n", "pip will install all models and dependencies automatically.\n", "\n", "Install the [mediapy](https://github.com/google/mediapy) package for visualizing images/videos." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s3khsunT7kWa" }, "outputs": [], "source": [ "!pip install -q tf-models-nightly tfds-nightly\n", "\n", "!command -v ffmpeg \u003e/dev/null || (apt update \u0026\u0026 apt install -y ffmpeg)\n", "!pip install -q mediapy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dI_1csl6Q-gH" }, "outputs": [], "source": [ "import os\n", "from six.moves import urllib\n", "\n", "import matplotlib.pyplot as plt\n", "import mediapy as media\n", "import numpy as np\n", "from PIL import Image\n", "import tensorflow as tf\n", "import tensorflow_datasets as tfds\n", "import tensorflow_hub as hub\n", "\n", "from official.vision.beta.configs import video_classification\n", "from official.projects.movinet.configs import movinet as movinet_configs\n", "from official.projects.movinet.modeling import movinet\n", "from official.projects.movinet.modeling import movinet_layers\n", "from official.projects.movinet.modeling import movinet_model" ] }, { "cell_type": "markdown", "metadata": { "id": "6g0tuFvf71S9" }, "source": [ "## Example Usage with TensorFlow Hub\n", "\n", "Load MoViNet-A2-Base from TensorFlow Hub, as part of the [MoViNet collection](https://tfhub.dev/google/collections/movinet/).\n", "\n", "The following code will:\n", "\n", "- Load a MoViNet KerasLayer from [tfhub.dev](https://tfhub.dev).\n", "- Wrap the layer in a [Keras Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model).\n", "- Load an example image, and reshape it to a single frame video.\n", "- Classify the video" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nTUdhlRJzl2o" }, "outputs": [], "source": [ "movinet_a2_hub_url = 'https://tfhub.dev/tensorflow/movinet/a2/base/kinetics-600/classification/1'\n", "\n", "inputs = tf.keras.layers.Input(\n", " shape=[None, None, None, 3],\n", " dtype=tf.float32)\n", "\n", "encoder = hub.KerasLayer(movinet_a2_hub_url, trainable=True)\n", "\n", "# Important: To use tf.nn.conv3d on CPU, we must compile with tf.function.\n", "encoder.call = tf.function(encoder.call, experimental_compile=True)\n", "\n", "# [batch_size, 600]\n", "outputs = encoder(dict(image=inputs))\n", "\n", "model = tf.keras.Model(inputs, outputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "7kU1_pL10l0B" }, "source": [ "To provide a simple example video for classification, we can load a static image and reshape it to produce a video with a single frame." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Iy0rKRrT723_" }, "outputs": [], "source": [ "image_url = 'https://upload.wikimedia.org/wikipedia/commons/8/84/Ski_Famille_-_Family_Ski_Holidays.jpg'\n", "image_height = 224\n", "image_width = 224\n", "\n", "with urllib.request.urlopen(image_url) as f:\n", " image = Image.open(f).resize((image_height, image_width))\n", "video = tf.reshape(np.array(image), [1, 1, image_height, image_width, 3])\n", "video = tf.cast(video, tf.float32) / 255.\n", "\n", "image" ] }, { "cell_type": "markdown", "metadata": { "id": "Yf6EefHuWfxC" }, "source": [ "Run the model and output the predicted label. Expected output should be skiing (labels 464-467). E.g., 465 = \"skiing crosscountry\".\n", "\n", "See [here](https://gist.github.com/willprice/f19da185c9c5f32847134b87c1960769#file-kinetics_600_labels-csv) for a full list of all labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OOpEKuqH8sH7" }, "outputs": [], "source": [ "output = model(video)\n", "output_label_index = tf.argmax(output, -1)[0].numpy()\n", "\n", "print(output_label_index)" ] }, { "cell_type": "markdown", "metadata": { "id": "_s-7bEoa3f8g" }, "source": [ "## Example Usage with the TensorFlow Model Garden\n", "\n", "Fine-tune MoViNet-A0-Base on [UCF-101](https://www.crcv.ucf.edu/research/data-sets/ucf101/).\n", "\n", "The following code will:\n", "\n", "- Load the UCF-101 dataset with [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/ucf101).\n", "- Create a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) pipeline for training and evaluation.\n", "- Display some example videos from the dataset.\n", "- Build a MoViNet model and load pretrained weights.\n", "- Fine-tune the final classifier layers on UCF-101." ] }, { "cell_type": "markdown", "metadata": { "id": "o7unW4WVr580" }, "source": [ "### Load the UCF-101 Dataset with TensorFlow Datasets\n", "\n", "Calling `download_and_prepare()` will automatically download the dataset. After downloading, this cell will output information about the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FxM1vNYp_YAM" }, "outputs": [], "source": [ "dataset_name = 'ucf101'\n", "\n", "builder = tfds.builder(dataset_name)\n", "\n", "config = tfds.download.DownloadConfig(verify_ssl=False)\n", "builder.download_and_prepare(download_config=config)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "executionInfo": { "elapsed": 2957, "status": "ok", "timestamp": 1619748263684, "user": { "displayName": "", "photoUrl": "", "userId": "" }, "user_tz": 360 }, "id": "boQHbcfDhXpJ", "outputId": "eabc3307-d6bf-4f29-cc5a-c8dc6360701b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of classes: 101\n", "Number of examples for train: 9537\n", "Number of examples for test: 3783\n", "\n" ] }, { "data": { "text/plain": [ "tfds.core.DatasetInfo(\n", " name='ucf101',\n", " full_name='ucf101/ucf101_1_256/2.0.0',\n", " description=\"\"\"\n", " A 101-label video classification dataset.\n", " \"\"\",\n", " config_description=\"\"\"\n", " 256x256 UCF with the first action recognition split.\n", " \"\"\",\n", " homepage='https://www.crcv.ucf.edu/data-sets/ucf101/',\n", " data_path='/readahead/128M/placer/prod/home/tensorflow-datasets-cns-storage-owner/datasets/ucf101/ucf101_1_256/2.0.0',\n", " download_size=6.48 GiB,\n", " dataset_size=Unknown size,\n", " features=FeaturesDict({\n", " 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=101),\n", " 'video': Video(Image(shape=(256, 256, 3), dtype=tf.uint8)),\n", " }),\n", " supervised_keys=None,\n", " splits={\n", " 'test': \u003cSplitInfo num_examples=3783, num_shards=32\u003e,\n", " 'train': \u003cSplitInfo num_examples=9537, num_shards=64\u003e,\n", " },\n", " citation=\"\"\"@article{DBLP:journals/corr/abs-1212-0402,\n", " author = {Khurram Soomro and\n", " Amir Roshan Zamir and\n", " Mubarak Shah},\n", " title = {{UCF101:} {A} Dataset of 101 Human Actions Classes From Videos in\n", " The Wild},\n", " journal = {CoRR},\n", " volume = {abs/1212.0402},\n", " year = {2012},\n", " url = {http://arxiv.org/abs/1212.0402},\n", " archivePrefix = {arXiv},\n", " eprint = {1212.0402},\n", " timestamp = {Mon, 13 Aug 2018 16:47:45 +0200},\n", " biburl = {https://dblp.org/rec/bib/journals/corr/abs-1212-0402},\n", " bibsource = {dblp computer science bibliography, https://dblp.org}\n", " }\"\"\",\n", ")" ] }, "execution_count": 0, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "num_classes = builder.info.features['label'].num_classes\n", "num_examples = {\n", " name: split.num_examples\n", " for name, split in builder.info.splits.items()\n", "}\n", "\n", "print('Number of classes:', num_classes)\n", "print('Number of examples for train:', num_examples['train'])\n", "print('Number of examples for test:', num_examples['test'])\n", "print()\n", "\n", "builder.info" ] }, { "cell_type": "markdown", "metadata": { "id": "BsJJgnBBqDKZ" }, "source": [ "Build the training and evaluation datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9cO_BCu9le3r" }, "outputs": [], "source": [ "batch_size = 8\n", "num_frames = 8\n", "frame_stride = 10\n", "resolution = 172\n", "\n", "def format_features(features):\n", " video = features['video']\n", " video = video[:, ::frame_stride]\n", " video = video[:, :num_frames]\n", "\n", " video = tf.reshape(video, [-1, video.shape[2], video.shape[3], 3])\n", " video = tf.image.resize(video, (resolution, resolution))\n", " video = tf.reshape(video, [-1, num_frames, resolution, resolution, 3])\n", " video = tf.cast(video, tf.float32) / 255.\n", "\n", " label = tf.one_hot(features['label'], num_classes)\n", " return (video, label)\n", "\n", "train_dataset = builder.as_dataset(\n", " split='train',\n", " batch_size=batch_size,\n", " shuffle_files=True)\n", "train_dataset = train_dataset.map(\n", " format_features,\n", " num_parallel_calls=tf.data.AUTOTUNE)\n", "train_dataset = train_dataset.repeat()\n", "train_dataset = train_dataset.prefetch(2)\n", "\n", "test_dataset = builder.as_dataset(\n", " split='test',\n", " batch_size=batch_size)\n", "test_dataset = test_dataset.map(\n", " format_features,\n", " num_parallel_calls=tf.data.AUTOTUNE,\n", " deterministic=True)\n", "test_dataset = test_dataset.prefetch(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "rToX7_Ymgh57" }, "source": [ "Display some example videos from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KG8Z7rUj06of" }, "outputs": [], "source": [ "videos, labels = next(iter(train_dataset))\n", "media.show_videos(videos.numpy(), codec='gif', fps=5)" ] }, { "cell_type": "markdown", "metadata": { "id": "R3RHeuHdsd_3" }, "source": [ "### Build MoViNet-A0-Base and Load Pretrained Weights" ] }, { "cell_type": "markdown", "metadata": { "id": "JXVQOP9Rqk0I" }, "source": [ "Here we create a MoViNet model using the open source code provided in [tensorflow/models](https://github.com/tensorflow/models) and load the pretrained weights. Here we freeze the all layers except the final classifier head to speed up fine-tuning." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JpfxpeGSsbzJ" }, "outputs": [], "source": [ "model_id = 'a0'\n", "\n", "tf.keras.backend.clear_session()\n", "\n", "backbone = movinet.Movinet(\n", " model_id=model_id)\n", "model = movinet_model.MovinetClassifier(\n", " backbone=backbone,\n", " num_classes=600)\n", "model.build([batch_size, num_frames, resolution, resolution, 3])\n", "\n", "# Load pretrained weights from TF Hub\n", "movinet_hub_url = f'https://tfhub.dev/tensorflow/movinet/{model_id}/base/kinetics-600/classification/1'\n", "movinet_hub_model = hub.KerasLayer(movinet_hub_url, trainable=True)\n", "pretrained_weights = {w.name: w for w in movinet_hub_model.weights}\n", "model_weights = {w.name: w for w in model.weights}\n", "for name in pretrained_weights:\n", " model_weights[name].assign(pretrained_weights[name])\n", "\n", "# Wrap the backbone with a new classifier to create a new classifier head\n", "# with num_classes outputs\n", "model = movinet_model.MovinetClassifier(\n", " backbone=backbone,\n", " num_classes=num_classes)\n", "model.build([batch_size, num_frames, resolution, resolution, 3])\n", "\n", "# Freeze all layers except for the final classifier head\n", "for layer in model.layers[:-1]:\n", " layer.trainable = False\n", "model.layers[-1].trainable = True" ] }, { "cell_type": "markdown", "metadata": { "id": "ucntdu2xqgXB" }, "source": [ "Configure fine-tuning with training/evaluation steps, loss object, metrics, learning rate, optimizer, and callbacks.\n", "\n", "Here we use 3 epochs. Training for more epochs should improve accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WUYTw48BouTu" }, "outputs": [], "source": [ "num_epochs = 3\n", "\n", "train_steps = num_examples['train'] // batch_size\n", "total_train_steps = train_steps * num_epochs\n", "test_steps = num_examples['test'] // batch_size\n", "\n", "loss_obj = tf.keras.losses.CategoricalCrossentropy(\n", " from_logits=True,\n", " label_smoothing=0.1)\n", "\n", "metrics = [\n", " tf.keras.metrics.TopKCategoricalAccuracy(\n", " k=1, name='top_1', dtype=tf.float32),\n", " tf.keras.metrics.TopKCategoricalAccuracy(\n", " k=5, name='top_5', dtype=tf.float32),\n", "]\n", "\n", "initial_learning_rate = 0.01\n", "learning_rate = tf.keras.optimizers.schedules.CosineDecay(\n", " initial_learning_rate, decay_steps=total_train_steps,\n", ")\n", "optimizer = tf.keras.optimizers.RMSprop(\n", " learning_rate, rho=0.9, momentum=0.9, epsilon=1.0, clipnorm=1.0)\n", "\n", "model.compile(loss=loss_obj, optimizer=optimizer, metrics=metrics)\n", "\n", "callbacks = [\n", " tf.keras.callbacks.TensorBoard(),\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "0IyAOOlcpHna" }, "source": [ "Run the fine-tuning with Keras compile/fit. After fine-tuning the model, we should be able to achieve \u003e70% accuracy on the test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "executionInfo": { "elapsed": 982253, "status": "ok", "timestamp": 1619750139919, "user": { "displayName": "", "photoUrl": "", "userId": "" }, "user_tz": 360 }, "id": "Zecc_K3lga8I", "outputId": "e4c5c61e-aa08-47db-c04c-42dea3efb545" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/3\n", "1192/1192 [==============================] - 348s 286ms/step - loss: 3.4914 - top_1: 0.3639 - top_5: 0.6294 - val_loss: 2.5153 - val_top_1: 0.5975 - val_top_5: 0.8565\n", "Epoch 2/3\n", "1192/1192 [==============================] - 286s 240ms/step - loss: 2.1397 - top_1: 0.6794 - top_5: 0.9231 - val_loss: 2.0695 - val_top_1: 0.6838 - val_top_5: 0.9070\n", "Epoch 3/3\n", "1192/1192 [==============================] - 348s 292ms/step - loss: 1.8925 - top_1: 0.7660 - top_5: 0.9454 - val_loss: 1.9848 - val_top_1: 0.7116 - val_top_5: 0.9227\n" ] } ], "source": [ "results = model.fit(\n", " train_dataset,\n", " validation_data=test_dataset,\n", " epochs=num_epochs,\n", " steps_per_epoch=train_steps,\n", " validation_steps=test_steps,\n", " callbacks=callbacks,\n", " validation_freq=1,\n", " verbose=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "XuH8XflmpU9d" }, "source": [ "We can also view the training and evaluation progress in TensorBoard." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9fZhzhRJRd2J" }, "outputs": [], "source": [ "%reload_ext tensorboard\n", "%tensorboard --logdir logs --port 0" ] } ], "metadata": { "colab": { "collapsed_sections": [], "last_runtime": { "build_target": "//learning/deepmind/public/tools/ml_python:ml_notebook", "kind": "private" }, "name": "movinet_tutorial.ipynb", "provenance": [ { "file_id": "11msGCxFjxwioBOBJavP9alfTclUQCJf-", "timestamp": 1617043059980 } ] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }