video_api.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Welcome to torchvision's new video API\n",
    "\n",
    "Here, we're going to examine the capabilities of the new video API, together with the examples on how to build datasets and more. \n",
    "\n",
    "### Table of contents\n",
    "1. Introduction: building a new video object and examining the properties\n",
    "2. Building a sample `read_video` function\n",
    "3. Building an example dataset (can be applied to e.g. kinetics400)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('1.7.0a0+f5c95d5', '0.8.0a0+a2f405d')"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch, torchvision\n",
    "torch.__version__, torchvision.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true to ./WUzgd7C1pWA.mp4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100.4%"
     ]
    }
   ],
   "source": [
    "# download the sample video\n",
    "from torchvision.datasets.utils import download_url\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true\", \".\", \"WUzgd7C1pWA.mp4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction: building a new video object and examining the properties\n",
    "\n",
    "First we select a video to test the object out. For the sake of argument we're using one from Kinetics400 dataset. To create it, we need to define the path and the stream we want to use. See inline comments for description.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch, torchvision\n",
    "\"\"\"\n",
    "chosen video statistics:\n",
    "WUzgd7C1pWA.mp4\n",
    "  - source: kinetics-400\n",
    "  - video: H-264 - MPEG-4 AVC (part 10) (avc1)\n",
    "    - fps: 29.97\n",
    "  - audio: MPEG AAC audio (mp4a)\n",
    "    - sample rate: 48K Hz\n",
    "\"\"\"\n",
    "video_path = \"./WUzgd7C1pWA.mp4\"\n",
    "\n",
    "\"\"\"\n",
    "streams are defined in a similar fashion as torch devices. We encode them as strings in a form\n",
    "of `stream_type:stream_id` where stream_type is a string and stream_id a long int. \n",
    "\n",
    "The constructor accepts passing a stream_type only, in which case the stream is auto-discovered.\n",
    "\"\"\"\n",
    "stream = \"video\"\n",
    "\n",
    "\n",
    "\n",
    "video = torchvision.io.VideoReader(video_path, stream)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's get the metadata for our particular video:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'video': {'duration': [10.9109], 'fps': [29.97002997002997]},\n",
       " 'audio': {'duration': [10.9], 'framerate': [48000.0]}}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "video.get_metadata()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that video has two streams - a video and an audio stream. \n",
    "\n",
    "Let's read all the frames from the video stream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  327\n",
      "We can expect approx:  327.0\n",
      "Tensor size:  torch.Size([3, 256, 340])\n"
     ]
    }
   ],
   "source": [
    "# first we select the video stream \n",
    "metadata = video.get_metadata()\n",
    "video.set_current_stream(\"video:0\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
    "for frame, pts in video:\n",
    "    frames.append(frame)\n",
    "    \n",
    "print(\"Total number of frames: \", len(frames))\n",
    "approx_nf = metadata['video']['duration'][0] * metadata['video']['fps'][0]\n",
    "print(\"We can expect approx: \", approx_nf)\n",
    "print(\"Tensor size: \", frames[0].size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that selecting zero video stream is equivalent to selecting video stream automatically. I.e. `video:0` and `video` will end up with same results in this case. \n",
    "\n",
    "Let's try this for audio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  511\n",
      "Approx total number of datapoints we can expect:  523200.0\n",
      "Read data size:  523264\n"
     ]
    }
   ],
   "source": [
    "metadata = video.get_metadata()\n",
    "video.set_current_stream(\"audio\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
    "for frame, pts in video:\n",
    "    frames.append(frame)\n",
    "    \n",
    "print(\"Total number of frames: \", len(frames))\n",
    "approx_nf = metadata['audio']['duration'][0] * metadata['audio']['framerate'][0]\n",
    "print(\"Approx total number of datapoints we can expect: \", approx_nf)\n",
    "print(\"Read data size: \", frames[0].size(0) * len(frames))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But what if we only want to read certain time segment of the video?\n",
    "\n",
    "That can be done easily using the combination of our seek function, and the fact that each call to next returns the presentation timestamp of the returned frame in seconds. Given that our implementation relies on python iterators, we can leverage `itertools` to simplify the process and make it more pythonic. \n",
    "\n",
    "For example, if we wanted to read ten frames from second second:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  10\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "video.set_current_stream(\"video\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
    "\n",
    "# we seek into a second second of the video\n",
    "# and use islice to get 10 frames since\n",
    "for frame, pts in itertools.islice(video.seek(2), 10):\n",
    "    frames.append(frame)\n",
    "    \n",
    "print(\"Total number of frames: \", len(frames))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or if we wanted to read from 2nd to 5th second:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  90\n",
      "We can expect approx:  89.91008991008991\n",
      "Tensor size:  torch.Size([3, 256, 340])\n"
     ]
    }
   ],
   "source": [
    "video.set_current_stream(\"video\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
    "\n",
    "# we seek into a second second of the video\n",
    "video = video.seek(2)\n",
    "# then we utilize the itertools takewhile to get the \n",
    "# correct number of frames\n",
    "for frame, pts in itertools.takewhile(lambda x: x[1] <= 5, video):\n",
    "    frames.append(frame)\n",
    "\n",
    "print(\"Total number of frames: \", len(frames))\n",
    "approx_nf = (5-2) * video.get_metadata()['video']['fps'][0]\n",
    "print(\"We can expect approx: \", approx_nf)\n",
    "print(\"Tensor size: \", frames[0].size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Building a sample `read_video` function\n",
    "\n",
    "We can utilize the methods above to build the read video function that follows the same API to the existing `read_video` function "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def example_read_video(video_object, start=0, end=None, read_video=True, read_audio=True):\n",
    "\n",
    "    if end is None:\n",
    "        end = float(\"inf\")\n",
    "    if end < start:\n",
    "        raise ValueError(\n",
    "            \"end time should be larger than start time, got \"\n",
    "            \"start time={} and end time={}\".format(s, e)\n",
    "        )\n",
    "    \n",
    "    video_frames = torch.empty(0)\n",
    "    video_pts = []\n",
    "    if read_video:\n",
    "        video_object.set_current_stream(\"video\")\n",
    "        frames = []\n",
    "        for t, pts in itertools.takewhile(lambda x: x[1] <= end, video_object.seek(start)):\n",
    "            frames.append(t)\n",
    "            video_pts.append(pts)\n",
    "        if len(frames) > 0:\n",
    "            video_frames = torch.stack(frames, 0)\n",
    "\n",
    "    audio_frames = torch.empty(0)\n",
    "    audio_pts = []\n",
    "    if read_audio:\n",
    "        video_object.set_current_stream(\"audio\")\n",
    "        frames = []\n",
    "        for t, pts in itertools.takewhile(lambda x: x[1] <= end, video_object.seek(start)):\n",
    "            frames.append(t)\n",
    "            video_pts.append(pts)\n",
    "        if len(frames) > 0:\n",
    "            audio_frames = torch.cat(frames, 0)\n",
    "\n",
    "    return video_frames, audio_frames, (video_pts, audio_pts), video_object.get_metadata()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([327, 3, 256, 340]) torch.Size([523264, 1])\n"
     ]
    }
   ],
   "source": [
    "vf, af, info, meta = example_read_video(video)\n",
    "# total number of frames should be 327 for video and 523264 datapoints for audio\n",
    "print(vf.size(), af.size())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([523264, 1])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# you can also get the sequence of audio frames as well\n",
    "af.size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Building an example randomly sampled dataset (can be applied to training dataest of kinetics400)\n",
    "\n",
    "Cool, so now we can use the same principle to make the sample dataset. We suggest trying out iterable dataset for this purpose. \n",
    "\n",
    "Here, we are going to build\n",
    "\n",
    "a. an example dataset that reads randomly selected 10 frames of video"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make sample dataest\n",
    "import os\n",
    "os.makedirs(\"./dataset\", exist_ok=True)\n",
    "os.makedirs(\"./dataset/1\", exist_ok=True)\n",
    "os.makedirs(\"./dataset/2\", exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true to ./dataset/1/WUzgd7C1pWA.mp4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100.4%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi?raw=true to ./dataset/1/RATRACE_wave_f_nm_np1_fr_goo_37.avi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "102.5%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/SOX5yA1l24A.mp4?raw=true to ./dataset/2/SOX5yA1l24A.mp4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100.9%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g23_c01.avi?raw=true to ./dataset/2/v_SoccerJuggling_g23_c01.avi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "101.5%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g24_c01.avi?raw=true to ./dataset/2/v_SoccerJuggling_g24_c01.avi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "101.3%"
     ]
    }
   ],
   "source": [
    "# download the videos \n",
    "from torchvision.datasets.utils import download_url\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true\", \"./dataset/1\", \"WUzgd7C1pWA.mp4\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi?raw=true\", \"./dataset/1\", \"RATRACE_wave_f_nm_np1_fr_goo_37.avi\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/SOX5yA1l24A.mp4?raw=true\", \"./dataset/2\", \"SOX5yA1l24A.mp4\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g23_c01.avi?raw=true\", \"./dataset/2\", \"v_SoccerJuggling_g23_c01.avi\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g24_c01.avi?raw=true\", \"./dataset/2\", \"v_SoccerJuggling_g24_c01.avi\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# housekeeping and utilities\n",
    "import os\n",
    "import random\n",
    "\n",
    "import torch\n",
    "from torchvision.datasets.folder import make_dataset\n",
    "from torchvision import transforms as t\n",
    "\n",
    "def _find_classes(dir):\n",
    "    classes = [d.name for d in os.scandir(dir) if d.is_dir()]\n",
    "    classes.sort()\n",
    "    class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}\n",
    "    return classes, class_to_idx\n",
    "\n",
    "def get_samples(root, extensions=(\".mp4\", \".avi\")):\n",
    "    _, class_to_idx = _find_classes(root)\n",
    "    return make_dataset(root, class_to_idx, extensions=extensions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to define the dataset and some basic arguments. We asume the structure of the FolderDataset, and add the following parameters:\n",
    "    \n",
    "1. frame transform: with this API, we can chose to apply transforms on every frame of the video\n",
    "2. videotransform: equally, we can also apply transform to a 4D tensor\n",
    "3. length of the clip: do we want a single or multiple frames?\n",
    "\n",
    "Note that we actually add `epoch size` as using `IterableDataset` class allows us to naturally oversample clips or images from each video if needed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RandomDataset(torch.utils.data.IterableDataset):\n",
    "    def __init__(self, root, epoch_size=None, frame_transform=None, video_transform=None, clip_len=16):\n",
    "        super(RandomDataset).__init__()\n",
    "        \n",
    "        self.samples = get_samples(root)\n",
    "         \n",
    "        # allow for temporal jittering\n",
    "        if epoch_size is None:\n",
    "            epoch_size = len(self.samples)\n",
    "        self.epoch_size = epoch_size\n",
    "        \n",
    "        self.clip_len = clip_len  # length of a clip in frames\n",
    "        self.frame_transform = frame_transform  # transform for every frame individually\n",
    "        self.video_transform = video_transform # transform on a video sequence\n",
    "\n",
    "    def __iter__(self):\n",
    "        for i in range(self.epoch_size):\n",
    "            # get random sample\n",
    "            path, target = random.choice(self.samples)\n",
    "            # get video object\n",
    "            vid = torchvision.io.VideoReader(path, \"video\")\n",
    "            metadata = vid.get_metadata()\n",
    "            video_frames = [] # video frame buffer \n",
    "            # seek and return frames\n",
    "            \n",
    "            max_seek = metadata[\"video\"]['duration'][0] - (self.clip_len / metadata[\"video\"]['fps'][0])\n",
    "            start = random.uniform(0., max_seek)\n",
    "            for frame, current_pts in itertools.islice(vid.seek(start), self.clip_len):\n",
    "                video_frames.append(self.frame_transform(frame))\n",
    "            # stack it into a tensor\n",
    "            video = torch.stack(video_frames, 0)\n",
    "            if self.video_transform:\n",
    "                video = self.video_transform(video)\n",
    "            output = {\n",
    "                'path': path,\n",
    "                'video': video,\n",
    "                'target': target,\n",
    "                'start': start,\n",
    "                'end': current_pts}\n",
    "            yield output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a path of videos in a folder structure, i.e:\n",
    "```\n",
    "dataset:\n",
    "    -class 1:\n",
    "        file 0\n",
    "        file 1\n",
    "        ...\n",
    "    - class 2:\n",
    "        file 0\n",
    "        file 1\n",
    "        ...\n",
    "    - ...\n",
    "```\n",
    "We can generate a dataloader and test the dataset. \n",
    "            "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchvision import transforms as t\n",
    "transforms = [t.Resize((112, 112))]\n",
    "frame_transform = t.Compose(transforms)\n",
    "\n",
    "ds = RandomDataset(\"./dataset\", epoch_size=None, frame_transform=frame_transform)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "loader = DataLoader(ds, batch_size=12)\n",
    "d = {\"video\":[], 'start':[], 'end':[], 'tensorsize':[]}\n",
    "for b in loader:\n",
    "    for i in range(len(b['path'])):\n",
    "        d['video'].append(b['path'][i])\n",
    "        d['start'].append(b['start'][i].item())\n",
    "        d['end'].append(b['end'][i].item())\n",
    "        d['tensorsize'].append(b['video'][i].size())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'video': ['./dataset/1/WUzgd7C1pWA.mp4',\n",
       "  './dataset/1/WUzgd7C1pWA.mp4',\n",
       "  './dataset/2/v_SoccerJuggling_g23_c01.avi',\n",
       "  './dataset/2/v_SoccerJuggling_g23_c01.avi',\n",
       "  './dataset/1/RATRACE_wave_f_nm_np1_fr_goo_37.avi'],\n",
       " 'start': [8.97932147319667,\n",
       "  9.421856461438313,\n",
       "  2.1301381796579437,\n",
       "  5.514273689529127,\n",
       "  0.31979853297913124],\n",
       " 'end': [9.5095, 9.943266999999999, 2.635967, 6.0393669999999995, 0.833333],\n",
       " 'tensorsize': [torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112])]}"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Cleanup\n",
    "import os, shutil\n",
    "os.remove(\"./WUzgd7C1pWA.mp4\")\n",
    "shutil.rmtree(\"./dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}