video_api.ipynb 20.9 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Welcome to torchvision's new video API\n",
    "\n",
    "Here, we're going to examine the capabilities of the new video API, together with the examples on how to build datasets and more. \n",
    "\n",
    "### Table of contents\n",
    "1. Introduction: building a new video object and examining the properties\n",
    "2. Building a sample `read_video` function\n",
    "3. Building an example dataset (can be applied to e.g. kinetics400)"
   ]
  },
  {
   "cell_type": "code",
19
   "execution_count": 1,
20
21
22
23
24
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
25
       "('1.7.0a0+f5c95d5', '0.8.0a0+a2f405d')"
26
27
      ]
     },
28
     "execution_count": 1,
29
30
31
32
33
34
35
36
37
38
39
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch, torchvision\n",
    "torch.__version__, torchvision.__version__"
   ]
  },
  {
   "cell_type": "code",
40
   "execution_count": 2,
41
42
43
44
45
46
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
47
48
49
50
51
52
53
54
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true to ./WUzgd7C1pWA.mp4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100.4%"
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
     ]
    }
   ],
   "source": [
    "# download the sample video\n",
    "from torchvision.datasets.utils import download_url\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true\", \".\", \"WUzgd7C1pWA.mp4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction: building a new video object and examining the properties\n",
    "\n",
    "First we select a video to test the object out. For the sake of argument we're using one from Kinetics400 dataset. To create it, we need to define the path and the stream we want to use. See inline comments for description.  "
   ]
  },
  {
   "cell_type": "code",
75
   "execution_count": 5,
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch, torchvision\n",
    "\"\"\"\n",
    "chosen video statistics:\n",
    "WUzgd7C1pWA.mp4\n",
    "  - source: kinetics-400\n",
    "  - video: H-264 - MPEG-4 AVC (part 10) (avc1)\n",
    "    - fps: 29.97\n",
    "  - audio: MPEG AAC audio (mp4a)\n",
    "    - sample rate: 48K Hz\n",
    "\"\"\"\n",
    "video_path = \"./WUzgd7C1pWA.mp4\"\n",
    "\n",
    "\"\"\"\n",
    "streams are defined in a similar fashion as torch devices. We encode them as strings in a form\n",
    "of `stream_type:stream_id` where stream_type is a string and stream_id a long int. \n",
    "\n",
    "The constructor accepts passing a stream_type only, in which case the stream is auto-discovered.\n",
    "\"\"\"\n",
    "stream = \"video\"\n",
    "\n",
    "\n",
    "\n",
101
    "video = torchvision.io.VideoReader(video_path, stream)"
102
103
104
105
106
107
108
109
110
111
112
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's get the metadata for our particular video:"
   ]
  },
  {
   "cell_type": "code",
113
   "execution_count": 6,
114
115
116
117
118
119
120
121
122
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'video': {'duration': [10.9109], 'fps': [29.97002997002997]},\n",
       " 'audio': {'duration': [10.9], 'framerate': [48000.0]}}"
      ]
     },
123
     "execution_count": 6,
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "video.get_metadata()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that video has two streams - a video and an audio stream. \n",
    "\n",
    "Let's read all the frames from the video stream."
   ]
  },
  {
   "cell_type": "code",
143
   "execution_count": 8,
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  327\n",
      "We can expect approx:  327.0\n",
      "Tensor size:  torch.Size([3, 256, 340])\n"
     ]
    }
   ],
   "source": [
    "# first we select the video stream \n",
    "metadata = video.get_metadata()\n",
    "video.set_current_stream(\"video:0\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
162
    "for frame, pts in video:\n",
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
    "    frames.append(frame)\n",
    "    \n",
    "print(\"Total number of frames: \", len(frames))\n",
    "approx_nf = metadata['video']['duration'][0] * metadata['video']['fps'][0]\n",
    "print(\"We can expect approx: \", approx_nf)\n",
    "print(\"Tensor size: \", frames[0].size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that selecting zero video stream is equivalent to selecting video stream automatically. I.e. `video:0` and `video` will end up with same results in this case. \n",
    "\n",
    "Let's try this for audio"
   ]
  },
  {
   "cell_type": "code",
182
   "execution_count": 9,
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  511\n",
      "Approx total number of datapoints we can expect:  523200.0\n",
      "Read data size:  523264\n"
     ]
    }
   ],
   "source": [
    "metadata = video.get_metadata()\n",
    "video.set_current_stream(\"audio\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
200
    "for frame, pts in video:\n",
201
202
203
204
205
206
207
208
209
210
211
212
213
214
    "    frames.append(frame)\n",
    "    \n",
    "print(\"Total number of frames: \", len(frames))\n",
    "approx_nf = metadata['audio']['duration'][0] * metadata['audio']['framerate'][0]\n",
    "print(\"Approx total number of datapoints we can expect: \", approx_nf)\n",
    "print(\"Read data size: \", frames[0].size(0) * len(frames))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But what if we only want to read certain time segment of the video?\n",
    "\n",
215
    "That can be done easily using the combination of our seek function, and the fact that each call to next returns the presentation timestamp of the returned frame in seconds. Given that our implementation relies on python iterators, we can leverage `itertools` to simplify the process and make it more pythonic. \n",
216
    "\n",
217
    "For example, if we wanted to read ten frames from second second:"
218
219
220
221
   ]
  },
  {
   "cell_type": "code",
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  10\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "video.set_current_stream(\"video\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
    "\n",
    "# we seek into a second second of the video\n",
    "# and use islice to get 10 frames since\n",
    "for frame, pts in itertools.islice(video.seek(2), 10):\n",
    "    frames.append(frame)\n",
    "    \n",
    "print(\"Total number of frames: \", len(frames))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or if we wanted to read from 2nd to 5th second:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of frames:  90\n",
      "We can expect approx:  89.91008991008991\n",
      "Tensor size:  torch.Size([3, 256, 340])\n"
     ]
    }
   ],
   "source": [
    "video.set_current_stream(\"video\")\n",
    "\n",
    "frames = []  # we are going to save the frames here.\n",
    "\n",
274
275
276
277
278
    "# we seek into a second second of the video\n",
    "video = video.seek(2)\n",
    "# then we utilize the itertools takewhile to get the \n",
    "# correct number of frames\n",
    "for frame, pts in itertools.takewhile(lambda x: x[1] <= 5, video):\n",
279
    "    frames.append(frame)\n",
280
    "\n",
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
    "print(\"Total number of frames: \", len(frames))\n",
    "approx_nf = (5-2) * video.get_metadata()['video']['fps'][0]\n",
    "print(\"We can expect approx: \", approx_nf)\n",
    "print(\"Tensor size: \", frames[0].size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Building a sample `read_video` function\n",
    "\n",
    "We can utilize the methods above to build the read video function that follows the same API to the existing `read_video` function "
   ]
  },
  {
   "cell_type": "code",
298
   "execution_count": 17,
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
   "metadata": {},
   "outputs": [],
   "source": [
    "def example_read_video(video_object, start=0, end=None, read_video=True, read_audio=True):\n",
    "\n",
    "    if end is None:\n",
    "        end = float(\"inf\")\n",
    "    if end < start:\n",
    "        raise ValueError(\n",
    "            \"end time should be larger than start time, got \"\n",
    "            \"start time={} and end time={}\".format(s, e)\n",
    "        )\n",
    "    \n",
    "    video_frames = torch.empty(0)\n",
    "    video_pts = []\n",
    "    if read_video:\n",
    "        video_object.set_current_stream(\"video\")\n",
    "        frames = []\n",
317
    "        for t, pts in itertools.takewhile(lambda x: x[1] <= end, video_object.seek(start)):\n",
318
319
320
321
322
323
324
325
326
327
    "            frames.append(t)\n",
    "            video_pts.append(pts)\n",
    "        if len(frames) > 0:\n",
    "            video_frames = torch.stack(frames, 0)\n",
    "\n",
    "    audio_frames = torch.empty(0)\n",
    "    audio_pts = []\n",
    "    if read_audio:\n",
    "        video_object.set_current_stream(\"audio\")\n",
    "        frames = []\n",
328
    "        for t, pts in itertools.takewhile(lambda x: x[1] <= end, video_object.seek(start)):\n",
329
    "            frames.append(t)\n",
330
    "            video_pts.append(pts)\n",
331
332
333
334
335
336
337
338
    "        if len(frames) > 0:\n",
    "            audio_frames = torch.cat(frames, 0)\n",
    "\n",
    "    return video_frames, audio_frames, (video_pts, audio_pts), video_object.get_metadata()"
   ]
  },
  {
   "cell_type": "code",
339
   "execution_count": 19,
340
341
342
343
344
345
346
347
348
349
350
351
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([327, 3, 256, 340]) torch.Size([523264, 1])\n"
     ]
    }
   ],
   "source": [
    "vf, af, info, meta = example_read_video(video)\n",
352
    "# total number of frames should be 327 for video and 523264 datapoints for audio\n",
353
354
355
356
357
    "print(vf.size(), af.size())"
   ]
  },
  {
   "cell_type": "code",
358
   "execution_count": 20,
359
360
361
362
363
364
365
366
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([523264, 1])"
      ]
     },
367
     "execution_count": 20,
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# you can also get the sequence of audio frames as well\n",
    "af.size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Building an example randomly sampled dataset (can be applied to training dataest of kinetics400)\n",
    "\n",
    "Cool, so now we can use the same principle to make the sample dataset. We suggest trying out iterable dataset for this purpose. \n",
    "\n",
    "Here, we are going to build\n",
    "\n",
    "a. an example dataset that reads randomly selected 10 frames of video"
   ]
  },
  {
   "cell_type": "code",
392
   "execution_count": 21,
393
394
395
396
397
398
399
400
401
402
403
404
   "metadata": {},
   "outputs": [],
   "source": [
    "# make sample dataest\n",
    "import os\n",
    "os.makedirs(\"./dataset\", exist_ok=True)\n",
    "os.makedirs(\"./dataset/1\", exist_ok=True)\n",
    "os.makedirs(\"./dataset/2\", exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
405
   "execution_count": 22,
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true to ./dataset/1/WUzgd7C1pWA.mp4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100.4%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi?raw=true to ./dataset/1/RATRACE_wave_f_nm_np1_fr_goo_37.avi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "102.5%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/SOX5yA1l24A.mp4?raw=true to ./dataset/2/SOX5yA1l24A.mp4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100.9%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g23_c01.avi?raw=true to ./dataset/2/v_SoccerJuggling_g23_c01.avi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "101.5%"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g24_c01.avi?raw=true to ./dataset/2/v_SoccerJuggling_g24_c01.avi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "101.3%"
     ]
    }
   ],
   "source": [
    "# download the videos \n",
    "from torchvision.datasets.utils import download_url\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/WUzgd7C1pWA.mp4?raw=true\", \"./dataset/1\", \"WUzgd7C1pWA.mp4\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi?raw=true\", \"./dataset/1\", \"RATRACE_wave_f_nm_np1_fr_goo_37.avi\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/SOX5yA1l24A.mp4?raw=true\", \"./dataset/2\", \"SOX5yA1l24A.mp4\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g23_c01.avi?raw=true\", \"./dataset/2\", \"v_SoccerJuggling_g23_c01.avi\")\n",
    "download_url(\"https://github.com/pytorch/vision/blob/master/test/assets/videos/v_SoccerJuggling_g24_c01.avi?raw=true\", \"./dataset/2\", \"v_SoccerJuggling_g24_c01.avi\")"
   ]
  },
  {
   "cell_type": "code",
491
   "execution_count": 23,
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
   "metadata": {},
   "outputs": [],
   "source": [
    "# housekeeping and utilities\n",
    "import os\n",
    "import random\n",
    "\n",
    "import torch\n",
    "from torchvision.datasets.folder import make_dataset\n",
    "from torchvision import transforms as t\n",
    "\n",
    "def _find_classes(dir):\n",
    "    classes = [d.name for d in os.scandir(dir) if d.is_dir()]\n",
    "    classes.sort()\n",
    "    class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}\n",
    "    return classes, class_to_idx\n",
    "\n",
    "def get_samples(root, extensions=(\".mp4\", \".avi\")):\n",
    "    _, class_to_idx = _find_classes(root)\n",
    "    return make_dataset(root, class_to_idx, extensions=extensions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to define the dataset and some basic arguments. We asume the structure of the FolderDataset, and add the following parameters:\n",
    "    \n",
    "1. frame transform: with this API, we can chose to apply transforms on every frame of the video\n",
    "2. videotransform: equally, we can also apply transform to a 4D tensor\n",
    "3. length of the clip: do we want a single or multiple frames?\n",
    "\n",
    "Note that we actually add `epoch size` as using `IterableDataset` class allows us to naturally oversample clips or images from each video if needed. "
   ]
  },
  {
   "cell_type": "code",
529
   "execution_count": 33,
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
   "metadata": {},
   "outputs": [],
   "source": [
    "class RandomDataset(torch.utils.data.IterableDataset):\n",
    "    def __init__(self, root, epoch_size=None, frame_transform=None, video_transform=None, clip_len=16):\n",
    "        super(RandomDataset).__init__()\n",
    "        \n",
    "        self.samples = get_samples(root)\n",
    "         \n",
    "        # allow for temporal jittering\n",
    "        if epoch_size is None:\n",
    "            epoch_size = len(self.samples)\n",
    "        self.epoch_size = epoch_size\n",
    "        \n",
    "        self.clip_len = clip_len  # length of a clip in frames\n",
    "        self.frame_transform = frame_transform  # transform for every frame individually\n",
    "        self.video_transform = video_transform # transform on a video sequence\n",
    "\n",
    "    def __iter__(self):\n",
    "        for i in range(self.epoch_size):\n",
    "            # get random sample\n",
    "            path, target = random.choice(self.samples)\n",
    "            # get video object\n",
553
    "            vid = torchvision.io.VideoReader(path, \"video\")\n",
554
555
556
557
558
559
    "            metadata = vid.get_metadata()\n",
    "            video_frames = [] # video frame buffer \n",
    "            # seek and return frames\n",
    "            \n",
    "            max_seek = metadata[\"video\"]['duration'][0] - (self.clip_len / metadata[\"video\"]['fps'][0])\n",
    "            start = random.uniform(0., max_seek)\n",
560
    "            for frame, current_pts in itertools.islice(vid.seek(start), self.clip_len):\n",
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
    "                video_frames.append(self.frame_transform(frame))\n",
    "            # stack it into a tensor\n",
    "            video = torch.stack(video_frames, 0)\n",
    "            if self.video_transform:\n",
    "                video = self.video_transform(video)\n",
    "            output = {\n",
    "                'path': path,\n",
    "                'video': video,\n",
    "                'target': target,\n",
    "                'start': start,\n",
    "                'end': current_pts}\n",
    "            yield output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a path of videos in a folder structure, i.e:\n",
    "```\n",
    "dataset:\n",
    "    -class 1:\n",
    "        file 0\n",
    "        file 1\n",
    "        ...\n",
    "    - class 2:\n",
    "        file 0\n",
    "        file 1\n",
    "        ...\n",
    "    - ...\n",
    "```\n",
    "We can generate a dataloader and test the dataset. \n",
    "            "
   ]
  },
  {
   "cell_type": "code",
598
   "execution_count": 34,
599
600
601
602
603
604
605
606
607
608
609
610
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchvision import transforms as t\n",
    "transforms = [t.Resize((112, 112))]\n",
    "frame_transform = t.Compose(transforms)\n",
    "\n",
    "ds = RandomDataset(\"./dataset\", epoch_size=None, frame_transform=frame_transform)"
   ]
  },
  {
   "cell_type": "code",
611
   "execution_count": 39,
612
613
614
615
616
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "loader = DataLoader(ds, batch_size=12)\n",
617
    "d = {\"video\":[], 'start':[], 'end':[], 'tensorsize':[]}\n",
618
619
620
621
    "for b in loader:\n",
    "    for i in range(len(b['path'])):\n",
    "        d['video'].append(b['path'][i])\n",
    "        d['start'].append(b['start'][i].item())\n",
622
623
    "        d['end'].append(b['end'][i].item())\n",
    "        d['tensorsize'].append(b['video'][i].size())"
624
625
626
627
   ]
  },
  {
   "cell_type": "code",
628
   "execution_count": 40,
629
630
631
632
633
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
634
       "{'video': ['./dataset/1/WUzgd7C1pWA.mp4',\n",
635
       "  './dataset/1/WUzgd7C1pWA.mp4',\n",
636
637
638
639
640
641
642
643
644
645
646
647
648
649
       "  './dataset/2/v_SoccerJuggling_g23_c01.avi',\n",
       "  './dataset/2/v_SoccerJuggling_g23_c01.avi',\n",
       "  './dataset/1/RATRACE_wave_f_nm_np1_fr_goo_37.avi'],\n",
       " 'start': [8.97932147319667,\n",
       "  9.421856461438313,\n",
       "  2.1301381796579437,\n",
       "  5.514273689529127,\n",
       "  0.31979853297913124],\n",
       " 'end': [9.5095, 9.943266999999999, 2.635967, 6.0393669999999995, 0.833333],\n",
       " 'tensorsize': [torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112]),\n",
       "  torch.Size([16, 3, 112, 112])]}"
650
651
      ]
     },
652
     "execution_count": 40,
653
654
655
656
657
658
659
660
661
662
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d"
   ]
  },
  {
   "cell_type": "code",
663
   "execution_count": 41,
664
665
666
667
668
669
670
671
   "metadata": {},
   "outputs": [],
   "source": [
    "## Cleanup\n",
    "import os, shutil\n",
    "os.remove(\"./WUzgd7C1pWA.mp4\")\n",
    "shutil.rmtree(\"./dataset\")"
   ]
672
673
674
675
676
677
678
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}