Update video reader to use new decoder (#1978)

* Base decoder for video. (#1747) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1747 Pull Request resolved: https://github.com/pytorch/vision/pull/1746 Added the implementation of ffmpeg based decoder with functionality that can be used in VUE and TorchVision. Reviewed By: fmassa Differential Revision: D19358914 fbshipit-source-id: abb672f89bfaca6351dda2354f0d35cf8e47fa0f * Integrated base decoder into VideoReader class and video_utils.py (#1766) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1766 Replaced FfmpegDecoder (incompativle with VUE) by base decoder (compatible with VUE). Modified python utilities video_utils.py for internal simplification. Public interface got preserved. Reviewed By: fmassa Differential Revision: D19415903 fbshipit-source-id: 4d7a0158bd77bac0a18732fe4183fdd9a57f6402 * Optimizating base decoder performance. (#1852) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1852 Changed base decoder internals for a faster clip processing. Reviewed By: stephenyan1231 Differential Revision: D19748379 fbshipit-source-id: 58a435f0a0b25545e7bd1a3edb0b1d558176a806 * Minor fix and decoder class members access. Summary: Found and fix a bug in cropping algorithm (simple mistyping). Also derived classes need access to some decoder class members, like initialization parameters - make it protected. Reviewed By: stephenyan1231, fmassa Differential Revision: D19895076 fbshipit-source-id: 691336c8e18526b085ae5792ac3546bc387a6db9 * Added missing header for less dependencies. (#1898) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1898 Include streams/samplers shouldn't depend on decoder headers. Add dependencies directly to the place where they are required. Reviewed By: stephenyan1231 Differential Revision: D19911404 fbshipit-source-id: ef322a053708405c02cee4562b456b1602fb12fc * Implemented VUE Asynchronous Decoder Summary: For Mothership we have found that asynchronous decoder provides a better performance. Differential Revision: D20026194 fbshipit-source-id: 627b91844b4e3f917002031dd32cb19c239f4ba8 * fix a bug in API read_video_from_memory (#1942) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1942 In D18720474, it introduces a bug in `read_video_from_memory` API. Thank weiyaowang for reporting it. Reviewed By: weiyaowang Differential Revision: D20270179 fbshipit-source-id: 66348c99a5ad1f9129b90e934524ddfaad59de03 * extend decoder to support new video_max_dimension argument (#1924) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1924 Extend `video reader` decoder python API in Torchvision to support a new argument `video_max_dimension`. This enables the new video decoding use cases. When setting `video_width=0`, `video_height=0`, `video_min_dimension != 0`, and `video_max_dimension != 0`, we can rescale the video clips so that its spatial resolution (height, width) becomes - (video_min_dimension, video_max_dimension) if original height < original width - (video_max_dimension, video_min_dimension) if original height >= original width This is useful at video model testing stage, where we perform fully convolution evaluation and take entire video frames without cropping as input. Previously, for instance we can only set `video_width=0`, `video_height=0`, `video_min_dimension = 128`, which will preserve aspect ratio. In production dataset, there are a small number of videos where aspect ratio is either extremely large or small, and when the shorter edge is rescaled to 128, the longer edge is still large. This will easily cause GPU memory OOM when we sample multiple video clips, and put them in a single minibatch. Now, we can set (for instance) `video_width=0`, `video_height=0`, `video_min_dimension = 128` and `video_max_dimension = 171` so that the rescale resolution is either (128, 171) or (171, 128) depending on whether original height is larger than original width. Thus, we are less likely to have gpu OOM because the spatial size of video clips is determined. Reviewed By: putivsky Differential Revision: D20182529 fbshipit-source-id: f9c40afb7590e7c45e6908946597141efa35f57c * Fixing samplers initialization (#1967) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1967 No-ops for torchvision diff, which fixes samplers. Differential Revision: D20397218 fbshipit-source-id: 6dc4d04364f305fbda7ca4f67a25ceecd73d0f20 * Exclude C++ test files Co-authored-by: Yuri Putivsky <yuri@fb.com> Co-authored-by: Zhicheng Yan <zyan3@fb.com>

Update video reader to use new decoder (#1978)
* Base decoder for video. (#1747) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1747 Pull Request resolved: https://github.com/pytorch/vision/pull/1746 Added the implementation of ffmpeg based decoder with functionality that can be used in VUE and TorchVision. Reviewed By: fmassa Differential Revision: D19358914 fbshipit-source-id: abb672f89bfaca6351dda2354f0d35cf8e47fa0f * Integrated base decoder into VideoReader class and video_utils.py (#1766) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1766 Replaced FfmpegDecoder (incompativle with VUE) by base decoder (compatible with VUE). Modified python utilities video_utils.py for internal simplification. Public interface got preserved. Reviewed By: fmassa Differential Revision: D19415903 fbshipit-source-id: 4d7a0158bd77bac0a18732fe4183fdd9a57f6402 * Optimizating base decoder performance. (#1852) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1852 Changed base decoder internals for a faster clip processing. Reviewed By: stephenyan1231 Differential Revision: D19748379 fbshipit-source-id: 58a435f0a0b25545e7bd1a3edb0b1d558176a806 * Minor fix and decoder class members access. Summary: Found and fix a bug in cropping algorithm (simple mistyping). Also derived classes need access to some decoder class members, like initialization parameters - make it protected. Reviewed By: stephenyan1231, fmassa Differential Revision: D19895076 fbshipit-source-id: 691336c8e18526b085ae5792ac3546bc387a6db9 * Added missing header for less dependencies. (#1898) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1898 Include streams/samplers shouldn't depend on decoder headers. Add dependencies directly to the place where they are required. Reviewed By: stephenyan1231 Differential Revision: D19911404 fbshipit-source-id: ef322a053708405c02cee4562b456b1602fb12fc * Implemented VUE Asynchronous Decoder Summary: For Mothership we have found that asynchronous decoder provides a better performance. Differential Revision: D20026194 fbshipit-source-id: 627b91844b4e3f917002031dd32cb19c239f4ba8 * fix a bug in API read_video_from_memory (#1942) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1942 In D18720474, it introduces a bug in `read_video_from_memory` API. Thank weiyaowang for reporting it. Reviewed By: weiyaowang Differential Revision: D20270179 fbshipit-source-id: 66348c99a5ad1f9129b90e934524ddfaad59de03 * extend decoder to support new video_max_dimension argument (#1924) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1924 Extend `video reader` decoder python API in Torchvision to support a new argument `video_max_dimension`. This enables the new video decoding use cases. When setting `video_width=0`, `video_height=0`, `video_min_dimension != 0`, and `video_max_dimension != 0`, we can rescale the video clips so that its spatial resolution (height, width) becomes - (video_min_dimension, video_max_dimension) if original height < original width - (video_max_dimension, video_min_dimension) if original height >= original width This is useful at video model testing stage, where we perform fully convolution evaluation and take entire video frames without cropping as input. Previously, for instance we can only set `video_width=0`, `video_height=0`, `video_min_dimension = 128`, which will preserve aspect ratio. In production dataset, there are a small number of videos where aspect ratio is either extremely large or small, and when the shorter edge is rescaled to 128, the longer edge is still large. This will easily cause GPU memory OOM when we sample multiple video clips, and put them in a single minibatch. Now, we can set (for instance) `video_width=0`, `video_height=0`, `video_min_dimension = 128` and `video_max_dimension = 171` so that the rescale resolution is either (128, 171) or (171, 128) depending on whether original height is larger than original width. Thus, we are less likely to have gpu OOM because the spatial size of video clips is determined. Reviewed By: putivsky Differential Revision: D20182529 fbshipit-source-id: f9c40afb7590e7c45e6908946597141efa35f57c * Fixing samplers initialization (#1967) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1967 No-ops for torchvision diff, which fixes samplers. Differential Revision: D20397218 fbshipit-source-id: 6dc4d04364f305fbda7ca4f67a25ceecd73d0f20 * Exclude C++ test files Co-authored-by: Yuri Putivsky <yuri@fb.com> Co-authored-by: Zhicheng Yan <zyan3@fb.com>
32e16805 · Francisco Massa · GitHub · 8b9859d3 · 32e16805 · 32e16805
Unverified Commit 32e16805 authored Mar 17, 2020 by Francisco Massa Committed by GitHub Mar 17, 2020
20 changed files
--- a/torchvision/csrc/cpu/decoder/subtitle_stream.h
+++ b/torchvision/csrc/cpu/decoder/subtitle_stream.h
+#pragma once
+#include "stream.h"
+#include "subtitle_sampler.h"
+namespace ffmpeg {
+/**
+ * Class uses FFMPEG library to decode one subtitle stream.
+ */
+struct AVSubtitleKeeper : AVSubtitle {
+  int64_t release{0};
+};
+class SubtitleStream : public Stream {
+ public:
+  SubtitleStream(
+      AVFormatContext* inputCtx,
+      int index,
+      bool convertPtsToWallTime,
+      const SubtitleFormat& format);
+  ~SubtitleStream() override;
+ protected:
+  void setFramePts(DecoderHeader* header, bool flush) override;
+ private:
+  int initFormat() override;
+  int analyzePacket(const AVPacket* packet, bool* gotFrame) override;
+  int copyFrameBytes(ByteStorage* out, bool flush) override;
+  void releaseSubtitle();
+ private:
+  SubtitleSampler sampler_;
+  AVSubtitleKeeper sub_;
+};
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/sync_decoder.cpp
+++ b/torchvision/csrc/cpu/decoder/sync_decoder.cpp
+#include "sync_decoder.h"
+#include <c10/util/Logging.h>
+namespace ffmpeg {
+SyncDecoder::AVByteStorage::AVByteStorage(size_t n) {
+  ensure(n);
+}
+SyncDecoder::AVByteStorage::~AVByteStorage() {
+  av_free(buffer_);
+}
+void SyncDecoder::AVByteStorage::ensure(size_t n) {
+  if (tail() < n) {
+    capacity_ = offset_ + length_ + n;
+    buffer_ = static_cast<uint8_t*>(av_realloc(buffer_, capacity_));
+  }
+}
+uint8_t* SyncDecoder::AVByteStorage::writableTail() {
+  CHECK_LE(offset_ + length_, capacity_);
+  return buffer_ + offset_ + length_;
+}
+void SyncDecoder::AVByteStorage::append(size_t n) {
+  CHECK_LE(n, tail());
+  length_ += n;
+}
+void SyncDecoder::AVByteStorage::trim(size_t n) {
+  CHECK_LE(n, length_);
+  offset_ += n;
+  length_ -= n;
+}
+const uint8_t* SyncDecoder::AVByteStorage::data() const {
+  return buffer_ + offset_;
+}
+size_t SyncDecoder::AVByteStorage::length() const {
+  return length_;
+}
+size_t SyncDecoder::AVByteStorage::tail() const {
+  CHECK_LE(offset_ + length_, capacity_);
+  return capacity_ - offset_ - length_;
+}
+void SyncDecoder::AVByteStorage::clear() {
+  offset_ = 0;
+  length_ = 0;
+}
+std::unique_ptr<ByteStorage> SyncDecoder::createByteStorage(size_t n) {
+  return std::make_unique<AVByteStorage>(n);
+}
+void SyncDecoder::onInit() {
+  eof_ = false;
+  queue_.clear();
+}
+int SyncDecoder::decode(DecoderOutputMessage* out, uint64_t timeoutMs) {
+  if (eof_ && queue_.empty()) {
+    return ENODATA;
+  }
+  if (queue_.empty()) {
+    int result = getFrame(timeoutMs);
+    // assign EOF
+    eof_ = result == ENODATA;
+    // check unrecoverable error, any error but ENODATA
+    if (result && result != ENODATA) {
+      return result;
+    }
+    // still empty
+    if (queue_.empty()) {
+      if (eof_) {
+        return ENODATA;
+      } else {
+        LOG(INFO) << "Queue is empty";
+        return ETIMEDOUT;
+      }
+    }
+  }
+  *out = std::move(queue_.front());
+  queue_.pop_front();
+  return 0;
+}
+void SyncDecoder::push(DecoderOutputMessage&& buffer) {
+  queue_.push_back(std::move(buffer));
+}
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/sync_decoder.h
+++ b/torchvision/csrc/cpu/decoder/sync_decoder.h
+#pragma once
+#include <list>
+#include "decoder.h"
+namespace ffmpeg {
+/**
+ * Class uses FFMPEG library to decode media streams.
+ * Media bytes can be explicitly provided through read-callback
+ * or fetched internally by FFMPEG library
+ */
+class SyncDecoder : public Decoder {
+ public:
+  // Allocation of memory must be done with a proper alignment.
+  class AVByteStorage : public ByteStorage {
+   public:
+    explicit AVByteStorage(size_t n);
+    ~AVByteStorage() override;
+    void ensure(size_t n) override;
+    uint8_t* writableTail() override;
+    void append(size_t n) override;
+    void trim(size_t n) override;
+    const uint8_t* data() const override;
+    size_t length() const override;
+    size_t tail() const override;
+    void clear() override;
+   private:
+    size_t offset_{0};
+    size_t length_{0};
+    size_t capacity_{0};
+    uint8_t* buffer_{nullptr};
+  };
+ public:
+  int decode(DecoderOutputMessage* out, uint64_t timeoutMs) override;
+ private:
+  void push(DecoderOutputMessage&& buffer) override;
+  void onInit() override;
+  std::unique_ptr<ByteStorage> createByteStorage(size_t n) override;
+ private:
+  std::list<DecoderOutputMessage> queue_;
+  bool eof_{false};
+};
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/sync_decoder_test.cpp
+++ b/torchvision/csrc/cpu/decoder/sync_decoder_test.cpp
+#include <c10/util/Logging.h>
+#include <dirent.h>
+#include <gtest/gtest.h>
+#include "memory_buffer.h"
+#include "sync_decoder.h"
+#include "util.h"
+using namespace ffmpeg;
+namespace {
+struct VideoFileStats {
+  std::string name;
+  size_t durationPts{0};
+  int num{0};
+  int den{0};
+  int fps{0};
+};
+void gotAllTestFiles(
+    const std::string& folder,
+    std::vector<VideoFileStats>* stats) {
+  DIR* d = opendir(folder.c_str());
+  CHECK(d);
+  struct dirent* dir;
+  while ((dir = readdir(d))) {
+    if (dir->d_type != DT_DIR && 0 != strcmp(dir->d_name, "README")) {
+      VideoFileStats item;
+      item.name = folder + '/' + dir->d_name;
+      LOG(INFO) << "Found video file: " << item.name;
+      stats->push_back(std::move(item));
+    }
+  }
+  closedir(d);
+}
+void gotFilesStats(std::vector<VideoFileStats>& stats) {
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.seekAccuracy = 100000;
+  params.formats = {MediaFormat(0)};
+  params.headerOnly = true;
+  params.preventStaleness = false;
+  size_t avgProvUs = 0;
+  const size_t rounds = 100;
+  for (auto& item : stats) {
+    LOG(INFO) << "Decoding video file in memory: " << item.name;
+    FILE* f = fopen(item.name.c_str(), "rb");
+    CHECK(f != nullptr);
+    fseek(f, 0, SEEK_END);
+    std::vector<uint8_t> buffer(ftell(f));
+    rewind(f);
+    CHECK_EQ(buffer.size(), fread(buffer.data(), 1, buffer.size(), f));
+    fclose(f);
+    for (size_t i = 0; i < rounds; ++i) {
+      SyncDecoder decoder;
+      std::vector<DecoderMetadata> metadata;
+      const auto now = std::chrono::steady_clock::now();
+      CHECK(decoder.init(
+          params,
+          MemoryBuffer::getCallback(buffer.data(), buffer.size()),
+          &metadata));
+      const auto then = std::chrono::steady_clock::now();
+      decoder.shutdown();
+      avgProvUs +=
+          std::chrono::duration_cast<std::chrono::microseconds>(then - now)
+              .count();
+      CHECK_EQ(metadata.size(), 1);
+      item.num = metadata[0].num;
+      item.den = metadata[0].den;
+      item.fps = metadata[0].fps;
+      item.durationPts =
+          av_rescale_q(metadata[0].duration, AV_TIME_BASE_Q, {1, item.fps});
+    }
+  }
+  LOG(INFO) << "Probing (us) " << avgProvUs / stats.size() / rounds;
+}
+size_t measurePerformanceUs(
+    const std::vector<VideoFileStats>& stats,
+    size_t rounds,
+    size_t num,
+    size_t stride) {
+  size_t avgClipDecodingUs = 0;
+  std::srand(time(nullptr));
+  for (const auto& item : stats) {
+    FILE* f = fopen(item.name.c_str(), "rb");
+    CHECK(f != nullptr);
+    fseek(f, 0, SEEK_END);
+    std::vector<uint8_t> buffer(ftell(f));
+    rewind(f);
+    CHECK_EQ(buffer.size(), fread(buffer.data(), 1, buffer.size(), f));
+    fclose(f);
+    for (size_t i = 0; i < rounds; ++i) {
+      // randomy select clip
+      size_t rOffset = std::rand();
+      size_t fOffset = rOffset % item.durationPts;
+      size_t clipFrames = num + (num - 1) * stride;
+      if (fOffset + clipFrames > item.durationPts) {
+        fOffset = item.durationPts - clipFrames;
+      }
+      DecoderParameters params;
+      params.timeoutMs = 10000;
+      params.startOffset = 1000000;
+      params.seekAccuracy = 100000;
+      params.preventStaleness = false;
+      for (size_t n = 0; n < num; ++n) {
+        std::list<DecoderOutputMessage> msgs;
+        params.startOffset =
+            av_rescale_q(fOffset, {1, item.fps}, AV_TIME_BASE_Q);
+        params.endOffset = params.startOffset + 100;
+        auto now = std::chrono::steady_clock::now();
+        SyncDecoder decoder;
+        CHECK(decoder.init(
+            params,
+            MemoryBuffer::getCallback(buffer.data(), buffer.size()),
+            nullptr));
+        DecoderOutputMessage out;
+        while (0 == decoder.decode(&out, params.timeoutMs)) {
+          msgs.push_back(std::move(out));
+        }
+        decoder.shutdown();
+        const auto then = std::chrono::steady_clock::now();
+        fOffset += 1 + stride;
+        avgClipDecodingUs +=
+            std::chrono::duration_cast<std::chrono::microseconds>(then - now)
+                .count();
+      }
+    }
+  }
+  return avgClipDecodingUs / rounds / num / stats.size();
+}
+void runDecoder(SyncDecoder& decoder) {
+  DecoderOutputMessage out;
+  size_t audioFrames = 0, videoFrames = 0, totalBytes = 0;
+  while (0 == decoder.decode(&out, 10000)) {
+    if (out.header.format.type == TYPE_AUDIO) {
+      ++audioFrames;
+    } else if (out.header.format.type == TYPE_VIDEO) {
+      ++videoFrames;
+    } else if (out.header.format.type == TYPE_SUBTITLE && out.payload) {
+      // deserialize
+      LOG(INFO) << "Deserializing subtitle";
+      AVSubtitle sub;
+      memset(&sub, 0, sizeof(sub));
+      EXPECT_TRUE(Util::deserialize(*out.payload, &sub));
+      LOG(INFO) << "Found subtitles"
+                << ", num rects: " << sub.num_rects;
+      for (int i = 0; i < sub.num_rects; ++i) {
+        std::string text = "picture";
+        if (sub.rects[i]->type == SUBTITLE_TEXT) {
+          text = sub.rects[i]->text;
+        } else if (sub.rects[i]->type == SUBTITLE_ASS) {
+          text = sub.rects[i]->ass;
+        }
+        LOG(INFO) << "Rect num: " << i << ", type:" << sub.rects[i]->type
+                  << ", text: " << text;
+      }
+      avsubtitle_free(&sub);
+    }
+    if (out.payload) {
+      totalBytes += out.payload->length();
+    }
+  }
+  LOG(INFO) << "Decoded audio frames: " << audioFrames
+            << ", video frames: " << videoFrames
+            << ", total bytes: " << totalBytes;
+}
+} // namespace
+TEST(SyncDecoder, TestSyncDecoderPerformance) {
+  // Measure the average time of decoding per clip
+  // 1. list of the videos in testing directory
+  // 2. for each video got number of frames with timestamps
+  // 3. randomly select frame offset
+  // 4. adjust offset for number frames and strides,
+  //    if it's out out upper boundary
+  // 5. repeat multiple times, measuring and accumulating decoding time
+  //    per clip.
+  /*
+  1) 4 x 2
+  2) 8 x 8
+  3) 16 x 8
+  4) 32 x 4
+  */
+  const std::string kFolder = "pytorch/vision/test/assets/videos";
+  std::vector<VideoFileStats> stats;
+  gotAllTestFiles(kFolder, &stats);
+  gotFilesStats(stats);
+  const size_t kRounds = 10;
+  auto new4x2 = measurePerformanceUs(stats, kRounds, 4, 2);
+  auto new8x8 = measurePerformanceUs(stats, kRounds, 8, 8);
+  auto new16x8 = measurePerformanceUs(stats, kRounds, 16, 8);
+  auto new32x4 = measurePerformanceUs(stats, kRounds, 32, 4);
+  LOG(INFO) << "Clip decoding (us)"
+            << ", new(4x2): " << new4x2 << ", new(8x8): " << new8x8
+            << ", new(16x8): " << new16x8 << ", new(32x4): " << new32x4;
+}
+TEST(SyncDecoder, Test) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.seekAccuracy = 100000;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  params.uri = "pytorch/vision/test/assets/videos/R6llTwEh07w.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+}
+TEST(SyncDecoder, TestSubtitles) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  params.uri = "vue/synergy/data/robotsub.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+}
+TEST(SyncDecoder, TestHeadersOnly) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.seekAccuracy = 100000;
+  params.headerOnly = true;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  params.uri = "pytorch/vision/test/assets/videos/R6llTwEh07w.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+  params.uri = "pytorch/vision/test/assets/videos/SOX5yA1l24A.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+  params.uri = "pytorch/vision/test/assets/videos/WUzgd7C1pWA.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+}
+TEST(SyncDecoder, TestHeadersOnlyDownSampling) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.seekAccuracy = 100000;
+  params.headerOnly = true;
+  MediaFormat format;
+  format.type = TYPE_AUDIO;
+  format.format.audio.samples = 8000;
+  params.formats.insert(format);
+  format.type = TYPE_VIDEO;
+  format.format.video.width = 224;
+  format.format.video.height = 224;
+  params.formats.insert(format);
+  params.uri = "pytorch/vision/test/assets/videos/R6llTwEh07w.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+  params.uri = "pytorch/vision/test/assets/videos/SOX5yA1l24A.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+  params.uri = "pytorch/vision/test/assets/videos/WUzgd7C1pWA.mp4";
+  CHECK(decoder.init(params, nullptr, nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+}
+TEST(SyncDecoder, TestInitOnlyNoShutdown) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.seekAccuracy = 100000;
+  params.headerOnly = false;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  params.uri = "pytorch/vision/test/assets/videos/R6llTwEh07w.mp4";
+  std::vector<DecoderMetadata> metadata;
+  CHECK(decoder.init(params, nullptr, &metadata));
+}
+TEST(SyncDecoder, TestMemoryBuffer) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.endOffset = 9000000;
+  params.seekAccuracy = 10000;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  FILE* f = fopen(
+      "pytorch/vision/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi",
+      "rb");
+  CHECK(f != nullptr);
+  fseek(f, 0, SEEK_END);
+  std::vector<uint8_t> buffer(ftell(f));
+  rewind(f);
+  CHECK_EQ(buffer.size(), fread(buffer.data(), 1, buffer.size(), f));
+  fclose(f);
+  CHECK(decoder.init(
+      params,
+      MemoryBuffer::getCallback(buffer.data(), buffer.size()),
+      nullptr));
+  LOG(INFO) << "Decoding from memory bytes: " << buffer.size();
+  runDecoder(decoder);
+  decoder.shutdown();
+}
+TEST(SyncDecoder, TestMemoryBufferNoSeekableWithFullRead) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.endOffset = 9000000;
+  params.seekAccuracy = 10000;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  FILE* f = fopen("pytorch/vision/test/assets/videos/R6llTwEh07w.mp4", "rb");
+  CHECK(f != nullptr);
+  fseek(f, 0, SEEK_END);
+  std::vector<uint8_t> buffer(ftell(f));
+  rewind(f);
+  CHECK_EQ(buffer.size(), fread(buffer.data(), 1, buffer.size(), f));
+  fclose(f);
+  params.maxSeekableBytes = buffer.size() + 1;
+  MemoryBuffer object(buffer.data(), buffer.size());
+  CHECK(decoder.init(
+      params,
+      [object](uint8_t* out, int size, int whence, uint64_t timeoutMs) mutable
+      -> int {
+        if (out) { // see defs.h file
+          // read mode
+          return object.read(out, size);
+        }
+        // seek mode
+        if (!timeoutMs) {
+          // seek capabilty, yes - no
+          return -1;
+        }
+        return object.seek(size, whence);
+      },
+      nullptr));
+  runDecoder(decoder);
+  decoder.shutdown();
+}
+TEST(SyncDecoder, TestMemoryBufferNoSeekableWithPartialRead) {
+  SyncDecoder decoder;
+  DecoderParameters params;
+  params.timeoutMs = 10000;
+  params.startOffset = 1000000;
+  params.endOffset = 9000000;
+  params.seekAccuracy = 10000;
+  params.formats = {MediaFormat(), MediaFormat(0), MediaFormat('0')};
+  FILE* f = fopen("pytorch/vision/test/assets/videos/R6llTwEh07w.mp4", "rb");
+  CHECK(f != nullptr);
+  fseek(f, 0, SEEK_END);
+  std::vector<uint8_t> buffer(ftell(f));
+  rewind(f);
+  CHECK_EQ(buffer.size(), fread(buffer.data(), 1, buffer.size(), f));
+  fclose(f);
+  params.maxSeekableBytes = buffer.size() / 2;
+  MemoryBuffer object(buffer.data(), buffer.size());
+  CHECK(!decoder.init(
+      params,
+      [object](uint8_t* out, int size, int whence, uint64_t timeoutMs) mutable
+      -> int {
+        if (out) { // see defs.h file
+          // read mode
+          return object.read(out, size);
+        }
+        // seek mode
+        if (!timeoutMs) {
+          // seek capabilty, yes - no
+          return -1;
+        }
+        return object.seek(size, whence);
+      },
+      nullptr));
+}
--- a/torchvision/csrc/cpu/decoder/time_keeper.cpp
+++ b/torchvision/csrc/cpu/decoder/time_keeper.cpp
+#include "time_keeper.h"
+#include "defs.h"
+namespace ffmpeg {
+namespace {
+const long kMaxTimeBaseDiference = 10;
+}
+long TimeKeeper::adjust(long& decoderTimestamp) {
+  const long now = std::chrono::duration_cast<std::chrono::microseconds>(
+                       std::chrono::system_clock::now().time_since_epoch())
+                       .count();
+  if (startTime_ == 0) {
+    startTime_ = now;
+  }
+  if (streamTimestamp_ == 0) {
+    streamTimestamp_ = decoderTimestamp;
+  }
+  const auto runOut = startTime_ + decoderTimestamp - streamTimestamp_;
+  if (std::labs((now - runOut) / AV_TIME_BASE) > kMaxTimeBaseDiference) {
+    streamTimestamp_ = startTime_ - now + decoderTimestamp;
+  }
+  const auto sleepAdvised = runOut - now;
+  decoderTimestamp += startTime_ - streamTimestamp_;
+  return sleepAdvised > 0 ? sleepAdvised : 0;
+}
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/time_keeper.h
+++ b/torchvision/csrc/cpu/decoder/time_keeper.h
+#pragma once
+#include <stdlib.h>
+#include <chrono>
+namespace ffmpeg {
+/**
+ * Class keeps the track of the decoded timestamps (us) for media streams.
+ */
+class TimeKeeper {
+ public:
+  TimeKeeper() = default;
+  // adjust provided @timestamp to the corrected value
+  // return advised sleep time before next frame processing in (us)
+  long adjust(long& decoderTimestamp);
+ private:
+  long startTime_{0};
+  long streamTimestamp_{0};
+};
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/util.cpp
+++ b/torchvision/csrc/cpu/decoder/util.cpp
+#include "util.h"
+#include <c10/util/Logging.h>
+namespace ffmpeg {
+namespace Serializer {
+// fixed size types
+template <typename T>
+inline size_t getSize(const T& x) {
+  return sizeof(x);
+}
+template <typename T>
+inline bool serializeItem(
+    uint8_t* dest,
+    size_t len,
+    size_t& pos,
+    const T& src) {
+  VLOG(6) << "Generic serializeItem";
+  const auto required = sizeof(src);
+  if (len < pos + required) {
+    return false;
+  }
+  memcpy(dest + pos, &src, required);
+  pos += required;
+  return true;
+}
+template <typename T>
+inline bool deserializeItem(
+    const uint8_t* src,
+    size_t len,
+    size_t& pos,
+    T& dest) {
+  const auto required = sizeof(dest);
+  if (len < pos + required) {
+    return false;
+  }
+  memcpy(&dest, src + pos, required);
+  pos += required;
+  return true;
+}
+// AVSubtitleRect specialization
+inline size_t getSize(const AVSubtitleRect& x) {
+  auto rectBytes = [](const AVSubtitleRect& y) -> size_t {
+    size_t s = 0;
+    switch (y.type) {
+      case SUBTITLE_BITMAP:
+        for (int i = 0; i < y.nb_colors; ++i) {
+          s += sizeof(y.pict.linesize[i]);
+          s += y.pict.linesize[i];
+        }
+        break;
+      case SUBTITLE_TEXT:
+        s += sizeof(size_t);
+        s += strlen(y.text);
+        break;
+      case SUBTITLE_ASS:
+        s += sizeof(size_t);
+        s += strlen(y.ass);
+        break;
+      default:
+        break;
+    }
+    return s;
+  };
+  return getSize(x.x) + getSize(x.y) + getSize(x.w) + getSize(x.h) +
+      getSize(x.nb_colors) + getSize(x.type) + getSize(x.flags) + rectBytes(x);
+}
+// AVSubtitle specialization
+inline size_t getSize(const AVSubtitle& x) {
+  auto rectBytes = [](const AVSubtitle& y) -> size_t {
+    size_t s = getSize(y.num_rects);
+    for (unsigned i = 0; i < y.num_rects; ++i) {
+      s += getSize(*y.rects[i]);
+    }
+    return s;
+  };
+  return getSize(x.format) + getSize(x.start_display_time) +
+      getSize(x.end_display_time) + getSize(x.pts) + rectBytes(x);
+}
+inline bool serializeItem(
+    uint8_t* dest,
+    size_t len,
+    size_t& pos,
+    const AVSubtitleRect& src) {
+  auto rectSerialize =
+      [](uint8_t* d, size_t l, size_t& p, const AVSubtitleRect& x) -> size_t {
+    switch (x.type) {
+      case SUBTITLE_BITMAP:
+        for (int i = 0; i < x.nb_colors; ++i) {
+          if (!serializeItem(d, l, p, x.pict.linesize[i])) {
+            return false;
+          }
+          if (p + x.pict.linesize[i] > l) {
+            return false;
+          }
+          memcpy(d + p, x.pict.data[i], x.pict.linesize[i]);
+          p += x.pict.linesize[i];
+        }
+        return true;
+      case SUBTITLE_TEXT: {
+        const size_t s = strlen(x.text);
+        if (!serializeItem(d, l, p, s)) {
+          return false;
+        }
+        if (p + s > l) {
+          return false;
+        }
+        memcpy(d + p, x.text, s);
+        p += s;
+        return true;
+      }
+      case SUBTITLE_ASS: {
+        const size_t s = strlen(x.ass);
+        if (!serializeItem(d, l, p, s)) {
+          return false;
+        }
+        if (p + s > l) {
+          return false;
+        }
+        memcpy(d + p, x.ass, s);
+        p += s;
+        return true;
+      }
+      default:
+        return true;
+    }
+  };
+  return serializeItem(dest, len, pos, src.x) &&
+      serializeItem(dest, len, pos, src.y) &&
+      serializeItem(dest, len, pos, src.w) &&
+      serializeItem(dest, len, pos, src.h) &&
+      serializeItem(dest, len, pos, src.nb_colors) &&
+      serializeItem(dest, len, pos, src.type) &&
+      serializeItem(dest, len, pos, src.flags) &&
+      rectSerialize(dest, len, pos, src);
+}
+inline bool serializeItem(
+    uint8_t* dest,
+    size_t len,
+    size_t& pos,
+    const AVSubtitle& src) {
+  auto rectSerialize =
+      [](uint8_t* d, size_t l, size_t& p, const AVSubtitle& x) -> bool {
+    bool res = serializeItem(d, l, p, x.num_rects);
+    for (unsigned i = 0; res && i < x.num_rects; ++i) {
+      res = serializeItem(d, l, p, *(x.rects[i]));
+    }
+    return res;
+  };
+  VLOG(6) << "AVSubtitle serializeItem";
+  return serializeItem(dest, len, pos, src.format) &&
+      serializeItem(dest, len, pos, src.start_display_time) &&
+      serializeItem(dest, len, pos, src.end_display_time) &&
+      serializeItem(dest, len, pos, src.pts) &&
+      rectSerialize(dest, len, pos, src);
+}
+inline bool deserializeItem(
+    const uint8_t* src,
+    size_t len,
+    size_t& pos,
+    AVSubtitleRect& dest) {
+  auto rectDeserialize =
+      [](const uint8_t* y, size_t l, size_t& p, AVSubtitleRect& x) -> bool {
+    switch (x.type) {
+      case SUBTITLE_BITMAP:
+        for (int i = 0; i < x.nb_colors; ++i) {
+          if (!deserializeItem(y, l, p, x.pict.linesize[i])) {
+            return false;
+          }
+          if (p + x.pict.linesize[i] > l) {
+            return false;
+          }
+          x.pict.data[i] = (uint8_t*)av_malloc(x.pict.linesize[i]);
+          memcpy(x.pict.data[i], y + p, x.pict.linesize[i]);
+          p += x.pict.linesize[i];
+        }
+        return true;
+      case SUBTITLE_TEXT: {
+        size_t s = 0;
+        if (!deserializeItem(y, l, p, s)) {
+          return false;
+        }
+        if (p + s > l) {
+          return false;
+        }
+        x.text = (char*)av_malloc(s + 1);
+        memcpy(x.text, y + p, s);
+        x.text[s] = 0;
+        p += s;
+        return true;
+      }
+      case SUBTITLE_ASS: {
+        size_t s = 0;
+        if (!deserializeItem(y, l, p, s)) {
+          return false;
+        }
+        if (p + s > l) {
+          return false;
+        }
+        x.ass = (char*)av_malloc(s + 1);
+        memcpy(x.ass, y + p, s);
+        x.ass[s] = 0;
+        p += s;
+        return true;
+      }
+      default:
+        return true;
+    }
+  };
+  return deserializeItem(src, len, pos, dest.x) &&
+      deserializeItem(src, len, pos, dest.y) &&
+      deserializeItem(src, len, pos, dest.w) &&
+      deserializeItem(src, len, pos, dest.h) &&
+      deserializeItem(src, len, pos, dest.nb_colors) &&
+      deserializeItem(src, len, pos, dest.type) &&
+      deserializeItem(src, len, pos, dest.flags) &&
+      rectDeserialize(src, len, pos, dest);
+}
+inline bool deserializeItem(
+    const uint8_t* src,
+    size_t len,
+    size_t& pos,
+    AVSubtitle& dest) {
+  auto rectDeserialize =
+      [](const uint8_t* y, size_t l, size_t& p, AVSubtitle& x) -> bool {
+    bool res = deserializeItem(y, l, p, x.num_rects);
+    if (res && x.num_rects) {
+      x.rects =
+          (AVSubtitleRect**)av_malloc(x.num_rects * sizeof(AVSubtitleRect*));
+    }
+    for (unsigned i = 0; res && i < x.num_rects; ++i) {
+      x.rects[i] = (AVSubtitleRect*)av_malloc(sizeof(AVSubtitleRect));
+      memset(x.rects[i], 0, sizeof(AVSubtitleRect));
+      res = deserializeItem(y, l, p, *x.rects[i]);
+    }
+    return res;
+  };
+  return deserializeItem(src, len, pos, dest.format) &&
+      deserializeItem(src, len, pos, dest.start_display_time) &&
+      deserializeItem(src, len, pos, dest.end_display_time) &&
+      deserializeItem(src, len, pos, dest.pts) &&
+      rectDeserialize(src, len, pos, dest);
+}
+} // namespace Serializer
+namespace Util {
+std::string generateErrorDesc(int errorCode) {
+  std::array<char, 1024> buffer;
+  if (av_strerror(errorCode, buffer.data(), buffer.size()) < 0) {
+    return std::string("Unknown error code: ") + std::to_string(errorCode);
+  }
+  buffer.back() = 0;
+  return std::string(buffer.data());
+}
+size_t serialize(const AVSubtitle& sub, ByteStorage* out) {
+  const auto len = size(sub);
+  CHECK_LE(len, out->tail());
+  size_t pos = 0;
+  if (!Serializer::serializeItem(out->writableTail(), len, pos, sub)) {
+    return 0;
+  }
+  out->append(len);
+  return len;
+}
+bool deserialize(const ByteStorage& buf, AVSubtitle* sub) {
+  size_t pos = 0;
+  return Serializer::deserializeItem(buf.data(), buf.length(), pos, *sub);
+}
+size_t size(const AVSubtitle& sub) {
+  return Serializer::getSize(sub);
+}
+bool validateVideoFormat(const VideoFormat& f) {
+  /*
+  Valid parameters values for decoder
+  ____________________________________________________________________________________
+  |  W  |  H  | minDimension | maxDimension | cropImage |  algorithm                 |
+  |__________________________________________________________________________________|
+  |  0  |  0  |     0        |  0           |  N/A      |   original                 |
+  |__________________________________________________________________________________|
+  |  >0 |  0  |     N/A      |  N/A         |  N/A      |   scale keeping W          |
+  |__________________________________________________________________________________|
+  |  0  |  >0 |     N/A      |  N/A         |  N/A      |   scale keeping H          |
+  |__________________________________________________________________________________|
+  |  >0 |  >0 |     N/A      |  N/A         |  0        |   stretch/scale            |
+  |__________________________________________________________________________________|
+  |  >0 |  >0 |     N/A      |  N/A         |  >0       |   scale/crop               |
+  |__________________________________________________________________________________|
+  |  0  |  0  |     >0       |  0           |  N/A      |scale to min dimension      |
+  |__________________________________________________________________________________|
+  |  0  |  0  |     0        |  >0          |  N/A      |scale to max dimension      |
+  |__________________________________________________________________________________|
+  |  0  |  0  |     >0       |  >0          |  N/A      |stretch to min/max dimension|
+  |_____|_____|______________|______________|___________|____________________________|
+  */
+  return (f.width == 0 && // #1, #6, #7 and #8
+          f.height == 0 && f.cropImage == 0) ||
+      (f.width != 0 && // #4 and #5
+       f.height != 0 && f.minDimension == 0 && f.maxDimension == 0) ||
+      (((f.width != 0 && // #2
+         f.height == 0) ||
+        (f.width == 0 && // #3
+         f.height != 0)) &&
+       f.minDimension == 0 && f.maxDimension == 0 && f.cropImage == 0);
+}
+void setFormatDimensions(
+    size_t& destW,
+    size_t& destH,
+    size_t userW,
+    size_t userH,
+    size_t srcW,
+    size_t srcH,
+    size_t minDimension,
+    size_t maxDimension,
+    size_t cropImage) {
+  // rounding rules
+  // int -> double -> round up
+  // if fraction is >= 0.5 or round down if fraction is < 0.5
+  // int result = double(value) + 0.5
+  // here we rounding double to int according to the above rule
+  // #1, #6, #7 and #8
+  if (userW == 0 && userH == 0) {
+    if (minDimension > 0 && maxDimension == 0) { // #6
+      if (srcW > srcH) {
+        // landscape
+        destH = minDimension;
+        destW = round(double(srcW * minDimension) / srcH);
+      } else {
+        // portrait
+        destW = minDimension;
+        destH = round(double(srcH * minDimension) / srcW);
+      }
+    }
+    else if (minDimension == 0 && maxDimension > 0) { // #7
+      if (srcW > srcH) {
+        // landscape
+        destW = maxDimension;
+        destH = round(double(srcH * maxDimension) / srcW);
+      } else {
+        // portrait
+        destH = maxDimension;
+        destW = round(double(srcW * maxDimension) / srcH);
+      }
+    }
+    else if (minDimension > 0 && maxDimension > 0) { // #8
+      if (srcW > srcH) {
+        // landscape
+        destW = maxDimension;
+        destH = minDimension;
+      } else {
+        // portrait
+        destW = minDimension;
+        destH = maxDimension;
+      }
+    }
+    else { // #1
+      destW = srcW;
+      destH = srcH;
+    }
+  } else if (userW != 0 && userH == 0) { // #2
+    destW = userW;
+    destH = round(double(srcH * userW) / srcW);
+  } else if (userW == 0 && userH != 0) { // #3
+    destW = round(double(srcW * userH) / srcH);
+    destH = userH;
+  } else { // userW != 0 && userH != 0
+    if (cropImage == 0) { // #4
+      destW = userW;
+      destH = userH;
+    } else { // #5
+      double userSlope = double(userH) / userW;
+      double srcSlope = double(srcH) / srcW;
+      if (srcSlope < userSlope) {
+        destW = round(double(srcW * userH) / srcH);
+        destH = userH;
+      } else {
+        destW = userW;
+        destH = round(double(srcH * userW) / srcW);
+      }
+    }
+  }
+  // prevent zeros
+  destW = std::max(destW, 1UL);
+  destH = std::max(destH, 1UL);
+}
+} // namespace Util
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/util.h
+++ b/torchvision/csrc/cpu/decoder/util.h
+#pragma once
+#include "defs.h"
+namespace ffmpeg {
+/**
+ * FFMPEG library utility functions.
+ */
+namespace Util {
+std::string generateErrorDesc(int errorCode);
+size_t serialize(const AVSubtitle& sub, ByteStorage* out);
+bool deserialize(const ByteStorage& buf, AVSubtitle* sub);
+size_t size(const AVSubtitle& sub);
+void setFormatDimensions(
+    size_t& destW,
+    size_t& destH,
+    size_t userW,
+    size_t userH,
+    size_t srcW,
+    size_t srcH,
+    size_t minDimension,
+    size_t maxDimension,
+    size_t cropImage);
+bool validateVideoFormat(const VideoFormat& format);
+} // namespace Util
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/util_test.cpp
+++ b/torchvision/csrc/cpu/decoder/util_test.cpp
+#include <c10/util/Logging.h>
+#include <dirent.h>
+#include <gtest/gtest.h>
+#include "util.h"
+TEST(Util, TestSetFormatDimensions) {
+    const size_t test_cases[][9] = {
+        // (userW, userH, srcW, srcH, minDimension, maxDimension, cropImage, destW, destH)
+        {0, 0, 172, 128, 0, 0, 0, 172, 128},    // #1
+        {86, 0, 172, 128, 0, 0, 0, 86, 64},     // #2
+        {64, 0, 128, 172, 0, 0, 0, 64, 86},     // #2
+        {0, 32, 172, 128, 0, 0, 0, 43, 32},     // #3
+        {32, 0, 128, 172, 0, 0, 0, 32, 43},     // #3
+        {60, 50, 172, 128, 0, 0, 0, 60, 50},    // #4
+        {50, 60, 128, 172, 0, 0, 0, 50, 60},    // #4
+        {86, 40, 172, 128, 0, 0, 1, 86, 64},    // #5
+        {86, 92, 172, 128, 0, 0, 1, 124, 92},   // #5
+        {0, 0, 172, 128, 256, 0, 0, 344, 256},  // #6
+        {0, 0, 128, 172, 256, 0, 0, 256, 344},  // #6
+        {0, 0, 128, 172, 0, 344, 0, 256, 344},  // #7
+        {0, 0, 172, 128, 0, 344, 0, 344, 256},  // #7
+        {0, 0, 172, 128, 100, 344, 0, 344, 100},// #8
+        {0, 0, 128, 172, 100, 344, 0, 100, 344} // #8
+    };
+    for (const auto& tc : test_cases) {
+        size_t destW = 0;
+        size_t destH = 0;
+        ffmpeg::Util::setFormatDimensions(destW, destH, tc[0], tc[1], tc[2], tc[3], tc[4], tc[5], tc[6]);
+        CHECK(destW == tc[7]);
+        CHECK(destH == tc[8]);
+    }
+}
--- a/torchvision/csrc/cpu/decoder/video_sampler.cpp
+++ b/torchvision/csrc/cpu/decoder/video_sampler.cpp
+#include "video_sampler.h"
+#include <c10/util/Logging.h>
+#include "util.h"
+// www.ffmpeg.org/doxygen/0.5/swscale-example_8c-source.html
+namespace ffmpeg {
+namespace {
+int preparePlanes(
+    const VideoFormat& fmt,
+    const uint8_t* buffer,
+    uint8_t** planes,
+    int* lineSize) {
+  int result;
+  if ((result = av_image_fill_arrays(
+           planes,
+           lineSize,
+           buffer,
+           (AVPixelFormat)fmt.format,
+           fmt.width,
+           fmt.height,
+           1)) < 0) {
+    LOG(ERROR) << "av_image_fill_arrays failed, err: "
+               << Util::generateErrorDesc(result);
+  }
+  return result;
+}
+int transformImage(
+    SwsContext* context,
+    const uint8_t* const srcSlice[],
+    int srcStride[],
+    VideoFormat inFormat,
+    VideoFormat outFormat,
+    uint8_t* out,
+    uint8_t* planes[],
+    int lines[]) {
+  int result;
+  if ((result = preparePlanes(outFormat, out, planes, lines)) < 0) {
+    return result;
+  }
+  if ((result = sws_scale(
+           context, srcSlice, srcStride, 0, inFormat.height, planes, lines)) <
+      0) {
+    LOG(ERROR) << "sws_scale failed, err: " << Util::generateErrorDesc(result);
+    return result;
+  }
+  return 0;
+}
+} // namespace
+VideoSampler::VideoSampler(int swsFlags, int64_t loggingUuid)
+    : swsFlags_(swsFlags), loggingUuid_(loggingUuid) {}
+VideoSampler::~VideoSampler() {
+  cleanUp();
+}
+void VideoSampler::shutdown() {
+  cleanUp();
+}
+bool VideoSampler::init(const SamplerParameters& params) {
+  cleanUp();
+  if (params.out.video.cropImage != 0) {
+    if (!Util::validateVideoFormat(params.out.video)) {
+      LOG(ERROR) << "Invalid video format"
+                 << ", width: " << params.out.video.width
+                 << ", height: " << params.out.video.height
+                 << ", format: " << params.out.video.format
+                 << ", minDimension: " << params.out.video.minDimension
+                 << ", crop: " << params.out.video.cropImage;
+      return false;
+    }
+    scaleFormat_.format = params.out.video.format;
+    Util::setFormatDimensions(
+        scaleFormat_.width,
+        scaleFormat_.height,
+        params.out.video.width,
+        params.out.video.height,
+        params.in.video.width,
+        params.in.video.height,
+        0,
+        0,
+        1);
+    if (!(scaleFormat_ == params_.out.video)) { // crop required
+      cropContext_ = sws_getContext(
+          params.out.video.width,
+          params.out.video.height,
+          (AVPixelFormat)params.out.video.format,
+          params.out.video.width,
+          params.out.video.height,
+          (AVPixelFormat)params.out.video.format,
+          swsFlags_,
+          nullptr,
+          nullptr,
+          nullptr);
+      if (!cropContext_) {
+        LOG(ERROR) << "sws_getContext failed for crop context";
+        return false;
+      }
+      const auto scaleImageSize = av_image_get_buffer_size(
+          (AVPixelFormat)scaleFormat_.format,
+          scaleFormat_.width,
+          scaleFormat_.height,
+          1);
+      scaleBuffer_.resize(scaleImageSize);
+    }
+  } else {
+    scaleFormat_ = params.out.video;
+  }
+  VLOG(1) << "Input format #" << loggingUuid_ << ", width "
+          << params.in.video.width << ", height " << params.in.video.height
+          << ", format " << params.in.video.format << ", minDimension "
+          << params.in.video.minDimension << ", cropImage "
+          << params.in.video.cropImage;
+  VLOG(1) << "Scale format #" << loggingUuid_ << ", width "
+          << scaleFormat_.width << ", height " << scaleFormat_.height
+          << ", format " << scaleFormat_.format << ", minDimension "
+          << scaleFormat_.minDimension << ", cropImage "
+          << scaleFormat_.cropImage;
+  VLOG(1) << "Crop format #" << loggingUuid_ << ", width "
+          << params.out.video.width << ", height " << params.out.video.height
+          << ", format " << params.out.video.format << ", minDimension "
+          << params.out.video.minDimension << ", cropImage "
+          << params.out.video.cropImage;
+  scaleContext_ = sws_getContext(
+      params.in.video.width,
+      params.in.video.height,
+      (AVPixelFormat)params.in.video.format,
+      scaleFormat_.width,
+      scaleFormat_.height,
+      (AVPixelFormat)scaleFormat_.format,
+      swsFlags_,
+      nullptr,
+      nullptr,
+      nullptr);
+  // set output format
+  params_ = params;
+  return scaleContext_ != nullptr;
+}
+int VideoSampler::sample(
+    const uint8_t* const srcSlice[],
+    int srcStride[],
+    ByteStorage* out) {
+  int result;
+  // scaled and cropped image
+  int outImageSize = av_image_get_buffer_size(
+      (AVPixelFormat)params_.out.video.format,
+      params_.out.video.width,
+      params_.out.video.height,
+      1);
+  out->ensure(outImageSize);
+  uint8_t* scalePlanes[4] = {nullptr};
+  int scaleLines[4] = {0};
+  // perform scale first
+  if ((result = transformImage(
+           scaleContext_,
+           srcSlice,
+           srcStride,
+           params_.in.video,
+           scaleFormat_,
+           // for crop use internal buffer
+           cropContext_ ? scaleBuffer_.data() : out->writableTail(),
+           scalePlanes,
+           scaleLines))) {
+    return result;
+  }
+  // is crop required?
+  if (cropContext_) {
+    uint8_t* cropPlanes[4] = {nullptr};
+    int cropLines[4] = {0};
+    if (params_.out.video.height < scaleFormat_.height) {
+      // Destination image is wider of source image: cut top and bottom
+      for (size_t i = 0; i < 4 && scalePlanes[i] != nullptr; ++i) {
+        scalePlanes[i] += scaleLines[i] *
+            (scaleFormat_.height - params_.out.video.height) / 2;
+      }
+    } else {
+      // Source image is wider of destination image: cut sides
+      for (size_t i = 0; i < 4 && scalePlanes[i] != nullptr; ++i) {
+        scalePlanes[i] += scaleLines[i] *
+            (scaleFormat_.width - params_.out.video.width) / 2 /
+            scaleFormat_.width;
+      }
+    }
+    // crop image
+    if ((result = transformImage(
+             cropContext_,
+             scalePlanes,
+             scaleLines,
+             params_.out.video,
+             params_.out.video,
+             out->writableTail(),
+             cropPlanes,
+             cropLines))) {
+      return result;
+    }
+  }
+  out->append(outImageSize);
+  return outImageSize;
+}
+int VideoSampler::sample(AVFrame* frame, ByteStorage* out) {
+  if (!frame) {
+    return 0; // no flush for videos
+  }
+  return sample(frame->data, frame->linesize, out);
+}
+int VideoSampler::sample(const ByteStorage* in, ByteStorage* out) {
+  if (!in) {
+    return 0; // no flush for videos
+  }
+  int result;
+  uint8_t* inPlanes[4] = {nullptr};
+  int inLineSize[4] = {0};
+  if ((result = preparePlanes(
+           params_.in.video, in->data(), inPlanes, inLineSize)) < 0) {
+    return result;
+  }
+  return sample(inPlanes, inLineSize, out);
+}
+void VideoSampler::cleanUp() {
+  if (scaleContext_) {
+    sws_freeContext(scaleContext_);
+    scaleContext_ = nullptr;
+  }
+  if (cropContext_) {
+    sws_freeContext(cropContext_);
+    cropContext_ = nullptr;
+    scaleBuffer_.clear();
+  }
+}
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/video_sampler.h
+++ b/torchvision/csrc/cpu/decoder/video_sampler.h
+#pragma once
+#include "defs.h"
+namespace ffmpeg {
+/**
+ * Class transcode video frames from one format into another
+ */
+class VideoSampler : public MediaSampler {
+ public:
+  VideoSampler(int swsFlags = SWS_AREA, int64_t loggingUuid = 0);
+  ~VideoSampler() override;
+  // MediaSampler overrides
+  bool init(const SamplerParameters& params) override;
+  int sample(const ByteStorage* in, ByteStorage* out) override;
+  void shutdown() override;
+  // returns number processed/scaling bytes
+  int sample(AVFrame* frame, ByteStorage* out);
+  int getImageBytes() const;
+ private:
+  // close resources
+  void cleanUp();
+  // helper functions for rescaling, cropping, etc.
+  int sample(
+      const uint8_t* const srcSlice[],
+      int srcStride[],
+      ByteStorage* out);
+ private:
+  VideoFormat scaleFormat_;
+  SwsContext* scaleContext_{nullptr};
+  SwsContext* cropContext_{nullptr};
+  int swsFlags_{SWS_AREA};
+  std::vector<uint8_t> scaleBuffer_;
+  int64_t loggingUuid_{0};
+};
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/video_stream.cpp
+++ b/torchvision/csrc/cpu/decoder/video_stream.cpp
+#include "video_stream.h"
+#include <c10/util/Logging.h>
+#include "util.h"
+namespace ffmpeg {
+namespace {
+bool operator==(const VideoFormat& x, const AVFrame& y) {
+  return x.width == y.width && x.height == y.height && x.format == y.format;
+}
+bool operator==(const VideoFormat& x, const AVCodecContext& y) {
+  return x.width == y.width && x.height == y.height && x.format == y.pix_fmt;
+}
+VideoFormat& toVideoFormat(VideoFormat& x, const AVFrame& y) {
+  x.width = y.width;
+  x.height = y.height;
+  x.format = y.format;
+  return x;
+}
+VideoFormat& toVideoFormat(VideoFormat& x, const AVCodecContext& y) {
+  x.width = y.width;
+  x.height = y.height;
+  x.format = y.pix_fmt;
+  return x;
+}
+} // namespace
+VideoStream::VideoStream(
+    AVFormatContext* inputCtx,
+    int index,
+    bool convertPtsToWallTime,
+    const VideoFormat& format,
+    int64_t loggingUuid)
+    : Stream(
+          inputCtx,
+          MediaFormat::makeMediaFormat(format, index),
+          convertPtsToWallTime,
+          loggingUuid) {}
+VideoStream::~VideoStream() {
+  if (sampler_) {
+    sampler_->shutdown();
+    sampler_.reset();
+  }
+}
+int VideoStream::initFormat() {
+  // set output format
+  if (!Util::validateVideoFormat(format_.format.video)) {
+    LOG(ERROR) << "Invalid video format"
+               << ", width: " << format_.format.video.width
+               << ", height: " << format_.format.video.height
+               << ", format: " << format_.format.video.format
+               << ", minDimension: " << format_.format.video.minDimension
+               << ", crop: " << format_.format.video.cropImage;
+    return -1;
+  }
+  // keep aspect ratio
+  Util::setFormatDimensions(
+      format_.format.video.width,
+      format_.format.video.height,
+      format_.format.video.width,
+      format_.format.video.height,
+      codecCtx_->width,
+      codecCtx_->height,
+      format_.format.video.minDimension,
+      format_.format.video.maxDimension,
+      0);
+  if (format_.format.video.format == AV_PIX_FMT_NONE) {
+    format_.format.video.format = codecCtx_->pix_fmt;
+  }
+  return format_.format.video.width != 0 && format_.format.video.height != 0 &&
+          format_.format.video.format != AV_PIX_FMT_NONE
+      ? 0
+      : -1;
+}
+int VideoStream::copyFrameBytes(ByteStorage* out, bool flush) {
+  if (!sampler_) {
+    sampler_ = std::make_unique<VideoSampler>(SWS_AREA, loggingUuid_);
+  }
+  // check if input format gets changed
+  if (flush ? !(sampler_->getInputFormat().video == *codecCtx_)
+            : !(sampler_->getInputFormat().video == *frame_)) {
+    // - reinit sampler
+    SamplerParameters params;
+    params.type = format_.type;
+    params.out = format_.format;
+    params.in = FormatUnion(0);
+    flush ? toVideoFormat(params.in.video, *codecCtx_)
+          : toVideoFormat(params.in.video, *frame_);
+    if (!sampler_->init(params)) {
+      return -1;
+    }
+    VLOG(1) << "Set input video sampler format"
+            << ", width: " << params.in.video.width
+            << ", height: " << params.in.video.height
+            << ", format: " << params.in.video.format
+            << " : output video sampler format"
+            << ", width: " << format_.format.video.width
+            << ", height: " << format_.format.video.height
+            << ", format: " << format_.format.video.format
+            << ", minDimension: " << format_.format.video.minDimension
+            << ", crop: " << format_.format.video.cropImage;
+  }
+  return sampler_->sample(flush ? nullptr : frame_, out);
+}
+void VideoStream::setHeader(DecoderHeader* header, bool flush) {
+  Stream::setHeader(header, flush);
+  if (!flush) { // no frames for video flush
+    header->keyFrame = frame_->key_frame;
+    header->fps = av_q2d(av_guess_frame_rate(
+        inputCtx_, inputCtx_->streams[format_.stream], nullptr));
+  }
+}
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/decoder/video_stream.h
+++ b/torchvision/csrc/cpu/decoder/video_stream.h
+#pragma once
+#include "stream.h"
+#include "video_sampler.h"
+namespace ffmpeg {
+/**
+ * Class uses FFMPEG library to decode one video stream.
+ */
+class VideoStream : public Stream {
+ public:
+  VideoStream(
+      AVFormatContext* inputCtx,
+      int index,
+      bool convertPtsToWallTime,
+      const VideoFormat& format,
+      int64_t loggingUuid);
+  ~VideoStream() override;
+ private:
+  int initFormat() override;
+  int copyFrameBytes(ByteStorage* out, bool flush) override;
+  void setHeader(DecoderHeader* header, bool flush) override;
+ private:
+  std::unique_ptr<VideoSampler> sampler_;
+};
+} // namespace ffmpeg
--- a/torchvision/csrc/cpu/video_reader/FfmpegAudioSampler.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegAudioSampler.cpp
-#include "FfmpegAudioSampler.h"
-#include <memory>
-#include "FfmpegUtil.h"
-using namespace std;
-FfmpegAudioSampler::FfmpegAudioSampler(
-    const AudioFormat& in,
-    const AudioFormat& out)
-    : inFormat_(in), outFormat_(out) {}
-FfmpegAudioSampler::~FfmpegAudioSampler() {
-  if (swrContext_) {
-    swr_free(&swrContext_);
-  }
-}
-int FfmpegAudioSampler::init() {
-  swrContext_ = swr_alloc_set_opts(
-      nullptr, // we're allocating a new context
-      av_get_default_channel_layout(outFormat_.channels), // out_ch_layout
-      static_cast<AVSampleFormat>(outFormat_.format), // out_sample_fmt
-      outFormat_.samples, // out_sample_rate
-      av_get_default_channel_layout(inFormat_.channels), // in_ch_layout
-      static_cast<AVSampleFormat>(inFormat_.format), // in_sample_fmt
-      inFormat_.samples, // in_sample_rate
-      0, // log_offset
-      nullptr); // log_ctx
-  if (swrContext_ == nullptr) {
-    LOG(ERROR) << "swr_alloc_set_opts fails";
-    return -1;
-  }
-  int result = 0;
-  if ((result = swr_init(swrContext_)) < 0) {
-    LOG(ERROR) << "swr_init failed, err: " << ffmpeg_util::getErrorDesc(result)
-               << ", in -> format: " << inFormat_.format
-               << ", channels: " << inFormat_.channels
-               << ", samples: " << inFormat_.samples
-               << ", out -> format: " << outFormat_.format
-               << ", channels: " << outFormat_.channels
-               << ", samples: " << outFormat_.samples;
-    return -1;
-  }
-  return 0;
-}
-int64_t FfmpegAudioSampler::getSampleBytes(const AVFrame* frame) const {
-  auto outSamples = getOutNumSamples(frame->nb_samples);
-  return av_samples_get_buffer_size(
-      nullptr,
-      outFormat_.channels,
-      outSamples,
-      static_cast<AVSampleFormat>(outFormat_.format),
-      1);
-}
-// https://www.ffmpeg.org/doxygen/3.2/group__lswr.html
-unique_ptr<DecodedFrame> FfmpegAudioSampler::sample(const AVFrame* frame) {
-  if (!frame) {
-    return nullptr; // no flush for videos
-  }
-  auto inNumSamples = frame->nb_samples;
-  auto outNumSamples = getOutNumSamples(frame->nb_samples);
-  auto outSampleSize = getSampleBytes(frame);
-  AvDataPtr frameData(static_cast<uint8_t*>(av_malloc(outSampleSize)));
-  uint8_t* outPlanes[AVRESAMPLE_MAX_CHANNELS];
-  int result = 0;
-  if ((result = av_samples_fill_arrays(
-           outPlanes,
-           nullptr, // linesize is not needed
-           frameData.get(),
-           outFormat_.channels,
-           outNumSamples,
-           static_cast<AVSampleFormat>(outFormat_.format),
-           1)) < 0) {
-    LOG(ERROR) << "av_samples_fill_arrays failed, err: "
-               << ffmpeg_util::getErrorDesc(result)
-               << ", outNumSamples: " << outNumSamples
-               << ", format: " << outFormat_.format;
-    return nullptr;
-  }
-  if ((result = swr_convert(
-           swrContext_,
-           &outPlanes[0],
-           outNumSamples,
-           (const uint8_t**)&frame->data[0],
-           inNumSamples)) < 0) {
-    LOG(ERROR) << "swr_convert faield, err: "
-               << ffmpeg_util::getErrorDesc(result);
-    return nullptr;
-  }
-  // result returned by swr_convert is the No. of actual output samples.
-  // So update the buffer size using av_samples_get_buffer_size
-  result = av_samples_get_buffer_size(
-      nullptr,
-      outFormat_.channels,
-      result,
-      static_cast<AVSampleFormat>(outFormat_.format),
-      1);
-  return make_unique<DecodedFrame>(std::move(frameData), result, 0);
-}
-/*
-Because of decoding delay, the returned value is an upper bound of No. of
-output samples
-*/
-int64_t FfmpegAudioSampler::getOutNumSamples(int inNumSamples) const {
-  return av_rescale_rnd(
-      swr_get_delay(swrContext_, inFormat_.samples) + inNumSamples,
-      outFormat_.samples,
-      inFormat_.samples,
-      AV_ROUND_UP);
-}
--- a/torchvision/csrc/cpu/video_reader/FfmpegAudioSampler.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegAudioSampler.h
-#pragma once
-#include "FfmpegSampler.h"
-#define AVRESAMPLE_MAX_CHANNELS 32
-/**
- * Class transcode audio frames from one format into another
- */
-class FfmpegAudioSampler : public FfmpegSampler {
- public:
-  explicit FfmpegAudioSampler(const AudioFormat& in, const AudioFormat& out);
-  ~FfmpegAudioSampler() override;
-  int init() override;
-  int64_t getSampleBytes(const AVFrame* frame) const;
-  // FfmpegSampler overrides
-  // returns number of bytes of the sampled data
-  std::unique_ptr<DecodedFrame> sample(const AVFrame* frame) override;
-  const AudioFormat& getInFormat() const {
-    return inFormat_;
-  }
- private:
-  int64_t getOutNumSamples(int inNumSamples) const;
-  AudioFormat inFormat_;
-  AudioFormat outFormat_;
-  SwrContext* swrContext_{nullptr};
-};
--- a/torchvision/csrc/cpu/video_reader/FfmpegAudioStream.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegAudioStream.cpp
-#include "FfmpegAudioStream.h"
-#include "FfmpegUtil.h"
-using namespace std;
-namespace {
-bool operator==(const AudioFormat& x, const AVCodecContext& y) {
-  return x.samples == y.sample_rate && x.channels == y.channels &&
-      x.format == y.sample_fmt;
-}
-AudioFormat& toAudioFormat(
-    AudioFormat& audioFormat,
-    const AVCodecContext& codecCtx) {
-  audioFormat.samples = codecCtx.sample_rate;
-  audioFormat.channels = codecCtx.channels;
-  audioFormat.format = codecCtx.sample_fmt;
-  return audioFormat;
-}
-} // namespace
-FfmpegAudioStream::FfmpegAudioStream(
-    AVFormatContext* inputCtx,
-    int index,
-    enum AVMediaType avMediaType,
-    MediaFormat mediaFormat,
-    double seekFrameMargin)
-    : FfmpegStream(inputCtx, index, avMediaType, seekFrameMargin),
-      mediaFormat_(mediaFormat) {}
-FfmpegAudioStream::~FfmpegAudioStream() {}
-void FfmpegAudioStream::checkStreamDecodeParams() {
-  auto timeBase = getTimeBase();
-  if (timeBase.first > 0) {
-    CHECK_EQ(timeBase.first, inputCtx_->streams[index_]->time_base.num);
-    CHECK_EQ(timeBase.second, inputCtx_->streams[index_]->time_base.den);
-  }
-}
-void FfmpegAudioStream::updateStreamDecodeParams() {
-  auto timeBase = getTimeBase();
-  if (timeBase.first == 0) {
-    mediaFormat_.format.audio.timeBaseNum =
-        inputCtx_->streams[index_]->time_base.num;
-    mediaFormat_.format.audio.timeBaseDen =
-        inputCtx_->streams[index_]->time_base.den;
-  }
-  mediaFormat_.format.audio.duration = inputCtx_->streams[index_]->duration;
-}
-int FfmpegAudioStream::initFormat() {
-  AudioFormat& format = mediaFormat_.format.audio;
-  if (format.samples == 0) {
-    format.samples = codecCtx_->sample_rate;
-  }
-  if (format.channels == 0) {
-    format.channels = codecCtx_->channels;
-  }
-  if (format.format == AV_SAMPLE_FMT_NONE) {
-    format.format = codecCtx_->sample_fmt;
-    VLOG(2) << "set stream format sample_fmt: " << format.format;
-  }
-  checkStreamDecodeParams();
-  updateStreamDecodeParams();
-  if (format.samples > 0 && format.channels > 0 &&
-      format.format != AV_SAMPLE_FMT_NONE) {
-    return 0;
-  } else {
-    return -1;
-  }
-}
-unique_ptr<DecodedFrame> FfmpegAudioStream::sampleFrameData() {
-  AudioFormat& audioFormat = mediaFormat_.format.audio;
-  if (!sampler_ || !(sampler_->getInFormat() == *codecCtx_)) {
-    AudioFormat newInFormat;
-    newInFormat = toAudioFormat(newInFormat, *codecCtx_);
-    sampler_ = make_unique<FfmpegAudioSampler>(newInFormat, audioFormat);
-    VLOG(1) << "Set sampler input audio format"
-            << ", samples: " << newInFormat.samples
-            << ", channels: " << newInFormat.channels
-            << ", format: " << newInFormat.format
-            << " : output audio sampler format"
-            << ", samples: " << audioFormat.samples
-            << ", channels: " << audioFormat.channels
-            << ", format: " << audioFormat.format;
-    int ret = sampler_->init();
-    if (ret < 0) {
-      VLOG(1) << "Fail to initialize audio sampler";
-      return nullptr;
-    }
-  }
-  return sampler_->sample(frame_);
-}
--- a/torchvision/csrc/cpu/video_reader/FfmpegAudioStream.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegAudioStream.h
-#pragma once
-#include <utility>
-#include "FfmpegAudioSampler.h"
-#include "FfmpegStream.h"
-/**
- * Class uses FFMPEG library to decode one video stream.
- */
-class FfmpegAudioStream : public FfmpegStream {
- public:
-  explicit FfmpegAudioStream(
-      AVFormatContext* inputCtx,
-      int index,
-      enum AVMediaType avMediaType,
-      MediaFormat mediaFormat,
-      double seekFrameMargin);
-  ~FfmpegAudioStream() override;
-  // FfmpegStream overrides
-  MediaType getMediaType() const override {
-    return MediaType::TYPE_AUDIO;
-  }
-  FormatUnion getMediaFormat() const override {
-    return mediaFormat_.format;
-  }
-  int64_t getStartPts() const override {
-    return mediaFormat_.format.audio.startPts;
-  }
-  int64_t getEndPts() const override {
-    return mediaFormat_.format.audio.endPts;
-  }
-  // return numerator and denominator of time base
-  std::pair<int, int> getTimeBase() const {
-    return std::make_pair(
-        mediaFormat_.format.audio.timeBaseNum,
-        mediaFormat_.format.audio.timeBaseDen);
-  }
-  void checkStreamDecodeParams();
-  void updateStreamDecodeParams();
- protected:
-  int initFormat() override;
-  std::unique_ptr<DecodedFrame> sampleFrameData() override;
- private:
-  MediaFormat mediaFormat_;
-  std::unique_ptr<FfmpegAudioSampler> sampler_{nullptr};
-};
--- a/torchvision/csrc/cpu/video_reader/FfmpegDecoder.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegDecoder.cpp
-#include "FfmpegDecoder.h"
-#include "FfmpegAudioStream.h"
-#include "FfmpegUtil.h"
-#include "FfmpegVideoStream.h"
-using namespace std;
-static AVPacket avPkt;
-namespace {
-unique_ptr<FfmpegStream> createFfmpegStream(
-    MediaType type,
-    AVFormatContext* ctx,
-    int idx,
-    MediaFormat& mediaFormat,
-    double seekFrameMargin) {
-  enum AVMediaType avType;
-  CHECK(ffmpeg_util::mapMediaType(type, &avType));
-  switch (type) {
-    case MediaType::TYPE_VIDEO:
-      return make_unique<FfmpegVideoStream>(
-          ctx, idx, avType, mediaFormat, seekFrameMargin);
-    case MediaType::TYPE_AUDIO:
-      return make_unique<FfmpegAudioStream>(
-          ctx, idx, avType, mediaFormat, seekFrameMargin);
-    default:
-      return nullptr;
-  }
-}
-} // namespace
-FfmpegAvioContext::FfmpegAvioContext()
-    : workBuffersize_(VIO_BUFFER_SZ),
-      workBuffer_((uint8_t*)av_malloc(workBuffersize_)),
-      inputFile_(nullptr),
-      inputBuffer_(nullptr),
-      inputBufferSize_(0) {}
-int FfmpegAvioContext::initAVIOContext(const uint8_t* buffer, int64_t size) {
-  inputBuffer_ = buffer;
-  inputBufferSize_ = size;
-  avioCtx_ = avio_alloc_context(
-      workBuffer_,
-      workBuffersize_,
-      0,
-      reinterpret_cast<void*>(this),
-      &FfmpegAvioContext::readMemory,
-      nullptr, // no write function
-      &FfmpegAvioContext::seekMemory);
-  return 0;
-}
-FfmpegAvioContext::~FfmpegAvioContext() {
-  /* note: the internal buffer could have changed, and be != workBuffer_ */
-  if (avioCtx_) {
-    av_freep(&avioCtx_->buffer);
-    av_freep(&avioCtx_);
-  } else {
-    av_freep(&workBuffer_);
-  }
-  if (inputFile_) {
-    fclose(inputFile_);
-  }
-}
-int FfmpegAvioContext::read(uint8_t* buf, int buf_size) {
-  if (inputBuffer_) {
-    return readMemory(this, buf, buf_size);
-  } else {
-    return -1;
-  }
-}
-int FfmpegAvioContext::readMemory(void* opaque, uint8_t* buf, int buf_size) {
-  FfmpegAvioContext* h = static_cast<FfmpegAvioContext*>(opaque);
-  if (buf_size < 0) {
-    return -1;
-  }
-  int reminder = h->inputBufferSize_ - h->offset_;
-  int r = buf_size < reminder ? buf_size : reminder;
-  if (r < 0) {
-    return AVERROR_EOF;
-  }
-  memcpy(buf, h->inputBuffer_ + h->offset_, r);
-  h->offset_ += r;
-  return r;
-}
-int64_t FfmpegAvioContext::seek(int64_t offset, int whence) {
-  if (inputBuffer_) {
-    return seekMemory(this, offset, whence);
-  } else {
-    return -1;
-  }
-}
-int64_t FfmpegAvioContext::seekMemory(
-    void* opaque,
-    int64_t offset,
-    int whence) {
-  FfmpegAvioContext* h = static_cast<FfmpegAvioContext*>(opaque);
-  switch (whence) {
-    case SEEK_CUR: // from current position
-      h->offset_ += offset;
-      break;
-    case SEEK_END: // from eof
-      h->offset_ = h->inputBufferSize_ + offset;
-      break;
-    case SEEK_SET: // from beginning of file
-      h->offset_ = offset;
-      break;
-    case AVSEEK_SIZE:
-      return h->inputBufferSize_;
-  }
-  return h->offset_;
-}
-int FfmpegDecoder::init(
-    const std::string& filename,
-    bool isDecodeFile,
-    FfmpegAvioContext& ioctx,
-    DecoderOutput& decoderOutput) {
-  cleanUp();
-  int ret = 0;
-  if (!isDecodeFile) {
-    formatCtx_ = avformat_alloc_context();
-    if (!formatCtx_) {
-      LOG(ERROR) << "avformat_alloc_context failed";
-      return -1;
-    }
-    formatCtx_->pb = ioctx.get_avio();
-    formatCtx_->flags |= AVFMT_FLAG_CUSTOM_IO;
-    // Determining the input format:
-    int probeSz = AVPROBE_SIZE + AVPROBE_PADDING_SIZE;
-    uint8_t* probe((uint8_t*)av_malloc(probeSz));
-    memset(probe, 0, probeSz);
-    int len = ioctx.read(probe, probeSz - AVPROBE_PADDING_SIZE);
-    if (len < probeSz - AVPROBE_PADDING_SIZE) {
-      LOG(ERROR) << "Insufficient data to determine video format";
-      av_freep(&probe);
-      return -1;
-    }
-    // seek back to start of stream
-    ioctx.seek(0, SEEK_SET);
-    unique_ptr<AVProbeData> probeData(new AVProbeData());
-    probeData->buf = probe;
-    probeData->buf_size = len;
-    probeData->filename = "";
-    // Determine the input-format:
-    formatCtx_->iformat = av_probe_input_format(probeData.get(), 1);
-    // this is to avoid the double-free error
-    if (formatCtx_->iformat == nullptr) {
-      LOG(ERROR) << "av_probe_input_format fails";
-      return -1;
-    }
-    VLOG(1) << "av_probe_input_format succeeds";
-    av_freep(&probe);
-    ret = avformat_open_input(&formatCtx_, "", nullptr, nullptr);
-  } else {
-    ret = avformat_open_input(&formatCtx_, filename.c_str(), nullptr, nullptr);
-  }
-  if (ret < 0) {
-    LOG(ERROR) << "avformat_open_input failed, error: "
-               << ffmpeg_util::getErrorDesc(ret);
-    cleanUp();
-    return ret;
-  }
-  ret = avformat_find_stream_info(formatCtx_, nullptr);
-  if (ret < 0) {
-    LOG(ERROR) << "avformat_find_stream_info failed, error: "
-               << ffmpeg_util::getErrorDesc(ret);
-    cleanUp();
-    return ret;
-  }
-  if (!initStreams()) {
-    LOG(ERROR) << "Cannot activate streams";
-    cleanUp();
-    return -1;
-  }
-  for (auto& stream : streams_) {
-    MediaType mediaType = stream.second->getMediaType();
-    decoderOutput.initMediaType(mediaType, stream.second->getMediaFormat());
-  }
-  VLOG(1) << "FfmpegDecoder initialized";
-  return 0;
-}
-int FfmpegDecoder::decodeFile(
-    unique_ptr<DecoderParameters> params,
-    const string& fileName,
-    DecoderOutput& decoderOutput) {
-  VLOG(1) << "decode file: " << fileName;
-  FfmpegAvioContext ioctx;
-  int ret = decodeLoop(std::move(params), fileName, true, ioctx, decoderOutput);
-  return ret;
-}
-int FfmpegDecoder::decodeMemory(
-    unique_ptr<DecoderParameters> params,
-    const uint8_t* buffer,
-    int64_t size,
-    DecoderOutput& decoderOutput) {
-  VLOG(1) << "decode video data in memory";
-  FfmpegAvioContext ioctx;
-  int ret = ioctx.initAVIOContext(buffer, size);
-  if (ret == 0) {
-    ret =
-        decodeLoop(std::move(params), string(""), false, ioctx, decoderOutput);
-  }
-  return ret;
-}
-int FfmpegDecoder::probeFile(
-    unique_ptr<DecoderParameters> params,
-    const string& fileName,
-    DecoderOutput& decoderOutput) {
-  VLOG(1) << "probe file: " << fileName;
-  FfmpegAvioContext ioctx;
-  return probeVideo(std::move(params), fileName, true, ioctx, decoderOutput);
-}
-int FfmpegDecoder::probeMemory(
-    unique_ptr<DecoderParameters> params,
-    const uint8_t* buffer,
-    int64_t size,
-    DecoderOutput& decoderOutput) {
-  VLOG(1) << "probe video data in memory";
-  FfmpegAvioContext ioctx;
-  int ret = ioctx.initAVIOContext(buffer, size);
-  if (ret == 0) {
-    ret =
-        probeVideo(std::move(params), string(""), false, ioctx, decoderOutput);
-  }
-  return ret;
-}
-void FfmpegDecoder::cleanUp() {
-  if (formatCtx_) {
-    for (auto& stream : streams_) {
-      // Drain stream buffers.
-      DecoderOutput decoderOutput;
-      stream.second->flush(1, decoderOutput);
-      stream.second.reset();
-    }
-    streams_.clear();
-    avformat_close_input(&formatCtx_);
-  }
-}
-FfmpegStream* FfmpegDecoder::findStreamByIndex(int streamIndex) const {
-  auto it = streams_.find(streamIndex);
-  return it != streams_.end() ? it->second.get() : nullptr;
-}
-/*
-Reference implementation:
-https://ffmpeg.org/doxygen/3.4/demuxing_decoding_8c-example.html
-*/
-int FfmpegDecoder::decodeLoop(
-    unique_ptr<DecoderParameters> params,
-    const std::string& filename,
-    bool isDecodeFile,
-    FfmpegAvioContext& ioctx,
-    DecoderOutput& decoderOutput) {
-  params_ = std::move(params);
-  int ret = init(filename, isDecodeFile, ioctx, decoderOutput);
-  if (ret < 0) {
-    return ret;
-  }
-  // init package
-  av_init_packet(&avPkt);
-  avPkt.data = nullptr;
-  avPkt.size = 0;
-  int result = 0;
-  bool ptsInRange = true;
-  while (ptsInRange) {
-    result = av_read_frame(formatCtx_, &avPkt);
-    if (result == AVERROR(EAGAIN)) {
-      VLOG(1) << "Decoder is busy";
-      ret = 0;
-      break;
-    } else if (result == AVERROR_EOF) {
-      VLOG(1) << "Stream decoding is completed";
-      ret = 0;
-      break;
-    } else if (result < 0) {
-      VLOG(1) << "av_read_frame fails. Break decoder loop. Error: "
-              << ffmpeg_util::getErrorDesc(result);
-      ret = result;
-      break;
-    }
-    ret = 0;
-    auto stream = findStreamByIndex(avPkt.stream_index);
-    if (stream == nullptr) {
-      // the packet is from a stream the caller is not interested. Ignore it
-      VLOG(2) << "avPkt ignored. stream index: " << avPkt.stream_index;
-      // Need to free the memory of AVPacket. Otherwise, memory leak happens
-      av_packet_unref(&avPkt);
-      continue;
-    }
-    do {
-      result = stream->sendPacket(&avPkt);
-      if (result == AVERROR(EAGAIN)) {
-        VLOG(2) << "avcodec_send_packet returns AVERROR(EAGAIN)";
-        // start to recevie available frames from internal buffer
-        stream->receiveAvailFrames(params_->getPtsOnly, decoderOutput);
-        if (isPtsExceedRange()) {
-          // exit the most-outer while loop
-          VLOG(1) << "In all streams, exceed the end pts. Exit decoding loop";
-          ret = 0;
-          ptsInRange = false;
-          break;
-        }
-      } else if (result < 0) {
-        LOG(WARNING) << "avcodec_send_packet failed. Error: "
-                     << ffmpeg_util::getErrorDesc(result);
-        ret = result;
-        break;
-      } else {
-        VLOG(2) << "avcodec_send_packet succeeds";
-        // succeed. Read the next AVPacket and send out it
-        break;
-      }
-    } while (ptsInRange);
-    // Need to free the memory of AVPacket. Otherwise, memory leak happens
-    av_packet_unref(&avPkt);
-  }
-  /* flush cached frames */
-  flushStreams(decoderOutput);
-  return ret;
-}
-int FfmpegDecoder::probeVideo(
-    unique_ptr<DecoderParameters> params,
-    const std::string& filename,
-    bool isDecodeFile,
-    FfmpegAvioContext& ioctx,
-    DecoderOutput& decoderOutput) {
-  params_ = std::move(params);
-  return init(filename, isDecodeFile, ioctx, decoderOutput);
-}
-bool FfmpegDecoder::initStreams() {
-  for (auto it = params_->formats.begin(); it != params_->formats.end(); ++it) {
-    AVMediaType mediaType;
-    if (!ffmpeg_util::mapMediaType(it->first, &mediaType)) {
-      LOG(ERROR) << "Unknown media type: " << it->first;
-      return false;
-    }
-    int streamIdx =
-        av_find_best_stream(formatCtx_, mediaType, -1, -1, nullptr, 0);
-    if (streamIdx >= 0) {
-      VLOG(2) << "find stream index: " << streamIdx;
-      auto stream = createFfmpegStream(
-          it->first,
-          formatCtx_,
-          streamIdx,
-          it->second,
-          params_->seekFrameMargin);
-      CHECK(stream);
-      if (stream->openCodecContext() < 0) {
-        LOG(ERROR) << "Cannot open codec. Stream index: " << streamIdx;
-        return false;
-      }
-      streams_.emplace(streamIdx, move(stream));
-    } else {
-      VLOG(1) << "Cannot open find stream of type " << it->first;
-    }
-  }
-  // Seek frames in each stream
-  int ret = 0;
-  for (auto& stream : streams_) {
-    auto startPts = stream.second->getStartPts();
-    VLOG(1) << "stream: " << stream.first << " startPts: " << startPts;
-    if (startPts > 0 && (ret = stream.second->seekFrame(startPts)) < 0) {
-      LOG(WARNING) << "seekFrame in stream fails";
-      return false;
-    }
-  }
-  VLOG(1) << "initStreams succeeds";
-  return true;
-}
-bool FfmpegDecoder::isPtsExceedRange() {
-  bool exceed = true;
-  for (auto& stream : streams_) {
-    exceed = exceed && stream.second->isFramePtsExceedRange();
-  }
-  return exceed;
-}
-void FfmpegDecoder::flushStreams(DecoderOutput& decoderOutput) {
-  for (auto& stream : streams_) {
-    stream.second->flush(params_->getPtsOnly, decoderOutput);
-  }
-}
--- a/torchvision/csrc/cpu/video_reader/FfmpegDecoder.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegDecoder.h
--- a/torchvision/csrc/cpu/video_reader/FfmpegHeaders.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegHeaders.h