Update video reader to use new decoder (#1978)

* Base decoder for video. (#1747) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1747 Pull Request resolved: https://github.com/pytorch/vision/pull/1746 Added the implementation of ffmpeg based decoder with functionality that can be used in VUE and TorchVision. Reviewed By: fmassa Differential Revision: D19358914 fbshipit-source-id: abb672f89bfaca6351dda2354f0d35cf8e47fa0f * Integrated base decoder into VideoReader class and video_utils.py (#1766) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1766 Replaced FfmpegDecoder (incompativle with VUE) by base decoder (compatible with VUE). Modified python utilities video_utils.py for internal simplification. Public interface got preserved. Reviewed By: fmassa Differential Revision: D19415903 fbshipit-source-id: 4d7a0158bd77bac0a18732fe4183fdd9a57f6402 * Optimizating base decoder performance. (#1852) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1852 Changed base decoder internals for a faster clip processing. Reviewed By: stephenyan1231 Differential Revision: D19748379 fbshipit-source-id: 58a435f0a0b25545e7bd1a3edb0b1d558176a806 * Minor fix and decoder class members access. Summary: Found and fix a bug in cropping algorithm (simple mistyping). Also derived classes need access to some decoder class members, like initialization parameters - make it protected. Reviewed By: stephenyan1231, fmassa Differential Revision: D19895076 fbshipit-source-id: 691336c8e18526b085ae5792ac3546bc387a6db9 * Added missing header for less dependencies. (#1898) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1898 Include streams/samplers shouldn't depend on decoder headers. Add dependencies directly to the place where they are required. Reviewed By: stephenyan1231 Differential Revision: D19911404 fbshipit-source-id: ef322a053708405c02cee4562b456b1602fb12fc * Implemented VUE Asynchronous Decoder Summary: For Mothership we have found that asynchronous decoder provides a better performance. Differential Revision: D20026194 fbshipit-source-id: 627b91844b4e3f917002031dd32cb19c239f4ba8 * fix a bug in API read_video_from_memory (#1942) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1942 In D18720474, it introduces a bug in `read_video_from_memory` API. Thank weiyaowang for reporting it. Reviewed By: weiyaowang Differential Revision: D20270179 fbshipit-source-id: 66348c99a5ad1f9129b90e934524ddfaad59de03 * extend decoder to support new video_max_dimension argument (#1924) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1924 Extend `video reader` decoder python API in Torchvision to support a new argument `video_max_dimension`. This enables the new video decoding use cases. When setting `video_width=0`, `video_height=0`, `video_min_dimension != 0`, and `video_max_dimension != 0`, we can rescale the video clips so that its spatial resolution (height, width) becomes - (video_min_dimension, video_max_dimension) if original height < original width - (video_max_dimension, video_min_dimension) if original height >= original width This is useful at video model testing stage, where we perform fully convolution evaluation and take entire video frames without cropping as input. Previously, for instance we can only set `video_width=0`, `video_height=0`, `video_min_dimension = 128`, which will preserve aspect ratio. In production dataset, there are a small number of videos where aspect ratio is either extremely large or small, and when the shorter edge is rescaled to 128, the longer edge is still large. This will easily cause GPU memory OOM when we sample multiple video clips, and put them in a single minibatch. Now, we can set (for instance) `video_width=0`, `video_height=0`, `video_min_dimension = 128` and `video_max_dimension = 171` so that the rescale resolution is either (128, 171) or (171, 128) depending on whether original height is larger than original width. Thus, we are less likely to have gpu OOM because the spatial size of video clips is determined. Reviewed By: putivsky Differential Revision: D20182529 fbshipit-source-id: f9c40afb7590e7c45e6908946597141efa35f57c * Fixing samplers initialization (#1967) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1967 No-ops for torchvision diff, which fixes samplers. Differential Revision: D20397218 fbshipit-source-id: 6dc4d04364f305fbda7ca4f67a25ceecd73d0f20 * Exclude C++ test files Co-authored-by: Yuri Putivsky <yuri@fb.com> Co-authored-by: Zhicheng Yan <zyan3@fb.com>

Update video reader to use new decoder (#1978)
* Base decoder for video. (#1747) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1747 Pull Request resolved: https://github.com/pytorch/vision/pull/1746 Added the implementation of ffmpeg based decoder with functionality that can be used in VUE and TorchVision. Reviewed By: fmassa Differential Revision: D19358914 fbshipit-source-id: abb672f89bfaca6351dda2354f0d35cf8e47fa0f * Integrated base decoder into VideoReader class and video_utils.py (#1766) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1766 Replaced FfmpegDecoder (incompativle with VUE) by base decoder (compatible with VUE). Modified python utilities video_utils.py for internal simplification. Public interface got preserved. Reviewed By: fmassa Differential Revision: D19415903 fbshipit-source-id: 4d7a0158bd77bac0a18732fe4183fdd9a57f6402 * Optimizating base decoder performance. (#1852) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1852 Changed base decoder internals for a faster clip processing. Reviewed By: stephenyan1231 Differential Revision: D19748379 fbshipit-source-id: 58a435f0a0b25545e7bd1a3edb0b1d558176a806 * Minor fix and decoder class members access. Summary: Found and fix a bug in cropping algorithm (simple mistyping). Also derived classes need access to some decoder class members, like initialization parameters - make it protected. Reviewed By: stephenyan1231, fmassa Differential Revision: D19895076 fbshipit-source-id: 691336c8e18526b085ae5792ac3546bc387a6db9 * Added missing header for less dependencies. (#1898) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1898 Include streams/samplers shouldn't depend on decoder headers. Add dependencies directly to the place where they are required. Reviewed By: stephenyan1231 Differential Revision: D19911404 fbshipit-source-id: ef322a053708405c02cee4562b456b1602fb12fc * Implemented VUE Asynchronous Decoder Summary: For Mothership we have found that asynchronous decoder provides a better performance. Differential Revision: D20026194 fbshipit-source-id: 627b91844b4e3f917002031dd32cb19c239f4ba8 * fix a bug in API read_video_from_memory (#1942) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1942 In D18720474, it introduces a bug in `read_video_from_memory` API. Thank weiyaowang for reporting it. Reviewed By: weiyaowang Differential Revision: D20270179 fbshipit-source-id: 66348c99a5ad1f9129b90e934524ddfaad59de03 * extend decoder to support new video_max_dimension argument (#1924) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1924 Extend `video reader` decoder python API in Torchvision to support a new argument `video_max_dimension`. This enables the new video decoding use cases. When setting `video_width=0`, `video_height=0`, `video_min_dimension != 0`, and `video_max_dimension != 0`, we can rescale the video clips so that its spatial resolution (height, width) becomes - (video_min_dimension, video_max_dimension) if original height < original width - (video_max_dimension, video_min_dimension) if original height >= original width This is useful at video model testing stage, where we perform fully convolution evaluation and take entire video frames without cropping as input. Previously, for instance we can only set `video_width=0`, `video_height=0`, `video_min_dimension = 128`, which will preserve aspect ratio. In production dataset, there are a small number of videos where aspect ratio is either extremely large or small, and when the shorter edge is rescaled to 128, the longer edge is still large. This will easily cause GPU memory OOM when we sample multiple video clips, and put them in a single minibatch. Now, we can set (for instance) `video_width=0`, `video_height=0`, `video_min_dimension = 128` and `video_max_dimension = 171` so that the rescale resolution is either (128, 171) or (171, 128) depending on whether original height is larger than original width. Thus, we are less likely to have gpu OOM because the spatial size of video clips is determined. Reviewed By: putivsky Differential Revision: D20182529 fbshipit-source-id: f9c40afb7590e7c45e6908946597141efa35f57c * Fixing samplers initialization (#1967) Summary: Pull Request resolved: https://github.com/pytorch/vision/pull/1967 No-ops for torchvision diff, which fixes samplers. Differential Revision: D20397218 fbshipit-source-id: 6dc4d04364f305fbda7ca4f67a25ceecd73d0f20 * Exclude C++ test files Co-authored-by: Yuri Putivsky <yuri@fb.com> Co-authored-by: Zhicheng Yan <zyan3@fb.com>
32e16805 · Francisco Massa · GitHub · 8b9859d3 · 8b9859d3 · 8b9859d3
Unverified Commit 32e16805 authored Mar 17, 2020 by Francisco Massa Committed by GitHub Mar 17, 2020
18 changed files
--- a/torchvision/csrc/cpu/video_reader/FfmpegSampler.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegSampler.h
-#pragma once
-
-#include "FfmpegHeaders.h"
-#include "Interface.h"
-
-/**
- * Class sample data from AVFrame
- */
-class FfmpegSampler {
- public:
-  virtual ~FfmpegSampler() = default;
-  // return 0 on success and negative number on failure
-  virtual int init() = 0;
-  // sample from the given frame
-  virtual std::unique_ptr<DecodedFrame> sample(const AVFrame* frame) = 0;
-};
--- a/torchvision/csrc/cpu/video_reader/FfmpegStream.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegStream.cpp
-#include "FfmpegStream.h"
-#include "FfmpegUtil.h"
-
-using namespace std;
-
-// (TODO) Currently, disable the use of refCount
-static int refCount = 0;
-
-FfmpegStream::FfmpegStream(
-    AVFormatContext* inputCtx,
-    int index,
-    enum AVMediaType avMediaType,
-    double seekFrameMargin)
-    : inputCtx_(inputCtx),
-      index_(index),
-      avMediaType_(avMediaType),
-      seekFrameMargin_(seekFrameMargin) {}
-
-FfmpegStream::~FfmpegStream() {
-  if (frame_) {
-    av_frame_free(&frame_);
-  }
-  avcodec_free_context(&codecCtx_);
-}
-
-int FfmpegStream::openCodecContext() {
-  VLOG(2) << "stream start_time: " << inputCtx_->streams[index_]->start_time;
-
-  auto typeString = av_get_media_type_string(avMediaType_);
-  AVStream* st = inputCtx_->streams[index_];
-  auto codec_id = st->codecpar->codec_id;
-  VLOG(1) << "codec_id: " << codec_id;
-  AVCodec* codec = avcodec_find_decoder(codec_id);
-  if (!codec) {
-    LOG(ERROR) << "avcodec_find_decoder failed for codec_id: " << int(codec_id);
-    return AVERROR(EINVAL);
-  }
-  VLOG(1) << "Succeed to find decoder";
-
-  codecCtx_ = avcodec_alloc_context3(codec);
-  if (!codecCtx_) {
-    LOG(ERROR) << "avcodec_alloc_context3 fails";
-    return AVERROR(ENOMEM);
-  }
-
-  int ret;
-  /* Copy codec parameters from input stream to output codec context */
-  if ((ret = avcodec_parameters_to_context(codecCtx_, st->codecpar)) < 0) {
-    LOG(ERROR) << "Failed to copy " << typeString
-               << " codec parameters to decoder context";
-    return ret;
-  }
-
-  AVDictionary* opts = nullptr;
-  av_dict_set(&opts, "refcounted_frames", refCount ? "1" : "0", 0);
-
-  // after avcodec_open2, value of codecCtx_->time_base is NOT meaningful
-  // But inputCtx_->streams[index_]->time_base has meaningful values
-  if ((ret = avcodec_open2(codecCtx_, codec, &opts)) < 0) {
-    LOG(ERROR) << "avcodec_open2 failed. " << ffmpeg_util::getErrorDesc(ret);
-    return ret;
-  }
-  VLOG(1) << "Succeed to open codec";
-
-  frame_ = av_frame_alloc();
-  return initFormat();
-}
-
-unique_ptr<DecodedFrame> FfmpegStream::getFrameData(int getPtsOnly) {
-  if (!codecCtx_) {
-    LOG(ERROR) << "Codec is not initialized";
-    return nullptr;
-  }
-  if (getPtsOnly) {
-    unique_ptr<DecodedFrame> decodedFrame = make_unique<DecodedFrame>();
-    decodedFrame->pts_ = frame_->pts;
-    return decodedFrame;
-  } else {
-    unique_ptr<DecodedFrame> decodedFrame = sampleFrameData();
-    if (decodedFrame) {
-      decodedFrame->pts_ = frame_->pts;
-    }
-    return decodedFrame;
-  }
-}
-
-void FfmpegStream::flush(int getPtsOnly, DecoderOutput& decoderOutput) {
-  VLOG(1) << "Media Type: " << getMediaType() << ", flush stream.";
-  // need to receive frames before entering draining mode
-  receiveAvailFrames(getPtsOnly, decoderOutput);
-
-  VLOG(2) << "send nullptr packet";
-  sendPacket(nullptr);
-  // receive remaining frames after entering draining mode
-  receiveAvailFrames(getPtsOnly, decoderOutput);
-
-  avcodec_flush_buffers(codecCtx_);
-}
-
-bool FfmpegStream::isFramePtsInRange() {
-  CHECK(frame_);
-  auto pts = frame_->pts;
-  auto startPts = this->getStartPts();
-  auto endPts = this->getEndPts();
-  VLOG(2) << "isPtsInRange. pts: " << pts << ", startPts: " << startPts
-          << ", endPts: " << endPts;
-  return (pts == AV_NOPTS_VALUE) ||
-      (pts >= startPts && (endPts >= 0 ? pts <= endPts : true));
-}
-
-bool FfmpegStream::isFramePtsExceedRange() {
-  if (frame_) {
-    auto endPts = this->getEndPts();
-    VLOG(2) << "isFramePtsExceedRange. last_pts_: " << last_pts_
-            << ", endPts: " << endPts;
-    return endPts >= 0 ? last_pts_ >= endPts : false;
-  } else {
-    return true;
-  }
-}
-
-// seek a frame
-int FfmpegStream::seekFrame(int64_t seekPts) {
-  // translate margin from second to pts
-  int64_t margin = (int64_t)(
-      seekFrameMargin_ * (double)inputCtx_->streams[index_]->time_base.den /
-      (double)inputCtx_->streams[index_]->time_base.num);
-  int64_t real_seekPts = (seekPts - margin) > 0 ? (seekPts - margin) : 0;
-  VLOG(2) << "seek margin: " << margin;
-  VLOG(2) << "real seekPts: " << real_seekPts;
-  int ret = av_seek_frame(
-      inputCtx_,
-      index_,
-      (seekPts - margin) > 0 ? (seekPts - margin) : 0,
-      AVSEEK_FLAG_BACKWARD);
-  if (ret < 0) {
-    LOG(WARNING) << "av_seek_frame fails. Stream index: " << index_;
-    return ret;
-  }
-  return 0;
-}
-
-// send/receive encoding and decoding API overview
-// https://ffmpeg.org/doxygen/3.4/group__lavc__encdec.html
-int FfmpegStream::sendPacket(const AVPacket* packet) {
-  return avcodec_send_packet(codecCtx_, packet);
-}
-
-int FfmpegStream::receiveFrame() {
-  int ret = avcodec_receive_frame(codecCtx_, frame_);
-  if (ret >= 0) {
-    // succeed
-    frame_->pts = av_frame_get_best_effort_timestamp(frame_);
-    if (frame_->pts == AV_NOPTS_VALUE) {
-      // Trick: if we can not figure out pts, we just set it to be (last_pts +
-      // 1)
-      frame_->pts = last_pts_ + 1;
-    }
-    last_pts_ = frame_->pts;
-
-    VLOG(2) << "avcodec_receive_frame succeed";
-  } else if (ret == AVERROR(EAGAIN)) {
-    VLOG(2) << "avcodec_receive_frame fails and returns AVERROR(EAGAIN). ";
-  } else if (ret == AVERROR_EOF) {
-    // no more frame to read
-    VLOG(2) << "avcodec_receive_frame returns AVERROR_EOF";
-  } else {
-    LOG(WARNING) << "avcodec_receive_frame failed. Error: "
-                 << ffmpeg_util::getErrorDesc(ret);
-  }
-  return ret;
-}
-
-void FfmpegStream::receiveAvailFrames(
-    int getPtsOnly,
-    DecoderOutput& decoderOutput) {
-  int result = 0;
-  while ((result = receiveFrame()) >= 0) {
-    unique_ptr<DecodedFrame> decodedFrame = getFrameData(getPtsOnly);
-
-    if (decodedFrame &&
-        ((!getPtsOnly && decodedFrame->frameSize_ > 0) || getPtsOnly)) {
-      if (isFramePtsInRange()) {
-        decoderOutput.addMediaFrame(getMediaType(), std::move(decodedFrame));
-      }
-    } // end-if
-  } // end-while
-}
--- a/torchvision/csrc/cpu/video_reader/FfmpegStream.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegStream.h
-// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
-#pragma once
-
-#include <memory>
-#include <unordered_map>
-#include <utility>
-#include "FfmpegHeaders.h"
-#include "Interface.h"
-
-/*
-Class uses FFMPEG library to decode one media stream (audio or video).
-*/
-class FfmpegStream {
- public:
-  FfmpegStream(
-      AVFormatContext* inputCtx,
-      int index,
-      enum AVMediaType avMediaType,
-      double seekFrameMargin);
-  virtual ~FfmpegStream();
-
-  // returns 0 - on success or negative error
-  int openCodecContext();
-  // returns stream index
-  int getIndex() const {
-    return index_;
-  }
-  // returns number decoded/sampled bytes
-  std::unique_ptr<DecodedFrame> getFrameData(int getPtsOnly);
-  // flush the stream at the end of decoding.
-  // Return 0 on success and -1 when cache is drained
-  void flush(int getPtsOnly, DecoderOutput& decoderOutput);
-  // seek a frame
-  int seekFrame(int64_t ts);
-  // send an AVPacket
-  int sendPacket(const AVPacket* packet);
-  // receive AVFrame
-  int receiveFrame();
-  // receive all available frames from the internal buffer
-  void receiveAvailFrames(int getPtsOnly, DecoderOutput& decoderOutput);
-  // return media type
-  virtual MediaType getMediaType() const = 0;
-  // return media format
-  virtual FormatUnion getMediaFormat() const = 0;
-  // return start presentation timestamp
-  virtual int64_t getStartPts() const = 0;
-  // return end presentation timestamp
-  virtual int64_t getEndPts() const = 0;
-  // is the pts of most recent frame within range?
-  bool isFramePtsInRange();
-  // does the pts of most recent frame exceed range?
-  bool isFramePtsExceedRange();
-
- protected:
-  virtual int initFormat() = 0;
-  // returns a decoded frame
-  virtual std::unique_ptr<DecodedFrame> sampleFrameData() = 0;
-
- protected:
-  AVFormatContext* const inputCtx_;
-  const int index_;
-  enum AVMediaType avMediaType_;
-
-  AVCodecContext* codecCtx_{nullptr};
-  AVFrame* frame_{nullptr};
-  // pts of last decoded frame
-  int64_t last_pts_{0};
-  double seekFrameMargin_{1.0};
-};
--- a/torchvision/csrc/cpu/video_reader/FfmpegUtil.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegUtil.cpp
-#include "FfmpegUtil.h"
-
-using namespace std;
-
-namespace ffmpeg_util {
-
-bool mapFfmpegType(AVMediaType media, MediaType* type) {
-  switch (media) {
-    case AVMEDIA_TYPE_VIDEO:
-      *type = MediaType::TYPE_VIDEO;
-      return true;
-    case AVMEDIA_TYPE_AUDIO:
-      *type = MediaType::TYPE_AUDIO;
-      return true;
-    default:
-      return false;
-  }
-}
-
-bool mapMediaType(MediaType type, AVMediaType* media) {
-  switch (type) {
-    case MediaType::TYPE_VIDEO:
-      *media = AVMEDIA_TYPE_VIDEO;
-      return true;
-    case MediaType::TYPE_AUDIO:
-      *media = AVMEDIA_TYPE_AUDIO;
-      return true;
-    default:
-      return false;
-  }
-}
-
-void setFormatDimensions(
-    int& destW,
-    int& destH,
-    int userW,
-    int userH,
-    int srcW,
-    int srcH,
-    int minDimension) {
-  // rounding rules
-  // int -> double -> round
-  // round up if fraction is >= 0.5 or round down if fraction is < 0.5
-  // int result = double(value) + 0.5
-  // here we rounding double to int according to the above rule
-  if (userW == 0 && userH == 0) {
-    if (minDimension > 0) { // #2
-      if (srcW > srcH) {
-        // landscape
-        destH = minDimension;
-        destW = round(double(srcW * minDimension) / srcH);
-      } else {
-        // portrait
-        destW = minDimension;
-        destH = round(double(srcH * minDimension) / srcW);
-      }
-    } else { // #1
-      destW = srcW;
-      destH = srcH;
-    }
-  } else if (userW != 0 && userH == 0) { // #3
-    destW = userW;
-    destH = round(double(srcH * userW) / srcW);
-  } else if (userW == 0 && userH != 0) { // #4
-    destW = round(double(srcW * userH) / srcH);
-    destH = userH;
-  } else {
-    // userW != 0 && userH != 0. #5
-    destW = userW;
-    destH = userH;
-  }
-  // prevent zeros
-  destW = std::max(destW, 1);
-  destH = std::max(destH, 1);
-}
-
-bool validateVideoFormat(const VideoFormat& f) {
-  /*
-  Valid parameters values for decoder
-  ___________________________________________________
-  |  W  |  H  | minDimension |  algorithm           |
-  |_________________________________________________|
-  |  0  |  0  |     0        |   original           |
-  |_________________________________________________|
-  |  0  |  0  |     >0       |scale to min dimension|
-  |_____|_____|____________________________________ |
-  |  >0 |  0  |     0        |   scale keeping W    |
-  |_________________________________________________|
-  |  0  |  >0 |     0        |   scale keeping H    |
-  |_________________________________________________|
-  |  >0 |  >0 |     0        |   stretch/scale      |
-  |_________________________________________________|
-
-  */
-  return (f.width == 0 && f.height == 0) || // #1 and #2
-      (f.width != 0 && f.height != 0 && f.minDimension == 0) || // # 5
-      (((f.width != 0 && f.height == 0) || // #3 and #4
-        (f.width == 0 && f.height != 0)) &&
-       f.minDimension == 0);
-}
-
-string getErrorDesc(int errnum) {
-  array<char, 1024> buffer;
-  if (av_strerror(errnum, buffer.data(), buffer.size()) < 0) {
-    return string("Unknown error code");
-  }
-  buffer.back() = 0;
-  return string(buffer.data());
-}
-
-} // namespace ffmpeg_util
--- a/torchvision/csrc/cpu/video_reader/FfmpegUtil.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegUtil.h
-#pragma once
-
-#include <array>
-#include <string>
-#include "FfmpegHeaders.h"
-#include "Interface.h"
-
-namespace ffmpeg_util {
-
-bool mapFfmpegType(AVMediaType media, enum MediaType* type);
-
-bool mapMediaType(MediaType type, enum AVMediaType* media);
-
-void setFormatDimensions(
-    int& destW,
-    int& destH,
-    int userW,
-    int userH,
-    int srcW,
-    int srcH,
-    int minDimension);
-
-bool validateVideoFormat(const VideoFormat& f);
-
-std::string getErrorDesc(int errnum);
-
-} // namespace ffmpeg_util
--- a/torchvision/csrc/cpu/video_reader/FfmpegVideoSampler.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegVideoSampler.cpp
-#include "FfmpegVideoSampler.h"
-#include "FfmpegUtil.h"
-
-using namespace std;
-
-FfmpegVideoSampler::FfmpegVideoSampler(
-    const VideoFormat& in,
-    const VideoFormat& out,
-    int swsFlags)
-    : inFormat_(in), outFormat_(out), swsFlags_(swsFlags) {}
-
-FfmpegVideoSampler::~FfmpegVideoSampler() {
-  if (scaleContext_) {
-    sws_freeContext(scaleContext_);
-    scaleContext_ = nullptr;
-  }
-}
-
-int FfmpegVideoSampler::init() {
-  VLOG(1) << "Input format: width " << inFormat_.width << ", height "
-          << inFormat_.height << ", format " << inFormat_.format
-          << ", minDimension " << inFormat_.minDimension;
-  VLOG(1) << "Scale format: width " << outFormat_.width << ", height "
-          << outFormat_.height << ", format " << outFormat_.format
-          << ", minDimension " << outFormat_.minDimension;
-
-  scaleContext_ = sws_getContext(
-      inFormat_.width,
-      inFormat_.height,
-      (AVPixelFormat)inFormat_.format,
-      outFormat_.width,
-      outFormat_.height,
-      static_cast<AVPixelFormat>(outFormat_.format),
-      swsFlags_,
-      nullptr,
-      nullptr,
-      nullptr);
-  if (scaleContext_) {
-    return 0;
-  } else {
-    return -1;
-  }
-}
-
-int32_t FfmpegVideoSampler::getImageBytes() const {
-  return av_image_get_buffer_size(
-      (AVPixelFormat)outFormat_.format, outFormat_.width, outFormat_.height, 1);
-}
-
-// https://ffmpeg.org/doxygen/3.4/scaling_video_8c-example.html#a10
-unique_ptr<DecodedFrame> FfmpegVideoSampler::sample(const AVFrame* frame) {
-  if (!frame) {
-    return nullptr; // no flush for videos
-  }
-  // scaled and cropped image
-  auto outImageSize = getImageBytes();
-  AvDataPtr frameData(static_cast<uint8_t*>(av_malloc(outImageSize)));
-
-  uint8_t* scalePlanes[4] = {nullptr};
-  int scaleLines[4] = {0};
-
-  int result;
-  if ((result = av_image_fill_arrays(
-           scalePlanes,
-           scaleLines,
-           frameData.get(),
-           static_cast<AVPixelFormat>(outFormat_.format),
-           outFormat_.width,
-           outFormat_.height,
-           1)) < 0) {
-    LOG(ERROR) << "av_image_fill_arrays failed, err: "
-               << ffmpeg_util::getErrorDesc(result);
-    return nullptr;
-  }
-
-  if ((result = sws_scale(
-           scaleContext_,
-           frame->data,
-           frame->linesize,
-           0,
-           inFormat_.height,
-           scalePlanes,
-           scaleLines)) < 0) {
-    LOG(ERROR) << "sws_scale failed, err: "
-               << ffmpeg_util::getErrorDesc(result);
-    return nullptr;
-  }
-
-  return make_unique<DecodedFrame>(std::move(frameData), outImageSize, 0);
-}
--- a/torchvision/csrc/cpu/video_reader/FfmpegVideoSampler.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegVideoSampler.h
-#pragma once
-
-#include "FfmpegSampler.h"
-
-/**
- * Class transcode video frames from one format into another
- */
-
-class FfmpegVideoSampler : public FfmpegSampler {
- public:
-  explicit FfmpegVideoSampler(
-      const VideoFormat& in,
-      const VideoFormat& out,
-      int swsFlags = SWS_AREA);
-  ~FfmpegVideoSampler() override;
-
-  int init() override;
-
-  int32_t getImageBytes() const;
-  // returns number of bytes of the sampled data
-  std::unique_ptr<DecodedFrame> sample(const AVFrame* frame) override;
-
-  const VideoFormat& getInFormat() const {
-    return inFormat_;
-  }
-
- private:
-  VideoFormat inFormat_;
-  VideoFormat outFormat_;
-  int swsFlags_;
-  SwsContext* scaleContext_{nullptr};
-};
--- a/torchvision/csrc/cpu/video_reader/FfmpegVideoStream.cpp
+++ b/torchvision/csrc/cpu/video_reader/FfmpegVideoStream.cpp
-#include "FfmpegVideoStream.h"
-#include "FfmpegUtil.h"
-
-using namespace std;
-
-namespace {
-
-bool operator==(const VideoFormat& x, const AVFrame& y) {
-  return x.width == y.width && x.height == y.height &&
-      x.format == static_cast<AVPixelFormat>(y.format);
-}
-
-VideoFormat toVideoFormat(const AVFrame& frame) {
-  VideoFormat videoFormat;
-  videoFormat.width = frame.width;
-  videoFormat.height = frame.height;
-  videoFormat.format = static_cast<AVPixelFormat>(frame.format);
-
-  return videoFormat;
-}
-
-} // namespace
-
-FfmpegVideoStream::FfmpegVideoStream(
-    AVFormatContext* inputCtx,
-    int index,
-    enum AVMediaType avMediaType,
-    MediaFormat mediaFormat,
-    double seekFrameMargin)
-    : FfmpegStream(inputCtx, index, avMediaType, seekFrameMargin),
-      mediaFormat_(mediaFormat) {}
-
-FfmpegVideoStream::~FfmpegVideoStream() {}
-
-void FfmpegVideoStream::checkStreamDecodeParams() {
-  auto timeBase = getTimeBase();
-  if (timeBase.first > 0) {
-    CHECK_EQ(timeBase.first, inputCtx_->streams[index_]->time_base.num);
-    CHECK_EQ(timeBase.second, inputCtx_->streams[index_]->time_base.den);
-  }
-}
-
-void FfmpegVideoStream::updateStreamDecodeParams() {
-  auto timeBase = getTimeBase();
-  if (timeBase.first == 0) {
-    mediaFormat_.format.video.timeBaseNum =
-        inputCtx_->streams[index_]->time_base.num;
-    mediaFormat_.format.video.timeBaseDen =
-        inputCtx_->streams[index_]->time_base.den;
-  }
-  mediaFormat_.format.video.duration = inputCtx_->streams[index_]->duration;
-}
-
-int FfmpegVideoStream::initFormat() {
-  // set output format
-  VideoFormat& format = mediaFormat_.format.video;
-  if (!ffmpeg_util::validateVideoFormat(format)) {
-    LOG(ERROR) << "Invalid video format";
-    return -1;
-  }
-
-  format.fps = av_q2d(
-      av_guess_frame_rate(inputCtx_, inputCtx_->streams[index_], nullptr));
-
-  // keep aspect ratio
-  ffmpeg_util::setFormatDimensions(
-      format.width,
-      format.height,
-      format.width,
-      format.height,
-      codecCtx_->width,
-      codecCtx_->height,
-      format.minDimension);
-
-  VLOG(1) << "After adjusting, video format"
-          << ", width: " << format.width << ", height: " << format.height
-          << ", format: " << format.format
-          << ", minDimension: " << format.minDimension;
-
-  if (format.format == AV_PIX_FMT_NONE) {
-    format.format = codecCtx_->pix_fmt;
-    VLOG(1) << "Set pixel format: " << format.format;
-  }
-
-  checkStreamDecodeParams();
-
-  updateStreamDecodeParams();
-
-  return format.width != 0 && format.height != 0 &&
-          format.format != AV_PIX_FMT_NONE
-      ? 0
-      : -1;
-}
-
-unique_ptr<DecodedFrame> FfmpegVideoStream::sampleFrameData() {
-  VideoFormat& format = mediaFormat_.format.video;
-  if (!sampler_ || !(sampler_->getInFormat() == *frame_)) {
-    VideoFormat newInFormat = toVideoFormat(*frame_);
-    sampler_ = make_unique<FfmpegVideoSampler>(newInFormat, format, SWS_AREA);
-    VLOG(1) << "Set input video sampler format"
-            << ", width: " << newInFormat.width
-            << ", height: " << newInFormat.height
-            << ", format: " << newInFormat.format
-            << " : output video sampler format"
-            << ", width: " << format.width << ", height: " << format.height
-            << ", format: " << format.format
-            << ", minDimension: " << format.minDimension;
-    int ret = sampler_->init();
-    if (ret < 0) {
-      VLOG(1) << "Fail to initialize video sampler";
-      return nullptr;
-    }
-  }
-  return sampler_->sample(frame_);
-}
--- a/torchvision/csrc/cpu/video_reader/FfmpegVideoStream.h
+++ b/torchvision/csrc/cpu/video_reader/FfmpegVideoStream.h
-#pragma once
-
-#include <utility>
-#include "FfmpegStream.h"
-#include "FfmpegVideoSampler.h"
-
-/**
- * Class uses FFMPEG library to decode one video stream.
- */
-class FfmpegVideoStream : public FfmpegStream {
- public:
-  explicit FfmpegVideoStream(
-      AVFormatContext* inputCtx,
-      int index,
-      enum AVMediaType avMediaType,
-      MediaFormat mediaFormat,
-      double seekFrameMargin);
-
-  ~FfmpegVideoStream() override;
-
-  // FfmpegStream overrides
-  MediaType getMediaType() const override {
-    return MediaType::TYPE_VIDEO;
-  }
-
-  FormatUnion getMediaFormat() const override {
-    return mediaFormat_.format;
-  }
-
-  int64_t getStartPts() const override {
-    return mediaFormat_.format.video.startPts;
-  }
-  int64_t getEndPts() const override {
-    return mediaFormat_.format.video.endPts;
-  }
-  // return numerator and denominator of time base
-  std::pair<int, int> getTimeBase() const {
-    return std::make_pair(
-        mediaFormat_.format.video.timeBaseNum,
-        mediaFormat_.format.video.timeBaseDen);
-  }
-
-  void checkStreamDecodeParams();
-
-  void updateStreamDecodeParams();
-
- protected:
-  int initFormat() override;
-  std::unique_ptr<DecodedFrame> sampleFrameData() override;
-
- private:
-  MediaFormat mediaFormat_;
-  std::unique_ptr<FfmpegVideoSampler> sampler_{nullptr};
-};
--- a/torchvision/csrc/cpu/video_reader/Interface.cpp
+++ b/torchvision/csrc/cpu/video_reader/Interface.cpp
-#include "Interface.h"
-
-void DecoderOutput::initMediaType(MediaType mediaType, FormatUnion format) {
-  MediaData mediaData(format);
-  media_data_.emplace(mediaType, std::move(mediaData));
-}
-
-void DecoderOutput::addMediaFrame(
-    MediaType mediaType,
-    std::unique_ptr<DecodedFrame> frame) {
-  if (media_data_.find(mediaType) != media_data_.end()) {
-    VLOG(1) << "media type: " << mediaType
-            << " add frame with pts: " << frame->pts_;
-    media_data_[mediaType].frames_.push_back(std::move(frame));
-  } else {
-    VLOG(1) << "media type: " << mediaType << " not found. Skip the frame.";
-  }
-}
-
-void DecoderOutput::clear() {
-  media_data_.clear();
-}
--- a/torchvision/csrc/cpu/video_reader/Interface.h
+++ b/torchvision/csrc/cpu/video_reader/Interface.h
-#pragma once
-
-#include <c10/util/Logging.h>
-#include <sys/types.h>
-#include <memory>
-#include <unordered_map>
-
-extern "C" {
-
-#include <libavutil/pixfmt.h>
-#include <libavutil/samplefmt.h>
-void av_free(void* ptr);
-}
-
-struct avDeleter {
-  void operator()(uint8_t* p) const {
-    av_free(p);
-  }
-};
-
-const AVPixelFormat defaultVideoPixelFormat = AV_PIX_FMT_RGB24;
-const AVSampleFormat defaultAudioSampleFormat = AV_SAMPLE_FMT_FLT;
-
-using AvDataPtr = std::unique_ptr<uint8_t, avDeleter>;
-
-enum MediaType : uint32_t {
-  TYPE_VIDEO = 1,
-  TYPE_AUDIO = 2,
-};
-
-struct EnumClassHash {
-  template <typename T>
-  uint32_t operator()(T t) const {
-    return static_cast<uint32_t>(t);
-  }
-};
-
-struct VideoFormat {
-  // fields are initialized for the auto detection
-  // caller can specify some/all of field values if specific output is desirable
-
-  int width{0}; // width in pixels
-  int height{0}; // height in pixels
-  int minDimension{0}; // choose min dimension and rescale accordingly
-  // Output image pixel format. data type AVPixelFormat
-  AVPixelFormat format{defaultVideoPixelFormat}; // type AVPixelFormat
-  int64_t startPts{0}, endPts{0}; // Start and end presentation timestamp
-  int timeBaseNum{0};
-  int timeBaseDen{1}; // numerator and denominator of time base
-  float fps{0.0};
-  int64_t duration{0}; // duration of the stream, in stream time base
-};
-
-struct AudioFormat {
-  // fields are initialized for the auto detection
-  // caller can specify some/all of field values if specific output is desirable
-
-  int samples{0}; // number samples per second (frequency)
-  int channels{0}; // number of channels
-  AVSampleFormat format{defaultAudioSampleFormat}; // type AVSampleFormat
-  int64_t startPts{0}, endPts{0}; // Start and end presentation timestamp
-  int timeBaseNum{0};
-  int timeBaseDen{1}; // numerator and denominator of time base
-  int64_t duration{0}; // duration of the stream, in stream time base
-};
-
-union FormatUnion {
-  FormatUnion() {}
-  VideoFormat video;
-  AudioFormat audio;
-};
-
-struct MediaFormat {
-  MediaFormat() {}
-
-  MediaFormat(const MediaFormat& mediaFormat) : type(mediaFormat.type) {
-    if (type == MediaType::TYPE_VIDEO) {
-      format.video = mediaFormat.format.video;
-    } else if (type == MediaType::TYPE_AUDIO) {
-      format.audio = mediaFormat.format.audio;
-    }
-  }
-
-  MediaFormat(MediaType mediaType) : type(mediaType) {
-    if (mediaType == MediaType::TYPE_VIDEO) {
-      format.video = VideoFormat();
-    } else if (mediaType == MediaType::TYPE_AUDIO) {
-      format.audio = AudioFormat();
-    }
-  }
-  // media type
-  MediaType type;
-  // format data
-  FormatUnion format;
-};
-
-class DecodedFrame {
- public:
-  explicit DecodedFrame() : frame_(nullptr), frameSize_(0), pts_(0) {}
-  explicit DecodedFrame(AvDataPtr frame, int frameSize, int64_t pts)
-      : frame_(std::move(frame)), frameSize_(frameSize), pts_(pts) {}
-  AvDataPtr frame_{nullptr};
-  int frameSize_{0};
-  int64_t pts_{0};
-};
-
-struct MediaData {
-  MediaData() {}
-  MediaData(FormatUnion format) : format_(format) {}
-  FormatUnion format_;
-  std::vector<std::unique_ptr<DecodedFrame>> frames_;
-};
-
-class DecoderOutput {
- public:
-  explicit DecoderOutput() {}
-
-  ~DecoderOutput() {}
-
-  void initMediaType(MediaType mediaType, FormatUnion format);
-
-  void addMediaFrame(MediaType mediaType, std::unique_ptr<DecodedFrame> frame);
-
-  void clear();
-
-  std::unordered_map<MediaType, MediaData, EnumClassHash> media_data_;
-};
--- a/torchvision/csrc/cpu/video_reader/VideoReader.cpp
+++ b/torchvision/csrc/cpu/video_reader/VideoReader.cpp
@@ -3,11 +3,11 @@
 #include <Python.h>
 #include <c10/util/Logging.h>
 #include <exception>
-#include "FfmpegDecoder.h"
-#include "FfmpegHeaders.h"
-#include "util.h"
+#include "memory_buffer.h"
+#include "sync_decoder.h"

 using namespace std;
+using namespace ffmpeg;

 // If we are in a Windows environment, we need to define
 // initialization functions for the _custom_ops extension
@@ -27,121 +27,159 @@ PyMODINIT_FUNC PyInit_video_reader(void) {

 namespace video_reader {

-class UnknownPixelFormatException : public exception {
-  const char* what() const throw() override {
-    return "Unknown pixel format";
-  }
-};
-
-int getChannels(AVPixelFormat format) {
-  int numChannels = 0;
-  switch (format) {
-    case AV_PIX_FMT_BGR24:
-    case AV_PIX_FMT_RGB24:
-      numChannels = 3;
-      break;
-    default:
-      LOG(ERROR) << "Unknown format: " << format;
-      throw UnknownPixelFormatException();
-  }
-  return numChannels;
-}
+const AVPixelFormat defaultVideoPixelFormat = AV_PIX_FMT_RGB24;
+const AVSampleFormat defaultAudioSampleFormat = AV_SAMPLE_FMT_FLT;
+const size_t decoderTimeoutMs = 600000;
+// A jitter can be added to the end of the range to avoid conversion/rounding
+// error, small value 100us won't be enough to select the next frame, but enough
+// to compensate rounding error due to the multiple conversions.
+const size_t timeBaseJitterUs = 100;
+
+DecoderParameters getDecoderParams(
+    int64_t videoStartUs,
+    int64_t videoEndUs,
+    double seekFrameMarginUs,
+    int64_t getPtsOnly,
+    int64_t readVideoStream,
+    int videoWidth,
+    int videoHeight,
+    int videoMinDimension,
+    int videoMaxDimension,
+    int64_t readAudioStream,
+    int audioSamples,
+    int audioChannels) {
+  DecoderParameters params;
+  params.headerOnly = getPtsOnly != 0;
+  params.seekAccuracy = seekFrameMarginUs;
+  params.startOffset = videoStartUs;
+  params.endOffset = videoEndUs;
+  params.timeoutMs = decoderTimeoutMs;
+  params.preventStaleness = false;

-void fillVideoTensor(
-    std::vector<unique_ptr<DecodedFrame>>& frames,
-    torch::Tensor& videoFrame,
-    torch::Tensor& videoFramePts) {
-  int frameSize = 0;
-  if (videoFrame.numel() > 0) {
-    frameSize = videoFrame.numel() / frames.size();
+  if (readVideoStream == 1) {
+    MediaFormat videoFormat(0);
+    videoFormat.type = TYPE_VIDEO;
+    videoFormat.format.video.format = defaultVideoPixelFormat;
+    videoFormat.format.video.width = videoWidth;
+    videoFormat.format.video.height = videoHeight;
+    videoFormat.format.video.minDimension = videoMinDimension;
+    videoFormat.format.video.maxDimension = videoMaxDimension;
+    params.formats.insert(videoFormat);
  }

-  int frameCount = 0;
+  if (readAudioStream == 1) {
+    MediaFormat audioFormat;
+    audioFormat.type = TYPE_AUDIO;
+    audioFormat.format.audio.format = defaultAudioSampleFormat;
+    audioFormat.format.audio.samples = audioSamples;
+    audioFormat.format.audio.channels = audioChannels;
+    params.formats.insert(audioFormat);
+  }

-  uint8_t* videoFrameData =
-      videoFrame.numel() > 0 ? videoFrame.data_ptr<uint8_t>() : nullptr;
-  int64_t* videoFramePtsData = videoFramePts.data_ptr<int64_t>();
+  return params;
+}

-  for (size_t i = 0; i < frames.size(); ++i) {
-    const auto& frame = frames[i];
-    if (videoFrameData) {
-      memcpy(
-          videoFrameData + (size_t)(frameCount++) * (size_t)frameSize,
-          frame->frame_.get(),
-          frameSize * sizeof(uint8_t));
+// returns number of written bytes
+template <typename T>
+size_t fillTensor(
+    std::vector<DecoderOutputMessage>& msgs,
+    torch::Tensor& frame,
+    torch::Tensor& framePts,
+    int64_t num,
+    int64_t den) {
+  if (msgs.empty()) {
+    return 0;
+  }
+  T* frameData = frame.numel() > 0 ? frame.data_ptr<T>() : nullptr;
+  int64_t* framePtsData = framePts.data_ptr<int64_t>();
+  CHECK_EQ(framePts.size(0), msgs.size());
+  size_t avgElementsInFrame = frame.numel() / msgs.size();
+
+  size_t offset = 0;
+  for (size_t i = 0; i < msgs.size(); ++i) {
+    const auto& msg = msgs[i];
+    // convert pts into original time_base
+    AVRational avr = {(int)num, (int)den};
+    framePtsData[i] = av_rescale_q(msg.header.pts, AV_TIME_BASE_Q, avr);
+    VLOG(2) << "PTS type: " << sizeof(T) << ", us: " << msg.header.pts
+            << ", original: " << framePtsData[i];
+
+    if (frameData) {
+      auto sizeInBytes = msg.payload->length();
+      memcpy(frameData + offset, msg.payload->data(), sizeInBytes);
+      if (sizeof(T) == sizeof(uint8_t)) {
+        // Video - move by allocated frame size
+        offset += avgElementsInFrame / sizeof(T);
+      } else {
+        // Audio - move by number of samples
+        offset += sizeInBytes / sizeof(T);
+      }
    }
-    videoFramePtsData[i] = frame->pts_;
  }
+  return offset * sizeof(T);
 }

-void getVideoMeta(
-    DecoderOutput& decoderOutput,
-    int& numFrames,
-    int& height,
-    int& width,
-    int& numChannels) {
-  auto& videoFrames = decoderOutput.media_data_[TYPE_VIDEO].frames_;
-  numFrames = videoFrames.size();
-
-  FormatUnion& videoFormat = decoderOutput.media_data_[TYPE_VIDEO].format_;
-  height = videoFormat.video.height;
-  width = videoFormat.video.width;
-  numChannels = getChannels(videoFormat.video.format);
+size_t fillVideoTensor(
+    std::vector<DecoderOutputMessage>& msgs,
+    torch::Tensor& videoFrame,
+    torch::Tensor& videoFramePts,
+    int64_t num,
+    int64_t den) {
+  return fillTensor<uint8_t>(msgs, videoFrame, videoFramePts, num, den);
 }

-void fillAudioTensor(
-    std::vector<unique_ptr<DecodedFrame>>& frames,
+size_t fillAudioTensor(
+    std::vector<DecoderOutputMessage>& msgs,
    torch::Tensor& audioFrame,
-    torch::Tensor& audioFramePts) {
-  if (frames.size() == 0) {
-    return;
-  }
-
-  float* audioFrameData =
-      audioFrame.numel() > 0 ? audioFrame.data_ptr<float>() : nullptr;
-  CHECK_EQ(audioFramePts.size(0), frames.size());
-  int64_t* audioFramePtsData = audioFramePts.data_ptr<int64_t>();
-
-  int bytesPerSample = av_get_bytes_per_sample(defaultAudioSampleFormat);
-
-  int64_t frameDataOffset = 0;
-  for (size_t i = 0; i < frames.size(); ++i) {
-    audioFramePtsData[i] = frames[i]->pts_;
-    if (audioFrameData) {
-      memcpy(
-          audioFrameData + frameDataOffset,
-          frames[i]->frame_.get(),
-          frames[i]->frameSize_);
-      frameDataOffset += (frames[i]->frameSize_ / bytesPerSample);
-    }
-  }
+    torch::Tensor& audioFramePts,
+    int64_t num,
+    int64_t den) {
+  return fillTensor<float>(msgs, audioFrame, audioFramePts, num, den);
 }

-void getAudioMeta(
-    DecoderOutput& decoderOutput,
-    int64_t& numSamples,
-    int64_t& channels,
-    int64_t& numFrames) {
-  FormatUnion& audioFormat = decoderOutput.media_data_[TYPE_AUDIO].format_;
-
-  channels = audioFormat.audio.channels;
-  CHECK_EQ(audioFormat.audio.format, AV_SAMPLE_FMT_FLT);
-  int bytesPerSample = av_get_bytes_per_sample(
-      static_cast<AVSampleFormat>(audioFormat.audio.format));
-
-  // auto& audioFrames = decoderOutput.media_frames_[TYPE_AUDIO];
-  auto& audioFrames = decoderOutput.media_data_[TYPE_AUDIO].frames_;
-  numFrames = audioFrames.size();
-  int64_t frameSizeTotal = 0;
-  for (auto const& decodedFrame : audioFrames) {
-    frameSizeTotal += static_cast<int64_t>(decodedFrame->frameSize_);
+void offsetsToUs(
+    double& seekFrameMargin,
+    int64_t readVideoStream,
+    int64_t videoStartPts,
+    int64_t videoEndPts,
+    int64_t videoTimeBaseNum,
+    int64_t videoTimeBaseDen,
+    int64_t readAudioStream,
+    int64_t audioStartPts,
+    int64_t audioEndPts,
+    int64_t audioTimeBaseNum,
+    int64_t audioTimeBaseDen,
+    int64_t& videoStartUs,
+    int64_t& videoEndUs) {
+  seekFrameMargin *= AV_TIME_BASE;
+  videoStartUs = 0;
+  videoEndUs = -1;
+
+  if (readVideoStream) {
+    AVRational vr = {(int)videoTimeBaseNum, (int)videoTimeBaseDen};
+    if (videoStartPts > 0) {
+      videoStartUs = av_rescale_q(videoStartPts, vr, AV_TIME_BASE_Q);
+    }
+    if (videoEndPts > 0) {
+      // Add jitter to the end of the range to avoid conversion/rounding error.
+      // Small value 100us won't be enough to select the next frame, but enough
+      // to compensate rounding error due to the multiple conversions.
+      videoEndUs =
+          timeBaseJitterUs + av_rescale_q(videoEndPts, vr, AV_TIME_BASE_Q);
+    }
+  } else if (readAudioStream) {
+    AVRational ar = {(int)audioTimeBaseNum, (int)audioTimeBaseDen};
+    if (audioStartPts > 0) {
+      videoStartUs = av_rescale_q(audioStartPts, ar, AV_TIME_BASE_Q);
+    }
+    if (audioEndPts > 0) {
+      // Add jitter to the end of the range to avoid conversion/rounding error.
+      // Small value 100us won't be enough to select the next frame, but enough
+      // to compensate rounding error due to the multiple conversions.
+      videoEndUs =
+          timeBaseJitterUs + av_rescale_q(audioEndPts, ar, AV_TIME_BASE_Q);
+    }
  }
-  VLOG(2) << "numFrames: " << numFrames;
-  VLOG(2) << "frameSizeTotal: " << frameSizeTotal;
-  VLOG(2) << "channels: " << channels;
-  VLOG(2) << "bytesPerSample: " << bytesPerSample;
-  CHECK_EQ(frameSizeTotal % (channels * bytesPerSample), 0);
-  numSamples = frameSizeTotal / (channels * bytesPerSample);
 }

 torch::List<torch::Tensor> readVideo(
@@ -154,6 +192,7 @@ torch::List<torch::Tensor> readVideo(
    int64_t width,
    int64_t height,
    int64_t minDimension,
+    int64_t maxDimension,
    int64_t videoStartPts,
    int64_t videoEndPts,
    int64_t videoTimeBaseNum,
@@ -165,38 +204,92 @@ torch::List<torch::Tensor> readVideo(
    int64_t audioEndPts,
    int64_t audioTimeBaseNum,
    int64_t audioTimeBaseDen) {
-  unique_ptr<DecoderParameters> params = util::getDecoderParams(
+  int64_t videoStartUs, videoEndUs;
+
+  offsetsToUs(
      seekFrameMargin,
-      getPtsOnly,
      readVideoStream,
-      width,
-      height,
-      minDimension,
      videoStartPts,
      videoEndPts,
      videoTimeBaseNum,
      videoTimeBaseDen,
      readAudioStream,
-      audioSamples,
-      audioChannels,
      audioStartPts,
      audioEndPts,
      audioTimeBaseNum,
-      audioTimeBaseDen);
-
-  FfmpegDecoder decoder;
-  DecoderOutput decoderOutput;
+      audioTimeBaseDen,
+      videoStartUs,
+      videoEndUs);
+
+  DecoderParameters params = getDecoderParams(
+      videoStartUs, // videoStartPts
+      videoEndUs, // videoEndPts
+      seekFrameMargin, // seekFrameMargin
+      getPtsOnly, // getPtsOnly
+      readVideoStream, // readVideoStream
+      width, // width
+      height, // height
+      minDimension, // minDimension
+      maxDimension, // maxDimension
+      readAudioStream, // readAudioStream
+      audioSamples, // audioSamples
+      audioChannels // audioChannels
+  );

+  SyncDecoder decoder;
+  std::vector<DecoderOutputMessage> audioMessages, videoMessages;
+  DecoderInCallback callback = nullptr;
+  std::string logMessage, logType;
  if (isReadFile) {
-    decoder.decodeFile(std::move(params), videoPath, decoderOutput);
+    params.uri = videoPath;
+    logType = "file";
+    logMessage = videoPath;
  } else {
-    decoder.decodeMemory(
-        std::move(params),
-        input_video.data_ptr<uint8_t>(),
-        input_video.size(0),
-        decoderOutput);
+    callback = MemoryBuffer::getCallback(
+        input_video.data_ptr<uint8_t>(), input_video.size(0));
+    logType = "memory";
+    logMessage = std::to_string(input_video.size(0));
  }

+  VLOG(1) << "Video decoding from " << logType << " [" << logMessage
+          << "] has started";
+
+  const auto now = std::chrono::system_clock::now();
+
+  bool succeeded;
+  DecoderMetadata audioMetadata, videoMetadata;
+  std::vector<DecoderMetadata> metadata;
+  if ((succeeded = decoder.init(params, std::move(callback), &metadata))) {
+    for (const auto& header : metadata) {
+      if (header.format.type == TYPE_VIDEO) {
+        videoMetadata = header;
+      } else if (header.format.type == TYPE_AUDIO) {
+        audioMetadata = header;
+      }
+    }
+    int res;
+    DecoderOutputMessage msg;
+    while (0 == (res = decoder.decode(&msg, decoderTimeoutMs))) {
+      if (msg.header.format.type == TYPE_VIDEO) {
+        videoMessages.push_back(std::move(msg));
+      }
+      if (msg.header.format.type == TYPE_AUDIO) {
+        audioMessages.push_back(std::move(msg));
+      }
+      msg.payload.reset();
+    }
+  } else {
+    LOG(ERROR) << "Decoder initialization has failed";
+  }
+  const auto then = std::chrono::system_clock::now();
+  VLOG(1) << "Video decoding from " << logType << " [" << logMessage
+          << "] has finished, "
+          << std::chrono::duration_cast<std::chrono::microseconds>(then - now)
+                 .count()
+          << " us";
+
+  decoder.shutdown();
+
  // video section
  torch::Tensor videoFrame = torch::zeros({0}, torch::kByte);
  torch::Tensor videoFramePts = torch::zeros({0}, torch::kLong);
@@ -204,37 +297,49 @@ torch::List<torch::Tensor> readVideo(
  torch::Tensor videoFps = torch::zeros({0}, torch::kFloat);
  torch::Tensor videoDuration = torch::zeros({0}, torch::kLong);

-  if (readVideoStream == 1) {
-    auto it = decoderOutput.media_data_.find(TYPE_VIDEO);
-    if (it != decoderOutput.media_data_.end()) {
-      int numVideoFrames, outHeight, outWidth, numChannels;
-      getVideoMeta(
-          decoderOutput, numVideoFrames, outHeight, outWidth, numChannels);
+  if (succeeded && readVideoStream == 1) {
+    if (!videoMessages.empty()) {
+      const auto& header = videoMetadata;
+      const auto& format = header.format.format.video;
+      int numVideoFrames = videoMessages.size();
+      int outHeight = format.height;
+      int outWidth = format.width;
+      int numChannels = 3; // decoder guarantees the default AV_PIX_FMT_RGB24

+      size_t expectedWrittenBytes = 0;
      if (getPtsOnly == 0) {
        videoFrame = torch::zeros(
            {numVideoFrames, outHeight, outWidth, numChannels}, torch::kByte);
+        expectedWrittenBytes =
+            numVideoFrames * outHeight * outWidth * numChannels;
      }

      videoFramePts = torch::zeros({numVideoFrames}, torch::kLong);

-      fillVideoTensor(
-          decoderOutput.media_data_[TYPE_VIDEO].frames_,
-          videoFrame,
-          videoFramePts);
+      VLOG(2) << "video duration: " << header.duration
+              << ", fps: " << header.fps << ", num: " << header.num
+              << ", den: " << header.den << ", num frames: " << numVideoFrames;
+
+      auto numberWrittenBytes = fillVideoTensor(
+          videoMessages, videoFrame, videoFramePts, header.num, header.den);
+
+      CHECK_EQ(numberWrittenBytes, expectedWrittenBytes);

      videoTimeBase = torch::zeros({2}, torch::kInt);
      int* videoTimeBaseData = videoTimeBase.data_ptr<int>();
-      videoTimeBaseData[0] = it->second.format_.video.timeBaseNum;
-      videoTimeBaseData[1] = it->second.format_.video.timeBaseDen;
+      videoTimeBaseData[0] = header.num;
+      videoTimeBaseData[1] = header.den;

      videoFps = torch::zeros({1}, torch::kFloat);
      float* videoFpsData = videoFps.data_ptr<float>();
-      videoFpsData[0] = it->second.format_.video.fps;
+      videoFpsData[0] = header.fps;

      videoDuration = torch::zeros({1}, torch::kLong);
      int64_t* videoDurationData = videoDuration.data_ptr<int64_t>();
-      videoDurationData[0] = it->second.format_.video.duration;
+      AVRational vr = {(int)header.num, (int)header.den};
+      videoDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, vr);
+      VLOG(1) << "Video decoding from " << logType << " [" << logMessage
+              << "] filled video tensors";
    } else {
      VLOG(1) << "Miss video stream";
    }
@@ -246,39 +351,57 @@ torch::List<torch::Tensor> readVideo(
  torch::Tensor audioTimeBase = torch::zeros({0}, torch::kInt);
  torch::Tensor audioSampleRate = torch::zeros({0}, torch::kInt);
  torch::Tensor audioDuration = torch::zeros({0}, torch::kLong);
-  if (readAudioStream == 1) {
-    auto it = decoderOutput.media_data_.find(TYPE_AUDIO);
-    if (it != decoderOutput.media_data_.end()) {
-      VLOG(1) << "Find audio stream";
-      int64_t numAudioSamples = 0, outAudioChannels = 0, numAudioFrames = 0;
-      getAudioMeta(
-          decoderOutput, numAudioSamples, outAudioChannels, numAudioFrames);
-      VLOG(2) << "numAudioSamples: " << numAudioSamples;
-      VLOG(2) << "outAudioChannels: " << outAudioChannels;
-      VLOG(2) << "numAudioFrames: " << numAudioFrames;
+  if (succeeded && readAudioStream == 1) {
+    if (!audioMessages.empty()) {
+      const auto& header = audioMetadata;
+      const auto& format = header.format.format.audio;
+
+      int64_t outAudioChannels = format.channels;
+      int bytesPerSample =
+          av_get_bytes_per_sample(static_cast<AVSampleFormat>(format.format));

+      int numAudioFrames = audioMessages.size();
+      int64_t numAudioSamples = 0;
      if (getPtsOnly == 0) {
+        int64_t frameSizeTotal = 0;
+        for (auto const& audioMessage : audioMessages) {
+          frameSizeTotal += audioMessage.payload->length();
+        }
+
+        CHECK_EQ(frameSizeTotal % (outAudioChannels * bytesPerSample), 0);
+        numAudioSamples = frameSizeTotal / (outAudioChannels * bytesPerSample);
+
        audioFrame =
            torch::zeros({numAudioSamples, outAudioChannels}, torch::kFloat);
      }
      audioFramePts = torch::zeros({numAudioFrames}, torch::kLong);
-      fillAudioTensor(
-          decoderOutput.media_data_[TYPE_AUDIO].frames_,
-          audioFrame,
-          audioFramePts);
+
+      VLOG(2) << "audio duration: " << header.duration
+              << ", channels: " << format.channels
+              << ", sample rate: " << format.samples << ", num: " << header.num
+              << ", den: " << header.den;
+
+      auto numberWrittenBytes = fillAudioTensor(
+          audioMessages, audioFrame, audioFramePts, header.num, header.den);
+      CHECK_EQ(
+          numberWrittenBytes,
+          numAudioSamples * outAudioChannels * sizeof(float));

      audioTimeBase = torch::zeros({2}, torch::kInt);
      int* audioTimeBaseData = audioTimeBase.data_ptr<int>();
-      audioTimeBaseData[0] = it->second.format_.audio.timeBaseNum;
-      audioTimeBaseData[1] = it->second.format_.audio.timeBaseDen;
+      audioTimeBaseData[0] = header.num;
+      audioTimeBaseData[1] = header.den;

      audioSampleRate = torch::zeros({1}, torch::kInt);
      int* audioSampleRateData = audioSampleRate.data_ptr<int>();
-      audioSampleRateData[0] = it->second.format_.audio.samples;
+      audioSampleRateData[0] = format.samples;

      audioDuration = torch::zeros({1}, torch::kLong);
      int64_t* audioDurationData = audioDuration.data_ptr<int64_t>();
-      audioDurationData[0] = it->second.format_.audio.duration;
+      AVRational ar = {(int)header.num, (int)header.den};
+      audioDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, ar);
+      VLOG(1) << "Video decoding from " << logType << " [" << logMessage
+              << "] filled audio tensors";
    } else {
      VLOG(1) << "Miss audio stream";
    }
@@ -296,6 +419,9 @@ torch::List<torch::Tensor> readVideo(
  result.push_back(std::move(audioSampleRate));
  result.push_back(std::move(audioDuration));

+  VLOG(1) << "Video decoding from " << logType << " [" << logMessage
+          << "] about to return";
+
  return result;
 }

@@ -307,6 +433,7 @@ torch::List<torch::Tensor> readVideoFromMemory(
    int64_t width,
    int64_t height,
    int64_t minDimension,
+    int64_t maxDimension,
    int64_t videoStartPts,
    int64_t videoEndPts,
    int64_t videoTimeBaseNum,
@@ -328,6 +455,7 @@ torch::List<torch::Tensor> readVideoFromMemory(
      width,
      height,
      minDimension,
+      maxDimension,
      videoStartPts,
      videoEndPts,
      videoTimeBaseNum,
@@ -349,6 +477,7 @@ torch::List<torch::Tensor> readVideoFromFile(
    int64_t width,
    int64_t height,
    int64_t minDimension,
+    int64_t maxDimension,
    int64_t videoStartPts,
    int64_t videoEndPts,
    int64_t videoTimeBaseNum,
@@ -371,6 +500,7 @@ torch::List<torch::Tensor> readVideoFromFile(
      width,
      height,
      minDimension,
+      maxDimension,
      videoStartPts,
      videoEndPts,
      videoTimeBaseNum,
@@ -388,59 +518,96 @@ torch::List<torch::Tensor> probeVideo(
    bool isReadFile,
    const torch::Tensor& input_video,
    std::string videoPath) {
-  unique_ptr<DecoderParameters> params = util::getDecoderParams(
+  DecoderParameters params = getDecoderParams(
+      0, // videoStartUs
+      -1, // videoEndUs
      0, // seekFrameMargin
-      0, // getPtsOnly
+      1, // getPtsOnly
      1, // readVideoStream
      0, // width
      0, // height
      0, // minDimension
-      0, // videoStartPts
-      0, // videoEndPts
-      0, // videoTimeBaseNum
-      1, // videoTimeBaseDen
+      0, // maxDimension
      1, // readAudioStream
      0, // audioSamples
-      0, // audioChannels
-      0, // audioStartPts
-      0, // audioEndPts
-      0, // audioTimeBaseNum
-      1 // audioTimeBaseDen
+      0 // audioChannels
  );

-  FfmpegDecoder decoder;
-  DecoderOutput decoderOutput;
+  SyncDecoder decoder;
+  DecoderInCallback callback = nullptr;
+  std::string logMessage, logType;
  if (isReadFile) {
-    decoder.probeFile(std::move(params), videoPath, decoderOutput);
+    params.uri = videoPath;
+    logType = "file";
+    logMessage = videoPath;
+  } else {
+    callback = MemoryBuffer::getCallback(
+        input_video.data_ptr<uint8_t>(), input_video.size(0));
+    logType = "memory";
+    logMessage = std::to_string(input_video.size(0));
+  }
+
+  VLOG(1) << "Video probing from " << logType << " [" << logMessage
+          << "] has started";
+
+  const auto now = std::chrono::system_clock::now();
+
+  bool succeeded;
+  bool gotAudio = false, gotVideo = false;
+  DecoderMetadata audioMetadata, videoMetadata;
+  std::vector<DecoderMetadata> metadata;
+  if ((succeeded = decoder.init(params, std::move(callback), &metadata))) {
+    for (const auto& header : metadata) {
+      if (header.format.type == TYPE_VIDEO) {
+        gotVideo = true;
+        videoMetadata = header;
+      } else if (header.format.type == TYPE_AUDIO) {
+        gotAudio = true;
+        audioMetadata = header;
+      }
+    }
+    const auto then = std::chrono::system_clock::now();
+    VLOG(1) << "Video probing from " << logType << " [" << logMessage
+            << "] has finished, "
+            << std::chrono::duration_cast<std::chrono::microseconds>(then - now)
+                   .count()
+            << " us";
  } else {
-    decoder.probeMemory(
-        std::move(params),
-        input_video.data_ptr<uint8_t>(),
-        input_video.size(0),
-        decoderOutput);
+    LOG(ERROR) << "Decoder initialization has failed";
  }
+
+  decoder.shutdown();
+
  // video section
  torch::Tensor videoTimeBase = torch::zeros({0}, torch::kInt);
  torch::Tensor videoFps = torch::zeros({0}, torch::kFloat);
  torch::Tensor videoDuration = torch::zeros({0}, torch::kLong);

-  auto it = decoderOutput.media_data_.find(TYPE_VIDEO);
-  if (it != decoderOutput.media_data_.end()) {
-    VLOG(1) << "Find video stream";
+  if (succeeded && gotVideo) {
    videoTimeBase = torch::zeros({2}, torch::kInt);
    int* videoTimeBaseData = videoTimeBase.data_ptr<int>();
-    videoTimeBaseData[0] = it->second.format_.video.timeBaseNum;
-    videoTimeBaseData[1] = it->second.format_.video.timeBaseDen;
+    const auto& header = videoMetadata;
+    const auto& media = header.format;
+
+    videoTimeBaseData[0] = header.num;
+    videoTimeBaseData[1] = header.den;

    videoFps = torch::zeros({1}, torch::kFloat);
    float* videoFpsData = videoFps.data_ptr<float>();
-    videoFpsData[0] = it->second.format_.video.fps;
+    videoFpsData[0] = header.fps;

    videoDuration = torch::zeros({1}, torch::kLong);
    int64_t* videoDurationData = videoDuration.data_ptr<int64_t>();
-    videoDurationData[0] = it->second.format_.video.duration;
+    AVRational avr = {(int)header.num, (int)header.den};
+    videoDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, avr);
+
+    VLOG(2) << "Prob fps: " << header.fps << ", duration: " << header.duration
+            << ", num: " << header.num << ", den: " << header.den;
+
+    VLOG(1) << "Video probing from " << logType << " [" << logMessage
+            << "] filled video tensors";
  } else {
-    VLOG(1) << "Miss video stream";
+    LOG(ERROR) << "Miss video stream";
  }

  // audio section
@@ -448,21 +615,31 @@ torch::List<torch::Tensor> probeVideo(
  torch::Tensor audioSampleRate = torch::zeros({0}, torch::kInt);
  torch::Tensor audioDuration = torch::zeros({0}, torch::kLong);

-  it = decoderOutput.media_data_.find(TYPE_AUDIO);
-  if (it != decoderOutput.media_data_.end()) {
-    VLOG(1) << "Find audio stream";
+  if (succeeded && gotAudio) {
    audioTimeBase = torch::zeros({2}, torch::kInt);
    int* audioTimeBaseData = audioTimeBase.data_ptr<int>();
-    audioTimeBaseData[0] = it->second.format_.audio.timeBaseNum;
-    audioTimeBaseData[1] = it->second.format_.audio.timeBaseDen;
+    const auto& header = audioMetadata;
+    const auto& media = header.format;
+    const auto& format = media.format.audio;
+
+    audioTimeBaseData[0] = header.num;
+    audioTimeBaseData[1] = header.den;

    audioSampleRate = torch::zeros({1}, torch::kInt);
    int* audioSampleRateData = audioSampleRate.data_ptr<int>();
-    audioSampleRateData[0] = it->second.format_.audio.samples;
+    audioSampleRateData[0] = format.samples;

    audioDuration = torch::zeros({1}, torch::kLong);
    int64_t* audioDurationData = audioDuration.data_ptr<int64_t>();
-    audioDurationData[0] = it->second.format_.audio.duration;
+    AVRational avr = {(int)header.num, (int)header.den};
+    audioDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, avr);
+
+    VLOG(2) << "Prob sample rate: " << format.samples
+            << ", duration: " << header.duration << ", num: " << header.num
+            << ", den: " << header.den;
+
+    VLOG(1) << "Video probing from " << logType << " [" << logMessage
+            << "] filled audio tensors";
  } else {
    VLOG(1) << "Miss audio stream";
  }
@@ -475,6 +652,9 @@ torch::List<torch::Tensor> probeVideo(
  result.push_back(std::move(audioSampleRate));
  result.push_back(std::move(audioDuration));

+  VLOG(1) << "Video probing from " << logType << " [" << logMessage
+          << "] is about to return";
+
  return result;
 }


--- a/torchvision/csrc/cpu/video_reader/VideoReader.h
+++ b/torchvision/csrc/cpu/video_reader/VideoReader.h
 #pragma once

 #include <torch/script.h>
-
-// Interface for Python
-
-/*
-  return:
-    videoFrame: tensor (N, H, W, C) kByte
-    videoFramePts: tensor (N) kLong
-    videoTimeBase: tensor (2) kInt
-    videoFps: tensor (1) kFloat
-    audioFrame: tensor (N, C) kFloat
-    audioFramePts: tensor (N) kLong
-    audioTimeBase: tensor (2) kInt
-    audioSampleRate: tensor (1) kInt
-*/
-torch::List<torch::Tensor> readVideoFromMemory(
-    // 1D tensor of data type uint8, storing the comparessed video data
-    torch::Tensor input_video,
-    // seeking frame in the video/audio stream is imprecise so seek to a
-    // timestamp earlier by a margin The unit of margin is second
-    double seekFrameMargin,
-    // If only pts is needed and video/audio frames are not needed, set it
-    // to 1
-    int64_t getPtsOnly,
-    // bool variable. Set it to 1 if video stream should be read. Otherwise, set
-    // it to 0
-    int64_t readVideoStream,
-    /*
-    Valid parameters values for rescaling video frames
-    ___________________________________________________
-    |  width  |  height  | min_dimension |  algorithm |
-    |_________________________________________________|
-    |  0  |  0  |     0        |   original           |
-    |_________________________________________________|
-    |  0  |  0  |     >0       |scale to min dimension|
-    |_____|_____|____________________________________ |
-    |  >0 |  0  |     0        |   scale keeping W    |
-    |_________________________________________________|
-    |  0  |  >0 |     0        |   scale keeping H    |
-    |_________________________________________________|
-    |  >0 |  >0 |     0        |   stretch/scale      |
-    |_________________________________________________|
-    */
-    int64_t width,
-    int64_t height,
-    int64_t minDimension,
-    // video frames with pts in [videoStartPts, videoEndPts] will be decoded
-    // For decoding all video frames, use [0, -1]
-    int64_t videoStartPts,
-    int64_t videoEndPts,
-    // numerator and denominator of time base of video stream.
-    // For decoding all video frames, supply dummy 0 (numerator) and 1
-    // (denominator). For decoding localized video frames, need to supply
-    // them which will be checked during decoding
-    int64_t videoTimeBaseNum,
-    int64_t videoTimeBaseDen,
-    // bool variable. Set it to 1 if audio stream should be read. Otherwise, set
-    // it to 0
-    int64_t readAudioStream,
-    // audio stream sampling rate.
-    // If not resampling audio waveform, supply 0
-    // Otherwise, supply a positive integer.
-    int64_t audioSamples,
-    // audio stream channels
-    // Supply 0 to use the same number of channels as in the original audio
-    // stream
-    int64_t audioChannels,
-    // audio frames with pts in [audioStartPts, audioEndPts] will be decoded
-    // For decoding all audio frames, use [0, -1]
-    int64_t audioStartPts,
-    int64_t audioEndPts,
-    // numerator and denominator of time base of audio stream.
-    // For decoding all audio frames, supply dummy 0 (numerator) and 1
-    // (denominator). For decoding localized audio frames, need to supply
-    // them which will be checked during decoding
-    int64_t audioTimeBaseNum,
-    int64_t audioTimeBaseDen);
-
-torch::List<torch::Tensor> readVideoFromFile(
-    std::string videoPath,
-    double seekFrameMargin,
-    int64_t getPtsOnly,
-    int64_t readVideoStream,
-    int64_t width,
-    int64_t height,
-    int64_t minDimension,
-    int64_t videoStartPts,
-    int64_t videoEndPts,
-    int64_t videoTimeBaseNum,
-    int64_t videoTimeBaseDen,
-    int64_t readAudioStream,
-    int64_t audioSamples,
-    int64_t audioChannels,
-    int64_t audioStartPts,
-    int64_t audioEndPts,
-    int64_t audioTimeBaseNum,
-    int64_t audioTimeBaseDen);
--- a/torchvision/csrc/cpu/video_reader/util.cpp
+++ b/torchvision/csrc/cpu/video_reader/util.cpp
-#include "util.h"
-
-using namespace std;
-
-namespace util {
-
-unique_ptr<DecoderParameters> getDecoderParams(
-    double seekFrameMargin,
-    int64_t getPtsOnly,
-    int64_t readVideoStream,
-    int videoWidth,
-    int videoHeight,
-    int videoMinDimension,
-    int64_t videoStartPts,
-    int64_t videoEndPts,
-    int videoTimeBaseNum,
-    int videoTimeBaseDen,
-    int64_t readAudioStream,
-    int audioSamples,
-    int audioChannels,
-    int64_t audioStartPts,
-    int64_t audioEndPts,
-    int audioTimeBaseNum,
-    int audioTimeBaseDen) {
-  unique_ptr<DecoderParameters> params = make_unique<DecoderParameters>();
-
-  if (readVideoStream == 1) {
-    params->formats.emplace(
-        MediaType::TYPE_VIDEO, MediaFormat(MediaType::TYPE_VIDEO));
-    MediaFormat& videoFormat = params->formats[MediaType::TYPE_VIDEO];
-
-    videoFormat.format.video.width = videoWidth;
-    videoFormat.format.video.height = videoHeight;
-    videoFormat.format.video.minDimension = videoMinDimension;
-    videoFormat.format.video.startPts = videoStartPts;
-    videoFormat.format.video.endPts = videoEndPts;
-    videoFormat.format.video.timeBaseNum = videoTimeBaseNum;
-    videoFormat.format.video.timeBaseDen = videoTimeBaseDen;
-  }
-
-  if (readAudioStream == 1) {
-    params->formats.emplace(
-        MediaType::TYPE_AUDIO, MediaFormat(MediaType::TYPE_AUDIO));
-    MediaFormat& audioFormat = params->formats[MediaType::TYPE_AUDIO];
-
-    audioFormat.format.audio.samples = audioSamples;
-    audioFormat.format.audio.channels = audioChannels;
-    audioFormat.format.audio.startPts = audioStartPts;
-    audioFormat.format.audio.endPts = audioEndPts;
-    audioFormat.format.audio.timeBaseNum = audioTimeBaseNum;
-    audioFormat.format.audio.timeBaseDen = audioTimeBaseDen;
-  }
-
-  params->seekFrameMargin = seekFrameMargin;
-  params->getPtsOnly = getPtsOnly;
-
-  return params;
-}
-
-} // namespace util
--- a/torchvision/csrc/cpu/video_reader/util.h
+++ b/torchvision/csrc/cpu/video_reader/util.h
-#pragma once
-#include <memory>
-#include "FfmpegDecoder.h"
-
-namespace util {
-
-std::unique_ptr<DecoderParameters> getDecoderParams(
-    double seekFrameMargin,
-    int64_t getPtsOnly,
-    int64_t readVideoStream,
-    int videoWidth,
-    int videoHeight,
-    int videoMinDimension,
-    int64_t videoStartPts,
-    int64_t videoEndPts,
-    int videoTimeBaseNum,
-    int videoTimeBaseDen,
-    int64_t readAudioStream,
-    int audioSamples,
-    int audioChannels,
-    int64_t audioStartPts,
-    int64_t audioEndPts,
-    int audioTimeBaseNum,
-    int audioTimeBaseDen);
-
-} // namespace util
--- a/torchvision/datasets/video_utils.py
+++ b/torchvision/datasets/video_utils.py
@@ -98,6 +98,7 @@ class VideoClips(object):
        _video_width=0,
        _video_height=0,
        _video_min_dimension=0,
+        _video_max_dimension=0,
        _audio_samples=0,
        _audio_channels=0,
    ):
@@ -109,6 +110,7 @@ class VideoClips(object):
        self._video_width = _video_width
        self._video_height = _video_height
        self._video_min_dimension = _video_min_dimension
+        self._video_max_dimension = _video_max_dimension
        self._audio_samples = _audio_samples
        self._audio_channels = _audio_channels

@@ -179,6 +181,7 @@ class VideoClips(object):
            _video_width=self._video_width,
            _video_height=self._video_height,
            _video_min_dimension=self._video_min_dimension,
+            _video_max_dimension=self._video_max_dimension,
            _audio_samples=self._audio_samples,
            _audio_channels=self._audio_channels,
        )
@@ -299,6 +302,10 @@ class VideoClips(object):
                raise ValueError(
                    "pyav backend doesn't support _video_min_dimension != 0"
                )
+            if self._video_max_dimension != 0:
+                raise ValueError(
+                    "pyav backend doesn't support _video_max_dimension != 0"
+                )
            if self._audio_samples != 0:
                raise ValueError("pyav backend doesn't support _audio_samples != 0")

@@ -335,6 +342,7 @@ class VideoClips(object):
                video_width=self._video_width,
                video_height=self._video_height,
                video_min_dimension=self._video_min_dimension,
+                video_max_dimension=self._video_max_dimension,
                video_pts_range=(video_start_pts, video_end_pts),
                video_timebase=video_timebase,
                audio_samples=self._audio_samples,

--- a/torchvision/io/_video_opt.py
+++ b/torchvision/io/_video_opt.py
@@ -138,6 +138,7 @@ def _read_video_from_file(
    video_width=0,
    video_height=0,
    video_min_dimension=0,
+    video_max_dimension=0,
    video_pts_range=(0, -1),
    video_timebase=default_timebase,
    read_audio_stream=True,
@@ -155,21 +156,34 @@ def _read_video_from_file(
    filename : str
        path to the video file
    seek_frame_margin: double, optional
-        seeking frame in the stream is imprecise. Thus, when video_start_pts is specified,
-        we seek the pts earlier by seek_frame_margin seconds
+        seeking frame in the stream is imprecise. Thus, when video_start_pts
+        is specified, we seek the pts earlier by seek_frame_margin seconds
    read_video_stream: int, optional
        whether read video stream. If yes, set to 1. Otherwise, 0
-    video_width/video_height/video_min_dimension: int
+    video_width/video_height/video_min_dimension/video_max_dimension: int
        together decide the size of decoded frames
-        - when video_width = 0, video_height = 0, and video_min_dimension = 0, keep the orignal frame resolution
-        - when video_width = 0, video_height = 0, and video_min_dimension != 0, keep the aspect ratio and resize
-            the frame so that shorter edge size is video_min_dimension
-        - When video_width = 0, and video_height != 0, keep the aspect ratio and resize the frame
-            so that frame video_height is $video_height
-        - When video_width != 0, and video_height == 0, keep the aspect ratio and resize the frame
-            so that frame video_height is $video_width
-        - When video_width != 0, and video_height != 0, resize the frame so that frame video_width and video_height
-            are set to $video_width and $video_height, respectively
+        - When video_width = 0, video_height = 0, video_min_dimension = 0,
+            and video_max_dimension = 0, keep the orignal frame resolution
+        - When video_width = 0, video_height = 0, video_min_dimension != 0,
+            and video_max_dimension = 0, keep the aspect ratio and resize the
+            frame so that shorter edge size is video_min_dimension
+        - When video_width = 0, video_height = 0, video_min_dimension = 0,
+            and video_max_dimension != 0, keep the aspect ratio and resize
+            the frame so that longer edge size is video_max_dimension
+        - When video_width = 0, video_height = 0, video_min_dimension != 0,
+            and video_max_dimension != 0, resize the frame so that shorter
+            edge size is video_min_dimension, and longer edge size is
+            video_max_dimension. The aspect ratio may not be preserved
+        - When video_width = 0, video_height != 0, video_min_dimension = 0,
+            and video_max_dimension = 0, keep the aspect ratio and resize
+            the frame so that frame video_height is $video_height
+        - When video_width != 0, video_height == 0, video_min_dimension = 0,
+            and video_max_dimension = 0, keep the aspect ratio and resize
+            the frame so that frame video_width is $video_width
+        - When video_width != 0, video_height != 0, video_min_dimension = 0,
+            and video_max_dimension = 0, resize the frame so that frame
+            video_width and  video_height are set to $video_width and
+            $video_height, respectively
    video_pts_range : list(int), optional
        the start and end presentation timestamp of video stream
    video_timebase: Fraction, optional
@@ -207,6 +221,7 @@ def _read_video_from_file(
        video_width,
        video_height,
        video_min_dimension,
+        video_max_dimension,
        video_pts_range[0],
        video_pts_range[1],
        video_timebase.numerator,
@@ -244,6 +259,7 @@ def _read_video_timestamps_from_file(filename):
        0,  # video_width
        0,  # video_height
        0,  # video_min_dimension
+        0,  # video_max_dimension
        0,  # video_start_pts
        -1,  # video_end_pts
        0,  # video_timebase_num
@@ -282,6 +298,7 @@ def _read_video_from_memory(
    video_width=0,  # type: int
    video_height=0,  # type: int
    video_min_dimension=0,  # type: int
+    video_max_dimension=0,  # type: int
    video_pts_range=(0, -1),  # type: List[int]
    video_timebase_numerator=0,  # type: int
    video_timebase_denominator=1,  # type: int
@@ -307,17 +324,30 @@ def _read_video_from_memory(
        we seek the pts earlier by seek_frame_margin seconds
    read_video_stream: int, optional
        whether read video stream. If yes, set to 1. Otherwise, 0
-    video_width/video_height/video_min_dimension: int
+    video_width/video_height/video_min_dimension/video_max_dimension: int
        together decide the size of decoded frames
-        - when video_width = 0, video_height = 0, and video_min_dimension = 0, keep the orignal frame resolution
-        - when video_width = 0, video_height = 0, and video_min_dimension != 0, keep the aspect ratio and resize
-            the frame so that shorter edge size is video_min_dimension
-        - When video_width = 0, and video_height != 0, keep the aspect ratio and resize the frame
-            so that frame video_height is $video_height
-        - When video_width != 0, and video_height == 0, keep the aspect ratio and resize the frame
-            so that frame video_height is $video_width
-        - When video_width != 0, and video_height != 0, resize the frame so that frame video_width and video_height
-            are set to $video_width and $video_height, respectively
+        - When video_width = 0, video_height = 0, video_min_dimension = 0,
+            and video_max_dimension = 0, keep the orignal frame resolution
+        - When video_width = 0, video_height = 0, video_min_dimension != 0,
+            and video_max_dimension = 0, keep the aspect ratio and resize the
+            frame so that shorter edge size is video_min_dimension
+        - When video_width = 0, video_height = 0, video_min_dimension = 0,
+            and video_max_dimension != 0, keep the aspect ratio and resize
+            the frame so that longer edge size is video_max_dimension
+        - When video_width = 0, video_height = 0, video_min_dimension != 0,
+            and video_max_dimension != 0, resize the frame so that shorter
+            edge size is video_min_dimension, and longer edge size is
+            video_max_dimension. The aspect ratio may not be preserved
+        - When video_width = 0, video_height != 0, video_min_dimension = 0,
+            and video_max_dimension = 0, keep the aspect ratio and resize
+            the frame so that frame video_height is $video_height
+        - When video_width != 0, video_height == 0, video_min_dimension = 0,
+            and video_max_dimension = 0, keep the aspect ratio and resize
+            the frame so that frame video_width is $video_width
+        - When video_width != 0, video_height != 0, video_min_dimension = 0,
+            and video_max_dimension = 0, resize the frame so that frame
+            video_width and  video_height are set to $video_width and
+            $video_height, respectively
    video_pts_range : list(int), optional
        the start and end presentation timestamp of video stream
    video_timebase_numerator / video_timebase_denominator: optional
@@ -353,6 +383,7 @@ def _read_video_from_memory(
        video_width,
        video_height,
        video_min_dimension,
+        video_max_dimension,
        video_pts_range[0],
        video_pts_range[1],
        video_timebase_numerator,
@@ -394,6 +425,7 @@ def _read_video_timestamps_from_memory(video_data):
        0,  # video_width
        0,  # video_height
        0,  # video_min_dimension
+        0,  # video_max_dimension
        0,  # video_start_pts
        -1,  # video_end_pts
        0,  # video_timebase_num

--- a/torchvision/io/video.py
+++ b/torchvision/io/video.py
@@ -370,15 +370,17 @@ def read_video_from_memory(
    audio_pts_range=(0, -1),  # type: List[int]
    audio_timebase_numerator=0,  # type: int
    audio_timebase_denominator=1,  # type: int
+    video_max_dimension=0,  # type: int
 ):
    # type: (...) -> Tuple[torch.Tensor, torch.Tensor]
    return _video_opt._read_video_from_memory(
        video_data,
        seek_frame_margin,
-        read_audio_stream,
+        read_video_stream,
        video_width,
        video_height,
        video_min_dimension,
+        video_max_dimension,
        video_pts_range,
        video_timebase_numerator,
        video_timebase_denominator,