Unverified Commit 32e16805 authored by Francisco Massa's avatar Francisco Massa Committed by GitHub
Browse files

Update video reader to use new decoder (#1978)

* Base decoder for video. (#1747)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1747

Pull Request resolved: https://github.com/pytorch/vision/pull/1746

Added the implementation of ffmpeg based decoder with functionality that can be used in VUE and TorchVision.

Reviewed By: fmassa

Differential Revision: D19358914

fbshipit-source-id: abb672f89bfaca6351dda2354f0d35cf8e47fa0f

* Integrated base decoder into VideoReader class and video_utils.py (#1766)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1766

Replaced FfmpegDecoder (incompativle with VUE) by base decoder (compatible with VUE).
Modified python utilities video_utils.py for internal simplification. Public interface got preserved.

Reviewed By: fmassa

Differential Revision: D19415903

fbshipit-source-id: 4d7a0158bd77bac0a18732fe4183fdd9a57f6402

* Optimizating base decoder performance. (#1852)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1852

Changed base decoder internals for a faster clip processing.

Reviewed By: stephenyan1231

Differential Revision: D19748379

fbshipit-source-id: 58a435f0a0b25545e7bd1a3edb0b1d558176a806

* Minor fix and decoder class members access.

Summary:
Found and fix a bug in cropping algorithm (simple mistyping).
Also derived classes need access to some decoder class members, like initialization parameters - make it protected.

Reviewed By: stephenyan1231, fmassa

Differential Revision: D19895076

fbshipit-source-id: 691336c8e18526b085ae5792ac3546bc387a6db9

* Added missing header for less dependencies. (#1898)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1898

Include streams/samplers shouldn't depend on decoder headers. Add dependencies directly to the place where they are required.

Reviewed By: stephenyan1231

Differential Revision: D19911404

fbshipit-source-id: ef322a053708405c02cee4562b456b1602fb12fc

* Implemented VUE Asynchronous Decoder

Summary: For Mothership we have found that asynchronous decoder provides a better performance.

Differential Revision: D20026194

fbshipit-source-id: 627b91844b4e3f917002031dd32cb19c239f4ba8

* fix a bug in API read_video_from_memory (#1942)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1942

In D18720474, it introduces a bug in `read_video_from_memory` API. Thank weiyaowang for reporting it.

Reviewed By: weiyaowang

Differential Revision: D20270179

fbshipit-source-id: 66348c99a5ad1f9129b90e934524ddfaad59de03

* extend decoder to support new video_max_dimension argument (#1924)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1924

Extend `video reader` decoder python API in Torchvision to support a new argument `video_max_dimension`. This enables the new video decoding use cases. When setting `video_width=0`, `video_height=0`, `video_min_dimension != 0`, and `video_max_dimension != 0`, we can rescale the video clips so that its spatial resolution (height, width) becomes
 - (video_min_dimension, video_max_dimension) if original height < original width
 - (video_max_dimension, video_min_dimension) if original height >= original width

This is useful at video model testing stage, where we perform fully convolution evaluation and take entire video frames without cropping as input. Previously, for instance we can only set `video_width=0`, `video_height=0`, `video_min_dimension = 128`, which will preserve aspect ratio. In production dataset, there are a small number of videos where aspect ratio is either extremely large or small, and when the shorter edge is rescaled to 128, the longer edge is still large. This will easily cause GPU memory OOM when we sample multiple video clips, and put them in a single minibatch.

Now, we can set (for instance) `video_width=0`, `video_height=0`, `video_min_dimension = 128` and `video_max_dimension = 171` so that the rescale resolution is either (128, 171) or (171, 128) depending on whether original height is larger than original width. Thus, we are less likely to have gpu OOM because the spatial size of video clips is determined.

Reviewed By: putivsky

Differential Revision: D20182529

fbshipit-source-id: f9c40afb7590e7c45e6908946597141efa35f57c

* Fixing samplers initialization (#1967)

Summary:
Pull Request resolved: https://github.com/pytorch/vision/pull/1967



No-ops for torchvision diff, which fixes samplers.

Differential Revision: D20397218

fbshipit-source-id: 6dc4d04364f305fbda7ca4f67a25ceecd73d0f20

* Exclude C++ test files
Co-authored-by: default avatarYuri Putivsky <yuri@fb.com>
Co-authored-by: default avatarZhicheng Yan <zyan3@fb.com>
parent 8b9859d3
#pragma once
#include "FfmpegHeaders.h"
#include "Interface.h"
/**
* Class sample data from AVFrame
*/
class FfmpegSampler {
public:
virtual ~FfmpegSampler() = default;
// return 0 on success and negative number on failure
virtual int init() = 0;
// sample from the given frame
virtual std::unique_ptr<DecodedFrame> sample(const AVFrame* frame) = 0;
};
#include "FfmpegStream.h"
#include "FfmpegUtil.h"
using namespace std;
// (TODO) Currently, disable the use of refCount
static int refCount = 0;
FfmpegStream::FfmpegStream(
AVFormatContext* inputCtx,
int index,
enum AVMediaType avMediaType,
double seekFrameMargin)
: inputCtx_(inputCtx),
index_(index),
avMediaType_(avMediaType),
seekFrameMargin_(seekFrameMargin) {}
FfmpegStream::~FfmpegStream() {
if (frame_) {
av_frame_free(&frame_);
}
avcodec_free_context(&codecCtx_);
}
int FfmpegStream::openCodecContext() {
VLOG(2) << "stream start_time: " << inputCtx_->streams[index_]->start_time;
auto typeString = av_get_media_type_string(avMediaType_);
AVStream* st = inputCtx_->streams[index_];
auto codec_id = st->codecpar->codec_id;
VLOG(1) << "codec_id: " << codec_id;
AVCodec* codec = avcodec_find_decoder(codec_id);
if (!codec) {
LOG(ERROR) << "avcodec_find_decoder failed for codec_id: " << int(codec_id);
return AVERROR(EINVAL);
}
VLOG(1) << "Succeed to find decoder";
codecCtx_ = avcodec_alloc_context3(codec);
if (!codecCtx_) {
LOG(ERROR) << "avcodec_alloc_context3 fails";
return AVERROR(ENOMEM);
}
int ret;
/* Copy codec parameters from input stream to output codec context */
if ((ret = avcodec_parameters_to_context(codecCtx_, st->codecpar)) < 0) {
LOG(ERROR) << "Failed to copy " << typeString
<< " codec parameters to decoder context";
return ret;
}
AVDictionary* opts = nullptr;
av_dict_set(&opts, "refcounted_frames", refCount ? "1" : "0", 0);
// after avcodec_open2, value of codecCtx_->time_base is NOT meaningful
// But inputCtx_->streams[index_]->time_base has meaningful values
if ((ret = avcodec_open2(codecCtx_, codec, &opts)) < 0) {
LOG(ERROR) << "avcodec_open2 failed. " << ffmpeg_util::getErrorDesc(ret);
return ret;
}
VLOG(1) << "Succeed to open codec";
frame_ = av_frame_alloc();
return initFormat();
}
unique_ptr<DecodedFrame> FfmpegStream::getFrameData(int getPtsOnly) {
if (!codecCtx_) {
LOG(ERROR) << "Codec is not initialized";
return nullptr;
}
if (getPtsOnly) {
unique_ptr<DecodedFrame> decodedFrame = make_unique<DecodedFrame>();
decodedFrame->pts_ = frame_->pts;
return decodedFrame;
} else {
unique_ptr<DecodedFrame> decodedFrame = sampleFrameData();
if (decodedFrame) {
decodedFrame->pts_ = frame_->pts;
}
return decodedFrame;
}
}
void FfmpegStream::flush(int getPtsOnly, DecoderOutput& decoderOutput) {
VLOG(1) << "Media Type: " << getMediaType() << ", flush stream.";
// need to receive frames before entering draining mode
receiveAvailFrames(getPtsOnly, decoderOutput);
VLOG(2) << "send nullptr packet";
sendPacket(nullptr);
// receive remaining frames after entering draining mode
receiveAvailFrames(getPtsOnly, decoderOutput);
avcodec_flush_buffers(codecCtx_);
}
bool FfmpegStream::isFramePtsInRange() {
CHECK(frame_);
auto pts = frame_->pts;
auto startPts = this->getStartPts();
auto endPts = this->getEndPts();
VLOG(2) << "isPtsInRange. pts: " << pts << ", startPts: " << startPts
<< ", endPts: " << endPts;
return (pts == AV_NOPTS_VALUE) ||
(pts >= startPts && (endPts >= 0 ? pts <= endPts : true));
}
bool FfmpegStream::isFramePtsExceedRange() {
if (frame_) {
auto endPts = this->getEndPts();
VLOG(2) << "isFramePtsExceedRange. last_pts_: " << last_pts_
<< ", endPts: " << endPts;
return endPts >= 0 ? last_pts_ >= endPts : false;
} else {
return true;
}
}
// seek a frame
int FfmpegStream::seekFrame(int64_t seekPts) {
// translate margin from second to pts
int64_t margin = (int64_t)(
seekFrameMargin_ * (double)inputCtx_->streams[index_]->time_base.den /
(double)inputCtx_->streams[index_]->time_base.num);
int64_t real_seekPts = (seekPts - margin) > 0 ? (seekPts - margin) : 0;
VLOG(2) << "seek margin: " << margin;
VLOG(2) << "real seekPts: " << real_seekPts;
int ret = av_seek_frame(
inputCtx_,
index_,
(seekPts - margin) > 0 ? (seekPts - margin) : 0,
AVSEEK_FLAG_BACKWARD);
if (ret < 0) {
LOG(WARNING) << "av_seek_frame fails. Stream index: " << index_;
return ret;
}
return 0;
}
// send/receive encoding and decoding API overview
// https://ffmpeg.org/doxygen/3.4/group__lavc__encdec.html
int FfmpegStream::sendPacket(const AVPacket* packet) {
return avcodec_send_packet(codecCtx_, packet);
}
int FfmpegStream::receiveFrame() {
int ret = avcodec_receive_frame(codecCtx_, frame_);
if (ret >= 0) {
// succeed
frame_->pts = av_frame_get_best_effort_timestamp(frame_);
if (frame_->pts == AV_NOPTS_VALUE) {
// Trick: if we can not figure out pts, we just set it to be (last_pts +
// 1)
frame_->pts = last_pts_ + 1;
}
last_pts_ = frame_->pts;
VLOG(2) << "avcodec_receive_frame succeed";
} else if (ret == AVERROR(EAGAIN)) {
VLOG(2) << "avcodec_receive_frame fails and returns AVERROR(EAGAIN). ";
} else if (ret == AVERROR_EOF) {
// no more frame to read
VLOG(2) << "avcodec_receive_frame returns AVERROR_EOF";
} else {
LOG(WARNING) << "avcodec_receive_frame failed. Error: "
<< ffmpeg_util::getErrorDesc(ret);
}
return ret;
}
void FfmpegStream::receiveAvailFrames(
int getPtsOnly,
DecoderOutput& decoderOutput) {
int result = 0;
while ((result = receiveFrame()) >= 0) {
unique_ptr<DecodedFrame> decodedFrame = getFrameData(getPtsOnly);
if (decodedFrame &&
((!getPtsOnly && decodedFrame->frameSize_ > 0) || getPtsOnly)) {
if (isFramePtsInRange()) {
decoderOutput.addMediaFrame(getMediaType(), std::move(decodedFrame));
}
} // end-if
} // end-while
}
// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
#pragma once
#include <memory>
#include <unordered_map>
#include <utility>
#include "FfmpegHeaders.h"
#include "Interface.h"
/*
Class uses FFMPEG library to decode one media stream (audio or video).
*/
class FfmpegStream {
public:
FfmpegStream(
AVFormatContext* inputCtx,
int index,
enum AVMediaType avMediaType,
double seekFrameMargin);
virtual ~FfmpegStream();
// returns 0 - on success or negative error
int openCodecContext();
// returns stream index
int getIndex() const {
return index_;
}
// returns number decoded/sampled bytes
std::unique_ptr<DecodedFrame> getFrameData(int getPtsOnly);
// flush the stream at the end of decoding.
// Return 0 on success and -1 when cache is drained
void flush(int getPtsOnly, DecoderOutput& decoderOutput);
// seek a frame
int seekFrame(int64_t ts);
// send an AVPacket
int sendPacket(const AVPacket* packet);
// receive AVFrame
int receiveFrame();
// receive all available frames from the internal buffer
void receiveAvailFrames(int getPtsOnly, DecoderOutput& decoderOutput);
// return media type
virtual MediaType getMediaType() const = 0;
// return media format
virtual FormatUnion getMediaFormat() const = 0;
// return start presentation timestamp
virtual int64_t getStartPts() const = 0;
// return end presentation timestamp
virtual int64_t getEndPts() const = 0;
// is the pts of most recent frame within range?
bool isFramePtsInRange();
// does the pts of most recent frame exceed range?
bool isFramePtsExceedRange();
protected:
virtual int initFormat() = 0;
// returns a decoded frame
virtual std::unique_ptr<DecodedFrame> sampleFrameData() = 0;
protected:
AVFormatContext* const inputCtx_;
const int index_;
enum AVMediaType avMediaType_;
AVCodecContext* codecCtx_{nullptr};
AVFrame* frame_{nullptr};
// pts of last decoded frame
int64_t last_pts_{0};
double seekFrameMargin_{1.0};
};
#include "FfmpegUtil.h"
using namespace std;
namespace ffmpeg_util {
bool mapFfmpegType(AVMediaType media, MediaType* type) {
switch (media) {
case AVMEDIA_TYPE_VIDEO:
*type = MediaType::TYPE_VIDEO;
return true;
case AVMEDIA_TYPE_AUDIO:
*type = MediaType::TYPE_AUDIO;
return true;
default:
return false;
}
}
bool mapMediaType(MediaType type, AVMediaType* media) {
switch (type) {
case MediaType::TYPE_VIDEO:
*media = AVMEDIA_TYPE_VIDEO;
return true;
case MediaType::TYPE_AUDIO:
*media = AVMEDIA_TYPE_AUDIO;
return true;
default:
return false;
}
}
void setFormatDimensions(
int& destW,
int& destH,
int userW,
int userH,
int srcW,
int srcH,
int minDimension) {
// rounding rules
// int -> double -> round
// round up if fraction is >= 0.5 or round down if fraction is < 0.5
// int result = double(value) + 0.5
// here we rounding double to int according to the above rule
if (userW == 0 && userH == 0) {
if (minDimension > 0) { // #2
if (srcW > srcH) {
// landscape
destH = minDimension;
destW = round(double(srcW * minDimension) / srcH);
} else {
// portrait
destW = minDimension;
destH = round(double(srcH * minDimension) / srcW);
}
} else { // #1
destW = srcW;
destH = srcH;
}
} else if (userW != 0 && userH == 0) { // #3
destW = userW;
destH = round(double(srcH * userW) / srcW);
} else if (userW == 0 && userH != 0) { // #4
destW = round(double(srcW * userH) / srcH);
destH = userH;
} else {
// userW != 0 && userH != 0. #5
destW = userW;
destH = userH;
}
// prevent zeros
destW = std::max(destW, 1);
destH = std::max(destH, 1);
}
bool validateVideoFormat(const VideoFormat& f) {
/*
Valid parameters values for decoder
___________________________________________________
| W | H | minDimension | algorithm |
|_________________________________________________|
| 0 | 0 | 0 | original |
|_________________________________________________|
| 0 | 0 | >0 |scale to min dimension|
|_____|_____|____________________________________ |
| >0 | 0 | 0 | scale keeping W |
|_________________________________________________|
| 0 | >0 | 0 | scale keeping H |
|_________________________________________________|
| >0 | >0 | 0 | stretch/scale |
|_________________________________________________|
*/
return (f.width == 0 && f.height == 0) || // #1 and #2
(f.width != 0 && f.height != 0 && f.minDimension == 0) || // # 5
(((f.width != 0 && f.height == 0) || // #3 and #4
(f.width == 0 && f.height != 0)) &&
f.minDimension == 0);
}
string getErrorDesc(int errnum) {
array<char, 1024> buffer;
if (av_strerror(errnum, buffer.data(), buffer.size()) < 0) {
return string("Unknown error code");
}
buffer.back() = 0;
return string(buffer.data());
}
} // namespace ffmpeg_util
#pragma once
#include <array>
#include <string>
#include "FfmpegHeaders.h"
#include "Interface.h"
namespace ffmpeg_util {
bool mapFfmpegType(AVMediaType media, enum MediaType* type);
bool mapMediaType(MediaType type, enum AVMediaType* media);
void setFormatDimensions(
int& destW,
int& destH,
int userW,
int userH,
int srcW,
int srcH,
int minDimension);
bool validateVideoFormat(const VideoFormat& f);
std::string getErrorDesc(int errnum);
} // namespace ffmpeg_util
#include "FfmpegVideoSampler.h"
#include "FfmpegUtil.h"
using namespace std;
FfmpegVideoSampler::FfmpegVideoSampler(
const VideoFormat& in,
const VideoFormat& out,
int swsFlags)
: inFormat_(in), outFormat_(out), swsFlags_(swsFlags) {}
FfmpegVideoSampler::~FfmpegVideoSampler() {
if (scaleContext_) {
sws_freeContext(scaleContext_);
scaleContext_ = nullptr;
}
}
int FfmpegVideoSampler::init() {
VLOG(1) << "Input format: width " << inFormat_.width << ", height "
<< inFormat_.height << ", format " << inFormat_.format
<< ", minDimension " << inFormat_.minDimension;
VLOG(1) << "Scale format: width " << outFormat_.width << ", height "
<< outFormat_.height << ", format " << outFormat_.format
<< ", minDimension " << outFormat_.minDimension;
scaleContext_ = sws_getContext(
inFormat_.width,
inFormat_.height,
(AVPixelFormat)inFormat_.format,
outFormat_.width,
outFormat_.height,
static_cast<AVPixelFormat>(outFormat_.format),
swsFlags_,
nullptr,
nullptr,
nullptr);
if (scaleContext_) {
return 0;
} else {
return -1;
}
}
int32_t FfmpegVideoSampler::getImageBytes() const {
return av_image_get_buffer_size(
(AVPixelFormat)outFormat_.format, outFormat_.width, outFormat_.height, 1);
}
// https://ffmpeg.org/doxygen/3.4/scaling_video_8c-example.html#a10
unique_ptr<DecodedFrame> FfmpegVideoSampler::sample(const AVFrame* frame) {
if (!frame) {
return nullptr; // no flush for videos
}
// scaled and cropped image
auto outImageSize = getImageBytes();
AvDataPtr frameData(static_cast<uint8_t*>(av_malloc(outImageSize)));
uint8_t* scalePlanes[4] = {nullptr};
int scaleLines[4] = {0};
int result;
if ((result = av_image_fill_arrays(
scalePlanes,
scaleLines,
frameData.get(),
static_cast<AVPixelFormat>(outFormat_.format),
outFormat_.width,
outFormat_.height,
1)) < 0) {
LOG(ERROR) << "av_image_fill_arrays failed, err: "
<< ffmpeg_util::getErrorDesc(result);
return nullptr;
}
if ((result = sws_scale(
scaleContext_,
frame->data,
frame->linesize,
0,
inFormat_.height,
scalePlanes,
scaleLines)) < 0) {
LOG(ERROR) << "sws_scale failed, err: "
<< ffmpeg_util::getErrorDesc(result);
return nullptr;
}
return make_unique<DecodedFrame>(std::move(frameData), outImageSize, 0);
}
#pragma once
#include "FfmpegSampler.h"
/**
* Class transcode video frames from one format into another
*/
class FfmpegVideoSampler : public FfmpegSampler {
public:
explicit FfmpegVideoSampler(
const VideoFormat& in,
const VideoFormat& out,
int swsFlags = SWS_AREA);
~FfmpegVideoSampler() override;
int init() override;
int32_t getImageBytes() const;
// returns number of bytes of the sampled data
std::unique_ptr<DecodedFrame> sample(const AVFrame* frame) override;
const VideoFormat& getInFormat() const {
return inFormat_;
}
private:
VideoFormat inFormat_;
VideoFormat outFormat_;
int swsFlags_;
SwsContext* scaleContext_{nullptr};
};
#include "FfmpegVideoStream.h"
#include "FfmpegUtil.h"
using namespace std;
namespace {
bool operator==(const VideoFormat& x, const AVFrame& y) {
return x.width == y.width && x.height == y.height &&
x.format == static_cast<AVPixelFormat>(y.format);
}
VideoFormat toVideoFormat(const AVFrame& frame) {
VideoFormat videoFormat;
videoFormat.width = frame.width;
videoFormat.height = frame.height;
videoFormat.format = static_cast<AVPixelFormat>(frame.format);
return videoFormat;
}
} // namespace
FfmpegVideoStream::FfmpegVideoStream(
AVFormatContext* inputCtx,
int index,
enum AVMediaType avMediaType,
MediaFormat mediaFormat,
double seekFrameMargin)
: FfmpegStream(inputCtx, index, avMediaType, seekFrameMargin),
mediaFormat_(mediaFormat) {}
FfmpegVideoStream::~FfmpegVideoStream() {}
void FfmpegVideoStream::checkStreamDecodeParams() {
auto timeBase = getTimeBase();
if (timeBase.first > 0) {
CHECK_EQ(timeBase.first, inputCtx_->streams[index_]->time_base.num);
CHECK_EQ(timeBase.second, inputCtx_->streams[index_]->time_base.den);
}
}
void FfmpegVideoStream::updateStreamDecodeParams() {
auto timeBase = getTimeBase();
if (timeBase.first == 0) {
mediaFormat_.format.video.timeBaseNum =
inputCtx_->streams[index_]->time_base.num;
mediaFormat_.format.video.timeBaseDen =
inputCtx_->streams[index_]->time_base.den;
}
mediaFormat_.format.video.duration = inputCtx_->streams[index_]->duration;
}
int FfmpegVideoStream::initFormat() {
// set output format
VideoFormat& format = mediaFormat_.format.video;
if (!ffmpeg_util::validateVideoFormat(format)) {
LOG(ERROR) << "Invalid video format";
return -1;
}
format.fps = av_q2d(
av_guess_frame_rate(inputCtx_, inputCtx_->streams[index_], nullptr));
// keep aspect ratio
ffmpeg_util::setFormatDimensions(
format.width,
format.height,
format.width,
format.height,
codecCtx_->width,
codecCtx_->height,
format.minDimension);
VLOG(1) << "After adjusting, video format"
<< ", width: " << format.width << ", height: " << format.height
<< ", format: " << format.format
<< ", minDimension: " << format.minDimension;
if (format.format == AV_PIX_FMT_NONE) {
format.format = codecCtx_->pix_fmt;
VLOG(1) << "Set pixel format: " << format.format;
}
checkStreamDecodeParams();
updateStreamDecodeParams();
return format.width != 0 && format.height != 0 &&
format.format != AV_PIX_FMT_NONE
? 0
: -1;
}
unique_ptr<DecodedFrame> FfmpegVideoStream::sampleFrameData() {
VideoFormat& format = mediaFormat_.format.video;
if (!sampler_ || !(sampler_->getInFormat() == *frame_)) {
VideoFormat newInFormat = toVideoFormat(*frame_);
sampler_ = make_unique<FfmpegVideoSampler>(newInFormat, format, SWS_AREA);
VLOG(1) << "Set input video sampler format"
<< ", width: " << newInFormat.width
<< ", height: " << newInFormat.height
<< ", format: " << newInFormat.format
<< " : output video sampler format"
<< ", width: " << format.width << ", height: " << format.height
<< ", format: " << format.format
<< ", minDimension: " << format.minDimension;
int ret = sampler_->init();
if (ret < 0) {
VLOG(1) << "Fail to initialize video sampler";
return nullptr;
}
}
return sampler_->sample(frame_);
}
#pragma once
#include <utility>
#include "FfmpegStream.h"
#include "FfmpegVideoSampler.h"
/**
* Class uses FFMPEG library to decode one video stream.
*/
class FfmpegVideoStream : public FfmpegStream {
public:
explicit FfmpegVideoStream(
AVFormatContext* inputCtx,
int index,
enum AVMediaType avMediaType,
MediaFormat mediaFormat,
double seekFrameMargin);
~FfmpegVideoStream() override;
// FfmpegStream overrides
MediaType getMediaType() const override {
return MediaType::TYPE_VIDEO;
}
FormatUnion getMediaFormat() const override {
return mediaFormat_.format;
}
int64_t getStartPts() const override {
return mediaFormat_.format.video.startPts;
}
int64_t getEndPts() const override {
return mediaFormat_.format.video.endPts;
}
// return numerator and denominator of time base
std::pair<int, int> getTimeBase() const {
return std::make_pair(
mediaFormat_.format.video.timeBaseNum,
mediaFormat_.format.video.timeBaseDen);
}
void checkStreamDecodeParams();
void updateStreamDecodeParams();
protected:
int initFormat() override;
std::unique_ptr<DecodedFrame> sampleFrameData() override;
private:
MediaFormat mediaFormat_;
std::unique_ptr<FfmpegVideoSampler> sampler_{nullptr};
};
#include "Interface.h"
void DecoderOutput::initMediaType(MediaType mediaType, FormatUnion format) {
MediaData mediaData(format);
media_data_.emplace(mediaType, std::move(mediaData));
}
void DecoderOutput::addMediaFrame(
MediaType mediaType,
std::unique_ptr<DecodedFrame> frame) {
if (media_data_.find(mediaType) != media_data_.end()) {
VLOG(1) << "media type: " << mediaType
<< " add frame with pts: " << frame->pts_;
media_data_[mediaType].frames_.push_back(std::move(frame));
} else {
VLOG(1) << "media type: " << mediaType << " not found. Skip the frame.";
}
}
void DecoderOutput::clear() {
media_data_.clear();
}
#pragma once
#include <c10/util/Logging.h>
#include <sys/types.h>
#include <memory>
#include <unordered_map>
extern "C" {
#include <libavutil/pixfmt.h>
#include <libavutil/samplefmt.h>
void av_free(void* ptr);
}
struct avDeleter {
void operator()(uint8_t* p) const {
av_free(p);
}
};
const AVPixelFormat defaultVideoPixelFormat = AV_PIX_FMT_RGB24;
const AVSampleFormat defaultAudioSampleFormat = AV_SAMPLE_FMT_FLT;
using AvDataPtr = std::unique_ptr<uint8_t, avDeleter>;
enum MediaType : uint32_t {
TYPE_VIDEO = 1,
TYPE_AUDIO = 2,
};
struct EnumClassHash {
template <typename T>
uint32_t operator()(T t) const {
return static_cast<uint32_t>(t);
}
};
struct VideoFormat {
// fields are initialized for the auto detection
// caller can specify some/all of field values if specific output is desirable
int width{0}; // width in pixels
int height{0}; // height in pixels
int minDimension{0}; // choose min dimension and rescale accordingly
// Output image pixel format. data type AVPixelFormat
AVPixelFormat format{defaultVideoPixelFormat}; // type AVPixelFormat
int64_t startPts{0}, endPts{0}; // Start and end presentation timestamp
int timeBaseNum{0};
int timeBaseDen{1}; // numerator and denominator of time base
float fps{0.0};
int64_t duration{0}; // duration of the stream, in stream time base
};
struct AudioFormat {
// fields are initialized for the auto detection
// caller can specify some/all of field values if specific output is desirable
int samples{0}; // number samples per second (frequency)
int channels{0}; // number of channels
AVSampleFormat format{defaultAudioSampleFormat}; // type AVSampleFormat
int64_t startPts{0}, endPts{0}; // Start and end presentation timestamp
int timeBaseNum{0};
int timeBaseDen{1}; // numerator and denominator of time base
int64_t duration{0}; // duration of the stream, in stream time base
};
union FormatUnion {
FormatUnion() {}
VideoFormat video;
AudioFormat audio;
};
struct MediaFormat {
MediaFormat() {}
MediaFormat(const MediaFormat& mediaFormat) : type(mediaFormat.type) {
if (type == MediaType::TYPE_VIDEO) {
format.video = mediaFormat.format.video;
} else if (type == MediaType::TYPE_AUDIO) {
format.audio = mediaFormat.format.audio;
}
}
MediaFormat(MediaType mediaType) : type(mediaType) {
if (mediaType == MediaType::TYPE_VIDEO) {
format.video = VideoFormat();
} else if (mediaType == MediaType::TYPE_AUDIO) {
format.audio = AudioFormat();
}
}
// media type
MediaType type;
// format data
FormatUnion format;
};
class DecodedFrame {
public:
explicit DecodedFrame() : frame_(nullptr), frameSize_(0), pts_(0) {}
explicit DecodedFrame(AvDataPtr frame, int frameSize, int64_t pts)
: frame_(std::move(frame)), frameSize_(frameSize), pts_(pts) {}
AvDataPtr frame_{nullptr};
int frameSize_{0};
int64_t pts_{0};
};
struct MediaData {
MediaData() {}
MediaData(FormatUnion format) : format_(format) {}
FormatUnion format_;
std::vector<std::unique_ptr<DecodedFrame>> frames_;
};
class DecoderOutput {
public:
explicit DecoderOutput() {}
~DecoderOutput() {}
void initMediaType(MediaType mediaType, FormatUnion format);
void addMediaFrame(MediaType mediaType, std::unique_ptr<DecodedFrame> frame);
void clear();
std::unordered_map<MediaType, MediaData, EnumClassHash> media_data_;
};
......@@ -3,11 +3,11 @@
#include <Python.h>
#include <c10/util/Logging.h>
#include <exception>
#include "FfmpegDecoder.h"
#include "FfmpegHeaders.h"
#include "util.h"
#include "memory_buffer.h"
#include "sync_decoder.h"
using namespace std;
using namespace ffmpeg;
// If we are in a Windows environment, we need to define
// initialization functions for the _custom_ops extension
......@@ -27,121 +27,159 @@ PyMODINIT_FUNC PyInit_video_reader(void) {
namespace video_reader {
class UnknownPixelFormatException : public exception {
const char* what() const throw() override {
return "Unknown pixel format";
}
};
int getChannels(AVPixelFormat format) {
int numChannels = 0;
switch (format) {
case AV_PIX_FMT_BGR24:
case AV_PIX_FMT_RGB24:
numChannels = 3;
break;
default:
LOG(ERROR) << "Unknown format: " << format;
throw UnknownPixelFormatException();
}
return numChannels;
}
const AVPixelFormat defaultVideoPixelFormat = AV_PIX_FMT_RGB24;
const AVSampleFormat defaultAudioSampleFormat = AV_SAMPLE_FMT_FLT;
const size_t decoderTimeoutMs = 600000;
// A jitter can be added to the end of the range to avoid conversion/rounding
// error, small value 100us won't be enough to select the next frame, but enough
// to compensate rounding error due to the multiple conversions.
const size_t timeBaseJitterUs = 100;
DecoderParameters getDecoderParams(
int64_t videoStartUs,
int64_t videoEndUs,
double seekFrameMarginUs,
int64_t getPtsOnly,
int64_t readVideoStream,
int videoWidth,
int videoHeight,
int videoMinDimension,
int videoMaxDimension,
int64_t readAudioStream,
int audioSamples,
int audioChannels) {
DecoderParameters params;
params.headerOnly = getPtsOnly != 0;
params.seekAccuracy = seekFrameMarginUs;
params.startOffset = videoStartUs;
params.endOffset = videoEndUs;
params.timeoutMs = decoderTimeoutMs;
params.preventStaleness = false;
void fillVideoTensor(
std::vector<unique_ptr<DecodedFrame>>& frames,
torch::Tensor& videoFrame,
torch::Tensor& videoFramePts) {
int frameSize = 0;
if (videoFrame.numel() > 0) {
frameSize = videoFrame.numel() / frames.size();
if (readVideoStream == 1) {
MediaFormat videoFormat(0);
videoFormat.type = TYPE_VIDEO;
videoFormat.format.video.format = defaultVideoPixelFormat;
videoFormat.format.video.width = videoWidth;
videoFormat.format.video.height = videoHeight;
videoFormat.format.video.minDimension = videoMinDimension;
videoFormat.format.video.maxDimension = videoMaxDimension;
params.formats.insert(videoFormat);
}
int frameCount = 0;
if (readAudioStream == 1) {
MediaFormat audioFormat;
audioFormat.type = TYPE_AUDIO;
audioFormat.format.audio.format = defaultAudioSampleFormat;
audioFormat.format.audio.samples = audioSamples;
audioFormat.format.audio.channels = audioChannels;
params.formats.insert(audioFormat);
}
uint8_t* videoFrameData =
videoFrame.numel() > 0 ? videoFrame.data_ptr<uint8_t>() : nullptr;
int64_t* videoFramePtsData = videoFramePts.data_ptr<int64_t>();
return params;
}
for (size_t i = 0; i < frames.size(); ++i) {
const auto& frame = frames[i];
if (videoFrameData) {
memcpy(
videoFrameData + (size_t)(frameCount++) * (size_t)frameSize,
frame->frame_.get(),
frameSize * sizeof(uint8_t));
// returns number of written bytes
template <typename T>
size_t fillTensor(
std::vector<DecoderOutputMessage>& msgs,
torch::Tensor& frame,
torch::Tensor& framePts,
int64_t num,
int64_t den) {
if (msgs.empty()) {
return 0;
}
T* frameData = frame.numel() > 0 ? frame.data_ptr<T>() : nullptr;
int64_t* framePtsData = framePts.data_ptr<int64_t>();
CHECK_EQ(framePts.size(0), msgs.size());
size_t avgElementsInFrame = frame.numel() / msgs.size();
size_t offset = 0;
for (size_t i = 0; i < msgs.size(); ++i) {
const auto& msg = msgs[i];
// convert pts into original time_base
AVRational avr = {(int)num, (int)den};
framePtsData[i] = av_rescale_q(msg.header.pts, AV_TIME_BASE_Q, avr);
VLOG(2) << "PTS type: " << sizeof(T) << ", us: " << msg.header.pts
<< ", original: " << framePtsData[i];
if (frameData) {
auto sizeInBytes = msg.payload->length();
memcpy(frameData + offset, msg.payload->data(), sizeInBytes);
if (sizeof(T) == sizeof(uint8_t)) {
// Video - move by allocated frame size
offset += avgElementsInFrame / sizeof(T);
} else {
// Audio - move by number of samples
offset += sizeInBytes / sizeof(T);
}
}
videoFramePtsData[i] = frame->pts_;
}
return offset * sizeof(T);
}
void getVideoMeta(
DecoderOutput& decoderOutput,
int& numFrames,
int& height,
int& width,
int& numChannels) {
auto& videoFrames = decoderOutput.media_data_[TYPE_VIDEO].frames_;
numFrames = videoFrames.size();
FormatUnion& videoFormat = decoderOutput.media_data_[TYPE_VIDEO].format_;
height = videoFormat.video.height;
width = videoFormat.video.width;
numChannels = getChannels(videoFormat.video.format);
size_t fillVideoTensor(
std::vector<DecoderOutputMessage>& msgs,
torch::Tensor& videoFrame,
torch::Tensor& videoFramePts,
int64_t num,
int64_t den) {
return fillTensor<uint8_t>(msgs, videoFrame, videoFramePts, num, den);
}
void fillAudioTensor(
std::vector<unique_ptr<DecodedFrame>>& frames,
size_t fillAudioTensor(
std::vector<DecoderOutputMessage>& msgs,
torch::Tensor& audioFrame,
torch::Tensor& audioFramePts) {
if (frames.size() == 0) {
return;
}
float* audioFrameData =
audioFrame.numel() > 0 ? audioFrame.data_ptr<float>() : nullptr;
CHECK_EQ(audioFramePts.size(0), frames.size());
int64_t* audioFramePtsData = audioFramePts.data_ptr<int64_t>();
int bytesPerSample = av_get_bytes_per_sample(defaultAudioSampleFormat);
int64_t frameDataOffset = 0;
for (size_t i = 0; i < frames.size(); ++i) {
audioFramePtsData[i] = frames[i]->pts_;
if (audioFrameData) {
memcpy(
audioFrameData + frameDataOffset,
frames[i]->frame_.get(),
frames[i]->frameSize_);
frameDataOffset += (frames[i]->frameSize_ / bytesPerSample);
}
}
torch::Tensor& audioFramePts,
int64_t num,
int64_t den) {
return fillTensor<float>(msgs, audioFrame, audioFramePts, num, den);
}
void getAudioMeta(
DecoderOutput& decoderOutput,
int64_t& numSamples,
int64_t& channels,
int64_t& numFrames) {
FormatUnion& audioFormat = decoderOutput.media_data_[TYPE_AUDIO].format_;
channels = audioFormat.audio.channels;
CHECK_EQ(audioFormat.audio.format, AV_SAMPLE_FMT_FLT);
int bytesPerSample = av_get_bytes_per_sample(
static_cast<AVSampleFormat>(audioFormat.audio.format));
// auto& audioFrames = decoderOutput.media_frames_[TYPE_AUDIO];
auto& audioFrames = decoderOutput.media_data_[TYPE_AUDIO].frames_;
numFrames = audioFrames.size();
int64_t frameSizeTotal = 0;
for (auto const& decodedFrame : audioFrames) {
frameSizeTotal += static_cast<int64_t>(decodedFrame->frameSize_);
void offsetsToUs(
double& seekFrameMargin,
int64_t readVideoStream,
int64_t videoStartPts,
int64_t videoEndPts,
int64_t videoTimeBaseNum,
int64_t videoTimeBaseDen,
int64_t readAudioStream,
int64_t audioStartPts,
int64_t audioEndPts,
int64_t audioTimeBaseNum,
int64_t audioTimeBaseDen,
int64_t& videoStartUs,
int64_t& videoEndUs) {
seekFrameMargin *= AV_TIME_BASE;
videoStartUs = 0;
videoEndUs = -1;
if (readVideoStream) {
AVRational vr = {(int)videoTimeBaseNum, (int)videoTimeBaseDen};
if (videoStartPts > 0) {
videoStartUs = av_rescale_q(videoStartPts, vr, AV_TIME_BASE_Q);
}
if (videoEndPts > 0) {
// Add jitter to the end of the range to avoid conversion/rounding error.
// Small value 100us won't be enough to select the next frame, but enough
// to compensate rounding error due to the multiple conversions.
videoEndUs =
timeBaseJitterUs + av_rescale_q(videoEndPts, vr, AV_TIME_BASE_Q);
}
} else if (readAudioStream) {
AVRational ar = {(int)audioTimeBaseNum, (int)audioTimeBaseDen};
if (audioStartPts > 0) {
videoStartUs = av_rescale_q(audioStartPts, ar, AV_TIME_BASE_Q);
}
if (audioEndPts > 0) {
// Add jitter to the end of the range to avoid conversion/rounding error.
// Small value 100us won't be enough to select the next frame, but enough
// to compensate rounding error due to the multiple conversions.
videoEndUs =
timeBaseJitterUs + av_rescale_q(audioEndPts, ar, AV_TIME_BASE_Q);
}
}
VLOG(2) << "numFrames: " << numFrames;
VLOG(2) << "frameSizeTotal: " << frameSizeTotal;
VLOG(2) << "channels: " << channels;
VLOG(2) << "bytesPerSample: " << bytesPerSample;
CHECK_EQ(frameSizeTotal % (channels * bytesPerSample), 0);
numSamples = frameSizeTotal / (channels * bytesPerSample);
}
torch::List<torch::Tensor> readVideo(
......@@ -154,6 +192,7 @@ torch::List<torch::Tensor> readVideo(
int64_t width,
int64_t height,
int64_t minDimension,
int64_t maxDimension,
int64_t videoStartPts,
int64_t videoEndPts,
int64_t videoTimeBaseNum,
......@@ -165,38 +204,92 @@ torch::List<torch::Tensor> readVideo(
int64_t audioEndPts,
int64_t audioTimeBaseNum,
int64_t audioTimeBaseDen) {
unique_ptr<DecoderParameters> params = util::getDecoderParams(
int64_t videoStartUs, videoEndUs;
offsetsToUs(
seekFrameMargin,
getPtsOnly,
readVideoStream,
width,
height,
minDimension,
videoStartPts,
videoEndPts,
videoTimeBaseNum,
videoTimeBaseDen,
readAudioStream,
audioSamples,
audioChannels,
audioStartPts,
audioEndPts,
audioTimeBaseNum,
audioTimeBaseDen);
FfmpegDecoder decoder;
DecoderOutput decoderOutput;
audioTimeBaseDen,
videoStartUs,
videoEndUs);
DecoderParameters params = getDecoderParams(
videoStartUs, // videoStartPts
videoEndUs, // videoEndPts
seekFrameMargin, // seekFrameMargin
getPtsOnly, // getPtsOnly
readVideoStream, // readVideoStream
width, // width
height, // height
minDimension, // minDimension
maxDimension, // maxDimension
readAudioStream, // readAudioStream
audioSamples, // audioSamples
audioChannels // audioChannels
);
SyncDecoder decoder;
std::vector<DecoderOutputMessage> audioMessages, videoMessages;
DecoderInCallback callback = nullptr;
std::string logMessage, logType;
if (isReadFile) {
decoder.decodeFile(std::move(params), videoPath, decoderOutput);
params.uri = videoPath;
logType = "file";
logMessage = videoPath;
} else {
decoder.decodeMemory(
std::move(params),
input_video.data_ptr<uint8_t>(),
input_video.size(0),
decoderOutput);
callback = MemoryBuffer::getCallback(
input_video.data_ptr<uint8_t>(), input_video.size(0));
logType = "memory";
logMessage = std::to_string(input_video.size(0));
}
VLOG(1) << "Video decoding from " << logType << " [" << logMessage
<< "] has started";
const auto now = std::chrono::system_clock::now();
bool succeeded;
DecoderMetadata audioMetadata, videoMetadata;
std::vector<DecoderMetadata> metadata;
if ((succeeded = decoder.init(params, std::move(callback), &metadata))) {
for (const auto& header : metadata) {
if (header.format.type == TYPE_VIDEO) {
videoMetadata = header;
} else if (header.format.type == TYPE_AUDIO) {
audioMetadata = header;
}
}
int res;
DecoderOutputMessage msg;
while (0 == (res = decoder.decode(&msg, decoderTimeoutMs))) {
if (msg.header.format.type == TYPE_VIDEO) {
videoMessages.push_back(std::move(msg));
}
if (msg.header.format.type == TYPE_AUDIO) {
audioMessages.push_back(std::move(msg));
}
msg.payload.reset();
}
} else {
LOG(ERROR) << "Decoder initialization has failed";
}
const auto then = std::chrono::system_clock::now();
VLOG(1) << "Video decoding from " << logType << " [" << logMessage
<< "] has finished, "
<< std::chrono::duration_cast<std::chrono::microseconds>(then - now)
.count()
<< " us";
decoder.shutdown();
// video section
torch::Tensor videoFrame = torch::zeros({0}, torch::kByte);
torch::Tensor videoFramePts = torch::zeros({0}, torch::kLong);
......@@ -204,37 +297,49 @@ torch::List<torch::Tensor> readVideo(
torch::Tensor videoFps = torch::zeros({0}, torch::kFloat);
torch::Tensor videoDuration = torch::zeros({0}, torch::kLong);
if (readVideoStream == 1) {
auto it = decoderOutput.media_data_.find(TYPE_VIDEO);
if (it != decoderOutput.media_data_.end()) {
int numVideoFrames, outHeight, outWidth, numChannels;
getVideoMeta(
decoderOutput, numVideoFrames, outHeight, outWidth, numChannels);
if (succeeded && readVideoStream == 1) {
if (!videoMessages.empty()) {
const auto& header = videoMetadata;
const auto& format = header.format.format.video;
int numVideoFrames = videoMessages.size();
int outHeight = format.height;
int outWidth = format.width;
int numChannels = 3; // decoder guarantees the default AV_PIX_FMT_RGB24
size_t expectedWrittenBytes = 0;
if (getPtsOnly == 0) {
videoFrame = torch::zeros(
{numVideoFrames, outHeight, outWidth, numChannels}, torch::kByte);
expectedWrittenBytes =
numVideoFrames * outHeight * outWidth * numChannels;
}
videoFramePts = torch::zeros({numVideoFrames}, torch::kLong);
fillVideoTensor(
decoderOutput.media_data_[TYPE_VIDEO].frames_,
videoFrame,
videoFramePts);
VLOG(2) << "video duration: " << header.duration
<< ", fps: " << header.fps << ", num: " << header.num
<< ", den: " << header.den << ", num frames: " << numVideoFrames;
auto numberWrittenBytes = fillVideoTensor(
videoMessages, videoFrame, videoFramePts, header.num, header.den);
CHECK_EQ(numberWrittenBytes, expectedWrittenBytes);
videoTimeBase = torch::zeros({2}, torch::kInt);
int* videoTimeBaseData = videoTimeBase.data_ptr<int>();
videoTimeBaseData[0] = it->second.format_.video.timeBaseNum;
videoTimeBaseData[1] = it->second.format_.video.timeBaseDen;
videoTimeBaseData[0] = header.num;
videoTimeBaseData[1] = header.den;
videoFps = torch::zeros({1}, torch::kFloat);
float* videoFpsData = videoFps.data_ptr<float>();
videoFpsData[0] = it->second.format_.video.fps;
videoFpsData[0] = header.fps;
videoDuration = torch::zeros({1}, torch::kLong);
int64_t* videoDurationData = videoDuration.data_ptr<int64_t>();
videoDurationData[0] = it->second.format_.video.duration;
AVRational vr = {(int)header.num, (int)header.den};
videoDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, vr);
VLOG(1) << "Video decoding from " << logType << " [" << logMessage
<< "] filled video tensors";
} else {
VLOG(1) << "Miss video stream";
}
......@@ -246,39 +351,57 @@ torch::List<torch::Tensor> readVideo(
torch::Tensor audioTimeBase = torch::zeros({0}, torch::kInt);
torch::Tensor audioSampleRate = torch::zeros({0}, torch::kInt);
torch::Tensor audioDuration = torch::zeros({0}, torch::kLong);
if (readAudioStream == 1) {
auto it = decoderOutput.media_data_.find(TYPE_AUDIO);
if (it != decoderOutput.media_data_.end()) {
VLOG(1) << "Find audio stream";
int64_t numAudioSamples = 0, outAudioChannels = 0, numAudioFrames = 0;
getAudioMeta(
decoderOutput, numAudioSamples, outAudioChannels, numAudioFrames);
VLOG(2) << "numAudioSamples: " << numAudioSamples;
VLOG(2) << "outAudioChannels: " << outAudioChannels;
VLOG(2) << "numAudioFrames: " << numAudioFrames;
if (succeeded && readAudioStream == 1) {
if (!audioMessages.empty()) {
const auto& header = audioMetadata;
const auto& format = header.format.format.audio;
int64_t outAudioChannels = format.channels;
int bytesPerSample =
av_get_bytes_per_sample(static_cast<AVSampleFormat>(format.format));
int numAudioFrames = audioMessages.size();
int64_t numAudioSamples = 0;
if (getPtsOnly == 0) {
int64_t frameSizeTotal = 0;
for (auto const& audioMessage : audioMessages) {
frameSizeTotal += audioMessage.payload->length();
}
CHECK_EQ(frameSizeTotal % (outAudioChannels * bytesPerSample), 0);
numAudioSamples = frameSizeTotal / (outAudioChannels * bytesPerSample);
audioFrame =
torch::zeros({numAudioSamples, outAudioChannels}, torch::kFloat);
}
audioFramePts = torch::zeros({numAudioFrames}, torch::kLong);
fillAudioTensor(
decoderOutput.media_data_[TYPE_AUDIO].frames_,
audioFrame,
audioFramePts);
VLOG(2) << "audio duration: " << header.duration
<< ", channels: " << format.channels
<< ", sample rate: " << format.samples << ", num: " << header.num
<< ", den: " << header.den;
auto numberWrittenBytes = fillAudioTensor(
audioMessages, audioFrame, audioFramePts, header.num, header.den);
CHECK_EQ(
numberWrittenBytes,
numAudioSamples * outAudioChannels * sizeof(float));
audioTimeBase = torch::zeros({2}, torch::kInt);
int* audioTimeBaseData = audioTimeBase.data_ptr<int>();
audioTimeBaseData[0] = it->second.format_.audio.timeBaseNum;
audioTimeBaseData[1] = it->second.format_.audio.timeBaseDen;
audioTimeBaseData[0] = header.num;
audioTimeBaseData[1] = header.den;
audioSampleRate = torch::zeros({1}, torch::kInt);
int* audioSampleRateData = audioSampleRate.data_ptr<int>();
audioSampleRateData[0] = it->second.format_.audio.samples;
audioSampleRateData[0] = format.samples;
audioDuration = torch::zeros({1}, torch::kLong);
int64_t* audioDurationData = audioDuration.data_ptr<int64_t>();
audioDurationData[0] = it->second.format_.audio.duration;
AVRational ar = {(int)header.num, (int)header.den};
audioDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, ar);
VLOG(1) << "Video decoding from " << logType << " [" << logMessage
<< "] filled audio tensors";
} else {
VLOG(1) << "Miss audio stream";
}
......@@ -296,6 +419,9 @@ torch::List<torch::Tensor> readVideo(
result.push_back(std::move(audioSampleRate));
result.push_back(std::move(audioDuration));
VLOG(1) << "Video decoding from " << logType << " [" << logMessage
<< "] about to return";
return result;
}
......@@ -307,6 +433,7 @@ torch::List<torch::Tensor> readVideoFromMemory(
int64_t width,
int64_t height,
int64_t minDimension,
int64_t maxDimension,
int64_t videoStartPts,
int64_t videoEndPts,
int64_t videoTimeBaseNum,
......@@ -328,6 +455,7 @@ torch::List<torch::Tensor> readVideoFromMemory(
width,
height,
minDimension,
maxDimension,
videoStartPts,
videoEndPts,
videoTimeBaseNum,
......@@ -349,6 +477,7 @@ torch::List<torch::Tensor> readVideoFromFile(
int64_t width,
int64_t height,
int64_t minDimension,
int64_t maxDimension,
int64_t videoStartPts,
int64_t videoEndPts,
int64_t videoTimeBaseNum,
......@@ -371,6 +500,7 @@ torch::List<torch::Tensor> readVideoFromFile(
width,
height,
minDimension,
maxDimension,
videoStartPts,
videoEndPts,
videoTimeBaseNum,
......@@ -388,59 +518,96 @@ torch::List<torch::Tensor> probeVideo(
bool isReadFile,
const torch::Tensor& input_video,
std::string videoPath) {
unique_ptr<DecoderParameters> params = util::getDecoderParams(
DecoderParameters params = getDecoderParams(
0, // videoStartUs
-1, // videoEndUs
0, // seekFrameMargin
0, // getPtsOnly
1, // getPtsOnly
1, // readVideoStream
0, // width
0, // height
0, // minDimension
0, // videoStartPts
0, // videoEndPts
0, // videoTimeBaseNum
1, // videoTimeBaseDen
0, // maxDimension
1, // readAudioStream
0, // audioSamples
0, // audioChannels
0, // audioStartPts
0, // audioEndPts
0, // audioTimeBaseNum
1 // audioTimeBaseDen
0 // audioChannels
);
FfmpegDecoder decoder;
DecoderOutput decoderOutput;
SyncDecoder decoder;
DecoderInCallback callback = nullptr;
std::string logMessage, logType;
if (isReadFile) {
decoder.probeFile(std::move(params), videoPath, decoderOutput);
params.uri = videoPath;
logType = "file";
logMessage = videoPath;
} else {
callback = MemoryBuffer::getCallback(
input_video.data_ptr<uint8_t>(), input_video.size(0));
logType = "memory";
logMessage = std::to_string(input_video.size(0));
}
VLOG(1) << "Video probing from " << logType << " [" << logMessage
<< "] has started";
const auto now = std::chrono::system_clock::now();
bool succeeded;
bool gotAudio = false, gotVideo = false;
DecoderMetadata audioMetadata, videoMetadata;
std::vector<DecoderMetadata> metadata;
if ((succeeded = decoder.init(params, std::move(callback), &metadata))) {
for (const auto& header : metadata) {
if (header.format.type == TYPE_VIDEO) {
gotVideo = true;
videoMetadata = header;
} else if (header.format.type == TYPE_AUDIO) {
gotAudio = true;
audioMetadata = header;
}
}
const auto then = std::chrono::system_clock::now();
VLOG(1) << "Video probing from " << logType << " [" << logMessage
<< "] has finished, "
<< std::chrono::duration_cast<std::chrono::microseconds>(then - now)
.count()
<< " us";
} else {
decoder.probeMemory(
std::move(params),
input_video.data_ptr<uint8_t>(),
input_video.size(0),
decoderOutput);
LOG(ERROR) << "Decoder initialization has failed";
}
decoder.shutdown();
// video section
torch::Tensor videoTimeBase = torch::zeros({0}, torch::kInt);
torch::Tensor videoFps = torch::zeros({0}, torch::kFloat);
torch::Tensor videoDuration = torch::zeros({0}, torch::kLong);
auto it = decoderOutput.media_data_.find(TYPE_VIDEO);
if (it != decoderOutput.media_data_.end()) {
VLOG(1) << "Find video stream";
if (succeeded && gotVideo) {
videoTimeBase = torch::zeros({2}, torch::kInt);
int* videoTimeBaseData = videoTimeBase.data_ptr<int>();
videoTimeBaseData[0] = it->second.format_.video.timeBaseNum;
videoTimeBaseData[1] = it->second.format_.video.timeBaseDen;
const auto& header = videoMetadata;
const auto& media = header.format;
videoTimeBaseData[0] = header.num;
videoTimeBaseData[1] = header.den;
videoFps = torch::zeros({1}, torch::kFloat);
float* videoFpsData = videoFps.data_ptr<float>();
videoFpsData[0] = it->second.format_.video.fps;
videoFpsData[0] = header.fps;
videoDuration = torch::zeros({1}, torch::kLong);
int64_t* videoDurationData = videoDuration.data_ptr<int64_t>();
videoDurationData[0] = it->second.format_.video.duration;
AVRational avr = {(int)header.num, (int)header.den};
videoDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, avr);
VLOG(2) << "Prob fps: " << header.fps << ", duration: " << header.duration
<< ", num: " << header.num << ", den: " << header.den;
VLOG(1) << "Video probing from " << logType << " [" << logMessage
<< "] filled video tensors";
} else {
VLOG(1) << "Miss video stream";
LOG(ERROR) << "Miss video stream";
}
// audio section
......@@ -448,21 +615,31 @@ torch::List<torch::Tensor> probeVideo(
torch::Tensor audioSampleRate = torch::zeros({0}, torch::kInt);
torch::Tensor audioDuration = torch::zeros({0}, torch::kLong);
it = decoderOutput.media_data_.find(TYPE_AUDIO);
if (it != decoderOutput.media_data_.end()) {
VLOG(1) << "Find audio stream";
if (succeeded && gotAudio) {
audioTimeBase = torch::zeros({2}, torch::kInt);
int* audioTimeBaseData = audioTimeBase.data_ptr<int>();
audioTimeBaseData[0] = it->second.format_.audio.timeBaseNum;
audioTimeBaseData[1] = it->second.format_.audio.timeBaseDen;
const auto& header = audioMetadata;
const auto& media = header.format;
const auto& format = media.format.audio;
audioTimeBaseData[0] = header.num;
audioTimeBaseData[1] = header.den;
audioSampleRate = torch::zeros({1}, torch::kInt);
int* audioSampleRateData = audioSampleRate.data_ptr<int>();
audioSampleRateData[0] = it->second.format_.audio.samples;
audioSampleRateData[0] = format.samples;
audioDuration = torch::zeros({1}, torch::kLong);
int64_t* audioDurationData = audioDuration.data_ptr<int64_t>();
audioDurationData[0] = it->second.format_.audio.duration;
AVRational avr = {(int)header.num, (int)header.den};
audioDurationData[0] = av_rescale_q(header.duration, AV_TIME_BASE_Q, avr);
VLOG(2) << "Prob sample rate: " << format.samples
<< ", duration: " << header.duration << ", num: " << header.num
<< ", den: " << header.den;
VLOG(1) << "Video probing from " << logType << " [" << logMessage
<< "] filled audio tensors";
} else {
VLOG(1) << "Miss audio stream";
}
......@@ -475,6 +652,9 @@ torch::List<torch::Tensor> probeVideo(
result.push_back(std::move(audioSampleRate));
result.push_back(std::move(audioDuration));
VLOG(1) << "Video probing from " << logType << " [" << logMessage
<< "] is about to return";
return result;
}
......
#pragma once
#include <torch/script.h>
// Interface for Python
/*
return:
videoFrame: tensor (N, H, W, C) kByte
videoFramePts: tensor (N) kLong
videoTimeBase: tensor (2) kInt
videoFps: tensor (1) kFloat
audioFrame: tensor (N, C) kFloat
audioFramePts: tensor (N) kLong
audioTimeBase: tensor (2) kInt
audioSampleRate: tensor (1) kInt
*/
torch::List<torch::Tensor> readVideoFromMemory(
// 1D tensor of data type uint8, storing the comparessed video data
torch::Tensor input_video,
// seeking frame in the video/audio stream is imprecise so seek to a
// timestamp earlier by a margin The unit of margin is second
double seekFrameMargin,
// If only pts is needed and video/audio frames are not needed, set it
// to 1
int64_t getPtsOnly,
// bool variable. Set it to 1 if video stream should be read. Otherwise, set
// it to 0
int64_t readVideoStream,
/*
Valid parameters values for rescaling video frames
___________________________________________________
| width | height | min_dimension | algorithm |
|_________________________________________________|
| 0 | 0 | 0 | original |
|_________________________________________________|
| 0 | 0 | >0 |scale to min dimension|
|_____|_____|____________________________________ |
| >0 | 0 | 0 | scale keeping W |
|_________________________________________________|
| 0 | >0 | 0 | scale keeping H |
|_________________________________________________|
| >0 | >0 | 0 | stretch/scale |
|_________________________________________________|
*/
int64_t width,
int64_t height,
int64_t minDimension,
// video frames with pts in [videoStartPts, videoEndPts] will be decoded
// For decoding all video frames, use [0, -1]
int64_t videoStartPts,
int64_t videoEndPts,
// numerator and denominator of time base of video stream.
// For decoding all video frames, supply dummy 0 (numerator) and 1
// (denominator). For decoding localized video frames, need to supply
// them which will be checked during decoding
int64_t videoTimeBaseNum,
int64_t videoTimeBaseDen,
// bool variable. Set it to 1 if audio stream should be read. Otherwise, set
// it to 0
int64_t readAudioStream,
// audio stream sampling rate.
// If not resampling audio waveform, supply 0
// Otherwise, supply a positive integer.
int64_t audioSamples,
// audio stream channels
// Supply 0 to use the same number of channels as in the original audio
// stream
int64_t audioChannels,
// audio frames with pts in [audioStartPts, audioEndPts] will be decoded
// For decoding all audio frames, use [0, -1]
int64_t audioStartPts,
int64_t audioEndPts,
// numerator and denominator of time base of audio stream.
// For decoding all audio frames, supply dummy 0 (numerator) and 1
// (denominator). For decoding localized audio frames, need to supply
// them which will be checked during decoding
int64_t audioTimeBaseNum,
int64_t audioTimeBaseDen);
torch::List<torch::Tensor> readVideoFromFile(
std::string videoPath,
double seekFrameMargin,
int64_t getPtsOnly,
int64_t readVideoStream,
int64_t width,
int64_t height,
int64_t minDimension,
int64_t videoStartPts,
int64_t videoEndPts,
int64_t videoTimeBaseNum,
int64_t videoTimeBaseDen,
int64_t readAudioStream,
int64_t audioSamples,
int64_t audioChannels,
int64_t audioStartPts,
int64_t audioEndPts,
int64_t audioTimeBaseNum,
int64_t audioTimeBaseDen);
#include "util.h"
using namespace std;
namespace util {
unique_ptr<DecoderParameters> getDecoderParams(
double seekFrameMargin,
int64_t getPtsOnly,
int64_t readVideoStream,
int videoWidth,
int videoHeight,
int videoMinDimension,
int64_t videoStartPts,
int64_t videoEndPts,
int videoTimeBaseNum,
int videoTimeBaseDen,
int64_t readAudioStream,
int audioSamples,
int audioChannels,
int64_t audioStartPts,
int64_t audioEndPts,
int audioTimeBaseNum,
int audioTimeBaseDen) {
unique_ptr<DecoderParameters> params = make_unique<DecoderParameters>();
if (readVideoStream == 1) {
params->formats.emplace(
MediaType::TYPE_VIDEO, MediaFormat(MediaType::TYPE_VIDEO));
MediaFormat& videoFormat = params->formats[MediaType::TYPE_VIDEO];
videoFormat.format.video.width = videoWidth;
videoFormat.format.video.height = videoHeight;
videoFormat.format.video.minDimension = videoMinDimension;
videoFormat.format.video.startPts = videoStartPts;
videoFormat.format.video.endPts = videoEndPts;
videoFormat.format.video.timeBaseNum = videoTimeBaseNum;
videoFormat.format.video.timeBaseDen = videoTimeBaseDen;
}
if (readAudioStream == 1) {
params->formats.emplace(
MediaType::TYPE_AUDIO, MediaFormat(MediaType::TYPE_AUDIO));
MediaFormat& audioFormat = params->formats[MediaType::TYPE_AUDIO];
audioFormat.format.audio.samples = audioSamples;
audioFormat.format.audio.channels = audioChannels;
audioFormat.format.audio.startPts = audioStartPts;
audioFormat.format.audio.endPts = audioEndPts;
audioFormat.format.audio.timeBaseNum = audioTimeBaseNum;
audioFormat.format.audio.timeBaseDen = audioTimeBaseDen;
}
params->seekFrameMargin = seekFrameMargin;
params->getPtsOnly = getPtsOnly;
return params;
}
} // namespace util
#pragma once
#include <memory>
#include "FfmpegDecoder.h"
namespace util {
std::unique_ptr<DecoderParameters> getDecoderParams(
double seekFrameMargin,
int64_t getPtsOnly,
int64_t readVideoStream,
int videoWidth,
int videoHeight,
int videoMinDimension,
int64_t videoStartPts,
int64_t videoEndPts,
int videoTimeBaseNum,
int videoTimeBaseDen,
int64_t readAudioStream,
int audioSamples,
int audioChannels,
int64_t audioStartPts,
int64_t audioEndPts,
int audioTimeBaseNum,
int audioTimeBaseDen);
} // namespace util
......@@ -98,6 +98,7 @@ class VideoClips(object):
_video_width=0,
_video_height=0,
_video_min_dimension=0,
_video_max_dimension=0,
_audio_samples=0,
_audio_channels=0,
):
......@@ -109,6 +110,7 @@ class VideoClips(object):
self._video_width = _video_width
self._video_height = _video_height
self._video_min_dimension = _video_min_dimension
self._video_max_dimension = _video_max_dimension
self._audio_samples = _audio_samples
self._audio_channels = _audio_channels
......@@ -179,6 +181,7 @@ class VideoClips(object):
_video_width=self._video_width,
_video_height=self._video_height,
_video_min_dimension=self._video_min_dimension,
_video_max_dimension=self._video_max_dimension,
_audio_samples=self._audio_samples,
_audio_channels=self._audio_channels,
)
......@@ -299,6 +302,10 @@ class VideoClips(object):
raise ValueError(
"pyav backend doesn't support _video_min_dimension != 0"
)
if self._video_max_dimension != 0:
raise ValueError(
"pyav backend doesn't support _video_max_dimension != 0"
)
if self._audio_samples != 0:
raise ValueError("pyav backend doesn't support _audio_samples != 0")
......@@ -335,6 +342,7 @@ class VideoClips(object):
video_width=self._video_width,
video_height=self._video_height,
video_min_dimension=self._video_min_dimension,
video_max_dimension=self._video_max_dimension,
video_pts_range=(video_start_pts, video_end_pts),
video_timebase=video_timebase,
audio_samples=self._audio_samples,
......
......@@ -138,6 +138,7 @@ def _read_video_from_file(
video_width=0,
video_height=0,
video_min_dimension=0,
video_max_dimension=0,
video_pts_range=(0, -1),
video_timebase=default_timebase,
read_audio_stream=True,
......@@ -155,21 +156,34 @@ def _read_video_from_file(
filename : str
path to the video file
seek_frame_margin: double, optional
seeking frame in the stream is imprecise. Thus, when video_start_pts is specified,
we seek the pts earlier by seek_frame_margin seconds
seeking frame in the stream is imprecise. Thus, when video_start_pts
is specified, we seek the pts earlier by seek_frame_margin seconds
read_video_stream: int, optional
whether read video stream. If yes, set to 1. Otherwise, 0
video_width/video_height/video_min_dimension: int
video_width/video_height/video_min_dimension/video_max_dimension: int
together decide the size of decoded frames
- when video_width = 0, video_height = 0, and video_min_dimension = 0, keep the orignal frame resolution
- when video_width = 0, video_height = 0, and video_min_dimension != 0, keep the aspect ratio and resize
the frame so that shorter edge size is video_min_dimension
- When video_width = 0, and video_height != 0, keep the aspect ratio and resize the frame
so that frame video_height is $video_height
- When video_width != 0, and video_height == 0, keep the aspect ratio and resize the frame
so that frame video_height is $video_width
- When video_width != 0, and video_height != 0, resize the frame so that frame video_width and video_height
are set to $video_width and $video_height, respectively
- When video_width = 0, video_height = 0, video_min_dimension = 0,
and video_max_dimension = 0, keep the orignal frame resolution
- When video_width = 0, video_height = 0, video_min_dimension != 0,
and video_max_dimension = 0, keep the aspect ratio and resize the
frame so that shorter edge size is video_min_dimension
- When video_width = 0, video_height = 0, video_min_dimension = 0,
and video_max_dimension != 0, keep the aspect ratio and resize
the frame so that longer edge size is video_max_dimension
- When video_width = 0, video_height = 0, video_min_dimension != 0,
and video_max_dimension != 0, resize the frame so that shorter
edge size is video_min_dimension, and longer edge size is
video_max_dimension. The aspect ratio may not be preserved
- When video_width = 0, video_height != 0, video_min_dimension = 0,
and video_max_dimension = 0, keep the aspect ratio and resize
the frame so that frame video_height is $video_height
- When video_width != 0, video_height == 0, video_min_dimension = 0,
and video_max_dimension = 0, keep the aspect ratio and resize
the frame so that frame video_width is $video_width
- When video_width != 0, video_height != 0, video_min_dimension = 0,
and video_max_dimension = 0, resize the frame so that frame
video_width and video_height are set to $video_width and
$video_height, respectively
video_pts_range : list(int), optional
the start and end presentation timestamp of video stream
video_timebase: Fraction, optional
......@@ -207,6 +221,7 @@ def _read_video_from_file(
video_width,
video_height,
video_min_dimension,
video_max_dimension,
video_pts_range[0],
video_pts_range[1],
video_timebase.numerator,
......@@ -244,6 +259,7 @@ def _read_video_timestamps_from_file(filename):
0, # video_width
0, # video_height
0, # video_min_dimension
0, # video_max_dimension
0, # video_start_pts
-1, # video_end_pts
0, # video_timebase_num
......@@ -282,6 +298,7 @@ def _read_video_from_memory(
video_width=0, # type: int
video_height=0, # type: int
video_min_dimension=0, # type: int
video_max_dimension=0, # type: int
video_pts_range=(0, -1), # type: List[int]
video_timebase_numerator=0, # type: int
video_timebase_denominator=1, # type: int
......@@ -307,17 +324,30 @@ def _read_video_from_memory(
we seek the pts earlier by seek_frame_margin seconds
read_video_stream: int, optional
whether read video stream. If yes, set to 1. Otherwise, 0
video_width/video_height/video_min_dimension: int
video_width/video_height/video_min_dimension/video_max_dimension: int
together decide the size of decoded frames
- when video_width = 0, video_height = 0, and video_min_dimension = 0, keep the orignal frame resolution
- when video_width = 0, video_height = 0, and video_min_dimension != 0, keep the aspect ratio and resize
the frame so that shorter edge size is video_min_dimension
- When video_width = 0, and video_height != 0, keep the aspect ratio and resize the frame
so that frame video_height is $video_height
- When video_width != 0, and video_height == 0, keep the aspect ratio and resize the frame
so that frame video_height is $video_width
- When video_width != 0, and video_height != 0, resize the frame so that frame video_width and video_height
are set to $video_width and $video_height, respectively
- When video_width = 0, video_height = 0, video_min_dimension = 0,
and video_max_dimension = 0, keep the orignal frame resolution
- When video_width = 0, video_height = 0, video_min_dimension != 0,
and video_max_dimension = 0, keep the aspect ratio and resize the
frame so that shorter edge size is video_min_dimension
- When video_width = 0, video_height = 0, video_min_dimension = 0,
and video_max_dimension != 0, keep the aspect ratio and resize
the frame so that longer edge size is video_max_dimension
- When video_width = 0, video_height = 0, video_min_dimension != 0,
and video_max_dimension != 0, resize the frame so that shorter
edge size is video_min_dimension, and longer edge size is
video_max_dimension. The aspect ratio may not be preserved
- When video_width = 0, video_height != 0, video_min_dimension = 0,
and video_max_dimension = 0, keep the aspect ratio and resize
the frame so that frame video_height is $video_height
- When video_width != 0, video_height == 0, video_min_dimension = 0,
and video_max_dimension = 0, keep the aspect ratio and resize
the frame so that frame video_width is $video_width
- When video_width != 0, video_height != 0, video_min_dimension = 0,
and video_max_dimension = 0, resize the frame so that frame
video_width and video_height are set to $video_width and
$video_height, respectively
video_pts_range : list(int), optional
the start and end presentation timestamp of video stream
video_timebase_numerator / video_timebase_denominator: optional
......@@ -353,6 +383,7 @@ def _read_video_from_memory(
video_width,
video_height,
video_min_dimension,
video_max_dimension,
video_pts_range[0],
video_pts_range[1],
video_timebase_numerator,
......@@ -394,6 +425,7 @@ def _read_video_timestamps_from_memory(video_data):
0, # video_width
0, # video_height
0, # video_min_dimension
0, # video_max_dimension
0, # video_start_pts
-1, # video_end_pts
0, # video_timebase_num
......
......@@ -370,15 +370,17 @@ def read_video_from_memory(
audio_pts_range=(0, -1), # type: List[int]
audio_timebase_numerator=0, # type: int
audio_timebase_denominator=1, # type: int
video_max_dimension=0, # type: int
):
# type: (...) -> Tuple[torch.Tensor, torch.Tensor]
return _video_opt._read_video_from_memory(
video_data,
seek_frame_margin,
read_audio_stream,
read_video_stream,
video_width,
video_height,
video_min_dimension,
video_max_dimension,
video_pts_range,
video_timebase_numerator,
video_timebase_denominator,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment