Update precise seek behavior for t=0 (#2915)

Summary: It was reported that when videos with invalid PTS values are fed to StreamReader, StreamReader returns only the last frame. https://github.com/pytorch/vision/blob/677fc939b21a8893f07db4c1f90482b648b6573f/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi ``` import torchaudio src = "RATRACE_wave_f_nm_np1_fr_goo_37.avi" streamer = torchaudio.io.StreamReader(src=src) streamer.add_basic_video_stream(frames_per_chunk=-1) streamer.process_all_packets() video, = streamer.pop_chunks() print(video.size(0)) # prints 1, but there are more than 70 frames ``` The reason why all the frames are not returned is due to invalid PTS values. All the frames's PTS values are `-9223372036854775808` so the internal mechanism discards them. The reason why the last frame is output is because when entering drain mode, the discard value of -1 is used, which is interpreted as no discard. For the second issue, the discard behavior should be consistent across regular decoding and drain mode. For the first issue, although the normal behavior is not guaranteed for such invalid input, we can support the case where one reads video from start (or when one seeks into t=0) --- This commits make the following changes to address the above two. 1. Define the discard_before_pts attribtue on StreamProcessor, so that StreamProcessor is aware of the discard behavior without being told by StreamReader, and its behavior is consistent between regular decoding and drain. This gets rid of the discard_before_pts computation that is currently happening at the every time a frame is processed, so this should improve the peformance a bit. 2. Change the meaning of discard_before_pts, so that when it's 0, no discard happens. With this change, the negative value is not necessary so we put it a UB status. Note: Even with those changes seeking videos with invalid PTS is not plausible, client codes can implement a fallback which decodes frames first and discard undesired ones. Pull Request resolved: https://github.com/pytorch/audio/pull/2915 Reviewed By: nateanl Differential Revision: D41957784 Pulled By: mthrok fbshipit-source-id: 2dafdbada5aa33bfc81c986306f80642ba6277df

Update precise seek behavior for t=0 (#2915)
Summary: It was reported that when videos with invalid PTS values are fed to StreamReader, StreamReader returns only the last frame. https://github.com/pytorch/vision/blob/677fc939b21a8893f07db4c1f90482b648b6573f/test/assets/videos/RATRACE_wave_f_nm_np1_fr_goo_37.avi ``` import torchaudio src = "RATRACE_wave_f_nm_np1_fr_goo_37.avi" streamer = torchaudio.io.StreamReader(src=src) streamer.add_basic_video_stream(frames_per_chunk=-1) streamer.process_all_packets() video, = streamer.pop_chunks() print(video.size(0)) # prints 1, but there are more than 70 frames ``` The reason why all the frames are not returned is due to invalid PTS values. All the frames's PTS values are `-9223372036854775808` so the internal mechanism discards them. The reason why the last frame is output is because when entering drain mode, the discard value of -1 is used, which is interpreted as no discard. For the second issue, the discard behavior should be consistent across regular decoding and drain mode. For the first issue, although the normal behavior is not guaranteed for such invalid input, we can support the case where one reads video from start (or when one seeks into t=0) --- This commits make the following changes to address the above two. 1. Define the discard_before_pts attribtue on StreamProcessor, so that StreamProcessor is aware of the discard behavior without being told by StreamReader, and its behavior is consistent between regular decoding and drain. This gets rid of the discard_before_pts computation that is currently happening at the every time a frame is processed, so this should improve the peformance a bit. 2. Change the meaning of discard_before_pts, so that when it's 0, no discard happens. With this change, the negative value is not necessary so we put it a UB status. Note: Even with those changes seeking videos with invalid PTS is not plausible, client codes can implement a fallback which decodes frames first and discard undesired ones. Pull Request resolved: https://github.com/pytorch/audio/pull/2915 Reviewed By: nateanl Differential Revision: D41957784 Pulled By: mthrok fbshipit-source-id: 2dafdbada5aa33bfc81c986306f80642ba6277df
cbd35438 · moto · Facebook GitHub Bot · 123c7319 · cbd35438 · cbd35438
Commit cbd35438 authored Dec 12, 2022 by moto Committed by Facebook GitHub Bot Dec 12, 2022
4 changed files
--- a/torchaudio/csrc/ffmpeg/stream_reader/stream_processor.cpp
+++ b/torchaudio/csrc/ffmpeg/stream_reader/stream_processor.cpp
@@ -47,6 +47,12 @@ void StreamProcessor::remove_stream(KeyType key) {
  sinks.erase(key);
 }

+void StreamProcessor::set_discard_timestamp(int64_t timestamp) {
+  TORCH_CHECK(timestamp >= 0, "timestamp must be non-negative.");
+  discard_before_pts =
+      av_rescale_q(timestamp, av_get_time_base_q(), stream->time_base);
+}
+
 ////////////////////////////////////////////////////////////////////////////////
 // Query methods
 ////////////////////////////////////////////////////////////////////////////////
@@ -68,9 +74,7 @@ bool StreamProcessor::is_buffer_ready() const {
 ////////////////////////////////////////////////////////////////////////////////
 // 0: some kind of success
 // <0: Some error happened
-int StreamProcessor::process_packet(
-    AVPacket* packet,
-    int64_t discard_before_pts) {
+int StreamProcessor::process_packet(AVPacket* packet) {
  int ret = decoder.process_packet(packet);
  while (ret >= 0) {
    ret = decoder.get_frame(pFrame1);
@@ -83,7 +87,22 @@ int StreamProcessor::process_packet(
    if (ret < 0)
      return ret;

-    if (discard_before_pts < 0 || pFrame1->pts >= discard_before_pts) {
+    // When the value of discard_before_pts is 0, we consider that the seek is
+    // not performed and all the frames are passed to downstream
+    // unconditionally.
+    //
+    // Two reasons for this behavior;
+    // 1. When seek mode is not precise, we do not discard any frame.
+    //    In this case discard_before_pts is set to zero.
+    // 2. When users seek to zero, what they expect is to get to the beginning
+    //    of the data.
+    //    There are many videos with invalid PTS values, such as
+    //    -9223372036854775808, and though it is not possible to seek videos
+    //    without decoding, we can still support `seek(0)` as a special case,
+    //    and just not discard any.
+    //
+    // Note: discard_before_pts < 0 is UB.
+    if (discard_before_pts <= 0 || pFrame1->pts >= discard_before_pts) {
      send_frame(pFrame1);
    }


--- a/torchaudio/csrc/ffmpeg/stream_reader/stream_processor.h
+++ b/torchaudio/csrc/ffmpeg/stream_reader/stream_processor.h
@@ -25,6 +25,13 @@ class StreamProcessor {
  KeyType current_key = 0;
  std::map<KeyType, Sink> sinks;

+  // Used for precise seek.
+  // 0: no discard
+  // Positive Values: decoded frames with PTS values less than this are
+  // discarded.
+  // Negative values: UB. Should not happen.
+  int64_t discard_before_pts = 0;
+
 public:
  StreamProcessor(
      AVStream* stream,
@@ -57,6 +64,10 @@ class StreamProcessor {
  // 1. Remove the stream
  void remove_stream(KeyType key);

+  // Set discard
+  // The input timestamp must be expressed in AV_TIME_BASE unit.
+  void set_discard_timestamp(int64_t timestamp);
+
  //////////////////////////////////////////////////////////////////////////////
  // Query methods
  //////////////////////////////////////////////////////////////////////////////
@@ -70,7 +81,7 @@ class StreamProcessor {
  // 2. pass the decoded data to filters
  // 3. each filter store the result to the corresponding buffer
  // - Sending NULL will drain (flush) the internal
-  int process_packet(AVPacket* packet, int64_t discard_before_pts = -1);
+  int process_packet(AVPacket* packet);

  // flush the internal buffer of decoder.
  // To be use when seeking

--- a/torchaudio/csrc/ffmpeg/stream_reader/stream_reader.cpp
+++ b/torchaudio/csrc/ffmpeg/stream_reader/stream_reader.cpp
@@ -174,13 +174,13 @@ void StreamReader::seek(double timestamp_s, int64_t mode) {
  int flag = AVSEEK_FLAG_BACKWARD;
  switch (mode) {
    case 0:
-      seek_timestamp =
-          -1; // reset seek_timestap as it is only used for precise seek
+      // reset seek_timestap as it is only used for precise seek
+      seek_timestamp = 0;
      break;
    case 1:
      flag |= AVSEEK_FLAG_ANY;
-      seek_timestamp =
-          -1; // reset seek_timestap as it is only used for precise seek
+      // reset seek_timestap as it is only used for precise seek
+      seek_timestamp = 0;
      break;
    case 2:
      seek_timestamp = timestamp_av_tb;
@@ -192,12 +192,13 @@ void StreamReader::seek(double timestamp_s, int64_t mode) {
  int ret = av_seek_frame(pFormatContext, -1, timestamp_av_tb, flag);

  if (ret < 0) {
-    seek_timestamp = -1;
+    seek_timestamp = 0;
    TORCH_CHECK(false, "Failed to seek. (" + av_err2string(ret) + ".)");
  }
  for (const auto& it : processors) {
    if (it) {
      it->flush();
+      it->set_discard_timestamp(seek_timestamp);
    }
  }
 }
@@ -278,6 +279,7 @@ void StreamReader::add_stream(
  if (!processors[i]) {
    processors[i] = std::make_unique<StreamProcessor>(
        stream, decoder, decoder_option, device);
+    processors[i]->set_discard_timestamp(seek_timestamp);
  }
  stream->discard = AVDISCARD_DEFAULT;
  int key = processors[i]->add_stream(
@@ -328,12 +330,7 @@ int StreamReader::process_packet() {
    return 0;
  }

-  AVRational stream_tb =
-      pFormatContext->streams[pPacket->stream_index]->time_base;
-  int64_t seek_timestamp_in_stream_tb =
-      av_rescale_q(seek_timestamp, av_get_time_base_q(), stream_tb);
-
-  ret = processor->process_packet(packet, seek_timestamp_in_stream_tb);
+  ret = processor->process_packet(packet);

  return (ret < 0) ? ret : 0;
 }

--- a/torchaudio/csrc/ffmpeg/stream_reader/stream_reader.h
+++ b/torchaudio/csrc/ffmpeg/stream_reader/stream_reader.h
@@ -18,9 +18,16 @@ class StreamReader {
  std::vector<std::pair<int, int>> stream_indices;

  // timestamp to seek to expressed in AV_TIME_BASE
-  // < 0 : No seek
+  //
+  // 0 : No seek
  // Positive value: Skip AVFrames with timestamps before it
-  int64_t seek_timestamp = -1;
+  // Negative value: UB. Should not happen
+  //
+  // Note:
+  // When precise seek is performed, this value is set to the value provided
+  // by client code, and PTS values of decoded frames are compared against it
+  // to determine whether the frames should be passed to downstream.
+  int64_t seek_timestamp = 0;

 public:
  explicit StreamReader(AVFormatInputContextPtr&& p);