Improve the performance of YUV420P frame conversion (#3342)

Summary: This commit improve the performance of conversions of YUV420P format from AVFrame to torch Tensor. It changes two things; 1. Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy. 2. Get rid of intermediate UV plane copy The following compares the time it takes to process 30 seconds of YUV420P frame at 25 FPS of resolution 320x240. The measurement times are sorted by values. Some observations * `torch::nn::functional::interpolate` with `torch::kNearest` option is not as fast as copying data manually. * switching from `interpolate` to manual data copy reduces the variance. run | main | 1 | 1+2 | improvement (from main to 1+2) -- | -- | -- | -- | -- 1 | 0.452250583 | 0.417490125 | 0.40155375 | 11.21% 2 | 0.462039958 | 0.42006675 | 0.401764125 | 13.05% 3 | 0.463067666 | 0.42416 | 0.402651334 | 13.05% 4 | 0.464228166 | 0.424545458 | 0.402985667 | 13.19% 5 | 0.465777375 | 0.425629208 | 0.405604625 | 12.92% 6 | 0.469628666 | 0.427044333 | 0.40628525 | 13.49% 7 | 0.475935125 | 0.42805875 | 0.406412167 | 14.61% 8 | 0.482277667 | 0.429921209 | 0.407279 | 15.55% 9 | 0.496695208 | 0.431182792 | 0.442013791 | 11.01% 10 | 0.546653625 | 0.541639584 | 0.4711585 | 13.81% [second] Increasing the resolution, the improvement is smaller but is consistent. run | main | 1+2 | improvement -- | -- | -- | -- 1 | 4.032393 | 3.991784667 | 1.01% 2 | 4.052248084 | 3.992672208 | 1.47% 3 | 4.07705575 | 4.000541666 | 1.88% 4 | 4.143954792 | 4.020671584 | 2.98% 5 | 4.170711959 | 4.025753125 | 3.48% 6 | 4.240229292 | 4.045504875 | 4.59% 7 | 4.267384042 | 4.045588125 | 5.20% 8 | 4.277025958 | 4.061980083 | 5.03% 9 | 4.312192042 | 4.163251959 | 3.45% 10 | 4.406109875 | 4.312560334 | 2.12% <details><summary>code</summary> ```python import time from torchaudio.io import StreamReader def test(): r = StreamReader(src="testsrc=duration=30", format="lavfi") # r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi") r.add_video_stream(-1, filter_desc="format=yuv420p") t0 = time.monotonic() r.process_all_packets() elapsed = time.monotonic() - t0 print(elapsed) for _ in range(10): test() ``` </details> <details><summary>env</summary> ``` PyTorch version: 2.1.0.dev20230325 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 13.3.1 (arm64) GCC version: Could not collect Clang version: 14.0.6 CMake version: version 3.22.1 Libc version: N/A Python version: 3.9.16 (main, Mar 8 2023, 04:29:24) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-13.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M1 Versions of relevant libraries: [pip3] torch==2.1.0.dev20230325 [pip3] torchaudio==2.1.0a0+541b525 [conda] pytorch 2.1.0.dev20230325 py3.9_0 pytorch-nightly [conda] torchaudio 2.1.0a0+541b525 dev_0 <develop> ``` </details> Pull Request resolved: https://github.com/pytorch/audio/pull/3342 Reviewed By: xiaohui-zhang Differential Revision: D45947716 Pulled By: mthrok fbshipit-source-id: 17e5930f57544b4f2e48a9b2185464694a88ab68

Improve the performance of YUV420P frame conversion (#3342)
Summary: This commit improve the performance of conversions of YUV420P format from AVFrame to torch Tensor. It changes two things; 1. Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy. 2. Get rid of intermediate UV plane copy The following compares the time it takes to process 30 seconds of YUV420P frame at 25 FPS of resolution 320x240. The measurement times are sorted by values. Some observations * `torch::nn::functional::interpolate` with `torch::kNearest` option is not as fast as copying data manually. * switching from `interpolate` to manual data copy reduces the variance. run | main | 1 | 1+2 | improvement (from main to 1+2) -- | -- | -- | -- | -- 1 | 0.452250583 | 0.417490125 | 0.40155375 | 11.21% 2 | 0.462039958 | 0.42006675 | 0.401764125 | 13.05% 3 | 0.463067666 | 0.42416 | 0.402651334 | 13.05% 4 | 0.464228166 | 0.424545458 | 0.402985667 | 13.19% 5 | 0.465777375 | 0.425629208 | 0.405604625 | 12.92% 6 | 0.469628666 | 0.427044333 | 0.40628525 | 13.49% 7 | 0.475935125 | 0.42805875 | 0.406412167 | 14.61% 8 | 0.482277667 | 0.429921209 | 0.407279 | 15.55% 9 | 0.496695208 | 0.431182792 | 0.442013791 | 11.01% 10 | 0.546653625 | 0.541639584 | 0.4711585 | 13.81% [second] Increasing the resolution, the improvement is smaller but is consistent. run | main | 1+2 | improvement -- | -- | -- | -- 1 | 4.032393 | 3.991784667 | 1.01% 2 | 4.052248084 | 3.992672208 | 1.47% 3 | 4.07705575 | 4.000541666 | 1.88% 4 | 4.143954792 | 4.020671584 | 2.98% 5 | 4.170711959 | 4.025753125 | 3.48% 6 | 4.240229292 | 4.045504875 | 4.59% 7 | 4.267384042 | 4.045588125 | 5.20% 8 | 4.277025958 | 4.061980083 | 5.03% 9 | 4.312192042 | 4.163251959 | 3.45% 10 | 4.406109875 | 4.312560334 | 2.12% <details><summary>code</summary> ```python import time from torchaudio.io import StreamReader def test(): r = StreamReader(src="testsrc=duration=30", format="lavfi") # r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi") r.add_video_stream(-1, filter_desc="format=yuv420p") t0 = time.monotonic() r.process_all_packets() elapsed = time.monotonic() - t0 print(elapsed) for _ in range(10): test() ``` </details> <details><summary>env</summary> ``` PyTorch version: 2.1.0.dev20230325 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 13.3.1 (arm64) GCC version: Could not collect Clang version: 14.0.6 CMake version: version 3.22.1 Libc version: N/A Python version: 3.9.16 (main, Mar 8 2023, 04:29:24) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-13.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M1 Versions of relevant libraries: [pip3] torch==2.1.0.dev20230325 [pip3] torchaudio==2.1.0a0+541b525 [conda] pytorch 2.1.0.dev20230325 py3.9_0 pytorch-nightly [conda] torchaudio 2.1.0a0+541b525 dev_0 <develop> ``` </details> Pull Request resolved: https://github.com/pytorch/audio/pull/3342 Reviewed By: xiaohui-zhang Differential Revision: D45947716 Pulled By: mthrok fbshipit-source-id: 17e5930f57544b4f2e48a9b2185464694a88ab68
72d3fe09 · moto · Facebook GitHub Bot · c11661e0 · 72d3fe09 · 72d3fe09
Commit 72d3fe09 authored May 17, 2023 by moto Committed by Facebook GitHub Bot May 17, 2023
Showing with 31 additions and 31 deletions

torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp +31 -29

torchaudio/csrc/ffmpeg/stream_reader/conversion.h torchaudio/csrc/ffmpeg/stream_reader/conversion.h +0 -2

No files found.
--- a/torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp
+++ b/torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp
@@ -205,9 +205,7 @@ torch::Tensor PlanarImageConverter::convert(const AVFrame* src) {
 ////////////////////////////////////////////////////////////////////////////////
 // YUV420P
 ////////////////////////////////////////////////////////////////////////////////
-YUV420PConverter::YUV420PConverter(int h, int w)
+YUV420PConverter::YUV420PConverter(int h, int w) : ImageConverterBase(h, w, 3) {
-    : ImageConverterBase(h, w, 3),
-      tmp_uv(get_image_buffer({1, 2, height / 2, width / 2})) {
  TORCH_WARN_ONCE(
      "The output format YUV420P is selected. "
      "This will be implicitly converted to YUV444P, "
@@ -234,33 +232,37 @@ void YUV420PConverter::convert(const AVFrame* src, torch::Tensor& dst) {
      p_src += src->linesize[0];
    }
  }
-  // Write intermediate UV plane
+  // Chroma (U and V planes) are subsamapled by 2 in both vertical and
-  {
+  // holizontal directions.
-    uint8_t* p_dst = tmp_uv.data_ptr<uint8_t>();
+  // https://en.wikipedia.org/wiki/Chroma_subsampling
-    uint8_t* p_src = src->data[1];
+  // Since we are returning data in Tensor, which has the same size for all
-    for (int h = 0; h < height / 2; ++h) {
+  // color planes, we need to upsample the UV planes. PyTorch has interpolate
-      memcpy(p_dst, p_src, width / 2);
+  // function but it does not work for int16 type. So we manually copy them.
-      p_dst += width / 2;
+  //
-      p_src += src->linesize[1];
+  //              block1  block2  block3  block4
-    }
+  // ab -> aabb = a  b   *  a  b *       *
-    p_src = src->data[2];
+  // cd    aabb                   a  b      a  b
-    for (int h = 0; h < height / 2; ++h) {
+  //       ccdd   c  d      c  d
-      memcpy(p_dst, p_src, width / 2);
+  //       ccdd                   c  d      c  d
-      p_dst += width / 2;
+  //
-      p_src += src->linesize[2];
+  auto block00 = dst.slice(2, 0, {}, 2).slice(3, 0, {}, 2);
-    }
+  auto block01 = dst.slice(2, 0, {}, 2).slice(3, 1, {}, 2);
+  auto block10 = dst.slice(2, 1, {}, 2).slice(3, 0, {}, 2);
+  auto block11 = dst.slice(2, 1, {}, 2).slice(3, 1, {}, 2);
+  for (int i = 1; i < 3; ++i) {
+    // borrow data
+    auto tmp = torch::from_blob(
+        src->data[i],
+        {height / 2, width / 2},
+        {src->linesize[i], 1},
+        [](void*) {},
+        torch::TensorOptions().dtype(torch::kUInt8).layout(torch::kStrided));
+    // Copy to each block
+    block00.slice(1, i, i + 1).copy_(tmp);
+    block01.slice(1, i, i + 1).copy_(tmp);
+    block10.slice(1, i, i + 1).copy_(tmp);
+    block11.slice(1, i, i + 1).copy_(tmp);
  }
-  // Upsample width and height
-  namespace F = torch::nn::functional;
-  torch::Tensor uv = F::interpolate(
-      tmp_uv,
-      F::InterpolateFuncOptions()
-          .mode(torch::kNearest)
-          .size(std::vector<int64_t>({height, width})));
-  // Write to the UV plane
-  // dst[:, 1:] = uv
-  using namespace torch::indexing;
-  dst.index_put_({Slice(), Slice(1)}, uv);
 }
 torch::Tensor YUV420PConverter::convert(const AVFrame* src) {

--- a/torchaudio/csrc/ffmpeg/stream_reader/conversion.h
+++ b/torchaudio/csrc/ffmpeg/stream_reader/conversion.h
@@ -65,8 +65,6 @@ struct PlanarImageConverter : public ImageConverterBase {
 // Family of YUVs - NCHW
 ////////////////////////////////////////////////////////////////////////////////
 class YUV420PConverter : public ImageConverterBase {
-  torch::Tensor tmp_uv;
 public:
  YUV420PConverter(int height, int width);
  void convert(const AVFrame* src, torch::Tensor& dst);