Improve the performance of NV12 frame conversion (#3344)

Summary: Similar to https://github.com/pytorch/audio/pull/3342, this commit improves the performance of NV12 frame conversion. It changes two things; - Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy. - Get rid of intermediate UV plane copy with 320x240 run | main | pr | improvement -- | -- | -- | -- 1 | 0.600671417 | 0.464993125 | 22.59% 2 | 0.638846084 | 0.456763542 | 28.50% 3 | 0.64158175 | 0.458295333 | 28.57% 4 | 0.649868584 | 0.455450583 | 29.92% 5 | 0.612171333 | 0.462435625 | 24.46% 6 | 0.6128095 | 0.456716166 | 25.47% 7 | 0.632084583 | 0.463357083 | 26.69% 8 | 0.610733083 | 0.46148625 | 24.44% 9 | 0.613825834 | 0.4559555 | 25.72% 10 | 0.653857458 | 0.455375375 | 30.36% [second] with 1080x720 video run | main | pr | improvement -- | -- | -- | -- 1 | 4.984154333 | 4.21090375 | 15.51% 2 | 4.988090625 | 4.239649375 | 15.00% 3 | 4.988896375 | 4.227277458 | 15.27% 4 | 4.998186584 | 4.161077042 | 16.75% 5 | 5.06180425 | 4.191672584 | 17.19% 6 | 5.108769667 | 4.198468458 | 17.82% 7 | 5.151363625 | 4.181942167 | 18.82% 8 | 5.199527875 | 4.239319084 | 18.47% 9 | 5.224903708 | 4.194901959 | 19.71% 10 | 5.333422583 | 4.320925792 | 18.98% [second] <details><summary>code</summary> ```python import time from torchaudio.io import StreamReader def test(): r = StreamReader(src="testsrc=duration=30", format="lavfi") # r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi") r.add_video_stream(-1, filter_desc="format=nv12") t0 = time.monotonic() r.process_all_packets() elapsed = time.monotonic() - t0 print(elapsed) for _ in range(10): test() ``` </details> <details><summary>env</summary> ``` PyTorch version: 2.1.0.dev20230325 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 13.3.1 (arm64) GCC version: Could not collect Clang version: 14.0.6 CMake version: version 3.22.1 Libc version: N/A Python version: 3.9.16 (main, Mar 8 2023, 04:29:24) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-13.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M1 Versions of relevant libraries: [pip3] torch==2.1.0.dev20230325 [pip3] torchaudio==2.1.0a0+541b525 [conda] pytorch 2.1.0.dev20230325 py3.9_0 pytorch-nightly [conda] torchaudio 2.1.0a0+541b525 dev_0 <develop> ``` </details> Pull Request resolved: https://github.com/pytorch/audio/pull/3344 Reviewed By: xiaohui-zhang Differential Revision: D45948511 Pulled By: mthrok fbshipit-source-id: ae9b300cbcb4295f3f7470736f258280005a21e5

Improve the performance of NV12 frame conversion (#3344)
Summary: Similar to https://github.com/pytorch/audio/pull/3342, this commit improves the performance of NV12 frame conversion. It changes two things; - Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy. - Get rid of intermediate UV plane copy with 320x240 run | main | pr | improvement -- | -- | -- | -- 1 | 0.600671417 | 0.464993125 | 22.59% 2 | 0.638846084 | 0.456763542 | 28.50% 3 | 0.64158175 | 0.458295333 | 28.57% 4 | 0.649868584 | 0.455450583 | 29.92% 5 | 0.612171333 | 0.462435625 | 24.46% 6 | 0.6128095 | 0.456716166 | 25.47% 7 | 0.632084583 | 0.463357083 | 26.69% 8 | 0.610733083 | 0.46148625 | 24.44% 9 | 0.613825834 | 0.4559555 | 25.72% 10 | 0.653857458 | 0.455375375 | 30.36% [second] with 1080x720 video run | main | pr | improvement -- | -- | -- | -- 1 | 4.984154333 | 4.21090375 | 15.51% 2 | 4.988090625 | 4.239649375 | 15.00% 3 | 4.988896375 | 4.227277458 | 15.27% 4 | 4.998186584 | 4.161077042 | 16.75% 5 | 5.06180425 | 4.191672584 | 17.19% 6 | 5.108769667 | 4.198468458 | 17.82% 7 | 5.151363625 | 4.181942167 | 18.82% 8 | 5.199527875 | 4.239319084 | 18.47% 9 | 5.224903708 | 4.194901959 | 19.71% 10 | 5.333422583 | 4.320925792 | 18.98% [second] <details><summary>code</summary> ```python import time from torchaudio.io import StreamReader def test(): r = StreamReader(src="testsrc=duration=30", format="lavfi") # r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi") r.add_video_stream(-1, filter_desc="format=nv12") t0 = time.monotonic() r.process_all_packets() elapsed = time.monotonic() - t0 print(elapsed) for _ in range(10): test() ``` </details> <details><summary>env</summary> ``` PyTorch version: 2.1.0.dev20230325 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 13.3.1 (arm64) GCC version: Could not collect Clang version: 14.0.6 CMake version: version 3.22.1 Libc version: N/A Python version: 3.9.16 (main, Mar 8 2023, 04:29:24) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-13.3.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M1 Versions of relevant libraries: [pip3] torch==2.1.0.dev20230325 [pip3] torchaudio==2.1.0a0+541b525 [conda] pytorch 2.1.0.dev20230325 py3.9_0 pytorch-nightly [conda] torchaudio 2.1.0a0+541b525 dev_0 <develop> ``` </details> Pull Request resolved: https://github.com/pytorch/audio/pull/3344 Reviewed By: xiaohui-zhang Differential Revision: D45948511 Pulled By: mthrok fbshipit-source-id: ae9b300cbcb4295f3f7470736f258280005a21e5
c11661e0 · moto · Facebook GitHub Bot · 3ffd76c8 · c11661e0 · c11661e0
Commit c11661e0 authored May 17, 2023 by moto Committed by Facebook GitHub Bot May 17, 2023
Showing with 13 additions and 24 deletions

torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp +13 -22

torchaudio/csrc/ffmpeg/stream_reader/conversion.h torchaudio/csrc/ffmpeg/stream_reader/conversion.h +0 -2

No files found.
--- a/torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp
+++ b/torchaudio/csrc/ffmpeg/stream_reader/conversion.cpp
@@ -344,9 +344,7 @@ torch::Tensor YUV420P10LEConverter::convert(const AVFrame* src) {
 ////////////////////////////////////////////////////////////////////////////////
 // NV12
 ////////////////////////////////////////////////////////////////////////////////
-NV12Converter::NV12Converter(int h, int w)
-    : ImageConverterBase(h, w, 3),
-      tmp_uv(get_image_buffer({1, height / 2, width / 2, 2})) {
+NV12Converter::NV12Converter(int h, int w) : ImageConverterBase(h, w, 3) {
  TORCH_WARN_ONCE(
      "The output format NV12 is selected. "
      "This will be implicitly converted to YUV444P, "
@@ -375,26 +373,19 @@ void NV12Converter::convert(const AVFrame* src, torch::Tensor& dst) {
  }
  // Write intermediate UV plane
  {
-    uint8_t* p_dst = tmp_uv.data_ptr<uint8_t>();
-    uint8_t* p_src = src->data[1];
-    for (int h = 0; h < height / 2; ++h) {
-      memcpy(p_dst, p_src, width);
-      p_dst += width;
-      p_src += src->linesize[1];
-    }
+    auto tmp = torch::from_blob(
+        src->data[1],
+        {height / 2, width},
+        {src->linesize[1], 1},
+        [](void*) {},
+        torch::TensorOptions().dtype(torch::kUInt8).layout(torch::kStrided));
+    tmp = tmp.view({1, height / 2, width / 2, 2}).permute({0, 3, 1, 2});
+    auto dst_uv = dst.slice(1, 1, 3);
+    dst_uv.slice(2, 0, {}, 2).slice(3, 0, {}, 2).copy_(tmp);
+    dst_uv.slice(2, 0, {}, 2).slice(3, 1, {}, 2).copy_(tmp);
+    dst_uv.slice(2, 1, {}, 2).slice(3, 0, {}, 2).copy_(tmp);
+    dst_uv.slice(2, 1, {}, 2).slice(3, 1, {}, 2).copy_(tmp);
  }
-  // Upsample width and height
-  namespace F = torch::nn::functional;
-  torch::Tensor uv = F::interpolate(
-      tmp_uv.permute({0, 3, 1, 2}),
-      F::InterpolateFuncOptions()
-          .mode(torch::kNearest)
-          .size(std::vector<int64_t>({height, width})));
-
-  // Write to the UV plane
-  // dst[:, 1:] = uv
-  using namespace torch::indexing;
-  dst.index_put_({Slice(), Slice(1)}, uv);
 }

 torch::Tensor NV12Converter::convert(const AVFrame* src) {

--- a/torchaudio/csrc/ffmpeg/stream_reader/conversion.h
+++ b/torchaudio/csrc/ffmpeg/stream_reader/conversion.h
@@ -81,8 +81,6 @@ class YUV420P10LEConverter : public ImageConverterBase {
 };

 class NV12Converter : public ImageConverterBase {
-  torch::Tensor tmp_uv;
-
 public:
  NV12Converter(int height, int width);
  void convert(const AVFrame* src, torch::Tensor& dst);