Improve the performance of YUV420P frame conversion (#3342)
Summary:
This commit improve the performance of conversions of YUV420P format from AVFrame to torch Tensor.
It changes two things;
1. Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy.
2. Get rid of intermediate UV plane copy
The following compares the time it takes to process 30 seconds of YUV420P frame at 25 FPS of resolution 320x240. The measurement times are sorted by values.
Some observations
* `torch::nn::functional::interpolate` with `torch::kNearest` option is not as fast as copying data manually.
* switching from `interpolate` to manual data copy reduces the variance.
run | main | 1 | 1+2 | improvement (from main to 1+2)
-- | -- | -- | -- | --
1 | 0.452250583 | 0.417490125 | 0.40155375 | 11.21%
2 | 0.462039958 | 0.42006675 | 0.401764125 | 13.05%
3 | 0.463067666 | 0.42416 | 0.402651334 | 13.05%
4 | 0.464228166 | 0.424545458 | 0.402985667 | 13.19%
5 | 0.465777375 | 0.425629208 | 0.405604625 | 12.92%
6 | 0.469628666 | 0.427044333 | 0.40628525 | 13.49%
7 | 0.475935125 | 0.42805875 | 0.406412167 | 14.61%
8 | 0.482277667 | 0.429921209 | 0.407279 | 15.55%
9 | 0.496695208 | 0.431182792 | 0.442013791 | 11.01%
10 | 0.546653625 | 0.541639584 | 0.4711585 | 13.81%
[second]
Increasing the resolution, the improvement is smaller but is consistent.
run | main | 1+2 | improvement
-- | -- | -- | --
1 | 4.032393 | 3.991784667 | 1.01%
2 | 4.052248084 | 3.992672208 | 1.47%
3 | 4.07705575 | 4.000541666 | 1.88%
4 | 4.143954792 | 4.020671584 | 2.98%
5 | 4.170711959 | 4.025753125 | 3.48%
6 | 4.240229292 | 4.045504875 | 4.59%
7 | 4.267384042 | 4.045588125 | 5.20%
8 | 4.277025958 | 4.061980083 | 5.03%
9 | 4.312192042 | 4.163251959 | 3.45%
10 | 4.406109875 | 4.312560334 | 2.12%
<details><summary>code</summary>
```python
import time
from torchaudio.io import StreamReader
def test():
r = StreamReader(src="testsrc=duration=30", format="lavfi")
# r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi")
r.add_video_stream(-1, filter_desc="format=yuv420p")
t0 = time.monotonic()
r.process_all_packets()
elapsed = time.monotonic() - t0
print(elapsed)
for _ in range(10):
test()
```
</details>
<details><summary>env</summary>
```
PyTorch version: 2.1.0.dev20230325
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 13.3.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A
Python version: 3.9.16 (main, Mar 8 2023, 04:29:24) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M1
Versions of relevant libraries:
[pip3] torch==2.1.0.dev20230325
[pip3] torchaudio==2.1.0a0+541b525
[conda] pytorch 2.1.0.dev20230325 py3.9_0 pytorch-nightly
[conda] torchaudio 2.1.0a0+541b525 dev_0 <develop>
```
</details>
Pull Request resolved: https://github.com/pytorch/audio/pull/3342
Reviewed By: xiaohui-zhang
Differential Revision: D45947716
Pulled By: mthrok
fbshipit-source-id: 17e5930f57544b4f2e48a9b2185464694a88ab68
Showing
Please register or sign in to comment