Commit 72d3fe09 authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Improve the performance of YUV420P frame conversion (#3342)

Summary:
This commit improve the performance of conversions of YUV420P format from AVFrame to torch Tensor.

It changes two things;
1. Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy.
2.  Get rid of intermediate UV plane copy

The following compares the time it takes to process 30 seconds of YUV420P frame at 25 FPS of resolution 320x240. The measurement times are sorted by values.

Some observations
* `torch::nn::functional::interpolate` with `torch::kNearest` option is not as fast as copying data manually.
* switching from `interpolate` to manual data copy reduces the variance.

run | main | 1 | 1+2 | improvement (from main to 1+2)
-- | -- | -- | -- | --
1 | 0.452250583 | 0.417490125 | 0.40155375 | 11.21%
2 | 0.462039958 | 0.42006675 | 0.401764125 | 13.05%
3 | 0.463067666 | 0.42416 | 0.402651334 | 13.05%
4 | 0.464228166 | 0.424545458 | 0.402985667 | 13.19%
5 | 0.465777375 | 0.425629208 | 0.405604625 | 12.92%
6 | 0.469628666 | 0.427044333 | 0.40628525 | 13.49%
7 | 0.475935125 | 0.42805875 | 0.406412167 | 14.61%
8 | 0.482277667 | 0.429921209 | 0.407279 | 15.55%
9 | 0.496695208 | 0.431182792 | 0.442013791 | 11.01%
10 | 0.546653625 | 0.541639584 | 0.4711585 | 13.81%

[second]

Increasing the resolution, the improvement is smaller but is consistent.

run | main | 1+2 | improvement
-- | -- | -- | --
1 | 4.032393 | 3.991784667 | 1.01%
2 | 4.052248084 | 3.992672208 | 1.47%
3 | 4.07705575 | 4.000541666 | 1.88%
4 | 4.143954792 | 4.020671584 | 2.98%
5 | 4.170711959 | 4.025753125 | 3.48%
6 | 4.240229292 | 4.045504875 | 4.59%
7 | 4.267384042 | 4.045588125 | 5.20%
8 | 4.277025958 | 4.061980083 | 5.03%
9 | 4.312192042 | 4.163251959 | 3.45%
10 | 4.406109875 | 4.312560334 | 2.12%

<details><summary>code</summary>

```python
import time

from torchaudio.io import StreamReader

def test():
    r = StreamReader(src="testsrc=duration=30", format="lavfi")
    # r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi")
    r.add_video_stream(-1, filter_desc="format=yuv420p")
    t0 = time.monotonic()
    r.process_all_packets()
    elapsed = time.monotonic() - t0
    print(elapsed)

for _ in range(10):
    test()
```
</details>

<details><summary>env</summary>

```
PyTorch version: 2.1.0.dev20230325
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.3.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.9.16 (main, Mar  8 2023, 04:29:24)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1

Versions of relevant libraries:
[pip3] torch==2.1.0.dev20230325
[pip3] torchaudio==2.1.0a0+541b525
[conda] pytorch                   2.1.0.dev20230325         py3.9_0    pytorch-nightly
[conda] torchaudio                2.1.0a0+541b525           dev_0    <develop>
```

</details>

Pull Request resolved: https://github.com/pytorch/audio/pull/3342

Reviewed By: xiaohui-zhang

Differential Revision: D45947716

Pulled By: mthrok

fbshipit-source-id: 17e5930f57544b4f2e48a9b2185464694a88ab68
parent c11661e0
......@@ -205,9 +205,7 @@ torch::Tensor PlanarImageConverter::convert(const AVFrame* src) {
////////////////////////////////////////////////////////////////////////////////
// YUV420P
////////////////////////////////////////////////////////////////////////////////
YUV420PConverter::YUV420PConverter(int h, int w)
: ImageConverterBase(h, w, 3),
tmp_uv(get_image_buffer({1, 2, height / 2, width / 2})) {
YUV420PConverter::YUV420PConverter(int h, int w) : ImageConverterBase(h, w, 3) {
TORCH_WARN_ONCE(
"The output format YUV420P is selected. "
"This will be implicitly converted to YUV444P, "
......@@ -234,33 +232,37 @@ void YUV420PConverter::convert(const AVFrame* src, torch::Tensor& dst) {
p_src += src->linesize[0];
}
}
// Write intermediate UV plane
{
uint8_t* p_dst = tmp_uv.data_ptr<uint8_t>();
uint8_t* p_src = src->data[1];
for (int h = 0; h < height / 2; ++h) {
memcpy(p_dst, p_src, width / 2);
p_dst += width / 2;
p_src += src->linesize[1];
}
p_src = src->data[2];
for (int h = 0; h < height / 2; ++h) {
memcpy(p_dst, p_src, width / 2);
p_dst += width / 2;
p_src += src->linesize[2];
}
// Chroma (U and V planes) are subsamapled by 2 in both vertical and
// holizontal directions.
// https://en.wikipedia.org/wiki/Chroma_subsampling
// Since we are returning data in Tensor, which has the same size for all
// color planes, we need to upsample the UV planes. PyTorch has interpolate
// function but it does not work for int16 type. So we manually copy them.
//
// block1 block2 block3 block4
// ab -> aabb = a b * a b * *
// cd aabb a b a b
// ccdd c d c d
// ccdd c d c d
//
auto block00 = dst.slice(2, 0, {}, 2).slice(3, 0, {}, 2);
auto block01 = dst.slice(2, 0, {}, 2).slice(3, 1, {}, 2);
auto block10 = dst.slice(2, 1, {}, 2).slice(3, 0, {}, 2);
auto block11 = dst.slice(2, 1, {}, 2).slice(3, 1, {}, 2);
for (int i = 1; i < 3; ++i) {
// borrow data
auto tmp = torch::from_blob(
src->data[i],
{height / 2, width / 2},
{src->linesize[i], 1},
[](void*) {},
torch::TensorOptions().dtype(torch::kUInt8).layout(torch::kStrided));
// Copy to each block
block00.slice(1, i, i + 1).copy_(tmp);
block01.slice(1, i, i + 1).copy_(tmp);
block10.slice(1, i, i + 1).copy_(tmp);
block11.slice(1, i, i + 1).copy_(tmp);
}
// Upsample width and height
namespace F = torch::nn::functional;
torch::Tensor uv = F::interpolate(
tmp_uv,
F::InterpolateFuncOptions()
.mode(torch::kNearest)
.size(std::vector<int64_t>({height, width})));
// Write to the UV plane
// dst[:, 1:] = uv
using namespace torch::indexing;
dst.index_put_({Slice(), Slice(1)}, uv);
}
torch::Tensor YUV420PConverter::convert(const AVFrame* src) {
......
......@@ -65,8 +65,6 @@ struct PlanarImageConverter : public ImageConverterBase {
// Family of YUVs - NCHW
////////////////////////////////////////////////////////////////////////////////
class YUV420PConverter : public ImageConverterBase {
torch::Tensor tmp_uv;
public:
YUV420PConverter(int height, int width);
void convert(const AVFrame* src, torch::Tensor& dst);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment