Commit c11661e0 authored by moto's avatar moto Committed by Facebook GitHub Bot
Browse files

Improve the performance of NV12 frame conversion (#3344)

Summary:
Similar to https://github.com/pytorch/audio/pull/3342, this commit improves the performance of NV12 frame conversion.

It changes two things;

- Change the implementation of nearest-neighbor upsampling from `torch::nn::functional::interpolate` to manual data copy.
- Get rid of intermediate UV plane copy

with 320x240

run | main | pr | improvement
-- | -- | -- | --
1 | 0.600671417 | 0.464993125 | 22.59%
2 | 0.638846084 | 0.456763542 | 28.50%
3 | 0.64158175 | 0.458295333 | 28.57%
4 | 0.649868584 | 0.455450583 | 29.92%
5 | 0.612171333 | 0.462435625 | 24.46%
6 | 0.6128095 | 0.456716166 | 25.47%
7 | 0.632084583 | 0.463357083 | 26.69%
8 | 0.610733083 | 0.46148625 | 24.44%
9 | 0.613825834 | 0.4559555 | 25.72%
10 | 0.653857458 | 0.455375375 | 30.36%

[second]

with 1080x720 video

run | main | pr | improvement
-- | -- | -- | --
1 | 4.984154333 | 4.21090375 | 15.51%
2 | 4.988090625 | 4.239649375 | 15.00%
3 | 4.988896375 | 4.227277458 | 15.27%
4 | 4.998186584 | 4.161077042 | 16.75%
5 | 5.06180425 | 4.191672584 | 17.19%
6 | 5.108769667 | 4.198468458 | 17.82%
7 | 5.151363625 | 4.181942167 | 18.82%
8 | 5.199527875 | 4.239319084 | 18.47%
9 | 5.224903708 | 4.194901959 | 19.71%
10 | 5.333422583 | 4.320925792 | 18.98%

[second]

<details><summary>code</summary>

```python
import time

from torchaudio.io import StreamReader

def test():
    r = StreamReader(src="testsrc=duration=30", format="lavfi")
    # r = StreamReader(src="testsrc=duration=30:size=1080x720", format="lavfi")
    r.add_video_stream(-1, filter_desc="format=nv12")
    t0 = time.monotonic()
    r.process_all_packets()
    elapsed = time.monotonic() - t0
    print(elapsed)

for _ in range(10):
    test()
```
</details>

<details><summary>env</summary>

```
PyTorch version: 2.1.0.dev20230325
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.3.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.6
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.9.16 (main, Mar  8 2023, 04:29:24)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-13.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1

Versions of relevant libraries:
[pip3] torch==2.1.0.dev20230325
[pip3] torchaudio==2.1.0a0+541b525
[conda] pytorch                   2.1.0.dev20230325         py3.9_0    pytorch-nightly
[conda] torchaudio                2.1.0a0+541b525           dev_0    <develop>
```

</details>

Pull Request resolved: https://github.com/pytorch/audio/pull/3344

Reviewed By: xiaohui-zhang

Differential Revision: D45948511

Pulled By: mthrok

fbshipit-source-id: ae9b300cbcb4295f3f7470736f258280005a21e5
parent 3ffd76c8
......@@ -344,9 +344,7 @@ torch::Tensor YUV420P10LEConverter::convert(const AVFrame* src) {
////////////////////////////////////////////////////////////////////////////////
// NV12
////////////////////////////////////////////////////////////////////////////////
NV12Converter::NV12Converter(int h, int w)
: ImageConverterBase(h, w, 3),
tmp_uv(get_image_buffer({1, height / 2, width / 2, 2})) {
NV12Converter::NV12Converter(int h, int w) : ImageConverterBase(h, w, 3) {
TORCH_WARN_ONCE(
"The output format NV12 is selected. "
"This will be implicitly converted to YUV444P, "
......@@ -375,26 +373,19 @@ void NV12Converter::convert(const AVFrame* src, torch::Tensor& dst) {
}
// Write intermediate UV plane
{
uint8_t* p_dst = tmp_uv.data_ptr<uint8_t>();
uint8_t* p_src = src->data[1];
for (int h = 0; h < height / 2; ++h) {
memcpy(p_dst, p_src, width);
p_dst += width;
p_src += src->linesize[1];
}
auto tmp = torch::from_blob(
src->data[1],
{height / 2, width},
{src->linesize[1], 1},
[](void*) {},
torch::TensorOptions().dtype(torch::kUInt8).layout(torch::kStrided));
tmp = tmp.view({1, height / 2, width / 2, 2}).permute({0, 3, 1, 2});
auto dst_uv = dst.slice(1, 1, 3);
dst_uv.slice(2, 0, {}, 2).slice(3, 0, {}, 2).copy_(tmp);
dst_uv.slice(2, 0, {}, 2).slice(3, 1, {}, 2).copy_(tmp);
dst_uv.slice(2, 1, {}, 2).slice(3, 0, {}, 2).copy_(tmp);
dst_uv.slice(2, 1, {}, 2).slice(3, 1, {}, 2).copy_(tmp);
}
// Upsample width and height
namespace F = torch::nn::functional;
torch::Tensor uv = F::interpolate(
tmp_uv.permute({0, 3, 1, 2}),
F::InterpolateFuncOptions()
.mode(torch::kNearest)
.size(std::vector<int64_t>({height, width})));
// Write to the UV plane
// dst[:, 1:] = uv
using namespace torch::indexing;
dst.index_put_({Slice(), Slice(1)}, uv);
}
torch::Tensor NV12Converter::convert(const AVFrame* src) {
......
......@@ -81,8 +81,6 @@ class YUV420P10LEConverter : public ImageConverterBase {
};
class NV12Converter : public ImageConverterBase {
torch::Tensor tmp_uv;
public:
NV12Converter(int height, int width);
void convert(const AVFrame* src, torch::Tensor& dst);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment