• moto's avatar
    Add HW acceleration support on Streamer (#2331) · 54d2d04f
    moto authored
    Summary:
    This commits add `hw_accel` option to `Streamer::add_video_stream` method.
    Specifying `hw_accel="cuda"` allows to create the chunk Tensor directly from CUDA,
    when the following conditions are met.
    1. the video format is H264,
    2. underlying ffmpeg is compiled with NVENC, and
    3. the client code specifies `decoder="h264_cuvid"`.
    
    A simple benchmark yields x7 improvement in the decoding speed.
    
    <details>
    
    ```python
    import time
    
    from torchaudio.prototype.io import Streamer
    
    srcs = [
        "https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4",
        "./NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4",  # offline version
    ]
    
    patterns = [
        ("h264_cuvid", None, "cuda:0"),  # NVDEC on CUDA:0 -> CUDA:0
        ("h264_cuvid", None, "cuda:1"),  # NVDEC on CUDA:1 -> CUDA:1
        ("h264_cuvid", None, None),  # NVDEC -> CPU
        (None, None, None),  # CPU
    ]
    
    for src in srcs:
        print(src, flush=True)
        for (decoder, decoder_options, hw_accel) in patterns:
            s = Streamer(src)
            s.add_video_stream(5, decoder=decoder, decoder_options=decoder_options, hw_accel=hw_accel)
    
            t0 = time.monotonic()
            num_frames = 0
    	for i, (chunk, ) in enumerate(s.stream()):
    	    num_frames += chunk.shape[0]
            t1 = time.monotonic()
            print(chunk.dtype, chunk.shape, chunk.device)
            print(time.monotonic() - t0, num_frames, flush=True)
    ```
    </details>
    
    ```
    https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cuda:0
    10.781158386962488 6175
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cuda:1
    10.771313901990652 6175
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cpu
    27.88662809302332 6175
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cpu
    83.22728440898936 6175
    ./NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cuda:0
    12.945253834011964 6175
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cuda:1
    12.870224556012545 6175
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cpu
    28.03406483103754 6175
    torch.uint8 torch.Size([5, 3, 1080, 1920]) cpu
    82.6120332319988 6175
    ```
    
    With HW resizing
    
    <details>
    
    ```python
    import time
    
    from torchaudio.prototype.io import Streamer
    
    srcs = [
        "./NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4",
        "https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4",
    ]
    
    patterns = [
        # Decode with NVDEC, CUDA HW scaling -> CUDA:0
        ("h264_cuvid", {"resize": "960x540"}, "", "cuda:0"),
        # Decoded with NVDEC, CUDA HW scaling -> CPU
        ("h264_cuvid", {"resize": "960x540"}, "", None),
        # CPU decoding, CPU scaling
        (None, None, "scale=width=960:height=540", None),
    ]
    
    for src in srcs:
        print(src, flush=True)
        for (decoder, decoder_options, filter_desc, hw_accel) in patterns:
            s = Streamer(src)
            s.add_video_stream(
                5,
                decoder=decoder,
                decoder_options=decoder_options,
                filter_desc=filter_desc,
                hw_accel=hw_accel,
            )
    
            t0 = time.monotonic()
            num_frames = 0
            for i, (chunk, ) in enumerate(s.stream()):
                num_frames += chunk.shape[0]
            t1 = time.monotonic()
            print(chunk.dtype, chunk.shape, chunk.device)
            print(time.monotonic() - t0, num_frames, flush=True)
    ```
    
    </details>
    
    ```
    ./NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4
    torch.uint8 torch.Size([5, 3, 540, 960]) cuda:0
    12.890056837990414 6175
    torch.uint8 torch.Size([5, 3, 540, 960]) cpu
    10.697489063022658 6175
    torch.uint8 torch.Size([5, 3, 540, 960]) cpu
    85.19899423001334 6175
    
    https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4
    torch.uint8 torch.Size([5, 3, 540, 960]) cuda:0
    10.712715593050234 6175
    torch.uint8 torch.Size([5, 3, 540, 960]) cpu
    11.030170071986504 6175
    torch.uint8 torch.Size([5, 3, 540, 960]) cpu
    84.8515750519582 6175
    ```
    
    Pull Request resolved: https://github.com/pytorch/audio/pull/2331
    
    Reviewed By: hwangjeff
    
    Differential Revision: D36217169
    
    Pulled By: mthrok
    
    fbshipit-source-id: 7979570b083cfc238ad4735b44305d8649f0607b
    54d2d04f
stream_processor.cpp 3.27 KB