Update nvdec/nvenc tutorials (#3483)

Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/3483 Differential Revision: D47725664 Pulled By: mthrok fbshipit-source-id: e4249e1488fa7af8670be4a5077957912ff3420b

Update nvdec/nvenc tutorials (#3483)
Summary: Pull Request resolved: https://github.com/pytorch/audio/pull/3483 Differential Revision: D47725664 Pulled By: mthrok fbshipit-source-id: e4249e1488fa7af8670be4a5077957912ff3420b
56e22664 · moto · Facebook GitHub Bot · df655604 · 56e22664 · df655604
Commit 56e22664 authored Jul 25, 2023 by moto Committed by Facebook GitHub Bot Jul 25, 2023
6 changed files
--- a/docs/source/build.ffmpeg.rst
+++ b/docs/source/build.ffmpeg.rst
@@ -9,7 +9,7 @@ Using NVIDIA's GPU decoder and encoder, it is also possible to pass around CUDA
 This improves the video throughput significantly. However, please note that not all the video formats are supported by hardware acceleration.
-This page goes through how to build FFmpeg with hardware acceleration. For the detail on the performance of GPU decoder and encoder please see `Hardware-Accelerated Video Decoding and Encoding <./hw_acceleration_tutorial.html>`_
+This page goes through how to build FFmpeg with hardware acceleration. For the detail on the performance of GPU decoder and encoder please see :ref:`NVDEC tutoial <nvdec_tutorial>` and :ref:`NVENC tutorial <nvenc_tutorial>`.
 Overview
 --------
@@ -25,16 +25,16 @@ To use NVENC/NVDEC with TorchAudio, the following items are required.
 3. PyTorch / TorchAudio with CUDA support.
-TorchAudio’s official binary distributions are compiled to work with FFmpeg 4 libraries, and they contain the logic required for hardware-based decoding/encoding.
+TorchAudio’s official binary distributions are compiled to work with FFmpeg libraries, and they contain the logic to use hardware decoding/encoding.
-In the following, we build FFmpeg 4 libraries with NVDEC/NVENC support. You can use FFmpeg 5 or 6 as well.
+In the following, we build FFmpeg 4 libraries with NVDEC/NVENC support. You can also use FFmpeg 5 or 6.
 The following procedure was tested on Ubuntu.
 † For details on NVDEC/NVENC and FFmpeg, please refer to the following articles.
- https://docs.nvidia.com/video-technologies/video-codec-sdk/nvdec-video-decoder-api-prog-guide/
+- https://docs.nvidia.com/video-technologies/video-codec-sdk/11.1/nvdec-video-decoder-api-prog-guide/
- https://docs.nvidia.com/video-technologies/video-codec-sdk/ffmpeg-with-nvidia-gpu/#compiling-ffmpeg
+- https://docs.nvidia.com/video-technologies/video-codec-sdk/11.1/ffmpeg-with-nvidia-gpu/index.html#compiling-ffmpeg
 - https://developer.nvidia.com/blog/nvidia-ffmpeg-transcoding-guide/
 Check the GPU and CUDA version
@@ -156,7 +156,7 @@ which we use later for verifying the installation.
 Build FFmpeg with NVDEC/NVENC support
 -------------------------------------
-Next we download the source code of FFmpeg 4. We use 4.4.2 here. Any version later than 4.1 should work with TorchAudio binary distributions. If you want to use FFmpeg 5, then you need to build TorchAudio after building FFmpeg.
+Next we download the source code of FFmpeg 4. We use 4.4.2 here.
 .. code-block:: bash
@@ -465,9 +465,9 @@ It is often the case where there are multiple FFmpeg installations in the system
   ['h264_nvenc', 'nvenc', 'nvenc_h264', 'nvenc_hevc', 'hevc_nvenc']
-Using the hardware decoder
+Using the hardware decoder and encoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Once the installation and the runtime linking work fine, then you can test the GPU decoding with the following.
-For the detail on the performance of GPU decoder please see :ref:`NVDEC tutoial <nvdec_tutorial>`.
+For the detail on the performance of GPU decoder and encoder please see :ref:`NVDEC tutoial <nvdec_tutorial>` and :ref:`NVENC tutorial <nvenc_tutorial>`.
--- a/docs/source/hw_acceleration_tutorial.ipynb
+++ b/docs/source/hw_acceleration_tutorial.ipynb
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -44,8 +44,8 @@ model implementations and application components.
   tutorials/streamreader_advanced_tutorial
   tutorials/streamwriter_basic_tutorial
   tutorials/streamwriter_advanced
-   hw_acceleration_tutorial
   tutorials/nvdec_tutorial
+   tutorials/nvenc_tutorial
   tutorials/effector_tutorial
   tutorials/audio_resampling_tutorial
@@ -203,11 +203,11 @@ Tutorials
   :tags: I/O,StreamReader
 .. customcarditem::
-   :header: Hardware accelerated video I/O with NVDEC/NVENC
+   :header: Hardware accelerated video encoding with NVENC
-   :card_description: Learn how to setup and use HW accelerated video I/O.
+   :card_description: Learn how to use HW video encoder.
   :image: https://download.pytorch.org/torchaudio/tutorial-assets/thumbnails/hw_acceleration_tutorial.png
-   :link: hw_acceleration_tutorial.html
+   :link: tutorials/nvenc_tutorial.html
-   :tags: I/O,StreamReader,StreamWriter
+   :tags: I/O,StreamWriter
 .. customcarditem::
   :header: Apply effects and codecs to waveform

--- a/examples/tutorials/nvdec_tutorial.py
+++ b/examples/tutorials/nvdec_tutorial.py
@@ -30,12 +30,11 @@ print(torchaudio.__version__)
 ######################################################################
 #
+import os
 import time
 import matplotlib
 import matplotlib.pyplot as plt
-from IPython.display import HTML
 from torchaudio.io import StreamReader
 matplotlib.rcParams["image.interpolation"] = "none"
@@ -56,7 +55,7 @@ from torchaudio.utils import ffmpeg_utils
 print("FFmpeg Library versions:")
 for k, ver in ffmpeg_utils.get_versions().items():
-    print(f"  {k}: {'.'.join(str(v) for v in ver)}")
+    print(f"  {k}:\t{'.'.join(str(v) for v in ver)}")
 ######################################################################
 #
@@ -81,14 +80,11 @@ print(torch.cuda.get_device_properties(0))
 # - FPS: 29.97
 # - Pixel format: YUV420P
 #
+# .. raw:: html
-HTML(
+#
-    """
+#    <video style="max-width: 100%" controls>
-<video style="max-width: 100%" controls>
+#      <source src="https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4" type="video/mp4">
-  <source src="https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4" type="video/mp4">
+#    </video>
-</video>
-"""
-)
 ######################################################################
 #
@@ -102,11 +98,8 @@ src = torchaudio.utils.download_asset(
 # --------------------------
 #
 # To use HW video decoder, you need to specify the HW decoder when
-# defining the output video stream.
+# defining the output video stream by passing ``decoder`` option to
-#
+# :py:meth:`~torchaudio.io.StreamReader.add_video_stream` method.
-# To do so, you need to use
-# :py:meth:`~torchaudio.io.StreamReader.add_video_stream`
-# method, and provide ``decoder`` option.
 #
 s = StreamReader(src)
@@ -116,14 +109,22 @@ s.fill_buffer()
 ######################################################################
 #
-# By default, the decoded frame data are sent back to CPU memory.
+# The video frames are decoded and returned as tensor of NCHW format.
-print(video.shape, video.dtype, video.device)
+print(video.shape, video.dtype)
+######################################################################
+#
+# By default, the decoded frames are sent back to CPU memory, and
+# CPU tensors are created.
+print(video.device)
 ######################################################################
 #
-# To keep the data in GPU as CUDA tensor, you need to specify
+# By specifying ``hw_accel`` option, you can convert the decoded frames
-# ``hw_accel`` option, which takes the string values and pass it
+# to CUDA tensor.
+# ``hw_accel`` option takes string values and pass it
 # to :py:class:`torch.device`.
 #
 # .. note::
@@ -144,12 +145,11 @@ print(video.shape, video.dtype, video.device)
 ######################################################################
-# When there are multiple of GPUs available, ``StreamReader`` by
+# .. note::
-# default uses the first GPU. You can change this by providing
-# ``"gpu"`` option.
 #
-# ``hw_accel`` option can be specified independently. If they do not
+#    When there are multiple of GPUs available, ``StreamReader`` by
-# match data will be transfered automatically.
+#    default uses the first GPU. You can change this by providing
+#    ``"gpu"`` option.
 #
 # .. code::
 #
@@ -162,6 +162,15 @@ print(video.shape, video.dtype, video.device)
 #        hw_accel="cuda:0",
 #    )
 #
+# .. note::
+#
+#    ``"gpu"`` option and ``hw_accel`` option can be specified
+#    independently. If they do not match, decoded frames are
+#    transfered to the device specified by ``hw_accell``
+#    automatically.
+#
+# .. code::
+#
 #    # Video data is sent to CUDA device 0, and decoded there.
 #    # Then it is transfered to CUDA device 1, and converted to
 #    # CUDA tensor.
@@ -246,7 +255,6 @@ def plot():
    axes[0][1].set_title("HW decoder")
    plt.setp(axes, xticks=[], yticks=[])
    plt.tight_layout()
-    return fig
 plot()
@@ -285,7 +293,7 @@ def test_options(option):
    s.add_video_stream(1, decoder="h264_cuvid", hw_accel="cuda:0", decoder_option=option)
    s.fill_buffer()
    (video,) = s.pop_chunks()
-    print(f"{option}, {video.shape}")
+    print(f"Option: {option}:\t{video.shape}")
    return video[0]
@@ -323,8 +331,8 @@ plot()
 # Comparing resizing methods
 # --------------------------
 #
-# Unlike software scaling, NVDEC does not provide an option for
+# Unlike software scaling, NVDEC does not provide an option to choose
-# scaling algorithm.
+# the scaling algorithm.
 # In ML applicatoins, it is often necessary to construct a
 # preprocessing pipeline with a similar numerical property.
 # So here we compare the result of hardware resizing with software
@@ -335,15 +343,13 @@ plot()
 #
 # .. code::
 #
-#    ffmpeg -y -f lavfi -t 12.05 -i mptestsrc  -movflags +faststart mptestsrc.mp4
+#    ffmpeg -y -f lavfi -t 12.05 -i mptestsrc -movflags +faststart mptestsrc.mp4
+#
-HTML(
+# .. raw:: html
-    """
+#
-<video style="max-width: 100%" controls>
+#    <video style="max-width: 100%" controls>
-  <source src="https://download.pytorch.org/torchaudio/tutorial-assets/mptestsrc.mp4" type="video/mp4">
+#      <source src="https://download.pytorch.org/torchaudio/tutorial-assets/mptestsrc.mp4" type="video/mp4">
-</video>
+#    </video>
-"""
-)
 ######################################################################
@@ -422,7 +428,6 @@ def plot():
    plt.setp(axes, xticks=[], yticks=[])
    plt.tight_layout()
-    return fig
 plot()
@@ -441,74 +446,160 @@ plot()
 # decoding and HW video decoding.
 #
-src = torchaudio.utils.download_asset("tutorial-assets/testsrc2_xga.h264.mp4")
 ######################################################################
 # Decode as CUDA frames
 # ---------------------
 #
-# First, we compare the time it takes to decode video.
+# First, we compare the time it takes for software decoder and
+# hardware encoder to decode the same video.
+# To make the result comparable, when using software decoder, we move
+# the resulting tensor to CUDA.
+#
+# The procedures to test look like the following
 #
-# Because HW decoder currently only supports reading videos as
+# - Use hardware decoder and place data on CUDA directly
-# YUV444P format, we decode frames into YUV444P format for the case of
+# - Use software decoder, generate CPU Tensors and move them to CUDA.
-# software decoder as well.
 #
-# Also, so as to make it more comparable, for software decoding,
+# .. note:
-# after frames are decoder, we move the tensor to CUDA.
 #
+#    Because HW decoder currently only supports reading videos as
+#    YUV444P format, we decode frames into YUV444P format for the case of
+#    software decoder as well.
+#
+######################################################################
+# The following function implements the hardware decoder test case.
-def test_decode_cpu(src, device):
+def test_decode_cuda(src, decoder, hw_accel="cuda", frames_per_chunk=5):
-    print("Test software decoding")
    s = StreamReader(src)
-    s.add_video_stream(5)
+    s.add_video_stream(frames_per_chunk, decoder=decoder, hw_accel=hw_accel)
    num_frames = 0
+    chunk = None
    t0 = time.monotonic()
-    for i, (chunk,) in enumerate(s.stream()):
+    for (chunk,) in s.stream():
-        if i == 0:
-            print(f" - Shape: {chunk.shape}")
        num_frames += chunk.shape[0]
-        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
+    print(f" - Shape: {chunk.shape}")
    fps = num_frames / elapsed
-    print(f" - Processed {num_frames} frames in {elapsed} seconds. ({fps} fps)")
+    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps
 ######################################################################
-#
+# The following function implements the software decoder test case.
-def test_decode_cuda(src, decoder, hw_accel):
-    print("Test NVDEC")
+def test_decode_cpu(src, threads, decoder=None, frames_per_chunk=5):
    s = StreamReader(src)
-    s.add_video_stream(5, decoder=decoder, hw_accel=hw_accel)
+    s.add_video_stream(frames_per_chunk, decoder=decoder, decoder_option={"threads": f"{threads}"})
    num_frames = 0
+    device = torch.device("cuda")
    t0 = time.monotonic()
    for i, (chunk,) in enumerate(s.stream()):
        if i == 0:
            print(f" - Shape: {chunk.shape}")
        num_frames += chunk.shape[0]
+        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
-    print(f" - Processed {num_frames} frames in {elapsed} seconds. ({fps} fps)")
+    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
+    return fps
+######################################################################
+# For each resolution of video, we run multiple software decoder test
+# cases with different number of threads.
+def run_decode_tests(src, frames_per_chunk=5):
+    fps = []
+    print(f"Testing: {os.path.basename(src)}")
+    for threads in [1, 4, 8, 16]:
+        print(f"* Software decoding (num_threads={threads})")
+        fps.append(test_decode_cpu(src, threads))
+    print("* Hardware decoding")
+    fps.append(test_decode_cuda(src, decoder="h264_cuvid"))
    return fps
 ######################################################################
-# The following is the time it takes to decode video chunk-by-chunk and
+# Now we run the tests with videos of different resolutions.
-# move each chunk to CUDA device.
 #
+# QVGA
+# ----
-xga_cpu = test_decode_cpu(src, device=torch.device("cuda"))
+src_qvga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_qvga.h264.mp4")
+fps_qvga = run_decode_tests(src_qvga)
 ######################################################################
-# The following is the time it takes to decode video chunk-by-chunk
+# VGA
-# using HW decoder.
+# ---
+src_vga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_vga.h264.mp4")
+fps_vga = run_decode_tests(src_vga)
+######################################################################
+# XGA
+# ---
+src_xga = torchaudio.utils.download_asset("tutorial-assets/testsrc2_xga.h264.mp4")
+fps_xga = run_decode_tests(src_xga)
+######################################################################
+# Result
+# ------
 #
+# Now we plot the result.
-xga_cuda = test_decode_cuda(src, decoder="h264_cuvid", hw_accel="cuda")
+def plot():
+    fig, ax = plt.subplots(figsize=[9.6, 6.4])
+    for items in zip(fps_qvga, fps_vga, fps_xga, "ov^sx"):
+        ax.plot(items[:-1], marker=items[-1])
+    ax.grid(axis="both")
+    ax.set_xticks([0, 1, 2], ["QVGA (320x240)", "VGA (640x480)", "XGA (1024x768)"])
+    ax.legend(
+        [
+            "Software Decoding (threads=1)",
+            "Software Decoding (threads=4)",
+            "Software Decoding (threads=8)",
+            "Software Decoding (threads=16)",
+            "Hardware Decoding (CUDA Tensor)",
+        ]
+    )
+    ax.set_title("Speed of processing video frames")
+    ax.set_ylabel("Frames per second")
+    plt.tight_layout()
+plot()
+######################################################################
+#
+# We observe couple of things
+#
+# - Increasing the number of threads in software decoding makes the
+#   pipeline faster, but the performance saturates around 8 threads.
+# - The performance gain from using hardware decoder depends on the
+#   resolution of video.
+# - At lower resolutions like QVGA, hardware decoding is slower than
+#   software decoding
+# - At higher resolutions like XGA, hardware decoding is faster
+#   than software decoding.
+#
+#
+# It is worth noting that the performance gain also depends on the
+# type of GPU.
+# We observed that when decoding VGA videos using V100 or A100 GPUs,
+# hardware decoders are slower than software decoders. But using A10
+# GPU hardware deocder is faster than software decodr.
+#
 ######################################################################
 # Decode and resize
@@ -527,9 +618,7 @@ xga_cuda = test_decode_cuda(src, decoder="h264_cuvid", hw_accel="cuda")
 # 3. Decode and resize video simulaneously with HW decoder, read the
 #    resulting frames as CUDA tensor.
 #
-# The pipeline 1 represents common video loading implementaations.
+# The pipeline 1 represents common video loading implementations.
-# The library used for decoding part could be different, such as OpenCV,
-# torchvision and PyAV.
 #
 # The pipeline 2 uses FFmpeg's filter graph, which allows to manipulate
 # raw frames before converting them to Tensors.
@@ -547,22 +636,22 @@ xga_cuda = test_decode_cuda(src, decoder="h264_cuvid", hw_accel="cuda")
 #
-def test_decode_then_resize(src, device="cuda", height=224, width=224, mode="bicubic"):
+def test_decode_then_resize(src, height, width, mode="bicubic", frames_per_chunk=5):
-    print("Test software decoding with PyTorch interpolate")
    s = StreamReader(src)
-    s.add_video_stream(5)
+    s.add_video_stream(frames_per_chunk, decoder_option={"threads": "8"})
    num_frames = 0
+    device = torch.device("cuda")
+    chunk = None
    t0 = time.monotonic()
-    for i, (chunk,) in enumerate(s.stream()):
+    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
        chunk = torch.nn.functional.interpolate(chunk, [height, width], mode=mode, antialias=True)
-        if i == 0:
-            print(f" - Shape: {chunk.shape}")
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
-    print(f" - Processed {num_frames} frames in {elapsed} seconds. ({fps} fps)")
+    print(f" - Shape: {chunk.shape}")
+    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps
@@ -575,21 +664,23 @@ def test_decode_then_resize(src, device="cuda", height=224, width=224, mode="bic
 #
-def test_decode_and_resize(src, device="cuda", height=224, width=224, mode="bicubic"):
+def test_decode_and_resize(src, height, width, mode="bicubic", frames_per_chunk=5):
-    print("Test software decoding with FFmpeg scale")
    s = StreamReader(src)
-    s.add_video_stream(5, filter_desc=f"scale={width}:{height}:sws_flags={mode}")
+    s.add_video_stream(
+        frames_per_chunk, filter_desc=f"scale={width}:{height}:sws_flags={mode}", decoder_option={"threads": "8"}
+    )
    num_frames = 0
+    device = torch.device("cuda")
+    chunk = None
    t0 = time.monotonic()
-    for i, (chunk,) in enumerate(s.stream()):
+    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
-        if i == 0:
-            print(f" - Shape: {chunk.shape}")
        chunk = chunk.to(device)
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
-    print(f" - Processed {num_frames} frames in {elapsed} seconds. ({fps} fps)")
+    print(f" - Shape: {chunk.shape}")
+    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps
@@ -598,150 +689,106 @@ def test_decode_and_resize(src, device="cuda", height=224, width=224, mode="bicu
 # performed by NVDEC and the resulting tensor is placed on CUDA memory.
-def test_hw_decode_and_resize(src, decoder, decoder_option, hw_accel="cuda"):
+def test_hw_decode_and_resize(src, decoder, decoder_option, hw_accel="cuda", frames_per_chunk=5):
-    print("Test NVDEC with resie")
    s = StreamReader(src)
    s.add_video_stream(5, decoder=decoder, decoder_option=decoder_option, hw_accel=hw_accel)
    num_frames = 0
+    chunk = None
    t0 = time.monotonic()
-    for i, (chunk,) in enumerate(s.stream()):
+    for (chunk,) in s.stream():
        num_frames += chunk.shape[0]
-        if i == 0:
-            print(f" - Shape: {chunk.shape}")
    elapsed = time.monotonic() - t0
    fps = num_frames / elapsed
-    print(f" - Processed {num_frames} frames in {elapsed} seconds. ({fps} fps)")
+    print(f" - Shape: {chunk.shape}")
+    print(f" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
    return fps
 ######################################################################
 #
+# The following function run the benchmark functions on given sources.
-xga_cpu_resize1 = test_decode_then_resize(src)
-######################################################################
 #
-xga_cpu_resize2 = test_decode_and_resize(src)
-######################################################################
+def run_resize_tests(src):
-#
+    print(f"Testing: {os.path.basename(src)}")
+    height, width = 224, 224
+    print("* Software decoding with PyTorch interpolate")
+    cpu_resize1 = test_decode_then_resize(src, height=height, width=width)
+    print("* Software decoding with FFmpeg scale")
+    cpu_resize2 = test_decode_and_resize(src, height=height, width=width)
+    print("* Hardware decoding with resize")
+    cuda_resize = test_hw_decode_and_resize(src, decoder="h264_cuvid", decoder_option={"resize": f"{width}x{height}"})
+    return [cpu_resize1, cpu_resize2, cuda_resize]
-xga_cuda_resize = test_hw_decode_and_resize(src, decoder="h264_cuvid", decoder_option={"resize": "224x224"})
 ######################################################################
 #
-# The following figures illustrates the benchmark result.
+# Now we run the tests.
-#
-# Notice that HW decoder has almost no overhead for reizing operation.
-def plot(data, size):
-    fig, ax = plt.subplots(1, 1, figsize=[9.6, 6.4])
-    ax.grid(axis="y")
-    ax.set_axisbelow(True)
-    bars = ax.bar(
-        [
-            "NVDEC\n(no resize)",
-            "Software decoding\n(no resize)",
-            "NVDEC\nwith resizing",
-            "Software decoding\nwith resize\n(FFmpeg scale)",
-            "Software decoding\nwith resize\n(PyTorch interpolate)",
-        ],
-        data,
-        color=["royalblue", "gray", "royalblue", "gray", "gray"],
-    )
-    ax.bar_label(bars)
-    ax.set_ylabel("Number of frames processed per second")
-    ax.set_title(f"Speed of decoding and converting frames into CUDA tensor (input: {size})")
-    plt.tight_layout()
-    return fig
-plot([xga_cuda, xga_cpu, xga_cuda_resize, xga_cpu_resize2, xga_cpu_resize1], "xga (1024x768)")
 ######################################################################
-# Video resolution and HW decoder performance
+# QVGA
-# -------------------------------------------
+# ----
-#
-# The performance gain from using HW decoder is highly
-# dependant on the video size and the type of GPUs.
-# Generally speaking, HW decoder is more
-# performant when processing videos with higher resolution.
-#
-# In the following, we perform the same benchmark using videos with
-# smaller resolutionm VGA (640x480) and QVGA (320x240).
-#
-src = torchaudio.utils.download_asset("tutorial-assets/testsrc2_vga.h264.mp4")
+fps_qvga = run_resize_tests(src_qvga)
 ######################################################################
-#
+# VGA
+# ---
-vga_cpu = test_decode_cpu(src, device=torch.device("cuda"))
+fps_vga = run_resize_tests(src_vga)
 ######################################################################
-#
+# XGA
+# ---
-vga_cuda = test_decode_cuda(src, decoder="h264_cuvid", hw_accel="cuda")
+fps_xga = run_resize_tests(src_xga)
 ######################################################################
+# Result
+# ------
+# Now we plot the result.
 #
-vga_cpu_resize1 = test_decode_then_resize(src)
-######################################################################
-#
-vga_cpu_resize2 = test_decode_and_resize(src)
+def plot():
+    fig, ax = plt.subplots(figsize=[9.6, 6.4])
-######################################################################
-#
-vga_cuda_resize = test_hw_decode_and_resize(src, decoder="h264_cuvid", decoder_option={"resize": "224x224"})
-######################################################################
-#
-src = torchaudio.utils.download_asset("tutorial-assets/testsrc2_qvga.h264.mp4")
-######################################################################
-#
-qvga_cpu = test_decode_cpu(src, device=torch.device("cuda"))
-######################################################################
-#
-qvga_cuda = test_decode_cuda(src, decoder="h264_cuvid", hw_accel="cuda")
-######################################################################
-#
-qvga_cpu_resize1 = test_decode_then_resize(src)
+    for items in zip(fps_qvga, fps_vga, fps_xga, "ov^sx"):
+        ax.plot(items[:-1], marker=items[-1])
+    ax.grid(axis="both")
+    ax.set_xticks([0, 1, 2], ["QVGA (320x240)", "VGA (640x480)", "XGA (1024x768)"])
+    ax.legend(
+        [
+            "Software decoding\nwith resize\n(PyTorch interpolate)",
+            "Software decoding\nwith resize\n(FFmpeg scale)",
+            "NVDEC\nwith resizing",
+        ]
+    )
+    ax.set_title("Speed of processing video frames")
+    ax.set_xlabel("Input video resolution")
+    ax.set_ylabel("Frames per second")
+    plt.tight_layout()
-######################################################################
-#
-qvga_cpu_resize2 = test_decode_and_resize(src)
+plot()
 ######################################################################
 #
+# Hardware deocder shows a similar trend as previous experiment.
-qvga_cuda_resize = test_hw_decode_and_resize(src, decoder="h264_cuvid", decoder_option={"resize": "224x224"})
+# In fact, the performance is almost the same. Hardware resizing has
+# almost zero overhead for scaling down the frames.
-######################################################################
 #
-# Now we plot the result. You can see that when processing these
+# Software decoding also shows a similar trend. Performing resizing as
-# videos, HW decoder is slower than CPU decoder.
+# part of decoding is faster. One possible explanation is that, video
+# frames are internally stored as YUV420P, which has half the number
+# of pixels compared to RGB24, or YUV444P. This means that if resizing
+# before copying frame data to PyTorch tensor, the number of pixels
+# manipulated and copied are smaller than the case where applying
+# resizing after frames are converted to Tensor.
 #
-plot([vga_cuda, vga_cpu, vga_cuda_resize, vga_cpu_resize2, vga_cpu_resize1], "vga (640x480)")
 ######################################################################
 #
+# Tag: :obj:`torchaudio.io`
-plot([qvga_cuda, qvga_cpu, qvga_cuda_resize, qvga_cpu_resize2, qvga_cpu_resize1], "qvga (320x240)")
--- a/examples/tutorials/nvenc_tutorial.py
+++ b/examples/tutorials/nvenc_tutorial.py
+"""
+Accelerated video encoding with NVENC
+=====================================
+.. _nvenc_tutorial:
+**Author**: `Moto Hira <moto@meta.com>`__
+This tutorial shows how to use NVIDIA’s hardware video encoder (NVENC)
+with TorchAudio, and how it improves the performance of video encoding.
+"""
+######################################################################
+# .. note::
+#
+#    This tutorial requires FFmpeg libraries compiled with HW
+#    acceleration enabled.
+#
+#    Please refer to
+#    :ref:`Enabling GPU video decoder/encoder <enabling_hw_decoder>`
+#    for how to build FFmpeg with HW acceleration.
+#
+# .. note::
+#
+#    Most modern GPUs have both HW decoder and encoder, but some
+#    highend GPUs like A100 and H100 do not have HW encoder.
+#    Please refer to the following for the availability and
+#    format coverage.
+#    https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new
+#
+#    Attempting to use HW encoder on these GPUs fails with an error
+#    message like ``Generic error in an external library``.
+#    You can enable debug log with
+#    :py:func:`torchaudio.utils.ffmpeg_utils.set_log_level` to see more
+#    detailed error messages issued along the way.
+#
+import torch
+import torchaudio
+print(torch.__version__)
+print(torchaudio.__version__)
+import io
+import time
+import matplotlib.pyplot as plt
+from IPython.display import Video
+from torchaudio.io import StreamReader, StreamWriter
+######################################################################
+#
+# Check the prerequisites
+# -----------------------
+#
+# First, we check that TorchAudio correctly detects FFmpeg libraries
+# that support HW decoder/encoder.
+#
+from torchaudio.utils import ffmpeg_utils
+######################################################################
+#
+print("FFmpeg Library versions:")
+for k, ver in ffmpeg_utils.get_versions().items():
+    print(f"  {k}:\t{'.'.join(str(v) for v in ver)}")
+######################################################################
+#
+print("Available NVENC Encoders:")
+for k in ffmpeg_utils.get_video_encoders().keys():
+    if "nvenc" in k:
+        print(f" - {k}")
+######################################################################
+#
+print("Avaialbe GPU:")
+print(torch.cuda.get_device_properties(0))
+######################################################################
+# We use the following helper function to generate test frame data.
+# For the detail of synthetic video generation please refer to
+# :ref:`StreamReader Advanced Usage <lavfi>`.
+def get_data(height, width, format="yuv444p", frame_rate=30000 / 1001, duration=4):
+    src = f"testsrc2=rate={frame_rate}:size={width}x{height}:duration={duration}"
+    s = StreamReader(src=src, format="lavfi")
+    s.add_basic_video_stream(-1, format=format)
+    s.process_all_packets()
+    (video,) = s.pop_chunks()
+    return video
+######################################################################
+# Encoding videos with NVENC
+# --------------------------
+#
+# To use HW video encoder, you need to specify the HW encoder when
+# defining the output video stream by providing ``encoder`` option to
+# :py:meth:`~torchaudio.io.StreamWriter.add_video_stream`.
+#
+######################################################################
+#
+pict_config = {
+    "height": 360,
+    "width": 640,
+    "frame_rate": 30000 / 1001,
+    "format": "yuv444p",
+}
+frame_data = get_data(**pict_config)
+######################################################################
+#
+w = StreamWriter(io.BytesIO(), format="mp4")
+w.add_video_stream(**pict_config, encoder="h264_nvenc", encoder_format="yuv444p")
+with w.open():
+    w.write_video_chunk(0, frame_data)
+######################################################################
+# Similar to the HW decoder, by default, the encoder expects the frame
+# data to be on CPU memory. To send data from CUDA memory, you need to
+# specify ``hw_accel`` option.
+#
+buffer = io.BytesIO()
+w = StreamWriter(buffer, format="mp4")
+w.add_video_stream(**pict_config, encoder="h264_nvenc", encoder_format="yuv444p", hw_accel="cuda:0")
+with w.open():
+    w.write_video_chunk(0, frame_data.to(torch.device("cuda:0")))
+buffer.seek(0)
+video_cuda = buffer.read()
+######################################################################
+#
+Video(video_cuda, embed=True, mimetype="video/mp4")
+######################################################################
+# Benchmark NVENC with StreamWriter
+# ---------------------------------
+#
+# Now we compare the performance of software encoder and hardware
+# encoder.
+#
+# Similar to the benchmark in NVDEC, we process the videos of different
+# resolution, and measure the time it takes to encode them.
+#
+# We also measure the size of resulting video file.
+######################################################################
+# The following function encodes the given frames and measure the time
+# it takes to encode and the size of the resulting video data.
+#
+def test_encode(data, encoder, width, height, hw_accel=None, **config):
+    assert data.is_cuda
+    buffer = io.BytesIO()
+    s = StreamWriter(buffer, format="mp4")
+    s.add_video_stream(encoder=encoder, width=width, height=height, hw_accel=hw_accel, **config)
+    with s.open():
+        t0 = time.monotonic()
+        if hw_accel is None:
+            data = data.to("cpu")
+        s.write_video_chunk(0, data)
+        elapsed = time.monotonic() - t0
+    size = buffer.tell()
+    fps = len(data) / elapsed
+    print(f" - Processed {len(data)} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)")
+    print(f" - Encoded data size: {size} bytes")
+    return elapsed, size
+######################################################################
+# We conduct the tests for the following configurations
+#
+# - Software encoder with the number of threads 1, 4, 8
+# - Hardware encoder with and without ``hw_accel`` option.
+#
+def run_tests(height, width, duration=4):
+    # Generate the test data
+    print(f"Testing resolution: {width}x{height}")
+    pict_config = {
+        "height": height,
+        "width": width,
+        "frame_rate": 30000 / 1001,
+        "format": "yuv444p",
+    }
+    data = get_data(**pict_config, duration=duration)
+    data = data.to(torch.device("cuda:0"))
+    times = []
+    sizes = []
+    # Test software encoding
+    encoder_config = {
+        "encoder": "libx264",
+        "encoder_format": "yuv444p",
+    }
+    for i, num_threads in enumerate([1, 4, 8]):
+        print(f"* Software Encoder (num_threads={num_threads})")
+        time_, size = test_encode(
+            data,
+            encoder_option={"threads": str(num_threads)},
+            **pict_config,
+            **encoder_config,
+        )
+        times.append(time_)
+        if i == 0:
+            sizes.append(size)
+    # Test hardware encoding
+    encoder_config = {
+        "encoder": "h264_nvenc",
+        "encoder_format": "yuv444p",
+        "encoder_option": {"gpu": "0"},
+    }
+    for i, hw_accel in enumerate([None, "cuda"]):
+        print(f"* Hardware Encoder {'(CUDA frames)' if hw_accel else ''}")
+        time_, size = test_encode(
+            data,
+            **pict_config,
+            **encoder_config,
+            hw_accel=hw_accel,
+        )
+        times.append(time_)
+        if i == 0:
+            sizes.append(size)
+    return times, sizes
+######################################################################
+# And we change the resolution of videos to see how these measurement
+# change.
+#
+# 360P
+# ----
+#
+time_360, size_360 = run_tests(360, 640)
+######################################################################
+# 720P
+# ----
+#
+time_720, size_720 = run_tests(720, 1280)
+######################################################################
+# 1080P
+# -----
+#
+time_1080, size_1080 = run_tests(1080, 1920)
+######################################################################
+# Now we plot the result.
+#
+def plot():
+    fig, axes = plt.subplots(2, 1, sharex=True, figsize=[9.6, 7.2])
+    for items in zip(time_360, time_720, time_1080, "ov^X+"):
+        axes[0].plot(items[:-1], marker=items[-1])
+    axes[0].grid(axis="both")
+    axes[0].set_xticks([0, 1, 2], ["360p", "720p", "1080p"], visible=True)
+    axes[0].tick_params(labeltop=False)
+    axes[0].legend(
+        [
+            "Software Encoding (threads=1)",
+            "Software Encoding (threads=4)",
+            "Software Encoding (threads=8)",
+            "Hardware Encoding (CPU Tensor)",
+            "Hardware Encoding (CUDA Tensor)",
+        ]
+    )
+    axes[0].set_title("Time to encode videos with different resolutions")
+    axes[0].set_ylabel("Time [s]")
+    for items in zip(size_360, size_720, size_1080, "v^"):
+        axes[1].plot(items[:-1], marker=items[-1])
+    axes[1].grid(axis="both")
+    axes[1].set_xticks([0, 1, 2], ["360p", "720p", "1080p"])
+    axes[1].set_ylabel("The encoded size [bytes]")
+    axes[1].set_title("The size of encoded videos")
+    axes[1].legend(
+        [
+            "Software Encoding",
+            "Hardware Encoding",
+        ]
+    )
+    plt.tight_layout()
+plot()
+######################################################################
+# Result
+# ------
+#
+# We observe couple of things;
+#
+# - The time to encode video grows as the resolution becomes larger.
+# - In the case of software encoding, increasing the number of threads
+#   helps reduce the decoding time.
+# - The gain from extra threads diminishes around 8.
+# - Hardware encoding is faster than software encoding in general.
+# - Using ``hw_accel`` does not improve the speed of encoding itself
+#   as much.
+# - The size of the resulting videos grow as the resolution becomes
+#   larger.
+# - Hardware encoder produces smaller video file at larger resolution.
+#
+# The last point is somewhat strange to the author (who is not an
+# expert in production of videos.)
+# It is often said that hardware decoders produce larger video
+# compared to software encoders.
+# Some says that software encoders allow fine-grained control over
+# encoding configuration, so the resulting video is more optimal.
+# Meanwhile, hardware encoders are optimized for performance, thus
+# does not provide as much control over quality and binary size.
+#
+######################################################################
+# Quality Spotcheck
+# -----------------
+#
+# So, how are the quality of videos produced with hardware encoders?
+# A quick spot check of high resolution videos uncovers that they have
+# more noticeable artifacts on higher resolution.
+# Which might be an explanation of the smaller binary size. (meaning,
+# it is not allocating enough bits to produce quality output.)
+#
+# The following images are raw frames of videos encoded with hardware
+# encoders.
+#
+######################################################################
+# 360P
+# ----
+#
+# .. raw:: html
+#
+#    <img style="max-width: 100%" src="https://download.pytorch.org/torchaudio/tutorial-assets/nvenc_testsrc2_360_097.png" alt="NVENC sample 360P">
+######################################################################
+# 720P
+# ----
+#
+# .. raw:: html
+#
+#    <img style="max-width: 100%" src="https://download.pytorch.org/torchaudio/tutorial-assets/nvenc_testsrc2_720_097.png" alt="NVENC sample 720P">
+######################################################################
+# 1080P
+# -----
+#
+# .. raw:: html
+#
+#    <img style="max-width: 100%" src="https://download.pytorch.org/torchaudio/tutorial-assets/nvenc_testsrc2_1080_097.png" alt="NVENC sample 1080P">
+######################################################################
+#
+# We can see that there are more artifacts at higher resolution, which
+# are noticeable.
+#
+# Perhaps one might be able to reduce these using ``encoder_options``
+# arguments.
+# We did not try, but if you try that and find a better quality
+# setting, feel free to let us know. ;)
+######################################################################
+#
+# Tag: :obj:`torchaudio.io`
--- a/examples/tutorials/streamreader_advanced_tutorial.py
+++ b/examples/tutorials/streamreader_advanced_tutorial.py
@@ -102,6 +102,9 @@ VIDEO_URL = f"{base_url}/stream-api/NASAs_Most_Scientifically_Complex_Space_Obse
 #
 ######################################################################
+#
+# .. _lavfi:
+#
 # Synthetic source streams
 # ------------------------
 #