Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669)

**Description** Add nvbandwidth benchmark. --------- Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669)
**Description** Add nvbandwidth benchmark. --------- Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
7cef624e · Hongtao Zhang · GitHub · c8c52eb2 · 7cef624e · 7cef624e
Unverified Commit 7cef624e authored Nov 21, 2024 by Hongtao Zhang Committed by GitHub Nov 22, 2024
6 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -384,6 +384,82 @@ with topology distance of 2, 4, 6, respectively.
 | ib-traffic/ib\_write\_bw\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client}  | bandwidth (GB/s) | The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client.  |
 | ib-traffic/ib\_write\_lat\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | time (us)        | The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. |
+### `nvbandwidth`
+#### Introduction
+Measures bandwidth and latency for various memcpy patterns across different links using copy engine or kernel copy methods,
+performed by [nvbandwidth](https://github.com/NVIDIA/nvbandwidth)
+#### Metrics
+| Metrics                                                 | Unit                   | Description                                                                                                                                                                |
+|---------------------------------------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| host_to_device_memcpy_ce_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Host to device CE memcpy using cuMemcpyAsync                      |
+| host_to_device_memcpy_ce_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| device_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Device to host CE memcpy using cuMemcpyAsync                      |
+| device_to_host_memcpy_ce_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| host_to_device_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s      | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. |
+| host_to_device_bidirectional_memcpy_ce_sum_bw      | GB/s                | Sum of the output matrix                                           |
+| device_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s      | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. |
+| device_to_host_bidirectional_memcpy_ce_sum_bw      | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s               | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_memcpy_read_ce_sum_bw             | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s              | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_memcpy_write_ce_sum_bw            | GB/s                | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_bidirectional_memcpy_read_ce_sum_bw | GB/s               | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_bidirectional_memcpy_write_ce_sum_bw | GB/s               | Sum of the output matrix                                           |
+| all_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host. |
+| all_to_host_memcpy_ce_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| all_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s              | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
+| all_to_host_bidirectional_memcpy_ce_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| host_to_all_memcpy_ce_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices. |
+| host_to_all_memcpy_ce_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| host_to_all_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s              | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
+| host_to_all_bidirectional_memcpy_ce_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_write_ce_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| all_to_one_write_ce_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_read_ce_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| all_to_one_read_ce_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| one_to_all_write_ce_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| one_to_all_write_ce_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| one_to_all_read_ce_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| one_to_all_read_ce_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| host_to_device_memcpy_sm_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Host to device SM memcpy using a copy kernel                      |
+| host_to_device_memcpy_sm_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| device_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw      | GB/s                | Device to host SM memcpy using a copy kernel                      |
+| device_to_host_memcpy_sm_sum_bw                    | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s               | Measures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_memcpy_read_sm_sum_bw             | GB/s                | Sum of the output matrix                                           |
+| device_to_device_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s              | Measures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_memcpy_write_sm_sum_bw            | GB/s                | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context. |
+| device_to_device_bidirectional_memcpy_read_sm_sum_bw | GB/s               | Sum of the output matrix                                           |
+| device_to_device_bidirectional_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context. |
+| device_to_device_bidirectional_memcpy_write_sm_sum_bw | GB/s               | Sum of the output matrix                                           |
+| all_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host. |
+| all_to_host_memcpy_sm_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| all_to_host_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s              | A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
+| all_to_host_bidirectional_memcpy_sm_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| host_to_all_memcpy_sm_cpu[0-9]_gpu[0-9]_bw         | GB/s                | Measures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices. |
+| host_to_all_memcpy_sm_sum_bw                       | GB/s                | Sum of the output matrix                                           |
+| host_to_all_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s              | A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
+| host_to_all_bidirectional_memcpy_sm_sum_bw         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_write_sm_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| all_to_one_write_sm_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| all_to_one_read_sm_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| all_to_one_read_sm_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| one_to_all_write_sm_gpu[0-9]_gpu[0-9]_bw           | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
+| one_to_all_write_sm_sum_bw                         | GB/s                | Sum of the output matrix                                           |
+| one_to_all_read_sm_gpu[0-9]_gpu[0-9]_bw            | GB/s                | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
+| one_to_all_read_sm_sum_bw                          | GB/s                | Sum of the output matrix                                           |
+| host_device_latency_sm_cpu[0-9]_gpu[0-9]_lat       | µs                  | Host - device SM copy latency using a ptr chase kernel            |
+| host_device_latency_sm_sum_lat                     | µs                  | Sum of the output matrix                                           |
+| device_to_device_latency_sm_gpu[0-9]_gpu[0-9]_lat  | µs                  | Measures latency of a pointer dereference operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency. |
+| device_to_device_latency_sm_sum_lat                | µs                  | Sum of the output matrix                                           |
 ## Computation-communication Benchmarks

--- a/examples/benchmarks/nvbandwidth.py
+++ b/examples/benchmarks/nvbandwidth.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+"""Micro benchmark example for nvbandwidth benchmark.
+Commands to run:
+  python3 examples/benchmarks/nvbandwidth.py
+"""
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.common.utils import logger
+if __name__ == '__main__':
+    context = BenchmarkRegistry.create_benchmark_context(
+        'nvbandwidth',
+        platform=Platform.CPU,
+        parameters=(
+            '--buffer_size 128 '
+            '--test_cases 0,1,19,20 '
+            '--skip_verification '
+            '--disable_affinity '
+            '--use_mean '
+            '--num_loops 10'
+        )
+    )
+    benchmark = BenchmarkRegistry.launch_benchmark(context)
+    if benchmark:
+        logger.info(
+            'benchmark: {}, return code: {}, result: {}'.format(
+                benchmark.name, benchmark.return_code, benchmark.result
+            )
+        )
--- a/superbench/benchmarks/micro_benchmarks/__init__.py
+++ b/superbench/benchmarks/micro_benchmarks/__init__.py
@@ -37,6 +37,7 @@ from superbench.benchmarks.micro_benchmarks.directx_gpu_encoding_latency import
 from superbench.benchmarks.micro_benchmarks.directx_gpu_copy_performance import DirectXGPUCopyBw
 from superbench.benchmarks.micro_benchmarks.directx_mem_bw_performance import DirectXGPUMemBw
 from superbench.benchmarks.micro_benchmarks.directx_gemm_flops_performance import DirectXGPUCoreFlops
+from superbench.benchmarks.micro_benchmarks.nvbandwidth import NvBandwidthBenchmark
 __all__ = [
    'BlasLtBaseBenchmark',
@@ -73,4 +74,5 @@ __all__ = [
    'DirectXGPUCopyBw',
    'DirectXGPUMemBw',
    'DirectXGPUCoreFlops',
+    'NvBandwidthBenchmark',
 ]
--- a/superbench/benchmarks/micro_benchmarks/nvbandwidth.py
+++ b/superbench/benchmarks/micro_benchmarks/nvbandwidth.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+"""Module of the NV Bandwidth Test."""
+import os
+import re
+from superbench.common.utils import logger
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
+class NvBandwidthBenchmark(MicroBenchmarkWithInvoke):
+    """The NV Bandwidth Test benchmark class."""
+    def __init__(self, name, parameters=''):
+        """Constructor.
+        Args:
+            name (str): benchmark name.
+            parameters (str): benchmark parameters.
+        """
+        super().__init__(name, parameters)
+        self._bin_name = 'nvbandwidth'
+    def add_parser_arguments(self):
+        """Add the specified arguments."""
+        super().add_parser_arguments()
+        self._parser.add_argument(
+            '--buffer_size',
+            type=int,
+            default=64,
+            required=False,
+            help='Memcpy buffer size in MiB. Default is 64.',
+        )
+        self._parser.add_argument(
+            '--test_cases',
+            type=str,
+            default='',
+            required=False,
+            help=(
+                'Specify the test case(s) to run, either by name or index. By default, all test cases are executed. '
+                'Example: --test_cases 0,1,2,19,20'
+            ),
+        )
+        self._parser.add_argument(
+            '--skip_verification',
+            action='store_true',
+            help='Skips data verification after copy. Default is False.',
+        )
+        self._parser.add_argument(
+            '--disable_affinity',
+            action='store_true',
+            help='Disable automatic CPU affinity control. Default is False.',
+        )
+        self._parser.add_argument(
+            '--use_mean',
+            action='store_true',
+            help='Use mean instead of median for results. Default is False.',
+        )
+        self._parser.add_argument(
+            '--num_loops',
+            type=int,
+            default=3,
+            required=False,
+            help='Iterations of the benchmark. Default is 3.',
+        )
+    def _preprocess(self):
+        """Preprocess/preparation operations before the benchmarking.
+        Return:
+            True if _preprocess() succeed.
+        """
+        if not super()._preprocess():
+            return False
+        if not self._set_binary_path():
+            return False
+        # Construct the command for nvbandwidth
+        command = os.path.join(self._args.bin_dir, self._bin_name)
+        if self._args.buffer_size:
+            command += f' --bufferSize {self._args.buffer_size}'
+        if self._args.test_cases:
+            command += ' --testcase ' + ' '.join([testcase.strip() for testcase in self._args.test_cases.split(',')])
+        if self._args.skip_verification:
+            command += ' --skipVerification'
+        if self._args.disable_affinity:
+            command += ' --disableAffinity'
+        if self._args.use_mean:
+            command += ' --useMean'
+        if self._args.num_loops:
+            command += f' --testSamples {self._args.num_loops}'
+        self._commands.append(command)
+        return True
+    def _process_raw_line(self, line, parse_status):
+        """Process a single line of raw output from the nvbandwidth benchmark.
+        This function updates the `parse_status` dictionary with parsed results from the given `line`.
+        It detects the start of a test, parses matrix headers and rows, and extracts summary results.
+        Args:
+            line (str): A single line of raw output from the benchmark.
+            parse_status (dict): A dictionary to maintain the current parsing state and results. It should contain:
+                - 'test_name' (str): The name of the current test being parsed.
+                - 'benchmark_type' (str): 'bw' or 'lat'. It also indicating if matrix data is being parsed.
+                - 'matrix_header' (list): The header of the matrix being parsed.
+                - 'results' (dict): A dictionary to store the parsed results.
+        Return:
+            None
+        """
+        # Regular expressions for summary line and matrix header detection
+        block_start_pattern = re.compile(r'^Running\s+(.+)$')
+        summary_pattern = re.compile(r'SUM (\S+) (\d+\.\d+)')
+        matrix_header_line = re.compile(r'^(memcpy|memory latency)')
+        matrix_row_pattern = re.compile(r'^\s*\d')
+        line = line.strip()
+        # Detect the start of a test
+        if block_start_pattern.match(line):
+            parse_status['test_name'] = block_start_pattern.match(line).group(1).lower()[:-1]
+            return
+        # Detect the start of matrix data
+        if parse_status['test_name'] and matrix_header_line.match(line):
+            parse_status['benchmark_type'] = 'bw' if 'bandwidth' in line else 'lat'
+            return
+        # Parse the matrix header
+        if (
+            parse_status['test_name'] and parse_status['benchmark_type'] and not parse_status['matrix_header']
+            and matrix_row_pattern.match(line)
+        ):
+            parse_status['matrix_header'] = line.split()
+            return
+        # Parse matrix rows
+        if parse_status['test_name'] and parse_status['benchmark_type'] and matrix_row_pattern.match(line):
+            row_data = line.split()
+            row_index = row_data[0]
+            for col_index, value in enumerate(row_data[1:], start=1):
+                col_header = parse_status['matrix_header'][col_index - 1]
+                test_name = parse_status['test_name']
+                benchmark_type = parse_status['benchmark_type']
+                metric_name = f'{test_name}_cpu{row_index}_gpu{col_header}_{benchmark_type}'
+                parse_status['results'][metric_name] = float(value)
+            return
+        # Parse summary results
+        summary_match = summary_pattern.search(line)
+        if summary_match:
+            value = float(summary_match.group(2))
+            test_name = parse_status['test_name']
+            benchmark_type = parse_status['benchmark_type']
+            parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = value
+            # Reset parsing state for next test
+            parse_status['test_name'] = ''
+            parse_status['benchmark_type'] = None
+            parse_status['matrix_header'].clear()
+    def _process_raw_result(self, cmd_idx, raw_output):
+        """Function to parse raw results and save the summarized results.
+           self._result.add_raw_data() and self._result.add_result() need to be called to save the results.
+        Args:
+            cmd_idx (int): the index of command corresponding with the raw_output.
+            raw_output (str): raw output string of the micro-benchmark.
+        Return:
+            True if the raw output string is valid and result can be extracted.
+        """
+        try:
+            self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
+            content = raw_output.splitlines()
+            parsing_status = {
+                'results': {},
+                'benchmark_type': None,
+                'matrix_header': [],
+                'test_name': '',
+            }
+            for line in content:
+                self._process_raw_line(line, parsing_status)
+            if not parsing_status['results']:
+                self._result.add_raw_data('nvbandwidth', 'No valid results found', self._args.log_raw_data)
+                return False
+            # Store parsed results
+            for metric, value in parsing_status['results'].items():
+                self._result.add_result(metric, value)
+            return True
+        except Exception as e:
+            logger.error(
+                'The result format is invalid - round: {}, benchmark: {}, raw output: {}, message: {}.'.format(
+                    self._curr_run_index, self._name, raw_output, str(e)
+                )
+            )
+            self._result.add_result('abort', 1)
+            return False
+BenchmarkRegistry.register_benchmark('nvbandwidth', NvBandwidthBenchmark, platform=Platform.CUDA)
--- a/tests/benchmarks/micro_benchmarks/test_nvbandwidth.py
+++ b/tests/benchmarks/micro_benchmarks/test_nvbandwidth.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+"""Tests for nvbandwidth benchmark."""
+import unittest
+from tests.helper import decorator
+from tests.helper.testcase import BenchmarkTestCase
+from superbench.benchmarks import BenchmarkRegistry, ReturnCode, Platform
+class TestNvBandwidthBenchmark(BenchmarkTestCase, unittest.TestCase):
+    """Test class for NV Bandwidth benchmark."""
+    @classmethod
+    def setUpClass(cls):
+        """Hook method for setting up class fixture before running tests in the class."""
+        super().setUpClass()
+        cls.createMockEnvs(cls)
+        cls.createMockFiles(cls, ['bin/nvbandwidth'])
+    def test_nvbandwidth_preprocess(self):
+        """Test NV Bandwidth benchmark preprocess."""
+        benchmark_name = 'nvbandwidth'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, Platform.CUDA)
+        assert (benchmark_class)
+        # Test preprocess with default parameters
+        benchmark = benchmark_class(benchmark_name, parameters='')
+        assert benchmark._preprocess()
+        assert benchmark.return_code == ReturnCode.SUCCESS
+        # Test preprocess with specified parameters
+        parameters = (
+            '--buffer_size 256 '
+            '--test_cases 0,1,2,19,20 '
+            '--skip_verification '
+            '--disable_affinity '
+            '--use_mean '
+            '--num_loops 100'
+        )
+        benchmark = benchmark_class(benchmark_name, parameters=parameters)
+        assert benchmark._preprocess()
+        assert benchmark.return_code == ReturnCode.SUCCESS
+        # Check command
+        assert (1 == len(benchmark._commands))
+        assert ('--bufferSize 256' in benchmark._commands[0])
+        assert ('--testcase 0 1 2 19 20' in benchmark._commands[0])
+        assert ('--skipVerification' in benchmark._commands[0])
+        assert ('--disableAffinity' in benchmark._commands[0])
+        assert ('--useMean' in benchmark._commands[0])
+        assert ('--testSamples 100' in benchmark._commands[0])
+    @decorator.load_data('tests/data/nvbandwidth_results.log')
+    def test_nvbandwidth_result_parsing_real_output(self, results):
+        """Test NV Bandwidth benchmark result parsing."""
+        benchmark_name = 'nvbandwidth'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, Platform.CUDA)
+        assert (benchmark_class)
+        benchmark = benchmark_class(benchmark_name, parameters='')
+        # Preprocess and validate command
+        assert benchmark._preprocess()
+        # Parse the provided raw output
+        assert benchmark._process_raw_result(0, results)
+        assert benchmark.return_code == ReturnCode.SUCCESS
+        # Validate parsed results
+        assert benchmark.result['host_to_device_memcpy_ce_cpu0_gpu0_bw'][0] == 369.36
+        assert benchmark.result['host_to_device_memcpy_ce_cpu0_gpu1_bw'][0] == 269.33
+        assert benchmark.result['host_to_device_memcpy_ce_sum_bw'][0] == 1985.60
+        assert benchmark.result['device_to_host_memcpy_ce_cpu0_gpu1_bw'][0] == 312.11
+        assert benchmark.result['device_to_host_memcpy_ce_sum_bw'][0] == 607.26
+        assert benchmark.result['host_device_latency_sm_cpu0_gpu0_lat'][0] == 772.58
+        assert benchmark.result['host_device_latency_sm_sum_lat'][0] == 772.58
--- a/tests/data/nvbandwidth_results.log
+++ b/tests/data/nvbandwidth_results.log
+nvbandwidth Version: v0.6
+Built from Git version: v0.6
+CUDA Runtime Version: 12040
+CUDA Driver Version: 12040
+Driver Version: 550.54.15
+Device 0: NVIDIA GH200 480GB (00000009:01:00)
+Running host_to_device_memcpy_ce.
+memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
+           0         1         2
+ 0    369.36    269.33    412.11
+ 1    323.36    299.33    312.11
+SUM host_to_device_memcpy_ce 1985.60
+Running device_to_host_memcpy_ce.
+memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
+           0         1
+ 0    295.15    312.11
+SUM device_to_host_memcpy_ce 607.26
+Running host_to_device_bidirectional_memcpy_ce.
+memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
+           0
+ 0    176.92
+SUM host_to_device_bidirectional_memcpy_ce 176.92
+Running device_to_host_bidirectional_memcpy_ce.
+memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
+           0
+ 0    187.26
+SUM device_to_host_bidirectional_memcpy_ce 187.26
+Waived:
+Waived:
+Waived:
+Waived:
+Running all_to_host_memcpy_ce.
+memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
+           0
+ 0    295.15
+SUM all_to_host_memcpy_ce 295.15
+Running all_to_host_bidirectional_memcpy_ce.
+memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
+           0
+ 0    187.00
+SUM all_to_host_bidirectional_memcpy_ce 187.00
+Running host_to_all_memcpy_ce.
+memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
+           0
+ 0    370.13
+SUM host_to_all_memcpy_ce 370.13
+Running host_to_all_bidirectional_memcpy_ce.
+memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
+           0
+ 0    176.86
+SUM host_to_all_bidirectional_memcpy_ce 176.86
+Waived:
+Waived:
+Waived:
+Waived:
+Running host_to_device_memcpy_sm.
+memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
+           0
+ 0    372.33
+SUM host_to_device_memcpy_sm 372.33
+Running device_to_host_memcpy_sm.
+memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
+           0
+ 0    351.93
+SUM device_to_host_memcpy_sm 351.93
+Waived:
+Waived:
+Waived:
+Waived:
+Running all_to_host_memcpy_sm.
+memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
+           0
+ 0    352.98
+SUM all_to_host_memcpy_sm 352.98
+Running all_to_host_bidirectional_memcpy_sm.
+memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
+           0
+ 0    156.53
+SUM all_to_host_bidirectional_memcpy_sm 156.53
+Running host_to_all_memcpy_sm.
+memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
+           0
+ 0    360.93
+SUM host_to_all_memcpy_sm 360.93
+Running host_to_all_bidirectional_memcpy_sm.
+memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
+           0
+ 0    247.56
+SUM host_to_all_bidirectional_memcpy_sm 247.56
+Waived:
+Waived:
+Waived:
+Waived:
+Running host_device_latency_sm.
+memory latency SM CPU(row) <-> GPU(column) (ns)
+           0
+ 0    772.58
+SUM host_device_latency_sm 772.58
+Waived:
+NOTE: The reported results may not reflect the full capabilities of the platform.
+Performance can vary with software drivers, hardware clocks, and system topology.