Unverified Commit 7cef624e authored by Hongtao Zhang's avatar Hongtao Zhang Committed by GitHub
Browse files

Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669)



**Description**

Add nvbandwidth benchmark.

---------
Co-authored-by: default avatarhongtaozhang <hongtaozhang@microsoft.com>
parent c8c52eb2
...@@ -384,6 +384,82 @@ with topology distance of 2, 4, 6, respectively. ...@@ -384,6 +384,82 @@ with topology distance of 2, 4, 6, respectively.
| ib-traffic/ib\_write\_bw\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | bandwidth (GB/s) | The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. | | ib-traffic/ib\_write\_bw\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | bandwidth (GB/s) | The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. |
| ib-traffic/ib\_write\_lat\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | time (us) | The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. | | ib-traffic/ib\_write\_lat\_${msg_size}\_${direction}\_${line}\_${pair}:${server}\_${client} | time (us) | The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}<sup>th</sup> node pair in the ${line}<sup>th</sup> line of the config, ${server} and ${client} are the hostname of server and client. |
### `nvbandwidth`
#### Introduction
Measures bandwidth and latency for various memcpy patterns across different links using copy engine or kernel copy methods,
performed by [nvbandwidth](https://github.com/NVIDIA/nvbandwidth)
#### Metrics
| Metrics | Unit | Description |
|---------------------------------------------------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host_to_device_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | Host to device CE memcpy using cuMemcpyAsync |
| host_to_device_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| device_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | Device to host CE memcpy using cuMemcpyAsync |
| device_to_host_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| host_to_device_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. |
| host_to_device_bidirectional_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| device_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. |
| device_to_host_bidirectional_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_memcpy_read_ce_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_memcpy_write_ce_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_bidirectional_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_bidirectional_memcpy_read_ce_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_bidirectional_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_bidirectional_memcpy_write_ce_sum_bw | GB/s | Sum of the output matrix |
| all_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host. |
| all_to_host_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| all_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
| all_to_host_bidirectional_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| host_to_all_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices. |
| host_to_all_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| host_to_all_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw | GB/s | A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic. |
| host_to_all_bidirectional_memcpy_ce_sum_bw | GB/s | Sum of the output matrix |
| all_to_one_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| all_to_one_write_ce_sum_bw | GB/s | Sum of the output matrix |
| all_to_one_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| all_to_one_read_ce_sum_bw | GB/s | Sum of the output matrix |
| one_to_all_write_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| one_to_all_write_ce_sum_bw | GB/s | Sum of the output matrix |
| one_to_all_read_ce_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| one_to_all_read_ce_sum_bw | GB/s | Sum of the output matrix |
| host_to_device_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s | Host to device SM memcpy using a copy kernel |
| host_to_device_memcpy_sm_sum_bw | GB/s | Sum of the output matrix |
| device_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s | Device to host SM memcpy using a copy kernel |
| device_to_host_memcpy_sm_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_memcpy_read_sm_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_memcpy_write_sm_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_bidirectional_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context. |
| device_to_device_bidirectional_memcpy_read_sm_sum_bw | GB/s | Sum of the output matrix |
| device_to_device_bidirectional_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context. |
| device_to_device_bidirectional_memcpy_write_sm_sum_bw | GB/s | Sum of the output matrix |
| all_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host. |
| all_to_host_memcpy_sm_sum_bw | GB/s | Sum of the output matrix |
| all_to_host_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s | A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
| all_to_host_bidirectional_memcpy_sm_sum_bw | GB/s | Sum of the output matrix |
| host_to_all_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s | Measures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices. |
| host_to_all_memcpy_sm_sum_bw | GB/s | Sum of the output matrix |
| host_to_all_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw | GB/s | A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels. |
| host_to_all_bidirectional_memcpy_sm_sum_bw | GB/s | Sum of the output matrix |
| all_to_one_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| all_to_one_write_sm_sum_bw | GB/s | Sum of the output matrix |
| all_to_one_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| all_to_one_read_sm_sum_bw | GB/s | Sum of the output matrix |
| one_to_all_write_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context. |
| one_to_all_write_sm_sum_bw | GB/s | Sum of the output matrix |
| one_to_all_read_sm_gpu[0-9]_gpu[0-9]_bw | GB/s | Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context. |
| one_to_all_read_sm_sum_bw | GB/s | Sum of the output matrix |
| host_device_latency_sm_cpu[0-9]_gpu[0-9]_lat | µs | Host - device SM copy latency using a ptr chase kernel |
| host_device_latency_sm_sum_lat | µs | Sum of the output matrix |
| device_to_device_latency_sm_gpu[0-9]_gpu[0-9]_lat | µs | Measures latency of a pointer dereference operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency. |
| device_to_device_latency_sm_sum_lat | µs | Sum of the output matrix |
## Computation-communication Benchmarks ## Computation-communication Benchmarks
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Micro benchmark example for nvbandwidth benchmark.
Commands to run:
python3 examples/benchmarks/nvbandwidth.py
"""
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger
if __name__ == '__main__':
context = BenchmarkRegistry.create_benchmark_context(
'nvbandwidth',
platform=Platform.CPU,
parameters=(
'--buffer_size 128 '
'--test_cases 0,1,19,20 '
'--skip_verification '
'--disable_affinity '
'--use_mean '
'--num_loops 10'
)
)
benchmark = BenchmarkRegistry.launch_benchmark(context)
if benchmark:
logger.info(
'benchmark: {}, return code: {}, result: {}'.format(
benchmark.name, benchmark.return_code, benchmark.result
)
)
...@@ -37,6 +37,7 @@ ...@@ -37,6 +37,7 @@
from superbench.benchmarks.micro_benchmarks.directx_gpu_copy_performance import DirectXGPUCopyBw from superbench.benchmarks.micro_benchmarks.directx_gpu_copy_performance import DirectXGPUCopyBw
from superbench.benchmarks.micro_benchmarks.directx_mem_bw_performance import DirectXGPUMemBw from superbench.benchmarks.micro_benchmarks.directx_mem_bw_performance import DirectXGPUMemBw
from superbench.benchmarks.micro_benchmarks.directx_gemm_flops_performance import DirectXGPUCoreFlops from superbench.benchmarks.micro_benchmarks.directx_gemm_flops_performance import DirectXGPUCoreFlops
from superbench.benchmarks.micro_benchmarks.nvbandwidth import NvBandwidthBenchmark
__all__ = [ __all__ = [
'BlasLtBaseBenchmark', 'BlasLtBaseBenchmark',
...@@ -73,4 +74,5 @@ ...@@ -73,4 +74,5 @@
'DirectXGPUCopyBw', 'DirectXGPUCopyBw',
'DirectXGPUMemBw', 'DirectXGPUMemBw',
'DirectXGPUCoreFlops', 'DirectXGPUCoreFlops',
'NvBandwidthBenchmark',
] ]
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Module of the NV Bandwidth Test."""
import os
import re
from superbench.common.utils import logger
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
class NvBandwidthBenchmark(MicroBenchmarkWithInvoke):
"""The NV Bandwidth Test benchmark class."""
def __init__(self, name, parameters=''):
"""Constructor.
Args:
name (str): benchmark name.
parameters (str): benchmark parameters.
"""
super().__init__(name, parameters)
self._bin_name = 'nvbandwidth'
def add_parser_arguments(self):
"""Add the specified arguments."""
super().add_parser_arguments()
self._parser.add_argument(
'--buffer_size',
type=int,
default=64,
required=False,
help='Memcpy buffer size in MiB. Default is 64.',
)
self._parser.add_argument(
'--test_cases',
type=str,
default='',
required=False,
help=(
'Specify the test case(s) to run, either by name or index. By default, all test cases are executed. '
'Example: --test_cases 0,1,2,19,20'
),
)
self._parser.add_argument(
'--skip_verification',
action='store_true',
help='Skips data verification after copy. Default is False.',
)
self._parser.add_argument(
'--disable_affinity',
action='store_true',
help='Disable automatic CPU affinity control. Default is False.',
)
self._parser.add_argument(
'--use_mean',
action='store_true',
help='Use mean instead of median for results. Default is False.',
)
self._parser.add_argument(
'--num_loops',
type=int,
default=3,
required=False,
help='Iterations of the benchmark. Default is 3.',
)
def _preprocess(self):
"""Preprocess/preparation operations before the benchmarking.
Return:
True if _preprocess() succeed.
"""
if not super()._preprocess():
return False
if not self._set_binary_path():
return False
# Construct the command for nvbandwidth
command = os.path.join(self._args.bin_dir, self._bin_name)
if self._args.buffer_size:
command += f' --bufferSize {self._args.buffer_size}'
if self._args.test_cases:
command += ' --testcase ' + ' '.join([testcase.strip() for testcase in self._args.test_cases.split(',')])
if self._args.skip_verification:
command += ' --skipVerification'
if self._args.disable_affinity:
command += ' --disableAffinity'
if self._args.use_mean:
command += ' --useMean'
if self._args.num_loops:
command += f' --testSamples {self._args.num_loops}'
self._commands.append(command)
return True
def _process_raw_line(self, line, parse_status):
"""Process a single line of raw output from the nvbandwidth benchmark.
This function updates the `parse_status` dictionary with parsed results from the given `line`.
It detects the start of a test, parses matrix headers and rows, and extracts summary results.
Args:
line (str): A single line of raw output from the benchmark.
parse_status (dict): A dictionary to maintain the current parsing state and results. It should contain:
- 'test_name' (str): The name of the current test being parsed.
- 'benchmark_type' (str): 'bw' or 'lat'. It also indicating if matrix data is being parsed.
- 'matrix_header' (list): The header of the matrix being parsed.
- 'results' (dict): A dictionary to store the parsed results.
Return:
None
"""
# Regular expressions for summary line and matrix header detection
block_start_pattern = re.compile(r'^Running\s+(.+)$')
summary_pattern = re.compile(r'SUM (\S+) (\d+\.\d+)')
matrix_header_line = re.compile(r'^(memcpy|memory latency)')
matrix_row_pattern = re.compile(r'^\s*\d')
line = line.strip()
# Detect the start of a test
if block_start_pattern.match(line):
parse_status['test_name'] = block_start_pattern.match(line).group(1).lower()[:-1]
return
# Detect the start of matrix data
if parse_status['test_name'] and matrix_header_line.match(line):
parse_status['benchmark_type'] = 'bw' if 'bandwidth' in line else 'lat'
return
# Parse the matrix header
if (
parse_status['test_name'] and parse_status['benchmark_type'] and not parse_status['matrix_header']
and matrix_row_pattern.match(line)
):
parse_status['matrix_header'] = line.split()
return
# Parse matrix rows
if parse_status['test_name'] and parse_status['benchmark_type'] and matrix_row_pattern.match(line):
row_data = line.split()
row_index = row_data[0]
for col_index, value in enumerate(row_data[1:], start=1):
col_header = parse_status['matrix_header'][col_index - 1]
test_name = parse_status['test_name']
benchmark_type = parse_status['benchmark_type']
metric_name = f'{test_name}_cpu{row_index}_gpu{col_header}_{benchmark_type}'
parse_status['results'][metric_name] = float(value)
return
# Parse summary results
summary_match = summary_pattern.search(line)
if summary_match:
value = float(summary_match.group(2))
test_name = parse_status['test_name']
benchmark_type = parse_status['benchmark_type']
parse_status['results'][f'{test_name}_sum_{benchmark_type}'] = value
# Reset parsing state for next test
parse_status['test_name'] = ''
parse_status['benchmark_type'] = None
parse_status['matrix_header'].clear()
def _process_raw_result(self, cmd_idx, raw_output):
"""Function to parse raw results and save the summarized results.
self._result.add_raw_data() and self._result.add_result() need to be called to save the results.
Args:
cmd_idx (int): the index of command corresponding with the raw_output.
raw_output (str): raw output string of the micro-benchmark.
Return:
True if the raw output string is valid and result can be extracted.
"""
try:
self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
content = raw_output.splitlines()
parsing_status = {
'results': {},
'benchmark_type': None,
'matrix_header': [],
'test_name': '',
}
for line in content:
self._process_raw_line(line, parsing_status)
if not parsing_status['results']:
self._result.add_raw_data('nvbandwidth', 'No valid results found', self._args.log_raw_data)
return False
# Store parsed results
for metric, value in parsing_status['results'].items():
self._result.add_result(metric, value)
return True
except Exception as e:
logger.error(
'The result format is invalid - round: {}, benchmark: {}, raw output: {}, message: {}.'.format(
self._curr_run_index, self._name, raw_output, str(e)
)
)
self._result.add_result('abort', 1)
return False
BenchmarkRegistry.register_benchmark('nvbandwidth', NvBandwidthBenchmark, platform=Platform.CUDA)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""Tests for nvbandwidth benchmark."""
import unittest
from tests.helper import decorator
from tests.helper.testcase import BenchmarkTestCase
from superbench.benchmarks import BenchmarkRegistry, ReturnCode, Platform
class TestNvBandwidthBenchmark(BenchmarkTestCase, unittest.TestCase):
"""Test class for NV Bandwidth benchmark."""
@classmethod
def setUpClass(cls):
"""Hook method for setting up class fixture before running tests in the class."""
super().setUpClass()
cls.createMockEnvs(cls)
cls.createMockFiles(cls, ['bin/nvbandwidth'])
def test_nvbandwidth_preprocess(self):
"""Test NV Bandwidth benchmark preprocess."""
benchmark_name = 'nvbandwidth'
(benchmark_class,
predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, Platform.CUDA)
assert (benchmark_class)
# Test preprocess with default parameters
benchmark = benchmark_class(benchmark_name, parameters='')
assert benchmark._preprocess()
assert benchmark.return_code == ReturnCode.SUCCESS
# Test preprocess with specified parameters
parameters = (
'--buffer_size 256 '
'--test_cases 0,1,2,19,20 '
'--skip_verification '
'--disable_affinity '
'--use_mean '
'--num_loops 100'
)
benchmark = benchmark_class(benchmark_name, parameters=parameters)
assert benchmark._preprocess()
assert benchmark.return_code == ReturnCode.SUCCESS
# Check command
assert (1 == len(benchmark._commands))
assert ('--bufferSize 256' in benchmark._commands[0])
assert ('--testcase 0 1 2 19 20' in benchmark._commands[0])
assert ('--skipVerification' in benchmark._commands[0])
assert ('--disableAffinity' in benchmark._commands[0])
assert ('--useMean' in benchmark._commands[0])
assert ('--testSamples 100' in benchmark._commands[0])
@decorator.load_data('tests/data/nvbandwidth_results.log')
def test_nvbandwidth_result_parsing_real_output(self, results):
"""Test NV Bandwidth benchmark result parsing."""
benchmark_name = 'nvbandwidth'
(benchmark_class,
predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, Platform.CUDA)
assert (benchmark_class)
benchmark = benchmark_class(benchmark_name, parameters='')
# Preprocess and validate command
assert benchmark._preprocess()
# Parse the provided raw output
assert benchmark._process_raw_result(0, results)
assert benchmark.return_code == ReturnCode.SUCCESS
# Validate parsed results
assert benchmark.result['host_to_device_memcpy_ce_cpu0_gpu0_bw'][0] == 369.36
assert benchmark.result['host_to_device_memcpy_ce_cpu0_gpu1_bw'][0] == 269.33
assert benchmark.result['host_to_device_memcpy_ce_sum_bw'][0] == 1985.60
assert benchmark.result['device_to_host_memcpy_ce_cpu0_gpu1_bw'][0] == 312.11
assert benchmark.result['device_to_host_memcpy_ce_sum_bw'][0] == 607.26
assert benchmark.result['host_device_latency_sm_cpu0_gpu0_lat'][0] == 772.58
assert benchmark.result['host_device_latency_sm_sum_lat'][0] == 772.58
nvbandwidth Version: v0.6
Built from Git version: v0.6
CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15
Device 0: NVIDIA GH200 480GB (00000009:01:00)
Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2
0 369.36 269.33 412.11
1 323.36 299.33 312.11
SUM host_to_device_memcpy_ce 1985.60
Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1
0 295.15 312.11
SUM device_to_host_memcpy_ce 607.26
Running host_to_device_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0
0 176.92
SUM host_to_device_bidirectional_memcpy_ce 176.92
Running device_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0
0 187.26
SUM device_to_host_bidirectional_memcpy_ce 187.26
Waived:
Waived:
Waived:
Waived:
Running all_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0
0 295.15
SUM all_to_host_memcpy_ce 295.15
Running all_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0
0 187.00
SUM all_to_host_bidirectional_memcpy_ce 187.00
Running host_to_all_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0
0 370.13
SUM host_to_all_memcpy_ce 370.13
Running host_to_all_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0
0 176.86
SUM host_to_all_bidirectional_memcpy_ce 176.86
Waived:
Waived:
Waived:
Waived:
Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
0
0 372.33
SUM host_to_device_memcpy_sm 372.33
Running device_to_host_memcpy_sm.
memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
0
0 351.93
SUM device_to_host_memcpy_sm 351.93
Waived:
Waived:
Waived:
Waived:
Running all_to_host_memcpy_sm.
memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
0
0 352.98
SUM all_to_host_memcpy_sm 352.98
Running all_to_host_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0
0 156.53
SUM all_to_host_bidirectional_memcpy_sm 156.53
Running host_to_all_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
0
0 360.93
SUM host_to_all_memcpy_sm 360.93
Running host_to_all_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0
0 247.56
SUM host_to_all_bidirectional_memcpy_sm 247.56
Waived:
Waived:
Waived:
Waived:
Running host_device_latency_sm.
memory latency SM CPU(row) <-> GPU(column) (ns)
0
0 772.58
SUM host_device_latency_sm 772.58
Waived:
NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment