Unverified Commit 4eddd50a authored by WenqingLan1's avatar WenqingLan1 Committed by GitHub
Browse files

Benchmarks - Add GPU Stream Micro Benchmark (#697)

Added GPU Stream benchmark - measures the GPU memory bandwidth and
efficiency for double datatype through various memory operations
including copy, scale, add, and triad.
- added documentation for `gpu-stream` detailing its introduction,
metrics, and descriptions.
- added unit tests for `gpu-stream`. Example output is in
`superbenchmark/tests/data/gpu_stream.log`.
parent 991c0051
...@@ -262,6 +262,25 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU ...@@ -262,6 +262,25 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU
| gpu\_all\_to\_gpu[0-9]+\_write\_by\_sm\_bw | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled. | | gpu\_all\_to\_gpu[0-9]+\_write\_by\_sm\_bw | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled. |
| gpu\_all\_to\_gpu\_all\_write\_by\_sm\_bw | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled. | | gpu\_all\_to\_gpu\_all\_write\_by\_sm\_bw | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled. |
### `gpu-stream`
#### Introduction
Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
#### Metrics
| Metric Name | Unit | Description |
|------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size. |
| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size. |
| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. |
| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. |
| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. |
| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. |
| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. |
### `ib-loopback` ### `ib-loopback`
#### Introduction #### Introduction
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Micro benchmark example for GPU Stream performance.
Commands to run:
python3 examples/benchmarks/gpu_stream.py
"""
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger
if __name__ == '__main__':
context = BenchmarkRegistry.create_benchmark_context(
'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10'
)
# For ROCm environment, please specify the benchmark name and the platform as the following.
# context = BenchmarkRegistry.create_benchmark_context(
# 'gpu-stream', platform=Platform.ROCM, parameters='--num_warm_up 1 --num_loops 10'
# )
# To enable data checking, please add '--check_data'.
benchmark = BenchmarkRegistry.launch_benchmark(context)
if benchmark:
logger.info(
'benchmark: {}, return code: {}, result: {}'.format(
benchmark.name, benchmark.return_code, benchmark.result
)
)
...@@ -23,6 +23,7 @@ ...@@ -23,6 +23,7 @@
from superbench.benchmarks.micro_benchmarks.cpu_hpl_performance import CpuHplBenchmark from superbench.benchmarks.micro_benchmarks.cpu_hpl_performance import CpuHplBenchmark
from superbench.benchmarks.micro_benchmarks.gpcnet_performance import GPCNetBenchmark from superbench.benchmarks.micro_benchmarks.gpcnet_performance import GPCNetBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_copy_bw_performance import GpuCopyBwBenchmark from superbench.benchmarks.micro_benchmarks.gpu_copy_bw_performance import GpuCopyBwBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_stream import GpuStreamBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_burn_test import GpuBurnBenchmark from superbench.benchmarks.micro_benchmarks.gpu_burn_test import GpuBurnBenchmark
from superbench.benchmarks.micro_benchmarks.ib_loopback_performance import IBLoopbackBenchmark from superbench.benchmarks.micro_benchmarks.ib_loopback_performance import IBLoopbackBenchmark
from superbench.benchmarks.micro_benchmarks.ib_validation_performance import IBBenchmark from superbench.benchmarks.micro_benchmarks.ib_validation_performance import IBBenchmark
...@@ -58,6 +59,7 @@ ...@@ -58,6 +59,7 @@
'GemmFlopsBenchmark', 'GemmFlopsBenchmark',
'GpuBurnBenchmark', 'GpuBurnBenchmark',
'GpuCopyBwBenchmark', 'GpuCopyBwBenchmark',
'GpuStreamBenchmark',
'IBBenchmark', 'IBBenchmark',
'IBLoopbackBenchmark', 'IBLoopbackBenchmark',
'KernelLaunch', 'KernelLaunch',
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Module of the GPU Stream Performance benchmark."""
import os
from superbench.common.utils import logger
from superbench.benchmarks import BenchmarkRegistry, ReturnCode
from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
class GpuStreamBenchmark(MicroBenchmarkWithInvoke):
"""The GPU stream performance benchmark class."""
def __init__(self, name, parameters=''):
"""Constructor.
Args:
name (str): benchmark name.
parameters (str): benchmark parameters.
"""
super().__init__(name, parameters)
self._bin_name = 'gpu_stream'
def add_parser_arguments(self):
"""Add the specified arguments."""
super().add_parser_arguments()
self._parser.add_argument(
'--size',
type=int,
default=4096 * 1024**2,
required=False,
help='Size of data buffer in bytes.',
)
self._parser.add_argument(
'--num_warm_up',
type=int,
default=20,
required=False,
help='Number of warm up rounds',
)
self._parser.add_argument(
'--num_loops',
type=int,
default=100,
required=False,
help='Number of data buffer copies performed.',
)
self._parser.add_argument(
'--check_data',
action='store_true',
help='Enable data checking',
)
def _preprocess(self):
"""Preprocess/preparation operations before the benchmarking.
Return:
True if _preprocess() succeed.
"""
if not super()._preprocess():
return False
self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name)
args = '--size %d --num_warm_up %d --num_loops %d ' % (
self._args.size, self._args.num_warm_up, self._args.num_loops
)
if self._args.check_data:
args += ' --check_data'
self._commands = ['%s %s' % (self.__bin_path, args)]
return True
def _process_raw_result(self, cmd_idx, raw_output):
"""Function to parse raw results and save the summarized results.
self._result.add_raw_data() and self._result.add_result() need to be called to save the results.
Args:
cmd_idx (int): the index of command corresponding with the raw_output.
raw_output (str): raw output string of the micro-benchmark.
Return:
True if the raw output string is valid and result can be extracted.
"""
self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
try:
output_lines = [x.strip() for x in raw_output.strip().splitlines()]
count = 0
for output_line in output_lines:
if output_line.startswith('STREAM_'):
count += 1
tag, bw_str, ratio = output_line.split()
self._result.add_result(tag + '_bw', float(bw_str))
self._result.add_result(tag + '_ratio', float(ratio))
if count == 0:
raise BaseException('No valid results found.')
except BaseException as e:
self._result.set_return_code(ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
logger.error(
'The result format is invalid - round: {}, benchmark: {}, raw output: {}, message: {}.'.format(
self._curr_run_index, self._name, raw_output, str(e)
)
)
return False
return True
BenchmarkRegistry.register_benchmark('gpu-stream', GpuStreamBenchmark)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
cmake_minimum_required(VERSION 3.18)
project(gpu_stream LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
find_package(CUDAToolkit QUIET)
# Source files
set(SOURCES
gpu_stream_test.cpp
gpu_stream_utils.cpp
gpu_stream.cu
gpu_stream_kernels.cu
)
# Cuda environment
if(CUDAToolkit_FOUND)
message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})
include(../cuda_common.cmake)
add_executable(gpu_stream ${SOURCES})
set_property(TARGET gpu_stream PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
target_include_directories(gpu_stream PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
target_link_libraries(gpu_stream numa nvidia-ml)
else()
# TODO: test for ROC
# ROCm environment
include(../rocm_common.cmake)
find_package(hip QUIET)
if(hip_FOUND)
message(STATUS "Found ROCm: " ${HIP_VERSION})
# Convert cuda code to hip code in cpp
execute_process(COMMAND hipify-perl -print-stats -o gpu_stream.cpp ${SOURCES} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
# link hip device lib
add_executable(gpu_stream gpu_stream.cpp)
include(CheckSymbolExists)
check_symbol_exists("hipDeviceMallocUncached" "hip/hip_runtime_api.h" HIP_UNCACHED_MEMORY)
if(${HIP_UNCACHED_MEMORY})
target_compile_definitions(gpu_stream PRIVATE HIP_UNCACHED_MEMORY)
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
target_link_libraries(gpu_stream numa hip::device)
else()
message(FATAL_ERROR "No CUDA or ROCm environment found.")
endif()
endif()
install(TARGETS gpu_stream RUNTIME DESTINATION bin)
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
// GPU stream benchmark
// This benchmark is based on the STREAM benchmark, which is a simple benchmark program that measures
// sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple (COPY, SCALE, ADD, TRIAD)
// kernels.
#include "gpu_stream.hpp"
#include <cassert>
#include <iostream>
/**
* @brief Destroys the CUDA events used for benchmarking.
*
* @details This function cleans up and releases the resources associated with the CUDA events
* used for benchmarking based on the provided arguments. It ensures that all allocated resources
* are properly freed.
*
* @param[in,out] args A unique pointer to a BenchArgs structure
*
* @return int The status code indicating success or failure of the destruction process.
*/
template <typename T> int GpuStream::DestroyEvent(std::unique_ptr<BenchArgs<T>> &args) {
cudaError_t cuda_err = cudaSuccess;
if (SetGpu(args->gpu_id)) {
return -1;
}
cuda_err = cudaEventDestroy(args->sub.start_event);
if (cuda_err != cudaSuccess) {
std::cerr << "DestroyEvent::cudaEventDestroy error: " << cuda_err << std::endl;
return -1;
}
cuda_err = cudaEventDestroy(args->sub.end_event);
if (cuda_err != cudaSuccess) {
std::cerr << "DestroyEvent::cudaEventDestroy error: " << cuda_err << std::endl;
return -1;
}
return 0;
}
/**
* @brief Constructor for the GpuStream class.
*
* This constructor initializes the GpuStream opts with the given parameters.
*
* @param opts parsed command line options..
*/
GpuStream::GpuStream(Opts &opts) noexcept : opts_(opts) { PrintInputInfo(opts_); }
/**
* @brief Sets the active GPU.
*
* @details This function sets the active GPU to the specified GPU ID.
*
* @param[in] gpu_id The ID of the GPU to set as active.
*
* @return int The status code indicating success or failure.
*/
int GpuStream::SetGpu(int gpu_id) {
cudaError_t cuda_err = cudaSetDevice(gpu_id);
if (cuda_err != cudaSuccess) {
std::cerr << "SetGpu::cudaSetDevice " << gpu_id << "error: " << cuda_err << std::endl;
return -1;
}
return 0;
}
/**
* @brief Retrieves the number of GPUs available.
*
* @details This function retrieves the number of GPUs available on the system and stores the count
* in the provided pointer.
*
* @param[out] gpu_count Pointer to an integer where the GPU count will be stored.
*
* @return int The status code indicating success or failure.
*/
int GpuStream::GetGpuCount(int *gpu_count) {
cudaError_t cuda_err = cudaGetDeviceCount(gpu_count);
if (cuda_err != cudaSuccess) {
std::cerr << "GetGpuCount::cudaGetDeviceCount error: " << cuda_err << std::endl;
return -1;
}
return 0;
}
/**
* @brief destroys buff and stream resources.
*
* @details This method cleans up and releases the resources associated with the buffer and stream
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments.
*
* @return int The status code indicating success or failure.
*/
template <typename T> int GpuStream::Destroy(std::unique_ptr<BenchArgs<T>> &args) {
int ret = DestroyBufAndStream(args);
if (ret == 0) {
ret = DestroyEvent(args);
}
return ret;
}
/**
* @brief Prints CUDA device information.
*
* @details This function prints the properties of a CUDA device specified by the device ID.
*
* @param[in] device_id The ID of the CUDA device.
* @param[out] prop The properties of the CUDA device.
* @return void
* */
void GpuStream::PrintCudaDeviceInfo(int device_id, const cudaDeviceProp &prop) {
std::cout << "\nDevice " << device_id << ": \"" << prop.name << "\"";
std::cout << " " << prop.multiProcessorCount << " SMs(" << prop.major << "." << prop.minor << ")";
// Compute theoretical bw:
// https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/#theoretical_bandwidth
std::cout << " Memory: " << prop.memoryClockRate / 1000.0 << "MHz x " << prop.memoryBusWidth
<< "-bit = " << (prop.memoryClockRate / 1000.0) * (prop.memoryBusWidth / 8) * 2 / 1000.0 << " GB/s PEAK ";
std::cout << " ECC is " << (prop.ECCEnabled ? "ON" : "OFF") << std::endl;
}
/**
* @brief Allocates a buffer on the GPU.
*
* @details This function allocates a buffer of the specified size on the GPU and returns a pointer
* to the allocated memory.
*
* @param[out] ptr Pointer to a pointer where the address of the allocated buffer will be stored.
* @param[in] size The size of the buffer to allocate in bytes.
*
* @return cudaError_t The status code indicating success or failure of the memory allocation.
*/
template <typename T> cudaError_t GpuStream::GpuMallocDataBuf(T **ptr, uint64_t size) { return cudaMalloc(ptr, size); }
/**
* @brief Prepares validation buffers for GPU stream benchmark.
*
* @details This function allocates and initializes validation buffers for different
* kernels (copy, scale, add, and triad) used in the GPU stream benchmark. The buffer order
* matches the kernel order in the Kernel enum.
*
* @param args A unique pointer to a BenchArgs structure containing the necessary arguments
* for preparing the buffer and stream.
*
* @return int The status code indicating success or failure of the preparation.
*/
template <typename T> int GpuStream::PrepareValidationBuf(std::unique_ptr<BenchArgs<T>> &args) {
args->sub.validation_buf_ptrs.resize(kNumValidationBuffers);
// Compute and allocate validation buffers for add, scale and triad
uint64_t size = args->size / sizeof(T);
for (auto &buf_ptr : args->sub.validation_buf_ptrs) {
buf_ptr.resize(size);
}
// Initialize validation buffer
for (size_t j = 0; j < size; j++) {
args->sub.validation_buf_ptrs[0][j] = static_cast<T>(j % kUInt8Mod);
args->sub.validation_buf_ptrs[1][j] = static_cast<T>(j % kUInt8Mod) * scalar;
args->sub.validation_buf_ptrs[2][j] = static_cast<T>(j % kUInt8Mod) + static_cast<T>(j % kUInt8Mod);
args->sub.validation_buf_ptrs[3][j] = static_cast<T>(j % kUInt8Mod) + static_cast<T>(j % kUInt8Mod) * scalar;
}
return 0;
}
/**
* @brief Prepares the buffer and stream for benchmarking.
*
* @details This function prepares the necessary buffer and stream for benchmarking based on the
* provided arguments. It initializes and configures the buffer and stream as required.
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
* for preparing the buffer and stream.
*
* @return int The status code indicating success or failure of the preparation.
*/
template <typename T> int GpuStream::PrepareBufAndStream(std::unique_ptr<BenchArgs<T>> &args) {
cudaError_t cuda_err = cudaSuccess;
if (args->check_data) {
// Generate data to copy
args->sub.data_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
for (int j = 0; j < args->size / sizeof(T); j++) {
args->sub.data_buf[j] = static_cast<T>(j % kUInt8Mod);
}
// Allocate check buffer
args->sub.check_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
}
// Allocate buffers
args->sub.gpu_buf_ptrs.resize(kNumBuffers);
// Set to buffer device for GPU buffer
if (SetGpu(args->gpu_id)) {
return -1;
}
// Allocate buffers
for (auto &buf_ptr : args->sub.gpu_buf_ptrs) {
T *raw_ptr = nullptr;
cuda_err = GpuMallocDataBuf(&raw_ptr, args->size * sizeof(T));
if (cuda_err != cudaSuccess) {
std::cerr << "PrepareBufAndStream::cudaMalloc error: " << cuda_err << std::endl;
return -1;
}
buf_ptr.reset(raw_ptr); // Transfer ownership to the smart pointer
}
// Initialize source buffer
if (args->check_data) {
cuda_err = cudaMemcpy(args->sub.gpu_buf_ptrs[0].get(), args->sub.data_buf, args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
std::cerr << "PrepareBufAndStream::cudaMemcpy error: " << cuda_err << std::endl;
return -1;
}
cuda_err = cudaMemcpy(args->sub.gpu_buf_ptrs[1].get(), args->sub.data_buf, args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
std::cerr << "PrepareBufAndStream::cudaMemcpy error: " << cuda_err << std::endl;
return -1;
}
PrepareValidationBuf<T>(args);
}
cuda_err = cudaStreamCreateWithFlags(&(args->sub.stream), cudaStreamNonBlocking);
if (cuda_err != cudaSuccess) {
std::cerr << "PrepareBufAndStream::cudaStreamCreate error: " << cuda_err << std::endl;
return -1;
}
return 0;
}
/**
* @brief Prepares CUDA events for benchmarking.
*
* @details This function creates the necessary CUDA events for benchmarking based on the
* provided arguments. It initializes and configures the events as required.
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
* for preparing the CUDA events.
*
* @return int The status code indicating success or failure of the preparation.
*/
template <typename T> int GpuStream::PrepareEvent(std::unique_ptr<BenchArgs<T>> &args) {
cudaError_t cuda_err = cudaSuccess;
if (SetGpu(args->gpu_id)) {
return -1;
}
cuda_err = cudaEventCreate(&(args->sub.start_event));
if (cuda_err != cudaSuccess) {
std::cerr << "PrepareEvent::cudaEventCreate error: " << cuda_err << std::endl;
return -1;
}
cuda_err = cudaEventCreate(&(args->sub.end_event));
if (cuda_err != cudaSuccess) {
std::cerr << "PrepareEvent::cudaEventCreate error: " << cuda_err << std::endl;
return -1;
}
return 0;
}
/**
* @brief Validates the result of data transfer.
*
* @details This function checks the buffer to validate the result of a data transfer operation
* based on the provided arguments. It ensures that the data transfer was successful and that
* the buffer contains the expected data.
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
* for validating the buffer.
*
* @return int The status code indicating success or failure of the validation.
*/
template <typename T> int GpuStream::CheckBuf(std::unique_ptr<BenchArgs<T>> &args, int kernel_idx) {
cudaError_t cuda_err = cudaSuccess;
int memcmp_result = 0;
if (SetGpu(args->gpu_id)) {
return -1;
}
// Copy buffer output from stream kernel to local buffer
cuda_err = cudaMemcpy(args->sub.check_buf, args->sub.gpu_buf_ptrs[2].get(), args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
std::cerr << "CheckBuf::cudaMemcpy error: " << cuda_err << std::endl;
return -1;
}
// Validate result by comparing the data buffer and check buffer
memcmp_result = memcmp(args->sub.validation_buf_ptrs[kernel_idx].data(), args->sub.check_buf, args->size);
if (memcmp_result) {
std::cerr << "CheckBuf::Memory check failed for kernel index " << kernel_idx << std::endl;
return -1;
}
return 0;
}
/**
* @brief Destroys the buffer and stream used for benchmarking.
*
* @details This function cleans up and releases the resources associated with the buffer and stream
* used for benchmarking based on the provided arguments. It ensures that all allocated buffers and streams
* resources are properly freed.
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
* for destroying the buffer and stream.
*
* @return int The status code indicating success or failure of the destruction process.
*/
template <typename T> int GpuStream::DestroyBufAndStream(std::unique_ptr<BenchArgs<T>> &args) {
int ret = 0;
cudaError_t cuda_err = cudaSuccess;
// Destroy original data buffer and check buffer
if (args->sub.data_buf != nullptr) {
numa_free(args->sub.data_buf, args->size);
}
if (args->sub.check_buf != nullptr) {
numa_free(args->sub.check_buf, args->size);
}
// Set to buffer device for GPU buffer
if (SetGpu(args->gpu_id)) {
return -1;
}
cuda_err = cudaStreamDestroy(args->sub.stream);
if (cuda_err != cudaSuccess) {
std::cerr << "DestroyBufAndStream::cudaStreamDestroy error: " << cuda_err << std::endl;
return -1;
}
return ret;
}
/**
* @brief Runs the STREAM benchmark.
*
* @details This function runs the STREAM benchmark using the specified kernel and number of threads per block.
* It prepares the necessary arguments and configurations for the benchmark execution.
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments for the
benchmark.
* @param[in] kernel The kernel function to be used for the benchmark.
* @param[in] num_threads_per_block The number of threads per block to be used in the kernel execution.
*
* @return int The status code indicating success or failure of the benchmark execution.
*/
template <typename T>
int GpuStream::RunStreamKernel(std::unique_ptr<BenchArgs<T>> &args, Kernel kernel, int num_threads_per_block) {
cudaError_t cuda_err = cudaSuccess;
uint64_t num_thread_blocks;
int size_factor = 2;
// Validate data size
uint64_t num_elements_in_thread_block = kNumLoopUnroll * num_threads_per_block;
uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(T);
if (args->size % num_bytes_in_thread_block) {
std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl;
return -1;
}
num_thread_blocks = args->size / num_bytes_in_thread_block;
args->sub.times_in_ms.resize(static_cast<int>(Kernel::kCount));
if (SetGpu(args->gpu_id)) {
return -1;
}
// Launch jobs and collect running time
for (int i = 0; i < args->num_loops + args->num_warm_up; i++) {
// Record start event once warm up iterations are done
if (i == args->num_warm_up) {
cuda_err = cudaEventRecord(args->sub.start_event, args->sub.stream);
if (cuda_err != cudaSuccess) {
std::cerr << "RunStreamKernel::cudaEventRecord error: " << cuda_err << std::endl;
return -1;
}
}
switch (kernel) {
case Kernel::kCopy:
CopyKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()));
args->sub.kernel_name = "COPY";
break;
case Kernel::kScale:
ScaleKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()), scalar);
args->sub.kernel_name = "SCALE";
break;
case Kernel::kAdd:
AddKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()),
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[1].get()));
size_factor = 3;
args->sub.kernel_name = "ADD";
break;
case Kernel::kTriad:
TriadKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()),
reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[1].get()), scalar);
size_factor = 3;
args->sub.kernel_name = "TRIAD";
break;
default:
std::cerr << "RunStreamKernel::Invalid kernel: " << std::endl;
break;
}
// Record end event at the end of iterations
if (i + 1 == args->num_loops + args->num_warm_up) {
cuda_err = cudaEventRecord(args->sub.end_event, args->sub.stream);
if (cuda_err != cudaSuccess) {
std::cerr << "RunStreamKernel::cudaEventRecord error: " << cuda_err << std::endl;
return -1;
}
}
}
// Wait for the stream to finish
cuda_err = cudaStreamSynchronize(args->sub.stream);
if (cuda_err != cudaSuccess) {
std::cerr << "RunStreamKernel::cudaStreamSynchronize error: " << cuda_err << std::endl;
return -1;
}
// Calculate time
float time_in_ms = 0;
cuda_err = cudaEventElapsedTime(&time_in_ms, args->sub.start_event, args->sub.end_event);
if (cuda_err != cudaSuccess) {
std::cerr << "RunStreamKernel::cudaEventElapsedTime error: " << cuda_err << std::endl;
return -1;
}
args->sub.times_in_ms[static_cast<int>(kernel)].push_back(time_in_ms / size_factor);
return 0;
}
/**
* @brief Runs the benchmark for various kernels and processes the results for a BenchArgs config.
*
* @details This function prepares the necessary buffers and streams, runs the benchmark for each kernel
* with different thread per block configurations, checks the results, and processes the benchmark results.
* It also handles cleanup of resources in case of errors.
*
* @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments for the
benchmark.
*
* @return int The status code indicating success or failure of the benchmark execution.
* */
template <typename T> int GpuStream::RunStream(std::unique_ptr<BenchArgs<T>> &args, const std::string &data_type) {
int ret = 0;
ret = PrepareBufAndStream<T>(args);
float peak_bw =
args->gpu_device_prop.memoryClockRate / 1000.0 * args->gpu_device_prop.memoryBusWidth / 8 * 2 / 1000.0;
if (ret != 0) {
return DestroyBufAndStream(args);
}
ret = PrepareEvent(args);
if (ret != 0) {
return DestroyEvent(args);
}
// benchmark over the kThreadsPerBlock array
for (const int num_threads_in_block : kThreadsPerBlock) {
// run the stream benchmark over the stream kernels
for (int i = 0; i < static_cast<int>(Kernel::kCount); ++i) {
Kernel kernel = static_cast<Kernel>(i);
int ret = RunStreamKernel<T>(args, kernel, num_threads_in_block);
if (ret == 0 && args->check_data) {
// Compare buffer based on the kernel
ret = CheckBuf(args, i);
}
}
}
// output formatted results to stdout
// Tags are of format:
// STREAM_<Kernelname>_datatype_gpu_<gpu_id>_buffer_<buffer_size>_block_<block_size>
for (int i = 0; i < args->sub.times_in_ms.size(); i++) {
std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_gpu_" + std::to_string(args->gpu_id) +
"_buffer_" + std::to_string(args->size);
for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) {
// Calculate and display bandwidth
double bw = args->size * args->num_loops / args->sub.times_in_ms[i][j] / 1e6;
std::cout << tag << "_block_" << kThreadsPerBlock[j] << "\t" << bw << "\t" << std::fixed
<< std::setprecision(2) << bw / peak_bw * 100 << std::endl;
}
}
// cleanup buffer and streams for the curr arg
Destroy(args);
return ret;
}
/**
* @brief Runs the Stream benchmark.
*
* @details This function processes the input args, validates and composes the BenchArgs structure for the
availavble
* GPUs, and runs the benchmark.
*
* @return int The status code indicating success or failure of the benchmark execution.
* */
int GpuStream::Run() {
int ret = 0;
int gpu_count = 0;
// Get number of NUMA nodes
if (numa_available()) {
std::cerr << "main::numa_available error" << std::endl;
return -1;
}
// Get number of GPUs
ret = GetGpuCount(&gpu_count);
if (ret != 0) {
return ret;
}
// find all GPUs and compose the Benchmarking data structure
for (int j = 0; j < gpu_count; j++) {
auto args = std::make_unique<BenchArgs<double>>();
args->numa_id = 0;
args->gpu_id = j;
cudaGetDeviceProperties(&args->gpu_device_prop, j);
args->num_warm_up = opts_.num_warm_up;
args->num_loops = opts_.num_loops;
args->size = opts_.size;
args->check_data = opts_.check_data;
args->numa_id = 0;
args->gpu_id = j;
// add data to vector
bench_args_.emplace_back(std::move(args));
}
bool has_error = false;
// Run the benchmark for all the configured data
for (auto &variant_args : bench_args_) {
std::visit(
[&](auto &curr_args) {
PrintCudaDeviceInfo(curr_args->gpu_id, curr_args->gpu_device_prop);
// Set the NUMA node
ret = numa_run_on_node(curr_args->numa_id);
if (ret != 0) {
std::cerr << "Run::numa_run_on_node error: " << errno << std::endl;
has_error = true;
return;
}
// Run the stream benchmark for the configued data
if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<float>>) {
ret = RunStream<float>(curr_args, "float");
} else if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<double>>) {
ret = RunStream<double>(curr_args, "double");
} else {
std::cerr << "Run::Unknown type error" << std::endl;
has_error = true;
return;
}
if (ret != 0) {
std::cerr << "Run::RunStream error: " << errno << std::endl;
has_error = true;
}
},
variant_args);
}
if (has_error) {
return -1;
}
return ret;
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
#pragma once
#include <getopt.h>
#include <iostream>
#include <memory>
#include <variant>
#include <vector>
#include <cuda.h>
#include <cuda_runtime.h>
#include <numa.h>
#include "gpu_stream_kernels.hpp"
#include "gpu_stream_utils.hpp"
#define NON_HIP (!defined(__HIP_PLATFORM_HCC__) && !defined(__HCC__) && !defined(__HIPCC__))
using namespace stream_config;
class GpuStream {
public:
GpuStream() = delete; // Delete default constructor
GpuStream(Opts &) noexcept; // Constructor
~GpuStream() noexcept = default; // Destructor
GpuStream(const GpuStream &) = delete;
GpuStream &operator=(const GpuStream &) = delete;
GpuStream(GpuStream &&) noexcept = default;
GpuStream &operator=(GpuStream &&) noexcept = default;
int Run();
private:
using BenchArgsVariant = std::variant<std::unique_ptr<BenchArgs<double>>>;
std::vector<BenchArgsVariant> bench_args_;
Opts opts_;
// Memory management functions
template <typename T> cudaError_t GpuMallocDataBuf(T **, uint64_t);
template <typename T> int PrepareValidationBuf(std::unique_ptr<BenchArgs<T>> &);
template <typename T> int CheckBuf(std::unique_ptr<BenchArgs<T>> &, int);
template <typename T> int PrepareEvent(std::unique_ptr<BenchArgs<T>> &);
template <typename T> int PrepareBufAndStream(std::unique_ptr<BenchArgs<T>> &);
template <typename T> int DestroyEvent(std::unique_ptr<BenchArgs<T>> &);
template <typename T> int DestroyBufAndStream(std::unique_ptr<BenchArgs<T>> &);
template <typename T> int Destroy(std::unique_ptr<BenchArgs<T>> &);
// Benchmark functions
template <typename T> int RunStreamKernel(std::unique_ptr<BenchArgs<T>> &, Kernel, int);
template <typename T> int RunStream(std::unique_ptr<BenchArgs<T>> &, const std::string &data_type);
// Helper functions
int GetGpuCount(int *);
int SetGpu(int gpu_id);
void PrintCudaDeviceInfo(int, const cudaDeviceProp &);
};
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
#include "gpu_stream_kernels.hpp"
/**
* @brief Fetches a value from source memory and writes it to a register.
*
* @details This inline device function fetches a value from the specified source memory
* location and writes it to the provided register. The implementation references the following:
* 1) NCCL:
* https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483
* 2) RCCL:
* https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268
*
* @tparam T The type of the value to fetch.
* @param[out] v The register to write the fetched value to.
* @param[in] p The source memory location to fetch the value from.
*/
template <typename T> inline __device__ void Fetch(T &v, const T *p) {
#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
v = *p;
#else
if constexpr (std::is_same<T, float>::value) {
asm volatile("ld.volatile.global.f32 %0, [%1];" : "=f"(v) : "l"(p) : "memory");
} else if constexpr (std::is_same<T, double>::value) {
asm volatile("ld.volatile.global.f64 %0, [%1];" : "=d"(v) : "l"(p) : "memory");
}
#endif
}
/**
* @brief Stores a value from register and writes it to target memory.
*
* @details This inline device function stores a value from the provided register
* and writes it to the specified target memory location. The implementation references the following:
* 1) NCCL:
* https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486
* 2) RCCL:
* https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276
*
* @tparam T The type of the value to store.
* @param[out] p The target memory location to write the value to.
* @param[in] v The register containing the value to be stored.
*/
template <typename T> inline __device__ void Store(T *p, const T &v) {
#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
*p = v;
#else
if constexpr (std::is_same<T, float>::value) {
asm volatile("st.volatile.global.f32 [%0], %1;" ::"l"(p), "f"(v) : "memory");
} else if constexpr (std::is_same<T, double>::value) {
asm volatile("st.volatile.global.f64 [%0], %1;" ::"l"(p), "d"(v) : "memory");
}
#endif
}
/**
* @brief Performs COPY, a simple copy operation from source to target. b = a
*
* @details This CUDA kernel performs a simple copy operation, copying data from the source array
* to the target array. This is used to measure transfer rates without any arithmetic operations.
*
* @param[out] tgt The target array where data will be copied to.
* @param[in] src The source array from which data will be copied.
*/
__global__ void CopyKernel(double *tgt, const double *src) {
uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
double val[kNumLoopUnrollAlias];
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
Fetch(val[i], src + index + i * blockDim.x);
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
Store(tgt + index + i * blockDim.x, val[i]);
}
/**
* @brief Performs SCALE, a scaling operation on the source data. b = x * a
*
* @details This CUDA kernel performs a simple arithmetic operation by scaling the source data
* with a given scalar value and storing the result in the target array.
*
* @param[out] tgt The target array where the scaled data will be stored.
* @param[in] src The source array containing the data to be scaled.
* @param[in] scalar The scalar value used to scale the source data.
*/
__global__ void ScaleKernel(double *tgt, const double *src, const long scalar) {
uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
double val[kNumLoopUnrollAlias];
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
Fetch(val[i], src + index + i * blockDim.x);
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
val[i] *= scalar;
Store(tgt + index + i * blockDim.x, val[i]);
}
}
/**
* @brief Performs ADD, an addition operation on two source arrays. c = a + b
*
* @details This CUDA kernel adds corresponding elements from two source arrays and stores the result
* in the target array. This operation is used to measure transfer rates with a simple arithmetic addition.
*
* @param[out] tgt The target array where the result of the addition will be stored.
* @param[in] src_a The first source array containing the first set of operands.
* @param[in] src_b The second source array containing the second set of operands.
*/
__global__ void AddKernel(double *tgt, const double *src_a, const double *src_b) {
uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
double val_a[kNumLoopUnrollAlias];
double val_b[kNumLoopUnrollAlias];
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
Fetch(val_a[i], src_a + index + i * blockDim.x);
Fetch(val_b[i], src_b + index + i * blockDim.x);
}
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
val_a[i] += val_b[i];
Store(tgt + index + i * blockDim.x, val_a[i]);
}
}
/**
* @brief Performs TRIAD, fused multiply/add operations on source arrays. a = b + x * c
*
* @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from
* the second source array with a scalar value, adding the result to corresponding elements from
* the first source array, and storing the result in the target array.
*
* @param[out] tgt The target array where the result of the fused multiply/add operation will be stored.
* @param[in] src_a The first source array containing the first set of operands.
* @param[in] src_b The second source array containing the second set of operands to be multiplied by the scalar.
* @param[in] scalar The scalar value used in the multiply/add operation.
*/
__global__ void TriadKernel(double *tgt, const double *src_a, const double *src_b, const long scalar) {
uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
double val_a[kNumLoopUnrollAlias];
double val_b[kNumLoopUnrollAlias];
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
Fetch(val_a[i], src_a + index + i * blockDim.x);
Fetch(val_b[i], src_b + index + i * blockDim.x);
}
#pragma unroll
for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
val_b[i] += (val_a[i] * scalar);
Store(tgt + index + i * blockDim.x, val_b[i]);
}
}
\ No newline at end of file
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
#pragma once
#include <cuda.h>
#include <cuda_runtime.h>
#include "gpu_stream_utils.hpp"
constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll;
// Function declarations
template <typename T> inline __device__ void Fetch(T &v, const T *p);
template <typename T> inline __device__ void Store(T *p, const T &v);
__global__ void CopyKernel(double *, const double *);
__global__ void ScaleKernel(double *, const double *, const long);
__global__ void AddKernel(double *, const double *, const double *);
__global__ void TriadKernel(double *, const double *, const double *, const long);
\ No newline at end of file
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
#include "gpu_stream.hpp"
/**
* @brief Main function and entry of gpu stream benchmark
* @details
* params list:
* num_warm_up: warm up count
* num_loops: num of runs for timing
* size: number of bytes to setup for the test
* @param argc argument count
* @param argv argument vector
* @return int
*/
int main(int argc, char **argv) {
int ret = 0;
stream_config::Opts opts;
// parse arguments from cmd
ret = stream_config::ParseOpts(argc, argv, &opts);
if (ret != 0) {
return ret;
}
// run the stream benchmark
GpuStream gpu_stream(opts);
ret = gpu_stream.Run();
if (ret != 0) {
return ret;
}
return 0;
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
#include "gpu_stream_utils.hpp"
namespace stream_config {
/**
* @brief Converts a kernel index to its corresponding string representation.
*
* @details This function takes an integer representing a kernel index and returns the corresponding
* string representation of the kernel. The mapping between kernel indices and their string representations
* should be defined within the function.
*
* @param[in] kernel_idx The index of the kernel to be converted to a string.
*
* @return std::string The string representation of the kernel.
*/
std::string KernelToString(int kernel_idx) {
switch (kernel_idx) {
case static_cast<int>(Kernel::kCopy):
return "COPY";
case static_cast<int>(Kernel::kScale):
return "SCALE";
case static_cast<int>(Kernel::kAdd):
return "ADD";
case static_cast<int>(Kernel::kTriad):
return "TRIAD";
default:
return "UNKNOWN";
}
}
/**
* @brief Print the usage of this program.
*
* @details Thus function prints the usage of this program.
*
* @return void.
* */
void PrintUsage() {
std::cout << "Usage: gpu_stream "
<< "--size <size in bytes> "
<< "--num_warm_up <num_warm_up> "
<< "--num_loops <num_loops> "
<< "[--check_data]" << std::endl;
}
/**
* @brief Print the user provided inputs info.
*
* @details Thus function prints the parsed user provided inputs of this program..
*
* @param[in] opts The Opts struct that stores the parsed values.
*
* @return void
* */
void PrintInputInfo(Opts &opts) {
std::cout << "STREAM Benchmark" << std::endl;
std::cout << "Buffer size(bytes): " << opts.size << std::endl;
std::cout << "Number of warm up runs: " << opts.num_warm_up << std::endl;
std::cout << "Number of loops: " << opts.num_loops << std::endl;
std::cout << "Check data: " << (opts.check_data ? "Yes" : "No") << std::endl;
}
/**
* @brief Parse the command line options.
*
* @details Thus function parses the command line options and stores the values in the Opts struct.
*
* @param[in] argc The number of command line options.
* @param[in] argv The command line options.
* @param[out] opts The Opts struct to store the parsed values.
*
* @return int The status code.
* */
int ParseOpts(int argc, char **argv, Opts *opts) {
enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData };
const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
{"num_warm_up", required_argument, nullptr, static_cast<int>(OptIdx::kNumWarmUp)},
{"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumLoops)},
{"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)}};
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
} else if (getopt_ret == '?') {
parse_err = true;
break;
}
switch (opt_idx) {
case static_cast<int>(OptIdx::kSize):
if (1 != sscanf(optarg, "%lu", &(opts->size))) {
std::cerr << "Invalid size: " << optarg << std::endl;
parse_err = true;
} else {
size_specified = true;
}
break;
case static_cast<int>(OptIdx::kNumWarmUp):
if (1 != sscanf(optarg, "%lu", &(opts->num_warm_up))) {
std::cerr << "Invalid num_warm_up: " << optarg << std::endl;
parse_err = true;
} else {
num_warm_up_specified = true;
}
break;
case static_cast<int>(OptIdx::kNumLoops):
if (1 != sscanf(optarg, "%lu", &(opts->num_loops))) {
std::cerr << "Invalid num_loops: " << optarg << std::endl;
parse_err = true;
} else {
num_loops_specified = true;
}
break;
case static_cast<int>(OptIdx::kEnableCheckData):
opts->check_data = true;
break;
default:
parse_err = true;
}
if (parse_err) {
break;
}
}
if (parse_err) {
PrintUsage();
return -1;
}
return 0;
}
} // namespace stream_config
unsigned long long getCurrentTimestampInMicroseconds() {
// Get the current time point
auto now = std::chrono::system_clock::now();
// Convert to time since epoch
auto duration = now.time_since_epoch();
// Convert to microseconds
auto microseconds = std::chrono::duration_cast<std::chrono::microseconds>(duration).count();
return static_cast<unsigned long long>(microseconds);
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.
#pragma once
#include <array>
#include <chrono>
#include <getopt.h>
#include <iomanip>
#include <iostream>
#include <memory>
#include <string>
#include <vector>
#include <cuda.h>
#include <cuda_runtime.h>
#include <numa.h>
#include <nvml.h>
// Custom deleter for GPU buffers
struct GpuBufferDeleter {
template <typename T> void operator()(T *ptr) const {
if (ptr) {
cudaFree(ptr);
}
}
};
unsigned long long getCurrentTimestampInMicroseconds();
namespace stream_config {
constexpr std::array<int, 4> kThreadsPerBlock = {128, 256, 512, 1024}; // Threads per block
constexpr uint64_t kDefaultBufferSizeInBytes = 4294967296; // Default buffer size 4GB
constexpr int kNumLoopUnroll = 2; // Unroll depth in SM copy kernel
constexpr int kNumBuffers = 3; // Number of buffers for triad, add kernel
constexpr int kNumValidationBuffers = 4; // Number of validation buffers, one for each kernel
constexpr int kUInt8Mod = 256; // Modulo for unsigned long data type
constexpr std::array<int, 4> kBufferBwMultipliers = {2, 2, 3, 3}; // Buffer multiplier for triad, add kernel
constexpr long scalar = 11; // Scalar for scale, triad kernel
// Enum for different kernels
enum class Kernel {
kCopy,
kScale,
kAdd,
kTriad,
kCount // Add a count to keep track of the number of enums. Helpful for iterating over enums.
};
// Arguments for each sub benchmark run.
template <typename T> struct SubBenchArgs {
// Unique pointer for GPU buffers
using GpuBufferUniquePtr = std::unique_ptr<T, GpuBufferDeleter>;
// Original data buffer.
T *data_buf = nullptr;
// Buffer to validate the correctness of data transfer.
T *check_buf = nullptr;
// GPU pointer of the data buffer on source devices.
std::vector<GpuBufferUniquePtr> gpu_buf_ptrs;
// Pointer of the validation buffers for each kernel. Order is same as Kernel enum.
std::vector<std::vector<T>> validation_buf_ptrs;
// CUDA stream to be used.
cudaStream_t stream;
// CUDA event to record start time.
cudaEvent_t start_event;
// CUDA event to record end time.
cudaEvent_t end_event;
// CUDA event to record end time.
std::vector<std::vector<float>> times_in_ms;
// Stream Kernel name.
std::string kernel_name;
};
// Arguments for each benchmark run.
template <typename T> struct BenchArgs {
// NUMA node under which the benchmark is done.
uint64_t numa_id = 0;
// GPU ID for device.
int gpu_id = 0;
// GPU device info
cudaDeviceProp gpu_device_prop;
// Data buffer size used.
uint64_t size = kDefaultBufferSizeInBytes;
// Number of warm up rounds to run.
uint64_t num_warm_up = 0;
// Number of loops to run.
uint64_t num_loops = 1;
// Whether check data after copy.
bool check_data = false;
// Sub-benchmarks in parallel.
SubBenchArgs<T> sub;
};
// Options accepted by this program.
struct Opts {
// Data buffer size for copy benchmark.
uint64_t size = kDefaultBufferSizeInBytes;
// Number of warm up rounds to run.
uint64_t num_warm_up = 0;
// Number of loops to run.
uint64_t num_loops = 0;
// Whether check data after copy.
bool check_data = false;
};
std::string KernelToString(int); // Function to convert enum to string
int ParseOpts(int, char **, Opts *);
void PrintInputInfo(Opts &);
void PrintUsage();
} // namespace stream_config
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""Tests for gpu_stream benchmark."""
import numbers
import unittest
from tests.helper import decorator
from tests.helper.testcase import BenchmarkTestCase
from superbench.benchmarks import BenchmarkRegistry, BenchmarkType, ReturnCode, Platform
class GpuStreamBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
"""Test class for gpu_stream benchmark."""
@classmethod
def setUpClass(cls):
"""Hook method for setting up class fixture before running tests in the class."""
super().setUpClass()
cls.createMockEnvs(cls)
cls.createMockFiles(cls, ['bin/gpu_stream'])
def _test_gpu_stream_command_generation(self, platform):
"""Test gpu-stream benchmark command generation."""
benchmark_name = 'gpu-stream'
(benchmark_class,
predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
assert (benchmark_class)
num_warm_up = 5
num_loops = 10
size = 25769803776
parameters = '--num_warm_up %d --num_loops %d --size %d ' \
'--check_data' % \
(num_warm_up, num_loops, size)
benchmark = benchmark_class(benchmark_name, parameters=parameters)
# Check basic information
assert (benchmark)
ret = benchmark._preprocess()
assert (ret is True)
assert (benchmark.return_code == ReturnCode.SUCCESS)
assert (benchmark.name == benchmark_name)
assert (benchmark.type == BenchmarkType.MICRO)
# Check parameters specified in BenchmarkContext.
assert (benchmark._args.size == size)
assert (benchmark._args.num_warm_up == num_warm_up)
assert (benchmark._args.num_loops == num_loops)
assert (benchmark._args.check_data)
# Check command
assert (1 == len(benchmark._commands))
assert (benchmark._commands[0].startswith(benchmark._GpuStreamBenchmark__bin_path))
assert ('--size %d' % size in benchmark._commands[0])
assert ('--num_warm_up %d' % num_warm_up in benchmark._commands[0])
assert ('--num_loops %d' % num_loops in benchmark._commands[0])
assert ('--check_data' in benchmark._commands[0])
@decorator.cuda_test
def test_gpu_stream_command_generation_cuda(self):
"""Test gpu-stream benchmark command generation, CUDA case."""
self._test_gpu_stream_command_generation(Platform.CUDA)
@decorator.rocm_test
def test_gpu_stream_command_generation_rocm(self):
"""Test gpu-stream benchmark command generation, ROCm case."""
self._test_gpu_stream_command_generation(Platform.ROCM)
@decorator.load_data('tests/data/gpu_stream.log')
def _test_gpu_stream_result_parsing(self, platform, test_raw_output):
"""Test gpu-stream benchmark result parsing."""
benchmark_name = 'gpu-stream'
(benchmark_class,
predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
assert (benchmark_class)
benchmark = benchmark_class(benchmark_name, parameters='')
assert (benchmark)
ret = benchmark._preprocess()
assert (ret is True)
assert (benchmark.return_code == ReturnCode.SUCCESS)
assert (benchmark.name == 'gpu-stream')
assert (benchmark.type == BenchmarkType.MICRO)
# Positive case - valid raw output.
assert (benchmark._process_raw_result(0, test_raw_output))
assert (benchmark.return_code == ReturnCode.SUCCESS)
assert (1 == len(benchmark.raw_data))
# print(test_raw_output.splitlines())
test_raw_output_dict = {
x.split()[0]: [float(x.split()[1]), float(x.split()[2])]
for x in test_raw_output.strip().splitlines() if x.startswith('STREAM_')
}
assert (len(test_raw_output_dict) * 2 + benchmark.default_metric_count == len(benchmark.result))
for output_key in benchmark.result:
if output_key == 'return_code':
assert (benchmark.result[output_key] == [0])
else:
assert (len(benchmark.result[output_key]) == 1)
assert (isinstance(benchmark.result[output_key][0], numbers.Number))
if output_key.endswith('_bw'):
assert (output_key.strip('_bw') in test_raw_output_dict)
assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0])
else:
assert (output_key.strip('_ratio') in test_raw_output_dict)
assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0])
# Negative case - invalid raw output.
assert (benchmark._process_raw_result(1, 'Invalid raw output') is False)
assert (benchmark.return_code == ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
@decorator.cuda_test
def test_gpu_stream_result_parsing_cuda(self):
"""Test gpu-stream benchmark result parsing, CUDA case."""
self._test_gpu_stream_result_parsing(Platform.CUDA)
@decorator.rocm_test
def test_gpu_stream_result_parsing_rocm(self):
"""Test gpu-stream benchmark result parsing, ROCm case."""
self._test_gpu_stream_result_parsing(Platform.ROCM)
STREAM Benchmark
Buffer size(bytes): 4294967296
Number of warm up runs: 10
Number of loops: 40
Check data: No
Device 0: "NVIDIA Graphics Device" 152 SMs(10.0) Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK ECC is ON
STREAM_COPY_double_gpu_0_buffer_4294967296_block_128 6711.67 81.93
STREAM_COPY_double_gpu_0_buffer_4294967296_block_256 6549.50 79.95
STREAM_COPY_double_gpu_0_buffer_4294967296_block_512 6195.43 75.63
STREAM_COPY_double_gpu_0_buffer_4294967296_block_1024 5721.52 69.84
STREAM_SCALE_double_gpu_0_buffer_4294967296_block_128 6680.42 81.55
STREAM_SCALE_double_gpu_0_buffer_4294967296_block_256 6515.51 79.54
STREAM_SCALE_double_gpu_0_buffer_4294967296_block_512 6106.69 74.54
STREAM_SCALE_double_gpu_0_buffer_4294967296_block_1024 5626.68 68.69
STREAM_ADD_double_gpu_0_buffer_4294967296_block_128 7379.25 90.08
STREAM_ADD_double_gpu_0_buffer_4294967296_block_256 7407.27 90.42
STREAM_ADD_double_gpu_0_buffer_4294967296_block_512 7309.59 89.23
STREAM_ADD_double_gpu_0_buffer_4294967296_block_1024 6788.64 82.87
STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_128 7378.19 90.07
STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_256 7414.01 90.50
STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_512 7295.50 89.06
STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_1024 6730.42 82.16
Device 1: "NVIDIA Graphics Device" 152 SMs(10.0) Memory: 4000.00MHz x 8192-bit = 8192.00 GB/s PEAK ECC is ON
STREAM_COPY_double_gpu_1_buffer_4294967296_block_128 6708.74 81.89
STREAM_COPY_double_gpu_1_buffer_4294967296_block_256 6549.47 79.95
STREAM_COPY_double_gpu_1_buffer_4294967296_block_512 6195.39 75.63
STREAM_COPY_double_gpu_1_buffer_4294967296_block_1024 5725.07 69.89
STREAM_SCALE_double_gpu_1_buffer_4294967296_block_128 6678.56 81.53
STREAM_SCALE_double_gpu_1_buffer_4294967296_block_256 6514.05 79.52
STREAM_SCALE_double_gpu_1_buffer_4294967296_block_512 6103.80 74.51
STREAM_SCALE_double_gpu_1_buffer_4294967296_block_1024 5630.41 68.73
STREAM_ADD_double_gpu_1_buffer_4294967296_block_128 7377.74 90.06
STREAM_ADD_double_gpu_1_buffer_4294967296_block_256 7410.97 90.47
STREAM_ADD_double_gpu_1_buffer_4294967296_block_512 7310.80 89.24
STREAM_ADD_double_gpu_1_buffer_4294967296_block_1024 6789.91 82.88
STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_128 7379.03 90.08
STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_256 7414.04 90.50
STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_512 7298.26 89.09
STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_1024 6732.15 82.18
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment