Benchmarks - Add GPU Stream Micro Benchmark (#697)

Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.

Benchmarks - Add GPU Stream Micro Benchmark (#697)
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
4eddd50a · WenqingLan1 · GitHub · 991c0051 · 4eddd50a · 4eddd50a
Unverified Commit 4eddd50a authored Jun 18, 2025 by WenqingLan1 Committed by GitHub Jun 18, 2025
14 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -262,6 +262,25 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU
 | gpu\_all\_to\_gpu[0-9]+\_write\_by\_sm\_bw                  | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled.                     |
 | gpu\_all\_to\_gpu\_all\_write\_by\_sm\_bw                   | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled.                |
+### `gpu-stream`
+#### Introduction
+Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
+#### Metrics
+| Metric Name                                                | Unit             | Description                                                                                                                             |
+|------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
+| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw  | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size.                         |
+| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size.                         |
+| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw   | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size.                         |
+| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size.                         |
+| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio  | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size.                         |
+| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size.                         |
+| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio   | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size.                         |
+| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size.                         |
 ### `ib-loopback`
 #### Introduction

--- a/examples/benchmarks/gpu_stream.py
+++ b/examples/benchmarks/gpu_stream.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+"""Micro benchmark example for GPU Stream performance.
+Commands to run:
+  python3 examples/benchmarks/gpu_stream.py
+"""
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.common.utils import logger
+if __name__ == '__main__':
+    context = BenchmarkRegistry.create_benchmark_context(
+        'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10'
+    )
+    # For ROCm environment, please specify the benchmark name and the platform as the following.
+    # context = BenchmarkRegistry.create_benchmark_context(
+    #     'gpu-stream', platform=Platform.ROCM, parameters='--num_warm_up 1 --num_loops 10'
+    # )
+    # To enable data checking, please add '--check_data'.
+    benchmark = BenchmarkRegistry.launch_benchmark(context)
+    if benchmark:
+        logger.info(
+            'benchmark: {}, return code: {}, result: {}'.format(
+                benchmark.name, benchmark.return_code, benchmark.result
+            )
+        )
--- a/superbench/benchmarks/micro_benchmarks/__init__.py
+++ b/superbench/benchmarks/micro_benchmarks/__init__.py
@@ -23,6 +23,7 @@ from superbench.benchmarks.micro_benchmarks.cpu_stream_performance import CpuStr
 from superbench.benchmarks.micro_benchmarks.cpu_hpl_performance import CpuHplBenchmark
 from superbench.benchmarks.micro_benchmarks.gpcnet_performance import GPCNetBenchmark
 from superbench.benchmarks.micro_benchmarks.gpu_copy_bw_performance import GpuCopyBwBenchmark
+from superbench.benchmarks.micro_benchmarks.gpu_stream import GpuStreamBenchmark
 from superbench.benchmarks.micro_benchmarks.gpu_burn_test import GpuBurnBenchmark
 from superbench.benchmarks.micro_benchmarks.ib_loopback_performance import IBLoopbackBenchmark
 from superbench.benchmarks.micro_benchmarks.ib_validation_performance import IBBenchmark
@@ -58,6 +59,7 @@ __all__ = [
    'GemmFlopsBenchmark',
    'GpuBurnBenchmark',
    'GpuCopyBwBenchmark',
+    'GpuStreamBenchmark',
    'IBBenchmark',
    'IBLoopbackBenchmark',
    'KernelLaunch',

--- a/superbench/benchmarks/micro_benchmarks/gpu_stream.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+"""Module of the GPU Stream Performance benchmark."""
+import os
+from superbench.common.utils import logger
+from superbench.benchmarks import BenchmarkRegistry, ReturnCode
+from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
+class GpuStreamBenchmark(MicroBenchmarkWithInvoke):
+    """The GPU stream performance benchmark class."""
+    def __init__(self, name, parameters=''):
+        """Constructor.
+        Args:
+            name (str): benchmark name.
+            parameters (str): benchmark parameters.
+        """
+        super().__init__(name, parameters)
+        self._bin_name = 'gpu_stream'
+    def add_parser_arguments(self):
+        """Add the specified arguments."""
+        super().add_parser_arguments()
+        self._parser.add_argument(
+            '--size',
+            type=int,
+            default=4096 * 1024**2,
+            required=False,
+            help='Size of data buffer in bytes.',
+        )
+        self._parser.add_argument(
+            '--num_warm_up',
+            type=int,
+            default=20,
+            required=False,
+            help='Number of warm up rounds',
+        )
+        self._parser.add_argument(
+            '--num_loops',
+            type=int,
+            default=100,
+            required=False,
+            help='Number of data buffer copies performed.',
+        )
+        self._parser.add_argument(
+            '--check_data',
+            action='store_true',
+            help='Enable data checking',
+        )
+    def _preprocess(self):
+        """Preprocess/preparation operations before the benchmarking.
+        Return:
+            True if _preprocess() succeed.
+        """
+        if not super()._preprocess():
+            return False
+        self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name)
+        args = '--size %d --num_warm_up %d --num_loops %d ' % (
+            self._args.size, self._args.num_warm_up, self._args.num_loops
+        )
+        if self._args.check_data:
+            args += ' --check_data'
+        self._commands = ['%s %s' % (self.__bin_path, args)]
+        return True
+    def _process_raw_result(self, cmd_idx, raw_output):
+        """Function to parse raw results and save the summarized results.
+          self._result.add_raw_data() and self._result.add_result() need to be called to save the results.
+        Args:
+            cmd_idx (int): the index of command corresponding with the raw_output.
+            raw_output (str): raw output string of the micro-benchmark.
+        Return:
+            True if the raw output string is valid and result can be extracted.
+        """
+        self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
+        try:
+            output_lines = [x.strip() for x in raw_output.strip().splitlines()]
+            count = 0
+            for output_line in output_lines:
+                if output_line.startswith('STREAM_'):
+                    count += 1
+                    tag, bw_str, ratio = output_line.split()
+                    self._result.add_result(tag + '_bw', float(bw_str))
+                    self._result.add_result(tag + '_ratio', float(ratio))
+            if count == 0:
+                raise BaseException('No valid results found.')
+        except BaseException as e:
+            self._result.set_return_code(ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
+            logger.error(
+                'The result format is invalid - round: {}, benchmark: {}, raw output: {}, message: {}.'.format(
+                    self._curr_run_index, self._name, raw_output, str(e)
+                )
+            )
+            return False
+        return True
+BenchmarkRegistry.register_benchmark('gpu-stream', GpuStreamBenchmark)
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+cmake_minimum_required(VERSION 3.18)
+project(gpu_stream LANGUAGES CXX)
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CUDA_STANDARD 17)
+set(CMAKE_CUDA_STANDARD_REQUIRED ON)
+find_package(CUDAToolkit QUIET)
+# Source files
+set(SOURCES
+    gpu_stream_test.cpp
+    gpu_stream_utils.cpp
+    gpu_stream.cu
+    gpu_stream_kernels.cu
+)
+# Cuda environment
+if(CUDAToolkit_FOUND)
+    message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})
+    include(../cuda_common.cmake)
+    add_executable(gpu_stream ${SOURCES})
+    set_property(TARGET gpu_stream PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
+    target_include_directories(gpu_stream PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
+    target_link_libraries(gpu_stream numa nvidia-ml)
+else()
+    # TODO: test for ROC
+    # ROCm environment
+    include(../rocm_common.cmake)
+    find_package(hip QUIET)
+    if(hip_FOUND)
+        message(STATUS "Found ROCm: " ${HIP_VERSION})
+        # Convert cuda code to hip code in cpp
+        execute_process(COMMAND hipify-perl -print-stats -o gpu_stream.cpp ${SOURCES} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
+        # link hip device lib
+        add_executable(gpu_stream gpu_stream.cpp)
+        include(CheckSymbolExists)
+        check_symbol_exists("hipDeviceMallocUncached" "hip/hip_runtime_api.h" HIP_UNCACHED_MEMORY)
+        if(${HIP_UNCACHED_MEMORY})
+            target_compile_definitions(gpu_stream PRIVATE HIP_UNCACHED_MEMORY)
+        endif()
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
+        target_link_libraries(gpu_stream numa hip::device)
+    else()
+        message(FATAL_ERROR "No CUDA or ROCm environment found.")
+    endif()
+endif()
+install(TARGETS gpu_stream RUNTIME DESTINATION bin)
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+// GPU stream benchmark
+// This benchmark is based on the STREAM benchmark, which is a simple  benchmark program that measures
+// sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple (COPY, SCALE, ADD, TRIAD)
+// kernels.
+#include "gpu_stream.hpp"
+#include <cassert>
+#include <iostream>
+/**
+ * @brief Destroys the CUDA events used for benchmarking.
+ *
+ * @details This function cleans up and releases the resources associated with the CUDA events
+ * used for benchmarking based on the provided arguments. It ensures that all allocated resources
+ * are properly freed.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure
+ *
+ * @return int The status code indicating success or failure of the destruction process.
+ */
+template <typename T> int GpuStream::DestroyEvent(std::unique_ptr<BenchArgs<T>> &args) {
+    cudaError_t cuda_err = cudaSuccess;
+    if (SetGpu(args->gpu_id)) {
+        return -1;
+    }
+    cuda_err = cudaEventDestroy(args->sub.start_event);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "DestroyEvent::cudaEventDestroy error: " << cuda_err << std::endl;
+        return -1;
+    }
+    cuda_err = cudaEventDestroy(args->sub.end_event);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "DestroyEvent::cudaEventDestroy error: " << cuda_err << std::endl;
+        return -1;
+    }
+    return 0;
+}
+/**
+ * @brief Constructor for the GpuStream class.
+ *
+ * This constructor initializes the GpuStream opts with the given parameters.
+ *
+ * @param opts parsed command line options..
+ */
+GpuStream::GpuStream(Opts &opts) noexcept : opts_(opts) { PrintInputInfo(opts_); }
+/**
+ * @brief Sets the active GPU.
+ *
+ * @details This function sets the active GPU to the specified GPU ID.
+ *
+ * @param[in] gpu_id The ID of the GPU to set as active.
+ *
+ * @return int The status code indicating success or failure.
+ */
+int GpuStream::SetGpu(int gpu_id) {
+    cudaError_t cuda_err = cudaSetDevice(gpu_id);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "SetGpu::cudaSetDevice " << gpu_id << "error: " << cuda_err << std::endl;
+        return -1;
+    }
+    return 0;
+}
+/**
+ * @brief Retrieves the number of GPUs available.
+ *
+ * @details This function retrieves the number of GPUs available on the system and stores the count
+ * in the provided pointer.
+ *
+ * @param[out] gpu_count Pointer to an integer where the GPU count will be stored.
+ *
+ * @return int The status code indicating success or failure.
+ */
+int GpuStream::GetGpuCount(int *gpu_count) {
+    cudaError_t cuda_err = cudaGetDeviceCount(gpu_count);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "GetGpuCount::cudaGetDeviceCount error: " << cuda_err << std::endl;
+        return -1;
+    }
+    return 0;
+}
+/**
+ * @brief destroys buff and stream resources.
+ *
+ * @details This method cleans up and releases the resources associated with the buffer and stream
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments.
+ *
+ * @return int The status code indicating success or failure.
+ */
+template <typename T> int GpuStream::Destroy(std::unique_ptr<BenchArgs<T>> &args) {
+    int ret = DestroyBufAndStream(args);
+    if (ret == 0) {
+        ret = DestroyEvent(args);
+    }
+    return ret;
+}
+/**
+ * @brief Prints CUDA device information.
+ *
+ * @details This function prints the properties of a CUDA device specified by the device ID.
+ *
+ * @param[in] device_id The ID of the CUDA device.
+ * @param[out] prop The properties of the CUDA device.
+ * @return void
+ * */
+void GpuStream::PrintCudaDeviceInfo(int device_id, const cudaDeviceProp &prop) {
+    std::cout << "\nDevice " << device_id << ": \"" << prop.name << "\"";
+    std::cout << "  " << prop.multiProcessorCount << " SMs(" << prop.major << "." << prop.minor << ")";
+    // Compute theoretical bw:
+    // https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/#theoretical_bandwidth
+    std::cout << "  Memory: " << prop.memoryClockRate / 1000.0 << "MHz x " << prop.memoryBusWidth
+              << "-bit = " << (prop.memoryClockRate / 1000.0) * (prop.memoryBusWidth / 8) * 2 / 1000.0 << " GB/s PEAK ";
+    std::cout << "  ECC is " << (prop.ECCEnabled ? "ON" : "OFF") << std::endl;
+}
+/**
+ * @brief Allocates a buffer on the GPU.
+ *
+ * @details This function allocates a buffer of the specified size on the GPU and returns a pointer
+ * to the allocated memory.
+ *
+ * @param[out] ptr Pointer to a pointer where the address of the allocated buffer will be stored.
+ * @param[in] size The size of the buffer to allocate in bytes.
+ *
+ * @return cudaError_t The status code indicating success or failure of the memory allocation.
+ */
+template <typename T> cudaError_t GpuStream::GpuMallocDataBuf(T **ptr, uint64_t size) { return cudaMalloc(ptr, size); }
+/**
+ * @brief Prepares validation buffers for GPU stream benchmark.
+ *
+ * @details This function allocates and initializes validation buffers for different
+ * kernels (copy, scale, add, and triad) used in the GPU stream benchmark. The buffer order
+ * matches the kernel order in the Kernel enum.
+ *
+ * @param args A unique pointer to a BenchArgs structure containing the necessary arguments
+ * for preparing the buffer and stream.
+ *
+ * @return int The status code indicating success or failure of the preparation.
+ */
+template <typename T> int GpuStream::PrepareValidationBuf(std::unique_ptr<BenchArgs<T>> &args) {
+    args->sub.validation_buf_ptrs.resize(kNumValidationBuffers);
+    // Compute and allocate validation buffers for add, scale and triad
+    uint64_t size = args->size / sizeof(T);
+    for (auto &buf_ptr : args->sub.validation_buf_ptrs) {
+        buf_ptr.resize(size);
+    }
+    // Initialize validation buffer
+    for (size_t j = 0; j < size; j++) {
+        args->sub.validation_buf_ptrs[0][j] = static_cast<T>(j % kUInt8Mod);
+        args->sub.validation_buf_ptrs[1][j] = static_cast<T>(j % kUInt8Mod) * scalar;
+        args->sub.validation_buf_ptrs[2][j] = static_cast<T>(j % kUInt8Mod) + static_cast<T>(j % kUInt8Mod);
+        args->sub.validation_buf_ptrs[3][j] = static_cast<T>(j % kUInt8Mod) + static_cast<T>(j % kUInt8Mod) * scalar;
+    }
+    return 0;
+}
+/**
+ * @brief Prepares the buffer and stream for benchmarking.
+ *
+ * @details This function prepares the necessary buffer and stream for benchmarking based on the
+ * provided arguments. It initializes and configures the buffer and stream as required.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
+ * for preparing the buffer and stream.
+ *
+ * @return int The status code indicating success or failure of the preparation.
+ */
+template <typename T> int GpuStream::PrepareBufAndStream(std::unique_ptr<BenchArgs<T>> &args) {
+    cudaError_t cuda_err = cudaSuccess;
+    if (args->check_data) {
+        // Generate data to copy
+        args->sub.data_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
+        for (int j = 0; j < args->size / sizeof(T); j++) {
+            args->sub.data_buf[j] = static_cast<T>(j % kUInt8Mod);
+        }
+        // Allocate check buffer
+        args->sub.check_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
+    }
+    // Allocate buffers
+    args->sub.gpu_buf_ptrs.resize(kNumBuffers);
+    // Set to buffer device for GPU buffer
+    if (SetGpu(args->gpu_id)) {
+        return -1;
+    }
+    // Allocate buffers
+    for (auto &buf_ptr : args->sub.gpu_buf_ptrs) {
+        T *raw_ptr = nullptr;
+        cuda_err = GpuMallocDataBuf(&raw_ptr, args->size * sizeof(T));
+        if (cuda_err != cudaSuccess) {
+            std::cerr << "PrepareBufAndStream::cudaMalloc error: " << cuda_err << std::endl;
+            return -1;
+        }
+        buf_ptr.reset(raw_ptr); // Transfer ownership to the smart pointer
+    }
+    // Initialize source buffer
+    if (args->check_data) {
+        cuda_err = cudaMemcpy(args->sub.gpu_buf_ptrs[0].get(), args->sub.data_buf, args->size, cudaMemcpyDefault);
+        if (cuda_err != cudaSuccess) {
+            std::cerr << "PrepareBufAndStream::cudaMemcpy error: " << cuda_err << std::endl;
+            return -1;
+        }
+        cuda_err = cudaMemcpy(args->sub.gpu_buf_ptrs[1].get(), args->sub.data_buf, args->size, cudaMemcpyDefault);
+        if (cuda_err != cudaSuccess) {
+            std::cerr << "PrepareBufAndStream::cudaMemcpy error: " << cuda_err << std::endl;
+            return -1;
+        }
+        PrepareValidationBuf<T>(args);
+    }
+    cuda_err = cudaStreamCreateWithFlags(&(args->sub.stream), cudaStreamNonBlocking);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "PrepareBufAndStream::cudaStreamCreate error: " << cuda_err << std::endl;
+        return -1;
+    }
+    return 0;
+}
+/**
+ * @brief Prepares CUDA events for benchmarking.
+ *
+ * @details This function creates the necessary CUDA events for benchmarking based on the
+ * provided arguments. It initializes and configures the events as required.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
+ * for preparing the CUDA events.
+ *
+ * @return int The status code indicating success or failure of the preparation.
+ */
+template <typename T> int GpuStream::PrepareEvent(std::unique_ptr<BenchArgs<T>> &args) {
+    cudaError_t cuda_err = cudaSuccess;
+    if (SetGpu(args->gpu_id)) {
+        return -1;
+    }
+    cuda_err = cudaEventCreate(&(args->sub.start_event));
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "PrepareEvent::cudaEventCreate error: " << cuda_err << std::endl;
+        return -1;
+    }
+    cuda_err = cudaEventCreate(&(args->sub.end_event));
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "PrepareEvent::cudaEventCreate error: " << cuda_err << std::endl;
+        return -1;
+    }
+    return 0;
+}
+/**
+ * @brief Validates the result of data transfer.
+ *
+ * @details This function checks the buffer to validate the result of a data transfer operation
+ * based on the provided arguments. It ensures that the data transfer was successful and that
+ * the buffer contains the expected data.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
+ * for validating the buffer.
+ *
+ * @return int The status code indicating success or failure of the validation.
+ */
+template <typename T> int GpuStream::CheckBuf(std::unique_ptr<BenchArgs<T>> &args, int kernel_idx) {
+    cudaError_t cuda_err = cudaSuccess;
+    int memcmp_result = 0;
+    if (SetGpu(args->gpu_id)) {
+        return -1;
+    }
+    // Copy buffer output from stream kernel to local buffer
+    cuda_err = cudaMemcpy(args->sub.check_buf, args->sub.gpu_buf_ptrs[2].get(), args->size, cudaMemcpyDefault);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "CheckBuf::cudaMemcpy error: " << cuda_err << std::endl;
+        return -1;
+    }
+    // Validate result by comparing the data buffer and check buffer
+    memcmp_result = memcmp(args->sub.validation_buf_ptrs[kernel_idx].data(), args->sub.check_buf, args->size);
+    if (memcmp_result) {
+        std::cerr << "CheckBuf::Memory check failed for kernel index " << kernel_idx << std::endl;
+        return -1;
+    }
+    return 0;
+}
+/**
+ * @brief Destroys the buffer and stream used for benchmarking.
+ *
+ * @details This function cleans up and releases the resources associated with the buffer and stream
+ * used for benchmarking based on the provided arguments. It ensures that all allocated buffers and streams
+ * resources are properly freed.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments
+ * for destroying the buffer and stream.
+ *
+ * @return int The status code indicating success or failure of the destruction process.
+ */
+template <typename T> int GpuStream::DestroyBufAndStream(std::unique_ptr<BenchArgs<T>> &args) {
+    int ret = 0;
+    cudaError_t cuda_err = cudaSuccess;
+    // Destroy original data buffer and check buffer
+    if (args->sub.data_buf != nullptr) {
+        numa_free(args->sub.data_buf, args->size);
+    }
+    if (args->sub.check_buf != nullptr) {
+        numa_free(args->sub.check_buf, args->size);
+    }
+    // Set to buffer device for GPU buffer
+    if (SetGpu(args->gpu_id)) {
+        return -1;
+    }
+    cuda_err = cudaStreamDestroy(args->sub.stream);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "DestroyBufAndStream::cudaStreamDestroy error: " << cuda_err << std::endl;
+        return -1;
+    }
+    return ret;
+}
+/**
+ * @brief Runs the STREAM benchmark.
+ *
+ * @details This function runs the STREAM benchmark using the specified kernel and number of threads per block.
+ * It prepares the necessary arguments and configurations for the benchmark execution.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments for the
+ benchmark.
+ * @param[in] kernel The kernel function to be used for the benchmark.
+ * @param[in] num_threads_per_block The number of threads per block to be used in the kernel execution.
+ *
+ * @return int The status code indicating success or failure of the benchmark execution.
+ */
+template <typename T>
+int GpuStream::RunStreamKernel(std::unique_ptr<BenchArgs<T>> &args, Kernel kernel, int num_threads_per_block) {
+    cudaError_t cuda_err = cudaSuccess;
+    uint64_t num_thread_blocks;
+    int size_factor = 2;
+    // Validate data size
+    uint64_t num_elements_in_thread_block = kNumLoopUnroll * num_threads_per_block;
+    uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(T);
+    if (args->size % num_bytes_in_thread_block) {
+        std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl;
+        return -1;
+    }
+    num_thread_blocks = args->size / num_bytes_in_thread_block;
+    args->sub.times_in_ms.resize(static_cast<int>(Kernel::kCount));
+    if (SetGpu(args->gpu_id)) {
+        return -1;
+    }
+    // Launch jobs and collect running time
+    for (int i = 0; i < args->num_loops + args->num_warm_up; i++) {
+        // Record start event once warm up iterations are done
+        if (i == args->num_warm_up) {
+            cuda_err = cudaEventRecord(args->sub.start_event, args->sub.stream);
+            if (cuda_err != cudaSuccess) {
+                std::cerr << "RunStreamKernel::cudaEventRecord error: " << cuda_err << std::endl;
+                return -1;
+            }
+        }
+        switch (kernel) {
+        case Kernel::kCopy:
+            CopyKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()));
+            args->sub.kernel_name = "COPY";
+            break;
+        case Kernel::kScale:
+            ScaleKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()), scalar);
+            args->sub.kernel_name = "SCALE";
+            break;
+        case Kernel::kAdd:
+            AddKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()),
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[1].get()));
+            size_factor = 3;
+            args->sub.kernel_name = "ADD";
+            break;
+        case Kernel::kTriad:
+            TriadKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()),
+                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[1].get()), scalar);
+            size_factor = 3;
+            args->sub.kernel_name = "TRIAD";
+            break;
+        default:
+            std::cerr << "RunStreamKernel::Invalid kernel: " << std::endl;
+            break;
+        }
+        // Record end event at the end of iterations
+        if (i + 1 == args->num_loops + args->num_warm_up) {
+            cuda_err = cudaEventRecord(args->sub.end_event, args->sub.stream);
+            if (cuda_err != cudaSuccess) {
+                std::cerr << "RunStreamKernel::cudaEventRecord error: " << cuda_err << std::endl;
+                return -1;
+            }
+        }
+    }
+    // Wait for the stream to finish
+    cuda_err = cudaStreamSynchronize(args->sub.stream);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "RunStreamKernel::cudaStreamSynchronize error: " << cuda_err << std::endl;
+        return -1;
+    }
+    // Calculate time
+    float time_in_ms = 0;
+    cuda_err = cudaEventElapsedTime(&time_in_ms, args->sub.start_event, args->sub.end_event);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "RunStreamKernel::cudaEventElapsedTime error: " << cuda_err << std::endl;
+        return -1;
+    }
+    args->sub.times_in_ms[static_cast<int>(kernel)].push_back(time_in_ms / size_factor);
+    return 0;
+}
+/**
+ * @brief Runs the benchmark for various kernels and processes the results for a BenchArgs config.
+ *
+ * @details This function prepares the necessary buffers and streams, runs the benchmark for each kernel
+ * with different thread per block configurations, checks the results, and processes the benchmark results.
+ * It also handles cleanup of resources in case of errors.
+ *
+ * @param[in,out] args A unique pointer to a BenchArgs structure containing the necessary arguments for the
+ benchmark.
+ *
+ * @return int The status code indicating success or failure of the benchmark execution.
+ * */
+template <typename T> int GpuStream::RunStream(std::unique_ptr<BenchArgs<T>> &args, const std::string &data_type) {
+    int ret = 0;
+    ret = PrepareBufAndStream<T>(args);
+    float peak_bw =
+        args->gpu_device_prop.memoryClockRate / 1000.0 * args->gpu_device_prop.memoryBusWidth / 8 * 2 / 1000.0;
+    if (ret != 0) {
+        return DestroyBufAndStream(args);
+    }
+    ret = PrepareEvent(args);
+    if (ret != 0) {
+        return DestroyEvent(args);
+    }
+    // benchmark over the kThreadsPerBlock array
+    for (const int num_threads_in_block : kThreadsPerBlock) {
+        // run the stream benchmark over the stream kernels
+        for (int i = 0; i < static_cast<int>(Kernel::kCount); ++i) {
+            Kernel kernel = static_cast<Kernel>(i);
+            int ret = RunStreamKernel<T>(args, kernel, num_threads_in_block);
+            if (ret == 0 && args->check_data) {
+                // Compare buffer based on the kernel
+                ret = CheckBuf(args, i);
+            }
+        }
+    }
+    // output formatted results to stdout
+    // Tags are of format:
+    // STREAM_<Kernelname>_datatype_gpu_<gpu_id>_buffer_<buffer_size>_block_<block_size>
+    for (int i = 0; i < args->sub.times_in_ms.size(); i++) {
+        std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_gpu_" + std::to_string(args->gpu_id) +
+                          "_buffer_" + std::to_string(args->size);
+        for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) {
+            // Calculate and display bandwidth
+            double bw = args->size * args->num_loops / args->sub.times_in_ms[i][j] / 1e6;
+            std::cout << tag << "_block_" << kThreadsPerBlock[j] << "\t" << bw << "\t" << std::fixed
+                      << std::setprecision(2) << bw / peak_bw * 100 << std::endl;
+        }
+    }
+    // cleanup buffer and streams for the curr arg
+    Destroy(args);
+    return ret;
+}
+/**
+ * @brief Runs the Stream benchmark.
+ *
+ * @details This function processes the input args, validates and composes the BenchArgs structure for the
+ availavble
+ * GPUs, and runs the benchmark.
+ *
+ * @return int The status code indicating success or failure of the benchmark execution.
+ * */
+int GpuStream::Run() {
+    int ret = 0;
+    int gpu_count = 0;
+    // Get number of NUMA nodes
+    if (numa_available()) {
+        std::cerr << "main::numa_available error" << std::endl;
+        return -1;
+    }
+    // Get number of GPUs
+    ret = GetGpuCount(&gpu_count);
+    if (ret != 0) {
+        return ret;
+    }
+    // find all GPUs and compose the Benchmarking data structure
+    for (int j = 0; j < gpu_count; j++) {
+        auto args = std::make_unique<BenchArgs<double>>();
+        args->numa_id = 0;
+        args->gpu_id = j;
+        cudaGetDeviceProperties(&args->gpu_device_prop, j);
+        args->num_warm_up = opts_.num_warm_up;
+        args->num_loops = opts_.num_loops;
+        args->size = opts_.size;
+        args->check_data = opts_.check_data;
+        args->numa_id = 0;
+        args->gpu_id = j;
+        // add data to vector
+        bench_args_.emplace_back(std::move(args));
+    }
+    bool has_error = false;
+    // Run the benchmark for all the configured data
+    for (auto &variant_args : bench_args_) {
+        std::visit(
+            [&](auto &curr_args) {
+                PrintCudaDeviceInfo(curr_args->gpu_id, curr_args->gpu_device_prop);
+                // Set the NUMA node
+                ret = numa_run_on_node(curr_args->numa_id);
+                if (ret != 0) {
+                    std::cerr << "Run::numa_run_on_node error: " << errno << std::endl;
+                    has_error = true;
+                    return;
+                }
+                // Run the stream benchmark for the configued data
+                if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<float>>) {
+                    ret = RunStream<float>(curr_args, "float");
+                } else if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<double>>) {
+                    ret = RunStream<double>(curr_args, "double");
+                } else {
+                    std::cerr << "Run::Unknown type error" << std::endl;
+                    has_error = true;
+                    return;
+                }
+                if (ret != 0) {
+                    std::cerr << "Run::RunStream error: " << errno << std::endl;
+                    has_error = true;
+                }
+            },
+            variant_args);
+    }
+    if (has_error) {
+        return -1;
+    }
+    return ret;
+}
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+#pragma once
+#include <getopt.h>
+#include <iostream>
+#include <memory>
+#include <variant>
+#include <vector>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <numa.h>
+#include "gpu_stream_kernels.hpp"
+#include "gpu_stream_utils.hpp"
+#define NON_HIP (!defined(__HIP_PLATFORM_HCC__) && !defined(__HCC__) && !defined(__HIPCC__))
+using namespace stream_config;
+class GpuStream {
+  public:
+    GpuStream() = delete;            // Delete default constructor
+    GpuStream(Opts &) noexcept;      // Constructor
+    ~GpuStream() noexcept = default; // Destructor
+    GpuStream(const GpuStream &) = delete;
+    GpuStream &operator=(const GpuStream &) = delete;
+    GpuStream(GpuStream &&) noexcept = default;
+    GpuStream &operator=(GpuStream &&) noexcept = default;
+    int Run();
+  private:
+    using BenchArgsVariant = std::variant<std::unique_ptr<BenchArgs<double>>>;
+    std::vector<BenchArgsVariant> bench_args_;
+    Opts opts_;
+    // Memory management functions
+    template <typename T> cudaError_t GpuMallocDataBuf(T **, uint64_t);
+    template <typename T> int PrepareValidationBuf(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int CheckBuf(std::unique_ptr<BenchArgs<T>> &, int);
+    template <typename T> int PrepareEvent(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int PrepareBufAndStream(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int DestroyEvent(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int DestroyBufAndStream(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int Destroy(std::unique_ptr<BenchArgs<T>> &);
+    // Benchmark functions
+    template <typename T> int RunStreamKernel(std::unique_ptr<BenchArgs<T>> &, Kernel, int);
+    template <typename T> int RunStream(std::unique_ptr<BenchArgs<T>> &, const std::string &data_type);
+    // Helper functions
+    int GetGpuCount(int *);
+    int SetGpu(int gpu_id);
+    void PrintCudaDeviceInfo(int, const cudaDeviceProp &);
+};
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+#include "gpu_stream_kernels.hpp"
+/**
+ * @brief Fetches a value from source memory and writes it to a register.
+ *
+ * @details This inline device function fetches a value from the specified source memory
+ * location and writes it to the provided register. The implementation references the following:
+ * 1) NCCL:
+ * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483
+ * 2) RCCL:
+ * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268
+ *
+ * @tparam T The type of the value to fetch.
+ * @param[out] v The register to write the fetched value to.
+ * @param[in] p The source memory location to fetch the value from.
+ */
+template <typename T> inline __device__ void Fetch(T &v, const T *p) {
+#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
+    v = *p;
+#else
+    if constexpr (std::is_same<T, float>::value) {
+        asm volatile("ld.volatile.global.f32 %0, [%1];" : "=f"(v) : "l"(p) : "memory");
+    } else if constexpr (std::is_same<T, double>::value) {
+        asm volatile("ld.volatile.global.f64 %0, [%1];" : "=d"(v) : "l"(p) : "memory");
+    }
+#endif
+}
+/**
+ * @brief Stores a value from register and writes it to target memory.
+ *
+ * @details This inline device function stores a value from the provided register
+ * and writes it to the specified target memory location. The implementation references the following:
+ * 1) NCCL:
+ * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486
+ * 2) RCCL:
+ * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276
+ *
+ * @tparam T The type of the value to store.
+ * @param[out] p The target memory location to write the value to.
+ * @param[in] v The register containing the value to be stored.
+ */
+template <typename T> inline __device__ void Store(T *p, const T &v) {
+#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
+    *p = v;
+#else
+    if constexpr (std::is_same<T, float>::value) {
+        asm volatile("st.volatile.global.f32 [%0], %1;" ::"l"(p), "f"(v) : "memory");
+    } else if constexpr (std::is_same<T, double>::value) {
+        asm volatile("st.volatile.global.f64 [%0], %1;" ::"l"(p), "d"(v) : "memory");
+    }
+#endif
+}
+/**
+ * @brief Performs COPY, a simple copy operation from source to target. b = a
+ *
+ * @details This CUDA kernel performs a simple copy operation, copying data from the source array
+ * to the target array. This is used to measure transfer rates without any arithmetic operations.
+ *
+ * @param[out] tgt The target array where data will be copied to.
+ * @param[in] src The source array from which data will be copied.
+ */
+__global__ void CopyKernel(double *tgt, const double *src) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val[kNumLoopUnrollAlias];
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
+        Fetch(val[i], src + index + i * blockDim.x);
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
+        Store(tgt + index + i * blockDim.x, val[i]);
+}
+/**
+ * @brief Performs SCALE, a scaling operation on the source data. b = x * a
+ *
+ * @details This CUDA kernel performs a simple arithmetic operation by scaling the source data
+ * with a given scalar value and storing the result in the target array.
+ *
+ * @param[out] tgt The target array where the scaled data will be stored.
+ * @param[in] src The source array containing the data to be scaled.
+ * @param[in] scalar The scalar value used to scale the source data.
+ */
+__global__ void ScaleKernel(double *tgt, const double *src, const long scalar) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val[kNumLoopUnrollAlias];
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
+        Fetch(val[i], src + index + i * blockDim.x);
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        val[i] *= scalar;
+        Store(tgt + index + i * blockDim.x, val[i]);
+    }
+}
+/**
+ * @brief Performs ADD, an addition operation on two source arrays. c = a + b
+ *
+ * @details This CUDA kernel adds corresponding elements from two source arrays and stores the result
+ * in the target array. This operation is used to measure transfer rates with a simple arithmetic addition.
+ *
+ * @param[out] tgt The target array where the result of the addition will be stored.
+ * @param[in] src_a The first source array containing the first set of operands.
+ * @param[in] src_b The second source array containing the second set of operands.
+ */
+__global__ void AddKernel(double *tgt, const double *src_a, const double *src_b) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val_a[kNumLoopUnrollAlias];
+    double val_b[kNumLoopUnrollAlias];
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        Fetch(val_a[i], src_a + index + i * blockDim.x);
+        Fetch(val_b[i], src_b + index + i * blockDim.x);
+    }
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        val_a[i] += val_b[i];
+        Store(tgt + index + i * blockDim.x, val_a[i]);
+    }
+}
+/**
+ * @brief Performs TRIAD, fused multiply/add operations on source arrays. a = b + x * c
+ *
+ * @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from
+ * the second source array with a scalar value, adding the result to corresponding elements from
+ * the first source array, and storing the result in the target array.
+ *
+ * @param[out] tgt The target array where the result of the fused multiply/add operation will be stored.
+ * @param[in] src_a The first source array containing the first set of operands.
+ * @param[in] src_b The second source array containing the second set of operands to be multiplied by the scalar.
+ * @param[in] scalar The scalar value used in the multiply/add operation.
+ */
+__global__ void TriadKernel(double *tgt, const double *src_a, const double *src_b, const long scalar) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val_a[kNumLoopUnrollAlias];
+    double val_b[kNumLoopUnrollAlias];
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        Fetch(val_a[i], src_a + index + i * blockDim.x);
+        Fetch(val_b[i], src_b + index + i * blockDim.x);
+    }
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        val_b[i] += (val_a[i] * scalar);
+        Store(tgt + index + i * blockDim.x, val_b[i]);
+    }
+}
\ No newline at end of file
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+#pragma once
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "gpu_stream_utils.hpp"
+constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll;
+// Function declarations
+template <typename T> inline __device__ void Fetch(T &v, const T *p);
+template <typename T> inline __device__ void Store(T *p, const T &v);
+__global__ void CopyKernel(double *, const double *);
+__global__ void ScaleKernel(double *, const double *, const long);
+__global__ void AddKernel(double *, const double *, const double *);
+__global__ void TriadKernel(double *, const double *, const double *, const long);
\ No newline at end of file
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+#include "gpu_stream.hpp"
+/**
+ * @brief Main function and entry of gpu stream benchmark
+ * @details
+ * params list:
+ *  num_warm_up: warm up count
+ *  num_loops: num of runs for timing
+ *  size: number of bytes to setup for the test
+ * @param  argc argument count
+ * @param  argv argument vector
+ * @return int
+ */
+int main(int argc, char **argv) {
+    int ret = 0;
+    stream_config::Opts opts;
+    // parse arguments from cmd
+    ret = stream_config::ParseOpts(argc, argv, &opts);
+    if (ret != 0) {
+        return ret;
+    }
+    // run the stream benchmark
+    GpuStream gpu_stream(opts);
+    ret = gpu_stream.Run();
+    if (ret != 0) {
+        return ret;
+    }
+    return 0;
+}
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+#include "gpu_stream_utils.hpp"
+namespace stream_config {
+/**
+ * @brief Converts a kernel index to its corresponding string representation.
+ *
+ * @details This function takes an integer representing a kernel index and returns the corresponding
+ * string representation of the kernel. The mapping between kernel indices and their string representations
+ * should be defined within the function.
+ *
+ * @param[in] kernel_idx The index of the kernel to be converted to a string.
+ *
+ * @return std::string The string representation of the kernel.
+ */
+std::string KernelToString(int kernel_idx) {
+    switch (kernel_idx) {
+    case static_cast<int>(Kernel::kCopy):
+        return "COPY";
+    case static_cast<int>(Kernel::kScale):
+        return "SCALE";
+    case static_cast<int>(Kernel::kAdd):
+        return "ADD";
+    case static_cast<int>(Kernel::kTriad):
+        return "TRIAD";
+    default:
+        return "UNKNOWN";
+    }
+}
+/**
+ * @brief Print the usage of this program.
+ *
+ * @details Thus function prints the usage of this program.
+ *
+ * @return void.
+ * */
+void PrintUsage() {
+    std::cout << "Usage: gpu_stream "
+              << "--size <size in bytes> "
+              << "--num_warm_up <num_warm_up> "
+              << "--num_loops <num_loops> "
+              << "[--check_data]" << std::endl;
+}
+/**
+ * @brief Print the user provided inputs info.
+ *
+ * @details Thus function prints the parsed user provided inputs of this program..
+ *
+ * @param[in] opts The Opts struct that stores the parsed values.
+ *
+ * @return void
+ * */
+void PrintInputInfo(Opts &opts) {
+    std::cout << "STREAM Benchmark" << std::endl;
+    std::cout << "Buffer size(bytes): " << opts.size << std::endl;
+    std::cout << "Number of warm up runs: " << opts.num_warm_up << std::endl;
+    std::cout << "Number of loops: " << opts.num_loops << std::endl;
+    std::cout << "Check data: " << (opts.check_data ? "Yes" : "No") << std::endl;
+}
+/**
+ * @brief Parse the command line options.
+ *
+ * @details Thus function parses the command line options and stores the values in the Opts struct.
+ *
+ * @param[in] argc The number of command line options.
+ * @param[in] argv The command line options.
+ * @param[out] opts The Opts struct to store the parsed values.
+ *
+ * @return int The status code.
+ * */
+int ParseOpts(int argc, char **argv, Opts *opts) {
+    enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData };
+    const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
+                                     {"num_warm_up", required_argument, nullptr, static_cast<int>(OptIdx::kNumWarmUp)},
+                                     {"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumLoops)},
+                                     {"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)}};
+    int getopt_ret = 0;
+    int opt_idx = 0;
+    bool size_specified = true;
+    bool num_warm_up_specified = false;
+    bool num_loops_specified = false;
+    bool parse_err = false;
+    while (true) {
+        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
+        if (getopt_ret == -1) {
+            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
+                parse_err = true;
+            }
+            break;
+        } else if (getopt_ret == '?') {
+            parse_err = true;
+            break;
+        }
+        switch (opt_idx) {
+        case static_cast<int>(OptIdx::kSize):
+            if (1 != sscanf(optarg, "%lu", &(opts->size))) {
+                std::cerr << "Invalid size: " << optarg << std::endl;
+                parse_err = true;
+            } else {
+                size_specified = true;
+            }
+            break;
+        case static_cast<int>(OptIdx::kNumWarmUp):
+            if (1 != sscanf(optarg, "%lu", &(opts->num_warm_up))) {
+                std::cerr << "Invalid num_warm_up: " << optarg << std::endl;
+                parse_err = true;
+            } else {
+                num_warm_up_specified = true;
+            }
+            break;
+        case static_cast<int>(OptIdx::kNumLoops):
+            if (1 != sscanf(optarg, "%lu", &(opts->num_loops))) {
+                std::cerr << "Invalid num_loops: " << optarg << std::endl;
+                parse_err = true;
+            } else {
+                num_loops_specified = true;
+            }
+            break;
+        case static_cast<int>(OptIdx::kEnableCheckData):
+            opts->check_data = true;
+            break;
+        default:
+            parse_err = true;
+        }
+        if (parse_err) {
+            break;
+        }
+    }
+    if (parse_err) {
+        PrintUsage();
+        return -1;
+    }
+    return 0;
+}
+} // namespace stream_config
+unsigned long long getCurrentTimestampInMicroseconds() {
+    // Get the current time point
+    auto now = std::chrono::system_clock::now();
+    // Convert to time since epoch
+    auto duration = now.time_since_epoch();
+    // Convert to microseconds
+    auto microseconds = std::chrono::duration_cast<std::chrono::microseconds>(duration).count();
+    return static_cast<unsigned long long>(microseconds);
+}
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+#pragma once
+#include <array>
+#include <chrono>
+#include <getopt.h>
+#include <iomanip>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <numa.h>
+#include <nvml.h>
+// Custom deleter for GPU buffers
+struct GpuBufferDeleter {
+    template <typename T> void operator()(T *ptr) const {
+        if (ptr) {
+            cudaFree(ptr);
+        }
+    }
+};
+unsigned long long getCurrentTimestampInMicroseconds();
+namespace stream_config {
+constexpr std::array<int, 4> kThreadsPerBlock = {128, 256, 512, 1024}; // Threads per block
+constexpr uint64_t kDefaultBufferSizeInBytes = 4294967296;             // Default buffer size 4GB
+constexpr int kNumLoopUnroll = 2;                                      // Unroll depth in SM copy kernel
+constexpr int kNumBuffers = 3;                                         // Number of buffers for triad, add kernel
+constexpr int kNumValidationBuffers = 4;                          // Number of validation buffers, one for each kernel
+constexpr int kUInt8Mod = 256;                                    // Modulo for unsigned long data type
+constexpr std::array<int, 4> kBufferBwMultipliers = {2, 2, 3, 3}; // Buffer multiplier for triad, add kernel
+constexpr long scalar = 11;                                       // Scalar for scale, triad kernel
+// Enum for different kernels
+enum class Kernel {
+    kCopy,
+    kScale,
+    kAdd,
+    kTriad,
+    kCount // Add a count to keep track of the number of enums. Helpful for iterating over enums.
+};
+// Arguments for each sub benchmark run.
+template <typename T> struct SubBenchArgs {
+    // Unique pointer for GPU buffers
+    using GpuBufferUniquePtr = std::unique_ptr<T, GpuBufferDeleter>;
+    // Original data buffer.
+    T *data_buf = nullptr;
+    // Buffer to validate the correctness of data transfer.
+    T *check_buf = nullptr;
+    // GPU pointer of the data buffer on source devices.
+    std::vector<GpuBufferUniquePtr> gpu_buf_ptrs;
+    // Pointer of the validation buffers for each kernel. Order is same as Kernel enum.
+    std::vector<std::vector<T>> validation_buf_ptrs;
+    // CUDA stream to be used.
+    cudaStream_t stream;
+    // CUDA event to record start time.
+    cudaEvent_t start_event;
+    // CUDA event to record end time.
+    cudaEvent_t end_event;
+    // CUDA event to record end time.
+    std::vector<std::vector<float>> times_in_ms;
+    // Stream Kernel name.
+    std::string kernel_name;
+};
+// Arguments for each benchmark run.
+template <typename T> struct BenchArgs {
+    // NUMA node under which the benchmark is done.
+    uint64_t numa_id = 0;
+    // GPU ID for device.
+    int gpu_id = 0;
+    // GPU device info
+    cudaDeviceProp gpu_device_prop;
+    // Data buffer size used.
+    uint64_t size = kDefaultBufferSizeInBytes;
+    // Number of warm up rounds to run.
+    uint64_t num_warm_up = 0;
+    // Number of loops to run.
+    uint64_t num_loops = 1;
+    // Whether check data after copy.
+    bool check_data = false;
+    // Sub-benchmarks in parallel.
+    SubBenchArgs<T> sub;
+};
+// Options accepted by this program.
+struct Opts {
+    // Data buffer size for copy benchmark.
+    uint64_t size = kDefaultBufferSizeInBytes;
+    // Number of warm up rounds to run.
+    uint64_t num_warm_up = 0;
+    // Number of loops to run.
+    uint64_t num_loops = 0;
+    // Whether check data after copy.
+    bool check_data = false;
+};
+std::string KernelToString(int); // Function to convert enum to string
+int ParseOpts(int, char **, Opts *);
+void PrintInputInfo(Opts &);
+void PrintUsage();
+} // namespace stream_config
--- a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
+++ b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+"""Tests for gpu_stream benchmark."""
+import numbers
+import unittest
+from tests.helper import decorator
+from tests.helper.testcase import BenchmarkTestCase
+from superbench.benchmarks import BenchmarkRegistry, BenchmarkType, ReturnCode, Platform
+class GpuStreamBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
+    """Test class for gpu_stream benchmark."""
+    @classmethod
+    def setUpClass(cls):
+        """Hook method for setting up class fixture before running tests in the class."""
+        super().setUpClass()
+        cls.createMockEnvs(cls)
+        cls.createMockFiles(cls, ['bin/gpu_stream'])
+    def _test_gpu_stream_command_generation(self, platform):
+        """Test gpu-stream benchmark command generation."""
+        benchmark_name = 'gpu-stream'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        assert (benchmark_class)
+        num_warm_up = 5
+        num_loops = 10
+        size = 25769803776
+        parameters = '--num_warm_up %d --num_loops %d --size %d ' \
+            '--check_data' % \
+            (num_warm_up, num_loops, size)
+        benchmark = benchmark_class(benchmark_name, parameters=parameters)
+        # Check basic information
+        assert (benchmark)
+        ret = benchmark._preprocess()
+        assert (ret is True)
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        assert (benchmark.name == benchmark_name)
+        assert (benchmark.type == BenchmarkType.MICRO)
+        # Check parameters specified in BenchmarkContext.
+        assert (benchmark._args.size == size)
+        assert (benchmark._args.num_warm_up == num_warm_up)
+        assert (benchmark._args.num_loops == num_loops)
+        assert (benchmark._args.check_data)
+        # Check command
+        assert (1 == len(benchmark._commands))
+        assert (benchmark._commands[0].startswith(benchmark._GpuStreamBenchmark__bin_path))
+        assert ('--size %d' % size in benchmark._commands[0])
+        assert ('--num_warm_up %d' % num_warm_up in benchmark._commands[0])
+        assert ('--num_loops %d' % num_loops in benchmark._commands[0])
+        assert ('--check_data' in benchmark._commands[0])
+    @decorator.cuda_test
+    def test_gpu_stream_command_generation_cuda(self):
+        """Test gpu-stream benchmark command generation, CUDA case."""
+        self._test_gpu_stream_command_generation(Platform.CUDA)
+    @decorator.rocm_test
+    def test_gpu_stream_command_generation_rocm(self):
+        """Test gpu-stream benchmark command generation, ROCm case."""
+        self._test_gpu_stream_command_generation(Platform.ROCM)
+    @decorator.load_data('tests/data/gpu_stream.log')
+    def _test_gpu_stream_result_parsing(self, platform, test_raw_output):
+        """Test gpu-stream benchmark result parsing."""
+        benchmark_name = 'gpu-stream'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        assert (benchmark_class)
+        benchmark = benchmark_class(benchmark_name, parameters='')
+        assert (benchmark)
+        ret = benchmark._preprocess()
+        assert (ret is True)
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        assert (benchmark.name == 'gpu-stream')
+        assert (benchmark.type == BenchmarkType.MICRO)
+        # Positive case - valid raw output.
+        assert (benchmark._process_raw_result(0, test_raw_output))
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        assert (1 == len(benchmark.raw_data))
+        # print(test_raw_output.splitlines())
+        test_raw_output_dict = {
+            x.split()[0]: [float(x.split()[1]), float(x.split()[2])]
+            for x in test_raw_output.strip().splitlines() if x.startswith('STREAM_')
+        }
+        assert (len(test_raw_output_dict) * 2 + benchmark.default_metric_count == len(benchmark.result))
+        for output_key in benchmark.result:
+            if output_key == 'return_code':
+                assert (benchmark.result[output_key] == [0])
+            else:
+                assert (len(benchmark.result[output_key]) == 1)
+                assert (isinstance(benchmark.result[output_key][0], numbers.Number))
+                if output_key.endswith('_bw'):
+                    assert (output_key.strip('_bw') in test_raw_output_dict)
+                    assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0])
+                else:
+                    assert (output_key.strip('_ratio') in test_raw_output_dict)
+                    assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0])
+        # Negative case - invalid raw output.
+        assert (benchmark._process_raw_result(1, 'Invalid raw output') is False)
+        assert (benchmark.return_code == ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
+    @decorator.cuda_test
+    def test_gpu_stream_result_parsing_cuda(self):
+        """Test gpu-stream benchmark result parsing, CUDA case."""
+        self._test_gpu_stream_result_parsing(Platform.CUDA)
+    @decorator.rocm_test
+    def test_gpu_stream_result_parsing_rocm(self):
+        """Test gpu-stream benchmark result parsing, ROCm case."""
+        self._test_gpu_stream_result_parsing(Platform.ROCM)
--- a/tests/data/gpu_stream.log
+++ b/tests/data/gpu_stream.log
+STREAM Benchmark
+Buffer size(bytes): 4294967296
+Number of warm up runs: 10
+Number of loops: 40
+Check data: No
+Device 0: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK   ECC is ON
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_128    6711.67 81.93
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_256    6549.50 79.95
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_512    6195.43 75.63
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_1024   5721.52 69.84
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_128   6680.42 81.55
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_256   6515.51 79.54
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_512   6106.69 74.54
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_1024  5626.68 68.69
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_128     7379.25 90.08
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_256     7407.27 90.42
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_512     7309.59 89.23
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_1024    6788.64 82.87
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_128   7378.19 90.07
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_256   7414.01 90.50
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_512   7295.50 89.06
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_1024  6730.42 82.16
+Device 1: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000.00MHz x 8192-bit = 8192.00 GB/s PEAK   ECC is ON
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_128    6708.74 81.89
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_256    6549.47 79.95
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_512    6195.39 75.63
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_1024   5725.07 69.89
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_128   6678.56 81.53
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_256   6514.05 79.52
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_512   6103.80 74.51
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_1024  5630.41 68.73
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_128     7377.74 90.06
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_256     7410.97 90.47
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_512     7310.80 89.24
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_1024    6789.91 82.88
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_128   7379.03 90.08
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_256   7414.04 90.50
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_512   7298.26 89.09
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_1024  6732.15 82.18
\ No newline at end of file