Benchmarks - Add GPU Stream Micro Benchmark (#697)

Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.

Benchmarks - Add GPU Stream Micro Benchmark (#697)
Added GPU Stream benchmark - measures the GPU memory bandwidth and efficiency for double datatype through various memory operations including copy, scale, add, and triad. - added documentation for `gpu-stream` detailing its introduction, metrics, and descriptions. - added unit tests for `gpu-stream`. Example output is in `superbenchmark/tests/data/gpu_stream.log`.
4eddd50a · WenqingLan1 · GitHub · 991c0051 · 4eddd50a · 4eddd50a
Unverified Commit 4eddd50a authored Jun 18, 2025 by WenqingLan1 Committed by GitHub Jun 18, 2025
14 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -262,6 +262,25 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU
 | gpu\_all\_to\_gpu[0-9]+\_write\_by\_sm\_bw                  | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled.                     |
 | gpu\_all\_to\_gpu\_all\_write\_by\_sm\_bw                   | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled.                |

+### `gpu-stream`
+
+#### Introduction
+
+Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
+
+#### Metrics
+
+| Metric Name                                                | Unit             | Description                                                                                                                             |
+|------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
+| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw  | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size.                         |
+| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size.                         |
+| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw   | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size.                         |
+| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size.                         |
+| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio  | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size.                         |
+| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size.                         |
+| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio   | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size.                         |
+| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size.                         |
+
 ### `ib-loopback`

 #### Introduction

--- a/examples/benchmarks/gpu_stream.py
+++ b/examples/benchmarks/gpu_stream.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+"""Micro benchmark example for GPU Stream performance.
+
+Commands to run:
+  python3 examples/benchmarks/gpu_stream.py
+"""
+
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.common.utils import logger
+
+if __name__ == '__main__':
+    context = BenchmarkRegistry.create_benchmark_context(
+        'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10'
+    )
+    # For ROCm environment, please specify the benchmark name and the platform as the following.
+    # context = BenchmarkRegistry.create_benchmark_context(
+    #     'gpu-stream', platform=Platform.ROCM, parameters='--num_warm_up 1 --num_loops 10'
+    # )
+    # To enable data checking, please add '--check_data'.
+
+    benchmark = BenchmarkRegistry.launch_benchmark(context)
+    if benchmark:
+        logger.info(
+            'benchmark: {}, return code: {}, result: {}'.format(
+                benchmark.name, benchmark.return_code, benchmark.result
+            )
+        )
--- a/superbench/benchmarks/micro_benchmarks/__init__.py
+++ b/superbench/benchmarks/micro_benchmarks/__init__.py
@@ -23,6 +23,7 @@
 from superbench.benchmarks.micro_benchmarks.cpu_hpl_performance import CpuHplBenchmark
 from superbench.benchmarks.micro_benchmarks.gpcnet_performance import GPCNetBenchmark
 from superbench.benchmarks.micro_benchmarks.gpu_copy_bw_performance import GpuCopyBwBenchmark
+from superbench.benchmarks.micro_benchmarks.gpu_stream import GpuStreamBenchmark
 from superbench.benchmarks.micro_benchmarks.gpu_burn_test import GpuBurnBenchmark
 from superbench.benchmarks.micro_benchmarks.ib_loopback_performance import IBLoopbackBenchmark
 from superbench.benchmarks.micro_benchmarks.ib_validation_performance import IBBenchmark
@@ -58,6 +59,7 @@
    'GemmFlopsBenchmark',
    'GpuBurnBenchmark',
    'GpuCopyBwBenchmark',
+    'GpuStreamBenchmark',
    'IBBenchmark',
    'IBLoopbackBenchmark',
    'KernelLaunch',

--- a/superbench/benchmarks/micro_benchmarks/gpu_stream.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+"""Module of the GPU Stream Performance benchmark."""
+
+import os
+
+from superbench.common.utils import logger
+from superbench.benchmarks import BenchmarkRegistry, ReturnCode
+from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
+
+
+class GpuStreamBenchmark(MicroBenchmarkWithInvoke):
+    """The GPU stream performance benchmark class."""
+    def __init__(self, name, parameters=''):
+        """Constructor.
+
+        Args:
+            name (str): benchmark name.
+            parameters (str): benchmark parameters.
+        """
+        super().__init__(name, parameters)
+
+        self._bin_name = 'gpu_stream'
+
+    def add_parser_arguments(self):
+        """Add the specified arguments."""
+        super().add_parser_arguments()
+
+        self._parser.add_argument(
+            '--size',
+            type=int,
+            default=4096 * 1024**2,
+            required=False,
+            help='Size of data buffer in bytes.',
+        )
+
+        self._parser.add_argument(
+            '--num_warm_up',
+            type=int,
+            default=20,
+            required=False,
+            help='Number of warm up rounds',
+        )
+
+        self._parser.add_argument(
+            '--num_loops',
+            type=int,
+            default=100,
+            required=False,
+            help='Number of data buffer copies performed.',
+        )
+
+        self._parser.add_argument(
+            '--check_data',
+            action='store_true',
+            help='Enable data checking',
+        )
+
+    def _preprocess(self):
+        """Preprocess/preparation operations before the benchmarking.
+
+        Return:
+            True if _preprocess() succeed.
+        """
+        if not super()._preprocess():
+            return False
+
+        self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name)
+
+        args = '--size %d --num_warm_up %d --num_loops %d ' % (
+            self._args.size, self._args.num_warm_up, self._args.num_loops
+        )
+
+        if self._args.check_data:
+            args += ' --check_data'
+
+        self._commands = ['%s %s' % (self.__bin_path, args)]
+
+        return True
+
+    def _process_raw_result(self, cmd_idx, raw_output):
+        """Function to parse raw results and save the summarized results.
+
+          self._result.add_raw_data() and self._result.add_result() need to be called to save the results.
+
+        Args:
+            cmd_idx (int): the index of command corresponding with the raw_output.
+            raw_output (str): raw output string of the micro-benchmark.
+
+        Return:
+            True if the raw output string is valid and result can be extracted.
+        """
+        self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
+
+        try:
+            output_lines = [x.strip() for x in raw_output.strip().splitlines()]
+            count = 0
+            for output_line in output_lines:
+                if output_line.startswith('STREAM_'):
+                    count += 1
+                    tag, bw_str, ratio = output_line.split()
+                    self._result.add_result(tag + '_bw', float(bw_str))
+                    self._result.add_result(tag + '_ratio', float(ratio))
+            if count == 0:
+                raise BaseException('No valid results found.')
+        except BaseException as e:
+            self._result.set_return_code(ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
+            logger.error(
+                'The result format is invalid - round: {}, benchmark: {}, raw output: {}, message: {}.'.format(
+                    self._curr_run_index, self._name, raw_output, str(e)
+                )
+            )
+            return False
+
+        return True
+
+
+BenchmarkRegistry.register_benchmark('gpu-stream', GpuStreamBenchmark)
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+cmake_minimum_required(VERSION 3.18)
+
+project(gpu_stream LANGUAGES CXX)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+
+set(CMAKE_CUDA_STANDARD 17)
+set(CMAKE_CUDA_STANDARD_REQUIRED ON)
+
+find_package(CUDAToolkit QUIET)
+
+# Source files
+set(SOURCES
+    gpu_stream_test.cpp
+    gpu_stream_utils.cpp
+    gpu_stream.cu
+    gpu_stream_kernels.cu
+)
+
+# Cuda environment
+if(CUDAToolkit_FOUND)
+    message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})
+
+    include(../cuda_common.cmake)
+    add_executable(gpu_stream ${SOURCES})
+    set_property(TARGET gpu_stream PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
+    target_include_directories(gpu_stream PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
+    target_link_libraries(gpu_stream numa nvidia-ml)
+else()
+    # TODO: test for ROC
+    # ROCm environment
+    include(../rocm_common.cmake)
+    find_package(hip QUIET)
+    if(hip_FOUND)
+        message(STATUS "Found ROCm: " ${HIP_VERSION})
+
+        # Convert cuda code to hip code in cpp
+        execute_process(COMMAND hipify-perl -print-stats -o gpu_stream.cpp ${SOURCES} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/)
+
+        # link hip device lib
+        add_executable(gpu_stream gpu_stream.cpp)
+
+        include(CheckSymbolExists)
+        check_symbol_exists("hipDeviceMallocUncached" "hip/hip_runtime_api.h" HIP_UNCACHED_MEMORY)
+        if(${HIP_UNCACHED_MEMORY})
+            target_compile_definitions(gpu_stream PRIVATE HIP_UNCACHED_MEMORY)
+        endif()
+
+        set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
+        target_link_libraries(gpu_stream numa hip::device)
+    else()
+        message(FATAL_ERROR "No CUDA or ROCm environment found.")
+    endif()
+endif()
+
+install(TARGETS gpu_stream RUNTIME DESTINATION bin)
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <getopt.h>
+#include <iostream>
+#include <memory>
+#include <variant>
+#include <vector>
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <numa.h>
+
+#include "gpu_stream_kernels.hpp"
+#include "gpu_stream_utils.hpp"
+
+#define NON_HIP (!defined(__HIP_PLATFORM_HCC__) && !defined(__HCC__) && !defined(__HIPCC__))
+
+using namespace stream_config;
+
+class GpuStream {
+  public:
+    GpuStream() = delete;            // Delete default constructor
+    GpuStream(Opts &) noexcept;      // Constructor
+    ~GpuStream() noexcept = default; // Destructor
+
+    GpuStream(const GpuStream &) = delete;
+    GpuStream &operator=(const GpuStream &) = delete;
+    GpuStream(GpuStream &&) noexcept = default;
+    GpuStream &operator=(GpuStream &&) noexcept = default;
+
+    int Run();
+
+  private:
+    using BenchArgsVariant = std::variant<std::unique_ptr<BenchArgs<double>>>;
+    std::vector<BenchArgsVariant> bench_args_;
+    Opts opts_;
+
+    // Memory management functions
+    template <typename T> cudaError_t GpuMallocDataBuf(T **, uint64_t);
+    template <typename T> int PrepareValidationBuf(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int CheckBuf(std::unique_ptr<BenchArgs<T>> &, int);
+
+    template <typename T> int PrepareEvent(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int PrepareBufAndStream(std::unique_ptr<BenchArgs<T>> &);
+
+    template <typename T> int DestroyEvent(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int DestroyBufAndStream(std::unique_ptr<BenchArgs<T>> &);
+    template <typename T> int Destroy(std::unique_ptr<BenchArgs<T>> &);
+
+    // Benchmark functions
+    template <typename T> int RunStreamKernel(std::unique_ptr<BenchArgs<T>> &, Kernel, int);
+    template <typename T> int RunStream(std::unique_ptr<BenchArgs<T>> &, const std::string &data_type);
+
+    // Helper functions
+    int GetGpuCount(int *);
+    int SetGpu(int gpu_id);
+    void PrintCudaDeviceInfo(int, const cudaDeviceProp &);
+};
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+#include "gpu_stream_kernels.hpp"
+
+/**
+ * @brief Fetches a value from source memory and writes it to a register.
+ *
+ * @details This inline device function fetches a value from the specified source memory
+ * location and writes it to the provided register. The implementation references the following:
+ * 1) NCCL:
+ * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483
+ * 2) RCCL:
+ * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268
+ *
+ * @tparam T The type of the value to fetch.
+ * @param[out] v The register to write the fetched value to.
+ * @param[in] p The source memory location to fetch the value from.
+ */
+template <typename T> inline __device__ void Fetch(T &v, const T *p) {
+#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
+    v = *p;
+#else
+    if constexpr (std::is_same<T, float>::value) {
+        asm volatile("ld.volatile.global.f32 %0, [%1];" : "=f"(v) : "l"(p) : "memory");
+    } else if constexpr (std::is_same<T, double>::value) {
+        asm volatile("ld.volatile.global.f64 %0, [%1];" : "=d"(v) : "l"(p) : "memory");
+    }
+#endif
+}
+
+/**
+ * @brief Stores a value from register and writes it to target memory.
+ *
+ * @details This inline device function stores a value from the provided register
+ * and writes it to the specified target memory location. The implementation references the following:
+ * 1) NCCL:
+ * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486
+ * 2) RCCL:
+ * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276
+ *
+ * @tparam T The type of the value to store.
+ * @param[out] p The target memory location to write the value to.
+ * @param[in] v The register containing the value to be stored.
+ */
+template <typename T> inline __device__ void Store(T *p, const T &v) {
+#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
+    *p = v;
+#else
+    if constexpr (std::is_same<T, float>::value) {
+        asm volatile("st.volatile.global.f32 [%0], %1;" ::"l"(p), "f"(v) : "memory");
+    } else if constexpr (std::is_same<T, double>::value) {
+        asm volatile("st.volatile.global.f64 [%0], %1;" ::"l"(p), "d"(v) : "memory");
+    }
+#endif
+}
+
+/**
+ * @brief Performs COPY, a simple copy operation from source to target. b = a
+ *
+ * @details This CUDA kernel performs a simple copy operation, copying data from the source array
+ * to the target array. This is used to measure transfer rates without any arithmetic operations.
+ *
+ * @param[out] tgt The target array where data will be copied to.
+ * @param[in] src The source array from which data will be copied.
+ */
+__global__ void CopyKernel(double *tgt, const double *src) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val[kNumLoopUnrollAlias];
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
+        Fetch(val[i], src + index + i * blockDim.x);
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
+        Store(tgt + index + i * blockDim.x, val[i]);
+}
+
+/**
+ * @brief Performs SCALE, a scaling operation on the source data. b = x * a
+ *
+ * @details This CUDA kernel performs a simple arithmetic operation by scaling the source data
+ * with a given scalar value and storing the result in the target array.
+ *
+ * @param[out] tgt The target array where the scaled data will be stored.
+ * @param[in] src The source array containing the data to be scaled.
+ * @param[in] scalar The scalar value used to scale the source data.
+ */
+__global__ void ScaleKernel(double *tgt, const double *src, const long scalar) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val[kNumLoopUnrollAlias];
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
+        Fetch(val[i], src + index + i * blockDim.x);
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        val[i] *= scalar;
+        Store(tgt + index + i * blockDim.x, val[i]);
+    }
+}
+
+/**
+ * @brief Performs ADD, an addition operation on two source arrays. c = a + b
+ *
+ * @details This CUDA kernel adds corresponding elements from two source arrays and stores the result
+ * in the target array. This operation is used to measure transfer rates with a simple arithmetic addition.
+ *
+ * @param[out] tgt The target array where the result of the addition will be stored.
+ * @param[in] src_a The first source array containing the first set of operands.
+ * @param[in] src_b The second source array containing the second set of operands.
+ */
+__global__ void AddKernel(double *tgt, const double *src_a, const double *src_b) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val_a[kNumLoopUnrollAlias];
+    double val_b[kNumLoopUnrollAlias];
+
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        Fetch(val_a[i], src_a + index + i * blockDim.x);
+        Fetch(val_b[i], src_b + index + i * blockDim.x);
+    }
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        val_a[i] += val_b[i];
+        Store(tgt + index + i * blockDim.x, val_a[i]);
+    }
+}
+
+/**
+ * @brief Performs TRIAD, fused multiply/add operations on source arrays. a = b + x * c
+ *
+ * @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from
+ * the second source array with a scalar value, adding the result to corresponding elements from
+ * the first source array, and storing the result in the target array.
+ *
+ * @param[out] tgt The target array where the result of the fused multiply/add operation will be stored.
+ * @param[in] src_a The first source array containing the first set of operands.
+ * @param[in] src_b The second source array containing the second set of operands to be multiplied by the scalar.
+ * @param[in] scalar The scalar value used in the multiply/add operation.
+ */
+__global__ void TriadKernel(double *tgt, const double *src_a, const double *src_b, const long scalar) {
+    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
+    double val_a[kNumLoopUnrollAlias];
+    double val_b[kNumLoopUnrollAlias];
+
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        Fetch(val_a[i], src_a + index + i * blockDim.x);
+        Fetch(val_b[i], src_b + index + i * blockDim.x);
+    }
+#pragma unroll
+    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
+        val_b[i] += (val_a[i] * scalar);
+        Store(tgt + index + i * blockDim.x, val_b[i]);
+    }
+}
\ No newline at end of file
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include "gpu_stream_utils.hpp"
+constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll;
+
+// Function declarations
+template <typename T> inline __device__ void Fetch(T &v, const T *p);
+template <typename T> inline __device__ void Store(T *p, const T &v);
+
+__global__ void CopyKernel(double *, const double *);
+__global__ void ScaleKernel(double *, const double *, const long);
+__global__ void AddKernel(double *, const double *, const double *);
+__global__ void TriadKernel(double *, const double *, const double *, const long);
\ No newline at end of file
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+#include "gpu_stream.hpp"
+
+/**
+ * @brief Main function and entry of gpu stream benchmark
+
+ * @details
+ * params list:
+ *  num_warm_up: warm up count
+ *  num_loops: num of runs for timing
+ *  size: number of bytes to setup for the test
+ * @param  argc argument count
+ * @param  argv argument vector
+ * @return int
+ */
+int main(int argc, char **argv) {
+    int ret = 0;
+    stream_config::Opts opts;
+
+    // parse arguments from cmd
+    ret = stream_config::ParseOpts(argc, argv, &opts);
+    if (ret != 0) {
+        return ret;
+    }
+
+    // run the stream benchmark
+    GpuStream gpu_stream(opts);
+    ret = gpu_stream.Run();
+    if (ret != 0) {
+        return ret;
+    }
+
+    return 0;
+}
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
+
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+#include "gpu_stream_utils.hpp"
+
+namespace stream_config {
+/**
+ * @brief Converts a kernel index to its corresponding string representation.
+ *
+ * @details This function takes an integer representing a kernel index and returns the corresponding
+ * string representation of the kernel. The mapping between kernel indices and their string representations
+ * should be defined within the function.
+ *
+ * @param[in] kernel_idx The index of the kernel to be converted to a string.
+ *
+ * @return std::string The string representation of the kernel.
+ */
+std::string KernelToString(int kernel_idx) {
+    switch (kernel_idx) {
+    case static_cast<int>(Kernel::kCopy):
+        return "COPY";
+    case static_cast<int>(Kernel::kScale):
+        return "SCALE";
+    case static_cast<int>(Kernel::kAdd):
+        return "ADD";
+    case static_cast<int>(Kernel::kTriad):
+        return "TRIAD";
+    default:
+        return "UNKNOWN";
+    }
+}
+
+/**
+ * @brief Print the usage of this program.
+ *
+ * @details Thus function prints the usage of this program.
+ *
+ * @return void.
+ * */
+void PrintUsage() {
+    std::cout << "Usage: gpu_stream "
+              << "--size <size in bytes> "
+              << "--num_warm_up <num_warm_up> "
+              << "--num_loops <num_loops> "
+              << "[--check_data]" << std::endl;
+}
+
+/**
+ * @brief Print the user provided inputs info.
+ *
+ * @details Thus function prints the parsed user provided inputs of this program..
+ *
+ * @param[in] opts The Opts struct that stores the parsed values.
+ *
+ * @return void
+ * */
+void PrintInputInfo(Opts &opts) {
+    std::cout << "STREAM Benchmark" << std::endl;
+    std::cout << "Buffer size(bytes): " << opts.size << std::endl;
+    std::cout << "Number of warm up runs: " << opts.num_warm_up << std::endl;
+    std::cout << "Number of loops: " << opts.num_loops << std::endl;
+    std::cout << "Check data: " << (opts.check_data ? "Yes" : "No") << std::endl;
+}
+
+/**
+ * @brief Parse the command line options.
+ *
+ * @details Thus function parses the command line options and stores the values in the Opts struct.
+ *
+ * @param[in] argc The number of command line options.
+ * @param[in] argv The command line options.
+ * @param[out] opts The Opts struct to store the parsed values.
+ *
+ * @return int The status code.
+ * */
+int ParseOpts(int argc, char **argv, Opts *opts) {
+    enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData };
+    const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
+                                     {"num_warm_up", required_argument, nullptr, static_cast<int>(OptIdx::kNumWarmUp)},
+                                     {"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumLoops)},
+                                     {"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)}};
+    int getopt_ret = 0;
+    int opt_idx = 0;
+    bool size_specified = true;
+    bool num_warm_up_specified = false;
+    bool num_loops_specified = false;
+
+    bool parse_err = false;
+    while (true) {
+        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
+        if (getopt_ret == -1) {
+            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
+                parse_err = true;
+            }
+            break;
+        } else if (getopt_ret == '?') {
+            parse_err = true;
+            break;
+        }
+        switch (opt_idx) {
+        case static_cast<int>(OptIdx::kSize):
+            if (1 != sscanf(optarg, "%lu", &(opts->size))) {
+                std::cerr << "Invalid size: " << optarg << std::endl;
+                parse_err = true;
+            } else {
+                size_specified = true;
+            }
+            break;
+        case static_cast<int>(OptIdx::kNumWarmUp):
+            if (1 != sscanf(optarg, "%lu", &(opts->num_warm_up))) {
+                std::cerr << "Invalid num_warm_up: " << optarg << std::endl;
+                parse_err = true;
+            } else {
+                num_warm_up_specified = true;
+            }
+            break;
+        case static_cast<int>(OptIdx::kNumLoops):
+            if (1 != sscanf(optarg, "%lu", &(opts->num_loops))) {
+                std::cerr << "Invalid num_loops: " << optarg << std::endl;
+                parse_err = true;
+            } else {
+                num_loops_specified = true;
+            }
+            break;
+        case static_cast<int>(OptIdx::kEnableCheckData):
+            opts->check_data = true;
+            break;
+        default:
+            parse_err = true;
+        }
+        if (parse_err) {
+            break;
+        }
+    }
+    if (parse_err) {
+        PrintUsage();
+        return -1;
+    }
+    return 0;
+}
+
+} // namespace stream_config
+
+unsigned long long getCurrentTimestampInMicroseconds() {
+    // Get the current time point
+    auto now = std::chrono::system_clock::now();
+
+    // Convert to time since epoch
+    auto duration = now.time_since_epoch();
+
+    // Convert to microseconds
+    auto microseconds = std::chrono::duration_cast<std::chrono::microseconds>(duration).count();
+
+    return static_cast<unsigned long long>(microseconds);
+}
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <array>
+#include <chrono>
+#include <getopt.h>
+#include <iomanip>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <numa.h>
+#include <nvml.h>
+
+// Custom deleter for GPU buffers
+struct GpuBufferDeleter {
+    template <typename T> void operator()(T *ptr) const {
+        if (ptr) {
+            cudaFree(ptr);
+        }
+    }
+};
+
+unsigned long long getCurrentTimestampInMicroseconds();
+
+namespace stream_config {
+constexpr std::array<int, 4> kThreadsPerBlock = {128, 256, 512, 1024}; // Threads per block
+constexpr uint64_t kDefaultBufferSizeInBytes = 4294967296;             // Default buffer size 4GB
+constexpr int kNumLoopUnroll = 2;                                      // Unroll depth in SM copy kernel
+constexpr int kNumBuffers = 3;                                         // Number of buffers for triad, add kernel
+constexpr int kNumValidationBuffers = 4;                          // Number of validation buffers, one for each kernel
+constexpr int kUInt8Mod = 256;                                    // Modulo for unsigned long data type
+constexpr std::array<int, 4> kBufferBwMultipliers = {2, 2, 3, 3}; // Buffer multiplier for triad, add kernel
+constexpr long scalar = 11;                                       // Scalar for scale, triad kernel
+
+// Enum for different kernels
+enum class Kernel {
+    kCopy,
+    kScale,
+    kAdd,
+    kTriad,
+    kCount // Add a count to keep track of the number of enums. Helpful for iterating over enums.
+};
+
+// Arguments for each sub benchmark run.
+template <typename T> struct SubBenchArgs {
+    // Unique pointer for GPU buffers
+    using GpuBufferUniquePtr = std::unique_ptr<T, GpuBufferDeleter>;
+
+    // Original data buffer.
+    T *data_buf = nullptr;
+
+    // Buffer to validate the correctness of data transfer.
+    T *check_buf = nullptr;
+
+    // GPU pointer of the data buffer on source devices.
+    std::vector<GpuBufferUniquePtr> gpu_buf_ptrs;
+
+    // Pointer of the validation buffers for each kernel. Order is same as Kernel enum.
+    std::vector<std::vector<T>> validation_buf_ptrs;
+
+    // CUDA stream to be used.
+    cudaStream_t stream;
+
+    // CUDA event to record start time.
+    cudaEvent_t start_event;
+
+    // CUDA event to record end time.
+    cudaEvent_t end_event;
+
+    // CUDA event to record end time.
+    std::vector<std::vector<float>> times_in_ms;
+
+    // Stream Kernel name.
+    std::string kernel_name;
+};
+
+// Arguments for each benchmark run.
+template <typename T> struct BenchArgs {
+
+    // NUMA node under which the benchmark is done.
+    uint64_t numa_id = 0;
+
+    // GPU ID for device.
+    int gpu_id = 0;
+
+    // GPU device info
+    cudaDeviceProp gpu_device_prop;
+
+    // Data buffer size used.
+    uint64_t size = kDefaultBufferSizeInBytes;
+
+    // Number of warm up rounds to run.
+    uint64_t num_warm_up = 0;
+
+    // Number of loops to run.
+    uint64_t num_loops = 1;
+
+    // Whether check data after copy.
+    bool check_data = false;
+
+    // Sub-benchmarks in parallel.
+    SubBenchArgs<T> sub;
+};
+
+// Options accepted by this program.
+struct Opts {
+    // Data buffer size for copy benchmark.
+    uint64_t size = kDefaultBufferSizeInBytes;
+
+    // Number of warm up rounds to run.
+    uint64_t num_warm_up = 0;
+
+    // Number of loops to run.
+    uint64_t num_loops = 0;
+
+    // Whether check data after copy.
+    bool check_data = false;
+};
+
+std::string KernelToString(int); // Function to convert enum to string
+int ParseOpts(int, char **, Opts *);
+void PrintInputInfo(Opts &);
+void PrintUsage();
+
+} // namespace stream_config
--- a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
+++ b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""Tests for gpu_stream benchmark."""
+
+import numbers
+import unittest
+
+from tests.helper import decorator
+from tests.helper.testcase import BenchmarkTestCase
+from superbench.benchmarks import BenchmarkRegistry, BenchmarkType, ReturnCode, Platform
+
+
+class GpuStreamBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
+    """Test class for gpu_stream benchmark."""
+    @classmethod
+    def setUpClass(cls):
+        """Hook method for setting up class fixture before running tests in the class."""
+        super().setUpClass()
+        cls.createMockEnvs(cls)
+        cls.createMockFiles(cls, ['bin/gpu_stream'])
+
+    def _test_gpu_stream_command_generation(self, platform):
+        """Test gpu-stream benchmark command generation."""
+        benchmark_name = 'gpu-stream'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        assert (benchmark_class)
+
+        num_warm_up = 5
+        num_loops = 10
+        size = 25769803776
+
+        parameters = '--num_warm_up %d --num_loops %d --size %d ' \
+            '--check_data' % \
+            (num_warm_up, num_loops, size)
+        benchmark = benchmark_class(benchmark_name, parameters=parameters)
+
+        # Check basic information
+        assert (benchmark)
+        ret = benchmark._preprocess()
+        assert (ret is True)
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        assert (benchmark.name == benchmark_name)
+        assert (benchmark.type == BenchmarkType.MICRO)
+
+        # Check parameters specified in BenchmarkContext.
+        assert (benchmark._args.size == size)
+        assert (benchmark._args.num_warm_up == num_warm_up)
+        assert (benchmark._args.num_loops == num_loops)
+        assert (benchmark._args.check_data)
+
+        # Check command
+        assert (1 == len(benchmark._commands))
+        assert (benchmark._commands[0].startswith(benchmark._GpuStreamBenchmark__bin_path))
+        assert ('--size %d' % size in benchmark._commands[0])
+        assert ('--num_warm_up %d' % num_warm_up in benchmark._commands[0])
+        assert ('--num_loops %d' % num_loops in benchmark._commands[0])
+        assert ('--check_data' in benchmark._commands[0])
+
+    @decorator.cuda_test
+    def test_gpu_stream_command_generation_cuda(self):
+        """Test gpu-stream benchmark command generation, CUDA case."""
+        self._test_gpu_stream_command_generation(Platform.CUDA)
+
+    @decorator.rocm_test
+    def test_gpu_stream_command_generation_rocm(self):
+        """Test gpu-stream benchmark command generation, ROCm case."""
+        self._test_gpu_stream_command_generation(Platform.ROCM)
+
+    @decorator.load_data('tests/data/gpu_stream.log')
+    def _test_gpu_stream_result_parsing(self, platform, test_raw_output):
+        """Test gpu-stream benchmark result parsing."""
+        benchmark_name = 'gpu-stream'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        assert (benchmark_class)
+        benchmark = benchmark_class(benchmark_name, parameters='')
+        assert (benchmark)
+        ret = benchmark._preprocess()
+        assert (ret is True)
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        assert (benchmark.name == 'gpu-stream')
+        assert (benchmark.type == BenchmarkType.MICRO)
+
+        # Positive case - valid raw output.
+        assert (benchmark._process_raw_result(0, test_raw_output))
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+
+        assert (1 == len(benchmark.raw_data))
+        # print(test_raw_output.splitlines())
+        test_raw_output_dict = {
+            x.split()[0]: [float(x.split()[1]), float(x.split()[2])]
+            for x in test_raw_output.strip().splitlines() if x.startswith('STREAM_')
+        }
+        assert (len(test_raw_output_dict) * 2 + benchmark.default_metric_count == len(benchmark.result))
+        for output_key in benchmark.result:
+            if output_key == 'return_code':
+                assert (benchmark.result[output_key] == [0])
+            else:
+                assert (len(benchmark.result[output_key]) == 1)
+                assert (isinstance(benchmark.result[output_key][0], numbers.Number))
+                if output_key.endswith('_bw'):
+                    assert (output_key.strip('_bw') in test_raw_output_dict)
+                    assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0])
+                else:
+                    assert (output_key.strip('_ratio') in test_raw_output_dict)
+                    assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0])
+
+        # Negative case - invalid raw output.
+        assert (benchmark._process_raw_result(1, 'Invalid raw output') is False)
+        assert (benchmark.return_code == ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
+
+    @decorator.cuda_test
+    def test_gpu_stream_result_parsing_cuda(self):
+        """Test gpu-stream benchmark result parsing, CUDA case."""
+        self._test_gpu_stream_result_parsing(Platform.CUDA)
+
+    @decorator.rocm_test
+    def test_gpu_stream_result_parsing_rocm(self):
+        """Test gpu-stream benchmark result parsing, ROCm case."""
+        self._test_gpu_stream_result_parsing(Platform.ROCM)
--- a/tests/data/gpu_stream.log
+++ b/tests/data/gpu_stream.log
+STREAM Benchmark
+Buffer size(bytes): 4294967296
+Number of warm up runs: 10
+Number of loops: 40
+Check data: No
+
+Device 0: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK   ECC is ON
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_128    6711.67 81.93
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_256    6549.50 79.95
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_512    6195.43 75.63
+STREAM_COPY_double_gpu_0_buffer_4294967296_block_1024   5721.52 69.84
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_128   6680.42 81.55
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_256   6515.51 79.54
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_512   6106.69 74.54
+STREAM_SCALE_double_gpu_0_buffer_4294967296_block_1024  5626.68 68.69
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_128     7379.25 90.08
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_256     7407.27 90.42
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_512     7309.59 89.23
+STREAM_ADD_double_gpu_0_buffer_4294967296_block_1024    6788.64 82.87
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_128   7378.19 90.07
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_256   7414.01 90.50
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_512   7295.50 89.06
+STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_1024  6730.42 82.16
+
+Device 1: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000.00MHz x 8192-bit = 8192.00 GB/s PEAK   ECC is ON
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_128    6708.74 81.89
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_256    6549.47 79.95
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_512    6195.39 75.63
+STREAM_COPY_double_gpu_1_buffer_4294967296_block_1024   5725.07 69.89
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_128   6678.56 81.53
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_256   6514.05 79.52
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_512   6103.80 74.51
+STREAM_SCALE_double_gpu_1_buffer_4294967296_block_1024  5630.41 68.73
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_128     7377.74 90.06
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_256     7410.97 90.47
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_512     7310.80 89.24
+STREAM_ADD_double_gpu_1_buffer_4294967296_block_1024    6789.91 82.88
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_128   7379.03 90.08
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_256   7414.04 90.50
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_512   7298.26 89.09
+STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_1024  6732.15 82.18
\ No newline at end of file