Migrate gpu-stream to BabelStream v5.0

d4051602 · one · 1a57f2d6 · d4051602 · d4051602 · d4051602
Commit d4051602 authored Mar 19, 2026 by one
5 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -267,20 +267,42 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU

 #### Introduction

-Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
+Measure the memory bandwidth of GPU using BabelStream (`hip-stream`) backend.
+The benchmark executes copy, scale, add, triad, and dot operations.
+The `array_size` parameter represents the number of elements.
+Each benchmark run measures the GPU visible to the current process.

 #### Metrics

-| Metric Name                                                | Unit             | Description                                                                                                                             |
-|------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
-| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw  | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size.                         |
-| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size.                         |
-| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw   | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size.                         |
-| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size.                         |
-| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio  | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size.                         |
-| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size.                         |
-| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio   | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size.                         |
-| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size.                         |
+| Metric Name                                                       | Unit             | Description                                                                                    |
+|-------------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------|
+| STREAM\_INIT\_[float\|double]\_array\_[0-9]+\_bw                 | bandwidth (GB/s) | Initialization phase bandwidth for the current benchmark run and one array size.              |
+| STREAM\_INIT\_[float\|double]\_array\_[0-9]+\_time               | time (s)         | Initialization phase runtime for the current benchmark run and one array size.                |
+| STREAM\_READ\_[float\|double]\_array\_[0-9]+\_bw                 | bandwidth (GB/s) | Read phase bandwidth for the current benchmark run and one array size.                        |
+| STREAM\_READ\_[float\|double]\_array\_[0-9]+\_time               | time (s)         | Read phase runtime for the current benchmark run and one array size.                          |
+| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_bw                 | bandwidth (GB/s) | Maximum copy bandwidth for the current benchmark run and one array size.                       |
+| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_time\_min          | time (s)         | Minimum copy runtime for the current benchmark run and one array size.                        |
+| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_time\_max          | time (s)         | Maximum copy runtime for the current benchmark run and one array size.                        |
+| STREAM\_COPY\_[float\|double]\_array\_[0-9]+\_time\_avg          | time (s)         | Average copy runtime for the current benchmark run and one array size.                        |
+| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_bw                  | bandwidth (GB/s) | Maximum mul bandwidth for the current benchmark run and one array size.                       |
+| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_time\_min           | time (s)         | Minimum mul runtime for the current benchmark run and one array size.                         |
+| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_time\_max           | time (s)         | Maximum mul runtime for the current benchmark run and one array size.                         |
+| STREAM\_MUL\_[float\|double]\_array\_[0-9]+\_time\_avg           | time (s)         | Average mul runtime for the current benchmark run and one array size.                         |
+| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_bw                  | bandwidth (GB/s) | Maximum add bandwidth for the current benchmark run and one array size.                        |
+| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_time\_min           | time (s)         | Minimum add runtime for the current benchmark run and one array size.                         |
+| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_time\_max           | time (s)         | Maximum add runtime for the current benchmark run and one array size.                         |
+| STREAM\_ADD\_[float\|double]\_array\_[0-9]+\_time\_avg           | time (s)         | Average add runtime for the current benchmark run and one array size.                         |
+| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_bw                | bandwidth (GB/s) | Maximum triad bandwidth for the current benchmark run and one array size.                      |
+| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_time\_min         | time (s)         | Minimum triad runtime for the current benchmark run and one array size.                       |
+| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_time\_max         | time (s)         | Maximum triad runtime for the current benchmark run and one array size.                       |
+| STREAM\_TRIAD\_[float\|double]\_array\_[0-9]+\_time\_avg         | time (s)         | Average triad runtime for the current benchmark run and one array size.                       |
+| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_bw                  | bandwidth (GB/s) | Maximum dot bandwidth for the current benchmark run and one array size.                        |
+| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_time\_min           | time (s)         | Minimum dot runtime for the current benchmark run and one array size.                         |
+| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_time\_max           | time (s)         | Maximum dot runtime for the current benchmark run and one array size.                         |
+| STREAM\_DOT\_[float\|double]\_array\_[0-9]+\_time\_avg           | time (s)         | Average dot runtime for the current benchmark run and one array size.                         |
+
+`gpu-stream` reports `phase` and `function` metrics. `_ratio` and `block_*` metrics are removed.
+Bandwidth metrics are converted from BabelStream `max_mbytes_per_sec` by using `GB/s = MB/s / 1000`.

 ### `ib-loopback`


--- a/examples/benchmarks/gpu_stream.py
+++ b/examples/benchmarks/gpu_stream.py
@@ -12,13 +12,12 @@

 if __name__ == '__main__':
    context = BenchmarkRegistry.create_benchmark_context(
-        'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10'
+        'gpu-stream', platform=Platform.CUDA, parameters='--array_size 268435456 --num_loops 10 --precision double'
    )
    # For ROCm environment, please specify the benchmark name and the platform as the following.
    # context = BenchmarkRegistry.create_benchmark_context(
-    #     'gpu-stream', platform=Platform.ROCM, parameters='--num_warm_up 1 --num_loops 10'
+    #     'gpu-stream', platform=Platform.ROCM, parameters='--array_size 268435456 --num_loops 10 --precision float'
    # )
-    # To enable data checking, please add '--check_data'.

    benchmark = BenchmarkRegistry.launch_benchmark(context)
    if benchmark:

--- a/superbench/benchmarks/micro_benchmarks/gpu_stream.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream.py
@@ -3,6 +3,8 @@

 """Module of the GPU Stream Performance benchmark."""

+import csv
+import io
 import os

 from superbench.common.utils import logger
@@ -12,6 +14,18 @@

 class GpuStreamBenchmark(MicroBenchmarkWithInvoke):
    """The GPU stream performance benchmark class."""
+    _function_metric_map = {
+        'Copy': 'COPY',
+        'Mul': 'MUL',
+        'Add': 'ADD',
+        'Triad': 'TRIAD',
+        'Dot': 'DOT',
+    }
+    _phase_metric_map = {
+        'Init': 'INIT',
+        'Read': 'READ',
+    }
+
    def __init__(self, name, parameters=''):
        """Constructor.

@@ -21,26 +35,19 @@ def __init__(self, name, parameters=''):
        """
        super().__init__(name, parameters)

-        self._bin_name = 'gpu_stream'
+        self._bin_name = 'hip-stream'
+        self.__bin_path = None

    def add_parser_arguments(self):
        """Add the specified arguments."""
        super().add_parser_arguments()

        self._parser.add_argument(
-            '--size',
-            type=int,
-            default=4096 * 1024**2,
-            required=False,
-            help='Size of data buffer in bytes.',
-        )
-
-        self._parser.add_argument(
-            '--num_warm_up',
+            '--array_size',
            type=int,
-            default=20,
+            default=268435456,
            required=False,
-            help='Number of warm up rounds',
+            help='Number of elements in array.',
        )

        self._parser.add_argument(
@@ -48,13 +55,16 @@ def add_parser_arguments(self):
            type=int,
            default=100,
            required=False,
-            help='Number of data buffer copies performed.',
+            help='Number of benchmark runs, mapping to --numtimes in BabelStream.',
        )

        self._parser.add_argument(
-            '--check_data',
-            action='store_true',
-            help='Enable data checking',
+            '--precision',
+            type=str,
+            default='double',
+            choices=['double', 'float'],
+            required=False,
+            help='Data type for benchmark.',
        )

    def _preprocess(self):
@@ -68,17 +78,138 @@ def _preprocess(self):

        self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name)

-        args = '--size %d --num_warm_up %d --num_loops %d ' % (
-            self._args.size, self._args.num_warm_up, self._args.num_loops
-        )
-
-        if self._args.check_data:
-            args += ' --check_data'
-
-        self._commands = ['%s %s' % (self.__bin_path, args)]
+        args = f'--arraysize {self._args.array_size} --numtimes {self._args.num_loops} --csv'
+        if self._args.precision == 'float':
+            args += ' --float'
+        self._commands = ['{} {}'.format(self.__bin_path, args)]

        return True

+    def _get_device_name(self, raw_output):
+        """Extract device name from BabelStream output when available."""
+        for line in raw_output.splitlines():
+            if line.startswith('Using HIP device '):
+                return line[len('Using HIP device '):].strip()
+        return 'Unknown'
+
+    @staticmethod
+    def _mbps_to_gbps(value):
+        """Convert MB/s to GB/s."""
+        return float(value) / 1000
+
+    def _parse_csv_phase_rows(self, raw_output):
+        """Extract phase rows from BabelStream CSV output."""
+        lines = [line.strip() for line in raw_output.strip().splitlines() if line.strip()]
+        header = 'phase,n_elements,sizeof,max_mbytes_per_sec,runtime'
+        if header not in lines:
+            raise ValueError('No phase CSV header found in output.')
+
+        start_idx = lines.index(header)
+        csv_content = '\n'.join(lines[start_idx:])
+        reader = csv.DictReader(io.StringIO(csv_content))
+
+        phase_rows = []
+        for row in reader:
+            phase_name = row.get('phase', '').strip()
+            if phase_name in self._phase_metric_map:
+                metric_tag = self._phase_metric_map[phase_name]
+                array_size = int(row['n_elements'])
+                phase_rows.append({
+                    'metric_name': self._get_phase_bw_metric_name(metric_tag, array_size),
+                    'value': self._mbps_to_gbps(row['max_mbytes_per_sec']),
+                })
+                phase_rows.append({
+                    'metric_name': self._get_phase_time_metric_name(metric_tag, array_size),
+                    'value': float(row['runtime']),
+                })
+
+        if not phase_rows:
+            raise ValueError('No valid phase rows found in CSV output.')
+
+        return phase_rows
+
+    def _parse_csv_function_rows(self, raw_output):
+        """Extract function rows from BabelStream CSV output."""
+        lines = [line.strip() for line in raw_output.strip().splitlines() if line.strip()]
+        header = 'function,num_times,n_elements,sizeof,max_mbytes_per_sec,min_runtime,max_runtime,avg_runtime'
+        if header not in lines:
+            raise ValueError('No function CSV header found in output.')
+
+        start_idx = lines.index(header)
+        csv_content = '\n'.join(lines[start_idx:])
+        reader = csv.DictReader(io.StringIO(csv_content))
+
+        function_rows = []
+        for row in reader:
+            function_name = row.get('function', '').strip()
+            if function_name in self._function_metric_map:
+                metric_tag = self._function_metric_map[function_name]
+                array_size = int(row['n_elements'])
+                function_rows.append({
+                    'metric_name': self._get_function_bw_metric_name(metric_tag, array_size),
+                    'value': self._mbps_to_gbps(row['max_mbytes_per_sec']),
+                })
+                function_rows.append({
+                    'metric_name': self._get_function_time_metric_name(metric_tag, array_size, 'min'),
+                    'value': float(row['min_runtime']),
+                })
+                function_rows.append({
+                    'metric_name': self._get_function_time_metric_name(metric_tag, array_size, 'max'),
+                    'value': float(row['max_runtime']),
+                })
+                function_rows.append({
+                    'metric_name': self._get_function_time_metric_name(metric_tag, array_size, 'avg'),
+                    'value': float(row['avg_runtime']),
+                })
+
+        if not function_rows:
+            raise ValueError('No valid function rows found in CSV output.')
+
+        return function_rows
+
+    def _get_phase_bw_metric_name(self, metric_tag, array_size):
+        """Build phase bandwidth metric name."""
+        return 'STREAM_{}_{}_array_{}_bw'.format(metric_tag, self._args.precision, array_size)
+
+    def _get_phase_time_metric_name(self, metric_tag, array_size):
+        """Build phase runtime metric name."""
+        return 'STREAM_{}_{}_array_{}_time'.format(metric_tag, self._args.precision, array_size)
+
+    def _get_function_bw_metric_name(self, metric_tag, array_size):
+        """Build function bandwidth metric name."""
+        return 'STREAM_{}_{}_array_{}_bw'.format(metric_tag, self._args.precision, array_size)
+
+    def _get_function_time_metric_name(self, metric_tag, array_size, metric_type):
+        """Build function runtime metric name."""
+        return 'STREAM_{}_{}_array_{}_time_{}'.format(metric_tag, self._args.precision, array_size, metric_type)
+
+    def _format_device_output(self, device_name, metrics):
+        """Render one device section in a human-readable format."""
+        metric_width = max(len(metric['metric_name']) for metric in metrics)
+        output_lines = ['Device: {}'.format(device_name)]
+        for metric in metrics:
+            output_lines.append('{:<{width}}  {:.6f}'.format(metric['metric_name'], metric['value'], width=metric_width))
+        return output_lines
+
+    def _get_text_output_header(self):
+        """Render benchmark metadata in the text output header."""
+        return [
+            'STREAM Benchmark (BabelStream backend)',
+            'Array size(elements): {}'.format(self._args.array_size),
+            'Number of loops: {}'.format(self._args.num_loops),
+            'Precision: {}'.format(self._args.precision),
+            'Bandwidth unit: GB/s (converted from MB/s / 1000)',
+        ]
+
+    def _parse_device_output(self, raw_output):
+        """Parse one device output and return rendered lines and parsed metrics."""
+        device_name = self._get_device_name(raw_output)
+        metrics = self._parse_csv_phase_rows(raw_output) + self._parse_csv_function_rows(raw_output)
+        rendered_lines = self._get_text_output_header()
+        rendered_lines.append('')
+        rendered_lines.extend(self._format_device_output(device_name, metrics))
+        return rendered_lines, metrics
+
    def _process_raw_result(self, cmd_idx, raw_output):
        """Function to parse raw results and save the summarized results.

@@ -91,19 +222,11 @@ def _process_raw_result(self, cmd_idx, raw_output):
        Return:
            True if the raw output string is valid and result can be extracted.
        """
-        self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
-
        try:
-            output_lines = [x.strip() for x in raw_output.strip().splitlines()]
-            count = 0
-            for output_line in output_lines:
-                if output_line.startswith('STREAM_'):
-                    count += 1
-                    tag, bw_str, ratio = output_line.split()
-                    self._result.add_result(tag + '_bw', float(bw_str))
-                    self._result.add_result(tag + '_ratio', float(ratio))
-            if count == 0:
-                raise BaseException('No valid results found.')
+            rendered_lines, metrics = self._parse_device_output(raw_output)
+            self._result.add_raw_data('raw_output_' + str(cmd_idx), '\n'.join(rendered_lines), self._args.log_raw_data)
+            for metric in metrics:
+                self._result.add_result(metric['metric_name'], metric['value'])
        except BaseException as e:
            self._result.set_return_code(ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)
            logger.error(

--- a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
+++ b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT License.

-"""Tests for gpu_stream benchmark."""
+"""Tests for gpu-stream benchmark."""

-import numbers
 import unittest
+from pathlib import Path

 from tests.helper import decorator
 from tests.helper.testcase import BenchmarkTestCase
@@ -12,51 +12,50 @@


 class GpuStreamBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
-    """Test class for gpu_stream benchmark."""
+    """Test class for gpu-stream benchmark."""
    @classmethod
    def setUpClass(cls):
        """Hook method for setting up class fixture before running tests in the class."""
        super().setUpClass()
        cls.createMockEnvs(cls)
-        cls.createMockFiles(cls, ['bin/gpu_stream'])
+        cls.createMockFiles(cls, ['bin/hip-stream'])
+
+    @staticmethod
+    def _load_fixture(filename):
+        return (Path('tests/data') / filename).read_text()

    def _test_gpu_stream_command_generation(self, platform):
        """Test gpu-stream benchmark command generation."""
        benchmark_name = 'gpu-stream'
-        (benchmark_class,
-         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        (benchmark_class, _) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
        assert (benchmark_class)

-        num_warm_up = 5
-        num_loops = 10
-        size = 25769803776
-
-        parameters = '--num_warm_up %d --num_loops %d --size %d ' \
-            '--check_data' % \
-            (num_warm_up, num_loops, size)
+        parameters = '--array_size 268435456 --num_loops 20 --precision float'
        benchmark = benchmark_class(benchmark_name, parameters=parameters)

-        # Check basic information
-        assert (benchmark)
        ret = benchmark._preprocess()
+
+        assert (benchmark)
        assert (ret is True)
        assert (benchmark.return_code == ReturnCode.SUCCESS)
        assert (benchmark.name == benchmark_name)
        assert (benchmark.type == BenchmarkType.MICRO)
+        assert (benchmark._args.array_size == 268435456)
+        assert (benchmark._args.num_loops == 20)
+        assert (benchmark._args.precision == 'float')

-        # Check parameters specified in BenchmarkContext.
-        assert (benchmark._args.size == size)
-        assert (benchmark._args.num_warm_up == num_warm_up)
-        assert (benchmark._args.num_loops == num_loops)
-        assert (benchmark._args.check_data)
-
-        # Check command
        assert (1 == len(benchmark._commands))
        assert (benchmark._commands[0].startswith(benchmark._GpuStreamBenchmark__bin_path))
-        assert ('--size %d' % size in benchmark._commands[0])
-        assert ('--num_warm_up %d' % num_warm_up in benchmark._commands[0])
-        assert ('--num_loops %d' % num_loops in benchmark._commands[0])
-        assert ('--check_data' in benchmark._commands[0])
+        assert ('--arraysize 268435456' in benchmark._commands[0])
+        assert ('--numtimes 20' in benchmark._commands[0])
+        assert ('--csv' in benchmark._commands[0])
+        assert ('--float' in benchmark._commands[0])
+        assert ('--device' not in benchmark._commands[0])
+
+        benchmark = benchmark_class(benchmark_name, parameters='--array_size 1024 --num_loops 2')
+        assert (benchmark._preprocess() is True)
+        assert (benchmark._args.precision == 'double')
+        assert ('--float' not in benchmark._commands[0])

    @decorator.cuda_test
    def test_gpu_stream_command_generation_cuda(self):
@@ -68,47 +67,61 @@ def test_gpu_stream_command_generation_rocm(self):
        """Test gpu-stream benchmark command generation, ROCm case."""
        self._test_gpu_stream_command_generation(Platform.ROCM)

-    @decorator.load_data('tests/data/gpu_stream.log')
-    def _test_gpu_stream_result_parsing(self, platform, test_raw_output):
+    def _test_gpu_stream_result_parsing(self, platform):
        """Test gpu-stream benchmark result parsing."""
        benchmark_name = 'gpu-stream'
-        (benchmark_class,
-         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        (benchmark_class, _) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
        assert (benchmark_class)
-        benchmark = benchmark_class(benchmark_name, parameters='')
+
+        benchmark = benchmark_class(benchmark_name, parameters='--precision double')
        assert (benchmark)
-        ret = benchmark._preprocess()
-        assert (ret is True)
+        assert (benchmark._preprocess() is True)
        assert (benchmark.return_code == ReturnCode.SUCCESS)
-        assert (benchmark.name == 'gpu-stream')
+        assert (benchmark.name == benchmark_name)
        assert (benchmark.type == BenchmarkType.MICRO)

-        # Positive case - valid raw output.
-        assert (benchmark._process_raw_result(0, test_raw_output))
-        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        valid_output = self._load_fixture('gpu_stream.log')

-        assert (1 == len(benchmark.raw_data))
-        # print(test_raw_output.splitlines())
-        test_raw_output_dict = {
-            x.split()[0]: [float(x.split()[1]), float(x.split()[2])]
-            for x in test_raw_output.strip().splitlines() if x.startswith('STREAM_')
+        assert (benchmark._process_raw_result(0, valid_output))
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+        assert ('raw_output_0' in benchmark.raw_data)
+        assert ('Device: BW150' in benchmark.raw_data['raw_output_0'][0])
+
+        expected_metric_values = {
+            'STREAM_INIT_double_array_268435456_bw': 6.77961,
+            'STREAM_INIT_double_array_268435456_time': 0.950269,
+            'STREAM_READ_double_array_268435456_bw': 1255.98,
+            'STREAM_READ_double_array_268435456_time': 0.00512943,
+            'STREAM_COPY_double_array_268435456_bw': 1345.22,
+            'STREAM_COPY_double_array_268435456_time_min': 0.00319277,
+            'STREAM_COPY_double_array_268435456_time_max': 0.00320985,
+            'STREAM_COPY_double_array_268435456_time_avg': 0.00319879,
+            'STREAM_MUL_double_array_268435456_bw': 1370.7,
+            'STREAM_MUL_double_array_268435456_time_min': 0.00313342,
+            'STREAM_MUL_double_array_268435456_time_max': 0.00314978,
+            'STREAM_MUL_double_array_268435456_time_avg': 0.00313862,
+            'STREAM_ADD_double_array_268435456_bw': 1292.74,
+            'STREAM_ADD_double_array_268435456_time_min': 0.00498358,
+            'STREAM_ADD_double_array_268435456_time_max': 0.00499938,
+            'STREAM_ADD_double_array_268435456_time_avg': 0.00498747,
+            'STREAM_TRIAD_double_array_268435456_bw': 1292.52,
+            'STREAM_TRIAD_double_array_268435456_time_min': 0.00498439,
+            'STREAM_TRIAD_double_array_268435456_time_max': 0.00499791,
+            'STREAM_TRIAD_double_array_268435456_time_avg': 0.00498815,
+            'STREAM_DOT_double_array_268435456_bw': 1271.19,
+            'STREAM_DOT_double_array_268435456_time_min': 0.00337869,
+            'STREAM_DOT_double_array_268435456_time_max': 0.00359398,
+            'STREAM_DOT_double_array_268435456_time_avg': 0.0033883,
        }
-        assert (len(test_raw_output_dict) * 2 + benchmark.default_metric_count == len(benchmark.result))
-        for output_key in benchmark.result:
-            if output_key == 'return_code':
-                assert (benchmark.result[output_key] == [0])
-            else:
-                assert (len(benchmark.result[output_key]) == 1)
-                assert (isinstance(benchmark.result[output_key][0], numbers.Number))
-                if output_key.endswith('_bw'):
-                    assert (output_key.strip('_bw') in test_raw_output_dict)
-                    assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0])
-                else:
-                    assert (output_key.strip('_ratio') in test_raw_output_dict)
-                    assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0])
-
-        # Negative case - invalid raw output.
-        assert (benchmark._process_raw_result(1, 'Invalid raw output') is False)
+        for metric_name, expected_value in expected_metric_values.items():
+            assert (metric_name in benchmark.result)
+            assert (abs(benchmark.result[metric_name][0] - expected_value) < 1e-6)
+
+        assert (all(not metric.endswith('_ratio') for metric in benchmark.result))
+
+        benchmark = benchmark_class(benchmark_name, parameters='--precision double')
+        assert (benchmark._preprocess() is True)
+        assert (benchmark._process_raw_result(0, 'Invalid raw output') is False)
        assert (benchmark.return_code == ReturnCode.MICROBENCHMARK_RESULT_PARSING_FAILURE)

    @decorator.cuda_test

--- a/tests/data/gpu_stream.log
+++ b/tests/data/gpu_stream.log
-STREAM Benchmark
-Buffer size(bytes): 4294967296
-Number of warm up runs: 10
-Number of loops: 40
-Check data: No
-
-Device 0: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK   ECC is ON
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_128    6711.67 81.93
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_256    6549.50 79.95
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_512    6195.43 75.63
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_1024   5721.52 69.84
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_128   6680.42 81.55
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_256   6515.51 79.54
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_512   6106.69 74.54
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_1024  5626.68 68.69
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_128     7379.25 90.08
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_256     7407.27 90.42
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_512     7309.59 89.23
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_1024    6788.64 82.87
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_128   7378.19 90.07
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_256   7414.01 90.50
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_512   7295.50 89.06
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_1024  6730.42 82.16
-
-Device 1: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000.00MHz x 8192-bit = 8192.00 GB/s PEAK   ECC is ON
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_128    6708.74 81.89
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_256    6549.47 79.95
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_512    6195.39 75.63
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_1024   5725.07 69.89
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_128   6678.56 81.53
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_256   6514.05 79.52
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_512   6103.80 74.51
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_1024  5630.41 68.73
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_128     7377.74 90.06
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_256     7410.97 90.47
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_512     7310.80 89.24
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_1024    6789.91 82.88
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_128   7379.03 90.08
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_256   7414.04 90.50
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_512   7298.26 89.09
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_1024  6732.15 82.18
\ No newline at end of file
+Using HIP device BW150
+Driver: 60326045
+Memory: DEFAULT
+phase,n_elements,sizeof,max_mbytes_per_sec,runtime
+Init,268435456,8,6779.61,0.950269
+Read,268435456,8,1.25598e+06,0.00512943
+function,num_times,n_elements,sizeof,max_mbytes_per_sec,min_runtime,max_runtime,avg_runtime
+Copy,100,268435456,8,1.34522e+06,0.00319277,0.00320985,0.00319879
+Mul,100,268435456,8,1.3707e+06,0.00313342,0.00314978,0.00313862
+Add,100,268435456,8,1.29274e+06,0.00498358,0.00499938,0.00498747
+Triad,100,268435456,8,1.29252e+06,0.00498439,0.00499791,0.00498815
+Dot,100,268435456,8,1.27119e+06,0.00337869,0.00359398,0.0033883