Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)

**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)
**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
74421ffe · Ziyue Yang · GitHub · fd2bc9e0 · 74421ffe · 74421ffe
Unverified Commit 74421ffe authored Jan 21, 2022 by Ziyue Yang Committed by GitHub Jan 21, 2022
6 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -186,11 +186,16 @@ Measure the memory copy bandwidth performed by GPU SM/DMA engine, including devi
 #### Metrics
-| Name                                                                          | Unit             | Description                                                                                                                |
+| Name                                                                               | Unit             | Description                                                                                                                              |
-|-------------------------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------|
+|------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------|
-| cpu\_to\_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw      | bandwidth (GB/s) | The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs.                             |
+| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw                      | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM.                                  |
-| gpu[0-9]+\_to\_cpu\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw      | bandwidth (GB/s) | The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs.                               |
+| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw                      | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM.                                  |
-| gpu[0-9]+\_to_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from  or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled. |
+| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw                | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM with peer communication enabled.     |
+| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
+| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw                       | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM.                       |
+| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw                       | bandwidth (GB/s) | Same as above.                                                                                                                           |
+| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw                 | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM with peer communication enabled.     |
+| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw  | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
 ### `ib-loopback`

--- a/examples/benchmarks/gpu_copy_bw_performance.py
+++ b/examples/benchmarks/gpu_copy_bw_performance.py
@@ -18,6 +18,8 @@
    # context = BenchmarkRegistry.create_benchmark_context(
    #     'gpu-copy-bw', platform=Platform.ROCM, parameters='--mem_type htod dtoh dtod --copy_type sm dma'
    # )
+    # For bidirectional test, please specify parameters as the following.
+    # parameters='--mem_type htod dtod --copy_type sm dma --bidirectional'
    benchmark = BenchmarkRegistry.launch_benchmark(context)
    if benchmark:

--- a/superbench/benchmarks/micro_benchmarks/gpu_copy_bw_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_copy_bw_performance.py
@@ -61,6 +61,12 @@ def add_parser_arguments(self):
            help='Number of data buffer copies performed.',
        )
+        self._parser.add_argument(
+            '--bidirectional',
+            action='store_true',
+            help='Enable bidirectional test',
+        )
    def _preprocess(self):
        """Preprocess/preparation operations before the benchmarking.
@@ -78,6 +84,9 @@ def _preprocess(self):
        for copy_type in self._args.copy_type:
            args += ' --%s_copy' % copy_type
+        if self._args.bidirectional:
+            args += ' --bidirectional'
        self._commands = ['%s %s' % (self.__bin_path, args)]
        return True

--- a/superbench/benchmarks/micro_benchmarks/gpu_copy_performance/gpu_copy.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_copy_performance/gpu_copy.cu
--- a/tests/benchmarks/micro_benchmarks/test_gpu_copy_bw_performance.py
+++ b/tests/benchmarks/micro_benchmarks/test_gpu_copy_bw_performance.py
@@ -32,7 +32,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
        mem_types = ['htod', 'dtoh', 'dtod']
        copy_types = ['sm', 'dma']
-        parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d' % \
+        parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d --bidirectional' % \
            (' '.join(mem_types), ' '.join(copy_types), size, num_loops)
        benchmark = benchmark_class(benchmark_name, parameters=parameters)
@@ -49,6 +49,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
        assert (benchmark._args.copy_type == copy_types)
        assert (benchmark._args.size == size)
        assert (benchmark._args.num_loops == num_loops)
+        assert (benchmark._args.bidirectional)
        # Check command
        assert (1 == len(benchmark._commands))
@@ -59,6 +60,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
            assert ('--%s_copy' % copy_type in benchmark._commands[0])
        assert ('--size %d' % size in benchmark._commands[0])
        assert ('--num_loops %d' % num_loops in benchmark._commands[0])
+        assert ('--bidirectional' in benchmark._commands[0])
    @decorator.cuda_test
    def test_gpu_copy_bw_performance_command_generation_cuda(self):
@@ -70,7 +72,8 @@ def test_gpu_copy_bw_performance_command_generation_rocm(self):
        """Test gpu-copy benchmark command generation, ROCm case."""
        self._test_gpu_copy_bw_performance_command_generation(Platform.ROCM)
-    def _test_gpu_copy_bw_performance_result_parsing(self, platform):
+    @decorator.load_data('tests/data/gpu_copy_bw_performance.log')
+    def _test_gpu_copy_bw_performance_result_parsing(self, platform, test_raw_output):
        """Test gpu-copy benchmark result parsing."""
        benchmark_name = 'gpu-copy-bw'
        (benchmark_class,
@@ -85,20 +88,6 @@ def _test_gpu_copy_bw_performance_result_parsing(self, platform):
        assert (benchmark.type == BenchmarkType.MICRO)
        # Positive case - valid raw output.
-        test_raw_output = """
-cpu_to_gpu0_by_gpu0_using_sm_under_numa0 26.1755
-cpu_to_gpu0_by_gpu0_using_dma_under_numa0 26.1894
-gpu0_to_cpu_by_gpu0_using_sm_under_numa0 5.72584
-gpu0_to_cpu_by_gpu0_using_dma_under_numa0 26.2623
-gpu0_to_gpu0_by_gpu0_using_sm_under_numa0 659.275
-gpu0_to_gpu0_by_gpu0_using_dma_under_numa0 636.401
-cpu_to_gpu0_by_gpu0_using_sm_under_numa1 26.1589
-cpu_to_gpu0_by_gpu0_using_dma_under_numa1 26.18
-gpu0_to_cpu_by_gpu0_using_sm_under_numa1 5.07597
-gpu0_to_cpu_by_gpu0_using_dma_under_numa1 25.2851
-gpu0_to_gpu0_by_gpu0_using_sm_under_numa1 656.825
-gpu0_to_gpu0_by_gpu0_using_dma_under_numa1 634.203
-"""
        assert (benchmark._process_raw_result(0, test_raw_output))
        assert (benchmark.return_code == ReturnCode.SUCCESS)

--- a/tests/data/gpu_copy_bw_performance.log
+++ b/tests/data/gpu_copy_bw_performance.log
+cpu_to_gpu0_by_sm_under_numa0_uni 26.1736
+cpu_to_gpu0_by_dma_under_numa0_uni 26.1878
+gpu0_to_cpu_by_sm_under_numa0_uni 5.01589
+gpu0_to_cpu_by_dma_under_numa0_uni 21.8659
+gpu0_to_gpu0_by_sm_under_numa0_uni 655.759
+gpu0_to_gpu0_by_dma_under_numa0_uni 633.325
+gpu0_to_gpu1_write_by_sm_under_numa0_uni 250.122
+gpu0_to_gpu1_write_by_dma_under_numa0_uni 274.951
+gpu0_to_gpu1_read_by_sm_under_numa0_uni 253.563
+gpu0_to_gpu1_read_by_dma_under_numa0_uni 264.009
+cpu_to_gpu1_by_sm_under_numa0_uni 26.187
+cpu_to_gpu1_by_dma_under_numa0_uni 26.207
+gpu1_to_cpu_by_sm_under_numa0_uni 5.01132
+gpu1_to_cpu_by_dma_under_numa0_uni 21.8635
+gpu1_to_gpu0_write_by_sm_under_numa0_uni 249.824
+gpu1_to_gpu0_write_by_dma_under_numa0_uni 275.123
+gpu1_to_gpu0_read_by_sm_under_numa0_uni 253.469
+gpu1_to_gpu0_read_by_dma_under_numa0_uni 264.908
+gpu1_to_gpu1_by_sm_under_numa0_uni 658.338
+gpu1_to_gpu1_by_dma_under_numa0_uni 631.148
+cpu_to_gpu0_by_sm_under_numa1_uni 26.1542
+cpu_to_gpu0_by_dma_under_numa1_uni 26.2007
+gpu0_to_cpu_by_sm_under_numa1_uni 5.67356
+gpu0_to_cpu_by_dma_under_numa1_uni 21.8599
+gpu0_to_gpu0_by_sm_under_numa1_uni 656.935
+gpu0_to_gpu0_by_dma_under_numa1_uni 631.974
+gpu0_to_gpu1_write_by_sm_under_numa1_uni 250.118
+gpu0_to_gpu1_write_by_dma_under_numa1_uni 274.778
+gpu0_to_gpu1_read_by_sm_under_numa1_uni 253.625
+gpu0_to_gpu1_read_by_dma_under_numa1_uni 264.347
+cpu_to_gpu1_by_sm_under_numa1_uni 26.1905
+cpu_to_gpu1_by_dma_under_numa1_uni 26.2007
+gpu1_to_cpu_by_sm_under_numa1_uni 5.67716
+gpu1_to_cpu_by_dma_under_numa1_uni 21.8579
+gpu1_to_gpu0_write_by_sm_under_numa1_uni 250.064
+gpu1_to_gpu0_write_by_dma_under_numa1_uni 274.924
+gpu1_to_gpu0_read_by_sm_under_numa1_uni 253.746
+gpu1_to_gpu0_read_by_dma_under_numa1_uni 264.256
+gpu1_to_gpu1_by_sm_under_numa1_uni 655.623
+gpu1_to_gpu1_by_dma_under_numa1_uni 634.062
+cpu_to_gpu0_by_sm_under_numa0_bi 8.45975
+cpu_to_gpu0_by_dma_under_numa0_bi 36.4282
+gpu0_to_gpu0_by_sm_under_numa0_bi 689.063
+gpu0_to_gpu0_by_dma_under_numa0_bi 661.7
+gpu0_to_gpu1_write_by_sm_under_numa0_bi 427.446
+gpu0_to_gpu1_write_by_dma_under_numa0_bi 521.577
+gpu0_to_gpu1_read_by_sm_under_numa0_bi 446.835
+gpu0_to_gpu1_read_by_dma_under_numa0_bi 503.158
+cpu_to_gpu1_by_sm_under_numa0_bi 8.4487
+cpu_to_gpu1_by_dma_under_numa0_bi 36.4272
+cpu_to_gpu0_by_sm_under_numa1_bi 9.36164
+cpu_to_gpu0_by_dma_under_numa1_bi 36.411
+gpu0_to_gpu0_by_sm_under_numa1_bi 688.156
+gpu0_to_gpu0_by_dma_under_numa1_bi 662.077
+gpu0_to_gpu1_write_by_sm_under_numa1_bi 427.033
+gpu0_to_gpu1_write_by_dma_under_numa1_bi 521.367
+gpu0_to_gpu1_read_by_sm_under_numa1_bi 446.179
+gpu0_to_gpu1_read_by_dma_under_numa1_bi 503.843
+cpu_to_gpu1_by_sm_under_numa1_bi 9.37368
+cpu_to_gpu1_by_dma_under_numa1_bi 36.4128