Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)

**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)
**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
74421ffe · Ziyue Yang · GitHub · fd2bc9e0 · 74421ffe · 74421ffe
Unverified Commit 74421ffe authored Jan 21, 2022 by Ziyue Yang Committed by GitHub Jan 21, 2022
6 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -186,11 +186,16 @@ Measure the memory copy bandwidth performed by GPU SM/DMA engine, including devi
 #### Metrics
-| Name                                                                          | Unit             | Description                                                                                                                |
+| Name                                                                               | Unit             | Description                                                                                                                              |
-|-------------------------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------|
+|------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------|
-| cpu\_to\_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw      | bandwidth (GB/s) | The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs.                             |
+| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw                      | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM.                                  |
-| gpu[0-9]+\_to\_cpu\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw      | bandwidth (GB/s) | The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs.                               |
+| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw                      | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM.                                  |
-| gpu[0-9]+\_to_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from  or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled. |
+| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw                | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM with peer communication enabled.     |
+| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
+| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw                       | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM.                       |
+| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw                       | bandwidth (GB/s) | Same as above.                                                                                                                           |
+| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw                 | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM with peer communication enabled.     |
+| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw  | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
 ### `ib-loopback`

--- a/examples/benchmarks/gpu_copy_bw_performance.py
+++ b/examples/benchmarks/gpu_copy_bw_performance.py
@@ -18,6 +18,8 @@ if __name__ == '__main__':
    # context = BenchmarkRegistry.create_benchmark_context(
    #     'gpu-copy-bw', platform=Platform.ROCM, parameters='--mem_type htod dtoh dtod --copy_type sm dma'
    # )
+    # For bidirectional test, please specify parameters as the following.
+    # parameters='--mem_type htod dtod --copy_type sm dma --bidirectional'
    benchmark = BenchmarkRegistry.launch_benchmark(context)
    if benchmark:

--- a/superbench/benchmarks/micro_benchmarks/gpu_copy_bw_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_copy_bw_performance.py
@@ -61,6 +61,12 @@ class GpuCopyBwBenchmark(MicroBenchmarkWithInvoke):
            help='Number of data buffer copies performed.',
        )
+        self._parser.add_argument(
+            '--bidirectional',
+            action='store_true',
+            help='Enable bidirectional test',
+        )
    def _preprocess(self):
        """Preprocess/preparation operations before the benchmarking.
@@ -78,6 +84,9 @@ class GpuCopyBwBenchmark(MicroBenchmarkWithInvoke):
        for copy_type in self._args.copy_type:
            args += ' --%s_copy' % copy_type
+        if self._args.bidirectional:
+            args += ' --bidirectional'
        self._commands = ['%s %s' % (self.__bin_path, args)]
        return True

--- a/superbench/benchmarks/micro_benchmarks/gpu_copy_performance/gpu_copy.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_copy_performance/gpu_copy.cu
@@ -15,8 +15,8 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
-// Arguments for each benchmark run.
+// Arguments for each sub benchmark run.
-struct BenchArgs {
+struct SubBenchArgs {
    // Whether source device is GPU.
    bool is_src_dev_gpu = false;
@@ -32,20 +32,6 @@ struct BenchArgs {
    // GPU IDs for worker device.
    int worker_gpu_id = 0;
-    // Uses SM copy, otherwise DMA copy.
-    bool is_sm_copy = false;
-    // NUMA node under which the benchmark is done.
-    uint64_t numa_id = 0;
-    // Data buffer size used.
-    uint64_t size = 0;
-    // Number of loops to run.
-    uint64_t num_loops = 0;
-};
-struct Buffers {
    // Original data buffer.
    uint8_t *data_buf = nullptr;
@@ -63,6 +49,34 @@ struct Buffers {
    // GPU pointer of the data buffer on destination devices.
    uint8_t *dst_dev_gpu_buf_ptr = nullptr;
+    // CUDA stream to be used.
+    cudaStream_t stream;
+};
+// Arguments for each benchmark run.
+struct BenchArgs {
+    // Max number of sub benchmarks.
+    static const int kMaxNumSubs = 2;
+    // Number of sub benchmarks in this benchmark run.
+    // 1 for unidirectional, 2 for bidirectional.
+    int num_subs = 0;
+    // NUMA node under which the benchmark is done.
+    uint64_t numa_id = 0;
+    // Data buffer size used.
+    uint64_t size = 0;
+    // Number of loops to run.
+    uint64_t num_loops = 0;
+    // Uses SM copy, otherwise DMA copy.
+    bool is_sm_copy = false;
+    // Sub-benchmarks in parallel.
+    SubBenchArgs subs[kMaxNumSubs];
 };
 // Options accepted by this program.
@@ -87,9 +101,12 @@ struct Opts {
    // Whether device-to-device transfer needs to be evaluated.
    bool dtod_enabled = false;
+    // Whether bidirectional transfer is enabled.
+    bool bidirectional_enabled = false;
 };
-// Pring usage of this program.
+// Print usage of this program.
 void PrintUsage() {
    printf("Usage: gpu_copy "
           "--size <size> "
@@ -98,19 +115,31 @@ void PrintUsage() {
           "[--dma_copy] "
           "[--htod] "
           "[--dtoh] "
-           "[--dtod]\n");
+           "[--dtod] "
+           "[--bidirectional]\n");
 }
 // Parse options of this program.
 int ParseOpts(int argc, char **argv, Opts *opts) {
-    enum class OptIdx { kSize, kNumIters, kEnableSmCopy, kEnableDmaCopy, kEnableHToD, kEnableDToH, kEnableDToD };
+    enum class OptIdx {
-    const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
+        kSize,
-                                     {"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumIters)},
+        kNumIters,
-                                     {"sm_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableSmCopy)},
+        kEnableSmCopy,
-                                     {"dma_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDmaCopy)},
+        kEnableDmaCopy,
-                                     {"htod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableHToD)},
+        kEnableHToD,
-                                     {"dtoh", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToH)},
+        kEnableDToH,
-                                     {"dtod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToD)}};
+        kEnableDToD,
+        kEnableBidirectional
+    };
+    const struct option options[] = {
+        {"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
+        {"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumIters)},
+        {"sm_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableSmCopy)},
+        {"dma_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDmaCopy)},
+        {"htod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableHToD)},
+        {"dtoh", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToH)},
+        {"dtod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToD)},
+        {"bidirectional", no_argument, nullptr, static_cast<int>(OptIdx::kEnableBidirectional)}};
    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = false;
@@ -159,6 +188,9 @@ int ParseOpts(int argc, char **argv, Opts *opts) {
        case static_cast<int>(OptIdx::kEnableDToD):
            opts->dtod_enabled = true;
            break;
+        case static_cast<int>(OptIdx::kEnableBidirectional):
+            opts->bidirectional_enabled = true;
+            break;
        default:
            parse_err = true;
        }
@@ -193,145 +225,182 @@ int SetGpu(int gpu_id) {
    return 0;
 }
-// Prepare data buffers to be used.
+// Prepare data buffers and streams to be used.
-int PrepareBuf(const BenchArgs &args, Buffers *buffers) {
+int PrepareBufAndStream(BenchArgs *args) {
    cudaError_t cuda_err = cudaSuccess;
    constexpr int uint8_mod = 256;
-    // Generate data to copy
+    for (int i = 0; i < args->num_subs; i++) {
-    buffers->data_buf = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id));
+        SubBenchArgs &sub = args->subs[i];
-    for (int i = 0; i < args.size; i++) {
-        buffers->data_buf[i] = static_cast<uint8_t>(i % uint8_mod);
-    }
-    // Reset check buffer
+        // Generate data to copy
-    buffers->check_buf = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id));
+        sub.data_buf = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
-    memset(buffers->check_buf, 0, args.size);
+        for (int j = 0; j < args->size; j++) {
+            sub.data_buf[j] = static_cast<uint8_t>(j % uint8_mod);
-    // Allocate buffers for src/dst devices
+        }
-    constexpr int num_devices = 2;
-    bool is_dev_gpu[num_devices] = {args.is_src_dev_gpu, args.is_dst_dev_gpu};
+        // Allocate check buffer
-    int dev_ids[num_devices] = {args.src_gpu_id, args.dst_gpu_id};
+        sub.check_buf = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
-    uint8_t **host_buf_ptrs[num_devices] = {&(buffers->src_dev_host_buf_ptr), &(buffers->dst_dev_host_buf_ptr)};
-    uint8_t **gpu_buf_ptrs[num_devices] = {&(buffers->src_dev_gpu_buf_ptr), &(buffers->dst_dev_gpu_buf_ptr)};
+        // Allocate buffers for src/dst devices
-    for (int i = 0; i < num_devices; i++) {
+        constexpr int num_devices = 2;
-        // Allocate buffers
+        bool is_dev_gpu[num_devices] = {sub.is_src_dev_gpu, sub.is_dst_dev_gpu};
-        if (is_dev_gpu[i]) {
+        int dev_ids[num_devices] = {sub.src_gpu_id, sub.dst_gpu_id};
-            // Set to buffer device for GPU buffer
+        uint8_t **host_buf_ptrs[num_devices] = {&(sub.src_dev_host_buf_ptr), &(sub.dst_dev_host_buf_ptr)};
-            if (SetGpu(dev_ids[i])) {
+        uint8_t **gpu_buf_ptrs[num_devices] = {&(sub.src_dev_gpu_buf_ptr), &(sub.dst_dev_gpu_buf_ptr)};
-                return -1;
+        for (int j = 0; j < num_devices; j++) {
-            }
+            // Allocate buffers
-            *(host_buf_ptrs[i]) = nullptr;
+            if (is_dev_gpu[j]) {
-            cuda_err = cudaMalloc(gpu_buf_ptrs[i], args.size);
+                // Set to buffer device for GPU buffer
-            if (cuda_err != cudaSuccess) {
+                if (SetGpu(dev_ids[j])) {
-                fprintf(stderr, "PrepareBuf::cudaMalloc error: %d\n", cuda_err);
+                    return -1;
-                return -1;
+                }
-            }
+                *(host_buf_ptrs[j]) = nullptr;
-        } else {
+                cuda_err = cudaMalloc(gpu_buf_ptrs[j], args->size);
-            // Set to worker device for host memory buffer
+                if (cuda_err != cudaSuccess) {
-            if (SetGpu(args.worker_gpu_id)) {
+                    fprintf(stderr, "PrepareBufAndStream::cudaMalloc error: %d\n", cuda_err);
-                return -1;
+                    return -1;
-            }
+                }
-            *(host_buf_ptrs[i]) = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id));
+            } else {
-            cuda_err = cudaHostRegister(*(host_buf_ptrs[i]), args.size, cudaHostRegisterMapped);
+                // Set to worker device for host memory buffer
-            if (cuda_err != cudaSuccess) {
+                if (SetGpu(sub.worker_gpu_id)) {
-                fprintf(stderr, "PrepareBuf::cudaHostRegister error: %d\n", cuda_err);
+                    return -1;
-                return -1;
+                }
+                *(host_buf_ptrs[j]) = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
+                cuda_err = cudaHostRegister(*(host_buf_ptrs[j]), args->size, cudaHostRegisterMapped);
+                if (cuda_err != cudaSuccess) {
+                    fprintf(stderr, "PrepareBufAndStream::cudaHostRegister error: %d\n", cuda_err);
+                    return -1;
+                }
+                cuda_err = cudaHostGetDevicePointer((void **)gpu_buf_ptrs[j], *(host_buf_ptrs[j]), 0);
+                if (cuda_err != cudaSuccess) {
+                    fprintf(stderr, "PrepareBufAndStream::cudaHostGetDevicePointer error: %d\n", cuda_err);
+                    return -1;
+                }
            }
-            cuda_err = cudaHostGetDevicePointer((void **)gpu_buf_ptrs[i], *(host_buf_ptrs[i]), 0);
+        }
-            if (cuda_err != cudaSuccess) {
-                fprintf(stderr, "PrepareBuf::cudaHostGetDevicePointer error: %d\n", cuda_err);
+        // Initialize source buffer
+        if (sub.is_src_dev_gpu) {
+            if (SetGpu(sub.src_gpu_id)) {
                return -1;
            }
        }
-    }
+        cuda_err = cudaMemcpy(sub.src_dev_gpu_buf_ptr, sub.data_buf, args->size, cudaMemcpyDefault);
+        if (cuda_err != cudaSuccess) {
+            fprintf(stderr, "PrepareBufAndStream::cudaMemcpy error: %d\n", cuda_err);
+            return -1;
+        }
-    // Initialize source buffer
+        // Initialize stream on worker device
-    if (SetGpu(args.src_gpu_id)) {
+        if (SetGpu(sub.worker_gpu_id)) {
-        return -1;
+            return -1;
-    }
+        }
-    cuda_err = cudaMemcpy(buffers->src_dev_gpu_buf_ptr, buffers->data_buf, args.size, cudaMemcpyDefault);
+        cuda_err = cudaStreamCreateWithFlags(&(sub.stream), cudaStreamNonBlocking);
-    if (cuda_err != cudaSuccess) {
+        if (cuda_err != cudaSuccess) {
-        fprintf(stderr, "PrepareBuf::cudaMemcpy error: %d\n", cuda_err);
+            fprintf(stderr, "PrepareBufAndStream::cudaStreamCreate error: %d\n", cuda_err);
-        return -1;
+            return -1;
+        }
    }
    return 0;
 }
 // Validate the result of data transfer.
-int CheckBuf(const BenchArgs &args, const Buffers &buffers) {
+int CheckBuf(BenchArgs *args) {
    cudaError_t cuda_err = cudaSuccess;
+    int memcmp_result = 0;
-    // Copy result
+    for (int i = 0; i < args->num_subs; i++) {
-    if (SetGpu(args.dst_gpu_id)) {
+        SubBenchArgs &sub = args->subs[i];
-        return -1;
-    }
-    cuda_err = cudaMemcpy(buffers.check_buf, buffers.src_dev_gpu_buf_ptr, args.size, cudaMemcpyDefault);
-    if (cuda_err != cudaSuccess) {
-        fprintf(stderr, "CheckBuf::cudaMemcpy error: %d\n", cuda_err);
-        return -1;
-    }
-    // Validate result
+        // Copy result
-    int memcmp_result = memcmp(buffers.data_buf, buffers.check_buf, args.size);
+        memset(sub.check_buf, 0, args->size);
-    if (memcmp_result) {
+        if (SetGpu(sub.dst_gpu_id)) {
-        fprintf(stderr, "CheckBuf: Memory check failed\n");
+            return -1;
-        return -1;
+        }
+        cuda_err = cudaMemcpy(sub.check_buf, sub.dst_dev_gpu_buf_ptr, args->size, cudaMemcpyDefault);
+        if (cuda_err != cudaSuccess) {
+            fprintf(stderr, "CheckBuf::cudaMemcpy error: %d\n", cuda_err);
+            return -1;
+        }
+        // Validate result
+        memcmp_result = memcmp(sub.data_buf, sub.check_buf, args->size);
+        if (memcmp_result) {
+            fprintf(stderr, "CheckBuf: Memory check failed\n");
+            return -1;
+        }
    }
    return 0;
 }
-// Destroy data buffers
+// Destroy data buffers and streams
-int DestroyBuf(const BenchArgs &args, Buffers *buffers) {
+int DestroyBufAndStream(BenchArgs *args) {
    int ret = 0;
    cudaError_t cuda_err = cudaSuccess;
-    // Destroy original data buffer and check buffer
+    for (int i = 0; i < args->num_subs; i++) {
-    if (buffers->data_buf != nullptr)
+        SubBenchArgs &sub = args->subs[i];
-        numa_free(buffers->data_buf, args.size);
-    if (buffers->check_buf != nullptr)
+        // Destroy original data buffer and check buffer
-        numa_free(buffers->check_buf, args.size);
+        if (sub.data_buf != nullptr) {
+            numa_free(sub.data_buf, args->size);
-    // Only destroy buffers for src/dst devices
+        }
-    constexpr int num_devices = 2;
+        if (sub.check_buf != nullptr) {
-    bool is_dev_gpu[num_devices] = {args.is_src_dev_gpu, args.is_dst_dev_gpu};
+            numa_free(sub.check_buf, args->size);
-    int dev_ids[num_devices] = {args.src_gpu_id, args.dst_gpu_id};
+        }
-    uint8_t **host_buf_ptrs[num_devices] = {&(buffers->src_dev_host_buf_ptr), &(buffers->dst_dev_host_buf_ptr)};
-    uint8_t **gpu_buf_ptrs[num_devices] = {&(buffers->src_dev_gpu_buf_ptr), &(buffers->dst_dev_gpu_buf_ptr)};
+        // Only destroy buffers for src/dst devices
-    for (int i = 0; i < num_devices; i++) {
+        constexpr int num_devices = 2;
-        // Destroy buffers
+        bool is_dev_gpu[num_devices] = {sub.is_src_dev_gpu, sub.is_dst_dev_gpu};
-        if (is_dev_gpu[i]) {
+        int dev_ids[num_devices] = {sub.src_gpu_id, sub.dst_gpu_id};
-            if (*(gpu_buf_ptrs[i]) == nullptr) {
+        uint8_t **host_buf_ptrs[num_devices] = {&(sub.src_dev_host_buf_ptr), &(sub.dst_dev_host_buf_ptr)};
-                continue;
+        uint8_t **gpu_buf_ptrs[num_devices] = {&(sub.src_dev_gpu_buf_ptr), &(sub.dst_dev_gpu_buf_ptr)};
-            }
+        for (int i = 0; i < num_devices; i++) {
-            // Set to buffer device for GPU buffer
+            // Destroy buffers
-            if (SetGpu(dev_ids[i])) {
+            if (is_dev_gpu[i]) {
-                return -1;
+                if (*(gpu_buf_ptrs[i]) == nullptr) {
-            }
+                    continue;
-            cuda_err = cudaFree(*(gpu_buf_ptrs[i]));
+                }
-            if (cuda_err != cudaSuccess) {
+                // Set to buffer device for GPU buffer
-                fprintf(stderr, "DestroyBuf::cudaFree error: %d\n", cuda_err);
+                if (SetGpu(dev_ids[i])) {
-                ret = -1;
+                    return -1;
-            }
+                }
-            *(gpu_buf_ptrs[i]) = nullptr;
+                cuda_err = cudaFree(*(gpu_buf_ptrs[i]));
-        } else {
+                if (cuda_err != cudaSuccess) {
-            if (*(host_buf_ptrs[i]) == nullptr) {
+                    fprintf(stderr, "DestroyBufAndStream::cudaFree error: %d\n", cuda_err);
-                continue;
+                    ret = -1;
-            }
+                }
-            // Set to worker device for host memory buffer
+                *(gpu_buf_ptrs[i]) = nullptr;
-            if (SetGpu(args.worker_gpu_id)) {
+            } else {
-                return -1;
+                if (*(host_buf_ptrs[i]) == nullptr) {
-            }
+                    continue;
-            cuda_err = cudaHostUnregister(*(host_buf_ptrs[i]));
+                }
-            if (cuda_err != cudaSuccess) {
+                // Set to worker device for host memory buffer
-                fprintf(stderr, "DestroyBuf::cudaHostUnregister error: %d\n", cuda_err);
+                if (SetGpu(sub.worker_gpu_id)) {
-                ret = -1;
+                    return -1;
+                }
+                cuda_err = cudaHostUnregister(*(host_buf_ptrs[i]));
+                if (cuda_err != cudaSuccess) {
+                    fprintf(stderr, "DestroyBufAndStream::cudaHostUnregister error: %d\n", cuda_err);
+                    ret = -1;
+                }
+                numa_free(*(host_buf_ptrs[i]), args->size);
+                *(host_buf_ptrs[i]) = nullptr;
+                *(gpu_buf_ptrs[i]) = nullptr;
            }
-            numa_free(*(host_buf_ptrs[i]), args.size);
+        }
-            *(host_buf_ptrs[i]) = nullptr;
-            *(gpu_buf_ptrs[i]) = nullptr;
+        // Destroy stream on worker device
+        if (SetGpu(sub.worker_gpu_id)) {
+            return -1;
+        }
+        cuda_err = cudaStreamDestroy(sub.stream);
+        if (cuda_err != cudaSuccess) {
+            fprintf(stderr, "DestoryBufAndStream::cudaStreamDestroy error: %d\n", cuda_err);
+            return -1;
        }
    }
@@ -388,81 +457,83 @@ __global__ void SMCopyKernel(ulong2 *tgt, const ulong2 *src) {
 }
 // Print result tag as <src_dev>_to_<dst_dev>_by_<worker_dev>_using_<sm|dma>_under_<numa_node>.
-void PringResultTag(const BenchArgs &args) {
+void PrintResultTag(const BenchArgs &args) {
-    if (args.is_src_dev_gpu) {
+    if (args.subs[0].is_src_dev_gpu) {
-        printf("gpu%d", args.src_gpu_id);
+        printf("gpu%d", args.subs[0].src_gpu_id);
    } else {
        printf("cpu");
    }
    printf("_to_");
-    if (args.is_dst_dev_gpu) {
+    if (args.subs[0].is_dst_dev_gpu) {
-        printf("gpu%d", args.dst_gpu_id);
+        printf("gpu%d", args.subs[0].dst_gpu_id);
    } else {
        printf("cpu");
    }
-    printf("_by_gpu%d_using_%s_under_numa%lu", args.worker_gpu_id, args.is_sm_copy ? "sm" : "dma", args.numa_id);
+    if (args.subs[0].is_src_dev_gpu && args.subs[0].is_dst_dev_gpu &&
+        args.subs[0].src_gpu_id != args.subs[0].dst_gpu_id) {
+        if (args.subs[0].src_gpu_id == args.subs[0].worker_gpu_id) {
+            printf("_write");
+        } else {
+            printf("_read");
+        }
+    }
+    printf("_by_%s_under_numa%lu", args.is_sm_copy ? "sm" : "dma", args.numa_id);
+    if (args.num_subs == 1) {
+        printf("_uni");
+    } else {
+        printf("_bi");
+    }
 }
 // Run copy benchmark.
-int RunCopy(const BenchArgs &args, const Buffers &buffers) {
+int RunCopy(BenchArgs *args) {
    cudaError_t cuda_err = cudaSuccess;
-    cudaStream_t stream;
    uint64_t num_thread_blocks;
-    // Set to worker device
-    if (SetGpu(args.worker_gpu_id)) {
-        return -1;
-    }
    // Validate data size for SM copy
-    if (args.is_sm_copy) {
+    if (args->is_sm_copy) {
        uint64_t num_elements_in_thread_block = NUM_LOOP_UNROLL * NUM_THREADS_IN_BLOCK;
        uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(ulong2);
-        if (args.size % num_bytes_in_thread_block) {
+        if (args->size % num_bytes_in_thread_block) {
            fprintf(stderr, "RunCopy: Data size should be multiple of %lu\n", num_bytes_in_thread_block);
            return -1;
        }
-        num_thread_blocks = args.size / num_bytes_in_thread_block;
+        num_thread_blocks = args->size / num_bytes_in_thread_block;
-    }
-    // Create stream to launch kernels
-    cuda_err = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
-    if (cuda_err != cudaSuccess) {
-        fprintf(stderr, "RunCopy::cudaStreamCreate error: %d\n", cuda_err);
-        return -1;
    }
    // Launch jobs and collect running time
    auto start = std::chrono::steady_clock::now();
-    for (int i = 0; i < args.num_loops; i++) {
+    for (int i = 0; i < args->num_loops; i++) {
-        if (args.is_sm_copy) {
+        for (int j = 0; j < args->num_subs; j++) {
-            SMCopyKernel<<<num_thread_blocks, NUM_THREADS_IN_BLOCK, 0, stream>>>(
+            SubBenchArgs &sub = args->subs[j];
-                reinterpret_cast<ulong2 *>(buffers.dst_dev_gpu_buf_ptr),
+            if (SetGpu(sub.worker_gpu_id)) {
-                reinterpret_cast<ulong2 *>(buffers.src_dev_gpu_buf_ptr));
+                return -1;
-        } else {
+            }
-            cudaMemcpyAsync(buffers.dst_dev_gpu_buf_ptr, buffers.src_dev_gpu_buf_ptr, args.size, cudaMemcpyDefault,
+            if (args->is_sm_copy) {
-                            stream);
+                SMCopyKernel<<<num_thread_blocks, NUM_THREADS_IN_BLOCK, 0, sub.stream>>>(
+                    reinterpret_cast<ulong2 *>(sub.dst_dev_gpu_buf_ptr),
+                    reinterpret_cast<ulong2 *>(sub.src_dev_gpu_buf_ptr));
+            } else {
+                cudaMemcpyAsync(sub.dst_dev_gpu_buf_ptr, sub.src_dev_gpu_buf_ptr, args->size, cudaMemcpyDefault,
+                                sub.stream);
+            }
        }
    }
-    cuda_err = cudaStreamSynchronize(stream);
+    for (int i = 0; i < args->num_subs; i++) {
-    auto end = std::chrono::steady_clock::now();
+        SubBenchArgs &sub = args->subs[i];
-    if (cuda_err != cudaSuccess) {
+        cuda_err = cudaStreamSynchronize(sub.stream);
-        fprintf(stderr, "RunCopy::cudaStreamSynchronize error: %d\n", cuda_err);
+        if (cuda_err != cudaSuccess) {
-        return -1;
+            fprintf(stderr, "RunCopy::cudaStreamSynchronize error: %d\n", cuda_err);
-    }
+            return -1;
+        }
-    // Destroy stream
-    cuda_err = cudaStreamDestroy(stream);
-    if (cuda_err != cudaSuccess) {
-        fprintf(stderr, "RunCopy::cudaStreamDestroy error: %d\n", cuda_err);
-        return -1;
    }
+    auto end = std::chrono::steady_clock::now();
    // Calculate and display bandwidth if no problem
    double time_in_sec = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
-    PringResultTag(args);
+    PrintResultTag(*args);
-    printf(" %g\n", args.size * args.num_loops / time_in_sec / 1e9);
+    printf(" %g\n", args->size * args->num_loops * args->num_subs / time_in_sec / 1e9);
    return 0;
 }
@@ -492,24 +563,73 @@ int EnablePeerAccess(int src_gpu_id, int dst_gpu_id, int *can_access) {
    return 0;
 }
-int RunBench(const BenchArgs &args) {
+int RunBench(BenchArgs *args) {
    int ret = 0;
    int destroy_buf_ret = 0;
-    Buffers buffers;
+    ret = PrepareBufAndStream(args);
-    ret = PrepareBuf(args, &buffers);
    if (ret == 0) {
-        ret = RunCopy(args, buffers);
+        ret = RunCopy(args);
        if (ret == 0) {
-            ret = CheckBuf(args, buffers);
+            ret = CheckBuf(args);
        }
    }
-    destroy_buf_ret = DestroyBuf(args, &buffers);
+    destroy_buf_ret = DestroyBufAndStream(args);
    if (ret == 0) {
        ret = destroy_buf_ret;
    }
    return ret;
 }
+void SetSubBenchArgsForHToD(int gpu_id, bool is_bidirectional, BenchArgs *args) {
+    args->subs[0].is_src_dev_gpu = false;
+    args->subs[0].is_dst_dev_gpu = true;
+    args->subs[0].dst_gpu_id = gpu_id;
+    args->subs[0].worker_gpu_id = gpu_id;
+    if (is_bidirectional) {
+        args->num_subs = 2;
+        args->subs[1].is_src_dev_gpu = true;
+        args->subs[1].is_dst_dev_gpu = false;
+        args->subs[1].src_gpu_id = gpu_id;
+        args->subs[1].worker_gpu_id = gpu_id;
+    } else {
+        args->num_subs = 1;
+    }
+}
+void SetSubBenchArgsForDToH(int gpu_id, bool is_bidirectional, BenchArgs *args) {
+    args->subs[0].is_src_dev_gpu = true;
+    args->subs[0].is_dst_dev_gpu = false;
+    args->subs[0].src_gpu_id = gpu_id;
+    args->subs[0].worker_gpu_id = gpu_id;
+    if (is_bidirectional) {
+        args->num_subs = 2;
+        args->subs[1].is_src_dev_gpu = false;
+        args->subs[1].is_dst_dev_gpu = true;
+        args->subs[1].dst_gpu_id = gpu_id;
+        args->subs[1].worker_gpu_id = gpu_id;
+    } else {
+        args->num_subs = 1;
+    }
+}
+void SetSubBenchArgsForDToD(int src_gpu_id, int dst_gpu_id, bool is_read, bool is_bidirectional, BenchArgs *args) {
+    args->subs[0].is_src_dev_gpu = true;
+    args->subs[0].is_dst_dev_gpu = true;
+    args->subs[0].src_gpu_id = src_gpu_id;
+    args->subs[0].dst_gpu_id = dst_gpu_id;
+    args->subs[0].worker_gpu_id = is_read ? dst_gpu_id : src_gpu_id;
+    if (is_bidirectional) {
+        args->num_subs = 2;
+        args->subs[1].is_src_dev_gpu = true;
+        args->subs[1].is_dst_dev_gpu = true;
+        args->subs[1].src_gpu_id = dst_gpu_id;
+        args->subs[1].dst_gpu_id = src_gpu_id;
+        args->subs[1].worker_gpu_id = is_read ? src_gpu_id : dst_gpu_id;
+    } else {
+        args->num_subs = 1;
+    }
+}
 int main(int argc, char **argv) {
    int ret = 0;
    int numa_count = 0;
@@ -546,55 +666,52 @@ int main(int argc, char **argv) {
        for (int j = 0; j < gpu_count; j++) {
            // Host-to-device benchmark
            if (opts.htod_enabled) {
-                args.is_src_dev_gpu = false;
-                args.is_dst_dev_gpu = true;
-                args.dst_gpu_id = j;
-                args.worker_gpu_id = j;
                if (opts.sm_copy_enabled) {
                    args.is_sm_copy = true;
+                    SetSubBenchArgsForHToD(j, opts.bidirectional_enabled, &args);
                    args_list.push_back(args);
                }
                if (opts.dma_copy_enabled) {
                    args.is_sm_copy = false;
+                    SetSubBenchArgsForHToD(j, opts.bidirectional_enabled, &args);
                    args_list.push_back(args);
                }
            }
            // Device-to-host benchmark
            if (opts.dtoh_enabled) {
-                args.is_src_dev_gpu = true;
-                args.src_gpu_id = j;
-                args.is_dst_dev_gpu = false;
-                args.worker_gpu_id = j;
                if (opts.sm_copy_enabled) {
                    args.is_sm_copy = true;
+                    SetSubBenchArgsForDToH(j, opts.bidirectional_enabled, &args);
                    args_list.push_back(args);
                }
                if (opts.dma_copy_enabled) {
                    args.is_sm_copy = false;
+                    SetSubBenchArgsForDToH(j, opts.bidirectional_enabled, &args);
                    args_list.push_back(args);
                }
            }
            // Device-to-device benchmark
            if (opts.dtod_enabled) {
-                args.is_src_dev_gpu = true;
-                args.src_gpu_id = j;
-                args.is_dst_dev_gpu = true;
                // Scan all peers
                for (int k = 0; k < gpu_count; k++) {
-                    args.dst_gpu_id = k;
+                    // Skip second half for bidirectional test
+                    if (opts.bidirectional_enabled && j > k) {
+                        break;
+                    }
                    // P2P write
                    ret = EnablePeerAccess(j, k, &can_access);
                    if (ret != 0) {
                        return -1;
                    }
                    if (can_access) {
-                        args.worker_gpu_id = j;
                        if (opts.sm_copy_enabled) {
                            args.is_sm_copy = true;
+                            SetSubBenchArgsForDToD(j, k, false, opts.bidirectional_enabled, &args);
                            args_list.push_back(args);
                        }
                        if (opts.dma_copy_enabled) {
                            args.is_sm_copy = false;
+                            SetSubBenchArgsForDToD(j, k, false, opts.bidirectional_enabled, &args);
                            args_list.push_back(args);
                        }
                    }
@@ -607,13 +724,14 @@ int main(int argc, char **argv) {
                        return -1;
                    }
                    if (can_access) {
-                        args.worker_gpu_id = k;
                        if (opts.sm_copy_enabled) {
                            args.is_sm_copy = true;
+                            SetSubBenchArgsForDToD(j, k, true, opts.bidirectional_enabled, &args);
                            args_list.push_back(args);
                        }
                        if (opts.dma_copy_enabled) {
                            args.is_sm_copy = false;
+                            SetSubBenchArgsForDToD(j, k, true, opts.bidirectional_enabled, &args);
                            args_list.push_back(args);
                        }
                    }
@@ -622,13 +740,13 @@ int main(int argc, char **argv) {
        }
    }
-    for (const BenchArgs &curr_args : args_list) {
+    for (BenchArgs &curr_args : args_list) {
        ret = numa_run_on_node(curr_args.numa_id);
        if (ret != 0) {
            fprintf(stderr, "main::numa_run_on_node error: %d\n", errno);
            return -1;
        }
-        ret = RunBench(curr_args);
+        ret = RunBench(&curr_args);
        if (ret != 0) {
            return -1;
        }

--- a/tests/benchmarks/micro_benchmarks/test_gpu_copy_bw_performance.py
+++ b/tests/benchmarks/micro_benchmarks/test_gpu_copy_bw_performance.py
@@ -32,7 +32,7 @@ class GpuCopyBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
        mem_types = ['htod', 'dtoh', 'dtod']
        copy_types = ['sm', 'dma']
-        parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d' % \
+        parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d --bidirectional' % \
            (' '.join(mem_types), ' '.join(copy_types), size, num_loops)
        benchmark = benchmark_class(benchmark_name, parameters=parameters)
@@ -49,6 +49,7 @@ class GpuCopyBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
        assert (benchmark._args.copy_type == copy_types)
        assert (benchmark._args.size == size)
        assert (benchmark._args.num_loops == num_loops)
+        assert (benchmark._args.bidirectional)
        # Check command
        assert (1 == len(benchmark._commands))
@@ -59,6 +60,7 @@ class GpuCopyBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
            assert ('--%s_copy' % copy_type in benchmark._commands[0])
        assert ('--size %d' % size in benchmark._commands[0])
        assert ('--num_loops %d' % num_loops in benchmark._commands[0])
+        assert ('--bidirectional' in benchmark._commands[0])
    @decorator.cuda_test
    def test_gpu_copy_bw_performance_command_generation_cuda(self):
@@ -70,7 +72,8 @@ class GpuCopyBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
        """Test gpu-copy benchmark command generation, ROCm case."""
        self._test_gpu_copy_bw_performance_command_generation(Platform.ROCM)
-    def _test_gpu_copy_bw_performance_result_parsing(self, platform):
+    @decorator.load_data('tests/data/gpu_copy_bw_performance.log')
+    def _test_gpu_copy_bw_performance_result_parsing(self, platform, test_raw_output):
        """Test gpu-copy benchmark result parsing."""
        benchmark_name = 'gpu-copy-bw'
        (benchmark_class,
@@ -85,20 +88,6 @@ class GpuCopyBwBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
        assert (benchmark.type == BenchmarkType.MICRO)
        # Positive case - valid raw output.
-        test_raw_output = """
-cpu_to_gpu0_by_gpu0_using_sm_under_numa0 26.1755
-cpu_to_gpu0_by_gpu0_using_dma_under_numa0 26.1894
-gpu0_to_cpu_by_gpu0_using_sm_under_numa0 5.72584
-gpu0_to_cpu_by_gpu0_using_dma_under_numa0 26.2623
-gpu0_to_gpu0_by_gpu0_using_sm_under_numa0 659.275
-gpu0_to_gpu0_by_gpu0_using_dma_under_numa0 636.401
-cpu_to_gpu0_by_gpu0_using_sm_under_numa1 26.1589
-cpu_to_gpu0_by_gpu0_using_dma_under_numa1 26.18
-gpu0_to_cpu_by_gpu0_using_sm_under_numa1 5.07597
-gpu0_to_cpu_by_gpu0_using_dma_under_numa1 25.2851
-gpu0_to_gpu0_by_gpu0_using_sm_under_numa1 656.825
-gpu0_to_gpu0_by_gpu0_using_dma_under_numa1 634.203
-"""
        assert (benchmark._process_raw_result(0, test_raw_output))
        assert (benchmark.return_code == ReturnCode.SUCCESS)

--- a/tests/data/gpu_copy_bw_performance.log
+++ b/tests/data/gpu_copy_bw_performance.log
+cpu_to_gpu0_by_sm_under_numa0_uni 26.1736
+cpu_to_gpu0_by_dma_under_numa0_uni 26.1878
+gpu0_to_cpu_by_sm_under_numa0_uni 5.01589
+gpu0_to_cpu_by_dma_under_numa0_uni 21.8659
+gpu0_to_gpu0_by_sm_under_numa0_uni 655.759
+gpu0_to_gpu0_by_dma_under_numa0_uni 633.325
+gpu0_to_gpu1_write_by_sm_under_numa0_uni 250.122
+gpu0_to_gpu1_write_by_dma_under_numa0_uni 274.951
+gpu0_to_gpu1_read_by_sm_under_numa0_uni 253.563
+gpu0_to_gpu1_read_by_dma_under_numa0_uni 264.009
+cpu_to_gpu1_by_sm_under_numa0_uni 26.187
+cpu_to_gpu1_by_dma_under_numa0_uni 26.207
+gpu1_to_cpu_by_sm_under_numa0_uni 5.01132
+gpu1_to_cpu_by_dma_under_numa0_uni 21.8635
+gpu1_to_gpu0_write_by_sm_under_numa0_uni 249.824
+gpu1_to_gpu0_write_by_dma_under_numa0_uni 275.123
+gpu1_to_gpu0_read_by_sm_under_numa0_uni 253.469
+gpu1_to_gpu0_read_by_dma_under_numa0_uni 264.908
+gpu1_to_gpu1_by_sm_under_numa0_uni 658.338
+gpu1_to_gpu1_by_dma_under_numa0_uni 631.148
+cpu_to_gpu0_by_sm_under_numa1_uni 26.1542
+cpu_to_gpu0_by_dma_under_numa1_uni 26.2007
+gpu0_to_cpu_by_sm_under_numa1_uni 5.67356
+gpu0_to_cpu_by_dma_under_numa1_uni 21.8599
+gpu0_to_gpu0_by_sm_under_numa1_uni 656.935
+gpu0_to_gpu0_by_dma_under_numa1_uni 631.974
+gpu0_to_gpu1_write_by_sm_under_numa1_uni 250.118
+gpu0_to_gpu1_write_by_dma_under_numa1_uni 274.778
+gpu0_to_gpu1_read_by_sm_under_numa1_uni 253.625
+gpu0_to_gpu1_read_by_dma_under_numa1_uni 264.347
+cpu_to_gpu1_by_sm_under_numa1_uni 26.1905
+cpu_to_gpu1_by_dma_under_numa1_uni 26.2007
+gpu1_to_cpu_by_sm_under_numa1_uni 5.67716
+gpu1_to_cpu_by_dma_under_numa1_uni 21.8579
+gpu1_to_gpu0_write_by_sm_under_numa1_uni 250.064
+gpu1_to_gpu0_write_by_dma_under_numa1_uni 274.924
+gpu1_to_gpu0_read_by_sm_under_numa1_uni 253.746
+gpu1_to_gpu0_read_by_dma_under_numa1_uni 264.256
+gpu1_to_gpu1_by_sm_under_numa1_uni 655.623
+gpu1_to_gpu1_by_dma_under_numa1_uni 634.062
+cpu_to_gpu0_by_sm_under_numa0_bi 8.45975
+cpu_to_gpu0_by_dma_under_numa0_bi 36.4282
+gpu0_to_gpu0_by_sm_under_numa0_bi 689.063
+gpu0_to_gpu0_by_dma_under_numa0_bi 661.7
+gpu0_to_gpu1_write_by_sm_under_numa0_bi 427.446
+gpu0_to_gpu1_write_by_dma_under_numa0_bi 521.577
+gpu0_to_gpu1_read_by_sm_under_numa0_bi 446.835
+gpu0_to_gpu1_read_by_dma_under_numa0_bi 503.158
+cpu_to_gpu1_by_sm_under_numa0_bi 8.4487
+cpu_to_gpu1_by_dma_under_numa0_bi 36.4272
+cpu_to_gpu0_by_sm_under_numa1_bi 9.36164
+cpu_to_gpu0_by_dma_under_numa1_bi 36.411
+gpu0_to_gpu0_by_sm_under_numa1_bi 688.156
+gpu0_to_gpu0_by_dma_under_numa1_bi 662.077
+gpu0_to_gpu1_write_by_sm_under_numa1_bi 427.033
+gpu0_to_gpu1_write_by_dma_under_numa1_bi 521.367
+gpu0_to_gpu1_read_by_sm_under_numa1_bi 446.179
+gpu0_to_gpu1_read_by_dma_under_numa1_bi 503.843
+cpu_to_gpu1_by_sm_under_numa1_bi 9.37368
+cpu_to_gpu1_by_dma_under_numa1_bi 36.4128