Unverified Commit 74421ffe authored by Ziyue Yang's avatar Ziyue Yang Committed by GitHub
Browse files

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
parent fd2bc9e0
...@@ -186,11 +186,16 @@ Measure the memory copy bandwidth performed by GPU SM/DMA engine, including devi ...@@ -186,11 +186,16 @@ Measure the memory copy bandwidth performed by GPU SM/DMA engine, including devi
#### Metrics #### Metrics
| Name | Unit | Description | | Name | Unit | Description |
|-------------------------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------| |------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| cpu\_to\_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. | | cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM. |
| gpu[0-9]+\_to\_cpu\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. | | gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM. |
| gpu[0-9]+\_to_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled. | | gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM with peer communication enabled. |
| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM. |
| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | Same as above. |
| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM with peer communication enabled. |
| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
### `ib-loopback` ### `ib-loopback`
......
...@@ -18,6 +18,8 @@ ...@@ -18,6 +18,8 @@
# context = BenchmarkRegistry.create_benchmark_context( # context = BenchmarkRegistry.create_benchmark_context(
# 'gpu-copy-bw', platform=Platform.ROCM, parameters='--mem_type htod dtoh dtod --copy_type sm dma' # 'gpu-copy-bw', platform=Platform.ROCM, parameters='--mem_type htod dtoh dtod --copy_type sm dma'
# ) # )
# For bidirectional test, please specify parameters as the following.
# parameters='--mem_type htod dtod --copy_type sm dma --bidirectional'
benchmark = BenchmarkRegistry.launch_benchmark(context) benchmark = BenchmarkRegistry.launch_benchmark(context)
if benchmark: if benchmark:
......
...@@ -61,6 +61,12 @@ def add_parser_arguments(self): ...@@ -61,6 +61,12 @@ def add_parser_arguments(self):
help='Number of data buffer copies performed.', help='Number of data buffer copies performed.',
) )
self._parser.add_argument(
'--bidirectional',
action='store_true',
help='Enable bidirectional test',
)
def _preprocess(self): def _preprocess(self):
"""Preprocess/preparation operations before the benchmarking. """Preprocess/preparation operations before the benchmarking.
...@@ -78,6 +84,9 @@ def _preprocess(self): ...@@ -78,6 +84,9 @@ def _preprocess(self):
for copy_type in self._args.copy_type: for copy_type in self._args.copy_type:
args += ' --%s_copy' % copy_type args += ' --%s_copy' % copy_type
if self._args.bidirectional:
args += ' --bidirectional'
self._commands = ['%s %s' % (self.__bin_path, args)] self._commands = ['%s %s' % (self.__bin_path, args)]
return True return True
......
...@@ -15,8 +15,8 @@ ...@@ -15,8 +15,8 @@
#include <cuda.h> #include <cuda.h>
#include <cuda_runtime.h> #include <cuda_runtime.h>
// Arguments for each benchmark run. // Arguments for each sub benchmark run.
struct BenchArgs { struct SubBenchArgs {
// Whether source device is GPU. // Whether source device is GPU.
bool is_src_dev_gpu = false; bool is_src_dev_gpu = false;
...@@ -32,20 +32,6 @@ struct BenchArgs { ...@@ -32,20 +32,6 @@ struct BenchArgs {
// GPU IDs for worker device. // GPU IDs for worker device.
int worker_gpu_id = 0; int worker_gpu_id = 0;
// Uses SM copy, otherwise DMA copy.
bool is_sm_copy = false;
// NUMA node under which the benchmark is done.
uint64_t numa_id = 0;
// Data buffer size used.
uint64_t size = 0;
// Number of loops to run.
uint64_t num_loops = 0;
};
struct Buffers {
// Original data buffer. // Original data buffer.
uint8_t *data_buf = nullptr; uint8_t *data_buf = nullptr;
...@@ -63,6 +49,34 @@ struct Buffers { ...@@ -63,6 +49,34 @@ struct Buffers {
// GPU pointer of the data buffer on destination devices. // GPU pointer of the data buffer on destination devices.
uint8_t *dst_dev_gpu_buf_ptr = nullptr; uint8_t *dst_dev_gpu_buf_ptr = nullptr;
// CUDA stream to be used.
cudaStream_t stream;
};
// Arguments for each benchmark run.
struct BenchArgs {
// Max number of sub benchmarks.
static const int kMaxNumSubs = 2;
// Number of sub benchmarks in this benchmark run.
// 1 for unidirectional, 2 for bidirectional.
int num_subs = 0;
// NUMA node under which the benchmark is done.
uint64_t numa_id = 0;
// Data buffer size used.
uint64_t size = 0;
// Number of loops to run.
uint64_t num_loops = 0;
// Uses SM copy, otherwise DMA copy.
bool is_sm_copy = false;
// Sub-benchmarks in parallel.
SubBenchArgs subs[kMaxNumSubs];
}; };
// Options accepted by this program. // Options accepted by this program.
...@@ -87,9 +101,12 @@ struct Opts { ...@@ -87,9 +101,12 @@ struct Opts {
// Whether device-to-device transfer needs to be evaluated. // Whether device-to-device transfer needs to be evaluated.
bool dtod_enabled = false; bool dtod_enabled = false;
// Whether bidirectional transfer is enabled.
bool bidirectional_enabled = false;
}; };
// Pring usage of this program. // Print usage of this program.
void PrintUsage() { void PrintUsage() {
printf("Usage: gpu_copy " printf("Usage: gpu_copy "
"--size <size> " "--size <size> "
...@@ -98,19 +115,31 @@ void PrintUsage() { ...@@ -98,19 +115,31 @@ void PrintUsage() {
"[--dma_copy] " "[--dma_copy] "
"[--htod] " "[--htod] "
"[--dtoh] " "[--dtoh] "
"[--dtod]\n"); "[--dtod] "
"[--bidirectional]\n");
} }
// Parse options of this program. // Parse options of this program.
int ParseOpts(int argc, char **argv, Opts *opts) { int ParseOpts(int argc, char **argv, Opts *opts) {
enum class OptIdx { kSize, kNumIters, kEnableSmCopy, kEnableDmaCopy, kEnableHToD, kEnableDToH, kEnableDToD }; enum class OptIdx {
const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)}, kSize,
{"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumIters)}, kNumIters,
{"sm_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableSmCopy)}, kEnableSmCopy,
{"dma_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDmaCopy)}, kEnableDmaCopy,
{"htod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableHToD)}, kEnableHToD,
{"dtoh", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToH)}, kEnableDToH,
{"dtod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToD)}}; kEnableDToD,
kEnableBidirectional
};
const struct option options[] = {
{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
{"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumIters)},
{"sm_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableSmCopy)},
{"dma_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDmaCopy)},
{"htod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableHToD)},
{"dtoh", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToH)},
{"dtod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToD)},
{"bidirectional", no_argument, nullptr, static_cast<int>(OptIdx::kEnableBidirectional)}};
int getopt_ret = 0; int getopt_ret = 0;
int opt_idx = 0; int opt_idx = 0;
bool size_specified = false; bool size_specified = false;
...@@ -159,6 +188,9 @@ int ParseOpts(int argc, char **argv, Opts *opts) { ...@@ -159,6 +188,9 @@ int ParseOpts(int argc, char **argv, Opts *opts) {
case static_cast<int>(OptIdx::kEnableDToD): case static_cast<int>(OptIdx::kEnableDToD):
opts->dtod_enabled = true; opts->dtod_enabled = true;
break; break;
case static_cast<int>(OptIdx::kEnableBidirectional):
opts->bidirectional_enabled = true;
break;
default: default:
parse_err = true; parse_err = true;
} }
...@@ -193,145 +225,182 @@ int SetGpu(int gpu_id) { ...@@ -193,145 +225,182 @@ int SetGpu(int gpu_id) {
return 0; return 0;
} }
// Prepare data buffers to be used. // Prepare data buffers and streams to be used.
int PrepareBuf(const BenchArgs &args, Buffers *buffers) { int PrepareBufAndStream(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess; cudaError_t cuda_err = cudaSuccess;
constexpr int uint8_mod = 256; constexpr int uint8_mod = 256;
// Generate data to copy for (int i = 0; i < args->num_subs; i++) {
buffers->data_buf = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id)); SubBenchArgs &sub = args->subs[i];
for (int i = 0; i < args.size; i++) {
buffers->data_buf[i] = static_cast<uint8_t>(i % uint8_mod);
}
// Reset check buffer // Generate data to copy
buffers->check_buf = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id)); sub.data_buf = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
memset(buffers->check_buf, 0, args.size); for (int j = 0; j < args->size; j++) {
sub.data_buf[j] = static_cast<uint8_t>(j % uint8_mod);
// Allocate buffers for src/dst devices }
constexpr int num_devices = 2;
bool is_dev_gpu[num_devices] = {args.is_src_dev_gpu, args.is_dst_dev_gpu}; // Allocate check buffer
int dev_ids[num_devices] = {args.src_gpu_id, args.dst_gpu_id}; sub.check_buf = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
uint8_t **host_buf_ptrs[num_devices] = {&(buffers->src_dev_host_buf_ptr), &(buffers->dst_dev_host_buf_ptr)};
uint8_t **gpu_buf_ptrs[num_devices] = {&(buffers->src_dev_gpu_buf_ptr), &(buffers->dst_dev_gpu_buf_ptr)}; // Allocate buffers for src/dst devices
for (int i = 0; i < num_devices; i++) { constexpr int num_devices = 2;
// Allocate buffers bool is_dev_gpu[num_devices] = {sub.is_src_dev_gpu, sub.is_dst_dev_gpu};
if (is_dev_gpu[i]) { int dev_ids[num_devices] = {sub.src_gpu_id, sub.dst_gpu_id};
// Set to buffer device for GPU buffer uint8_t **host_buf_ptrs[num_devices] = {&(sub.src_dev_host_buf_ptr), &(sub.dst_dev_host_buf_ptr)};
if (SetGpu(dev_ids[i])) { uint8_t **gpu_buf_ptrs[num_devices] = {&(sub.src_dev_gpu_buf_ptr), &(sub.dst_dev_gpu_buf_ptr)};
return -1; for (int j = 0; j < num_devices; j++) {
} // Allocate buffers
*(host_buf_ptrs[i]) = nullptr; if (is_dev_gpu[j]) {
cuda_err = cudaMalloc(gpu_buf_ptrs[i], args.size); // Set to buffer device for GPU buffer
if (cuda_err != cudaSuccess) { if (SetGpu(dev_ids[j])) {
fprintf(stderr, "PrepareBuf::cudaMalloc error: %d\n", cuda_err); return -1;
return -1; }
} *(host_buf_ptrs[j]) = nullptr;
} else { cuda_err = cudaMalloc(gpu_buf_ptrs[j], args->size);
// Set to worker device for host memory buffer if (cuda_err != cudaSuccess) {
if (SetGpu(args.worker_gpu_id)) { fprintf(stderr, "PrepareBufAndStream::cudaMalloc error: %d\n", cuda_err);
return -1; return -1;
} }
*(host_buf_ptrs[i]) = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id)); } else {
cuda_err = cudaHostRegister(*(host_buf_ptrs[i]), args.size, cudaHostRegisterMapped); // Set to worker device for host memory buffer
if (cuda_err != cudaSuccess) { if (SetGpu(sub.worker_gpu_id)) {
fprintf(stderr, "PrepareBuf::cudaHostRegister error: %d\n", cuda_err); return -1;
return -1; }
*(host_buf_ptrs[j]) = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
cuda_err = cudaHostRegister(*(host_buf_ptrs[j]), args->size, cudaHostRegisterMapped);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaHostRegister error: %d\n", cuda_err);
return -1;
}
cuda_err = cudaHostGetDevicePointer((void **)gpu_buf_ptrs[j], *(host_buf_ptrs[j]), 0);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaHostGetDevicePointer error: %d\n", cuda_err);
return -1;
}
} }
cuda_err = cudaHostGetDevicePointer((void **)gpu_buf_ptrs[i], *(host_buf_ptrs[i]), 0); }
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBuf::cudaHostGetDevicePointer error: %d\n", cuda_err); // Initialize source buffer
if (sub.is_src_dev_gpu) {
if (SetGpu(sub.src_gpu_id)) {
return -1; return -1;
} }
} }
} cuda_err = cudaMemcpy(sub.src_dev_gpu_buf_ptr, sub.data_buf, args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaMemcpy error: %d\n", cuda_err);
return -1;
}
// Initialize source buffer // Initialize stream on worker device
if (SetGpu(args.src_gpu_id)) { if (SetGpu(sub.worker_gpu_id)) {
return -1; return -1;
} }
cuda_err = cudaMemcpy(buffers->src_dev_gpu_buf_ptr, buffers->data_buf, args.size, cudaMemcpyDefault); cuda_err = cudaStreamCreateWithFlags(&(sub.stream), cudaStreamNonBlocking);
if (cuda_err != cudaSuccess) { if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBuf::cudaMemcpy error: %d\n", cuda_err); fprintf(stderr, "PrepareBufAndStream::cudaStreamCreate error: %d\n", cuda_err);
return -1; return -1;
}
} }
return 0; return 0;
} }
// Validate the result of data transfer. // Validate the result of data transfer.
int CheckBuf(const BenchArgs &args, const Buffers &buffers) { int CheckBuf(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess; cudaError_t cuda_err = cudaSuccess;
int memcmp_result = 0;
// Copy result for (int i = 0; i < args->num_subs; i++) {
if (SetGpu(args.dst_gpu_id)) { SubBenchArgs &sub = args->subs[i];
return -1;
}
cuda_err = cudaMemcpy(buffers.check_buf, buffers.src_dev_gpu_buf_ptr, args.size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "CheckBuf::cudaMemcpy error: %d\n", cuda_err);
return -1;
}
// Validate result // Copy result
int memcmp_result = memcmp(buffers.data_buf, buffers.check_buf, args.size); memset(sub.check_buf, 0, args->size);
if (memcmp_result) { if (SetGpu(sub.dst_gpu_id)) {
fprintf(stderr, "CheckBuf: Memory check failed\n"); return -1;
return -1; }
cuda_err = cudaMemcpy(sub.check_buf, sub.dst_dev_gpu_buf_ptr, args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "CheckBuf::cudaMemcpy error: %d\n", cuda_err);
return -1;
}
// Validate result
memcmp_result = memcmp(sub.data_buf, sub.check_buf, args->size);
if (memcmp_result) {
fprintf(stderr, "CheckBuf: Memory check failed\n");
return -1;
}
} }
return 0; return 0;
} }
// Destroy data buffers // Destroy data buffers and streams
int DestroyBuf(const BenchArgs &args, Buffers *buffers) { int DestroyBufAndStream(BenchArgs *args) {
int ret = 0; int ret = 0;
cudaError_t cuda_err = cudaSuccess; cudaError_t cuda_err = cudaSuccess;
// Destroy original data buffer and check buffer for (int i = 0; i < args->num_subs; i++) {
if (buffers->data_buf != nullptr) SubBenchArgs &sub = args->subs[i];
numa_free(buffers->data_buf, args.size);
if (buffers->check_buf != nullptr) // Destroy original data buffer and check buffer
numa_free(buffers->check_buf, args.size); if (sub.data_buf != nullptr) {
numa_free(sub.data_buf, args->size);
// Only destroy buffers for src/dst devices }
constexpr int num_devices = 2; if (sub.check_buf != nullptr) {
bool is_dev_gpu[num_devices] = {args.is_src_dev_gpu, args.is_dst_dev_gpu}; numa_free(sub.check_buf, args->size);
int dev_ids[num_devices] = {args.src_gpu_id, args.dst_gpu_id}; }
uint8_t **host_buf_ptrs[num_devices] = {&(buffers->src_dev_host_buf_ptr), &(buffers->dst_dev_host_buf_ptr)};
uint8_t **gpu_buf_ptrs[num_devices] = {&(buffers->src_dev_gpu_buf_ptr), &(buffers->dst_dev_gpu_buf_ptr)}; // Only destroy buffers for src/dst devices
for (int i = 0; i < num_devices; i++) { constexpr int num_devices = 2;
// Destroy buffers bool is_dev_gpu[num_devices] = {sub.is_src_dev_gpu, sub.is_dst_dev_gpu};
if (is_dev_gpu[i]) { int dev_ids[num_devices] = {sub.src_gpu_id, sub.dst_gpu_id};
if (*(gpu_buf_ptrs[i]) == nullptr) { uint8_t **host_buf_ptrs[num_devices] = {&(sub.src_dev_host_buf_ptr), &(sub.dst_dev_host_buf_ptr)};
continue; uint8_t **gpu_buf_ptrs[num_devices] = {&(sub.src_dev_gpu_buf_ptr), &(sub.dst_dev_gpu_buf_ptr)};
} for (int i = 0; i < num_devices; i++) {
// Set to buffer device for GPU buffer // Destroy buffers
if (SetGpu(dev_ids[i])) { if (is_dev_gpu[i]) {
return -1; if (*(gpu_buf_ptrs[i]) == nullptr) {
} continue;
cuda_err = cudaFree(*(gpu_buf_ptrs[i])); }
if (cuda_err != cudaSuccess) { // Set to buffer device for GPU buffer
fprintf(stderr, "DestroyBuf::cudaFree error: %d\n", cuda_err); if (SetGpu(dev_ids[i])) {
ret = -1; return -1;
} }
*(gpu_buf_ptrs[i]) = nullptr; cuda_err = cudaFree(*(gpu_buf_ptrs[i]));
} else { if (cuda_err != cudaSuccess) {
if (*(host_buf_ptrs[i]) == nullptr) { fprintf(stderr, "DestroyBufAndStream::cudaFree error: %d\n", cuda_err);
continue; ret = -1;
} }
// Set to worker device for host memory buffer *(gpu_buf_ptrs[i]) = nullptr;
if (SetGpu(args.worker_gpu_id)) { } else {
return -1; if (*(host_buf_ptrs[i]) == nullptr) {
} continue;
cuda_err = cudaHostUnregister(*(host_buf_ptrs[i])); }
if (cuda_err != cudaSuccess) { // Set to worker device for host memory buffer
fprintf(stderr, "DestroyBuf::cudaHostUnregister error: %d\n", cuda_err); if (SetGpu(sub.worker_gpu_id)) {
ret = -1; return -1;
}
cuda_err = cudaHostUnregister(*(host_buf_ptrs[i]));
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestroyBufAndStream::cudaHostUnregister error: %d\n", cuda_err);
ret = -1;
}
numa_free(*(host_buf_ptrs[i]), args->size);
*(host_buf_ptrs[i]) = nullptr;
*(gpu_buf_ptrs[i]) = nullptr;
} }
numa_free(*(host_buf_ptrs[i]), args.size); }
*(host_buf_ptrs[i]) = nullptr;
*(gpu_buf_ptrs[i]) = nullptr; // Destroy stream on worker device
if (SetGpu(sub.worker_gpu_id)) {
return -1;
}
cuda_err = cudaStreamDestroy(sub.stream);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestoryBufAndStream::cudaStreamDestroy error: %d\n", cuda_err);
return -1;
} }
} }
...@@ -388,81 +457,83 @@ __global__ void SMCopyKernel(ulong2 *tgt, const ulong2 *src) { ...@@ -388,81 +457,83 @@ __global__ void SMCopyKernel(ulong2 *tgt, const ulong2 *src) {
} }
// Print result tag as <src_dev>_to_<dst_dev>_by_<worker_dev>_using_<sm|dma>_under_<numa_node>. // Print result tag as <src_dev>_to_<dst_dev>_by_<worker_dev>_using_<sm|dma>_under_<numa_node>.
void PringResultTag(const BenchArgs &args) { void PrintResultTag(const BenchArgs &args) {
if (args.is_src_dev_gpu) { if (args.subs[0].is_src_dev_gpu) {
printf("gpu%d", args.src_gpu_id); printf("gpu%d", args.subs[0].src_gpu_id);
} else { } else {
printf("cpu"); printf("cpu");
} }
printf("_to_"); printf("_to_");
if (args.is_dst_dev_gpu) { if (args.subs[0].is_dst_dev_gpu) {
printf("gpu%d", args.dst_gpu_id); printf("gpu%d", args.subs[0].dst_gpu_id);
} else { } else {
printf("cpu"); printf("cpu");
} }
printf("_by_gpu%d_using_%s_under_numa%lu", args.worker_gpu_id, args.is_sm_copy ? "sm" : "dma", args.numa_id); if (args.subs[0].is_src_dev_gpu && args.subs[0].is_dst_dev_gpu &&
args.subs[0].src_gpu_id != args.subs[0].dst_gpu_id) {
if (args.subs[0].src_gpu_id == args.subs[0].worker_gpu_id) {
printf("_write");
} else {
printf("_read");
}
}
printf("_by_%s_under_numa%lu", args.is_sm_copy ? "sm" : "dma", args.numa_id);
if (args.num_subs == 1) {
printf("_uni");
} else {
printf("_bi");
}
} }
// Run copy benchmark. // Run copy benchmark.
int RunCopy(const BenchArgs &args, const Buffers &buffers) { int RunCopy(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess; cudaError_t cuda_err = cudaSuccess;
cudaStream_t stream;
uint64_t num_thread_blocks; uint64_t num_thread_blocks;
// Set to worker device
if (SetGpu(args.worker_gpu_id)) {
return -1;
}
// Validate data size for SM copy // Validate data size for SM copy
if (args.is_sm_copy) { if (args->is_sm_copy) {
uint64_t num_elements_in_thread_block = NUM_LOOP_UNROLL * NUM_THREADS_IN_BLOCK; uint64_t num_elements_in_thread_block = NUM_LOOP_UNROLL * NUM_THREADS_IN_BLOCK;
uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(ulong2); uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(ulong2);
if (args.size % num_bytes_in_thread_block) { if (args->size % num_bytes_in_thread_block) {
fprintf(stderr, "RunCopy: Data size should be multiple of %lu\n", num_bytes_in_thread_block); fprintf(stderr, "RunCopy: Data size should be multiple of %lu\n", num_bytes_in_thread_block);
return -1; return -1;
} }
num_thread_blocks = args.size / num_bytes_in_thread_block; num_thread_blocks = args->size / num_bytes_in_thread_block;
}
// Create stream to launch kernels
cuda_err = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunCopy::cudaStreamCreate error: %d\n", cuda_err);
return -1;
} }
// Launch jobs and collect running time // Launch jobs and collect running time
auto start = std::chrono::steady_clock::now(); auto start = std::chrono::steady_clock::now();
for (int i = 0; i < args.num_loops; i++) { for (int i = 0; i < args->num_loops; i++) {
if (args.is_sm_copy) { for (int j = 0; j < args->num_subs; j++) {
SMCopyKernel<<<num_thread_blocks, NUM_THREADS_IN_BLOCK, 0, stream>>>( SubBenchArgs &sub = args->subs[j];
reinterpret_cast<ulong2 *>(buffers.dst_dev_gpu_buf_ptr), if (SetGpu(sub.worker_gpu_id)) {
reinterpret_cast<ulong2 *>(buffers.src_dev_gpu_buf_ptr)); return -1;
} else { }
cudaMemcpyAsync(buffers.dst_dev_gpu_buf_ptr, buffers.src_dev_gpu_buf_ptr, args.size, cudaMemcpyDefault, if (args->is_sm_copy) {
stream); SMCopyKernel<<<num_thread_blocks, NUM_THREADS_IN_BLOCK, 0, sub.stream>>>(
reinterpret_cast<ulong2 *>(sub.dst_dev_gpu_buf_ptr),
reinterpret_cast<ulong2 *>(sub.src_dev_gpu_buf_ptr));
} else {
cudaMemcpyAsync(sub.dst_dev_gpu_buf_ptr, sub.src_dev_gpu_buf_ptr, args->size, cudaMemcpyDefault,
sub.stream);
}
} }
} }
cuda_err = cudaStreamSynchronize(stream); for (int i = 0; i < args->num_subs; i++) {
auto end = std::chrono::steady_clock::now(); SubBenchArgs &sub = args->subs[i];
if (cuda_err != cudaSuccess) { cuda_err = cudaStreamSynchronize(sub.stream);
fprintf(stderr, "RunCopy::cudaStreamSynchronize error: %d\n", cuda_err); if (cuda_err != cudaSuccess) {
return -1; fprintf(stderr, "RunCopy::cudaStreamSynchronize error: %d\n", cuda_err);
} return -1;
}
// Destroy stream
cuda_err = cudaStreamDestroy(stream);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunCopy::cudaStreamDestroy error: %d\n", cuda_err);
return -1;
} }
auto end = std::chrono::steady_clock::now();
// Calculate and display bandwidth if no problem // Calculate and display bandwidth if no problem
double time_in_sec = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count(); double time_in_sec = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
PringResultTag(args); PrintResultTag(*args);
printf(" %g\n", args.size * args.num_loops / time_in_sec / 1e9); printf(" %g\n", args->size * args->num_loops * args->num_subs / time_in_sec / 1e9);
return 0; return 0;
} }
...@@ -492,24 +563,73 @@ int EnablePeerAccess(int src_gpu_id, int dst_gpu_id, int *can_access) { ...@@ -492,24 +563,73 @@ int EnablePeerAccess(int src_gpu_id, int dst_gpu_id, int *can_access) {
return 0; return 0;
} }
int RunBench(const BenchArgs &args) { int RunBench(BenchArgs *args) {
int ret = 0; int ret = 0;
int destroy_buf_ret = 0; int destroy_buf_ret = 0;
Buffers buffers; ret = PrepareBufAndStream(args);
ret = PrepareBuf(args, &buffers);
if (ret == 0) { if (ret == 0) {
ret = RunCopy(args, buffers); ret = RunCopy(args);
if (ret == 0) { if (ret == 0) {
ret = CheckBuf(args, buffers); ret = CheckBuf(args);
} }
} }
destroy_buf_ret = DestroyBuf(args, &buffers); destroy_buf_ret = DestroyBufAndStream(args);
if (ret == 0) { if (ret == 0) {
ret = destroy_buf_ret; ret = destroy_buf_ret;
} }
return ret; return ret;
} }
void SetSubBenchArgsForHToD(int gpu_id, bool is_bidirectional, BenchArgs *args) {
args->subs[0].is_src_dev_gpu = false;
args->subs[0].is_dst_dev_gpu = true;
args->subs[0].dst_gpu_id = gpu_id;
args->subs[0].worker_gpu_id = gpu_id;
if (is_bidirectional) {
args->num_subs = 2;
args->subs[1].is_src_dev_gpu = true;
args->subs[1].is_dst_dev_gpu = false;
args->subs[1].src_gpu_id = gpu_id;
args->subs[1].worker_gpu_id = gpu_id;
} else {
args->num_subs = 1;
}
}
void SetSubBenchArgsForDToH(int gpu_id, bool is_bidirectional, BenchArgs *args) {
args->subs[0].is_src_dev_gpu = true;
args->subs[0].is_dst_dev_gpu = false;
args->subs[0].src_gpu_id = gpu_id;
args->subs[0].worker_gpu_id = gpu_id;
if (is_bidirectional) {
args->num_subs = 2;
args->subs[1].is_src_dev_gpu = false;
args->subs[1].is_dst_dev_gpu = true;
args->subs[1].dst_gpu_id = gpu_id;
args->subs[1].worker_gpu_id = gpu_id;
} else {
args->num_subs = 1;
}
}
void SetSubBenchArgsForDToD(int src_gpu_id, int dst_gpu_id, bool is_read, bool is_bidirectional, BenchArgs *args) {
args->subs[0].is_src_dev_gpu = true;
args->subs[0].is_dst_dev_gpu = true;
args->subs[0].src_gpu_id = src_gpu_id;
args->subs[0].dst_gpu_id = dst_gpu_id;
args->subs[0].worker_gpu_id = is_read ? dst_gpu_id : src_gpu_id;
if (is_bidirectional) {
args->num_subs = 2;
args->subs[1].is_src_dev_gpu = true;
args->subs[1].is_dst_dev_gpu = true;
args->subs[1].src_gpu_id = dst_gpu_id;
args->subs[1].dst_gpu_id = src_gpu_id;
args->subs[1].worker_gpu_id = is_read ? src_gpu_id : dst_gpu_id;
} else {
args->num_subs = 1;
}
}
int main(int argc, char **argv) { int main(int argc, char **argv) {
int ret = 0; int ret = 0;
int numa_count = 0; int numa_count = 0;
...@@ -546,55 +666,52 @@ int main(int argc, char **argv) { ...@@ -546,55 +666,52 @@ int main(int argc, char **argv) {
for (int j = 0; j < gpu_count; j++) { for (int j = 0; j < gpu_count; j++) {
// Host-to-device benchmark // Host-to-device benchmark
if (opts.htod_enabled) { if (opts.htod_enabled) {
args.is_src_dev_gpu = false;
args.is_dst_dev_gpu = true;
args.dst_gpu_id = j;
args.worker_gpu_id = j;
if (opts.sm_copy_enabled) { if (opts.sm_copy_enabled) {
args.is_sm_copy = true; args.is_sm_copy = true;
SetSubBenchArgsForHToD(j, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
if (opts.dma_copy_enabled) { if (opts.dma_copy_enabled) {
args.is_sm_copy = false; args.is_sm_copy = false;
SetSubBenchArgsForHToD(j, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
} }
// Device-to-host benchmark // Device-to-host benchmark
if (opts.dtoh_enabled) { if (opts.dtoh_enabled) {
args.is_src_dev_gpu = true;
args.src_gpu_id = j;
args.is_dst_dev_gpu = false;
args.worker_gpu_id = j;
if (opts.sm_copy_enabled) { if (opts.sm_copy_enabled) {
args.is_sm_copy = true; args.is_sm_copy = true;
SetSubBenchArgsForDToH(j, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
if (opts.dma_copy_enabled) { if (opts.dma_copy_enabled) {
args.is_sm_copy = false; args.is_sm_copy = false;
SetSubBenchArgsForDToH(j, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
} }
// Device-to-device benchmark // Device-to-device benchmark
if (opts.dtod_enabled) { if (opts.dtod_enabled) {
args.is_src_dev_gpu = true;
args.src_gpu_id = j;
args.is_dst_dev_gpu = true;
// Scan all peers // Scan all peers
for (int k = 0; k < gpu_count; k++) { for (int k = 0; k < gpu_count; k++) {
args.dst_gpu_id = k; // Skip second half for bidirectional test
if (opts.bidirectional_enabled && j > k) {
break;
}
// P2P write // P2P write
ret = EnablePeerAccess(j, k, &can_access); ret = EnablePeerAccess(j, k, &can_access);
if (ret != 0) { if (ret != 0) {
return -1; return -1;
} }
if (can_access) { if (can_access) {
args.worker_gpu_id = j;
if (opts.sm_copy_enabled) { if (opts.sm_copy_enabled) {
args.is_sm_copy = true; args.is_sm_copy = true;
SetSubBenchArgsForDToD(j, k, false, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
if (opts.dma_copy_enabled) { if (opts.dma_copy_enabled) {
args.is_sm_copy = false; args.is_sm_copy = false;
SetSubBenchArgsForDToD(j, k, false, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
} }
...@@ -607,13 +724,14 @@ int main(int argc, char **argv) { ...@@ -607,13 +724,14 @@ int main(int argc, char **argv) {
return -1; return -1;
} }
if (can_access) { if (can_access) {
args.worker_gpu_id = k;
if (opts.sm_copy_enabled) { if (opts.sm_copy_enabled) {
args.is_sm_copy = true; args.is_sm_copy = true;
SetSubBenchArgsForDToD(j, k, true, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
if (opts.dma_copy_enabled) { if (opts.dma_copy_enabled) {
args.is_sm_copy = false; args.is_sm_copy = false;
SetSubBenchArgsForDToD(j, k, true, opts.bidirectional_enabled, &args);
args_list.push_back(args); args_list.push_back(args);
} }
} }
...@@ -622,13 +740,13 @@ int main(int argc, char **argv) { ...@@ -622,13 +740,13 @@ int main(int argc, char **argv) {
} }
} }
for (const BenchArgs &curr_args : args_list) { for (BenchArgs &curr_args : args_list) {
ret = numa_run_on_node(curr_args.numa_id); ret = numa_run_on_node(curr_args.numa_id);
if (ret != 0) { if (ret != 0) {
fprintf(stderr, "main::numa_run_on_node error: %d\n", errno); fprintf(stderr, "main::numa_run_on_node error: %d\n", errno);
return -1; return -1;
} }
ret = RunBench(curr_args); ret = RunBench(&curr_args);
if (ret != 0) { if (ret != 0) {
return -1; return -1;
} }
......
...@@ -32,7 +32,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform): ...@@ -32,7 +32,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
mem_types = ['htod', 'dtoh', 'dtod'] mem_types = ['htod', 'dtoh', 'dtod']
copy_types = ['sm', 'dma'] copy_types = ['sm', 'dma']
parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d' % \ parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d --bidirectional' % \
(' '.join(mem_types), ' '.join(copy_types), size, num_loops) (' '.join(mem_types), ' '.join(copy_types), size, num_loops)
benchmark = benchmark_class(benchmark_name, parameters=parameters) benchmark = benchmark_class(benchmark_name, parameters=parameters)
...@@ -49,6 +49,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform): ...@@ -49,6 +49,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
assert (benchmark._args.copy_type == copy_types) assert (benchmark._args.copy_type == copy_types)
assert (benchmark._args.size == size) assert (benchmark._args.size == size)
assert (benchmark._args.num_loops == num_loops) assert (benchmark._args.num_loops == num_loops)
assert (benchmark._args.bidirectional)
# Check command # Check command
assert (1 == len(benchmark._commands)) assert (1 == len(benchmark._commands))
...@@ -59,6 +60,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform): ...@@ -59,6 +60,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
assert ('--%s_copy' % copy_type in benchmark._commands[0]) assert ('--%s_copy' % copy_type in benchmark._commands[0])
assert ('--size %d' % size in benchmark._commands[0]) assert ('--size %d' % size in benchmark._commands[0])
assert ('--num_loops %d' % num_loops in benchmark._commands[0]) assert ('--num_loops %d' % num_loops in benchmark._commands[0])
assert ('--bidirectional' in benchmark._commands[0])
@decorator.cuda_test @decorator.cuda_test
def test_gpu_copy_bw_performance_command_generation_cuda(self): def test_gpu_copy_bw_performance_command_generation_cuda(self):
...@@ -70,7 +72,8 @@ def test_gpu_copy_bw_performance_command_generation_rocm(self): ...@@ -70,7 +72,8 @@ def test_gpu_copy_bw_performance_command_generation_rocm(self):
"""Test gpu-copy benchmark command generation, ROCm case.""" """Test gpu-copy benchmark command generation, ROCm case."""
self._test_gpu_copy_bw_performance_command_generation(Platform.ROCM) self._test_gpu_copy_bw_performance_command_generation(Platform.ROCM)
def _test_gpu_copy_bw_performance_result_parsing(self, platform): @decorator.load_data('tests/data/gpu_copy_bw_performance.log')
def _test_gpu_copy_bw_performance_result_parsing(self, platform, test_raw_output):
"""Test gpu-copy benchmark result parsing.""" """Test gpu-copy benchmark result parsing."""
benchmark_name = 'gpu-copy-bw' benchmark_name = 'gpu-copy-bw'
(benchmark_class, (benchmark_class,
...@@ -85,20 +88,6 @@ def _test_gpu_copy_bw_performance_result_parsing(self, platform): ...@@ -85,20 +88,6 @@ def _test_gpu_copy_bw_performance_result_parsing(self, platform):
assert (benchmark.type == BenchmarkType.MICRO) assert (benchmark.type == BenchmarkType.MICRO)
# Positive case - valid raw output. # Positive case - valid raw output.
test_raw_output = """
cpu_to_gpu0_by_gpu0_using_sm_under_numa0 26.1755
cpu_to_gpu0_by_gpu0_using_dma_under_numa0 26.1894
gpu0_to_cpu_by_gpu0_using_sm_under_numa0 5.72584
gpu0_to_cpu_by_gpu0_using_dma_under_numa0 26.2623
gpu0_to_gpu0_by_gpu0_using_sm_under_numa0 659.275
gpu0_to_gpu0_by_gpu0_using_dma_under_numa0 636.401
cpu_to_gpu0_by_gpu0_using_sm_under_numa1 26.1589
cpu_to_gpu0_by_gpu0_using_dma_under_numa1 26.18
gpu0_to_cpu_by_gpu0_using_sm_under_numa1 5.07597
gpu0_to_cpu_by_gpu0_using_dma_under_numa1 25.2851
gpu0_to_gpu0_by_gpu0_using_sm_under_numa1 656.825
gpu0_to_gpu0_by_gpu0_using_dma_under_numa1 634.203
"""
assert (benchmark._process_raw_result(0, test_raw_output)) assert (benchmark._process_raw_result(0, test_raw_output))
assert (benchmark.return_code == ReturnCode.SUCCESS) assert (benchmark.return_code == ReturnCode.SUCCESS)
......
cpu_to_gpu0_by_sm_under_numa0_uni 26.1736
cpu_to_gpu0_by_dma_under_numa0_uni 26.1878
gpu0_to_cpu_by_sm_under_numa0_uni 5.01589
gpu0_to_cpu_by_dma_under_numa0_uni 21.8659
gpu0_to_gpu0_by_sm_under_numa0_uni 655.759
gpu0_to_gpu0_by_dma_under_numa0_uni 633.325
gpu0_to_gpu1_write_by_sm_under_numa0_uni 250.122
gpu0_to_gpu1_write_by_dma_under_numa0_uni 274.951
gpu0_to_gpu1_read_by_sm_under_numa0_uni 253.563
gpu0_to_gpu1_read_by_dma_under_numa0_uni 264.009
cpu_to_gpu1_by_sm_under_numa0_uni 26.187
cpu_to_gpu1_by_dma_under_numa0_uni 26.207
gpu1_to_cpu_by_sm_under_numa0_uni 5.01132
gpu1_to_cpu_by_dma_under_numa0_uni 21.8635
gpu1_to_gpu0_write_by_sm_under_numa0_uni 249.824
gpu1_to_gpu0_write_by_dma_under_numa0_uni 275.123
gpu1_to_gpu0_read_by_sm_under_numa0_uni 253.469
gpu1_to_gpu0_read_by_dma_under_numa0_uni 264.908
gpu1_to_gpu1_by_sm_under_numa0_uni 658.338
gpu1_to_gpu1_by_dma_under_numa0_uni 631.148
cpu_to_gpu0_by_sm_under_numa1_uni 26.1542
cpu_to_gpu0_by_dma_under_numa1_uni 26.2007
gpu0_to_cpu_by_sm_under_numa1_uni 5.67356
gpu0_to_cpu_by_dma_under_numa1_uni 21.8599
gpu0_to_gpu0_by_sm_under_numa1_uni 656.935
gpu0_to_gpu0_by_dma_under_numa1_uni 631.974
gpu0_to_gpu1_write_by_sm_under_numa1_uni 250.118
gpu0_to_gpu1_write_by_dma_under_numa1_uni 274.778
gpu0_to_gpu1_read_by_sm_under_numa1_uni 253.625
gpu0_to_gpu1_read_by_dma_under_numa1_uni 264.347
cpu_to_gpu1_by_sm_under_numa1_uni 26.1905
cpu_to_gpu1_by_dma_under_numa1_uni 26.2007
gpu1_to_cpu_by_sm_under_numa1_uni 5.67716
gpu1_to_cpu_by_dma_under_numa1_uni 21.8579
gpu1_to_gpu0_write_by_sm_under_numa1_uni 250.064
gpu1_to_gpu0_write_by_dma_under_numa1_uni 274.924
gpu1_to_gpu0_read_by_sm_under_numa1_uni 253.746
gpu1_to_gpu0_read_by_dma_under_numa1_uni 264.256
gpu1_to_gpu1_by_sm_under_numa1_uni 655.623
gpu1_to_gpu1_by_dma_under_numa1_uni 634.062
cpu_to_gpu0_by_sm_under_numa0_bi 8.45975
cpu_to_gpu0_by_dma_under_numa0_bi 36.4282
gpu0_to_gpu0_by_sm_under_numa0_bi 689.063
gpu0_to_gpu0_by_dma_under_numa0_bi 661.7
gpu0_to_gpu1_write_by_sm_under_numa0_bi 427.446
gpu0_to_gpu1_write_by_dma_under_numa0_bi 521.577
gpu0_to_gpu1_read_by_sm_under_numa0_bi 446.835
gpu0_to_gpu1_read_by_dma_under_numa0_bi 503.158
cpu_to_gpu1_by_sm_under_numa0_bi 8.4487
cpu_to_gpu1_by_dma_under_numa0_bi 36.4272
cpu_to_gpu0_by_sm_under_numa1_bi 9.36164
cpu_to_gpu0_by_dma_under_numa1_bi 36.411
gpu0_to_gpu0_by_sm_under_numa1_bi 688.156
gpu0_to_gpu0_by_dma_under_numa1_bi 662.077
gpu0_to_gpu1_write_by_sm_under_numa1_bi 427.033
gpu0_to_gpu1_write_by_dma_under_numa1_bi 521.367
gpu0_to_gpu1_read_by_sm_under_numa1_bi 446.179
gpu0_to_gpu1_read_by_dma_under_numa1_bi 503.843
cpu_to_gpu1_by_sm_under_numa1_bi 9.37368
cpu_to_gpu1_by_dma_under_numa1_bi 36.4128
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment