Unverified Commit 74421ffe authored by Ziyue Yang's avatar Ziyue Yang Committed by GitHub
Browse files

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285)

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
parent fd2bc9e0
......@@ -186,11 +186,16 @@ Measure the memory copy bandwidth performed by GPU SM/DMA engine, including devi
#### Metrics
| Name | Unit | Description |
|-------------------------------------------------------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------|
| cpu\_to\_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. |
| gpu[0-9]+\_to\_cpu\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs. |
| gpu[0-9]+\_to_gpu[0-9]+\_by\_gpu[0-9]+\_using\_(sm\|dma)\_under_numa[0-9]+_bw | bandwidth (GB/s) | The bandwidth reading from or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled. |
| Name | Unit | Description |
|------------------------------------------------------------------------------------|------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM. |
| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM. |
| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM with peer communication enabled. |
| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_uni\_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
| cpu\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM. |
| gpu[0-9]+\_to\_cpu\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | Same as above. |
| gpu[0-9]+\_to\_gpu[0-9]+\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM with peer communication enabled. |
| gpu[0-9]+\_to\_gpu[0-9]+\_(read\|write)\_by\_(sm\|dma)\_under\_numa[0-9]+\_bi\_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
### `ib-loopback`
......
......@@ -18,6 +18,8 @@
# context = BenchmarkRegistry.create_benchmark_context(
# 'gpu-copy-bw', platform=Platform.ROCM, parameters='--mem_type htod dtoh dtod --copy_type sm dma'
# )
# For bidirectional test, please specify parameters as the following.
# parameters='--mem_type htod dtod --copy_type sm dma --bidirectional'
benchmark = BenchmarkRegistry.launch_benchmark(context)
if benchmark:
......
......@@ -61,6 +61,12 @@ def add_parser_arguments(self):
help='Number of data buffer copies performed.',
)
self._parser.add_argument(
'--bidirectional',
action='store_true',
help='Enable bidirectional test',
)
def _preprocess(self):
"""Preprocess/preparation operations before the benchmarking.
......@@ -78,6 +84,9 @@ def _preprocess(self):
for copy_type in self._args.copy_type:
args += ' --%s_copy' % copy_type
if self._args.bidirectional:
args += ' --bidirectional'
self._commands = ['%s %s' % (self.__bin_path, args)]
return True
......
......@@ -15,8 +15,8 @@
#include <cuda.h>
#include <cuda_runtime.h>
// Arguments for each benchmark run.
struct BenchArgs {
// Arguments for each sub benchmark run.
struct SubBenchArgs {
// Whether source device is GPU.
bool is_src_dev_gpu = false;
......@@ -32,20 +32,6 @@ struct BenchArgs {
// GPU IDs for worker device.
int worker_gpu_id = 0;
// Uses SM copy, otherwise DMA copy.
bool is_sm_copy = false;
// NUMA node under which the benchmark is done.
uint64_t numa_id = 0;
// Data buffer size used.
uint64_t size = 0;
// Number of loops to run.
uint64_t num_loops = 0;
};
struct Buffers {
// Original data buffer.
uint8_t *data_buf = nullptr;
......@@ -63,6 +49,34 @@ struct Buffers {
// GPU pointer of the data buffer on destination devices.
uint8_t *dst_dev_gpu_buf_ptr = nullptr;
// CUDA stream to be used.
cudaStream_t stream;
};
// Arguments for each benchmark run.
struct BenchArgs {
// Max number of sub benchmarks.
static const int kMaxNumSubs = 2;
// Number of sub benchmarks in this benchmark run.
// 1 for unidirectional, 2 for bidirectional.
int num_subs = 0;
// NUMA node under which the benchmark is done.
uint64_t numa_id = 0;
// Data buffer size used.
uint64_t size = 0;
// Number of loops to run.
uint64_t num_loops = 0;
// Uses SM copy, otherwise DMA copy.
bool is_sm_copy = false;
// Sub-benchmarks in parallel.
SubBenchArgs subs[kMaxNumSubs];
};
// Options accepted by this program.
......@@ -87,9 +101,12 @@ struct Opts {
// Whether device-to-device transfer needs to be evaluated.
bool dtod_enabled = false;
// Whether bidirectional transfer is enabled.
bool bidirectional_enabled = false;
};
// Pring usage of this program.
// Print usage of this program.
void PrintUsage() {
printf("Usage: gpu_copy "
"--size <size> "
......@@ -98,19 +115,31 @@ void PrintUsage() {
"[--dma_copy] "
"[--htod] "
"[--dtoh] "
"[--dtod]\n");
"[--dtod] "
"[--bidirectional]\n");
}
// Parse options of this program.
int ParseOpts(int argc, char **argv, Opts *opts) {
enum class OptIdx { kSize, kNumIters, kEnableSmCopy, kEnableDmaCopy, kEnableHToD, kEnableDToH, kEnableDToD };
const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
{"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumIters)},
{"sm_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableSmCopy)},
{"dma_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDmaCopy)},
{"htod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableHToD)},
{"dtoh", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToH)},
{"dtod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToD)}};
enum class OptIdx {
kSize,
kNumIters,
kEnableSmCopy,
kEnableDmaCopy,
kEnableHToD,
kEnableDToH,
kEnableDToD,
kEnableBidirectional
};
const struct option options[] = {
{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
{"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumIters)},
{"sm_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableSmCopy)},
{"dma_copy", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDmaCopy)},
{"htod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableHToD)},
{"dtoh", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToH)},
{"dtod", no_argument, nullptr, static_cast<int>(OptIdx::kEnableDToD)},
{"bidirectional", no_argument, nullptr, static_cast<int>(OptIdx::kEnableBidirectional)}};
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = false;
......@@ -159,6 +188,9 @@ int ParseOpts(int argc, char **argv, Opts *opts) {
case static_cast<int>(OptIdx::kEnableDToD):
opts->dtod_enabled = true;
break;
case static_cast<int>(OptIdx::kEnableBidirectional):
opts->bidirectional_enabled = true;
break;
default:
parse_err = true;
}
......@@ -193,145 +225,182 @@ int SetGpu(int gpu_id) {
return 0;
}
// Prepare data buffers to be used.
int PrepareBuf(const BenchArgs &args, Buffers *buffers) {
// Prepare data buffers and streams to be used.
int PrepareBufAndStream(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess;
constexpr int uint8_mod = 256;
// Generate data to copy
buffers->data_buf = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id));
for (int i = 0; i < args.size; i++) {
buffers->data_buf[i] = static_cast<uint8_t>(i % uint8_mod);
}
for (int i = 0; i < args->num_subs; i++) {
SubBenchArgs &sub = args->subs[i];
// Reset check buffer
buffers->check_buf = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id));
memset(buffers->check_buf, 0, args.size);
// Allocate buffers for src/dst devices
constexpr int num_devices = 2;
bool is_dev_gpu[num_devices] = {args.is_src_dev_gpu, args.is_dst_dev_gpu};
int dev_ids[num_devices] = {args.src_gpu_id, args.dst_gpu_id};
uint8_t **host_buf_ptrs[num_devices] = {&(buffers->src_dev_host_buf_ptr), &(buffers->dst_dev_host_buf_ptr)};
uint8_t **gpu_buf_ptrs[num_devices] = {&(buffers->src_dev_gpu_buf_ptr), &(buffers->dst_dev_gpu_buf_ptr)};
for (int i = 0; i < num_devices; i++) {
// Allocate buffers
if (is_dev_gpu[i]) {
// Set to buffer device for GPU buffer
if (SetGpu(dev_ids[i])) {
return -1;
}
*(host_buf_ptrs[i]) = nullptr;
cuda_err = cudaMalloc(gpu_buf_ptrs[i], args.size);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBuf::cudaMalloc error: %d\n", cuda_err);
return -1;
}
} else {
// Set to worker device for host memory buffer
if (SetGpu(args.worker_gpu_id)) {
return -1;
}
*(host_buf_ptrs[i]) = static_cast<uint8_t *>(numa_alloc_onnode(args.size, args.numa_id));
cuda_err = cudaHostRegister(*(host_buf_ptrs[i]), args.size, cudaHostRegisterMapped);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBuf::cudaHostRegister error: %d\n", cuda_err);
return -1;
// Generate data to copy
sub.data_buf = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
for (int j = 0; j < args->size; j++) {
sub.data_buf[j] = static_cast<uint8_t>(j % uint8_mod);
}
// Allocate check buffer
sub.check_buf = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
// Allocate buffers for src/dst devices
constexpr int num_devices = 2;
bool is_dev_gpu[num_devices] = {sub.is_src_dev_gpu, sub.is_dst_dev_gpu};
int dev_ids[num_devices] = {sub.src_gpu_id, sub.dst_gpu_id};
uint8_t **host_buf_ptrs[num_devices] = {&(sub.src_dev_host_buf_ptr), &(sub.dst_dev_host_buf_ptr)};
uint8_t **gpu_buf_ptrs[num_devices] = {&(sub.src_dev_gpu_buf_ptr), &(sub.dst_dev_gpu_buf_ptr)};
for (int j = 0; j < num_devices; j++) {
// Allocate buffers
if (is_dev_gpu[j]) {
// Set to buffer device for GPU buffer
if (SetGpu(dev_ids[j])) {
return -1;
}
*(host_buf_ptrs[j]) = nullptr;
cuda_err = cudaMalloc(gpu_buf_ptrs[j], args->size);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaMalloc error: %d\n", cuda_err);
return -1;
}
} else {
// Set to worker device for host memory buffer
if (SetGpu(sub.worker_gpu_id)) {
return -1;
}
*(host_buf_ptrs[j]) = static_cast<uint8_t *>(numa_alloc_onnode(args->size, args->numa_id));
cuda_err = cudaHostRegister(*(host_buf_ptrs[j]), args->size, cudaHostRegisterMapped);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaHostRegister error: %d\n", cuda_err);
return -1;
}
cuda_err = cudaHostGetDevicePointer((void **)gpu_buf_ptrs[j], *(host_buf_ptrs[j]), 0);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaHostGetDevicePointer error: %d\n", cuda_err);
return -1;
}
}
cuda_err = cudaHostGetDevicePointer((void **)gpu_buf_ptrs[i], *(host_buf_ptrs[i]), 0);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBuf::cudaHostGetDevicePointer error: %d\n", cuda_err);
}
// Initialize source buffer
if (sub.is_src_dev_gpu) {
if (SetGpu(sub.src_gpu_id)) {
return -1;
}
}
}
cuda_err = cudaMemcpy(sub.src_dev_gpu_buf_ptr, sub.data_buf, args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaMemcpy error: %d\n", cuda_err);
return -1;
}
// Initialize source buffer
if (SetGpu(args.src_gpu_id)) {
return -1;
}
cuda_err = cudaMemcpy(buffers->src_dev_gpu_buf_ptr, buffers->data_buf, args.size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBuf::cudaMemcpy error: %d\n", cuda_err);
return -1;
// Initialize stream on worker device
if (SetGpu(sub.worker_gpu_id)) {
return -1;
}
cuda_err = cudaStreamCreateWithFlags(&(sub.stream), cudaStreamNonBlocking);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "PrepareBufAndStream::cudaStreamCreate error: %d\n", cuda_err);
return -1;
}
}
return 0;
}
// Validate the result of data transfer.
int CheckBuf(const BenchArgs &args, const Buffers &buffers) {
int CheckBuf(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess;
int memcmp_result = 0;
// Copy result
if (SetGpu(args.dst_gpu_id)) {
return -1;
}
cuda_err = cudaMemcpy(buffers.check_buf, buffers.src_dev_gpu_buf_ptr, args.size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "CheckBuf::cudaMemcpy error: %d\n", cuda_err);
return -1;
}
for (int i = 0; i < args->num_subs; i++) {
SubBenchArgs &sub = args->subs[i];
// Validate result
int memcmp_result = memcmp(buffers.data_buf, buffers.check_buf, args.size);
if (memcmp_result) {
fprintf(stderr, "CheckBuf: Memory check failed\n");
return -1;
// Copy result
memset(sub.check_buf, 0, args->size);
if (SetGpu(sub.dst_gpu_id)) {
return -1;
}
cuda_err = cudaMemcpy(sub.check_buf, sub.dst_dev_gpu_buf_ptr, args->size, cudaMemcpyDefault);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "CheckBuf::cudaMemcpy error: %d\n", cuda_err);
return -1;
}
// Validate result
memcmp_result = memcmp(sub.data_buf, sub.check_buf, args->size);
if (memcmp_result) {
fprintf(stderr, "CheckBuf: Memory check failed\n");
return -1;
}
}
return 0;
}
// Destroy data buffers
int DestroyBuf(const BenchArgs &args, Buffers *buffers) {
// Destroy data buffers and streams
int DestroyBufAndStream(BenchArgs *args) {
int ret = 0;
cudaError_t cuda_err = cudaSuccess;
// Destroy original data buffer and check buffer
if (buffers->data_buf != nullptr)
numa_free(buffers->data_buf, args.size);
if (buffers->check_buf != nullptr)
numa_free(buffers->check_buf, args.size);
// Only destroy buffers for src/dst devices
constexpr int num_devices = 2;
bool is_dev_gpu[num_devices] = {args.is_src_dev_gpu, args.is_dst_dev_gpu};
int dev_ids[num_devices] = {args.src_gpu_id, args.dst_gpu_id};
uint8_t **host_buf_ptrs[num_devices] = {&(buffers->src_dev_host_buf_ptr), &(buffers->dst_dev_host_buf_ptr)};
uint8_t **gpu_buf_ptrs[num_devices] = {&(buffers->src_dev_gpu_buf_ptr), &(buffers->dst_dev_gpu_buf_ptr)};
for (int i = 0; i < num_devices; i++) {
// Destroy buffers
if (is_dev_gpu[i]) {
if (*(gpu_buf_ptrs[i]) == nullptr) {
continue;
}
// Set to buffer device for GPU buffer
if (SetGpu(dev_ids[i])) {
return -1;
}
cuda_err = cudaFree(*(gpu_buf_ptrs[i]));
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestroyBuf::cudaFree error: %d\n", cuda_err);
ret = -1;
}
*(gpu_buf_ptrs[i]) = nullptr;
} else {
if (*(host_buf_ptrs[i]) == nullptr) {
continue;
}
// Set to worker device for host memory buffer
if (SetGpu(args.worker_gpu_id)) {
return -1;
}
cuda_err = cudaHostUnregister(*(host_buf_ptrs[i]));
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestroyBuf::cudaHostUnregister error: %d\n", cuda_err);
ret = -1;
for (int i = 0; i < args->num_subs; i++) {
SubBenchArgs &sub = args->subs[i];
// Destroy original data buffer and check buffer
if (sub.data_buf != nullptr) {
numa_free(sub.data_buf, args->size);
}
if (sub.check_buf != nullptr) {
numa_free(sub.check_buf, args->size);
}
// Only destroy buffers for src/dst devices
constexpr int num_devices = 2;
bool is_dev_gpu[num_devices] = {sub.is_src_dev_gpu, sub.is_dst_dev_gpu};
int dev_ids[num_devices] = {sub.src_gpu_id, sub.dst_gpu_id};
uint8_t **host_buf_ptrs[num_devices] = {&(sub.src_dev_host_buf_ptr), &(sub.dst_dev_host_buf_ptr)};
uint8_t **gpu_buf_ptrs[num_devices] = {&(sub.src_dev_gpu_buf_ptr), &(sub.dst_dev_gpu_buf_ptr)};
for (int i = 0; i < num_devices; i++) {
// Destroy buffers
if (is_dev_gpu[i]) {
if (*(gpu_buf_ptrs[i]) == nullptr) {
continue;
}
// Set to buffer device for GPU buffer
if (SetGpu(dev_ids[i])) {
return -1;
}
cuda_err = cudaFree(*(gpu_buf_ptrs[i]));
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestroyBufAndStream::cudaFree error: %d\n", cuda_err);
ret = -1;
}
*(gpu_buf_ptrs[i]) = nullptr;
} else {
if (*(host_buf_ptrs[i]) == nullptr) {
continue;
}
// Set to worker device for host memory buffer
if (SetGpu(sub.worker_gpu_id)) {
return -1;
}
cuda_err = cudaHostUnregister(*(host_buf_ptrs[i]));
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestroyBufAndStream::cudaHostUnregister error: %d\n", cuda_err);
ret = -1;
}
numa_free(*(host_buf_ptrs[i]), args->size);
*(host_buf_ptrs[i]) = nullptr;
*(gpu_buf_ptrs[i]) = nullptr;
}
numa_free(*(host_buf_ptrs[i]), args.size);
*(host_buf_ptrs[i]) = nullptr;
*(gpu_buf_ptrs[i]) = nullptr;
}
// Destroy stream on worker device
if (SetGpu(sub.worker_gpu_id)) {
return -1;
}
cuda_err = cudaStreamDestroy(sub.stream);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "DestoryBufAndStream::cudaStreamDestroy error: %d\n", cuda_err);
return -1;
}
}
......@@ -388,81 +457,83 @@ __global__ void SMCopyKernel(ulong2 *tgt, const ulong2 *src) {
}
// Print result tag as <src_dev>_to_<dst_dev>_by_<worker_dev>_using_<sm|dma>_under_<numa_node>.
void PringResultTag(const BenchArgs &args) {
if (args.is_src_dev_gpu) {
printf("gpu%d", args.src_gpu_id);
void PrintResultTag(const BenchArgs &args) {
if (args.subs[0].is_src_dev_gpu) {
printf("gpu%d", args.subs[0].src_gpu_id);
} else {
printf("cpu");
}
printf("_to_");
if (args.is_dst_dev_gpu) {
printf("gpu%d", args.dst_gpu_id);
if (args.subs[0].is_dst_dev_gpu) {
printf("gpu%d", args.subs[0].dst_gpu_id);
} else {
printf("cpu");
}
printf("_by_gpu%d_using_%s_under_numa%lu", args.worker_gpu_id, args.is_sm_copy ? "sm" : "dma", args.numa_id);
if (args.subs[0].is_src_dev_gpu && args.subs[0].is_dst_dev_gpu &&
args.subs[0].src_gpu_id != args.subs[0].dst_gpu_id) {
if (args.subs[0].src_gpu_id == args.subs[0].worker_gpu_id) {
printf("_write");
} else {
printf("_read");
}
}
printf("_by_%s_under_numa%lu", args.is_sm_copy ? "sm" : "dma", args.numa_id);
if (args.num_subs == 1) {
printf("_uni");
} else {
printf("_bi");
}
}
// Run copy benchmark.
int RunCopy(const BenchArgs &args, const Buffers &buffers) {
int RunCopy(BenchArgs *args) {
cudaError_t cuda_err = cudaSuccess;
cudaStream_t stream;
uint64_t num_thread_blocks;
// Set to worker device
if (SetGpu(args.worker_gpu_id)) {
return -1;
}
// Validate data size for SM copy
if (args.is_sm_copy) {
if (args->is_sm_copy) {
uint64_t num_elements_in_thread_block = NUM_LOOP_UNROLL * NUM_THREADS_IN_BLOCK;
uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(ulong2);
if (args.size % num_bytes_in_thread_block) {
if (args->size % num_bytes_in_thread_block) {
fprintf(stderr, "RunCopy: Data size should be multiple of %lu\n", num_bytes_in_thread_block);
return -1;
}
num_thread_blocks = args.size / num_bytes_in_thread_block;
}
// Create stream to launch kernels
cuda_err = cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunCopy::cudaStreamCreate error: %d\n", cuda_err);
return -1;
num_thread_blocks = args->size / num_bytes_in_thread_block;
}
// Launch jobs and collect running time
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < args.num_loops; i++) {
if (args.is_sm_copy) {
SMCopyKernel<<<num_thread_blocks, NUM_THREADS_IN_BLOCK, 0, stream>>>(
reinterpret_cast<ulong2 *>(buffers.dst_dev_gpu_buf_ptr),
reinterpret_cast<ulong2 *>(buffers.src_dev_gpu_buf_ptr));
} else {
cudaMemcpyAsync(buffers.dst_dev_gpu_buf_ptr, buffers.src_dev_gpu_buf_ptr, args.size, cudaMemcpyDefault,
stream);
for (int i = 0; i < args->num_loops; i++) {
for (int j = 0; j < args->num_subs; j++) {
SubBenchArgs &sub = args->subs[j];
if (SetGpu(sub.worker_gpu_id)) {
return -1;
}
if (args->is_sm_copy) {
SMCopyKernel<<<num_thread_blocks, NUM_THREADS_IN_BLOCK, 0, sub.stream>>>(
reinterpret_cast<ulong2 *>(sub.dst_dev_gpu_buf_ptr),
reinterpret_cast<ulong2 *>(sub.src_dev_gpu_buf_ptr));
} else {
cudaMemcpyAsync(sub.dst_dev_gpu_buf_ptr, sub.src_dev_gpu_buf_ptr, args->size, cudaMemcpyDefault,
sub.stream);
}
}
}
cuda_err = cudaStreamSynchronize(stream);
auto end = std::chrono::steady_clock::now();
if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunCopy::cudaStreamSynchronize error: %d\n", cuda_err);
return -1;
}
// Destroy stream
cuda_err = cudaStreamDestroy(stream);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunCopy::cudaStreamDestroy error: %d\n", cuda_err);
return -1;
for (int i = 0; i < args->num_subs; i++) {
SubBenchArgs &sub = args->subs[i];
cuda_err = cudaStreamSynchronize(sub.stream);
if (cuda_err != cudaSuccess) {
fprintf(stderr, "RunCopy::cudaStreamSynchronize error: %d\n", cuda_err);
return -1;
}
}
auto end = std::chrono::steady_clock::now();
// Calculate and display bandwidth if no problem
double time_in_sec = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
PringResultTag(args);
printf(" %g\n", args.size * args.num_loops / time_in_sec / 1e9);
PrintResultTag(*args);
printf(" %g\n", args->size * args->num_loops * args->num_subs / time_in_sec / 1e9);
return 0;
}
......@@ -492,24 +563,73 @@ int EnablePeerAccess(int src_gpu_id, int dst_gpu_id, int *can_access) {
return 0;
}
int RunBench(const BenchArgs &args) {
int RunBench(BenchArgs *args) {
int ret = 0;
int destroy_buf_ret = 0;
Buffers buffers;
ret = PrepareBuf(args, &buffers);
ret = PrepareBufAndStream(args);
if (ret == 0) {
ret = RunCopy(args, buffers);
ret = RunCopy(args);
if (ret == 0) {
ret = CheckBuf(args, buffers);
ret = CheckBuf(args);
}
}
destroy_buf_ret = DestroyBuf(args, &buffers);
destroy_buf_ret = DestroyBufAndStream(args);
if (ret == 0) {
ret = destroy_buf_ret;
}
return ret;
}
void SetSubBenchArgsForHToD(int gpu_id, bool is_bidirectional, BenchArgs *args) {
args->subs[0].is_src_dev_gpu = false;
args->subs[0].is_dst_dev_gpu = true;
args->subs[0].dst_gpu_id = gpu_id;
args->subs[0].worker_gpu_id = gpu_id;
if (is_bidirectional) {
args->num_subs = 2;
args->subs[1].is_src_dev_gpu = true;
args->subs[1].is_dst_dev_gpu = false;
args->subs[1].src_gpu_id = gpu_id;
args->subs[1].worker_gpu_id = gpu_id;
} else {
args->num_subs = 1;
}
}
void SetSubBenchArgsForDToH(int gpu_id, bool is_bidirectional, BenchArgs *args) {
args->subs[0].is_src_dev_gpu = true;
args->subs[0].is_dst_dev_gpu = false;
args->subs[0].src_gpu_id = gpu_id;
args->subs[0].worker_gpu_id = gpu_id;
if (is_bidirectional) {
args->num_subs = 2;
args->subs[1].is_src_dev_gpu = false;
args->subs[1].is_dst_dev_gpu = true;
args->subs[1].dst_gpu_id = gpu_id;
args->subs[1].worker_gpu_id = gpu_id;
} else {
args->num_subs = 1;
}
}
void SetSubBenchArgsForDToD(int src_gpu_id, int dst_gpu_id, bool is_read, bool is_bidirectional, BenchArgs *args) {
args->subs[0].is_src_dev_gpu = true;
args->subs[0].is_dst_dev_gpu = true;
args->subs[0].src_gpu_id = src_gpu_id;
args->subs[0].dst_gpu_id = dst_gpu_id;
args->subs[0].worker_gpu_id = is_read ? dst_gpu_id : src_gpu_id;
if (is_bidirectional) {
args->num_subs = 2;
args->subs[1].is_src_dev_gpu = true;
args->subs[1].is_dst_dev_gpu = true;
args->subs[1].src_gpu_id = dst_gpu_id;
args->subs[1].dst_gpu_id = src_gpu_id;
args->subs[1].worker_gpu_id = is_read ? src_gpu_id : dst_gpu_id;
} else {
args->num_subs = 1;
}
}
int main(int argc, char **argv) {
int ret = 0;
int numa_count = 0;
......@@ -546,55 +666,52 @@ int main(int argc, char **argv) {
for (int j = 0; j < gpu_count; j++) {
// Host-to-device benchmark
if (opts.htod_enabled) {
args.is_src_dev_gpu = false;
args.is_dst_dev_gpu = true;
args.dst_gpu_id = j;
args.worker_gpu_id = j;
if (opts.sm_copy_enabled) {
args.is_sm_copy = true;
SetSubBenchArgsForHToD(j, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
if (opts.dma_copy_enabled) {
args.is_sm_copy = false;
SetSubBenchArgsForHToD(j, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
}
// Device-to-host benchmark
if (opts.dtoh_enabled) {
args.is_src_dev_gpu = true;
args.src_gpu_id = j;
args.is_dst_dev_gpu = false;
args.worker_gpu_id = j;
if (opts.sm_copy_enabled) {
args.is_sm_copy = true;
SetSubBenchArgsForDToH(j, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
if (opts.dma_copy_enabled) {
args.is_sm_copy = false;
SetSubBenchArgsForDToH(j, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
}
// Device-to-device benchmark
if (opts.dtod_enabled) {
args.is_src_dev_gpu = true;
args.src_gpu_id = j;
args.is_dst_dev_gpu = true;
// Scan all peers
for (int k = 0; k < gpu_count; k++) {
args.dst_gpu_id = k;
// Skip second half for bidirectional test
if (opts.bidirectional_enabled && j > k) {
break;
}
// P2P write
ret = EnablePeerAccess(j, k, &can_access);
if (ret != 0) {
return -1;
}
if (can_access) {
args.worker_gpu_id = j;
if (opts.sm_copy_enabled) {
args.is_sm_copy = true;
SetSubBenchArgsForDToD(j, k, false, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
if (opts.dma_copy_enabled) {
args.is_sm_copy = false;
SetSubBenchArgsForDToD(j, k, false, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
}
......@@ -607,13 +724,14 @@ int main(int argc, char **argv) {
return -1;
}
if (can_access) {
args.worker_gpu_id = k;
if (opts.sm_copy_enabled) {
args.is_sm_copy = true;
SetSubBenchArgsForDToD(j, k, true, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
if (opts.dma_copy_enabled) {
args.is_sm_copy = false;
SetSubBenchArgsForDToD(j, k, true, opts.bidirectional_enabled, &args);
args_list.push_back(args);
}
}
......@@ -622,13 +740,13 @@ int main(int argc, char **argv) {
}
}
for (const BenchArgs &curr_args : args_list) {
for (BenchArgs &curr_args : args_list) {
ret = numa_run_on_node(curr_args.numa_id);
if (ret != 0) {
fprintf(stderr, "main::numa_run_on_node error: %d\n", errno);
return -1;
}
ret = RunBench(curr_args);
ret = RunBench(&curr_args);
if (ret != 0) {
return -1;
}
......
......@@ -32,7 +32,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
mem_types = ['htod', 'dtoh', 'dtod']
copy_types = ['sm', 'dma']
parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d' % \
parameters = '--mem_type %s --copy_type %s --size %d --num_loops %d --bidirectional' % \
(' '.join(mem_types), ' '.join(copy_types), size, num_loops)
benchmark = benchmark_class(benchmark_name, parameters=parameters)
......@@ -49,6 +49,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
assert (benchmark._args.copy_type == copy_types)
assert (benchmark._args.size == size)
assert (benchmark._args.num_loops == num_loops)
assert (benchmark._args.bidirectional)
# Check command
assert (1 == len(benchmark._commands))
......@@ -59,6 +60,7 @@ def _test_gpu_copy_bw_performance_command_generation(self, platform):
assert ('--%s_copy' % copy_type in benchmark._commands[0])
assert ('--size %d' % size in benchmark._commands[0])
assert ('--num_loops %d' % num_loops in benchmark._commands[0])
assert ('--bidirectional' in benchmark._commands[0])
@decorator.cuda_test
def test_gpu_copy_bw_performance_command_generation_cuda(self):
......@@ -70,7 +72,8 @@ def test_gpu_copy_bw_performance_command_generation_rocm(self):
"""Test gpu-copy benchmark command generation, ROCm case."""
self._test_gpu_copy_bw_performance_command_generation(Platform.ROCM)
def _test_gpu_copy_bw_performance_result_parsing(self, platform):
@decorator.load_data('tests/data/gpu_copy_bw_performance.log')
def _test_gpu_copy_bw_performance_result_parsing(self, platform, test_raw_output):
"""Test gpu-copy benchmark result parsing."""
benchmark_name = 'gpu-copy-bw'
(benchmark_class,
......@@ -85,20 +88,6 @@ def _test_gpu_copy_bw_performance_result_parsing(self, platform):
assert (benchmark.type == BenchmarkType.MICRO)
# Positive case - valid raw output.
test_raw_output = """
cpu_to_gpu0_by_gpu0_using_sm_under_numa0 26.1755
cpu_to_gpu0_by_gpu0_using_dma_under_numa0 26.1894
gpu0_to_cpu_by_gpu0_using_sm_under_numa0 5.72584
gpu0_to_cpu_by_gpu0_using_dma_under_numa0 26.2623
gpu0_to_gpu0_by_gpu0_using_sm_under_numa0 659.275
gpu0_to_gpu0_by_gpu0_using_dma_under_numa0 636.401
cpu_to_gpu0_by_gpu0_using_sm_under_numa1 26.1589
cpu_to_gpu0_by_gpu0_using_dma_under_numa1 26.18
gpu0_to_cpu_by_gpu0_using_sm_under_numa1 5.07597
gpu0_to_cpu_by_gpu0_using_dma_under_numa1 25.2851
gpu0_to_gpu0_by_gpu0_using_sm_under_numa1 656.825
gpu0_to_gpu0_by_gpu0_using_dma_under_numa1 634.203
"""
assert (benchmark._process_raw_result(0, test_raw_output))
assert (benchmark.return_code == ReturnCode.SUCCESS)
......
cpu_to_gpu0_by_sm_under_numa0_uni 26.1736
cpu_to_gpu0_by_dma_under_numa0_uni 26.1878
gpu0_to_cpu_by_sm_under_numa0_uni 5.01589
gpu0_to_cpu_by_dma_under_numa0_uni 21.8659
gpu0_to_gpu0_by_sm_under_numa0_uni 655.759
gpu0_to_gpu0_by_dma_under_numa0_uni 633.325
gpu0_to_gpu1_write_by_sm_under_numa0_uni 250.122
gpu0_to_gpu1_write_by_dma_under_numa0_uni 274.951
gpu0_to_gpu1_read_by_sm_under_numa0_uni 253.563
gpu0_to_gpu1_read_by_dma_under_numa0_uni 264.009
cpu_to_gpu1_by_sm_under_numa0_uni 26.187
cpu_to_gpu1_by_dma_under_numa0_uni 26.207
gpu1_to_cpu_by_sm_under_numa0_uni 5.01132
gpu1_to_cpu_by_dma_under_numa0_uni 21.8635
gpu1_to_gpu0_write_by_sm_under_numa0_uni 249.824
gpu1_to_gpu0_write_by_dma_under_numa0_uni 275.123
gpu1_to_gpu0_read_by_sm_under_numa0_uni 253.469
gpu1_to_gpu0_read_by_dma_under_numa0_uni 264.908
gpu1_to_gpu1_by_sm_under_numa0_uni 658.338
gpu1_to_gpu1_by_dma_under_numa0_uni 631.148
cpu_to_gpu0_by_sm_under_numa1_uni 26.1542
cpu_to_gpu0_by_dma_under_numa1_uni 26.2007
gpu0_to_cpu_by_sm_under_numa1_uni 5.67356
gpu0_to_cpu_by_dma_under_numa1_uni 21.8599
gpu0_to_gpu0_by_sm_under_numa1_uni 656.935
gpu0_to_gpu0_by_dma_under_numa1_uni 631.974
gpu0_to_gpu1_write_by_sm_under_numa1_uni 250.118
gpu0_to_gpu1_write_by_dma_under_numa1_uni 274.778
gpu0_to_gpu1_read_by_sm_under_numa1_uni 253.625
gpu0_to_gpu1_read_by_dma_under_numa1_uni 264.347
cpu_to_gpu1_by_sm_under_numa1_uni 26.1905
cpu_to_gpu1_by_dma_under_numa1_uni 26.2007
gpu1_to_cpu_by_sm_under_numa1_uni 5.67716
gpu1_to_cpu_by_dma_under_numa1_uni 21.8579
gpu1_to_gpu0_write_by_sm_under_numa1_uni 250.064
gpu1_to_gpu0_write_by_dma_under_numa1_uni 274.924
gpu1_to_gpu0_read_by_sm_under_numa1_uni 253.746
gpu1_to_gpu0_read_by_dma_under_numa1_uni 264.256
gpu1_to_gpu1_by_sm_under_numa1_uni 655.623
gpu1_to_gpu1_by_dma_under_numa1_uni 634.062
cpu_to_gpu0_by_sm_under_numa0_bi 8.45975
cpu_to_gpu0_by_dma_under_numa0_bi 36.4282
gpu0_to_gpu0_by_sm_under_numa0_bi 689.063
gpu0_to_gpu0_by_dma_under_numa0_bi 661.7
gpu0_to_gpu1_write_by_sm_under_numa0_bi 427.446
gpu0_to_gpu1_write_by_dma_under_numa0_bi 521.577
gpu0_to_gpu1_read_by_sm_under_numa0_bi 446.835
gpu0_to_gpu1_read_by_dma_under_numa0_bi 503.158
cpu_to_gpu1_by_sm_under_numa0_bi 8.4487
cpu_to_gpu1_by_dma_under_numa0_bi 36.4272
cpu_to_gpu0_by_sm_under_numa1_bi 9.36164
cpu_to_gpu0_by_dma_under_numa1_bi 36.411
gpu0_to_gpu0_by_sm_under_numa1_bi 688.156
gpu0_to_gpu0_by_dma_under_numa1_bi 662.077
gpu0_to_gpu1_write_by_sm_under_numa1_bi 427.033
gpu0_to_gpu1_write_by_dma_under_numa1_bi 521.367
gpu0_to_gpu1_read_by_sm_under_numa1_bi 446.179
gpu0_to_gpu1_read_by_dma_under_numa1_bi 503.843
cpu_to_gpu1_by_sm_under_numa1_bi 9.37368
cpu_to_gpu1_by_dma_under_numa1_bi 36.4128
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment