Unverified Commit cc0e9cb4 authored by gilbertlee-amd's avatar gilbertlee-amd Committed by GitHub
Browse files

TransferBench v1.11 (#9)

* Adding MIMO support, DMA executor, Null memory type
parent 3b47b874
# Changelog for TransferBench
## v1.11
### Added
- New multi-input / multi-output support (MIMO). Transfers now can reduce (element-wise summation) multiple input memory arrays
and write the sums to multiple outputs
- New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies). Previously this was done using USE_HIP_CALL, but now this allows
GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally.
- GPU-DMA executor can only be used for single-input/single-output Transfers
- GPU-DMA executor can only be associated with one SubExecutor
- Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers
- Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels
### Optimized
- Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs
### Changed
- Updated the example.cfg file to cover the new features
- Updated output to support MIMO
- Changed CUs/CPUs threads naming to SubExecutors for consistency
- Sweep Preset:
- Default sweep preset executors now includes DMA
- P2P Benchmarks:
- Now only works via "p2p". Removed "p2p_rr", "g2g" and "g2g_rr".
- Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g")
- New environment variable USE_REMOTE_READ replaces "_rr" presets
- New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor
- Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE
- Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
- Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE
- Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer
### Removed
- USE_HIP_CALL has been removed. Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets
- Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate
- Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer
- Removed USE_MEMSET environment variable. This can now be done via a Transfer using the null memory type
## v1.10
### Fixed
- Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes
......
/*
Copyright (c) 2021-2022 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) 2021-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -26,9 +26,12 @@ THE SOFTWARE.
#include <algorithm>
#include <random>
#include <time.h>
#define TB_VERSION "1.10"
#include "Kernels.hpp"
#define TB_VERSION "1.11"
extern char const MemTypeStr[];
extern char const ExeTypeStr[];
enum ConfigModeEnum
{
......@@ -45,10 +48,13 @@ public:
int const DEFAULT_NUM_WARMUPS = 1;
int const DEFAULT_NUM_ITERATIONS = 10;
int const DEFAULT_SAMPLING_FACTOR = 1;
int const DEFAULT_NUM_CPU_PER_TRANSFER = 4;
// Peer-to-peer Benchmark preset defaults
int const DEFAULT_P2P_NUM_CPU_SE = 4;
// Sweep-preset defaults
std::string const DEFAULT_SWEEP_SRC = "CG";
std::string const DEFAULT_SWEEP_EXE = "CG";
std::string const DEFAULT_SWEEP_EXE = "CDG";
std::string const DEFAULT_SWEEP_DST = "CG";
int const DEFAULT_SWEEP_MIN = 1;
int const DEFAULT_SWEEP_MAX = 24;
......@@ -59,21 +65,24 @@ public:
int blockBytes; // Each CU, except the last, gets a multiple of this many bytes to copy
int byteOffset; // Byte-offset for memory allocations
int numCpuDevices; // Number of CPU devices to use (defaults to # NUMA nodes detected)
int numCpuPerTransfer; // Number of CPU child threads to use per CPU Transfer
int numGpuDevices; // Number of GPU devices to use (defaults to # HIP devices detected)
int numIterations; // Number of timed iterations to perform. If negative, run for -numIterations seconds instead
int numWarmups; // Number of un-timed warmup iterations to perform
int outputToCsv; // Output in CSV format
int samplingFactor; // Affects how many different values of N are generated (when N set to 0)
int sharedMemBytes; // Amount of shared memory to use per threadblock
int useHipCall; // Use hipMemcpy/hipMemset instead of custom shader kernels
int useInteractive; // Pause for user-input before starting transfer loop
int useMemset; // Perform a memset instead of a copy (ignores source memory)
int usePcieIndexing; // Base GPU indexing on PCIe address instead of HIP device
int useSingleStream; // Use a single stream per device instead of per Tink. Can not be used with USE_HIP_CALL
int useSingleStream; // Use a single stream per GPU GFX executor instead of stream per Transfer
std::vector<float> fillPattern; // Pattern of floats used to fill source data
// Environment variables only for Benchmark-preset
int useRemoteRead; // Use destination memory type as executor instead of source memory type
int useDmaCopy; // Use DMA copy instead of GPU copy
int numGpuSubExecs; // Number of GPU subexecutors to use
int numCpuSubExecs; // Number of CPU subexecttors to use
// Environment variables only for Sweep-preset
int sweepMin; // Min number of simultaneous Transfers to be executed per test
int sweepMax; // Max number of simulatneous Transfers to be executed per test
......@@ -87,6 +96,10 @@ public:
std::string sweepExe; // Set of executors to be swept
std::string sweepDst; // Set of dst memory types to be swept
// Developer features
int enableDebug; // Enable debug output
int gpuKernel; // Which GPU kernel to use
// Used to track current configuration mode
ConfigModeEnum configMode;
......@@ -100,29 +113,48 @@ public:
EnvVars()
{
int maxSharedMemBytes = 0;
hipDeviceGetAttribute(&maxSharedMemBytes,
hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, 0);
HIP_CALL(hipDeviceGetAttribute(&maxSharedMemBytes,
hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, 0));
int numDeviceCUs = 0;
HIP_CALL(hipDeviceGetAttribute(&numDeviceCUs, hipDeviceAttributeMultiprocessorCount, 0));
int numDetectedCpus = numa_num_configured_nodes();
int numDetectedGpus;
hipGetDeviceCount(&numDetectedGpus);
HIP_CALL(hipGetDeviceCount(&numDetectedGpus));
hipDeviceProp_t prop;
HIP_CALL(hipGetDeviceProperties(&prop, 0));
std::string fullName = prop.gcnArchName;
std::string archName = fullName.substr(0, fullName.find(':'));
// Different hardware pick different GPU kernels
// This performance difference is generally only noticable when executing fewer CUs
int defaultGpuKernel = 0;
if (archName == "gfx906") defaultGpuKernel = 13;
else if (archName == "gfx90a") defaultGpuKernel = 9;
blockBytes = GetEnvVar("BLOCK_BYTES" , 256);
byteOffset = GetEnvVar("BYTE_OFFSET" , 0);
numCpuDevices = GetEnvVar("NUM_CPU_DEVICES" , numDetectedCpus);
numCpuPerTransfer = GetEnvVar("NUM_CPU_PER_TRANSFER", DEFAULT_NUM_CPU_PER_TRANSFER);
numGpuDevices = GetEnvVar("NUM_GPU_DEVICES" , numDetectedGpus);
numIterations = GetEnvVar("NUM_ITERATIONS" , DEFAULT_NUM_ITERATIONS);
numWarmups = GetEnvVar("NUM_WARMUPS" , DEFAULT_NUM_WARMUPS);
outputToCsv = GetEnvVar("OUTPUT_TO_CSV" , 0);
samplingFactor = GetEnvVar("SAMPLING_FACTOR" , DEFAULT_SAMPLING_FACTOR);
sharedMemBytes = GetEnvVar("SHARED_MEM_BYTES" , maxSharedMemBytes / 2 + 1);
useHipCall = GetEnvVar("USE_HIP_CALL" , 0);
useInteractive = GetEnvVar("USE_INTERACTIVE" , 0);
useMemset = GetEnvVar("USE_MEMSET" , 0);
usePcieIndexing = GetEnvVar("USE_PCIE_INDEX" , 0);
useSingleStream = GetEnvVar("USE_SINGLE_STREAM" , 0);
enableDebug = GetEnvVar("DEBUG" , 0);
gpuKernel = GetEnvVar("GPU_KERNEL" , defaultGpuKernel);
// P2P Benchmark related
useRemoteRead = GetEnvVar("USE_REMOTE_READ" , 0);
useDmaCopy = GetEnvVar("USE_GPU_DMA" , 0);
numGpuSubExecs = GetEnvVar("NUM_GPU_SE" , useDmaCopy ? 1 : numDeviceCUs);
numCpuSubExecs = GetEnvVar("NUM_CPU_SE" , DEFAULT_P2P_NUM_CPU_SE);
// Sweep related
sweepMin = GetEnvVar("SWEEP_MIN" , DEFAULT_SWEEP_MIN);
sweepMax = GetEnvVar("SWEEP_MAX" , DEFAULT_SWEEP_MAX);
sweepSrc = GetEnvVar("SWEEP_SRC" , DEFAULT_SWEEP_SRC);
......@@ -135,7 +167,6 @@ public:
sweepRandBytes = GetEnvVar("SWEEP_RAND_BYTES" , 0);
// Determine random seed
char *sweepSeedStr = getenv("SWEEP_SEED");
sweepSeed = (sweepSeedStr != NULL ? atoi(sweepSeedStr) : time(NULL));
generator = new std::default_random_engine(sweepSeed);
......@@ -224,11 +255,6 @@ public:
printf("[ERROR] SAMPLING_FACTOR must be greater or equal to 1\n");
exit(1);
}
if (numCpuPerTransfer < 1)
{
printf("[ERROR] NUM_CPU_PER_TRANSFER must be greater or equal to 1\n");
exit(1);
}
if (sharedMemBytes < 0 || sharedMemBytes > maxSharedMemBytes)
{
printf("[ERROR] SHARED_MEM_BYTES must be between 0 and %d\n", maxSharedMemBytes);
......@@ -239,9 +265,16 @@ public:
printf("[ERROR] BLOCK_BYTES must be a positive multiple of 4\n");
exit(1);
}
if (useSingleStream && useHipCall)
if (numGpuSubExecs <= 0)
{
printf("[ERROR] NUM_GPU_SE must be greater than 0\n");
exit(1);
}
if (numCpuSubExecs <= 0)
{
printf("[ERROR] Single stream mode cannot be used with HIP calls\n");
printf("[ERROR] NUM_CPU_SE must be greater than 0\n");
exit(1);
}
......@@ -273,10 +306,9 @@ public:
}
}
char const* permittedExecutors = "CG";
for (auto ch : sweepExe)
{
if (!strchr(permittedExecutors, ch))
if (!strchr(ExeTypeStr, ch))
{
printf("[ERROR] Unrecognized executor type '%c' specified for sweep executor\n", ch);
exit(1);
......@@ -287,12 +319,30 @@ public:
exit(1);
}
}
if (gpuKernel < 0 || gpuKernel > NUM_GPU_KERNELS)
{
printf("[ERROR] GPU kernel must be between 0 and %d\n", NUM_GPU_KERNELS);
exit(1);
}
// Determine how many CPUs exit per NUMA node (to avoid executing on NUMA without CPUs)
numCpusPerNuma.resize(numDetectedCpus);
int const totalCpus = numa_num_configured_cpus();
for (int i = 0; i < totalCpus; i++)
numCpusPerNuma[numa_node_of_cpu(i)]++;
// Check for deprecated env vars
if (getenv("USE_HIP_CALL"))
{
printf("[WARN] USE_HIP_CALL has been deprecated. Please use DMA executor 'D' or set USE_GPU_DMA for P2P-Benchmark preset\n");
exit(1);
}
char* enableSdma = getenv("HSA_ENABLE_SDMA");
if (enableSdma && !strcmp(enableSdma, "0"))
{
printf("[WARN] DMA functionality disabled due to environment variable HSA_ENABLE_SDMA=0. Copies will fallback to blit kernels\n");
}
}
// Display info on the env vars that can be used
......@@ -304,18 +354,15 @@ public:
printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4. Defaults to 0\n");
printf(" FILL_PATTERN=STR - Fill input buffer with pattern specified in hex digits (0-9,a-f,A-F). Must be even number of digits, (byte-level big-endian)\n");
printf(" NUM_CPU_DEVICES=X - Restrict number of CPUs to X. May not be greater than # detected NUMA nodes\n");
printf(" NUM_CPU_PER_TRANSFER=C - Use C threads per Transfer for CPU-executed copies\n");
printf(" NUM_GPU_DEVICES=X - Restrict number of GCPUs to X. May not be greater than # detected HIP devices\n");
printf(" NUM_GPU_DEVICES=X - Restrict number of GPUs to X. May not be greater than # detected HIP devices\n");
printf(" NUM_ITERATIONS=I - Perform I timed iteration(s) per test\n");
printf(" NUM_WARMUPS=W - Perform W untimed warmup iteration(s) per test\n");
printf(" OUTPUT_TO_CSV - Outputs to CSV format if set\n");
printf(" SAMPLING_FACTOR=F - Add F samples (when possible) between powers of 2 when auto-generating data sizes\n");
printf(" SHARED_MEM_BYTES=X - Use X shared mem bytes per threadblock, potentially to avoid multiple threadblocks per CU\n");
printf(" USE_HIP_CALL - Use hipMemcpy/hipMemset instead of custom shader kernels for GPU-executed copies\n");
printf(" USE_INTERACTIVE - Pause for user-input before starting transfer loop\n");
printf(" USE_MEMSET - Perform a memset instead of a copy (ignores source memory)\n");
printf(" USE_PCIE_INDEX - Index GPUs by PCIe address-ordering instead of HIP-provided indexing\n");
printf(" USE_SINGLE_STREAM - Use single stream per device instead of per Transfer. Cannot be used with USE_HIP_CALL\n");
printf(" USE_SINGLE_STREAM - Use a single stream per GPU GFX executor instead of stream per Transfer\n");
}
// Display env var settings
......@@ -331,10 +378,10 @@ public:
if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN"));
else
printf("Pseudo-random: (Element i = i modulo 383 + 31)");
printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n");
printf("%-20s = %12d : Using GPU kernel %d [%s]\n" , "GPU_KERNEL", gpuKernel, gpuKernel, GpuKernelNames[gpuKernel].c_str());
printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
printf("%-20s = %12d : Using %d CPU thread(s) per CPU-executed Transfer\n", "NUM_CPU_PER_TRANSFER", numCpuPerTransfer, numCpuPerTransfer);
printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations,
numIterations > 0 ? numIterations : -numIterations,
......@@ -344,18 +391,8 @@ public:
outputToCsv ? "CSV" : "console");
printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
printf("%-20s = %12d : Using %s for GPU-executed copies\n", "USE_HIP_CALL", useHipCall,
useHipCall ? "HIP functions" : "custom kernels");
if (useHipCall && !useMemset)
{
char* env = getenv("HSA_ENABLE_SDMA");
printf("%-20s = %12s : %s\n", "HSA_ENABLE_SDMA", env,
(env && !strcmp(env, "0")) ? "Using blit kernels for hipMemcpy" : "Using DMA copy engines");
}
printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive,
useInteractive ? "interactive" : "non-interactive");
printf("%-20s = %12d : Performing %s\n", "USE_MEMSET", useMemset,
useMemset ? "memset" : "memcopy");
printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM",
......@@ -371,23 +408,82 @@ public:
if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN"));
else
printf("Pseudo-random: (Element i = i modulo 383 + 31)");
printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n");
printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
printf("NUM_CPU_PER_TRANSFER,%d,Using %d CPU thread(s) per CPU-executed Transfer\n", numCpuPerTransfer, numCpuPerTransfer);
printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations,
numIterations > 0 ? numIterations : -numIterations,
numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
printf("USE_HIP_CALL,%d,Using %s for GPU-executed copies\n", useHipCall, useHipCall ? "HIP functions" : "custom kernels");
printf("USE_MEMSET,%d,Performing %s\n", useMemset, useMemset ? "memset" : "memcopy");
printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
}
};
// Display env var for P2P Benchmark preset
void DisplayP2PBenchmarkEnvVars() const
{
if (!outputToCsv)
{
printf("Peer-to-peer Benchmark configuration (TransferBench v%s)\n", TB_VERSION);
printf("=====================================================\n");
printf("%-20s = %12d : Using %s as executor\n", "USE_REMOTE_READ", useRemoteRead , useRemoteRead ? "DST" : "SRC");
printf("%-20s = %12d : Using GPU-%s as GPU executor\n", "USE_GPU_DMA" , useDmaCopy , useDmaCopy ? "DMA" : "GFX");
printf("%-20s = %12d : Using %d CPU subexecutors\n", "NUM_CPU_SE" , numCpuSubExecs, numCpuSubExecs);
printf("%-20s = %12d : Using %d GPU subexecutors\n", "NUM_GPU_SE" , numGpuSubExecs, numGpuSubExecs);
printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes);
printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset);
printf("%-20s = %12s : ", "FILL_PATTERN", getenv("FILL_PATTERN") ? "(specified)" : "(unset)");
if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN"));
else
printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n");
printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations,
numIterations > 0 ? numIterations : -numIterations,
numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("%-20s = %12d : Running %d warmup iteration(s) per Test\n", "NUM_WARMUPS", numWarmups, numWarmups);
printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive,
useInteractive ? "interactive" : "non-interactive");
printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("\n");
}
else
{
printf("EnvVar,Value,Description,(TransferBench v%s)\n", TB_VERSION);
printf("USE_REMOTE_READ,%d,Using %s as executor\n", useRemoteRead, useRemoteRead ? "DST" : "SRC");
printf("USE_GPU_DMA,%d,Using GPU-%s as GPU executor\n", useDmaCopy , useDmaCopy ? "DMA" : "GFX");
printf("NUM_CPU_SE,%d,Using %d CPU subexecutors\n", numCpuSubExecs, numCpuSubExecs);
printf("NUM_GPU_SE,%d,Using %d GPU subexecutors\n", numGpuSubExecs, numGpuSubExecs);
printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes);
printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset);
printf("FILL_PATTERN,%s,", getenv("FILL_PATTERN") ? "(specified)" : "(unset)");
if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN"));
else
printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n");
printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations,
numIterations > 0 ? numIterations : -numIterations,
numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
printf("\n");
}
}
// Display env var settings
void DisplaySweepEnvVars() const
{
......@@ -407,7 +503,6 @@ public:
printf("%-20s = %12d : Max number of XGMI hops for Transfers (-1 = no limit)\n", "SWEEP_XGMI_MAX", sweepXgmiMax);
printf("%-20s = %12d : Using %s number of bytes per Transfer\n", "SWEEP_RAND_BYTES", sweepRandBytes, sweepRandBytes ? "random" : "constant");
printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
printf("%-20s = %12d : Using %d CPU thread(s) per CPU-executed Transfer\n", "NUM_CPU_PER_TRANSFER", numCpuPerTransfer, numCpuPerTransfer);
printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes);
printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset);
......@@ -425,14 +520,6 @@ public:
outputToCsv ? "CSV" : "console");
printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
printf("%-20s = %12d : Using %s for GPU-executed copies\n", "USE_HIP_CALL", useHipCall,
useHipCall ? "HIP functions" : "custom kernels");
if (useHipCall && !useMemset)
{
char* env = getenv("HSA_ENABLE_SDMA");
printf("%-20s = %12s : %s\n", "HSA_ENABLE_SDMA", env,
(env && !strcmp(env, "0")) ? "Using blit kernels for hipMemcpy" : "Using DMA copy engines");
}
printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM",
......@@ -454,7 +541,6 @@ public:
printf("SWEEP_XGMI_MAX,%d,Max number of XGMI hops for Transfers (-1 = no limit)\n", sweepXgmiMax);
printf("SWEEP_RAND_BYTES,%d,Using %s number of bytes per Transfer\n", sweepRandBytes, sweepRandBytes ? "random" : "constant");
printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
printf("NUM_CPU_PER_TRANSFER,%d,Using %d CPU thread(s) per CPU-executed Transfer\n", numCpuPerTransfer, numCpuPerTransfer);
printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes);
printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset);
......@@ -469,7 +555,6 @@ public:
numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
printf("USE_HIP_CALL,%d,Using %s for GPU-executed copies\n", useHipCall, useHipCall ? "HIP functions" : "custom kernels");
printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
}
......
/*
Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -22,66 +22,145 @@ THE SOFTWARE.
#pragma once
#define PackedFloat_t float4
#define WARP_SIZE 64
#define BLOCKSIZE 256
#define FLOATS_PER_PACK (sizeof(PackedFloat_t) / sizeof(float))
#define MEMSET_CHAR 75
#define MEMSET_VAL 13323083.0f
// GPU copy kernel
__global__ void __launch_bounds__(BLOCKSIZE)
GpuCopyKernel(BlockParam* blockParams)
// Each subExecutor is provided with subarrays to work on
#define MAX_SRCS 16
#define MAX_DSTS 16
struct SubExecParam
{
#define PackedFloat_t float4
#define FLOATS_PER_PACK (sizeof(PackedFloat_t) / sizeof(float))
size_t N; // Number of floats this subExecutor works on
int numSrcs; // Number of source arrays
int numDsts; // Number of destination arrays
float* src[MAX_SRCS]; // Source array pointers
float* dst[MAX_DSTS]; // Destination array pointers
long long startCycle; // Start timestamp for in-kernel timing (GPU-GFX executor)
long long stopCycle; // Stop timestamp for in-kernel timing (GPU-GFX executor)
};
// Collect the arguments for this threadblock
int Nrem = blockParams[blockIdx.x].N;
float const* src = blockParams[blockIdx.x].src;
float* dst = blockParams[blockIdx.x].dst;
if (threadIdx.x == 0) blockParams[blockIdx.x].startCycle = __builtin_amdgcn_s_memrealtime();
void CpuReduceKernel(SubExecParam const& p)
{
int const& numSrcs = p.numSrcs;
int const& numDsts = p.numDsts;
if (numSrcs == 0)
{
for (int i = 0; i < numDsts; ++i)
memset((float* __restrict__)p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
}
else if (numSrcs == 1)
{
float const* __restrict__ src = p.src[0];
for (int i = 0; i < numDsts; ++i)
{
memcpy((float* __restrict__)p.dst[i], src, p.N * sizeof(float));
}
}
else
{
for (int j = 0; j < p.N; j++)
{
float sum = p.src[0][j];
for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
}
}
}
// Helper function for memset
template <typename T> __device__ __forceinline__ T MemsetVal();
template <> __device__ __forceinline__ float MemsetVal(){ return MEMSET_VAL; };
template <> __device__ __forceinline__ float4 MemsetVal(){ return make_float4(MEMSET_VAL, MEMSET_VAL, MEMSET_VAL, MEMSET_VAL); }
// GPU copy kernel 0: 3 loops: unroll float 4, float4s, floats
template <int LOOP1_UNROLL>
__global__ void __launch_bounds__(BLOCKSIZE)
GpuReduceKernel(SubExecParam* params)
{
int64_t startCycle = __builtin_amdgcn_s_memrealtime();
// Operate on wavefront granularity
int numWaves = BLOCKSIZE / WARP_SIZE; // Number of wavefronts per threadblock
int waveId = threadIdx.x / WARP_SIZE; // Wavefront number
int threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront
SubExecParam& p = params[blockIdx.x];
int const numSrcs = p.numSrcs;
int const numDsts = p.numDsts;
int const numWaves = BLOCKSIZE / WARP_SIZE; // Number of wavefronts per threadblock
int const waveId = threadIdx.x / WARP_SIZE; // Wavefront number
int const threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront
#define LOOP1_UNROLL 8
// 1st loop - each wavefront operates on LOOP1_UNROLL x FLOATS_PER_PACK per thread per iteration
// Determine the number of packed floats processed by the first loop
int const loop1Npack = (Nrem / (FLOATS_PER_PACK * LOOP1_UNROLL * WARP_SIZE)) * (LOOP1_UNROLL * WARP_SIZE);
int const loop1Nelem = loop1Npack * FLOATS_PER_PACK;
int const loop1Inc = BLOCKSIZE * LOOP1_UNROLL;
int loop1Offset = waveId * LOOP1_UNROLL * WARP_SIZE + threadId;
size_t Nrem = p.N;
size_t const loop1Npack = (Nrem / (FLOATS_PER_PACK * LOOP1_UNROLL * WARP_SIZE)) * (LOOP1_UNROLL * WARP_SIZE);
size_t const loop1Nelem = loop1Npack * FLOATS_PER_PACK;
size_t const loop1Inc = BLOCKSIZE * LOOP1_UNROLL;
size_t loop1Offset = waveId * LOOP1_UNROLL * WARP_SIZE + threadId;
PackedFloat_t const* packedSrc = (PackedFloat_t const*)(src) + loop1Offset;
PackedFloat_t* packedDst = (PackedFloat_t *)(dst) + loop1Offset;
while (loop1Offset < loop1Npack)
{
PackedFloat_t vals[LOOP1_UNROLL];
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u)
vals[u] = *(packedSrc + u * WARP_SIZE);
PackedFloat_t vals[LOOP1_UNROLL] = {};
if (numSrcs == 0)
{
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u) vals[u] = MemsetVal<float4>();
}
else
{
for (int i = 0; i < numSrcs; ++i)
{
PackedFloat_t const* __restrict__ packedSrc = (PackedFloat_t const*)(p.src[i]) + loop1Offset;
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u)
*(packedDst + u * WARP_SIZE) = vals[u];
vals[u] += *(packedSrc + u * WARP_SIZE);
}
}
packedSrc += loop1Inc;
packedDst += loop1Inc;
for (int i = 0; i < numDsts; ++i)
{
PackedFloat_t* __restrict__ packedDst = (PackedFloat_t*)(p.dst[i]) + loop1Offset;
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u) *(packedDst + u * WARP_SIZE) = vals[u];
}
loop1Offset += loop1Inc;
}
Nrem -= loop1Nelem;
if (Nrem > 0)
{
// 2nd loop - Each thread operates on FLOATS_PER_PACK per iteration
int const loop2Npack = Nrem / FLOATS_PER_PACK;
int const loop2Nelem = loop2Npack * FLOATS_PER_PACK;
int const loop2Inc = BLOCKSIZE;
int loop2Offset = threadIdx.x;
// NOTE: Using int32_t due to smaller size requirements
int32_t const loop2Npack = Nrem / FLOATS_PER_PACK;
int32_t const loop2Nelem = loop2Npack * FLOATS_PER_PACK;
int32_t const loop2Inc = BLOCKSIZE;
int32_t loop2Offset = threadIdx.x;
packedSrc = (PackedFloat_t const*)(src + loop1Nelem);
packedDst = (PackedFloat_t *)(dst + loop1Nelem);
while (loop2Offset < loop2Npack)
{
packedDst[loop2Offset] = packedSrc[loop2Offset];
PackedFloat_t val;
if (numSrcs == 0)
{
val = MemsetVal<float4>();
}
else
{
val = {};
for (int i = 0; i < numSrcs; ++i)
{
PackedFloat_t const* __restrict__ packedSrc = (PackedFloat_t const*)(p.src[i] + loop1Nelem) + loop2Offset;
val += *packedSrc;
}
}
for (int i = 0; i < numDsts; ++i)
{
PackedFloat_t* __restrict__ packedDst = (PackedFloat_t*)(p.dst[i] + loop1Nelem) + loop2Offset;
*packedDst = val;
}
loop2Offset += loop2Inc;
}
Nrem -= loop2Nelem;
......@@ -90,40 +169,221 @@ GpuCopyKernel(BlockParam* blockParams)
if (threadIdx.x < Nrem)
{
int offset = loop1Nelem + loop2Nelem + threadIdx.x;
dst[offset] = src[offset];
float val = 0;
if (numSrcs == 0)
{
val = MEMSET_VAL;
}
else
{
for (int i = 0; i < numSrcs; ++i)
val += ((float const* __restrict__)p.src[i])[offset];
}
for (int i = 0; i < numDsts; ++i)
((float* __restrict__)p.dst[i])[offset] = val;
}
}
__threadfence_system();
__syncthreads();
if (threadIdx.x == 0)
blockParams[blockIdx.x].stopCycle = __builtin_amdgcn_s_memrealtime();
{
p.startCycle = startCycle;
p.stopCycle = __builtin_amdgcn_s_memrealtime();
}
}
#define MEMSET_UNROLL 8
__global__ void __launch_bounds__(BLOCKSIZE)
GpuMemsetKernel(BlockParam* blockParams)
template <typename FLOAT_TYPE, int UNROLL_FACTOR>
__device__ size_t GpuReduceFuncImpl2(SubExecParam const &p, size_t const offset, size_t const N)
{
// Collect the arguments for this block
int N = blockParams[blockIdx.x].N;
float* __restrict__ dst = (float*)blockParams[blockIdx.x].dst;
int constexpr numFloatsPerPack = sizeof(FLOAT_TYPE) / sizeof(float); // Number of floats handled at a time per thread
int constexpr numWaves = BLOCKSIZE / WARP_SIZE; // Number of wavefronts per threadblock
size_t constexpr loopPackInc = BLOCKSIZE * UNROLL_FACTOR;
size_t constexpr numPacksPerWave = WARP_SIZE * UNROLL_FACTOR;
int const waveId = threadIdx.x / WARP_SIZE; // Wavefront number
int const threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront
int const numSrcs = p.numSrcs;
int const numDsts = p.numDsts;
size_t const numPacksDone = (numFloatsPerPack == 1 && UNROLL_FACTOR == 1) ? N : (N / (FLOATS_PER_PACK * numPacksPerWave)) * numPacksPerWave;
size_t const numFloatsLeft = N - numPacksDone * numFloatsPerPack;
size_t loopPackOffset = waveId * numPacksPerWave + threadId;
while (loopPackOffset < numPacksDone)
{
FLOAT_TYPE vals[UNROLL_FACTOR];
if (numSrcs == 0)
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u) vals[u] = MemsetVal<FLOAT_TYPE>();
}
else
{
FLOAT_TYPE const* __restrict__ src0Ptr = ((FLOAT_TYPE const*)(p.src[0] + offset)) + loopPackOffset;
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
vals[u] = *(src0Ptr + u * WARP_SIZE);
// Use non-zero value
#pragma unroll MEMSET_UNROLL
for (int tid = threadIdx.x; tid < N; tid += BLOCKSIZE)
for (int i = 1; i < numSrcs; ++i)
{
dst[tid] = 1234.0;
FLOAT_TYPE const* __restrict__ srcPtr = ((FLOAT_TYPE const*)(p.src[i] + offset)) + loopPackOffset;
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
vals[u] += *(srcPtr + u * WARP_SIZE);
}
}
for (int i = 0; i < numDsts; ++i)
{
FLOAT_TYPE* __restrict__ dstPtr = (FLOAT_TYPE*)(p.dst[i + offset]) + loopPackOffset;
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
*(dstPtr + u * WARP_SIZE) = vals[u];
}
loopPackOffset += loopPackInc;
}
return numFloatsLeft;
}
// CPU copy kernel
void CpuCopyKernel(BlockParam const& blockParams)
template <typename FLOAT_TYPE, int UNROLL_FACTOR>
__device__ size_t GpuReduceFuncImpl(SubExecParam const &p, size_t const offset, size_t const N)
{
memcpy(blockParams.dst, blockParams.src, blockParams.N * sizeof(float));
// Each thread in the block works on UNROLL_FACTOR FLOAT_TYPEs during each iteration of the loop
int constexpr numFloatsPerRead = sizeof(FLOAT_TYPE) / sizeof(float);
size_t constexpr numFloatsPerInnerLoop = BLOCKSIZE * numFloatsPerRead;
size_t constexpr numFloatsPerOuterLoop = numFloatsPerInnerLoop * UNROLL_FACTOR;
size_t const numFloatsLeft = (numFloatsPerRead == 1 && UNROLL_FACTOR == 1) ? 0 : N % numFloatsPerOuterLoop;
size_t const numFloatsDone = N - numFloatsLeft;
int const numSrcs = p.numSrcs;
int const numDsts = p.numDsts;
for (size_t idx = threadIdx.x * numFloatsPerRead; idx < numFloatsDone; idx += numFloatsPerOuterLoop)
{
FLOAT_TYPE tmp[UNROLL_FACTOR];
if (numSrcs == 0)
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
tmp[u] = MemsetVal<FLOAT_TYPE>();
}
else
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
tmp[u] = *((FLOAT_TYPE*)(&p.src[0][offset + idx + u * numFloatsPerInnerLoop]));
for (int i = 1; i < numSrcs; ++i)
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
tmp[u] += *((FLOAT_TYPE*)(&p.src[i][offset + idx + u * numFloatsPerInnerLoop]));
}
}
for (int i = 0; i < numDsts; ++i)
{
for (int u = 0; u < UNROLL_FACTOR; ++u)
{
*((FLOAT_TYPE*)(&p.dst[i][offset + idx + u * numFloatsPerInnerLoop])) = tmp[u];
}
}
}
return numFloatsLeft;
}
template <typename FLOAT_TYPE>
__device__ size_t GpuReduceFunc(SubExecParam const &p, size_t const offset, size_t const N, int const unroll)
{
switch (unroll)
{
case 1: return GpuReduceFuncImpl<FLOAT_TYPE, 1>(p, offset, N);
case 2: return GpuReduceFuncImpl<FLOAT_TYPE, 2>(p, offset, N);
case 3: return GpuReduceFuncImpl<FLOAT_TYPE, 3>(p, offset, N);
case 4: return GpuReduceFuncImpl<FLOAT_TYPE, 4>(p, offset, N);
case 5: return GpuReduceFuncImpl<FLOAT_TYPE, 5>(p, offset, N);
case 6: return GpuReduceFuncImpl<FLOAT_TYPE, 6>(p, offset, N);
case 7: return GpuReduceFuncImpl<FLOAT_TYPE, 7>(p, offset, N);
case 8: return GpuReduceFuncImpl<FLOAT_TYPE, 8>(p, offset, N);
case 9: return GpuReduceFuncImpl<FLOAT_TYPE, 9>(p, offset, N);
case 10: return GpuReduceFuncImpl<FLOAT_TYPE, 10>(p, offset, N);
case 11: return GpuReduceFuncImpl<FLOAT_TYPE, 11>(p, offset, N);
case 12: return GpuReduceFuncImpl<FLOAT_TYPE, 12>(p, offset, N);
case 13: return GpuReduceFuncImpl<FLOAT_TYPE, 13>(p, offset, N);
case 14: return GpuReduceFuncImpl<FLOAT_TYPE, 14>(p, offset, N);
case 15: return GpuReduceFuncImpl<FLOAT_TYPE, 15>(p, offset, N);
case 16: return GpuReduceFuncImpl<FLOAT_TYPE, 16>(p, offset, N);
default: return GpuReduceFuncImpl<FLOAT_TYPE, 1>(p, offset, N);
}
}
// CPU memset kernel
void CpuMemsetKernel(BlockParam const& blockParams)
// GPU copy kernel
__global__ void __launch_bounds__(BLOCKSIZE)
GpuReduceKernel2(SubExecParam* params)
{
for (int i = 0; i < blockParams.N; i++)
blockParams.dst[i] = 1234.0;
int64_t startCycle = __builtin_amdgcn_s_memrealtime();
SubExecParam& p = params[blockIdx.x];
size_t numFloatsLeft = GpuReduceFunc<float4>(p, 0, p.N, 8);
if (numFloatsLeft)
numFloatsLeft = GpuReduceFunc<float4>(p, p.N - numFloatsLeft, numFloatsLeft, 1);
if (numFloatsLeft)
GpuReduceFunc<float>(p, p.N - numFloatsLeft, numFloatsLeft, 1);
__threadfence_system();
if (threadIdx.x == 0)
{
p.startCycle = startCycle;
p.stopCycle = __builtin_amdgcn_s_memrealtime();
}
}
#define NUM_GPU_KERNELS 18
typedef void (*GpuKernelFuncPtr)(SubExecParam*);
GpuKernelFuncPtr GpuKernelTable[NUM_GPU_KERNELS] =
{
GpuReduceKernel<8>,
GpuReduceKernel<1>,
GpuReduceKernel<2>,
GpuReduceKernel<3>,
GpuReduceKernel<4>,
GpuReduceKernel<5>,
GpuReduceKernel<6>,
GpuReduceKernel<7>,
GpuReduceKernel<8>,
GpuReduceKernel<9>,
GpuReduceKernel<10>,
GpuReduceKernel<11>,
GpuReduceKernel<12>,
GpuReduceKernel<13>,
GpuReduceKernel<14>,
GpuReduceKernel<15>,
GpuReduceKernel<16>,
GpuReduceKernel2
};
std::string GpuKernelNames[NUM_GPU_KERNELS] =
{
"Default - 8xUnroll",
"Unroll x1",
"Unroll x2",
"Unroll x3",
"Unroll x4",
"Unroll x5",
"Unroll x6",
"Unroll x7",
"Unroll x8",
"Unroll x9",
"Unroll x10",
"Unroll x11",
"Unroll x12",
"Unroll x13",
"Unroll x14",
"Unroll x15",
"Unroll x16",
"8xUnrollB",
};
Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......
# Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
ROCM_PATH ?= /opt/rocm
HIPCC=$(ROCM_PATH)/bin/hipcc
EXE=TransferBench
CXXFLAGS = -O3 -I. -lnuma -L$(ROCM_PATH)/hsa/lib -lhsa-runtime64
CXXFLAGS = -O3 -I. -lnuma -L$(ROCM_PATH)/hsa/lib -lhsa-runtime64 -ferror-limit=5
all: $(EXE)
......
/*
Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -30,7 +30,6 @@ THE SOFTWARE.
#include "TransferBench.hpp"
#include "GetClosestNumaNode.hpp"
#include "Kernels.hpp"
int main(int argc, char **argv)
{
......@@ -76,30 +75,18 @@ int main(int argc, char **argv)
// - Tests that sweep across possible sets of Transfers
if (!strcmp(argv[1], "sweep") || !strcmp(argv[1], "rsweep"))
{
int numBlocksToUse = (argc > 3 ? atoi(argv[3]) : 4);
int numGpuSubExecs = (argc > 3 ? atoi(argv[3]) : 4);
int numCpuSubExecs = (argc > 4 ? atoi(argv[4]) : 4);
ev.configMode = CFG_SWEEP;
RunSweepPreset(ev, numBytesPerTransfer, numBlocksToUse, !strcmp(argv[1], "rsweep"));
RunSweepPreset(ev, numBytesPerTransfer, numGpuSubExecs, numCpuSubExecs, !strcmp(argv[1], "rsweep"));
exit(0);
}
// - Tests that benchmark peer-to-peer performance
else if (!strcmp(argv[1], "p2p") || !strcmp(argv[1], "p2p_rr") ||
!strcmp(argv[1], "g2g") || !strcmp(argv[1], "g2g_rr"))
else if (!strcmp(argv[1], "p2p"))
{
int numBlocksToUse = 0;
if (argc > 3)
numBlocksToUse = atoi(argv[3]);
else
HIP_CALL(hipDeviceGetAttribute(&numBlocksToUse, hipDeviceAttributeMultiprocessorCount, 0));
// Perform either local read (+remote write) [EXE = SRC] or
// remote read (+local write) [EXE = DST]
int readMode = (!strcmp(argv[1], "p2p_rr") || !strcmp(argv[1], "g2g_rr") ? 1 : 0);
int skipCpu = (!strcmp(argv[1], "g2g" ) || !strcmp(argv[1], "g2g_rr") ? 1 : 0);
// Execute peer to peer benchmark mode
ev.configMode = CFG_P2P;
RunPeerToPeerBenchmarks(ev, numBytesPerTransfer / sizeof(float), numBlocksToUse, readMode, skipCpu);
RunPeerToPeerBenchmarks(ev, numBytesPerTransfer / sizeof(float));
exit(0);
}
......@@ -116,8 +103,7 @@ int main(int argc, char **argv)
ev.DisplayEnvVars();
if (ev.outputToCsv)
{
printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),"
"ExeToSrcLinkType,ExeToDstLinkType,SrcAddr,DstAddr\n");
printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
}
int testNum = 0;
......@@ -170,71 +156,70 @@ void ExecuteTransfers(EnvVars const& ev,
TransferMap transferMap;
for (Transfer& transfer : transfers)
{
Executor executor(transfer.exeMemType, transfer.exeIndex);
Executor executor(transfer.exeType, transfer.exeIndex);
ExecutorInfo& executorInfo = transferMap[executor];
executorInfo.transfers.push_back(&transfer);
}
// Loop over each executor and prepare GPU resources
// Loop over each executor and prepare sub-executors
std::map<int, Transfer*> transferList;
for (auto& exeInfoPair : transferMap)
{
Executor const& executor = exeInfoPair.first;
ExecutorInfo& exeInfo = exeInfoPair.second;
ExeType const exeType = executor.first;
int const exeIndex = RemappedIndex(executor.second, IsCpuType(exeType));
exeInfo.totalTime = 0.0;
exeInfo.totalBlocks = 0;
exeInfo.totalSubExecs = 0;
// Loop over each transfer this executor is involved in
for (Transfer* transfer : exeInfo.transfers)
{
// Get some aliases to transfer variables
MemType const& exeMemType = transfer->exeMemType;
MemType const& srcMemType = transfer->srcMemType;
MemType const& dstMemType = transfer->dstMemType;
int const& blocksToUse = transfer->numBlocksToUse;
// Get potentially remapped device indices
int const srcIndex = RemappedIndex(transfer->srcIndex, srcMemType);
int const exeIndex = RemappedIndex(transfer->exeIndex, exeMemType);
int const dstIndex = RemappedIndex(transfer->dstIndex, dstMemType);
// Determine how many bytes to copy for this Transfer (use custom if pre-specified)
transfer->numBytesActual = (transfer->numBytes ? transfer->numBytes : N * sizeof(float));
// Enable peer-to-peer access if necessary (can only be called once per unique pair)
if (exeMemType == MEM_GPU)
// Allocate source memory
transfer->srcMem.resize(transfer->numSrcs);
for (int iSrc = 0; iSrc < transfer->numSrcs; ++iSrc)
{
MemType const& srcType = transfer->srcType[iSrc];
int const srcIndex = RemappedIndex(transfer->srcIndex[iSrc], IsCpuType(srcType));
// Ensure executing GPU can access source memory
if ((srcMemType == MEM_GPU || srcMemType == MEM_GPU_FINE) && srcIndex != exeIndex)
if (IsGpuType(exeType) == MEM_GPU && IsGpuType(srcType) && srcIndex != exeIndex)
EnablePeerAccess(exeIndex, srcIndex);
AllocateMemory(srcType, srcIndex, transfer->numBytesActual + ev.byteOffset, (void**)&transfer->srcMem[iSrc]);
}
// Allocate destination memory
transfer->dstMem.resize(transfer->numDsts);
for (int iDst = 0; iDst < transfer->numDsts; ++iDst)
{
MemType const& dstType = transfer->dstType[iDst];
int const dstIndex = RemappedIndex(transfer->dstIndex[iDst], IsCpuType(dstType));
// Ensure executing GPU can access destination memory
if ((dstMemType == MEM_GPU || dstMemType == MEM_GPU_FINE) && dstIndex != exeIndex)
if (IsGpuType(exeType) == MEM_GPU && IsGpuType(dstType) && dstIndex != exeIndex)
EnablePeerAccess(exeIndex, dstIndex);
}
// Allocate (maximum) source / destination memory based on type / device index
transfer->numBytesToCopy = (transfer->numBytes ? transfer->numBytes : N * sizeof(float));
AllocateMemory(srcMemType, srcIndex, transfer->numBytesToCopy + ev.byteOffset, (void**)&transfer->srcMem);
AllocateMemory(dstMemType, dstIndex, transfer->numBytesToCopy + ev.byteOffset, (void**)&transfer->dstMem);
AllocateMemory(dstType, dstIndex, transfer->numBytesActual + ev.byteOffset, (void**)&transfer->dstMem[iDst]);
}
transfer->blockParam.resize(exeMemType == MEM_CPU ? ev.numCpuPerTransfer : blocksToUse);
exeInfo.totalBlocks += transfer->blockParam.size();
exeInfo.totalSubExecs += transfer->numSubExecs;
transferList[transfer->transferIndex] = transfer;
}
// Prepare per-threadblock parameters for GPU executors
MemType const exeMemType = executor.first;
int const exeIndex = RemappedIndex(executor.second, exeMemType);
if (exeMemType == MEM_GPU)
// Prepare additional requirement for GPU-based executors
if (IsGpuType(exeType))
{
// Allocate one contiguous chunk of GPU memory for threadblock parameters
// This allows support for executing one transfer per stream, or all transfers in a single stream
AllocateMemory(exeMemType, exeIndex, exeInfo.totalBlocks * sizeof(BlockParam),
(void**)&exeInfo.blockParamGpu);
int const numTransfersToRun = ev.useSingleStream ? 1 : exeInfo.transfers.size();
exeInfo.streams.resize(numTransfersToRun);
exeInfo.startEvents.resize(numTransfersToRun);
exeInfo.stopEvents.resize(numTransfersToRun);
for (int i = 0; i < numTransfersToRun; ++i)
// Single-stream is only supported for GFX-based executors
int const numStreamsToUse = (exeType == EXE_GPU_DMA || !ev.useSingleStream) ? exeInfo.transfers.size() : 1;
exeInfo.streams.resize(numStreamsToUse);
exeInfo.startEvents.resize(numStreamsToUse);
exeInfo.stopEvents.resize(numStreamsToUse);
for (int i = 0; i < numStreamsToUse; ++i)
{
HIP_CALL(hipSetDevice(exeIndex));
HIP_CALL(hipStreamCreate(&exeInfo.streams[i]));
......@@ -242,12 +227,12 @@ void ExecuteTransfers(EnvVars const& ev,
HIP_CALL(hipEventCreate(&exeInfo.stopEvents[i]));
}
// Assign each transfer its portion of threadblock parameters
int transferOffset = 0;
for (int i = 0; i < exeInfo.transfers.size(); i++)
if (exeType == EXE_GPU_GFX)
{
exeInfo.transfers[i]->blockParamGpuPtr = exeInfo.blockParamGpu + transferOffset;
transferOffset += exeInfo.transfers[i]->blockParam.size();
// Allocate one contiguous chunk of GPU memory for threadblock parameters
// This allows support for executing one transfer per stream, or all transfers in a single stream
AllocateMemory(MEM_GPU, exeIndex, exeInfo.totalSubExecs * sizeof(SubExecParam),
(void**)&exeInfo.subExecParamGpu);
}
}
}
......@@ -265,17 +250,20 @@ void ExecuteTransfers(EnvVars const& ev,
{
// Prepare subarrays each threadblock works on and fill src memory with patterned data
Transfer* transfer = exeInfo.transfers[i];
transfer->PrepareBlockParams(ev, transfer->numBytesToCopy / sizeof(float));
exeInfo.totalBytes += transfer->numBytesToCopy;
transfer->PrepareSubExecParams(ev);
transfer->PrepareSrc(ev);
exeInfo.totalBytes += transfer->numBytesActual;
// Copy block parameters to GPU for GPU executors
if (transfer->exeMemType == MEM_GPU)
if (transfer->exeType == EXE_GPU_GFX)
{
HIP_CALL(hipMemcpy(&exeInfo.blockParamGpu[transferOffset],
transfer->blockParam.data(),
transfer->blockParam.size() * sizeof(BlockParam),
exeInfo.transfers[i]->subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset;
HIP_CALL(hipMemcpy(&exeInfo.subExecParamGpu[transferOffset],
transfer->subExecParam.data(),
transfer->subExecParam.size() * sizeof(SubExecParam),
hipMemcpyHostToDevice));
transferOffset += transfer->blockParam.size();
transferOffset += transfer->subExecParam.size();
}
}
}
......@@ -296,7 +284,11 @@ void ExecuteTransfers(EnvVars const& ev,
for (Transfer& transfer : transfers)
{
printf("Transfer %03d: SRC: %p DST: %p\n", transfer.transferIndex, transfer.srcMem, transfer.dstMem);
printf("Transfer %03d:\n", transfer.transferIndex);
for (int iSrc = 0; iSrc < transfer.numSrcs; ++iSrc)
printf(" SRC %0d: %p\n", iSrc, transfer.srcMem[iSrc]);
for (int iDst = 0; iDst < transfer.numDsts; ++iDst)
printf(" DST %0d: %p\n", iDst, transfer.dstMem[iDst]);
}
printf("Hit <Enter> to continue: ");
scanf("%*c");
......@@ -310,8 +302,9 @@ void ExecuteTransfers(EnvVars const& ev,
for (auto& exeInfoPair : transferMap)
{
ExecutorInfo& exeInfo = exeInfoPair.second;
int const numTransfersToRun = (IsGpuType(exeInfoPair.first.first) && ev.useSingleStream) ?
1 : exeInfo.transfers.size();
ExeType exeType = exeInfoPair.first.first;
int const numTransfersToRun = (exeType == EXE_GPU_GFX && ev.useSingleStream) ? 1 : exeInfo.transfers.size();
for (int i = 0; i < numTransfersToRun; ++i)
threads.push(std::thread(RunTransfer, std::ref(ev), iteration, std::ref(exeInfo), i));
}
......@@ -349,8 +342,8 @@ void ExecuteTransfers(EnvVars const& ev,
for (auto transferPair : transferList)
{
Transfer* transfer = transferPair.second;
CheckOrFill(MODE_CHECK, transfer->numBytesToCopy / sizeof(float), ev.useMemset, ev.useHipCall, ev.fillPattern, transfer->dstMem + initOffset);
totalBytesTransferred += transfer->numBytesToCopy;
transfer->ValidateDst(ev);
totalBytesTransferred += transfer->numBytesActual;
}
// Report timings
......@@ -363,11 +356,11 @@ void ExecuteTransfers(EnvVars const& ev,
for (auto& exeInfoPair : transferMap)
{
ExecutorInfo exeInfo = exeInfoPair.second;
MemType const exeMemType = exeInfoPair.first.first;
ExeType const exeType = exeInfoPair.first.first;
int const exeIndex = exeInfoPair.first.second;
// Compute total time for CPU executors
if (!IsGpuType(exeMemType))
// Compute total time for non GPU executors
if (exeType != EXE_GPU_GFX)
{
exeInfo.totalTime = 0;
for (auto const& transfer : exeInfo.transfers)
......@@ -380,51 +373,49 @@ void ExecuteTransfers(EnvVars const& ev,
if (verbose && !ev.outputToCsv)
{
printf(" Executor: %cPU %02d (# Transfers %02lu)| %9.3f GB/s | %8.3f ms | %12lu bytes\n",
MemTypeStr[exeMemType], exeIndex, exeInfo.transfers.size(), exeBandwidthGbs, exeDurationMsec, exeInfo.totalBytes);
printf(" Executor: %3s %02d | %7.3f GB/s | %8.3f ms | %12lu bytes\n",
ExeTypeName[exeType], exeIndex, exeBandwidthGbs, exeDurationMsec, exeInfo.totalBytes);
}
int totalCUs = 0;
for (auto const& transfer : exeInfo.transfers)
{
double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations);
double transferBandwidthGbs = (transfer->numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f;
totalCUs += transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse;
double transferBandwidthGbs = (transfer->numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
totalCUs += transfer->numSubExecs;
if (!verbose) continue;
if (!ev.outputToCsv)
{
printf(" Transfer %02d | %9.3f GB/s | %8.3f ms | %12lu bytes | %c%02d -> %c%02d:(%03d) -> %c%02d\n",
printf(" Transfer %02d | %7.3f GB/s | %8.3f ms | %12lu bytes | %s -> %s%02d:%03d -> %s\n",
transfer->transferIndex,
transferBandwidthGbs,
transferDurationMsec,
transfer->numBytesToCopy,
MemTypeStr[transfer->srcMemType], transfer->srcIndex,
MemTypeStr[transfer->exeMemType], transfer->exeIndex,
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
MemTypeStr[transfer->dstMemType], transfer->dstIndex);
transfer->numBytesActual,
transfer->SrcToStr().c_str(),
ExeTypeName[transfer->exeType], transfer->exeIndex,
transfer->numSubExecs,
transfer->DstToStr().c_str());
}
else
{
printf("%d,%d,%lu,%c%02d,%c%02d,%c%02d,%d,%.3f,%.3f,%s,%s,%p,%p\n",
testNum, transfer->transferIndex, transfer->numBytesToCopy,
MemTypeStr[transfer->srcMemType], transfer->srcIndex,
MemTypeStr[transfer->exeMemType], transfer->exeIndex,
MemTypeStr[transfer->dstMemType], transfer->dstIndex,
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
printf("%d,%d,%lu,%s,%c%02d,%s,%d,%.3f,%.3f,%s,%s\n",
testNum, transfer->transferIndex, transfer->numBytesActual,
transfer->SrcToStr().c_str(),
MemTypeStr[transfer->exeType], transfer->exeIndex,
transfer->DstToStr().c_str(),
transfer->numSubExecs,
transferBandwidthGbs, transferDurationMsec,
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->srcMemType, transfer->srcIndex).c_str(),
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->dstMemType, transfer->dstIndex).c_str(),
transfer->srcMem + initOffset, transfer->dstMem + initOffset);
PtrVectorToStr(transfer->srcMem, initOffset).c_str(),
PtrVectorToStr(transfer->dstMem, initOffset).c_str());
}
}
if (verbose && ev.outputToCsv)
{
printf("%d,ALL,%lu,ALL,%c%02d,ALL,%d,%.3f,%.3f,ALL,ALL,ALL,ALL\n",
printf("%d,ALL,%lu,ALL,%c%02d,ALL,%d,%.3f,%.3f,ALL,ALL\n",
testNum, totalBytesTransferred,
MemTypeStr[exeMemType], exeIndex, totalCUs,
MemTypeStr[exeType], exeIndex, totalCUs,
exeBandwidthGbs, exeDurationMsec);
}
}
......@@ -435,33 +426,31 @@ void ExecuteTransfers(EnvVars const& ev,
{
Transfer* transfer = transferPair.second;
double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations);
double transferBandwidthGbs = (transfer->numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f;
double transferBandwidthGbs = (transfer->numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
maxGpuTime = std::max(maxGpuTime, transferDurationMsec);
if (!verbose) continue;
if (!ev.outputToCsv)
{
printf(" Transfer %02d: %c%02d -> [%cPU %02d:%03d] -> %c%02d | %9.3f GB/s | %8.3f ms | %12lu bytes | %-16s\n",
printf(" Transfer %02d | %7.3f GB/s | %8.3f ms | %12lu bytes | %s -> %s%02d:%03d -> %s\n",
transfer->transferIndex,
MemTypeStr[transfer->srcMemType], transfer->srcIndex,
MemTypeStr[transfer->exeMemType], transfer->exeIndex,
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
MemTypeStr[transfer->dstMemType], transfer->dstIndex,
transferBandwidthGbs, transferDurationMsec,
transfer->numBytesToCopy,
GetTransferDesc(*transfer).c_str());
transfer->numBytesActual,
transfer->SrcToStr().c_str(),
ExeTypeName[transfer->exeType], transfer->exeIndex,
transfer->numSubExecs,
transfer->DstToStr().c_str());
}
else
{
printf("%d,%d,%lu,%c%02d,%c%02d,%c%02d,%d,%.3f,%.3f,%s,%s,%p,%p\n",
testNum, transfer->transferIndex, transfer->numBytesToCopy,
MemTypeStr[transfer->srcMemType], transfer->srcIndex,
MemTypeStr[transfer->exeMemType], transfer->exeIndex,
MemTypeStr[transfer->dstMemType], transfer->dstIndex,
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
printf("%d,%d,%lu,%s,%s%02d,%s,%d,%.3f,%.3f,%s,%s\n",
testNum, transfer->transferIndex, transfer->numBytesActual,
transfer->SrcToStr().c_str(),
ExeTypeName[transfer->exeType], transfer->exeIndex,
transfer->DstToStr().c_str(),
transfer->numSubExecs,
transferBandwidthGbs, transferDurationMsec,
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->srcMemType, transfer->srcIndex).c_str(),
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->dstMemType, transfer->dstIndex).c_str(),
transfer->srcMem + initOffset, transfer->dstMem + initOffset);
PtrVectorToStr(transfer->srcMem, initOffset).c_str(),
PtrVectorToStr(transfer->dstMem, initOffset).c_str());
}
}
}
......@@ -471,12 +460,12 @@ void ExecuteTransfers(EnvVars const& ev,
{
if (!ev.outputToCsv)
{
printf(" Aggregate Bandwidth (CPU timed) | %9.3f GB/s | %8.3f ms | %12lu bytes | Overhead: %.3f ms\n",
printf(" Aggregate (CPU) | %7.3f GB/s | %8.3f ms | %12lu bytes | Overhead: %.3f ms\n",
totalBandwidthGbs, totalCpuTime, totalBytesTransferred, totalCpuTime - maxGpuTime);
}
else
{
printf("%d,ALL,%lu,ALL,ALL,ALL,ALL,%.3f,%.3f,ALL,ALL,ALL,ALL\n",
printf("%d,ALL,%lu,ALL,ALL,ALL,ALL,%.3f,%.3f,ALL,ALL\n",
testNum, totalBytesTransferred, totalBandwidthGbs, totalCpuTime);
}
}
......@@ -485,31 +474,38 @@ void ExecuteTransfers(EnvVars const& ev,
for (auto exeInfoPair : transferMap)
{
ExecutorInfo& exeInfo = exeInfoPair.second;
ExeType const exeType = exeInfoPair.first.first;
int const exeIndex = RemappedIndex(exeInfoPair.first.second, IsCpuType(exeType));
for (auto& transfer : exeInfo.transfers)
{
// Get some aliases to Transfer variables
MemType const& exeMemType = transfer->exeMemType;
MemType const& srcMemType = transfer->srcMemType;
MemType const& dstMemType = transfer->dstMemType;
// Allocate (maximum) source / destination memory based on type / device index
DeallocateMemory(srcMemType, transfer->srcMem, N * sizeof(float) + ev.byteOffset);
DeallocateMemory(dstMemType, transfer->dstMem, N * sizeof(float) + ev.byteOffset);
transfer->blockParam.clear();
for (int iSrc = 0; iSrc < transfer->numSrcs; ++iSrc)
{
MemType const& srcType = transfer->srcType[iSrc];
DeallocateMemory(srcType, transfer->srcMem[iSrc], transfer->numBytesActual + ev.byteOffset);
}
for (int iDst = 0; iDst < transfer->numDsts; ++iDst)
{
MemType const& dstType = transfer->dstType[iDst];
DeallocateMemory(dstType, transfer->dstMem[iDst], transfer->numBytesActual + ev.byteOffset);
}
transfer->subExecParam.clear();
}
MemType const exeMemType = exeInfoPair.first.first;
int const exeIndex = RemappedIndex(exeInfoPair.first.second, exeMemType);
if (exeMemType == MEM_GPU)
if (IsGpuType(exeType))
{
DeallocateMemory(exeMemType, exeInfo.blockParamGpu);
int const numTransfersToRun = ev.useSingleStream ? 1 : exeInfo.transfers.size();
for (int i = 0; i < numTransfersToRun; ++i)
int const numStreams = (int)exeInfo.streams.size();
for (int i = 0; i < numStreams; ++i)
{
HIP_CALL(hipEventDestroy(exeInfo.startEvents[i]));
HIP_CALL(hipEventDestroy(exeInfo.stopEvents[i]));
HIP_CALL(hipStreamDestroy(exeInfo.streams[i]));
}
if (exeType == EXE_GPU_GFX)
{
DeallocateMemory(MEM_GPU, exeInfo.subExecParamGpu);
}
}
}
}
......@@ -531,12 +527,10 @@ void DisplayUsage(char const* cmdName)
printf("Usage: %s config <N>\n", cmdName);
printf(" config: Either:\n");
printf(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
printf(" - Name of preset benchmark:\n");
printf(" p2p{_rr} - All CPU/GPU pairs benchmark {with remote reads}\n");
printf(" g2g{_rr} - All GPU/GPU pairs benchmark {with remote reads}\n");
printf(" sweep - Sweep across possible sets of Transfers\n");
printf(" rsweep - Randomly sweep across possible sets of Transfers\n");
printf(" - 3rd optional argument used as # of CUs to use (all by default for p2p / 4 for sweep)\n");
printf(" - Name of preset config:\n");
printf(" p2p - Peer-to-peer benchmark tests\n");
printf(" sweep/rsweep - Sweep/random sweep across possible sets of Transfers\n");
printf(" - 3rd/4th optional args for # GPU SubExecs / # CPU SubExecs per Transfer\n");
printf(" N : (Optional) Number of bytes to copy per Transfer.\n");
printf(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
DEFAULT_BYTES_PER_TRANSFER);
......@@ -547,7 +541,7 @@ void DisplayUsage(char const* cmdName)
EnvVars::DisplayUsage();
}
int RemappedIndex(int const origIdx, MemType const memType)
int RemappedIndex(int const origIdx, bool const isCpuType)
{
static std::vector<int> remappingCpu;
static std::vector<int> remappingGpu;
......@@ -591,7 +585,7 @@ int RemappedIndex(int const origIdx, MemType const memType)
remappingGpu[i] = mapping[i].second;
}
}
return IsCpuType(memType) ? remappingCpu[origIdx] : remappingGpu[origIdx];
return isCpuType ? remappingCpu[origIdx] : remappingGpu[origIdx];
}
void DisplayTopology(bool const outputToCsv)
......@@ -634,11 +628,11 @@ void DisplayTopology(bool const outputToCsv)
for (int i = 0; i < numCpuDevices; i++)
{
int nodeI = RemappedIndex(i, MEM_CPU);
int nodeI = RemappedIndex(i, true);
printf("NUMA %02d (%02d)%s", i, nodeI, outputToCsv ? "," : "|");
for (int j = 0; j < numCpuDevices; j++)
{
int nodeJ = RemappedIndex(j, MEM_CPU);
int nodeJ = RemappedIndex(j, true);
int numaDist = numa_distance(nodeI, nodeJ);
if (outputToCsv)
printf("%d,", numaDist);
......@@ -657,7 +651,7 @@ void DisplayTopology(bool const outputToCsv)
bool isFirst = true;
for (int j = 0; j < numGpuDevices; j++)
{
if (GetClosestNumaNode(RemappedIndex(j, MEM_GPU)) == i)
if (GetClosestNumaNode(RemappedIndex(j, false)) == i)
{
if (isFirst) isFirst = false;
else printf(",");
......@@ -678,19 +672,30 @@ void DisplayTopology(bool const outputToCsv)
}
else
{
printf(" |");
for (int j = 0; j < numGpuDevices; j++)
{
hipDeviceProp_t prop;
HIP_CALL(hipGetDeviceProperties(&prop, j));
std::string fullName = prop.gcnArchName;
std::string archName = fullName.substr(0, fullName.find(':'));
printf(" %6s |", archName.c_str());
}
printf("\n");
printf(" |");
for (int j = 0; j < numGpuDevices; j++)
printf(" GPU %02d |", j);
printf(" PCIe Bus ID | Closest NUMA\n");
printf(" PCIe Bus ID | #CUs | Closest NUMA\n");
for (int j = 0; j <= numGpuDevices; j++)
printf("--------+");
printf("--------------+-------------\n");
printf("--------------+------+-------------\n");
}
char pciBusId[20];
for (int i = 0; i < numGpuDevices; i++)
{
int const deviceIdx = RemappedIndex(i, false);
printf("%sGPU %02d%s", outputToCsv ? "" : " ", i, outputToCsv ? "," : " |");
for (int j = 0; j < numGpuDevices; j++)
{
......@@ -704,8 +709,8 @@ void DisplayTopology(bool const outputToCsv)
else
{
uint32_t linkType, hopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(i, MEM_GPU),
RemappedIndex(j, MEM_GPU),
HIP_CALL(hipExtGetLinkTypeAndHopCount(deviceIdx,
RemappedIndex(j, false),
&linkType, &hopCount));
printf("%s%s-%d%s",
outputToCsv ? "" : " ",
......@@ -717,44 +722,78 @@ void DisplayTopology(bool const outputToCsv)
hopCount, outputToCsv ? "," : " |");
}
}
HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, RemappedIndex(i, MEM_GPU)));
HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, deviceIdx));
int numDeviceCUs = 0;
HIP_CALL(hipDeviceGetAttribute(&numDeviceCUs, hipDeviceAttributeMultiprocessorCount, deviceIdx));
if (outputToCsv)
printf("%s,%d\n", pciBusId, GetClosestNumaNode(RemappedIndex(i, MEM_GPU)));
printf("%s,%d,%d\n", pciBusId, numDeviceCUs, GetClosestNumaNode(deviceIdx));
else
printf(" %11s | %d \n", pciBusId, GetClosestNumaNode(RemappedIndex(i, MEM_GPU)));
printf(" %11s | %4d | %d\n", pciBusId, numDeviceCUs, GetClosestNumaNode(deviceIdx));
}
}
void ParseMemType(std::string const& token, int const numCpus, int const numGpus, MemType* memType, int* memIndex)
void ParseMemType(std::string const& token, int const numCpus, int const numGpus,
std::vector<MemType>& memTypes, std::vector<int>& memIndices)
{
char typeChar;
if (sscanf(token.c_str(), " %c %d", &typeChar, memIndex) != 2)
int offset = 0, devIndex, inc;
bool found = false;
memTypes.clear();
memIndices.clear();
while (sscanf(token.c_str() + offset, " %c %d%n", &typeChar, &devIndex, &inc) == 2)
{
offset += inc;
MemType memType = CharToMemType(typeChar);
if (IsCpuType(memType) && (devIndex < 0 || devIndex >= numCpus))
{
printf("[ERROR] Unable to parse memory type token %s - expecting either 'B,C,G or F' followed by an index\n",
token.c_str());
printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, devIndex);
exit(1);
}
if (IsGpuType(memType) && (devIndex < 0 || devIndex >= numGpus))
{
printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, devIndex);
exit(1);
}
switch (typeChar)
found = true;
if (memType != MEM_NULL)
{
memTypes.push_back(memType);
memIndices.push_back(devIndex);
}
}
if (!found)
{
case 'C': case 'c': case 'B': case 'b': case 'U': case 'u':
*memType = (typeChar == 'C' || typeChar == 'c') ? MEM_CPU : ((typeChar == 'B' || typeChar == 'b') ? MEM_CPU_FINE : MEM_CPU_UNPINNED);
if (*memIndex < 0 || *memIndex >= numCpus)
printf("[ERROR] Unable to parse memory type token %s. Expected one of %s followed by an index\n",
token.c_str(), MemTypeStr);
exit(1);
}
}
void ParseExeType(std::string const& token, int const numCpus, int const numGpus,
ExeType &exeType, int& exeIndex)
{
char typeChar;
if (sscanf(token.c_str(), " %c%d", &typeChar, &exeIndex) != 2)
{
printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, *memIndex);
printf("[ERROR] Unable to parse valid executor token (%s). Exepected one of %s followed by an index\n",
token.c_str(), ExeTypeStr);
exit(1);
}
break;
case 'G': case 'g': case 'F': case 'f':
*memType = (typeChar == 'G' || typeChar == 'g') ? MEM_GPU : MEM_GPU_FINE;
if (*memIndex < 0 || *memIndex >= numGpus)
exeType = CharToExeType(typeChar);
if (IsCpuType(exeType) && (exeIndex < 0 || exeIndex >= numCpus))
{
printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, *memIndex);
printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, exeIndex);
exit(1);
}
break;
default:
printf("[ERROR] Unrecognized memory type %s. Expecting either 'B','C','U','G' or 'F'\n", token.c_str());
if (IsGpuType(exeType) && (exeIndex < 0 || exeIndex >= numGpus))
{
printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, exeIndex);
exit(1);
}
}
......@@ -777,18 +816,18 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
std::string srcMem;
std::string dstMem;
// If numTransfers < 0, read quads (srcMem, exeMem, dstMem, #CUs)
// If numTransfers < 0, read 5-tuple (srcMem, exeMem, dstMem, #CUs, #Bytes)
// otherwise read triples (srcMem, exeMem, dstMem)
bool const advancedMode = (numTransfers < 0);
numTransfers = abs(numTransfers);
int numBlocksToUse;
int numSubExecs;
if (!advancedMode)
{
iss >> numBlocksToUse;
if (numBlocksToUse <= 0 || iss.fail())
iss >> numSubExecs;
if (numSubExecs <= 0 || iss.fail())
{
printf("Parsing error: Number of blocks to use (%d) must be greater than 0\n", numBlocksToUse);
printf("Parsing error: Number of blocks to use (%d) must be greater than 0\n", numSubExecs);
exit(1);
}
}
......@@ -799,7 +838,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
Transfer transfer;
transfer.transferIndex = i;
transfer.numBytes = 0;
transfer.numBytesToCopy = 0;
transfer.numBytesActual = 0;
if (!advancedMode)
{
iss >> srcMem >> exeMem >> dstMem;
......@@ -812,7 +851,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
else
{
std::string numBytesToken;
iss >> srcMem >> exeMem >> dstMem >> numBlocksToUse >> numBytesToken;
iss >> srcMem >> exeMem >> dstMem >> numSubExecs >> numBytesToken;
if (iss.fail())
{
printf("Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple\n", i+1);
......@@ -824,18 +863,33 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
exit(1);
}
char units = numBytesToken.back();
switch (units)
switch (toupper(units))
{
case 'K': numBytes *= 1024; break;
case 'M': numBytes *= 1024*1024; break;
case 'G': numBytes *= 1024*1024*1024; break;
}
}
ParseMemType(srcMem, numCpus, numGpus, transfer.srcType, transfer.srcIndex);
ParseMemType(dstMem, numCpus, numGpus, transfer.dstType, transfer.dstIndex);
ParseExeType(exeMem, numCpus, numGpus, transfer.exeType, transfer.exeIndex);
transfer.numSrcs = (int)transfer.srcType.size();
transfer.numDsts = (int)transfer.dstType.size();
if (transfer.numSrcs == 0 && transfer.numDsts == 0)
{
case 'K': case 'k': numBytes *= 1024; break;
case 'M': case 'm': numBytes *= 1024*1024; break;
case 'G': case 'g': numBytes *= 1024*1024*1024; break;
printf("[ERROR] Transfer must have at least one src or dst\n");
exit(1);
}
if (transfer.exeType == EXE_GPU_DMA && (transfer.numSrcs > 1 || transfer.numDsts > 1))
{
printf("[ERROR] GPU DMA executor can only be used for single source / single dst Transfers\n");
exit(1);
}
ParseMemType(srcMem, numCpus, numGpus, &transfer.srcMemType, &transfer.srcIndex);
ParseMemType(exeMem, numCpus, numGpus, &transfer.exeMemType, &transfer.exeIndex);
ParseMemType(dstMem, numCpus, numGpus, &transfer.dstMemType, &transfer.dstIndex);
transfer.numBlocksToUse = numBlocksToUse;
transfer.numSubExecs = numSubExecs;
transfer.numBytes = numBytes;
transfers.push_back(transfer);
}
......@@ -971,158 +1025,31 @@ void CheckPages(char* array, size_t numBytes, int targetId)
}
}
// Helper function to either fill a device pointer with pseudo-random data, or to check to see if it matches
void CheckOrFill(ModeType mode, int N, bool isMemset, bool isHipCall, std::vector<float>const& fillPattern, float* ptr)
{
// Prepare reference resultx
float* refBuffer = (float*)malloc(N * sizeof(float));
if (isMemset)
{
if (isHipCall)
{
memset(refBuffer, 42, N * sizeof(float));
}
else
{
for (int i = 0; i < N; i++)
refBuffer[i] = 1234.0f;
}
}
else
{
// Fill with repeated pattern if specified
size_t patternLen = fillPattern.size();
if (patternLen > 0)
{
for (int i = 0; i < N; i++)
refBuffer[i] = fillPattern[i % patternLen];
}
else // Otherwise fill with pseudo-random values
{
for (int i = 0; i < N; i++)
refBuffer[i] = (i % 383 + 31);
}
}
// Either fill the memory with the reference buffer, or compare against it
if (mode == MODE_FILL)
{
HIP_CALL(hipMemcpy(ptr, refBuffer, N * sizeof(float), hipMemcpyDefault));
}
else if (mode == MODE_CHECK)
{
float* hostBuffer = (float*) malloc(N * sizeof(float));
HIP_CALL(hipMemcpy(hostBuffer, ptr, N * sizeof(float), hipMemcpyDefault));
for (int i = 0; i < N; i++)
{
if (refBuffer[i] != hostBuffer[i])
{
printf("[ERROR] Mismatch at element %d Ref: %f Actual: %f\n", i, refBuffer[i], hostBuffer[i]);
exit(1);
}
}
free(hostBuffer);
}
free(refBuffer);
}
std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount)
{
char result[10];
switch (linkType)
{
case HSA_AMD_LINK_INFO_TYPE_HYPERTRANSPORT: sprintf(result, " HT-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_QPI : sprintf(result, " QPI-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_PCIE : sprintf(result, "PCIE-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_INFINBAND : sprintf(result, "INFB-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_XGMI : sprintf(result, "XGMI-%d", hopCount); break;
default: sprintf(result, "??????");
}
return result;
}
std::string GetDesc(MemType srcMemType, int srcIndex,
MemType dstMemType, int dstIndex)
{
if (IsCpuType(srcMemType))
{
if (IsCpuType(dstMemType)) return (srcIndex == dstIndex) ? "LOCAL" : "NUMA";
if (IsGpuType(dstMemType)) return "PCIE";
goto error;
}
if (IsGpuType(srcMemType))
{
if (IsCpuType(dstMemType)) return "PCIE";
if (IsGpuType(dstMemType))
{
if (srcIndex == dstIndex) return "LOCAL";
else
{
uint32_t linkType, hopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(srcIndex, MEM_GPU),
RemappedIndex(dstIndex, MEM_GPU),
&linkType, &hopCount));
return GetLinkTypeDesc(linkType, hopCount);
}
}
}
error:
printf("[ERROR] Unrecognized memory type\n");
exit(1);
}
std::string GetTransferDesc(Transfer const& transfer)
{
return GetDesc(transfer.srcMemType, transfer.srcIndex, transfer.exeMemType, transfer.exeIndex) + "-"
+ GetDesc(transfer.exeMemType, transfer.exeIndex, transfer.dstMemType, transfer.dstIndex);
}
void RunTransfer(EnvVars const& ev, int const iteration,
ExecutorInfo& exeInfo, int const transferIdx)
{
Transfer* transfer = exeInfo.transfers[transferIdx];
// GPU execution agent
if (transfer->exeMemType == MEM_GPU)
if (transfer->exeType == EXE_GPU_GFX)
{
// Switch to executing GPU
int const exeIndex = RemappedIndex(transfer->exeIndex, MEM_GPU);
int const exeIndex = RemappedIndex(transfer->exeIndex, false);
HIP_CALL(hipSetDevice(exeIndex));
hipStream_t& stream = exeInfo.streams[transferIdx];
hipEvent_t& startEvent = exeInfo.startEvents[transferIdx];
hipEvent_t& stopEvent = exeInfo.stopEvents[transferIdx];
int const initOffset = ev.byteOffset / sizeof(float);
if (ev.useHipCall)
{
// Record start event
HIP_CALL(hipEventRecord(startEvent, stream));
// Execute hipMemset / hipMemcpy
if (ev.useMemset)
HIP_CALL(hipMemsetAsync(transfer->dstMem + initOffset, 42, transfer->numBytesToCopy, stream));
else
HIP_CALL(hipMemcpyAsync(transfer->dstMem + initOffset,
transfer->srcMem + initOffset,
transfer->numBytesToCopy, hipMemcpyDefault,
stream));
// Record stop event
HIP_CALL(hipEventRecord(stopEvent, stream));
}
else
{
int const numBlocksToRun = ev.useSingleStream ? exeInfo.totalBlocks : transfer->numBlocksToUse;
hipExtLaunchKernelGGL(ev.useMemset ? GpuMemsetKernel : GpuCopyKernel,
// Figure out how many threadblocks to use.
// In single stream mode, all the threadblocks for this GPU are launched
// Otherwise, just launch the threadblocks associated with this single Transfer
int const numBlocksToRun = ev.useSingleStream ? exeInfo.totalSubExecs : transfer->numSubExecs;
hipExtLaunchKernelGGL(GpuKernelTable[ev.gpuKernel],
dim3(numBlocksToRun, 1, 1),
dim3(BLOCKSIZE, 1, 1),
ev.sharedMemBytes, stream,
startEvent, stopEvent,
0, transfer->blockParamGpuPtr);
}
0, transfer->subExecParamGpuPtr);
// Synchronize per iteration, unless in single sync mode, in which case
// synchronize during last warmup / last actual iteration
......@@ -1136,14 +1063,15 @@ void RunTransfer(EnvVars const& ev, int const iteration,
if (ev.useSingleStream)
{
// Figure out individual timings for Transfers that were all launched together
for (Transfer* currTransfer : exeInfo.transfers)
{
long long minStartCycle = currTransfer->blockParamGpuPtr[0].startCycle;
long long maxStopCycle = currTransfer->blockParamGpuPtr[0].stopCycle;
for (int i = 1; i < currTransfer->numBlocksToUse; i++)
long long minStartCycle = currTransfer->subExecParamGpuPtr[0].startCycle;
long long maxStopCycle = currTransfer->subExecParamGpuPtr[0].stopCycle;
for (int i = 1; i < currTransfer->numSubExecs; i++)
{
minStartCycle = std::min(minStartCycle, currTransfer->blockParamGpuPtr[i].startCycle);
maxStopCycle = std::max(maxStopCycle, currTransfer->blockParamGpuPtr[i].stopCycle);
minStartCycle = std::min(minStartCycle, currTransfer->subExecParamGpuPtr[i].startCycle);
maxStopCycle = std::max(maxStopCycle, currTransfer->subExecParamGpuPtr[i].stopCycle);
}
int const wallClockRate = GetWallClockRate(exeIndex);
double iterationTimeMs = (maxStopCycle - minStartCycle) / (double)(wallClockRate);
......@@ -1157,10 +1085,43 @@ void RunTransfer(EnvVars const& ev, int const iteration,
}
}
}
else if (transfer->exeMemType == MEM_CPU) // CPU execution agent
else if (transfer->exeType == EXE_GPU_DMA)
{
// Switch to executing GPU
int const exeIndex = RemappedIndex(transfer->exeIndex, false);
HIP_CALL(hipSetDevice(exeIndex));
hipStream_t& stream = exeInfo.streams[transferIdx];
hipEvent_t& startEvent = exeInfo.startEvents[transferIdx];
hipEvent_t& stopEvent = exeInfo.stopEvents[transferIdx];
HIP_CALL(hipEventRecord(startEvent, stream));
if (transfer->numSrcs == 0 && transfer->numDsts == 1)
{
HIP_CALL(hipMemsetAsync(transfer->dstMem[0],
MEMSET_CHAR, transfer->numBytesActual, stream));
}
else if (transfer->numSrcs == 1 && transfer->numDsts == 1)
{
HIP_CALL(hipMemcpyAsync(transfer->dstMem[0], transfer->srcMem[0],
transfer->numBytesActual, hipMemcpyDefault,
stream));
}
HIP_CALL(hipEventRecord(stopEvent, stream));
HIP_CALL(hipStreamSynchronize(stream));
if (iteration >= 0)
{
// Record GPU timing
float gpuDeltaMsec;
HIP_CALL(hipEventElapsedTime(&gpuDeltaMsec, startEvent, stopEvent));
transfer->transferTime += gpuDeltaMsec;
}
}
else if (transfer->exeType == EXE_CPU) // CPU execution agent
{
// Force this thread and all child threads onto correct NUMA node
int const exeIndex = RemappedIndex(transfer->exeIndex, MEM_CPU);
int const exeIndex = RemappedIndex(transfer->exeIndex, true);
if (numa_run_on_node(exeIndex))
{
printf("[ERROR] Unable to set CPU to NUMA node %d\n", exeIndex);
......@@ -1171,12 +1132,12 @@ void RunTransfer(EnvVars const& ev, int const iteration,
auto cpuStart = std::chrono::high_resolution_clock::now();
// Launch child-threads to perform memcopies
for (int i = 0; i < ev.numCpuPerTransfer; i++)
childThreads.push_back(std::thread(ev.useMemset ? CpuMemsetKernel : CpuCopyKernel, std::ref(transfer->blockParam[i])));
// Launch each subExecutor in child-threads to perform memcopies
for (int i = 0; i < transfer->numSubExecs; ++i)
childThreads.push_back(std::thread(CpuReduceKernel, std::ref(transfer->subExecParam[i])));
// Wait for child-threads to finish
for (int i = 0; i < ev.numCpuPerTransfer; i++)
for (int i = 0; i < transfer->numSubExecs; ++i)
childThreads[i].join();
auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
......@@ -1187,11 +1148,13 @@ void RunTransfer(EnvVars const& ev, int const iteration,
}
}
void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, int readMode, int skipCpu)
void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N)
{
ev.DisplayP2PBenchmarkEnvVars();
// Collect the number of available CPUs/GPUs on this machine
int const numGpus = ev.numGpuDevices;
int const numCpus = ev.numCpuDevices;
int const numGpus = ev.numGpuDevices;
int const numDevices = numCpus + numGpus;
// Enable peer to peer for each GPU
......@@ -1199,52 +1162,38 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
for (int j = 0; j < numGpus; j++)
if (i != j) EnablePeerAccess(i, j);
if (!ev.outputToCsv)
{
printf("Performing copies in each direction of %lu bytes\n", N * sizeof(float));
printf("Using %d threads per NUMA node for CPU copies\n", ev.numCpuPerTransfer);
printf("Using %d CUs per transfer\n", numBlocksToUse);
}
else
{
printf("SRC,DST,Direction,ReadMode,BW(GB/s),Bytes\n");
}
// Perform unidirectional / bidirectional
for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++)
{
// Print header
if (!ev.outputToCsv)
{
printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write]\n", isBidirectional ? "Bi" : "Uni",
readMode == 0 ? "Local" : "Remote",
readMode == 0 ? "Remote" : "Local");
printf("%10s", "D/D");
if (!skipCpu)
{
for (int i = 0; i < numCpus; i++)
printf("%7s %02d", "CPU", i);
}
for (int i = 0; i < numGpus; i++)
printf("%7s %02d", "GPU", i);
printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni",
ev.useRemoteRead ? "Remote" : "Local",
ev.useRemoteRead ? "Local" : "Remote",
ev.useDmaCopy ? "DMA" : "GFX");
printf("%10s", "SRC\\DST");
for (int i = 0; i < numCpus; i++) printf("%7s %02d", "CPU", i);
for (int i = 0; i < numGpus; i++) printf("%7s %02d", "GPU", i);
printf("\n");
}
// Loop over all possible src/dst pairs
for (int src = 0; src < numDevices; src++)
{
MemType const& srcMemType = (src < numCpus ? MEM_CPU : MEM_GPU);
if (skipCpu && srcMemType == MEM_CPU) continue;
int srcIndex = (srcMemType == MEM_CPU ? src : src - numCpus);
MemType const srcType = (src < numCpus ? MEM_CPU : MEM_GPU);
int const srcIndex = (srcType == MEM_CPU ? src : src - numCpus);
if (!ev.outputToCsv)
printf("%7s %02d", (srcMemType == MEM_CPU) ? "CPU" : "GPU", srcIndex);
printf("%7s %02d", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex);
for (int dst = 0; dst < numDevices; dst++)
{
MemType const& dstMemType = (dst < numCpus ? MEM_CPU : MEM_GPU);
if (skipCpu && dstMemType == MEM_CPU) continue;
int dstIndex = (dstMemType == MEM_CPU ? dst : dst - numCpus);
double bandwidth = GetPeakBandwidth(ev, N, isBidirectional, readMode, numBlocksToUse,
srcMemType, srcIndex, dstMemType, dstIndex);
MemType const dstType = (dst < numCpus ? MEM_CPU : MEM_GPU);
int const dstIndex = (dstType == MEM_CPU ? dst : dst - numCpus);
double bandwidth = GetPeakBandwidth(ev, N, isBidirectional, srcType, srcIndex, dstType, dstIndex);
if (!ev.outputToCsv)
{
if (bandwidth == 0)
......@@ -1254,13 +1203,12 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
}
else
{
printf("%s %02d,%s %02d,%s,%s,%.2f,%lu\n",
srcMemType == MEM_CPU ? "CPU" : "GPU",
srcIndex,
dstMemType == MEM_CPU ? "CPU" : "GPU",
dstIndex,
printf("%s %02d,%s %02d,%s,%s,%s,%.2f,%lu\n",
srcType == MEM_CPU ? "CPU" : "GPU", srcIndex,
dstType == MEM_CPU ? "CPU" : "GPU", dstIndex,
isBidirectional ? "bidirectional" : "unidirectional",
readMode == 0 ? "Local" : "Remote",
ev.useRemoteRead ? "Remote" : "Local",
ev.useDmaCopy ? "DMA" : "GFX",
bandwidth,
N * sizeof(float));
}
......@@ -1272,42 +1220,50 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
}
}
double GetPeakBandwidth(EnvVars const& ev,
size_t const N,
double GetPeakBandwidth(EnvVars const& ev, size_t const N,
int const isBidirectional,
int const readMode,
int const numBlocksToUse,
MemType const srcMemType,
int const srcIndex,
MemType const dstMemType,
int const dstIndex)
MemType const srcType, int const srcIndex,
MemType const dstType, int const dstIndex)
{
// Skip bidirectional on same device
if (isBidirectional && srcMemType == dstMemType && srcIndex == dstIndex) return 0.0f;
if (isBidirectional && srcType == dstType && srcIndex == dstIndex) return 0.0f;
int const initOffset = ev.byteOffset / sizeof(float);
// Prepare Transfers
std::vector<Transfer> transfers(2);
transfers[0].srcMemType = transfers[1].dstMemType = srcMemType;
transfers[0].dstMemType = transfers[1].srcMemType = dstMemType;
transfers[0].srcIndex = transfers[1].dstIndex = srcIndex;
transfers[0].dstIndex = transfers[1].srcIndex = dstIndex;
transfers[0].numBytes = transfers[1].numBytes = N * sizeof(float);
transfers[0].numBlocksToUse = transfers[1].numBlocksToUse = numBlocksToUse;
// Either perform (local read + remote write), or (remote read + local write)
transfers[0].exeMemType = (readMode == 0 ? srcMemType : dstMemType);
transfers[1].exeMemType = (readMode == 0 ? dstMemType : srcMemType);
transfers[0].exeIndex = (readMode == 0 ? srcIndex : dstIndex);
transfers[1].exeIndex = (readMode == 0 ? dstIndex : srcIndex);
// SRC -> DST
transfers[0].numSrcs = transfers[0].numDsts = 1;
transfers[0].srcType.push_back(srcType);
transfers[0].dstType.push_back(dstType);
transfers[0].srcIndex.push_back(srcIndex);
transfers[0].dstIndex.push_back(dstIndex);
// DST -> SRC
transfers[1].numSrcs = transfers[1].numDsts = 1;
transfers[1].srcType.push_back(dstType);
transfers[1].dstType.push_back(srcType);
transfers[1].srcIndex.push_back(dstIndex);
transfers[1].dstIndex.push_back(srcIndex);
// Either perform (local read + remote write), or (remote read + local write)
ExeType gpuExeType = ev.useDmaCopy ? EXE_GPU_DMA : EXE_GPU_GFX;
transfers[0].exeType = IsGpuType(ev.useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU;
transfers[1].exeType = IsGpuType(ev.useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU;
transfers[0].exeIndex = (ev.useRemoteRead ? dstIndex : srcIndex);
transfers[1].exeIndex = (ev.useRemoteRead ? srcIndex : dstIndex);
transfers[0].numSubExecs = IsGpuType(transfers[0].exeType) ? ev.numGpuSubExecs : ev.numCpuSubExecs;
transfers[1].numSubExecs = IsGpuType(transfers[0].exeType) ? ev.numGpuSubExecs : ev.numCpuSubExecs;
// Remove (DST->SRC) if not bidirectional
transfers.resize(isBidirectional + 1);
// Abort if executing on NUMA node with no CPUs
for (int i = 0; i <= isBidirectional; i++)
{
if (transfers[i].exeMemType == MEM_CPU && ev.numCpusPerNuma[transfers[i].exeIndex] == 0)
if (transfers[i].exeType == EXE_CPU && ev.numCpusPerNuma[transfers[i].exeIndex] == 0)
return 0;
}
......@@ -1318,45 +1274,176 @@ double GetPeakBandwidth(EnvVars const& ev,
for (int i = 0; i <= isBidirectional; i++)
{
double transferDurationMsec = transfers[i].transferTime / (1.0 * ev.numIterations);
double transferBandwidthGbs = (transfers[i].numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f;
double transferBandwidthGbs = (transfers[i].numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
totalBandwidth += transferBandwidthGbs;
}
return totalBandwidth;
}
void Transfer::PrepareBlockParams(EnvVars const& ev, size_t const N)
void Transfer::PrepareSubExecParams(EnvVars const& ev)
{
// Each subExecutor needs to know src/dst pointers and how many elements to transfer
// Figure out the sub-array each subExecutor works on for this Transfer
// - Partition N as evenly as possible, but try to keep subarray sizes as multiples of BLOCK_BYTES bytes,
// except the very last one, for alignment reasons
size_t const N = this->numBytesActual / sizeof(float);
int const initOffset = ev.byteOffset / sizeof(float);
int const targetMultiple = ev.blockBytes / sizeof(float);
// Initialize source memory with patterned data
CheckOrFill(MODE_FILL, N, ev.useMemset, ev.useHipCall, ev.fillPattern, this->srcMem + initOffset);
// In some cases, there may not be enough data for all subExectors
int const maxSubExecToUse = std::min((int)(N + targetMultiple - 1) / targetMultiple, this->numSubExecs);
this->subExecParam.clear();
this->subExecParam.resize(this->numSubExecs);
// Each block needs to know src/dst pointers and how many elements to transfer
// Figure out the sub-array each block does for this Transfer
// - Partition N as evenly as possible, but try to keep blocks as multiples of BLOCK_BYTES bytes,
// except the very last one, for alignment reasons
int const targetMultiple = ev.blockBytes / sizeof(float);
int const maxNumBlocksToUse = std::min((N + targetMultiple - 1) / targetMultiple, this->blockParam.size());
size_t assigned = 0;
for (int j = 0; j < this->blockParam.size(); j++)
for (int i = 0; i < this->numSubExecs; ++i)
{
int const blocksLeft = std::max(0, maxNumBlocksToUse - j);
int const subExecLeft = std::max(0, maxSubExecToUse - i);
size_t const leftover = N - assigned;
size_t const roundedN = (leftover + targetMultiple - 1) / targetMultiple;
BlockParam& param = this->blockParam[j];
param.N = blocksLeft ? std::min(leftover, ((roundedN / blocksLeft) * targetMultiple)) : 0;
param.src = this->srcMem + assigned + initOffset;
param.dst = this->dstMem + assigned + initOffset;
param.startCycle = 0;
param.stopCycle = 0;
assigned += param.N;
SubExecParam& p = this->subExecParam[i];
p.N = subExecLeft ? std::min(leftover, ((roundedN / subExecLeft) * targetMultiple)) : 0;
p.numSrcs = this->numSrcs;
p.numDsts = this->numDsts;
for (int iSrc = 0; iSrc < this->numSrcs; ++iSrc)
p.src[iSrc] = this->srcMem[iSrc] + assigned + initOffset;
for (int iDst = 0; iDst < this->numDsts; ++iDst)
p.dst[iDst] = this->dstMem[iDst] + assigned + initOffset;
if (ev.enableDebug)
{
printf("Transfer %02d SE:%02d: %10lu floats: %10lu to %10lu\n",
this->transferIndex, i, p.N, assigned, assigned + p.N);
}
p.startCycle = 0;
p.stopCycle = 0;
assigned += p.N;
}
this->transferTime = 0.0;
}
void Transfer::PrepareReference(EnvVars const& ev, std::vector<float>& buffer, int bufferIdx)
{
size_t N = buffer.size();
if (bufferIdx >= 0)
{
size_t patternLen = ev.fillPattern.size();
if (patternLen > 0)
{
for (size_t i = 0; i < N; ++i)
buffer[i] = ev.fillPattern[i % patternLen];
}
else
{
for (size_t i = 0; i < N; ++i)
buffer[i] = (i % 383 + 31) * (bufferIdx + 1);
}
}
else // Destination buffer
{
if (this->numSrcs == 0)
{
// Note: 0x75757575 = 13323083.0
memset(buffer.data(), MEMSET_CHAR, N * sizeof(float));
}
else
{
PrepareReference(ev, buffer, 0);
if (this->numSrcs > 1)
{
std::vector<float> temp(N);
for (int srcIdx = 1; srcIdx < this->numSrcs; ++srcIdx)
{
PrepareReference(ev, temp, srcIdx);
for (int i = 0; i < N; ++i)
{
buffer[i] += temp[i];
}
}
}
}
}
}
void Transfer::PrepareSrc(EnvVars const& ev)
{
if (this->numSrcs == 0) return;
size_t const N = this->numBytesActual / sizeof(float);
int const initOffset = ev.byteOffset / sizeof(float);
std::vector<float> reference(N);
for (int srcIdx = 0; srcIdx < this->numSrcs; ++srcIdx)
{
//PrepareReference(ev, reference, srcIdx);
PrepareReference(ev, reference, srcIdx);
HIP_CALL(hipMemcpy(this->srcMem[srcIdx] + initOffset, reference.data(), this->numBytesActual, hipMemcpyDefault));
}
}
void Transfer::ValidateDst(EnvVars const& ev)
{
if (this->numDsts == 0) return;
size_t const N = this->numBytesActual / sizeof(float);
int const initOffset = ev.byteOffset / sizeof(float);
std::vector<float> reference(N);
PrepareReference(ev, reference, -1);
std::vector<float> hostBuffer(N);
for (int dstIdx = 0; dstIdx < this->numDsts; ++dstIdx)
{
float* output;
if (IsCpuType(this->dstType[dstIdx]))
{
output = this->dstMem[dstIdx] + initOffset;
}
else
{
HIP_CALL(hipMemcpy(hostBuffer.data(), this->dstMem[dstIdx] + initOffset, this->numBytesActual, hipMemcpyDefault));
output = hostBuffer.data();
}
for (size_t i = 0; i < N; ++i)
{
if (reference[i] != output[i])
{
printf("\n[ERROR] Destination array %d value at index %lu (%.3f) does not match expected value (%.3f)\n",
dstIdx, i, output[i], reference[i]);
printf("[ERROR] Failed Transfer details: #%d: %s -> [%c%d:%d] -> %s\n",
this->transferIndex,
this->SrcToStr().c_str(),
ExeTypeStr[this->exeType], this->exeIndex,
this->numSubExecs,
this->DstToStr().c_str());
exit(1);
}
}
}
}
std::string Transfer::SrcToStr() const
{
if (numSrcs == 0) return "N";
std::stringstream ss;
for (int i = 0; i < numSrcs; ++i)
ss << MemTypeStr[srcType[i]] << srcIndex[i];
return ss.str();
}
std::string Transfer::DstToStr() const
{
if (numDsts == 0) return "N";
std::stringstream ss;
for (int i = 0; i < numDsts; ++i)
ss << MemTypeStr[dstType[i]] << dstIndex[i];
return ss.str();
}
// NOTE: This is a stop-gap solution until HIP provides wallclock values
int GetWallClockRate(int deviceId)
{
......@@ -1385,27 +1472,27 @@ int GetWallClockRate(int deviceId)
return wallClockPerDeviceMhz[deviceId];
}
void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numBlocksToUse, bool const isRandom)
void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numGpuSubExecs, int const numCpuSubExecs, bool const isRandom)
{
ev.DisplaySweepEnvVars();
// Compute how many possible Transfers are permitted (unique SRC/EXE/DST triplets)
std::vector<std::pair<MemType, int>> exeList;
std::vector<std::pair<ExeType, int>> exeList;
for (auto exe : ev.sweepExe)
{
MemType const exeMemType = CharToMemType(exe);
if (IsGpuType(exeMemType))
ExeType const exeType = CharToExeType(exe);
if (IsGpuType(exeType))
{
for (int exeIndex = 0; exeIndex < ev.numGpuDevices; ++exeIndex)
exeList.push_back(std::make_pair(exeMemType, exeIndex));
exeList.push_back(std::make_pair(exeType, exeIndex));
}
else
else if (IsCpuType(exeType))
{
for (int exeIndex = 0; exeIndex < ev.numCpuDevices; ++exeIndex)
{
// Skip NUMA nodes that have no CPUs (e.g. CXL)
if (ev.numCpusPerNuma[exeIndex] == 0) continue;
exeList.push_back(std::make_pair(exeMemType, exeIndex));
exeList.push_back(std::make_pair(exeType, exeIndex));
}
}
}
......@@ -1414,11 +1501,11 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
std::vector<std::pair<MemType, int>> srcList;
for (auto src : ev.sweepSrc)
{
MemType const srcMemType = CharToMemType(src);
int const numDevices = IsGpuType(srcMemType) ? ev.numGpuDevices : ev.numCpuDevices;
MemType const srcType = CharToMemType(src);
int const numDevices = IsGpuType(srcType) ? ev.numGpuDevices : ev.numCpuDevices;
for (int srcIndex = 0; srcIndex < numDevices; ++srcIndex)
srcList.push_back(std::make_pair(srcMemType, srcIndex));
srcList.push_back(std::make_pair(srcType, srcIndex));
}
int numSrcs = srcList.size();
......@@ -1426,20 +1513,20 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
std::vector<std::pair<MemType, int>> dstList;
for (auto dst : ev.sweepDst)
{
MemType const dstMemType = CharToMemType(dst);
int const numDevices = IsGpuType(dstMemType) ? ev.numGpuDevices : ev.numCpuDevices;
MemType const dstType = CharToMemType(dst);
int const numDevices = IsGpuType(dstType) ? ev.numGpuDevices : ev.numCpuDevices;
for (int dstIndex = 0; dstIndex < numDevices; ++dstIndex)
dstList.push_back(std::make_pair(dstMemType, dstIndex));
dstList.push_back(std::make_pair(dstType, dstIndex));
}
int numDsts = dstList.size();
// Build array of possibilities, respecting any additional restrictions (e.g. XGMI hop count)
struct TransferInfo
{
MemType srcMemType; int srcIndex;
MemType exeMemType; int exeIndex;
MemType dstMemType; int dstIndex;
MemType srcType; int srcIndex;
ExeType exeType; int exeIndex;
MemType dstType; int dstIndex;
};
// If either XGMI minimum is non-zero, or XGMI maximum is specified and non-zero then both links must be XGMI
......@@ -1451,7 +1538,7 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
{
// Skip CPU executors if XGMI link must be used
if (useXgmiOnly && !IsGpuType(exeList[i].first)) continue;
tinfo.exeMemType = exeList[i].first;
tinfo.exeType = exeList[i].first;
tinfo.exeIndex = exeList[i].second;
bool isXgmiSrc = false;
......@@ -1463,8 +1550,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
if (exeList[i].second != srcList[j].second)
{
uint32_t exeToSrcLinkType, exeToSrcHopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, MEM_GPU),
RemappedIndex(srcList[j].second, MEM_GPU),
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, false),
RemappedIndex(srcList[j].second, false),
&exeToSrcLinkType,
&exeToSrcHopCount));
isXgmiSrc = (exeToSrcLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI);
......@@ -1484,7 +1571,7 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
}
else if (useXgmiOnly) continue;
tinfo.srcMemType = srcList[j].first;
tinfo.srcType = srcList[j].first;
tinfo.srcIndex = srcList[j].second;
bool isXgmiDst = false;
......@@ -1496,8 +1583,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
if (exeList[i].second != dstList[k].second)
{
uint32_t exeToDstLinkType, exeToDstHopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, MEM_GPU),
RemappedIndex(dstList[k].second, MEM_GPU),
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, false),
RemappedIndex(dstList[k].second, false),
&exeToDstLinkType,
&exeToDstHopCount));
isXgmiDst = (exeToDstLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI);
......@@ -1519,7 +1606,7 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
// Skip this DST if total XGMI distance (SRC + DST) is greater than max limit
if (ev.sweepXgmiMax >= 0 && (numHopsSrc + numHopsDst) > ev.sweepXgmiMax) continue;
tinfo.dstMemType = dstList[k].first;
tinfo.dstType = dstList[k].first;
tinfo.dstIndex = dstList[k].second;
possibleTransfers.push_back(tinfo);
......@@ -1580,13 +1667,15 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
{
// Convert integer value to (SRC->EXE->DST) triplet
Transfer transfer;
transfer.srcMemType = possibleTransfers[value].srcMemType;
transfer.srcIndex = possibleTransfers[value].srcIndex;
transfer.exeMemType = possibleTransfers[value].exeMemType;
transfer.numSrcs = 1;
transfer.numDsts = 1;
transfer.srcType = {possibleTransfers[value].srcType};
transfer.srcIndex = {possibleTransfers[value].srcIndex};
transfer.exeType = possibleTransfers[value].exeType;
transfer.exeIndex = possibleTransfers[value].exeIndex;
transfer.dstMemType = possibleTransfers[value].dstMemType;
transfer.dstIndex = possibleTransfers[value].dstIndex;
transfer.numBlocksToUse = IsGpuType(transfer.exeMemType) ? numBlocksToUse : ev.numCpuPerTransfer;
transfer.dstType = {possibleTransfers[value].dstType};
transfer.dstIndex = {possibleTransfers[value].dstIndex};
transfer.numSubExecs = IsGpuType(transfer.exeType) ? numGpuSubExecs : numCpuSubExecs;
transfer.transferIndex = transfers.size();
transfer.numBytes = ev.sweepRandBytes ? randSize(*ev.generator) * sizeof(float) : 0;
transfers.push_back(transfer);
......@@ -1636,12 +1725,23 @@ void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& tran
for (auto const& transfer : transfers)
{
fprintf(fp, " (%c%d->%c%d->%c%d %d %lu)",
MemTypeStr[transfer.srcMemType], transfer.srcIndex,
MemTypeStr[transfer.exeMemType], transfer.exeIndex,
MemTypeStr[transfer.dstMemType], transfer.dstIndex,
transfer.numBlocksToUse,
MemTypeStr[transfer.srcType[0]], transfer.srcIndex[0],
ExeTypeStr[transfer.exeType], transfer.exeIndex,
MemTypeStr[transfer.dstType[0]], transfer.dstIndex[0],
transfer.numSubExecs,
transfer.numBytes);
}
fprintf(fp, "\n");
fflush(fp);
}
std::string PtrVectorToStr(std::vector<float*> const& strVector, int const initOffset)
{
std::stringstream ss;
for (int i = 0; i < strVector.size(); ++i)
{
if (i) ss << " ";
ss << (strVector[i] + initOffset);
}
return ss.str();
}
/*
Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -35,20 +35,20 @@ THE SOFTWARE.
#include <hip/hip_ext.h>
#include <hsa/hsa_ext_amd.h>
#include "EnvVars.hpp"
// Helper macro for catching HIP errors
#define HIP_CALL(cmd) \
do { \
hipError_t error = (cmd); \
if (error != hipSuccess) \
{ \
std::cerr << "Encountered HIP error (" << hipGetErrorString(error) << ") at line " \
<< __LINE__ << " in file " << __FILE__ << "\n"; \
std::cerr << "Encountered HIP error (" << hipGetErrorString(error) \
<< ") at line " << __LINE__ << " in file " << __FILE__ << "\n"; \
exit(-1); \
} \
} while (0)
#include "EnvVars.hpp"
// Simple configuration parameters
size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<26); // Amount of data transferred per Transfer
......@@ -59,92 +59,92 @@ typedef enum
MEM_GPU = 1, // Coarse-grained global GPU memory
MEM_CPU_FINE = 2, // Fine-grained pinned CPU memory
MEM_GPU_FINE = 3, // Fine-grained global GPU memory
MEM_CPU_UNPINNED = 4 // Unpinned CPU memory
MEM_CPU_UNPINNED = 4, // Unpinned CPU memory
MEM_NULL = 5, // NULL memory - used for empty
} MemType;
bool IsGpuType(MemType m)
{
return (m == MEM_GPU || m == MEM_GPU_FINE);
}
bool IsCpuType(MemType m)
typedef enum
{
return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED);
}
EXE_CPU = 0, // CPU executor (subExecutor = CPU thread)
EXE_GPU_GFX = 1, // GPU kernel-based executor (subExecutor = threadblock/CU)
EXE_GPU_DMA = 2, // GPU SDMA-based executor (subExecutor = streams)
} ExeType;
bool IsGpuType(MemType m) { return (m == MEM_GPU || m == MEM_GPU_FINE); }
bool IsCpuType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED); };
bool IsGpuType(ExeType e) { return (e == EXE_GPU_GFX || e == EXE_GPU_DMA); };
bool IsCpuType(ExeType e) { return (e == EXE_CPU); };
char const MemTypeStr[6] = "CGBFU";
char const MemTypeStr[7] = "CGBFUN";
char const ExeTypeStr[4] = "CGD";
char const ExeTypeName[3][4] = {"CPU", "GPU", "DMA"};
MemType inline CharToMemType(char const c)
{
switch (c)
{
case 'C': return MEM_CPU;
case 'G': return MEM_GPU;
case 'B': return MEM_CPU_FINE;
case 'F': return MEM_GPU_FINE;
case 'U': return MEM_CPU_UNPINNED;
default:
printf("[ERROR] Unexpected mem type (%c)\n", c);
char const* val = strchr(MemTypeStr, toupper(c));
if (*val) return (MemType)(val - MemTypeStr);
printf("[ERROR] Unexpected memory type (%c)\n", c);
exit(1);
}
}
typedef enum
{
MODE_FILL = 0, // Fill data with pattern
MODE_CHECK = 1 // Check data against pattern
} ModeType;
// Each threadblock copies N floats from src to dst
struct BlockParam
ExeType inline CharToExeType(char const c)
{
int N;
float* src;
float* dst;
long long startCycle;
long long stopCycle;
};
char const* val = strchr(ExeTypeStr, toupper(c));
if (*val) return (ExeType)(val - ExeTypeStr);
printf("[ERROR] Unexpected executor type (%c)\n", c);
exit(1);
}
// Each Transfer is a uni-direction operation from a src memory to dst memory
// Each Transfer performs reads from source memory location(s), sums them (if multiple sources are specified)
// then writes the summation to each of the specified destination memory location(s)
struct Transfer
{
int transferIndex; // Transfer identifier
// Transfer config
MemType exeMemType; // Transfer executor type (CPU or GPU)
int transferIndex; // Transfer identifier (within a Test)
ExeType exeType; // Transfer executor type
int exeIndex; // Executor index (NUMA node for CPU / device ID for GPU)
MemType srcMemType; // Source memory type
int srcIndex; // Source device index
MemType dstMemType; // Destination memory type
int dstIndex; // Destination device index
int numBlocksToUse; // Number of threadblocks to use for this Transfer
size_t numBytes; // Number of bytes to Transfer
size_t numBytesToCopy; // Number of bytes to copy
// Memory
float* srcMem; // Source memory
float* dstMem; // Destination memory
// How memory is split across threadblocks / CPU cores
std::vector<BlockParam> blockParam;
BlockParam* blockParamGpuPtr;
int numSubExecs; // Number of subExecutors to use for this Transfer
size_t numBytes; // # of bytes requested to Transfer (may be 0 to fallback to default)
size_t numBytesActual; // Actual number of bytes to copy
double transferTime; // Time taken in milliseconds
// Results
double transferTime;
int numSrcs; // Number of sources
std::vector<MemType> srcType; // Source memory types
std::vector<int> srcIndex; // Source device indice
std::vector<float*> srcMem; // Source memory
// Prepares src memory and how to divide N elements across threadblocks/threads
void PrepareBlockParams(EnvVars const& ev, size_t const N);
};
int numDsts; // Number of destinations
std::vector<MemType> dstType; // Destination memory type
std::vector<int> dstIndex; // Destination device index
std::vector<float*> dstMem; // Destination memory
std::vector<SubExecParam> subExecParam; // Defines subarrays assigned to each threadblock
SubExecParam* subExecParamGpuPtr; // Pointer to GPU copy of subExecParam
typedef std::pair<MemType, int> Executor;
// Prepares src/dst subarray pointers for each SubExecutor
void PrepareSubExecParams(EnvVars const& ev);
// Prepare source arrays with input data
void PrepareSrc(EnvVars const& ev);
// Validate that destination data contains expected results
void ValidateDst(EnvVars const& ev);
// Prepare reference buffers
void PrepareReference(EnvVars const& ev, std::vector<float>& buffer, int bufferIdx);
// String representation functions
std::string SrcToStr() const;
std::string DstToStr() const;
};
struct ExecutorInfo
{
std::vector<Transfer*> transfers; // Transfers to execute
size_t totalBytes; // Total bytes this executor transfers
int totalSubExecs; // Total number of subExecutors to use
// For GPU-Executors
int totalBlocks; // Total number of CUs/CPU threads to use
BlockParam* blockParamGpu; // Copy of block parameters in GPU device memory
SubExecParam* subExecParamGpu; // GPU copy of subExecutor parameters
std::vector<hipStream_t> streams;
std::vector<hipEvent_t> startEvents;
std::vector<hipEvent_t> stopEvents;
......@@ -153,6 +153,7 @@ struct ExecutorInfo
double totalTime;
};
typedef std::pair<ExeType, int> Executor;
typedef std::map<Executor, ExecutorInfo> TransferMap;
// Display usage instructions
......@@ -166,7 +167,9 @@ void PopulateTestSizes(size_t const numBytesPerTransfer, int const samplingFacto
std::vector<size_t>& valuesofN);
void ParseMemType(std::string const& token, int const numCpus, int const numGpus,
MemType* memType, int* memIndex);
std::vector<MemType>& memType, std::vector<int>& memIndex);
void ParseExeType(std::string const& token, int const numCpus, int const numGpus,
ExeType& exeType, int& exeIndex);
void ParseTransfers(char* line, int numCpus, int numGpus,
std::vector<Transfer>& transfers);
......@@ -178,26 +181,19 @@ void EnablePeerAccess(int const deviceId, int const peerDeviceId);
void AllocateMemory(MemType memType, int devIndex, size_t numBytes, void** memPtr);
void DeallocateMemory(MemType memType, void* memPtr, size_t const size = 0);
void CheckPages(char* byteArray, size_t numBytes, int targetId);
void CheckOrFill(ModeType mode, int N, bool isMemset, bool isHipCall, std::vector<float> const& fillPattern, float* ptr);
void RunTransfer(EnvVars const& ev, int const iteration, ExecutorInfo& exeInfo, int const transferIdx);
void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, int readMode, int skipCpu);
void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numBlocksToUse, bool const isRandom);
void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N);
void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numGpuSubExec, int const numCpuSubExec, bool const isRandom);
// Return the maximum bandwidth measured for given (src/dst) pair
double GetPeakBandwidth(EnvVars const& ev,
size_t const N,
double GetPeakBandwidth(EnvVars const& ev, size_t const N,
int const isBidirectional,
int const readMode,
int const numBlocksToUse,
MemType const srcMemType,
int const srcIndex,
MemType const dstMemType,
int const dstIndex);
MemType const srcType, int const srcIndex,
MemType const dstType, int const dstIndex);
std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount);
std::string GetDesc(MemType srcMemType, int srcIndex,
MemType dstMemType, int dstIndex);
std::string GetTransferDesc(Transfer const& transfer);
int RemappedIndex(int const origIdx, MemType const memType);
int RemappedIndex(int const origIdx, bool const isCpuType);
int GetWallClockRate(int deviceId);
void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers);
std::string PtrVectorToStr(std::vector<float*> const& strVector, int const initOffset);
# ConfigFile Format:
# ==================
# A Transfer is defined as a uni-directional copy from src memory location to dst memory location
# executed by either CPU or GPU
# A Transfer is defined as a single operation where an Executor reads and adds together
# values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
# This simplifies to a simple copy operation when dealing with single SRC/DST.
#
# SRC 0 DST 0
# SRC 1 -> Executor -> DST 1
# SRC X DST Y
# Three Executors are supported by TransferBench
# Executor: SubExecutor:
# 1) CPU CPU thread
# 2) GPU GPU threadblock/Compute Unit (CU)
# 3) DMA N/A. (May only be used for copies (single SRC/DST)
# Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
# There are two ways to specify a Test:
# 1) Basic
# The basic specification assumes the same number of threadblocks/CUs used per GPU-executed Transfer
# The basic specification assumes the same number of SubExecutors (SE) used per Transfer
# A positive number of Transfers is specified followed by that number of triplets describing each Transfer
# #Transfers #CUs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
# #Transfers #SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
# 2) Advanced
# A negative number of Transfers is specified, followed by quintuplets describing each Transfer
# A non-zero number of bytes specified will override any provided value
# -#Transfers (srcMem1->Executor1->dstMem1 #CUs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #CUsL BytesL)
# -#Transfers (srcMem1->Executor1->dstMem1 #SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #SEsL BytesL)
# Argument Details:
# #Transfers: Number of Transfers to be run in parallel
# #CUs : Number of threadblocks/CUs to use for a GPU-executed Transfer
# srcMemL : Source memory location (Where the data is to be read from). Ignored in memset mode
# #SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
# srcMemL : Source memory locations (Where the data is to be read from)
# Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
# - C: CPU-executed (Indexed from 0 to # NUMA nodes - 1)
# - G: GPU-executed (Indexed from 0 to # GPUs - 1)
# dstMemL : Destination memory location (Where the data is to be written to)
# - D: DMA-executor (Indexed from 0 to # GPUs - 1)
# dstMemL : Destination memory locations (Where the data is to be written to)
# bytesL : Number of bytes to copy (0 means use command-line specified size)
# Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
#
# Memory locations are specified by a character indicating memory type,
# followed by device index (0-indexed)
# Memory locations are specified by one or more (device character / device index) pairs
# Character indicating memory type followed by device index (0-indexed)
# Supported memory locations are:
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - U: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# Examples:
# 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
# 1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
# 2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 CUs
# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 2 CUs
# 2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
# Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
# Lines starting with # will be ignored. Lines starting with ## will be echoed to output
# Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
## Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
1 4 (G0->G0->G1)
# Copies 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
## Single DMA executed Transfer between GPUs 0 and 1
1 1 (G0->D0->G1)
## Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
## "Memset" by GPU 0 to GPU 0 memory
1 32 (N0->G0->G0)
## "Read-only" by CPU 0
1 4 (C0->C0->N0)
## Broadcast from GPU 0 to GPU 0 and GPU 1
1 16 (G0->G0->G0G1)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment