Unverified Commit cc0e9cb4 authored by gilbertlee-amd's avatar gilbertlee-amd Committed by GitHub
Browse files

TransferBench v1.11 (#9)

* Adding MIMO support, DMA executor, Null memory type
parent 3b47b874
# Changelog for TransferBench # Changelog for TransferBench
## v1.11
### Added
- New multi-input / multi-output support (MIMO). Transfers now can reduce (element-wise summation) multiple input memory arrays
and write the sums to multiple outputs
- New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies). Previously this was done using USE_HIP_CALL, but now this allows
GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally.
- GPU-DMA executor can only be used for single-input/single-output Transfers
- GPU-DMA executor can only be associated with one SubExecutor
- Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers
- Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels
### Optimized
- Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs
### Changed
- Updated the example.cfg file to cover the new features
- Updated output to support MIMO
- Changed CUs/CPUs threads naming to SubExecutors for consistency
- Sweep Preset:
- Default sweep preset executors now includes DMA
- P2P Benchmarks:
- Now only works via "p2p". Removed "p2p_rr", "g2g" and "g2g_rr".
- Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g")
- New environment variable USE_REMOTE_READ replaces "_rr" presets
- New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor
- Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE
- Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
- Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE
- Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer
### Removed
- USE_HIP_CALL has been removed. Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets
- Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate
- Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer
- Removed USE_MEMSET environment variable. This can now be done via a Transfer using the null memory type
## v1.10 ## v1.10
### Fixed ### Fixed
- Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes - Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes
......
/* /*
Copyright (c) 2021-2022 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) 2021-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -26,9 +26,12 @@ THE SOFTWARE. ...@@ -26,9 +26,12 @@ THE SOFTWARE.
#include <algorithm> #include <algorithm>
#include <random> #include <random>
#include <time.h> #include <time.h>
#define TB_VERSION "1.10" #include "Kernels.hpp"
#define TB_VERSION "1.11"
extern char const MemTypeStr[]; extern char const MemTypeStr[];
extern char const ExeTypeStr[];
enum ConfigModeEnum enum ConfigModeEnum
{ {
...@@ -45,10 +48,13 @@ public: ...@@ -45,10 +48,13 @@ public:
int const DEFAULT_NUM_WARMUPS = 1; int const DEFAULT_NUM_WARMUPS = 1;
int const DEFAULT_NUM_ITERATIONS = 10; int const DEFAULT_NUM_ITERATIONS = 10;
int const DEFAULT_SAMPLING_FACTOR = 1; int const DEFAULT_SAMPLING_FACTOR = 1;
int const DEFAULT_NUM_CPU_PER_TRANSFER = 4;
// Peer-to-peer Benchmark preset defaults
int const DEFAULT_P2P_NUM_CPU_SE = 4;
// Sweep-preset defaults
std::string const DEFAULT_SWEEP_SRC = "CG"; std::string const DEFAULT_SWEEP_SRC = "CG";
std::string const DEFAULT_SWEEP_EXE = "CG"; std::string const DEFAULT_SWEEP_EXE = "CDG";
std::string const DEFAULT_SWEEP_DST = "CG"; std::string const DEFAULT_SWEEP_DST = "CG";
int const DEFAULT_SWEEP_MIN = 1; int const DEFAULT_SWEEP_MIN = 1;
int const DEFAULT_SWEEP_MAX = 24; int const DEFAULT_SWEEP_MAX = 24;
...@@ -59,21 +65,24 @@ public: ...@@ -59,21 +65,24 @@ public:
int blockBytes; // Each CU, except the last, gets a multiple of this many bytes to copy int blockBytes; // Each CU, except the last, gets a multiple of this many bytes to copy
int byteOffset; // Byte-offset for memory allocations int byteOffset; // Byte-offset for memory allocations
int numCpuDevices; // Number of CPU devices to use (defaults to # NUMA nodes detected) int numCpuDevices; // Number of CPU devices to use (defaults to # NUMA nodes detected)
int numCpuPerTransfer; // Number of CPU child threads to use per CPU Transfer
int numGpuDevices; // Number of GPU devices to use (defaults to # HIP devices detected) int numGpuDevices; // Number of GPU devices to use (defaults to # HIP devices detected)
int numIterations; // Number of timed iterations to perform. If negative, run for -numIterations seconds instead int numIterations; // Number of timed iterations to perform. If negative, run for -numIterations seconds instead
int numWarmups; // Number of un-timed warmup iterations to perform int numWarmups; // Number of un-timed warmup iterations to perform
int outputToCsv; // Output in CSV format int outputToCsv; // Output in CSV format
int samplingFactor; // Affects how many different values of N are generated (when N set to 0) int samplingFactor; // Affects how many different values of N are generated (when N set to 0)
int sharedMemBytes; // Amount of shared memory to use per threadblock int sharedMemBytes; // Amount of shared memory to use per threadblock
int useHipCall; // Use hipMemcpy/hipMemset instead of custom shader kernels
int useInteractive; // Pause for user-input before starting transfer loop int useInteractive; // Pause for user-input before starting transfer loop
int useMemset; // Perform a memset instead of a copy (ignores source memory)
int usePcieIndexing; // Base GPU indexing on PCIe address instead of HIP device int usePcieIndexing; // Base GPU indexing on PCIe address instead of HIP device
int useSingleStream; // Use a single stream per device instead of per Tink. Can not be used with USE_HIP_CALL int useSingleStream; // Use a single stream per GPU GFX executor instead of stream per Transfer
std::vector<float> fillPattern; // Pattern of floats used to fill source data std::vector<float> fillPattern; // Pattern of floats used to fill source data
// Environment variables only for Benchmark-preset
int useRemoteRead; // Use destination memory type as executor instead of source memory type
int useDmaCopy; // Use DMA copy instead of GPU copy
int numGpuSubExecs; // Number of GPU subexecutors to use
int numCpuSubExecs; // Number of CPU subexecttors to use
// Environment variables only for Sweep-preset // Environment variables only for Sweep-preset
int sweepMin; // Min number of simultaneous Transfers to be executed per test int sweepMin; // Min number of simultaneous Transfers to be executed per test
int sweepMax; // Max number of simulatneous Transfers to be executed per test int sweepMax; // Max number of simulatneous Transfers to be executed per test
...@@ -87,6 +96,10 @@ public: ...@@ -87,6 +96,10 @@ public:
std::string sweepExe; // Set of executors to be swept std::string sweepExe; // Set of executors to be swept
std::string sweepDst; // Set of dst memory types to be swept std::string sweepDst; // Set of dst memory types to be swept
// Developer features
int enableDebug; // Enable debug output
int gpuKernel; // Which GPU kernel to use
// Used to track current configuration mode // Used to track current configuration mode
ConfigModeEnum configMode; ConfigModeEnum configMode;
...@@ -100,29 +113,48 @@ public: ...@@ -100,29 +113,48 @@ public:
EnvVars() EnvVars()
{ {
int maxSharedMemBytes = 0; int maxSharedMemBytes = 0;
hipDeviceGetAttribute(&maxSharedMemBytes, HIP_CALL(hipDeviceGetAttribute(&maxSharedMemBytes,
hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, 0); hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, 0));
int numDeviceCUs = 0;
HIP_CALL(hipDeviceGetAttribute(&numDeviceCUs, hipDeviceAttributeMultiprocessorCount, 0));
int numDetectedCpus = numa_num_configured_nodes(); int numDetectedCpus = numa_num_configured_nodes();
int numDetectedGpus; int numDetectedGpus;
hipGetDeviceCount(&numDetectedGpus); HIP_CALL(hipGetDeviceCount(&numDetectedGpus));
hipDeviceProp_t prop;
HIP_CALL(hipGetDeviceProperties(&prop, 0));
std::string fullName = prop.gcnArchName;
std::string archName = fullName.substr(0, fullName.find(':'));
// Different hardware pick different GPU kernels
// This performance difference is generally only noticable when executing fewer CUs
int defaultGpuKernel = 0;
if (archName == "gfx906") defaultGpuKernel = 13;
else if (archName == "gfx90a") defaultGpuKernel = 9;
blockBytes = GetEnvVar("BLOCK_BYTES" , 256); blockBytes = GetEnvVar("BLOCK_BYTES" , 256);
byteOffset = GetEnvVar("BYTE_OFFSET" , 0); byteOffset = GetEnvVar("BYTE_OFFSET" , 0);
numCpuDevices = GetEnvVar("NUM_CPU_DEVICES" , numDetectedCpus); numCpuDevices = GetEnvVar("NUM_CPU_DEVICES" , numDetectedCpus);
numCpuPerTransfer = GetEnvVar("NUM_CPU_PER_TRANSFER", DEFAULT_NUM_CPU_PER_TRANSFER);
numGpuDevices = GetEnvVar("NUM_GPU_DEVICES" , numDetectedGpus); numGpuDevices = GetEnvVar("NUM_GPU_DEVICES" , numDetectedGpus);
numIterations = GetEnvVar("NUM_ITERATIONS" , DEFAULT_NUM_ITERATIONS); numIterations = GetEnvVar("NUM_ITERATIONS" , DEFAULT_NUM_ITERATIONS);
numWarmups = GetEnvVar("NUM_WARMUPS" , DEFAULT_NUM_WARMUPS); numWarmups = GetEnvVar("NUM_WARMUPS" , DEFAULT_NUM_WARMUPS);
outputToCsv = GetEnvVar("OUTPUT_TO_CSV" , 0); outputToCsv = GetEnvVar("OUTPUT_TO_CSV" , 0);
samplingFactor = GetEnvVar("SAMPLING_FACTOR" , DEFAULT_SAMPLING_FACTOR); samplingFactor = GetEnvVar("SAMPLING_FACTOR" , DEFAULT_SAMPLING_FACTOR);
sharedMemBytes = GetEnvVar("SHARED_MEM_BYTES" , maxSharedMemBytes / 2 + 1); sharedMemBytes = GetEnvVar("SHARED_MEM_BYTES" , maxSharedMemBytes / 2 + 1);
useHipCall = GetEnvVar("USE_HIP_CALL" , 0);
useInteractive = GetEnvVar("USE_INTERACTIVE" , 0); useInteractive = GetEnvVar("USE_INTERACTIVE" , 0);
useMemset = GetEnvVar("USE_MEMSET" , 0);
usePcieIndexing = GetEnvVar("USE_PCIE_INDEX" , 0); usePcieIndexing = GetEnvVar("USE_PCIE_INDEX" , 0);
useSingleStream = GetEnvVar("USE_SINGLE_STREAM" , 0); useSingleStream = GetEnvVar("USE_SINGLE_STREAM" , 0);
enableDebug = GetEnvVar("DEBUG" , 0);
gpuKernel = GetEnvVar("GPU_KERNEL" , defaultGpuKernel);
// P2P Benchmark related
useRemoteRead = GetEnvVar("USE_REMOTE_READ" , 0);
useDmaCopy = GetEnvVar("USE_GPU_DMA" , 0);
numGpuSubExecs = GetEnvVar("NUM_GPU_SE" , useDmaCopy ? 1 : numDeviceCUs);
numCpuSubExecs = GetEnvVar("NUM_CPU_SE" , DEFAULT_P2P_NUM_CPU_SE);
// Sweep related
sweepMin = GetEnvVar("SWEEP_MIN" , DEFAULT_SWEEP_MIN); sweepMin = GetEnvVar("SWEEP_MIN" , DEFAULT_SWEEP_MIN);
sweepMax = GetEnvVar("SWEEP_MAX" , DEFAULT_SWEEP_MAX); sweepMax = GetEnvVar("SWEEP_MAX" , DEFAULT_SWEEP_MAX);
sweepSrc = GetEnvVar("SWEEP_SRC" , DEFAULT_SWEEP_SRC); sweepSrc = GetEnvVar("SWEEP_SRC" , DEFAULT_SWEEP_SRC);
...@@ -135,7 +167,6 @@ public: ...@@ -135,7 +167,6 @@ public:
sweepRandBytes = GetEnvVar("SWEEP_RAND_BYTES" , 0); sweepRandBytes = GetEnvVar("SWEEP_RAND_BYTES" , 0);
// Determine random seed // Determine random seed
char *sweepSeedStr = getenv("SWEEP_SEED"); char *sweepSeedStr = getenv("SWEEP_SEED");
sweepSeed = (sweepSeedStr != NULL ? atoi(sweepSeedStr) : time(NULL)); sweepSeed = (sweepSeedStr != NULL ? atoi(sweepSeedStr) : time(NULL));
generator = new std::default_random_engine(sweepSeed); generator = new std::default_random_engine(sweepSeed);
...@@ -224,11 +255,6 @@ public: ...@@ -224,11 +255,6 @@ public:
printf("[ERROR] SAMPLING_FACTOR must be greater or equal to 1\n"); printf("[ERROR] SAMPLING_FACTOR must be greater or equal to 1\n");
exit(1); exit(1);
} }
if (numCpuPerTransfer < 1)
{
printf("[ERROR] NUM_CPU_PER_TRANSFER must be greater or equal to 1\n");
exit(1);
}
if (sharedMemBytes < 0 || sharedMemBytes > maxSharedMemBytes) if (sharedMemBytes < 0 || sharedMemBytes > maxSharedMemBytes)
{ {
printf("[ERROR] SHARED_MEM_BYTES must be between 0 and %d\n", maxSharedMemBytes); printf("[ERROR] SHARED_MEM_BYTES must be between 0 and %d\n", maxSharedMemBytes);
...@@ -239,9 +265,16 @@ public: ...@@ -239,9 +265,16 @@ public:
printf("[ERROR] BLOCK_BYTES must be a positive multiple of 4\n"); printf("[ERROR] BLOCK_BYTES must be a positive multiple of 4\n");
exit(1); exit(1);
} }
if (useSingleStream && useHipCall)
if (numGpuSubExecs <= 0)
{
printf("[ERROR] NUM_GPU_SE must be greater than 0\n");
exit(1);
}
if (numCpuSubExecs <= 0)
{ {
printf("[ERROR] Single stream mode cannot be used with HIP calls\n"); printf("[ERROR] NUM_CPU_SE must be greater than 0\n");
exit(1); exit(1);
} }
...@@ -273,10 +306,9 @@ public: ...@@ -273,10 +306,9 @@ public:
} }
} }
char const* permittedExecutors = "CG";
for (auto ch : sweepExe) for (auto ch : sweepExe)
{ {
if (!strchr(permittedExecutors, ch)) if (!strchr(ExeTypeStr, ch))
{ {
printf("[ERROR] Unrecognized executor type '%c' specified for sweep executor\n", ch); printf("[ERROR] Unrecognized executor type '%c' specified for sweep executor\n", ch);
exit(1); exit(1);
...@@ -287,12 +319,30 @@ public: ...@@ -287,12 +319,30 @@ public:
exit(1); exit(1);
} }
} }
if (gpuKernel < 0 || gpuKernel > NUM_GPU_KERNELS)
{
printf("[ERROR] GPU kernel must be between 0 and %d\n", NUM_GPU_KERNELS);
exit(1);
}
// Determine how many CPUs exit per NUMA node (to avoid executing on NUMA without CPUs) // Determine how many CPUs exit per NUMA node (to avoid executing on NUMA without CPUs)
numCpusPerNuma.resize(numDetectedCpus); numCpusPerNuma.resize(numDetectedCpus);
int const totalCpus = numa_num_configured_cpus(); int const totalCpus = numa_num_configured_cpus();
for (int i = 0; i < totalCpus; i++) for (int i = 0; i < totalCpus; i++)
numCpusPerNuma[numa_node_of_cpu(i)]++; numCpusPerNuma[numa_node_of_cpu(i)]++;
// Check for deprecated env vars
if (getenv("USE_HIP_CALL"))
{
printf("[WARN] USE_HIP_CALL has been deprecated. Please use DMA executor 'D' or set USE_GPU_DMA for P2P-Benchmark preset\n");
exit(1);
}
char* enableSdma = getenv("HSA_ENABLE_SDMA");
if (enableSdma && !strcmp(enableSdma, "0"))
{
printf("[WARN] DMA functionality disabled due to environment variable HSA_ENABLE_SDMA=0. Copies will fallback to blit kernels\n");
}
} }
// Display info on the env vars that can be used // Display info on the env vars that can be used
...@@ -304,18 +354,15 @@ public: ...@@ -304,18 +354,15 @@ public:
printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4. Defaults to 0\n"); printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4. Defaults to 0\n");
printf(" FILL_PATTERN=STR - Fill input buffer with pattern specified in hex digits (0-9,a-f,A-F). Must be even number of digits, (byte-level big-endian)\n"); printf(" FILL_PATTERN=STR - Fill input buffer with pattern specified in hex digits (0-9,a-f,A-F). Must be even number of digits, (byte-level big-endian)\n");
printf(" NUM_CPU_DEVICES=X - Restrict number of CPUs to X. May not be greater than # detected NUMA nodes\n"); printf(" NUM_CPU_DEVICES=X - Restrict number of CPUs to X. May not be greater than # detected NUMA nodes\n");
printf(" NUM_CPU_PER_TRANSFER=C - Use C threads per Transfer for CPU-executed copies\n"); printf(" NUM_GPU_DEVICES=X - Restrict number of GPUs to X. May not be greater than # detected HIP devices\n");
printf(" NUM_GPU_DEVICES=X - Restrict number of GCPUs to X. May not be greater than # detected HIP devices\n");
printf(" NUM_ITERATIONS=I - Perform I timed iteration(s) per test\n"); printf(" NUM_ITERATIONS=I - Perform I timed iteration(s) per test\n");
printf(" NUM_WARMUPS=W - Perform W untimed warmup iteration(s) per test\n"); printf(" NUM_WARMUPS=W - Perform W untimed warmup iteration(s) per test\n");
printf(" OUTPUT_TO_CSV - Outputs to CSV format if set\n"); printf(" OUTPUT_TO_CSV - Outputs to CSV format if set\n");
printf(" SAMPLING_FACTOR=F - Add F samples (when possible) between powers of 2 when auto-generating data sizes\n"); printf(" SAMPLING_FACTOR=F - Add F samples (when possible) between powers of 2 when auto-generating data sizes\n");
printf(" SHARED_MEM_BYTES=X - Use X shared mem bytes per threadblock, potentially to avoid multiple threadblocks per CU\n"); printf(" SHARED_MEM_BYTES=X - Use X shared mem bytes per threadblock, potentially to avoid multiple threadblocks per CU\n");
printf(" USE_HIP_CALL - Use hipMemcpy/hipMemset instead of custom shader kernels for GPU-executed copies\n");
printf(" USE_INTERACTIVE - Pause for user-input before starting transfer loop\n"); printf(" USE_INTERACTIVE - Pause for user-input before starting transfer loop\n");
printf(" USE_MEMSET - Perform a memset instead of a copy (ignores source memory)\n");
printf(" USE_PCIE_INDEX - Index GPUs by PCIe address-ordering instead of HIP-provided indexing\n"); printf(" USE_PCIE_INDEX - Index GPUs by PCIe address-ordering instead of HIP-provided indexing\n");
printf(" USE_SINGLE_STREAM - Use single stream per device instead of per Transfer. Cannot be used with USE_HIP_CALL\n"); printf(" USE_SINGLE_STREAM - Use a single stream per GPU GFX executor instead of stream per Transfer\n");
} }
// Display env var settings // Display env var settings
...@@ -331,10 +378,10 @@ public: ...@@ -331,10 +378,10 @@ public:
if (fillPattern.size()) if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN")); printf("Pattern: %s", getenv("FILL_PATTERN"));
else else
printf("Pseudo-random: (Element i = i modulo 383 + 31)"); printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n"); printf("\n");
printf("%-20s = %12d : Using GPU kernel %d [%s]\n" , "GPU_KERNEL", gpuKernel, gpuKernel, GpuKernelNames[gpuKernel].c_str());
printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices); printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
printf("%-20s = %12d : Using %d CPU thread(s) per CPU-executed Transfer\n", "NUM_CPU_PER_TRANSFER", numCpuPerTransfer, numCpuPerTransfer);
printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices); printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations, printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations,
numIterations > 0 ? numIterations : -numIterations, numIterations > 0 ? numIterations : -numIterations,
...@@ -344,18 +391,8 @@ public: ...@@ -344,18 +391,8 @@ public:
outputToCsv ? "CSV" : "console"); outputToCsv ? "CSV" : "console");
printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES", printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes); getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
printf("%-20s = %12d : Using %s for GPU-executed copies\n", "USE_HIP_CALL", useHipCall,
useHipCall ? "HIP functions" : "custom kernels");
if (useHipCall && !useMemset)
{
char* env = getenv("HSA_ENABLE_SDMA");
printf("%-20s = %12s : %s\n", "HSA_ENABLE_SDMA", env,
(env && !strcmp(env, "0")) ? "Using blit kernels for hipMemcpy" : "Using DMA copy engines");
}
printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive, printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive,
useInteractive ? "interactive" : "non-interactive"); useInteractive ? "interactive" : "non-interactive");
printf("%-20s = %12d : Performing %s\n", "USE_MEMSET", useMemset,
useMemset ? "memset" : "memcopy");
printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX", printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP")); usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM", printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM",
...@@ -371,23 +408,82 @@ public: ...@@ -371,23 +408,82 @@ public:
if (fillPattern.size()) if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN")); printf("Pattern: %s", getenv("FILL_PATTERN"));
else else
printf("Pseudo-random: (Element i = i modulo 383 + 31)"); printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n"); printf("\n");
printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices); printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
printf("NUM_CPU_PER_TRANSFER,%d,Using %d CPU thread(s) per CPU-executed Transfer\n", numCpuPerTransfer, numCpuPerTransfer);
printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices); printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations, printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations,
numIterations > 0 ? numIterations : -numIterations, numIterations > 0 ? numIterations : -numIterations,
numIterations > 0 ? "timed iteration(s)" : "second(s)"); numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups); printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes); printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
printf("USE_HIP_CALL,%d,Using %s for GPU-executed copies\n", useHipCall, useHipCall ? "HIP functions" : "custom kernels");
printf("USE_MEMSET,%d,Performing %s\n", useMemset, useMemset ? "memset" : "memcopy");
printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP")); printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer")); printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
} }
}; };
// Display env var for P2P Benchmark preset
void DisplayP2PBenchmarkEnvVars() const
{
if (!outputToCsv)
{
printf("Peer-to-peer Benchmark configuration (TransferBench v%s)\n", TB_VERSION);
printf("=====================================================\n");
printf("%-20s = %12d : Using %s as executor\n", "USE_REMOTE_READ", useRemoteRead , useRemoteRead ? "DST" : "SRC");
printf("%-20s = %12d : Using GPU-%s as GPU executor\n", "USE_GPU_DMA" , useDmaCopy , useDmaCopy ? "DMA" : "GFX");
printf("%-20s = %12d : Using %d CPU subexecutors\n", "NUM_CPU_SE" , numCpuSubExecs, numCpuSubExecs);
printf("%-20s = %12d : Using %d GPU subexecutors\n", "NUM_GPU_SE" , numGpuSubExecs, numGpuSubExecs);
printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes);
printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset);
printf("%-20s = %12s : ", "FILL_PATTERN", getenv("FILL_PATTERN") ? "(specified)" : "(unset)");
if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN"));
else
printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n");
printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations,
numIterations > 0 ? numIterations : -numIterations,
numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("%-20s = %12d : Running %d warmup iteration(s) per Test\n", "NUM_WARMUPS", numWarmups, numWarmups);
printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive,
useInteractive ? "interactive" : "non-interactive");
printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("\n");
}
else
{
printf("EnvVar,Value,Description,(TransferBench v%s)\n", TB_VERSION);
printf("USE_REMOTE_READ,%d,Using %s as executor\n", useRemoteRead, useRemoteRead ? "DST" : "SRC");
printf("USE_GPU_DMA,%d,Using GPU-%s as GPU executor\n", useDmaCopy , useDmaCopy ? "DMA" : "GFX");
printf("NUM_CPU_SE,%d,Using %d CPU subexecutors\n", numCpuSubExecs, numCpuSubExecs);
printf("NUM_GPU_SE,%d,Using %d GPU subexecutors\n", numGpuSubExecs, numGpuSubExecs);
printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes);
printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset);
printf("FILL_PATTERN,%s,", getenv("FILL_PATTERN") ? "(specified)" : "(unset)");
if (fillPattern.size())
printf("Pattern: %s", getenv("FILL_PATTERN"));
else
printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
printf("\n");
printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations,
numIterations > 0 ? numIterations : -numIterations,
numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
printf("\n");
}
}
// Display env var settings // Display env var settings
void DisplaySweepEnvVars() const void DisplaySweepEnvVars() const
{ {
...@@ -407,7 +503,6 @@ public: ...@@ -407,7 +503,6 @@ public:
printf("%-20s = %12d : Max number of XGMI hops for Transfers (-1 = no limit)\n", "SWEEP_XGMI_MAX", sweepXgmiMax); printf("%-20s = %12d : Max number of XGMI hops for Transfers (-1 = no limit)\n", "SWEEP_XGMI_MAX", sweepXgmiMax);
printf("%-20s = %12d : Using %s number of bytes per Transfer\n", "SWEEP_RAND_BYTES", sweepRandBytes, sweepRandBytes ? "random" : "constant"); printf("%-20s = %12d : Using %s number of bytes per Transfer\n", "SWEEP_RAND_BYTES", sweepRandBytes, sweepRandBytes ? "random" : "constant");
printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices); printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
printf("%-20s = %12d : Using %d CPU thread(s) per CPU-executed Transfer\n", "NUM_CPU_PER_TRANSFER", numCpuPerTransfer, numCpuPerTransfer);
printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices); printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes); printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes);
printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset); printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset);
...@@ -425,14 +520,6 @@ public: ...@@ -425,14 +520,6 @@ public:
outputToCsv ? "CSV" : "console"); outputToCsv ? "CSV" : "console");
printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES", printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes); getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
printf("%-20s = %12d : Using %s for GPU-executed copies\n", "USE_HIP_CALL", useHipCall,
useHipCall ? "HIP functions" : "custom kernels");
if (useHipCall && !useMemset)
{
char* env = getenv("HSA_ENABLE_SDMA");
printf("%-20s = %12s : %s\n", "HSA_ENABLE_SDMA", env,
(env && !strcmp(env, "0")) ? "Using blit kernels for hipMemcpy" : "Using DMA copy engines");
}
printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX", printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP")); usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM", printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM",
...@@ -454,7 +541,6 @@ public: ...@@ -454,7 +541,6 @@ public:
printf("SWEEP_XGMI_MAX,%d,Max number of XGMI hops for Transfers (-1 = no limit)\n", sweepXgmiMax); printf("SWEEP_XGMI_MAX,%d,Max number of XGMI hops for Transfers (-1 = no limit)\n", sweepXgmiMax);
printf("SWEEP_RAND_BYTES,%d,Using %s number of bytes per Transfer\n", sweepRandBytes, sweepRandBytes ? "random" : "constant"); printf("SWEEP_RAND_BYTES,%d,Using %s number of bytes per Transfer\n", sweepRandBytes, sweepRandBytes ? "random" : "constant");
printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices); printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
printf("NUM_CPU_PER_TRANSFER,%d,Using %d CPU thread(s) per CPU-executed Transfer\n", numCpuPerTransfer, numCpuPerTransfer);
printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices); printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes); printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes);
printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset); printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset);
...@@ -469,7 +555,6 @@ public: ...@@ -469,7 +555,6 @@ public:
numIterations > 0 ? "timed iteration(s)" : "second(s)"); numIterations > 0 ? "timed iteration(s)" : "second(s)");
printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups); printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes); printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
printf("USE_HIP_CALL,%d,Using %s for GPU-executed copies\n", useHipCall, useHipCall ? "HIP functions" : "custom kernels");
printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP")); printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer")); printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
} }
......
/* /*
Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -22,66 +22,145 @@ THE SOFTWARE. ...@@ -22,66 +22,145 @@ THE SOFTWARE.
#pragma once #pragma once
#define WARP_SIZE 64 #define PackedFloat_t float4
#define BLOCKSIZE 256 #define WARP_SIZE 64
#define BLOCKSIZE 256
#define FLOATS_PER_PACK (sizeof(PackedFloat_t) / sizeof(float))
#define MEMSET_CHAR 75
#define MEMSET_VAL 13323083.0f
// GPU copy kernel // Each subExecutor is provided with subarrays to work on
__global__ void __launch_bounds__(BLOCKSIZE) #define MAX_SRCS 16
GpuCopyKernel(BlockParam* blockParams) #define MAX_DSTS 16
struct SubExecParam
{ {
#define PackedFloat_t float4 size_t N; // Number of floats this subExecutor works on
#define FLOATS_PER_PACK (sizeof(PackedFloat_t) / sizeof(float)) int numSrcs; // Number of source arrays
int numDsts; // Number of destination arrays
float* src[MAX_SRCS]; // Source array pointers
float* dst[MAX_DSTS]; // Destination array pointers
long long startCycle; // Start timestamp for in-kernel timing (GPU-GFX executor)
long long stopCycle; // Stop timestamp for in-kernel timing (GPU-GFX executor)
};
// Collect the arguments for this threadblock void CpuReduceKernel(SubExecParam const& p)
int Nrem = blockParams[blockIdx.x].N; {
float const* src = blockParams[blockIdx.x].src; int const& numSrcs = p.numSrcs;
float* dst = blockParams[blockIdx.x].dst; int const& numDsts = p.numDsts;
if (threadIdx.x == 0) blockParams[blockIdx.x].startCycle = __builtin_amdgcn_s_memrealtime();
if (numSrcs == 0)
{
for (int i = 0; i < numDsts; ++i)
memset((float* __restrict__)p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
}
else if (numSrcs == 1)
{
float const* __restrict__ src = p.src[0];
for (int i = 0; i < numDsts; ++i)
{
memcpy((float* __restrict__)p.dst[i], src, p.N * sizeof(float));
}
}
else
{
for (int j = 0; j < p.N; j++)
{
float sum = p.src[0][j];
for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
}
}
}
// Helper function for memset
template <typename T> __device__ __forceinline__ T MemsetVal();
template <> __device__ __forceinline__ float MemsetVal(){ return MEMSET_VAL; };
template <> __device__ __forceinline__ float4 MemsetVal(){ return make_float4(MEMSET_VAL, MEMSET_VAL, MEMSET_VAL, MEMSET_VAL); }
// GPU copy kernel 0: 3 loops: unroll float 4, float4s, floats
template <int LOOP1_UNROLL>
__global__ void __launch_bounds__(BLOCKSIZE)
GpuReduceKernel(SubExecParam* params)
{
int64_t startCycle = __builtin_amdgcn_s_memrealtime();
// Operate on wavefront granularity // Operate on wavefront granularity
int numWaves = BLOCKSIZE / WARP_SIZE; // Number of wavefronts per threadblock SubExecParam& p = params[blockIdx.x];
int waveId = threadIdx.x / WARP_SIZE; // Wavefront number int const numSrcs = p.numSrcs;
int threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront int const numDsts = p.numDsts;
int const numWaves = BLOCKSIZE / WARP_SIZE; // Number of wavefronts per threadblock
int const waveId = threadIdx.x / WARP_SIZE; // Wavefront number
int const threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront
#define LOOP1_UNROLL 8
// 1st loop - each wavefront operates on LOOP1_UNROLL x FLOATS_PER_PACK per thread per iteration // 1st loop - each wavefront operates on LOOP1_UNROLL x FLOATS_PER_PACK per thread per iteration
// Determine the number of packed floats processed by the first loop // Determine the number of packed floats processed by the first loop
int const loop1Npack = (Nrem / (FLOATS_PER_PACK * LOOP1_UNROLL * WARP_SIZE)) * (LOOP1_UNROLL * WARP_SIZE); size_t Nrem = p.N;
int const loop1Nelem = loop1Npack * FLOATS_PER_PACK; size_t const loop1Npack = (Nrem / (FLOATS_PER_PACK * LOOP1_UNROLL * WARP_SIZE)) * (LOOP1_UNROLL * WARP_SIZE);
int const loop1Inc = BLOCKSIZE * LOOP1_UNROLL; size_t const loop1Nelem = loop1Npack * FLOATS_PER_PACK;
int loop1Offset = waveId * LOOP1_UNROLL * WARP_SIZE + threadId; size_t const loop1Inc = BLOCKSIZE * LOOP1_UNROLL;
size_t loop1Offset = waveId * LOOP1_UNROLL * WARP_SIZE + threadId;
PackedFloat_t const* packedSrc = (PackedFloat_t const*)(src) + loop1Offset;
PackedFloat_t* packedDst = (PackedFloat_t *)(dst) + loop1Offset;
while (loop1Offset < loop1Npack) while (loop1Offset < loop1Npack)
{ {
PackedFloat_t vals[LOOP1_UNROLL]; PackedFloat_t vals[LOOP1_UNROLL] = {};
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u)
vals[u] = *(packedSrc + u * WARP_SIZE);
#pragma unroll if (numSrcs == 0)
for (int u = 0; u < LOOP1_UNROLL; ++u) {
*(packedDst + u * WARP_SIZE) = vals[u]; #pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u) vals[u] = MemsetVal<float4>();
}
else
{
for (int i = 0; i < numSrcs; ++i)
{
PackedFloat_t const* __restrict__ packedSrc = (PackedFloat_t const*)(p.src[i]) + loop1Offset;
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u)
vals[u] += *(packedSrc + u * WARP_SIZE);
}
}
packedSrc += loop1Inc; for (int i = 0; i < numDsts; ++i)
packedDst += loop1Inc; {
PackedFloat_t* __restrict__ packedDst = (PackedFloat_t*)(p.dst[i]) + loop1Offset;
#pragma unroll
for (int u = 0; u < LOOP1_UNROLL; ++u) *(packedDst + u * WARP_SIZE) = vals[u];
}
loop1Offset += loop1Inc; loop1Offset += loop1Inc;
} }
Nrem -= loop1Nelem; Nrem -= loop1Nelem;
if (Nrem > 0) if (Nrem > 0)
{ {
// 2nd loop - Each thread operates on FLOATS_PER_PACK per iteration // 2nd loop - Each thread operates on FLOATS_PER_PACK per iteration
int const loop2Npack = Nrem / FLOATS_PER_PACK; // NOTE: Using int32_t due to smaller size requirements
int const loop2Nelem = loop2Npack * FLOATS_PER_PACK; int32_t const loop2Npack = Nrem / FLOATS_PER_PACK;
int const loop2Inc = BLOCKSIZE; int32_t const loop2Nelem = loop2Npack * FLOATS_PER_PACK;
int loop2Offset = threadIdx.x; int32_t const loop2Inc = BLOCKSIZE;
int32_t loop2Offset = threadIdx.x;
packedSrc = (PackedFloat_t const*)(src + loop1Nelem);
packedDst = (PackedFloat_t *)(dst + loop1Nelem);
while (loop2Offset < loop2Npack) while (loop2Offset < loop2Npack)
{ {
packedDst[loop2Offset] = packedSrc[loop2Offset]; PackedFloat_t val;
if (numSrcs == 0)
{
val = MemsetVal<float4>();
}
else
{
val = {};
for (int i = 0; i < numSrcs; ++i)
{
PackedFloat_t const* __restrict__ packedSrc = (PackedFloat_t const*)(p.src[i] + loop1Nelem) + loop2Offset;
val += *packedSrc;
}
}
for (int i = 0; i < numDsts; ++i)
{
PackedFloat_t* __restrict__ packedDst = (PackedFloat_t*)(p.dst[i] + loop1Nelem) + loop2Offset;
*packedDst = val;
}
loop2Offset += loop2Inc; loop2Offset += loop2Inc;
} }
Nrem -= loop2Nelem; Nrem -= loop2Nelem;
...@@ -90,40 +169,221 @@ GpuCopyKernel(BlockParam* blockParams) ...@@ -90,40 +169,221 @@ GpuCopyKernel(BlockParam* blockParams)
if (threadIdx.x < Nrem) if (threadIdx.x < Nrem)
{ {
int offset = loop1Nelem + loop2Nelem + threadIdx.x; int offset = loop1Nelem + loop2Nelem + threadIdx.x;
dst[offset] = src[offset]; float val = 0;
if (numSrcs == 0)
{
val = MEMSET_VAL;
}
else
{
for (int i = 0; i < numSrcs; ++i)
val += ((float const* __restrict__)p.src[i])[offset];
}
for (int i = 0; i < numDsts; ++i)
((float* __restrict__)p.dst[i])[offset] = val;
} }
} }
__threadfence_system(); __syncthreads();
if (threadIdx.x == 0) if (threadIdx.x == 0)
blockParams[blockIdx.x].stopCycle = __builtin_amdgcn_s_memrealtime(); {
p.startCycle = startCycle;
p.stopCycle = __builtin_amdgcn_s_memrealtime();
}
} }
#define MEMSET_UNROLL 8 template <typename FLOAT_TYPE, int UNROLL_FACTOR>
__global__ void __launch_bounds__(BLOCKSIZE) __device__ size_t GpuReduceFuncImpl2(SubExecParam const &p, size_t const offset, size_t const N)
GpuMemsetKernel(BlockParam* blockParams) {
int constexpr numFloatsPerPack = sizeof(FLOAT_TYPE) / sizeof(float); // Number of floats handled at a time per thread
int constexpr numWaves = BLOCKSIZE / WARP_SIZE; // Number of wavefronts per threadblock
size_t constexpr loopPackInc = BLOCKSIZE * UNROLL_FACTOR;
size_t constexpr numPacksPerWave = WARP_SIZE * UNROLL_FACTOR;
int const waveId = threadIdx.x / WARP_SIZE; // Wavefront number
int const threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront
int const numSrcs = p.numSrcs;
int const numDsts = p.numDsts;
size_t const numPacksDone = (numFloatsPerPack == 1 && UNROLL_FACTOR == 1) ? N : (N / (FLOATS_PER_PACK * numPacksPerWave)) * numPacksPerWave;
size_t const numFloatsLeft = N - numPacksDone * numFloatsPerPack;
size_t loopPackOffset = waveId * numPacksPerWave + threadId;
while (loopPackOffset < numPacksDone)
{
FLOAT_TYPE vals[UNROLL_FACTOR];
if (numSrcs == 0)
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u) vals[u] = MemsetVal<FLOAT_TYPE>();
}
else
{
FLOAT_TYPE const* __restrict__ src0Ptr = ((FLOAT_TYPE const*)(p.src[0] + offset)) + loopPackOffset;
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
vals[u] = *(src0Ptr + u * WARP_SIZE);
for (int i = 1; i < numSrcs; ++i)
{
FLOAT_TYPE const* __restrict__ srcPtr = ((FLOAT_TYPE const*)(p.src[i] + offset)) + loopPackOffset;
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
vals[u] += *(srcPtr + u * WARP_SIZE);
}
}
for (int i = 0; i < numDsts; ++i)
{
FLOAT_TYPE* __restrict__ dstPtr = (FLOAT_TYPE*)(p.dst[i + offset]) + loopPackOffset;
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
*(dstPtr + u * WARP_SIZE) = vals[u];
}
loopPackOffset += loopPackInc;
}
return numFloatsLeft;
}
template <typename FLOAT_TYPE, int UNROLL_FACTOR>
__device__ size_t GpuReduceFuncImpl(SubExecParam const &p, size_t const offset, size_t const N)
{ {
// Collect the arguments for this block // Each thread in the block works on UNROLL_FACTOR FLOAT_TYPEs during each iteration of the loop
int N = blockParams[blockIdx.x].N; int constexpr numFloatsPerRead = sizeof(FLOAT_TYPE) / sizeof(float);
float* __restrict__ dst = (float*)blockParams[blockIdx.x].dst; size_t constexpr numFloatsPerInnerLoop = BLOCKSIZE * numFloatsPerRead;
size_t constexpr numFloatsPerOuterLoop = numFloatsPerInnerLoop * UNROLL_FACTOR;
size_t const numFloatsLeft = (numFloatsPerRead == 1 && UNROLL_FACTOR == 1) ? 0 : N % numFloatsPerOuterLoop;
size_t const numFloatsDone = N - numFloatsLeft;
int const numSrcs = p.numSrcs;
int const numDsts = p.numDsts;
// Use non-zero value for (size_t idx = threadIdx.x * numFloatsPerRead; idx < numFloatsDone; idx += numFloatsPerOuterLoop)
#pragma unroll MEMSET_UNROLL
for (int tid = threadIdx.x; tid < N; tid += BLOCKSIZE)
{ {
dst[tid] = 1234.0; FLOAT_TYPE tmp[UNROLL_FACTOR];
if (numSrcs == 0)
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
tmp[u] = MemsetVal<FLOAT_TYPE>();
}
else
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
tmp[u] = *((FLOAT_TYPE*)(&p.src[0][offset + idx + u * numFloatsPerInnerLoop]));
for (int i = 1; i < numSrcs; ++i)
{
#pragma unroll UNROLL_FACTOR
for (int u = 0; u < UNROLL_FACTOR; ++u)
tmp[u] += *((FLOAT_TYPE*)(&p.src[i][offset + idx + u * numFloatsPerInnerLoop]));
}
}
for (int i = 0; i < numDsts; ++i)
{
for (int u = 0; u < UNROLL_FACTOR; ++u)
{
*((FLOAT_TYPE*)(&p.dst[i][offset + idx + u * numFloatsPerInnerLoop])) = tmp[u];
}
}
} }
return numFloatsLeft;
} }
// CPU copy kernel template <typename FLOAT_TYPE>
void CpuCopyKernel(BlockParam const& blockParams) __device__ size_t GpuReduceFunc(SubExecParam const &p, size_t const offset, size_t const N, int const unroll)
{ {
memcpy(blockParams.dst, blockParams.src, blockParams.N * sizeof(float)); switch (unroll)
{
case 1: return GpuReduceFuncImpl<FLOAT_TYPE, 1>(p, offset, N);
case 2: return GpuReduceFuncImpl<FLOAT_TYPE, 2>(p, offset, N);
case 3: return GpuReduceFuncImpl<FLOAT_TYPE, 3>(p, offset, N);
case 4: return GpuReduceFuncImpl<FLOAT_TYPE, 4>(p, offset, N);
case 5: return GpuReduceFuncImpl<FLOAT_TYPE, 5>(p, offset, N);
case 6: return GpuReduceFuncImpl<FLOAT_TYPE, 6>(p, offset, N);
case 7: return GpuReduceFuncImpl<FLOAT_TYPE, 7>(p, offset, N);
case 8: return GpuReduceFuncImpl<FLOAT_TYPE, 8>(p, offset, N);
case 9: return GpuReduceFuncImpl<FLOAT_TYPE, 9>(p, offset, N);
case 10: return GpuReduceFuncImpl<FLOAT_TYPE, 10>(p, offset, N);
case 11: return GpuReduceFuncImpl<FLOAT_TYPE, 11>(p, offset, N);
case 12: return GpuReduceFuncImpl<FLOAT_TYPE, 12>(p, offset, N);
case 13: return GpuReduceFuncImpl<FLOAT_TYPE, 13>(p, offset, N);
case 14: return GpuReduceFuncImpl<FLOAT_TYPE, 14>(p, offset, N);
case 15: return GpuReduceFuncImpl<FLOAT_TYPE, 15>(p, offset, N);
case 16: return GpuReduceFuncImpl<FLOAT_TYPE, 16>(p, offset, N);
default: return GpuReduceFuncImpl<FLOAT_TYPE, 1>(p, offset, N);
}
} }
// CPU memset kernel // GPU copy kernel
void CpuMemsetKernel(BlockParam const& blockParams) __global__ void __launch_bounds__(BLOCKSIZE)
GpuReduceKernel2(SubExecParam* params)
{ {
for (int i = 0; i < blockParams.N; i++) int64_t startCycle = __builtin_amdgcn_s_memrealtime();
blockParams.dst[i] = 1234.0; SubExecParam& p = params[blockIdx.x];
size_t numFloatsLeft = GpuReduceFunc<float4>(p, 0, p.N, 8);
if (numFloatsLeft)
numFloatsLeft = GpuReduceFunc<float4>(p, p.N - numFloatsLeft, numFloatsLeft, 1);
if (numFloatsLeft)
GpuReduceFunc<float>(p, p.N - numFloatsLeft, numFloatsLeft, 1);
__threadfence_system();
if (threadIdx.x == 0)
{
p.startCycle = startCycle;
p.stopCycle = __builtin_amdgcn_s_memrealtime();
}
} }
#define NUM_GPU_KERNELS 18
typedef void (*GpuKernelFuncPtr)(SubExecParam*);
GpuKernelFuncPtr GpuKernelTable[NUM_GPU_KERNELS] =
{
GpuReduceKernel<8>,
GpuReduceKernel<1>,
GpuReduceKernel<2>,
GpuReduceKernel<3>,
GpuReduceKernel<4>,
GpuReduceKernel<5>,
GpuReduceKernel<6>,
GpuReduceKernel<7>,
GpuReduceKernel<8>,
GpuReduceKernel<9>,
GpuReduceKernel<10>,
GpuReduceKernel<11>,
GpuReduceKernel<12>,
GpuReduceKernel<13>,
GpuReduceKernel<14>,
GpuReduceKernel<15>,
GpuReduceKernel<16>,
GpuReduceKernel2
};
std::string GpuKernelNames[NUM_GPU_KERNELS] =
{
"Default - 8xUnroll",
"Unroll x1",
"Unroll x2",
"Unroll x3",
"Unroll x4",
"Unroll x5",
"Unroll x6",
"Unroll x7",
"Unroll x8",
"Unroll x9",
"Unroll x10",
"Unroll x11",
"Unroll x12",
"Unroll x13",
"Unroll x14",
"Unroll x15",
"Unroll x16",
"8xUnrollB",
};
Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
......
# Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved. # Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
ROCM_PATH ?= /opt/rocm ROCM_PATH ?= /opt/rocm
HIPCC=$(ROCM_PATH)/bin/hipcc HIPCC=$(ROCM_PATH)/bin/hipcc
EXE=TransferBench EXE=TransferBench
CXXFLAGS = -O3 -I. -lnuma -L$(ROCM_PATH)/hsa/lib -lhsa-runtime64 CXXFLAGS = -O3 -I. -lnuma -L$(ROCM_PATH)/hsa/lib -lhsa-runtime64 -ferror-limit=5
all: $(EXE) all: $(EXE)
......
/* /*
Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -30,7 +30,6 @@ THE SOFTWARE. ...@@ -30,7 +30,6 @@ THE SOFTWARE.
#include "TransferBench.hpp" #include "TransferBench.hpp"
#include "GetClosestNumaNode.hpp" #include "GetClosestNumaNode.hpp"
#include "Kernels.hpp"
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
...@@ -76,30 +75,18 @@ int main(int argc, char **argv) ...@@ -76,30 +75,18 @@ int main(int argc, char **argv)
// - Tests that sweep across possible sets of Transfers // - Tests that sweep across possible sets of Transfers
if (!strcmp(argv[1], "sweep") || !strcmp(argv[1], "rsweep")) if (!strcmp(argv[1], "sweep") || !strcmp(argv[1], "rsweep"))
{ {
int numBlocksToUse = (argc > 3 ? atoi(argv[3]) : 4); int numGpuSubExecs = (argc > 3 ? atoi(argv[3]) : 4);
int numCpuSubExecs = (argc > 4 ? atoi(argv[4]) : 4);
ev.configMode = CFG_SWEEP; ev.configMode = CFG_SWEEP;
RunSweepPreset(ev, numBytesPerTransfer, numBlocksToUse, !strcmp(argv[1], "rsweep")); RunSweepPreset(ev, numBytesPerTransfer, numGpuSubExecs, numCpuSubExecs, !strcmp(argv[1], "rsweep"));
exit(0); exit(0);
} }
// - Tests that benchmark peer-to-peer performance // - Tests that benchmark peer-to-peer performance
else if (!strcmp(argv[1], "p2p") || !strcmp(argv[1], "p2p_rr") || else if (!strcmp(argv[1], "p2p"))
!strcmp(argv[1], "g2g") || !strcmp(argv[1], "g2g_rr"))
{ {
int numBlocksToUse = 0;
if (argc > 3)
numBlocksToUse = atoi(argv[3]);
else
HIP_CALL(hipDeviceGetAttribute(&numBlocksToUse, hipDeviceAttributeMultiprocessorCount, 0));
// Perform either local read (+remote write) [EXE = SRC] or
// remote read (+local write) [EXE = DST]
int readMode = (!strcmp(argv[1], "p2p_rr") || !strcmp(argv[1], "g2g_rr") ? 1 : 0);
int skipCpu = (!strcmp(argv[1], "g2g" ) || !strcmp(argv[1], "g2g_rr") ? 1 : 0);
// Execute peer to peer benchmark mode
ev.configMode = CFG_P2P; ev.configMode = CFG_P2P;
RunPeerToPeerBenchmarks(ev, numBytesPerTransfer / sizeof(float), numBlocksToUse, readMode, skipCpu); RunPeerToPeerBenchmarks(ev, numBytesPerTransfer / sizeof(float));
exit(0); exit(0);
} }
...@@ -116,8 +103,7 @@ int main(int argc, char **argv) ...@@ -116,8 +103,7 @@ int main(int argc, char **argv)
ev.DisplayEnvVars(); ev.DisplayEnvVars();
if (ev.outputToCsv) if (ev.outputToCsv)
{ {
printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms)," printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
"ExeToSrcLinkType,ExeToDstLinkType,SrcAddr,DstAddr\n");
} }
int testNum = 0; int testNum = 0;
...@@ -170,71 +156,70 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -170,71 +156,70 @@ void ExecuteTransfers(EnvVars const& ev,
TransferMap transferMap; TransferMap transferMap;
for (Transfer& transfer : transfers) for (Transfer& transfer : transfers)
{ {
Executor executor(transfer.exeMemType, transfer.exeIndex); Executor executor(transfer.exeType, transfer.exeIndex);
ExecutorInfo& executorInfo = transferMap[executor]; ExecutorInfo& executorInfo = transferMap[executor];
executorInfo.transfers.push_back(&transfer); executorInfo.transfers.push_back(&transfer);
} }
// Loop over each executor and prepare GPU resources // Loop over each executor and prepare sub-executors
std::map<int, Transfer*> transferList; std::map<int, Transfer*> transferList;
for (auto& exeInfoPair : transferMap) for (auto& exeInfoPair : transferMap)
{ {
Executor const& executor = exeInfoPair.first; Executor const& executor = exeInfoPair.first;
ExecutorInfo& exeInfo = exeInfoPair.second; ExecutorInfo& exeInfo = exeInfoPair.second;
ExeType const exeType = executor.first;
int const exeIndex = RemappedIndex(executor.second, IsCpuType(exeType));
exeInfo.totalTime = 0.0; exeInfo.totalTime = 0.0;
exeInfo.totalBlocks = 0; exeInfo.totalSubExecs = 0;
// Loop over each transfer this executor is involved in // Loop over each transfer this executor is involved in
for (Transfer* transfer : exeInfo.transfers) for (Transfer* transfer : exeInfo.transfers)
{ {
// Get some aliases to transfer variables // Determine how many bytes to copy for this Transfer (use custom if pre-specified)
MemType const& exeMemType = transfer->exeMemType; transfer->numBytesActual = (transfer->numBytes ? transfer->numBytes : N * sizeof(float));
MemType const& srcMemType = transfer->srcMemType;
MemType const& dstMemType = transfer->dstMemType; // Allocate source memory
int const& blocksToUse = transfer->numBlocksToUse; transfer->srcMem.resize(transfer->numSrcs);
for (int iSrc = 0; iSrc < transfer->numSrcs; ++iSrc)
// Get potentially remapped device indices
int const srcIndex = RemappedIndex(transfer->srcIndex, srcMemType);
int const exeIndex = RemappedIndex(transfer->exeIndex, exeMemType);
int const dstIndex = RemappedIndex(transfer->dstIndex, dstMemType);
// Enable peer-to-peer access if necessary (can only be called once per unique pair)
if (exeMemType == MEM_GPU)
{ {
MemType const& srcType = transfer->srcType[iSrc];
int const srcIndex = RemappedIndex(transfer->srcIndex[iSrc], IsCpuType(srcType));
// Ensure executing GPU can access source memory // Ensure executing GPU can access source memory
if ((srcMemType == MEM_GPU || srcMemType == MEM_GPU_FINE) && srcIndex != exeIndex) if (IsGpuType(exeType) == MEM_GPU && IsGpuType(srcType) && srcIndex != exeIndex)
EnablePeerAccess(exeIndex, srcIndex); EnablePeerAccess(exeIndex, srcIndex);
AllocateMemory(srcType, srcIndex, transfer->numBytesActual + ev.byteOffset, (void**)&transfer->srcMem[iSrc]);
}
// Allocate destination memory
transfer->dstMem.resize(transfer->numDsts);
for (int iDst = 0; iDst < transfer->numDsts; ++iDst)
{
MemType const& dstType = transfer->dstType[iDst];
int const dstIndex = RemappedIndex(transfer->dstIndex[iDst], IsCpuType(dstType));
// Ensure executing GPU can access destination memory // Ensure executing GPU can access destination memory
if ((dstMemType == MEM_GPU || dstMemType == MEM_GPU_FINE) && dstIndex != exeIndex) if (IsGpuType(exeType) == MEM_GPU && IsGpuType(dstType) && dstIndex != exeIndex)
EnablePeerAccess(exeIndex, dstIndex); EnablePeerAccess(exeIndex, dstIndex);
}
// Allocate (maximum) source / destination memory based on type / device index AllocateMemory(dstType, dstIndex, transfer->numBytesActual + ev.byteOffset, (void**)&transfer->dstMem[iDst]);
transfer->numBytesToCopy = (transfer->numBytes ? transfer->numBytes : N * sizeof(float)); }
AllocateMemory(srcMemType, srcIndex, transfer->numBytesToCopy + ev.byteOffset, (void**)&transfer->srcMem);
AllocateMemory(dstMemType, dstIndex, transfer->numBytesToCopy + ev.byteOffset, (void**)&transfer->dstMem);
transfer->blockParam.resize(exeMemType == MEM_CPU ? ev.numCpuPerTransfer : blocksToUse); exeInfo.totalSubExecs += transfer->numSubExecs;
exeInfo.totalBlocks += transfer->blockParam.size();
transferList[transfer->transferIndex] = transfer; transferList[transfer->transferIndex] = transfer;
} }
// Prepare per-threadblock parameters for GPU executors // Prepare additional requirement for GPU-based executors
MemType const exeMemType = executor.first; if (IsGpuType(exeType))
int const exeIndex = RemappedIndex(executor.second, exeMemType);
if (exeMemType == MEM_GPU)
{ {
// Allocate one contiguous chunk of GPU memory for threadblock parameters // Single-stream is only supported for GFX-based executors
// This allows support for executing one transfer per stream, or all transfers in a single stream int const numStreamsToUse = (exeType == EXE_GPU_DMA || !ev.useSingleStream) ? exeInfo.transfers.size() : 1;
AllocateMemory(exeMemType, exeIndex, exeInfo.totalBlocks * sizeof(BlockParam), exeInfo.streams.resize(numStreamsToUse);
(void**)&exeInfo.blockParamGpu); exeInfo.startEvents.resize(numStreamsToUse);
exeInfo.stopEvents.resize(numStreamsToUse);
int const numTransfersToRun = ev.useSingleStream ? 1 : exeInfo.transfers.size(); for (int i = 0; i < numStreamsToUse; ++i)
exeInfo.streams.resize(numTransfersToRun);
exeInfo.startEvents.resize(numTransfersToRun);
exeInfo.stopEvents.resize(numTransfersToRun);
for (int i = 0; i < numTransfersToRun; ++i)
{ {
HIP_CALL(hipSetDevice(exeIndex)); HIP_CALL(hipSetDevice(exeIndex));
HIP_CALL(hipStreamCreate(&exeInfo.streams[i])); HIP_CALL(hipStreamCreate(&exeInfo.streams[i]));
...@@ -242,12 +227,12 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -242,12 +227,12 @@ void ExecuteTransfers(EnvVars const& ev,
HIP_CALL(hipEventCreate(&exeInfo.stopEvents[i])); HIP_CALL(hipEventCreate(&exeInfo.stopEvents[i]));
} }
// Assign each transfer its portion of threadblock parameters if (exeType == EXE_GPU_GFX)
int transferOffset = 0;
for (int i = 0; i < exeInfo.transfers.size(); i++)
{ {
exeInfo.transfers[i]->blockParamGpuPtr = exeInfo.blockParamGpu + transferOffset; // Allocate one contiguous chunk of GPU memory for threadblock parameters
transferOffset += exeInfo.transfers[i]->blockParam.size(); // This allows support for executing one transfer per stream, or all transfers in a single stream
AllocateMemory(MEM_GPU, exeIndex, exeInfo.totalSubExecs * sizeof(SubExecParam),
(void**)&exeInfo.subExecParamGpu);
} }
} }
} }
...@@ -265,17 +250,20 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -265,17 +250,20 @@ void ExecuteTransfers(EnvVars const& ev,
{ {
// Prepare subarrays each threadblock works on and fill src memory with patterned data // Prepare subarrays each threadblock works on and fill src memory with patterned data
Transfer* transfer = exeInfo.transfers[i]; Transfer* transfer = exeInfo.transfers[i];
transfer->PrepareBlockParams(ev, transfer->numBytesToCopy / sizeof(float)); transfer->PrepareSubExecParams(ev);
exeInfo.totalBytes += transfer->numBytesToCopy; transfer->PrepareSrc(ev);
exeInfo.totalBytes += transfer->numBytesActual;
// Copy block parameters to GPU for GPU executors // Copy block parameters to GPU for GPU executors
if (transfer->exeMemType == MEM_GPU) if (transfer->exeType == EXE_GPU_GFX)
{ {
HIP_CALL(hipMemcpy(&exeInfo.blockParamGpu[transferOffset], exeInfo.transfers[i]->subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset;
transfer->blockParam.data(), HIP_CALL(hipMemcpy(&exeInfo.subExecParamGpu[transferOffset],
transfer->blockParam.size() * sizeof(BlockParam), transfer->subExecParam.data(),
transfer->subExecParam.size() * sizeof(SubExecParam),
hipMemcpyHostToDevice)); hipMemcpyHostToDevice));
transferOffset += transfer->blockParam.size();
transferOffset += transfer->subExecParam.size();
} }
} }
} }
...@@ -286,7 +274,7 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -286,7 +274,7 @@ void ExecuteTransfers(EnvVars const& ev,
std::stack<std::thread> threads; std::stack<std::thread> threads;
for (int iteration = -ev.numWarmups; ; iteration++) for (int iteration = -ev.numWarmups; ; iteration++)
{ {
if (ev.numIterations > 0 && iteration >= ev.numIterations) break; if (ev.numIterations > 0 && iteration >= ev.numIterations) break;
if (ev.numIterations < 0 && totalCpuTime > -ev.numIterations) break; if (ev.numIterations < 0 && totalCpuTime > -ev.numIterations) break;
// Pause before starting first timed iteration in interactive mode // Pause before starting first timed iteration in interactive mode
...@@ -296,7 +284,11 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -296,7 +284,11 @@ void ExecuteTransfers(EnvVars const& ev,
for (Transfer& transfer : transfers) for (Transfer& transfer : transfers)
{ {
printf("Transfer %03d: SRC: %p DST: %p\n", transfer.transferIndex, transfer.srcMem, transfer.dstMem); printf("Transfer %03d:\n", transfer.transferIndex);
for (int iSrc = 0; iSrc < transfer.numSrcs; ++iSrc)
printf(" SRC %0d: %p\n", iSrc, transfer.srcMem[iSrc]);
for (int iDst = 0; iDst < transfer.numDsts; ++iDst)
printf(" DST %0d: %p\n", iDst, transfer.dstMem[iDst]);
} }
printf("Hit <Enter> to continue: "); printf("Hit <Enter> to continue: ");
scanf("%*c"); scanf("%*c");
...@@ -310,8 +302,9 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -310,8 +302,9 @@ void ExecuteTransfers(EnvVars const& ev,
for (auto& exeInfoPair : transferMap) for (auto& exeInfoPair : transferMap)
{ {
ExecutorInfo& exeInfo = exeInfoPair.second; ExecutorInfo& exeInfo = exeInfoPair.second;
int const numTransfersToRun = (IsGpuType(exeInfoPair.first.first) && ev.useSingleStream) ? ExeType exeType = exeInfoPair.first.first;
1 : exeInfo.transfers.size(); int const numTransfersToRun = (exeType == EXE_GPU_GFX && ev.useSingleStream) ? 1 : exeInfo.transfers.size();
for (int i = 0; i < numTransfersToRun; ++i) for (int i = 0; i < numTransfersToRun; ++i)
threads.push(std::thread(RunTransfer, std::ref(ev), iteration, std::ref(exeInfo), i)); threads.push(std::thread(RunTransfer, std::ref(ev), iteration, std::ref(exeInfo), i));
} }
...@@ -349,8 +342,8 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -349,8 +342,8 @@ void ExecuteTransfers(EnvVars const& ev,
for (auto transferPair : transferList) for (auto transferPair : transferList)
{ {
Transfer* transfer = transferPair.second; Transfer* transfer = transferPair.second;
CheckOrFill(MODE_CHECK, transfer->numBytesToCopy / sizeof(float), ev.useMemset, ev.useHipCall, ev.fillPattern, transfer->dstMem + initOffset); transfer->ValidateDst(ev);
totalBytesTransferred += transfer->numBytesToCopy; totalBytesTransferred += transfer->numBytesActual;
} }
// Report timings // Report timings
...@@ -362,12 +355,12 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -362,12 +355,12 @@ void ExecuteTransfers(EnvVars const& ev,
{ {
for (auto& exeInfoPair : transferMap) for (auto& exeInfoPair : transferMap)
{ {
ExecutorInfo exeInfo = exeInfoPair.second; ExecutorInfo exeInfo = exeInfoPair.second;
MemType const exeMemType = exeInfoPair.first.first; ExeType const exeType = exeInfoPair.first.first;
int const exeIndex = exeInfoPair.first.second; int const exeIndex = exeInfoPair.first.second;
// Compute total time for CPU executors // Compute total time for non GPU executors
if (!IsGpuType(exeMemType)) if (exeType != EXE_GPU_GFX)
{ {
exeInfo.totalTime = 0; exeInfo.totalTime = 0;
for (auto const& transfer : exeInfo.transfers) for (auto const& transfer : exeInfo.transfers)
...@@ -380,51 +373,49 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -380,51 +373,49 @@ void ExecuteTransfers(EnvVars const& ev,
if (verbose && !ev.outputToCsv) if (verbose && !ev.outputToCsv)
{ {
printf(" Executor: %cPU %02d (# Transfers %02lu)| %9.3f GB/s | %8.3f ms | %12lu bytes\n", printf(" Executor: %3s %02d | %7.3f GB/s | %8.3f ms | %12lu bytes\n",
MemTypeStr[exeMemType], exeIndex, exeInfo.transfers.size(), exeBandwidthGbs, exeDurationMsec, exeInfo.totalBytes); ExeTypeName[exeType], exeIndex, exeBandwidthGbs, exeDurationMsec, exeInfo.totalBytes);
} }
int totalCUs = 0; int totalCUs = 0;
for (auto const& transfer : exeInfo.transfers) for (auto const& transfer : exeInfo.transfers)
{ {
double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations); double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations);
double transferBandwidthGbs = (transfer->numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f; double transferBandwidthGbs = (transfer->numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
totalCUs += transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse; totalCUs += transfer->numSubExecs;
if (!verbose) continue; if (!verbose) continue;
if (!ev.outputToCsv) if (!ev.outputToCsv)
{ {
printf(" Transfer %02d | %9.3f GB/s | %8.3f ms | %12lu bytes | %c%02d -> %c%02d:(%03d) -> %c%02d\n", printf(" Transfer %02d | %7.3f GB/s | %8.3f ms | %12lu bytes | %s -> %s%02d:%03d -> %s\n",
transfer->transferIndex, transfer->transferIndex,
transferBandwidthGbs, transferBandwidthGbs,
transferDurationMsec, transferDurationMsec,
transfer->numBytesToCopy, transfer->numBytesActual,
MemTypeStr[transfer->srcMemType], transfer->srcIndex, transfer->SrcToStr().c_str(),
MemTypeStr[transfer->exeMemType], transfer->exeIndex, ExeTypeName[transfer->exeType], transfer->exeIndex,
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse, transfer->numSubExecs,
MemTypeStr[transfer->dstMemType], transfer->dstIndex); transfer->DstToStr().c_str());
} }
else else
{ {
printf("%d,%d,%lu,%c%02d,%c%02d,%c%02d,%d,%.3f,%.3f,%s,%s,%p,%p\n", printf("%d,%d,%lu,%s,%c%02d,%s,%d,%.3f,%.3f,%s,%s\n",
testNum, transfer->transferIndex, transfer->numBytesToCopy, testNum, transfer->transferIndex, transfer->numBytesActual,
MemTypeStr[transfer->srcMemType], transfer->srcIndex, transfer->SrcToStr().c_str(),
MemTypeStr[transfer->exeMemType], transfer->exeIndex, MemTypeStr[transfer->exeType], transfer->exeIndex,
MemTypeStr[transfer->dstMemType], transfer->dstIndex, transfer->DstToStr().c_str(),
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse, transfer->numSubExecs,
transferBandwidthGbs, transferDurationMsec, transferBandwidthGbs, transferDurationMsec,
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->srcMemType, transfer->srcIndex).c_str(), PtrVectorToStr(transfer->srcMem, initOffset).c_str(),
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->dstMemType, transfer->dstIndex).c_str(), PtrVectorToStr(transfer->dstMem, initOffset).c_str());
transfer->srcMem + initOffset, transfer->dstMem + initOffset);
} }
} }
if (verbose && ev.outputToCsv) if (verbose && ev.outputToCsv)
{ {
printf("%d,ALL,%lu,ALL,%c%02d,ALL,%d,%.3f,%.3f,ALL,ALL,ALL,ALL\n", printf("%d,ALL,%lu,ALL,%c%02d,ALL,%d,%.3f,%.3f,ALL,ALL\n",
testNum, totalBytesTransferred, testNum, totalBytesTransferred,
MemTypeStr[exeMemType], exeIndex, totalCUs, MemTypeStr[exeType], exeIndex, totalCUs,
exeBandwidthGbs, exeDurationMsec); exeBandwidthGbs, exeDurationMsec);
} }
} }
...@@ -435,33 +426,31 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -435,33 +426,31 @@ void ExecuteTransfers(EnvVars const& ev,
{ {
Transfer* transfer = transferPair.second; Transfer* transfer = transferPair.second;
double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations); double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations);
double transferBandwidthGbs = (transfer->numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f; double transferBandwidthGbs = (transfer->numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
maxGpuTime = std::max(maxGpuTime, transferDurationMsec); maxGpuTime = std::max(maxGpuTime, transferDurationMsec);
if (!verbose) continue; if (!verbose) continue;
if (!ev.outputToCsv) if (!ev.outputToCsv)
{ {
printf(" Transfer %02d: %c%02d -> [%cPU %02d:%03d] -> %c%02d | %9.3f GB/s | %8.3f ms | %12lu bytes | %-16s\n", printf(" Transfer %02d | %7.3f GB/s | %8.3f ms | %12lu bytes | %s -> %s%02d:%03d -> %s\n",
transfer->transferIndex, transfer->transferIndex,
MemTypeStr[transfer->srcMemType], transfer->srcIndex,
MemTypeStr[transfer->exeMemType], transfer->exeIndex,
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
MemTypeStr[transfer->dstMemType], transfer->dstIndex,
transferBandwidthGbs, transferDurationMsec, transferBandwidthGbs, transferDurationMsec,
transfer->numBytesToCopy, transfer->numBytesActual,
GetTransferDesc(*transfer).c_str()); transfer->SrcToStr().c_str(),
ExeTypeName[transfer->exeType], transfer->exeIndex,
transfer->numSubExecs,
transfer->DstToStr().c_str());
} }
else else
{ {
printf("%d,%d,%lu,%c%02d,%c%02d,%c%02d,%d,%.3f,%.3f,%s,%s,%p,%p\n", printf("%d,%d,%lu,%s,%s%02d,%s,%d,%.3f,%.3f,%s,%s\n",
testNum, transfer->transferIndex, transfer->numBytesToCopy, testNum, transfer->transferIndex, transfer->numBytesActual,
MemTypeStr[transfer->srcMemType], transfer->srcIndex, transfer->SrcToStr().c_str(),
MemTypeStr[transfer->exeMemType], transfer->exeIndex, ExeTypeName[transfer->exeType], transfer->exeIndex,
MemTypeStr[transfer->dstMemType], transfer->dstIndex, transfer->DstToStr().c_str(),
transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse, transfer->numSubExecs,
transferBandwidthGbs, transferDurationMsec, transferBandwidthGbs, transferDurationMsec,
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->srcMemType, transfer->srcIndex).c_str(), PtrVectorToStr(transfer->srcMem, initOffset).c_str(),
GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->dstMemType, transfer->dstIndex).c_str(), PtrVectorToStr(transfer->dstMem, initOffset).c_str());
transfer->srcMem + initOffset, transfer->dstMem + initOffset);
} }
} }
} }
...@@ -471,12 +460,12 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -471,12 +460,12 @@ void ExecuteTransfers(EnvVars const& ev,
{ {
if (!ev.outputToCsv) if (!ev.outputToCsv)
{ {
printf(" Aggregate Bandwidth (CPU timed) | %9.3f GB/s | %8.3f ms | %12lu bytes | Overhead: %.3f ms\n", printf(" Aggregate (CPU) | %7.3f GB/s | %8.3f ms | %12lu bytes | Overhead: %.3f ms\n",
totalBandwidthGbs, totalCpuTime, totalBytesTransferred, totalCpuTime - maxGpuTime); totalBandwidthGbs, totalCpuTime, totalBytesTransferred, totalCpuTime - maxGpuTime);
} }
else else
{ {
printf("%d,ALL,%lu,ALL,ALL,ALL,ALL,%.3f,%.3f,ALL,ALL,ALL,ALL\n", printf("%d,ALL,%lu,ALL,ALL,ALL,ALL,%.3f,%.3f,ALL,ALL\n",
testNum, totalBytesTransferred, totalBandwidthGbs, totalCpuTime); testNum, totalBytesTransferred, totalBandwidthGbs, totalCpuTime);
} }
} }
...@@ -484,32 +473,39 @@ void ExecuteTransfers(EnvVars const& ev, ...@@ -484,32 +473,39 @@ void ExecuteTransfers(EnvVars const& ev,
// Release GPU memory // Release GPU memory
for (auto exeInfoPair : transferMap) for (auto exeInfoPair : transferMap)
{ {
ExecutorInfo& exeInfo = exeInfoPair.second; ExecutorInfo& exeInfo = exeInfoPair.second;
ExeType const exeType = exeInfoPair.first.first;
int const exeIndex = RemappedIndex(exeInfoPair.first.second, IsCpuType(exeType));
for (auto& transfer : exeInfo.transfers) for (auto& transfer : exeInfo.transfers)
{ {
// Get some aliases to Transfer variables for (int iSrc = 0; iSrc < transfer->numSrcs; ++iSrc)
MemType const& exeMemType = transfer->exeMemType; {
MemType const& srcMemType = transfer->srcMemType; MemType const& srcType = transfer->srcType[iSrc];
MemType const& dstMemType = transfer->dstMemType; DeallocateMemory(srcType, transfer->srcMem[iSrc], transfer->numBytesActual + ev.byteOffset);
}
// Allocate (maximum) source / destination memory based on type / device index for (int iDst = 0; iDst < transfer->numDsts; ++iDst)
DeallocateMemory(srcMemType, transfer->srcMem, N * sizeof(float) + ev.byteOffset); {
DeallocateMemory(dstMemType, transfer->dstMem, N * sizeof(float) + ev.byteOffset); MemType const& dstType = transfer->dstType[iDst];
transfer->blockParam.clear(); DeallocateMemory(dstType, transfer->dstMem[iDst], transfer->numBytesActual + ev.byteOffset);
}
transfer->subExecParam.clear();
} }
MemType const exeMemType = exeInfoPair.first.first; if (IsGpuType(exeType))
int const exeIndex = RemappedIndex(exeInfoPair.first.second, exeMemType);
if (exeMemType == MEM_GPU)
{ {
DeallocateMemory(exeMemType, exeInfo.blockParamGpu); int const numStreams = (int)exeInfo.streams.size();
int const numTransfersToRun = ev.useSingleStream ? 1 : exeInfo.transfers.size(); for (int i = 0; i < numStreams; ++i)
for (int i = 0; i < numTransfersToRun; ++i)
{ {
HIP_CALL(hipEventDestroy(exeInfo.startEvents[i])); HIP_CALL(hipEventDestroy(exeInfo.startEvents[i]));
HIP_CALL(hipEventDestroy(exeInfo.stopEvents[i])); HIP_CALL(hipEventDestroy(exeInfo.stopEvents[i]));
HIP_CALL(hipStreamDestroy(exeInfo.streams[i])); HIP_CALL(hipStreamDestroy(exeInfo.streams[i]));
} }
if (exeType == EXE_GPU_GFX)
{
DeallocateMemory(MEM_GPU, exeInfo.subExecParamGpu);
}
} }
} }
} }
...@@ -531,12 +527,10 @@ void DisplayUsage(char const* cmdName) ...@@ -531,12 +527,10 @@ void DisplayUsage(char const* cmdName)
printf("Usage: %s config <N>\n", cmdName); printf("Usage: %s config <N>\n", cmdName);
printf(" config: Either:\n"); printf(" config: Either:\n");
printf(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n"); printf(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
printf(" - Name of preset benchmark:\n"); printf(" - Name of preset config:\n");
printf(" p2p{_rr} - All CPU/GPU pairs benchmark {with remote reads}\n"); printf(" p2p - Peer-to-peer benchmark tests\n");
printf(" g2g{_rr} - All GPU/GPU pairs benchmark {with remote reads}\n"); printf(" sweep/rsweep - Sweep/random sweep across possible sets of Transfers\n");
printf(" sweep - Sweep across possible sets of Transfers\n"); printf(" - 3rd/4th optional args for # GPU SubExecs / # CPU SubExecs per Transfer\n");
printf(" rsweep - Randomly sweep across possible sets of Transfers\n");
printf(" - 3rd optional argument used as # of CUs to use (all by default for p2p / 4 for sweep)\n");
printf(" N : (Optional) Number of bytes to copy per Transfer.\n"); printf(" N : (Optional) Number of bytes to copy per Transfer.\n");
printf(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n", printf(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
DEFAULT_BYTES_PER_TRANSFER); DEFAULT_BYTES_PER_TRANSFER);
...@@ -547,7 +541,7 @@ void DisplayUsage(char const* cmdName) ...@@ -547,7 +541,7 @@ void DisplayUsage(char const* cmdName)
EnvVars::DisplayUsage(); EnvVars::DisplayUsage();
} }
int RemappedIndex(int const origIdx, MemType const memType) int RemappedIndex(int const origIdx, bool const isCpuType)
{ {
static std::vector<int> remappingCpu; static std::vector<int> remappingCpu;
static std::vector<int> remappingGpu; static std::vector<int> remappingGpu;
...@@ -591,7 +585,7 @@ int RemappedIndex(int const origIdx, MemType const memType) ...@@ -591,7 +585,7 @@ int RemappedIndex(int const origIdx, MemType const memType)
remappingGpu[i] = mapping[i].second; remappingGpu[i] = mapping[i].second;
} }
} }
return IsCpuType(memType) ? remappingCpu[origIdx] : remappingGpu[origIdx]; return isCpuType ? remappingCpu[origIdx] : remappingGpu[origIdx];
} }
void DisplayTopology(bool const outputToCsv) void DisplayTopology(bool const outputToCsv)
...@@ -634,11 +628,11 @@ void DisplayTopology(bool const outputToCsv) ...@@ -634,11 +628,11 @@ void DisplayTopology(bool const outputToCsv)
for (int i = 0; i < numCpuDevices; i++) for (int i = 0; i < numCpuDevices; i++)
{ {
int nodeI = RemappedIndex(i, MEM_CPU); int nodeI = RemappedIndex(i, true);
printf("NUMA %02d (%02d)%s", i, nodeI, outputToCsv ? "," : "|"); printf("NUMA %02d (%02d)%s", i, nodeI, outputToCsv ? "," : "|");
for (int j = 0; j < numCpuDevices; j++) for (int j = 0; j < numCpuDevices; j++)
{ {
int nodeJ = RemappedIndex(j, MEM_CPU); int nodeJ = RemappedIndex(j, true);
int numaDist = numa_distance(nodeI, nodeJ); int numaDist = numa_distance(nodeI, nodeJ);
if (outputToCsv) if (outputToCsv)
printf("%d,", numaDist); printf("%d,", numaDist);
...@@ -657,7 +651,7 @@ void DisplayTopology(bool const outputToCsv) ...@@ -657,7 +651,7 @@ void DisplayTopology(bool const outputToCsv)
bool isFirst = true; bool isFirst = true;
for (int j = 0; j < numGpuDevices; j++) for (int j = 0; j < numGpuDevices; j++)
{ {
if (GetClosestNumaNode(RemappedIndex(j, MEM_GPU)) == i) if (GetClosestNumaNode(RemappedIndex(j, false)) == i)
{ {
if (isFirst) isFirst = false; if (isFirst) isFirst = false;
else printf(","); else printf(",");
...@@ -678,19 +672,30 @@ void DisplayTopology(bool const outputToCsv) ...@@ -678,19 +672,30 @@ void DisplayTopology(bool const outputToCsv)
} }
else else
{ {
printf(" |");
for (int j = 0; j < numGpuDevices; j++)
{
hipDeviceProp_t prop;
HIP_CALL(hipGetDeviceProperties(&prop, j));
std::string fullName = prop.gcnArchName;
std::string archName = fullName.substr(0, fullName.find(':'));
printf(" %6s |", archName.c_str());
}
printf("\n");
printf(" |"); printf(" |");
for (int j = 0; j < numGpuDevices; j++) for (int j = 0; j < numGpuDevices; j++)
printf(" GPU %02d |", j); printf(" GPU %02d |", j);
printf(" PCIe Bus ID | Closest NUMA\n"); printf(" PCIe Bus ID | #CUs | Closest NUMA\n");
for (int j = 0; j <= numGpuDevices; j++) for (int j = 0; j <= numGpuDevices; j++)
printf("--------+"); printf("--------+");
printf("--------------+-------------\n"); printf("--------------+------+-------------\n");
} }
char pciBusId[20]; char pciBusId[20];
for (int i = 0; i < numGpuDevices; i++) for (int i = 0; i < numGpuDevices; i++)
{ {
int const deviceIdx = RemappedIndex(i, false);
printf("%sGPU %02d%s", outputToCsv ? "" : " ", i, outputToCsv ? "," : " |"); printf("%sGPU %02d%s", outputToCsv ? "" : " ", i, outputToCsv ? "," : " |");
for (int j = 0; j < numGpuDevices; j++) for (int j = 0; j < numGpuDevices; j++)
{ {
...@@ -704,8 +709,8 @@ void DisplayTopology(bool const outputToCsv) ...@@ -704,8 +709,8 @@ void DisplayTopology(bool const outputToCsv)
else else
{ {
uint32_t linkType, hopCount; uint32_t linkType, hopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(i, MEM_GPU), HIP_CALL(hipExtGetLinkTypeAndHopCount(deviceIdx,
RemappedIndex(j, MEM_GPU), RemappedIndex(j, false),
&linkType, &hopCount)); &linkType, &hopCount));
printf("%s%s-%d%s", printf("%s%s-%d%s",
outputToCsv ? "" : " ", outputToCsv ? "" : " ",
...@@ -717,44 +722,78 @@ void DisplayTopology(bool const outputToCsv) ...@@ -717,44 +722,78 @@ void DisplayTopology(bool const outputToCsv)
hopCount, outputToCsv ? "," : " |"); hopCount, outputToCsv ? "," : " |");
} }
} }
HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, RemappedIndex(i, MEM_GPU))); HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, deviceIdx));
int numDeviceCUs = 0;
HIP_CALL(hipDeviceGetAttribute(&numDeviceCUs, hipDeviceAttributeMultiprocessorCount, deviceIdx));
if (outputToCsv) if (outputToCsv)
printf("%s,%d\n", pciBusId, GetClosestNumaNode(RemappedIndex(i, MEM_GPU))); printf("%s,%d,%d\n", pciBusId, numDeviceCUs, GetClosestNumaNode(deviceIdx));
else else
printf(" %11s | %d \n", pciBusId, GetClosestNumaNode(RemappedIndex(i, MEM_GPU))); printf(" %11s | %4d | %d\n", pciBusId, numDeviceCUs, GetClosestNumaNode(deviceIdx));
} }
} }
void ParseMemType(std::string const& token, int const numCpus, int const numGpus, MemType* memType, int* memIndex) void ParseMemType(std::string const& token, int const numCpus, int const numGpus,
std::vector<MemType>& memTypes, std::vector<int>& memIndices)
{ {
char typeChar; char typeChar;
if (sscanf(token.c_str(), " %c %d", &typeChar, memIndex) != 2) int offset = 0, devIndex, inc;
{ bool found = false;
printf("[ERROR] Unable to parse memory type token %s - expecting either 'B,C,G or F' followed by an index\n",
token.c_str());
exit(1);
}
switch (typeChar) memTypes.clear();
memIndices.clear();
while (sscanf(token.c_str() + offset, " %c %d%n", &typeChar, &devIndex, &inc) == 2)
{ {
case 'C': case 'c': case 'B': case 'b': case 'U': case 'u': offset += inc;
*memType = (typeChar == 'C' || typeChar == 'c') ? MEM_CPU : ((typeChar == 'B' || typeChar == 'b') ? MEM_CPU_FINE : MEM_CPU_UNPINNED); MemType memType = CharToMemType(typeChar);
if (*memIndex < 0 || *memIndex >= numCpus)
if (IsCpuType(memType) && (devIndex < 0 || devIndex >= numCpus))
{ {
printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, *memIndex); printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, devIndex);
exit(1); exit(1);
} }
break; if (IsGpuType(memType) && (devIndex < 0 || devIndex >= numGpus))
case 'G': case 'g': case 'F': case 'f':
*memType = (typeChar == 'G' || typeChar == 'g') ? MEM_GPU : MEM_GPU_FINE;
if (*memIndex < 0 || *memIndex >= numGpus)
{ {
printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, *memIndex); printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, devIndex);
exit(1); exit(1);
} }
break;
default: found = true;
printf("[ERROR] Unrecognized memory type %s. Expecting either 'B','C','U','G' or 'F'\n", token.c_str()); if (memType != MEM_NULL)
{
memTypes.push_back(memType);
memIndices.push_back(devIndex);
}
}
if (!found)
{
printf("[ERROR] Unable to parse memory type token %s. Expected one of %s followed by an index\n",
token.c_str(), MemTypeStr);
exit(1);
}
}
void ParseExeType(std::string const& token, int const numCpus, int const numGpus,
ExeType &exeType, int& exeIndex)
{
char typeChar;
if (sscanf(token.c_str(), " %c%d", &typeChar, &exeIndex) != 2)
{
printf("[ERROR] Unable to parse valid executor token (%s). Exepected one of %s followed by an index\n",
token.c_str(), ExeTypeStr);
exit(1);
}
exeType = CharToExeType(typeChar);
if (IsCpuType(exeType) && (exeIndex < 0 || exeIndex >= numCpus))
{
printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, exeIndex);
exit(1);
}
if (IsGpuType(exeType) && (exeIndex < 0 || exeIndex >= numGpus))
{
printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, exeIndex);
exit(1); exit(1);
} }
} }
...@@ -777,18 +816,18 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>& ...@@ -777,18 +816,18 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
std::string srcMem; std::string srcMem;
std::string dstMem; std::string dstMem;
// If numTransfers < 0, read quads (srcMem, exeMem, dstMem, #CUs) // If numTransfers < 0, read 5-tuple (srcMem, exeMem, dstMem, #CUs, #Bytes)
// otherwise read triples (srcMem, exeMem, dstMem) // otherwise read triples (srcMem, exeMem, dstMem)
bool const advancedMode = (numTransfers < 0); bool const advancedMode = (numTransfers < 0);
numTransfers = abs(numTransfers); numTransfers = abs(numTransfers);
int numBlocksToUse; int numSubExecs;
if (!advancedMode) if (!advancedMode)
{ {
iss >> numBlocksToUse; iss >> numSubExecs;
if (numBlocksToUse <= 0 || iss.fail()) if (numSubExecs <= 0 || iss.fail())
{ {
printf("Parsing error: Number of blocks to use (%d) must be greater than 0\n", numBlocksToUse); printf("Parsing error: Number of blocks to use (%d) must be greater than 0\n", numSubExecs);
exit(1); exit(1);
} }
} }
...@@ -799,7 +838,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>& ...@@ -799,7 +838,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
Transfer transfer; Transfer transfer;
transfer.transferIndex = i; transfer.transferIndex = i;
transfer.numBytes = 0; transfer.numBytes = 0;
transfer.numBytesToCopy = 0; transfer.numBytesActual = 0;
if (!advancedMode) if (!advancedMode)
{ {
iss >> srcMem >> exeMem >> dstMem; iss >> srcMem >> exeMem >> dstMem;
...@@ -812,7 +851,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>& ...@@ -812,7 +851,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
else else
{ {
std::string numBytesToken; std::string numBytesToken;
iss >> srcMem >> exeMem >> dstMem >> numBlocksToUse >> numBytesToken; iss >> srcMem >> exeMem >> dstMem >> numSubExecs >> numBytesToken;
if (iss.fail()) if (iss.fail())
{ {
printf("Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple\n", i+1); printf("Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple\n", i+1);
...@@ -824,18 +863,33 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>& ...@@ -824,18 +863,33 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
exit(1); exit(1);
} }
char units = numBytesToken.back(); char units = numBytesToken.back();
switch (units) switch (toupper(units))
{ {
case 'K': case 'k': numBytes *= 1024; break; case 'K': numBytes *= 1024; break;
case 'M': case 'm': numBytes *= 1024*1024; break; case 'M': numBytes *= 1024*1024; break;
case 'G': case 'g': numBytes *= 1024*1024*1024; break; case 'G': numBytes *= 1024*1024*1024; break;
} }
} }
ParseMemType(srcMem, numCpus, numGpus, &transfer.srcMemType, &transfer.srcIndex); ParseMemType(srcMem, numCpus, numGpus, transfer.srcType, transfer.srcIndex);
ParseMemType(exeMem, numCpus, numGpus, &transfer.exeMemType, &transfer.exeIndex); ParseMemType(dstMem, numCpus, numGpus, transfer.dstType, transfer.dstIndex);
ParseMemType(dstMem, numCpus, numGpus, &transfer.dstMemType, &transfer.dstIndex); ParseExeType(exeMem, numCpus, numGpus, transfer.exeType, transfer.exeIndex);
transfer.numBlocksToUse = numBlocksToUse;
transfer.numSrcs = (int)transfer.srcType.size();
transfer.numDsts = (int)transfer.dstType.size();
if (transfer.numSrcs == 0 && transfer.numDsts == 0)
{
printf("[ERROR] Transfer must have at least one src or dst\n");
exit(1);
}
if (transfer.exeType == EXE_GPU_DMA && (transfer.numSrcs > 1 || transfer.numDsts > 1))
{
printf("[ERROR] GPU DMA executor can only be used for single source / single dst Transfers\n");
exit(1);
}
transfer.numSubExecs = numSubExecs;
transfer.numBytes = numBytes; transfer.numBytes = numBytes;
transfers.push_back(transfer); transfers.push_back(transfer);
} }
...@@ -971,158 +1025,31 @@ void CheckPages(char* array, size_t numBytes, int targetId) ...@@ -971,158 +1025,31 @@ void CheckPages(char* array, size_t numBytes, int targetId)
} }
} }
// Helper function to either fill a device pointer with pseudo-random data, or to check to see if it matches
void CheckOrFill(ModeType mode, int N, bool isMemset, bool isHipCall, std::vector<float>const& fillPattern, float* ptr)
{
// Prepare reference resultx
float* refBuffer = (float*)malloc(N * sizeof(float));
if (isMemset)
{
if (isHipCall)
{
memset(refBuffer, 42, N * sizeof(float));
}
else
{
for (int i = 0; i < N; i++)
refBuffer[i] = 1234.0f;
}
}
else
{
// Fill with repeated pattern if specified
size_t patternLen = fillPattern.size();
if (patternLen > 0)
{
for (int i = 0; i < N; i++)
refBuffer[i] = fillPattern[i % patternLen];
}
else // Otherwise fill with pseudo-random values
{
for (int i = 0; i < N; i++)
refBuffer[i] = (i % 383 + 31);
}
}
// Either fill the memory with the reference buffer, or compare against it
if (mode == MODE_FILL)
{
HIP_CALL(hipMemcpy(ptr, refBuffer, N * sizeof(float), hipMemcpyDefault));
}
else if (mode == MODE_CHECK)
{
float* hostBuffer = (float*) malloc(N * sizeof(float));
HIP_CALL(hipMemcpy(hostBuffer, ptr, N * sizeof(float), hipMemcpyDefault));
for (int i = 0; i < N; i++)
{
if (refBuffer[i] != hostBuffer[i])
{
printf("[ERROR] Mismatch at element %d Ref: %f Actual: %f\n", i, refBuffer[i], hostBuffer[i]);
exit(1);
}
}
free(hostBuffer);
}
free(refBuffer);
}
std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount)
{
char result[10];
switch (linkType)
{
case HSA_AMD_LINK_INFO_TYPE_HYPERTRANSPORT: sprintf(result, " HT-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_QPI : sprintf(result, " QPI-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_PCIE : sprintf(result, "PCIE-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_INFINBAND : sprintf(result, "INFB-%d", hopCount); break;
case HSA_AMD_LINK_INFO_TYPE_XGMI : sprintf(result, "XGMI-%d", hopCount); break;
default: sprintf(result, "??????");
}
return result;
}
std::string GetDesc(MemType srcMemType, int srcIndex,
MemType dstMemType, int dstIndex)
{
if (IsCpuType(srcMemType))
{
if (IsCpuType(dstMemType)) return (srcIndex == dstIndex) ? "LOCAL" : "NUMA";
if (IsGpuType(dstMemType)) return "PCIE";
goto error;
}
if (IsGpuType(srcMemType))
{
if (IsCpuType(dstMemType)) return "PCIE";
if (IsGpuType(dstMemType))
{
if (srcIndex == dstIndex) return "LOCAL";
else
{
uint32_t linkType, hopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(srcIndex, MEM_GPU),
RemappedIndex(dstIndex, MEM_GPU),
&linkType, &hopCount));
return GetLinkTypeDesc(linkType, hopCount);
}
}
}
error:
printf("[ERROR] Unrecognized memory type\n");
exit(1);
}
std::string GetTransferDesc(Transfer const& transfer)
{
return GetDesc(transfer.srcMemType, transfer.srcIndex, transfer.exeMemType, transfer.exeIndex) + "-"
+ GetDesc(transfer.exeMemType, transfer.exeIndex, transfer.dstMemType, transfer.dstIndex);
}
void RunTransfer(EnvVars const& ev, int const iteration, void RunTransfer(EnvVars const& ev, int const iteration,
ExecutorInfo& exeInfo, int const transferIdx) ExecutorInfo& exeInfo, int const transferIdx)
{ {
Transfer* transfer = exeInfo.transfers[transferIdx]; Transfer* transfer = exeInfo.transfers[transferIdx];
// GPU execution agent if (transfer->exeType == EXE_GPU_GFX)
if (transfer->exeMemType == MEM_GPU)
{ {
// Switch to executing GPU // Switch to executing GPU
int const exeIndex = RemappedIndex(transfer->exeIndex, MEM_GPU); int const exeIndex = RemappedIndex(transfer->exeIndex, false);
HIP_CALL(hipSetDevice(exeIndex)); HIP_CALL(hipSetDevice(exeIndex));
hipStream_t& stream = exeInfo.streams[transferIdx]; hipStream_t& stream = exeInfo.streams[transferIdx];
hipEvent_t& startEvent = exeInfo.startEvents[transferIdx]; hipEvent_t& startEvent = exeInfo.startEvents[transferIdx];
hipEvent_t& stopEvent = exeInfo.stopEvents[transferIdx]; hipEvent_t& stopEvent = exeInfo.stopEvents[transferIdx];
int const initOffset = ev.byteOffset / sizeof(float); // Figure out how many threadblocks to use.
// In single stream mode, all the threadblocks for this GPU are launched
if (ev.useHipCall) // Otherwise, just launch the threadblocks associated with this single Transfer
{ int const numBlocksToRun = ev.useSingleStream ? exeInfo.totalSubExecs : transfer->numSubExecs;
// Record start event hipExtLaunchKernelGGL(GpuKernelTable[ev.gpuKernel],
HIP_CALL(hipEventRecord(startEvent, stream)); dim3(numBlocksToRun, 1, 1),
dim3(BLOCKSIZE, 1, 1),
// Execute hipMemset / hipMemcpy ev.sharedMemBytes, stream,
if (ev.useMemset) startEvent, stopEvent,
HIP_CALL(hipMemsetAsync(transfer->dstMem + initOffset, 42, transfer->numBytesToCopy, stream)); 0, transfer->subExecParamGpuPtr);
else
HIP_CALL(hipMemcpyAsync(transfer->dstMem + initOffset,
transfer->srcMem + initOffset,
transfer->numBytesToCopy, hipMemcpyDefault,
stream));
// Record stop event
HIP_CALL(hipEventRecord(stopEvent, stream));
}
else
{
int const numBlocksToRun = ev.useSingleStream ? exeInfo.totalBlocks : transfer->numBlocksToUse;
hipExtLaunchKernelGGL(ev.useMemset ? GpuMemsetKernel : GpuCopyKernel,
dim3(numBlocksToRun, 1, 1),
dim3(BLOCKSIZE, 1, 1),
ev.sharedMemBytes, stream,
startEvent, stopEvent,
0, transfer->blockParamGpuPtr);
}
// Synchronize per iteration, unless in single sync mode, in which case // Synchronize per iteration, unless in single sync mode, in which case
// synchronize during last warmup / last actual iteration // synchronize during last warmup / last actual iteration
...@@ -1136,14 +1063,15 @@ void RunTransfer(EnvVars const& ev, int const iteration, ...@@ -1136,14 +1063,15 @@ void RunTransfer(EnvVars const& ev, int const iteration,
if (ev.useSingleStream) if (ev.useSingleStream)
{ {
// Figure out individual timings for Transfers that were all launched together
for (Transfer* currTransfer : exeInfo.transfers) for (Transfer* currTransfer : exeInfo.transfers)
{ {
long long minStartCycle = currTransfer->blockParamGpuPtr[0].startCycle; long long minStartCycle = currTransfer->subExecParamGpuPtr[0].startCycle;
long long maxStopCycle = currTransfer->blockParamGpuPtr[0].stopCycle; long long maxStopCycle = currTransfer->subExecParamGpuPtr[0].stopCycle;
for (int i = 1; i < currTransfer->numBlocksToUse; i++) for (int i = 1; i < currTransfer->numSubExecs; i++)
{ {
minStartCycle = std::min(minStartCycle, currTransfer->blockParamGpuPtr[i].startCycle); minStartCycle = std::min(minStartCycle, currTransfer->subExecParamGpuPtr[i].startCycle);
maxStopCycle = std::max(maxStopCycle, currTransfer->blockParamGpuPtr[i].stopCycle); maxStopCycle = std::max(maxStopCycle, currTransfer->subExecParamGpuPtr[i].stopCycle);
} }
int const wallClockRate = GetWallClockRate(exeIndex); int const wallClockRate = GetWallClockRate(exeIndex);
double iterationTimeMs = (maxStopCycle - minStartCycle) / (double)(wallClockRate); double iterationTimeMs = (maxStopCycle - minStartCycle) / (double)(wallClockRate);
...@@ -1157,10 +1085,43 @@ void RunTransfer(EnvVars const& ev, int const iteration, ...@@ -1157,10 +1085,43 @@ void RunTransfer(EnvVars const& ev, int const iteration,
} }
} }
} }
else if (transfer->exeMemType == MEM_CPU) // CPU execution agent else if (transfer->exeType == EXE_GPU_DMA)
{
// Switch to executing GPU
int const exeIndex = RemappedIndex(transfer->exeIndex, false);
HIP_CALL(hipSetDevice(exeIndex));
hipStream_t& stream = exeInfo.streams[transferIdx];
hipEvent_t& startEvent = exeInfo.startEvents[transferIdx];
hipEvent_t& stopEvent = exeInfo.stopEvents[transferIdx];
HIP_CALL(hipEventRecord(startEvent, stream));
if (transfer->numSrcs == 0 && transfer->numDsts == 1)
{
HIP_CALL(hipMemsetAsync(transfer->dstMem[0],
MEMSET_CHAR, transfer->numBytesActual, stream));
}
else if (transfer->numSrcs == 1 && transfer->numDsts == 1)
{
HIP_CALL(hipMemcpyAsync(transfer->dstMem[0], transfer->srcMem[0],
transfer->numBytesActual, hipMemcpyDefault,
stream));
}
HIP_CALL(hipEventRecord(stopEvent, stream));
HIP_CALL(hipStreamSynchronize(stream));
if (iteration >= 0)
{
// Record GPU timing
float gpuDeltaMsec;
HIP_CALL(hipEventElapsedTime(&gpuDeltaMsec, startEvent, stopEvent));
transfer->transferTime += gpuDeltaMsec;
}
}
else if (transfer->exeType == EXE_CPU) // CPU execution agent
{ {
// Force this thread and all child threads onto correct NUMA node // Force this thread and all child threads onto correct NUMA node
int const exeIndex = RemappedIndex(transfer->exeIndex, MEM_CPU); int const exeIndex = RemappedIndex(transfer->exeIndex, true);
if (numa_run_on_node(exeIndex)) if (numa_run_on_node(exeIndex))
{ {
printf("[ERROR] Unable to set CPU to NUMA node %d\n", exeIndex); printf("[ERROR] Unable to set CPU to NUMA node %d\n", exeIndex);
...@@ -1171,12 +1132,12 @@ void RunTransfer(EnvVars const& ev, int const iteration, ...@@ -1171,12 +1132,12 @@ void RunTransfer(EnvVars const& ev, int const iteration,
auto cpuStart = std::chrono::high_resolution_clock::now(); auto cpuStart = std::chrono::high_resolution_clock::now();
// Launch child-threads to perform memcopies // Launch each subExecutor in child-threads to perform memcopies
for (int i = 0; i < ev.numCpuPerTransfer; i++) for (int i = 0; i < transfer->numSubExecs; ++i)
childThreads.push_back(std::thread(ev.useMemset ? CpuMemsetKernel : CpuCopyKernel, std::ref(transfer->blockParam[i]))); childThreads.push_back(std::thread(CpuReduceKernel, std::ref(transfer->subExecParam[i])));
// Wait for child-threads to finish // Wait for child-threads to finish
for (int i = 0; i < ev.numCpuPerTransfer; i++) for (int i = 0; i < transfer->numSubExecs; ++i)
childThreads[i].join(); childThreads[i].join();
auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart; auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
...@@ -1187,11 +1148,13 @@ void RunTransfer(EnvVars const& ev, int const iteration, ...@@ -1187,11 +1148,13 @@ void RunTransfer(EnvVars const& ev, int const iteration,
} }
} }
void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, int readMode, int skipCpu) void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N)
{ {
ev.DisplayP2PBenchmarkEnvVars();
// Collect the number of available CPUs/GPUs on this machine // Collect the number of available CPUs/GPUs on this machine
int const numGpus = ev.numGpuDevices; int const numCpus = ev.numCpuDevices;
int const numCpus = ev.numCpuDevices; int const numGpus = ev.numGpuDevices;
int const numDevices = numCpus + numGpus; int const numDevices = numCpus + numGpus;
// Enable peer to peer for each GPU // Enable peer to peer for each GPU
...@@ -1199,52 +1162,38 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in ...@@ -1199,52 +1162,38 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
for (int j = 0; j < numGpus; j++) for (int j = 0; j < numGpus; j++)
if (i != j) EnablePeerAccess(i, j); if (i != j) EnablePeerAccess(i, j);
if (!ev.outputToCsv)
{
printf("Performing copies in each direction of %lu bytes\n", N * sizeof(float));
printf("Using %d threads per NUMA node for CPU copies\n", ev.numCpuPerTransfer);
printf("Using %d CUs per transfer\n", numBlocksToUse);
}
else
{
printf("SRC,DST,Direction,ReadMode,BW(GB/s),Bytes\n");
}
// Perform unidirectional / bidirectional // Perform unidirectional / bidirectional
for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++) for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++)
{ {
// Print header // Print header
if (!ev.outputToCsv) if (!ev.outputToCsv)
{ {
printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write]\n", isBidirectional ? "Bi" : "Uni", printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni",
readMode == 0 ? "Local" : "Remote", ev.useRemoteRead ? "Remote" : "Local",
readMode == 0 ? "Remote" : "Local"); ev.useRemoteRead ? "Local" : "Remote",
printf("%10s", "D/D"); ev.useDmaCopy ? "DMA" : "GFX");
if (!skipCpu)
{ printf("%10s", "SRC\\DST");
for (int i = 0; i < numCpus; i++) for (int i = 0; i < numCpus; i++) printf("%7s %02d", "CPU", i);
printf("%7s %02d", "CPU", i); for (int i = 0; i < numGpus; i++) printf("%7s %02d", "GPU", i);
}
for (int i = 0; i < numGpus; i++)
printf("%7s %02d", "GPU", i);
printf("\n"); printf("\n");
} }
// Loop over all possible src/dst pairs // Loop over all possible src/dst pairs
for (int src = 0; src < numDevices; src++) for (int src = 0; src < numDevices; src++)
{ {
MemType const& srcMemType = (src < numCpus ? MEM_CPU : MEM_GPU); MemType const srcType = (src < numCpus ? MEM_CPU : MEM_GPU);
if (skipCpu && srcMemType == MEM_CPU) continue; int const srcIndex = (srcType == MEM_CPU ? src : src - numCpus);
int srcIndex = (srcMemType == MEM_CPU ? src : src - numCpus);
if (!ev.outputToCsv) if (!ev.outputToCsv)
printf("%7s %02d", (srcMemType == MEM_CPU) ? "CPU" : "GPU", srcIndex); printf("%7s %02d", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex);
for (int dst = 0; dst < numDevices; dst++) for (int dst = 0; dst < numDevices; dst++)
{ {
MemType const& dstMemType = (dst < numCpus ? MEM_CPU : MEM_GPU); MemType const dstType = (dst < numCpus ? MEM_CPU : MEM_GPU);
if (skipCpu && dstMemType == MEM_CPU) continue; int const dstIndex = (dstType == MEM_CPU ? dst : dst - numCpus);
int dstIndex = (dstMemType == MEM_CPU ? dst : dst - numCpus);
double bandwidth = GetPeakBandwidth(ev, N, isBidirectional, readMode, numBlocksToUse, double bandwidth = GetPeakBandwidth(ev, N, isBidirectional, srcType, srcIndex, dstType, dstIndex);
srcMemType, srcIndex, dstMemType, dstIndex);
if (!ev.outputToCsv) if (!ev.outputToCsv)
{ {
if (bandwidth == 0) if (bandwidth == 0)
...@@ -1254,13 +1203,12 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in ...@@ -1254,13 +1203,12 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
} }
else else
{ {
printf("%s %02d,%s %02d,%s,%s,%.2f,%lu\n", printf("%s %02d,%s %02d,%s,%s,%s,%.2f,%lu\n",
srcMemType == MEM_CPU ? "CPU" : "GPU", srcType == MEM_CPU ? "CPU" : "GPU", srcIndex,
srcIndex, dstType == MEM_CPU ? "CPU" : "GPU", dstIndex,
dstMemType == MEM_CPU ? "CPU" : "GPU",
dstIndex,
isBidirectional ? "bidirectional" : "unidirectional", isBidirectional ? "bidirectional" : "unidirectional",
readMode == 0 ? "Local" : "Remote", ev.useRemoteRead ? "Remote" : "Local",
ev.useDmaCopy ? "DMA" : "GFX",
bandwidth, bandwidth,
N * sizeof(float)); N * sizeof(float));
} }
...@@ -1272,42 +1220,50 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in ...@@ -1272,42 +1220,50 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
} }
} }
double GetPeakBandwidth(EnvVars const& ev, double GetPeakBandwidth(EnvVars const& ev, size_t const N,
size_t const N,
int const isBidirectional, int const isBidirectional,
int const readMode, MemType const srcType, int const srcIndex,
int const numBlocksToUse, MemType const dstType, int const dstIndex)
MemType const srcMemType,
int const srcIndex,
MemType const dstMemType,
int const dstIndex)
{ {
// Skip bidirectional on same device // Skip bidirectional on same device
if (isBidirectional && srcMemType == dstMemType && srcIndex == dstIndex) return 0.0f; if (isBidirectional && srcType == dstType && srcIndex == dstIndex) return 0.0f;
int const initOffset = ev.byteOffset / sizeof(float); int const initOffset = ev.byteOffset / sizeof(float);
// Prepare Transfers // Prepare Transfers
std::vector<Transfer> transfers(2); std::vector<Transfer> transfers(2);
transfers[0].srcMemType = transfers[1].dstMemType = srcMemType; transfers[0].numBytes = transfers[1].numBytes = N * sizeof(float);
transfers[0].dstMemType = transfers[1].srcMemType = dstMemType;
transfers[0].srcIndex = transfers[1].dstIndex = srcIndex; // SRC -> DST
transfers[0].dstIndex = transfers[1].srcIndex = dstIndex; transfers[0].numSrcs = transfers[0].numDsts = 1;
transfers[0].numBytes = transfers[1].numBytes = N * sizeof(float); transfers[0].srcType.push_back(srcType);
transfers[0].numBlocksToUse = transfers[1].numBlocksToUse = numBlocksToUse; transfers[0].dstType.push_back(dstType);
transfers[0].srcIndex.push_back(srcIndex);
transfers[0].dstIndex.push_back(dstIndex);
// DST -> SRC
transfers[1].numSrcs = transfers[1].numDsts = 1;
transfers[1].srcType.push_back(dstType);
transfers[1].dstType.push_back(srcType);
transfers[1].srcIndex.push_back(dstIndex);
transfers[1].dstIndex.push_back(srcIndex);
// Either perform (local read + remote write), or (remote read + local write) // Either perform (local read + remote write), or (remote read + local write)
transfers[0].exeMemType = (readMode == 0 ? srcMemType : dstMemType); ExeType gpuExeType = ev.useDmaCopy ? EXE_GPU_DMA : EXE_GPU_GFX;
transfers[1].exeMemType = (readMode == 0 ? dstMemType : srcMemType); transfers[0].exeType = IsGpuType(ev.useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU;
transfers[0].exeIndex = (readMode == 0 ? srcIndex : dstIndex); transfers[1].exeType = IsGpuType(ev.useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU;
transfers[1].exeIndex = (readMode == 0 ? dstIndex : srcIndex); transfers[0].exeIndex = (ev.useRemoteRead ? dstIndex : srcIndex);
transfers[1].exeIndex = (ev.useRemoteRead ? srcIndex : dstIndex);
transfers[0].numSubExecs = IsGpuType(transfers[0].exeType) ? ev.numGpuSubExecs : ev.numCpuSubExecs;
transfers[1].numSubExecs = IsGpuType(transfers[0].exeType) ? ev.numGpuSubExecs : ev.numCpuSubExecs;
// Remove (DST->SRC) if not bidirectional
transfers.resize(isBidirectional + 1); transfers.resize(isBidirectional + 1);
// Abort if executing on NUMA node with no CPUs // Abort if executing on NUMA node with no CPUs
for (int i = 0; i <= isBidirectional; i++) for (int i = 0; i <= isBidirectional; i++)
{ {
if (transfers[i].exeMemType == MEM_CPU && ev.numCpusPerNuma[transfers[i].exeIndex] == 0) if (transfers[i].exeType == EXE_CPU && ev.numCpusPerNuma[transfers[i].exeIndex] == 0)
return 0; return 0;
} }
...@@ -1318,45 +1274,176 @@ double GetPeakBandwidth(EnvVars const& ev, ...@@ -1318,45 +1274,176 @@ double GetPeakBandwidth(EnvVars const& ev,
for (int i = 0; i <= isBidirectional; i++) for (int i = 0; i <= isBidirectional; i++)
{ {
double transferDurationMsec = transfers[i].transferTime / (1.0 * ev.numIterations); double transferDurationMsec = transfers[i].transferTime / (1.0 * ev.numIterations);
double transferBandwidthGbs = (transfers[i].numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f; double transferBandwidthGbs = (transfers[i].numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
totalBandwidth += transferBandwidthGbs; totalBandwidth += transferBandwidthGbs;
} }
return totalBandwidth; return totalBandwidth;
} }
void Transfer::PrepareBlockParams(EnvVars const& ev, size_t const N) void Transfer::PrepareSubExecParams(EnvVars const& ev)
{ {
int const initOffset = ev.byteOffset / sizeof(float); // Each subExecutor needs to know src/dst pointers and how many elements to transfer
// Figure out the sub-array each subExecutor works on for this Transfer
// - Partition N as evenly as possible, but try to keep subarray sizes as multiples of BLOCK_BYTES bytes,
// except the very last one, for alignment reasons
size_t const N = this->numBytesActual / sizeof(float);
int const initOffset = ev.byteOffset / sizeof(float);
int const targetMultiple = ev.blockBytes / sizeof(float);
// Initialize source memory with patterned data // In some cases, there may not be enough data for all subExectors
CheckOrFill(MODE_FILL, N, ev.useMemset, ev.useHipCall, ev.fillPattern, this->srcMem + initOffset); int const maxSubExecToUse = std::min((int)(N + targetMultiple - 1) / targetMultiple, this->numSubExecs);
this->subExecParam.clear();
this->subExecParam.resize(this->numSubExecs);
// Each block needs to know src/dst pointers and how many elements to transfer
// Figure out the sub-array each block does for this Transfer
// - Partition N as evenly as possible, but try to keep blocks as multiples of BLOCK_BYTES bytes,
// except the very last one, for alignment reasons
int const targetMultiple = ev.blockBytes / sizeof(float);
int const maxNumBlocksToUse = std::min((N + targetMultiple - 1) / targetMultiple, this->blockParam.size());
size_t assigned = 0; size_t assigned = 0;
for (int j = 0; j < this->blockParam.size(); j++) for (int i = 0; i < this->numSubExecs; ++i)
{ {
int const blocksLeft = std::max(0, maxNumBlocksToUse - j); int const subExecLeft = std::max(0, maxSubExecToUse - i);
size_t const leftover = N - assigned; size_t const leftover = N - assigned;
size_t const roundedN = (leftover + targetMultiple - 1) / targetMultiple; size_t const roundedN = (leftover + targetMultiple - 1) / targetMultiple;
SubExecParam& p = this->subExecParam[i];
p.N = subExecLeft ? std::min(leftover, ((roundedN / subExecLeft) * targetMultiple)) : 0;
p.numSrcs = this->numSrcs;
p.numDsts = this->numDsts;
for (int iSrc = 0; iSrc < this->numSrcs; ++iSrc)
p.src[iSrc] = this->srcMem[iSrc] + assigned + initOffset;
for (int iDst = 0; iDst < this->numDsts; ++iDst)
p.dst[iDst] = this->dstMem[iDst] + assigned + initOffset;
if (ev.enableDebug)
{
printf("Transfer %02d SE:%02d: %10lu floats: %10lu to %10lu\n",
this->transferIndex, i, p.N, assigned, assigned + p.N);
}
BlockParam& param = this->blockParam[j]; p.startCycle = 0;
param.N = blocksLeft ? std::min(leftover, ((roundedN / blocksLeft) * targetMultiple)) : 0; p.stopCycle = 0;
param.src = this->srcMem + assigned + initOffset; assigned += p.N;
param.dst = this->dstMem + assigned + initOffset;
param.startCycle = 0;
param.stopCycle = 0;
assigned += param.N;
} }
this->transferTime = 0.0; this->transferTime = 0.0;
} }
void Transfer::PrepareReference(EnvVars const& ev, std::vector<float>& buffer, int bufferIdx)
{
size_t N = buffer.size();
if (bufferIdx >= 0)
{
size_t patternLen = ev.fillPattern.size();
if (patternLen > 0)
{
for (size_t i = 0; i < N; ++i)
buffer[i] = ev.fillPattern[i % patternLen];
}
else
{
for (size_t i = 0; i < N; ++i)
buffer[i] = (i % 383 + 31) * (bufferIdx + 1);
}
}
else // Destination buffer
{
if (this->numSrcs == 0)
{
// Note: 0x75757575 = 13323083.0
memset(buffer.data(), MEMSET_CHAR, N * sizeof(float));
}
else
{
PrepareReference(ev, buffer, 0);
if (this->numSrcs > 1)
{
std::vector<float> temp(N);
for (int srcIdx = 1; srcIdx < this->numSrcs; ++srcIdx)
{
PrepareReference(ev, temp, srcIdx);
for (int i = 0; i < N; ++i)
{
buffer[i] += temp[i];
}
}
}
}
}
}
void Transfer::PrepareSrc(EnvVars const& ev)
{
if (this->numSrcs == 0) return;
size_t const N = this->numBytesActual / sizeof(float);
int const initOffset = ev.byteOffset / sizeof(float);
std::vector<float> reference(N);
for (int srcIdx = 0; srcIdx < this->numSrcs; ++srcIdx)
{
//PrepareReference(ev, reference, srcIdx);
PrepareReference(ev, reference, srcIdx);
HIP_CALL(hipMemcpy(this->srcMem[srcIdx] + initOffset, reference.data(), this->numBytesActual, hipMemcpyDefault));
}
}
void Transfer::ValidateDst(EnvVars const& ev)
{
if (this->numDsts == 0) return;
size_t const N = this->numBytesActual / sizeof(float);
int const initOffset = ev.byteOffset / sizeof(float);
std::vector<float> reference(N);
PrepareReference(ev, reference, -1);
std::vector<float> hostBuffer(N);
for (int dstIdx = 0; dstIdx < this->numDsts; ++dstIdx)
{
float* output;
if (IsCpuType(this->dstType[dstIdx]))
{
output = this->dstMem[dstIdx] + initOffset;
}
else
{
HIP_CALL(hipMemcpy(hostBuffer.data(), this->dstMem[dstIdx] + initOffset, this->numBytesActual, hipMemcpyDefault));
output = hostBuffer.data();
}
for (size_t i = 0; i < N; ++i)
{
if (reference[i] != output[i])
{
printf("\n[ERROR] Destination array %d value at index %lu (%.3f) does not match expected value (%.3f)\n",
dstIdx, i, output[i], reference[i]);
printf("[ERROR] Failed Transfer details: #%d: %s -> [%c%d:%d] -> %s\n",
this->transferIndex,
this->SrcToStr().c_str(),
ExeTypeStr[this->exeType], this->exeIndex,
this->numSubExecs,
this->DstToStr().c_str());
exit(1);
}
}
}
}
std::string Transfer::SrcToStr() const
{
if (numSrcs == 0) return "N";
std::stringstream ss;
for (int i = 0; i < numSrcs; ++i)
ss << MemTypeStr[srcType[i]] << srcIndex[i];
return ss.str();
}
std::string Transfer::DstToStr() const
{
if (numDsts == 0) return "N";
std::stringstream ss;
for (int i = 0; i < numDsts; ++i)
ss << MemTypeStr[dstType[i]] << dstIndex[i];
return ss.str();
}
// NOTE: This is a stop-gap solution until HIP provides wallclock values // NOTE: This is a stop-gap solution until HIP provides wallclock values
int GetWallClockRate(int deviceId) int GetWallClockRate(int deviceId)
{ {
...@@ -1385,27 +1472,27 @@ int GetWallClockRate(int deviceId) ...@@ -1385,27 +1472,27 @@ int GetWallClockRate(int deviceId)
return wallClockPerDeviceMhz[deviceId]; return wallClockPerDeviceMhz[deviceId];
} }
void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numBlocksToUse, bool const isRandom) void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numGpuSubExecs, int const numCpuSubExecs, bool const isRandom)
{ {
ev.DisplaySweepEnvVars(); ev.DisplaySweepEnvVars();
// Compute how many possible Transfers are permitted (unique SRC/EXE/DST triplets) // Compute how many possible Transfers are permitted (unique SRC/EXE/DST triplets)
std::vector<std::pair<MemType, int>> exeList; std::vector<std::pair<ExeType, int>> exeList;
for (auto exe : ev.sweepExe) for (auto exe : ev.sweepExe)
{ {
MemType const exeMemType = CharToMemType(exe); ExeType const exeType = CharToExeType(exe);
if (IsGpuType(exeMemType)) if (IsGpuType(exeType))
{ {
for (int exeIndex = 0; exeIndex < ev.numGpuDevices; ++exeIndex) for (int exeIndex = 0; exeIndex < ev.numGpuDevices; ++exeIndex)
exeList.push_back(std::make_pair(exeMemType, exeIndex)); exeList.push_back(std::make_pair(exeType, exeIndex));
} }
else else if (IsCpuType(exeType))
{ {
for (int exeIndex = 0; exeIndex < ev.numCpuDevices; ++exeIndex) for (int exeIndex = 0; exeIndex < ev.numCpuDevices; ++exeIndex)
{ {
// Skip NUMA nodes that have no CPUs (e.g. CXL) // Skip NUMA nodes that have no CPUs (e.g. CXL)
if (ev.numCpusPerNuma[exeIndex] == 0) continue; if (ev.numCpusPerNuma[exeIndex] == 0) continue;
exeList.push_back(std::make_pair(exeMemType, exeIndex)); exeList.push_back(std::make_pair(exeType, exeIndex));
} }
} }
} }
...@@ -1414,11 +1501,11 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1414,11 +1501,11 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
std::vector<std::pair<MemType, int>> srcList; std::vector<std::pair<MemType, int>> srcList;
for (auto src : ev.sweepSrc) for (auto src : ev.sweepSrc)
{ {
MemType const srcMemType = CharToMemType(src); MemType const srcType = CharToMemType(src);
int const numDevices = IsGpuType(srcMemType) ? ev.numGpuDevices : ev.numCpuDevices; int const numDevices = IsGpuType(srcType) ? ev.numGpuDevices : ev.numCpuDevices;
for (int srcIndex = 0; srcIndex < numDevices; ++srcIndex) for (int srcIndex = 0; srcIndex < numDevices; ++srcIndex)
srcList.push_back(std::make_pair(srcMemType, srcIndex)); srcList.push_back(std::make_pair(srcType, srcIndex));
} }
int numSrcs = srcList.size(); int numSrcs = srcList.size();
...@@ -1426,20 +1513,20 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1426,20 +1513,20 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
std::vector<std::pair<MemType, int>> dstList; std::vector<std::pair<MemType, int>> dstList;
for (auto dst : ev.sweepDst) for (auto dst : ev.sweepDst)
{ {
MemType const dstMemType = CharToMemType(dst); MemType const dstType = CharToMemType(dst);
int const numDevices = IsGpuType(dstMemType) ? ev.numGpuDevices : ev.numCpuDevices; int const numDevices = IsGpuType(dstType) ? ev.numGpuDevices : ev.numCpuDevices;
for (int dstIndex = 0; dstIndex < numDevices; ++dstIndex) for (int dstIndex = 0; dstIndex < numDevices; ++dstIndex)
dstList.push_back(std::make_pair(dstMemType, dstIndex)); dstList.push_back(std::make_pair(dstType, dstIndex));
} }
int numDsts = dstList.size(); int numDsts = dstList.size();
// Build array of possibilities, respecting any additional restrictions (e.g. XGMI hop count) // Build array of possibilities, respecting any additional restrictions (e.g. XGMI hop count)
struct TransferInfo struct TransferInfo
{ {
MemType srcMemType; int srcIndex; MemType srcType; int srcIndex;
MemType exeMemType; int exeIndex; ExeType exeType; int exeIndex;
MemType dstMemType; int dstIndex; MemType dstType; int dstIndex;
}; };
// If either XGMI minimum is non-zero, or XGMI maximum is specified and non-zero then both links must be XGMI // If either XGMI minimum is non-zero, or XGMI maximum is specified and non-zero then both links must be XGMI
...@@ -1451,10 +1538,10 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1451,10 +1538,10 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
{ {
// Skip CPU executors if XGMI link must be used // Skip CPU executors if XGMI link must be used
if (useXgmiOnly && !IsGpuType(exeList[i].first)) continue; if (useXgmiOnly && !IsGpuType(exeList[i].first)) continue;
tinfo.exeMemType = exeList[i].first; tinfo.exeType = exeList[i].first;
tinfo.exeIndex = exeList[i].second; tinfo.exeIndex = exeList[i].second;
bool isXgmiSrc = false; bool isXgmiSrc = false;
int numHopsSrc = 0; int numHopsSrc = 0;
for (int j = 0; j < numSrcs; ++j) for (int j = 0; j < numSrcs; ++j)
{ {
...@@ -1463,8 +1550,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1463,8 +1550,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
if (exeList[i].second != srcList[j].second) if (exeList[i].second != srcList[j].second)
{ {
uint32_t exeToSrcLinkType, exeToSrcHopCount; uint32_t exeToSrcLinkType, exeToSrcHopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, MEM_GPU), HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, false),
RemappedIndex(srcList[j].second, MEM_GPU), RemappedIndex(srcList[j].second, false),
&exeToSrcLinkType, &exeToSrcLinkType,
&exeToSrcHopCount)); &exeToSrcHopCount));
isXgmiSrc = (exeToSrcLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI); isXgmiSrc = (exeToSrcLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI);
...@@ -1484,8 +1571,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1484,8 +1571,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
} }
else if (useXgmiOnly) continue; else if (useXgmiOnly) continue;
tinfo.srcMemType = srcList[j].first; tinfo.srcType = srcList[j].first;
tinfo.srcIndex = srcList[j].second; tinfo.srcIndex = srcList[j].second;
bool isXgmiDst = false; bool isXgmiDst = false;
int numHopsDst = 0; int numHopsDst = 0;
...@@ -1496,8 +1583,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1496,8 +1583,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
if (exeList[i].second != dstList[k].second) if (exeList[i].second != dstList[k].second)
{ {
uint32_t exeToDstLinkType, exeToDstHopCount; uint32_t exeToDstLinkType, exeToDstHopCount;
HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, MEM_GPU), HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, false),
RemappedIndex(dstList[k].second, MEM_GPU), RemappedIndex(dstList[k].second, false),
&exeToDstLinkType, &exeToDstLinkType,
&exeToDstHopCount)); &exeToDstHopCount));
isXgmiDst = (exeToDstLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI); isXgmiDst = (exeToDstLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI);
...@@ -1519,8 +1606,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1519,8 +1606,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
// Skip this DST if total XGMI distance (SRC + DST) is greater than max limit // Skip this DST if total XGMI distance (SRC + DST) is greater than max limit
if (ev.sweepXgmiMax >= 0 && (numHopsSrc + numHopsDst) > ev.sweepXgmiMax) continue; if (ev.sweepXgmiMax >= 0 && (numHopsSrc + numHopsDst) > ev.sweepXgmiMax) continue;
tinfo.dstMemType = dstList[k].first; tinfo.dstType = dstList[k].first;
tinfo.dstIndex = dstList[k].second; tinfo.dstIndex = dstList[k].second;
possibleTransfers.push_back(tinfo); possibleTransfers.push_back(tinfo);
} }
...@@ -1580,13 +1667,15 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con ...@@ -1580,13 +1667,15 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
{ {
// Convert integer value to (SRC->EXE->DST) triplet // Convert integer value to (SRC->EXE->DST) triplet
Transfer transfer; Transfer transfer;
transfer.srcMemType = possibleTransfers[value].srcMemType; transfer.numSrcs = 1;
transfer.srcIndex = possibleTransfers[value].srcIndex; transfer.numDsts = 1;
transfer.exeMemType = possibleTransfers[value].exeMemType; transfer.srcType = {possibleTransfers[value].srcType};
transfer.srcIndex = {possibleTransfers[value].srcIndex};
transfer.exeType = possibleTransfers[value].exeType;
transfer.exeIndex = possibleTransfers[value].exeIndex; transfer.exeIndex = possibleTransfers[value].exeIndex;
transfer.dstMemType = possibleTransfers[value].dstMemType; transfer.dstType = {possibleTransfers[value].dstType};
transfer.dstIndex = possibleTransfers[value].dstIndex; transfer.dstIndex = {possibleTransfers[value].dstIndex};
transfer.numBlocksToUse = IsGpuType(transfer.exeMemType) ? numBlocksToUse : ev.numCpuPerTransfer; transfer.numSubExecs = IsGpuType(transfer.exeType) ? numGpuSubExecs : numCpuSubExecs;
transfer.transferIndex = transfers.size(); transfer.transferIndex = transfers.size();
transfer.numBytes = ev.sweepRandBytes ? randSize(*ev.generator) * sizeof(float) : 0; transfer.numBytes = ev.sweepRandBytes ? randSize(*ev.generator) * sizeof(float) : 0;
transfers.push_back(transfer); transfers.push_back(transfer);
...@@ -1636,12 +1725,23 @@ void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& tran ...@@ -1636,12 +1725,23 @@ void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& tran
for (auto const& transfer : transfers) for (auto const& transfer : transfers)
{ {
fprintf(fp, " (%c%d->%c%d->%c%d %d %lu)", fprintf(fp, " (%c%d->%c%d->%c%d %d %lu)",
MemTypeStr[transfer.srcMemType], transfer.srcIndex, MemTypeStr[transfer.srcType[0]], transfer.srcIndex[0],
MemTypeStr[transfer.exeMemType], transfer.exeIndex, ExeTypeStr[transfer.exeType], transfer.exeIndex,
MemTypeStr[transfer.dstMemType], transfer.dstIndex, MemTypeStr[transfer.dstType[0]], transfer.dstIndex[0],
transfer.numBlocksToUse, transfer.numSubExecs,
transfer.numBytes); transfer.numBytes);
} }
fprintf(fp, "\n"); fprintf(fp, "\n");
fflush(fp); fflush(fp);
} }
std::string PtrVectorToStr(std::vector<float*> const& strVector, int const initOffset)
{
std::stringstream ss;
for (int i = 0; i < strVector.size(); ++i)
{
if (i) ss << " ";
ss << (strVector[i] + initOffset);
}
return ss.str();
}
/* /*
Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -35,20 +35,20 @@ THE SOFTWARE. ...@@ -35,20 +35,20 @@ THE SOFTWARE.
#include <hip/hip_ext.h> #include <hip/hip_ext.h>
#include <hsa/hsa_ext_amd.h> #include <hsa/hsa_ext_amd.h>
#include "EnvVars.hpp"
// Helper macro for catching HIP errors // Helper macro for catching HIP errors
#define HIP_CALL(cmd) \ #define HIP_CALL(cmd) \
do { \ do { \
hipError_t error = (cmd); \ hipError_t error = (cmd); \
if (error != hipSuccess) \ if (error != hipSuccess) \
{ \ { \
std::cerr << "Encountered HIP error (" << hipGetErrorString(error) << ") at line " \ std::cerr << "Encountered HIP error (" << hipGetErrorString(error) \
<< __LINE__ << " in file " << __FILE__ << "\n"; \ << ") at line " << __LINE__ << " in file " << __FILE__ << "\n"; \
exit(-1); \ exit(-1); \
} \ } \
} while (0) } while (0)
#include "EnvVars.hpp"
// Simple configuration parameters // Simple configuration parameters
size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<26); // Amount of data transferred per Transfer size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<26); // Amount of data transferred per Transfer
...@@ -59,92 +59,92 @@ typedef enum ...@@ -59,92 +59,92 @@ typedef enum
MEM_GPU = 1, // Coarse-grained global GPU memory MEM_GPU = 1, // Coarse-grained global GPU memory
MEM_CPU_FINE = 2, // Fine-grained pinned CPU memory MEM_CPU_FINE = 2, // Fine-grained pinned CPU memory
MEM_GPU_FINE = 3, // Fine-grained global GPU memory MEM_GPU_FINE = 3, // Fine-grained global GPU memory
MEM_CPU_UNPINNED = 4 // Unpinned CPU memory MEM_CPU_UNPINNED = 4, // Unpinned CPU memory
MEM_NULL = 5, // NULL memory - used for empty
} MemType; } MemType;
bool IsGpuType(MemType m) typedef enum
{
return (m == MEM_GPU || m == MEM_GPU_FINE);
}
bool IsCpuType(MemType m)
{ {
return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED); EXE_CPU = 0, // CPU executor (subExecutor = CPU thread)
} EXE_GPU_GFX = 1, // GPU kernel-based executor (subExecutor = threadblock/CU)
EXE_GPU_DMA = 2, // GPU SDMA-based executor (subExecutor = streams)
} ExeType;
char const MemTypeStr[6] = "CGBFU"; bool IsGpuType(MemType m) { return (m == MEM_GPU || m == MEM_GPU_FINE); }
bool IsCpuType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED); };
bool IsGpuType(ExeType e) { return (e == EXE_GPU_GFX || e == EXE_GPU_DMA); };
bool IsCpuType(ExeType e) { return (e == EXE_CPU); };
char const MemTypeStr[7] = "CGBFUN";
char const ExeTypeStr[4] = "CGD";
char const ExeTypeName[3][4] = {"CPU", "GPU", "DMA"};
MemType inline CharToMemType(char const c) MemType inline CharToMemType(char const c)
{ {
switch (c) char const* val = strchr(MemTypeStr, toupper(c));
{ if (*val) return (MemType)(val - MemTypeStr);
case 'C': return MEM_CPU; printf("[ERROR] Unexpected memory type (%c)\n", c);
case 'G': return MEM_GPU; exit(1);
case 'B': return MEM_CPU_FINE;
case 'F': return MEM_GPU_FINE;
case 'U': return MEM_CPU_UNPINNED;
default:
printf("[ERROR] Unexpected mem type (%c)\n", c);
exit(1);
}
} }
typedef enum ExeType inline CharToExeType(char const c)
{ {
MODE_FILL = 0, // Fill data with pattern char const* val = strchr(ExeTypeStr, toupper(c));
MODE_CHECK = 1 // Check data against pattern if (*val) return (ExeType)(val - ExeTypeStr);
} ModeType; printf("[ERROR] Unexpected executor type (%c)\n", c);
exit(1);
// Each threadblock copies N floats from src to dst }
struct BlockParam
{
int N;
float* src;
float* dst;
long long startCycle;
long long stopCycle;
};
// Each Transfer is a uni-direction operation from a src memory to dst memory // Each Transfer performs reads from source memory location(s), sums them (if multiple sources are specified)
// then writes the summation to each of the specified destination memory location(s)
struct Transfer struct Transfer
{ {
int transferIndex; // Transfer identifier int transferIndex; // Transfer identifier (within a Test)
ExeType exeType; // Transfer executor type
// Transfer config int exeIndex; // Executor index (NUMA node for CPU / device ID for GPU)
MemType exeMemType; // Transfer executor type (CPU or GPU) int numSubExecs; // Number of subExecutors to use for this Transfer
int exeIndex; // Executor index (NUMA node for CPU / device ID for GPU) size_t numBytes; // # of bytes requested to Transfer (may be 0 to fallback to default)
MemType srcMemType; // Source memory type size_t numBytesActual; // Actual number of bytes to copy
int srcIndex; // Source device index double transferTime; // Time taken in milliseconds
MemType dstMemType; // Destination memory type
int dstIndex; // Destination device index int numSrcs; // Number of sources
int numBlocksToUse; // Number of threadblocks to use for this Transfer std::vector<MemType> srcType; // Source memory types
size_t numBytes; // Number of bytes to Transfer std::vector<int> srcIndex; // Source device indice
size_t numBytesToCopy; // Number of bytes to copy std::vector<float*> srcMem; // Source memory
// Memory int numDsts; // Number of destinations
float* srcMem; // Source memory std::vector<MemType> dstType; // Destination memory type
float* dstMem; // Destination memory std::vector<int> dstIndex; // Destination device index
std::vector<float*> dstMem; // Destination memory
// How memory is split across threadblocks / CPU cores
std::vector<BlockParam> blockParam; std::vector<SubExecParam> subExecParam; // Defines subarrays assigned to each threadblock
BlockParam* blockParamGpuPtr; SubExecParam* subExecParamGpuPtr; // Pointer to GPU copy of subExecParam
// Results // Prepares src/dst subarray pointers for each SubExecutor
double transferTime; void PrepareSubExecParams(EnvVars const& ev);
// Prepares src memory and how to divide N elements across threadblocks/threads // Prepare source arrays with input data
void PrepareBlockParams(EnvVars const& ev, size_t const N); void PrepareSrc(EnvVars const& ev);
// Validate that destination data contains expected results
void ValidateDst(EnvVars const& ev);
// Prepare reference buffers
void PrepareReference(EnvVars const& ev, std::vector<float>& buffer, int bufferIdx);
// String representation functions
std::string SrcToStr() const;
std::string DstToStr() const;
}; };
typedef std::pair<MemType, int> Executor;
struct ExecutorInfo struct ExecutorInfo
{ {
std::vector<Transfer*> transfers; // Transfers to execute std::vector<Transfer*> transfers; // Transfers to execute
size_t totalBytes; // Total bytes this executor transfers size_t totalBytes; // Total bytes this executor transfers
int totalSubExecs; // Total number of subExecutors to use
// For GPU-Executors // For GPU-Executors
int totalBlocks; // Total number of CUs/CPU threads to use SubExecParam* subExecParamGpu; // GPU copy of subExecutor parameters
BlockParam* blockParamGpu; // Copy of block parameters in GPU device memory
std::vector<hipStream_t> streams; std::vector<hipStream_t> streams;
std::vector<hipEvent_t> startEvents; std::vector<hipEvent_t> startEvents;
std::vector<hipEvent_t> stopEvents; std::vector<hipEvent_t> stopEvents;
...@@ -153,6 +153,7 @@ struct ExecutorInfo ...@@ -153,6 +153,7 @@ struct ExecutorInfo
double totalTime; double totalTime;
}; };
typedef std::pair<ExeType, int> Executor;
typedef std::map<Executor, ExecutorInfo> TransferMap; typedef std::map<Executor, ExecutorInfo> TransferMap;
// Display usage instructions // Display usage instructions
...@@ -166,7 +167,9 @@ void PopulateTestSizes(size_t const numBytesPerTransfer, int const samplingFacto ...@@ -166,7 +167,9 @@ void PopulateTestSizes(size_t const numBytesPerTransfer, int const samplingFacto
std::vector<size_t>& valuesofN); std::vector<size_t>& valuesofN);
void ParseMemType(std::string const& token, int const numCpus, int const numGpus, void ParseMemType(std::string const& token, int const numCpus, int const numGpus,
MemType* memType, int* memIndex); std::vector<MemType>& memType, std::vector<int>& memIndex);
void ParseExeType(std::string const& token, int const numCpus, int const numGpus,
ExeType& exeType, int& exeIndex);
void ParseTransfers(char* line, int numCpus, int numGpus, void ParseTransfers(char* line, int numCpus, int numGpus,
std::vector<Transfer>& transfers); std::vector<Transfer>& transfers);
...@@ -178,26 +181,19 @@ void EnablePeerAccess(int const deviceId, int const peerDeviceId); ...@@ -178,26 +181,19 @@ void EnablePeerAccess(int const deviceId, int const peerDeviceId);
void AllocateMemory(MemType memType, int devIndex, size_t numBytes, void** memPtr); void AllocateMemory(MemType memType, int devIndex, size_t numBytes, void** memPtr);
void DeallocateMemory(MemType memType, void* memPtr, size_t const size = 0); void DeallocateMemory(MemType memType, void* memPtr, size_t const size = 0);
void CheckPages(char* byteArray, size_t numBytes, int targetId); void CheckPages(char* byteArray, size_t numBytes, int targetId);
void CheckOrFill(ModeType mode, int N, bool isMemset, bool isHipCall, std::vector<float> const& fillPattern, float* ptr);
void RunTransfer(EnvVars const& ev, int const iteration, ExecutorInfo& exeInfo, int const transferIdx); void RunTransfer(EnvVars const& ev, int const iteration, ExecutorInfo& exeInfo, int const transferIdx);
void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, int readMode, int skipCpu); void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N);
void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numBlocksToUse, bool const isRandom); void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numGpuSubExec, int const numCpuSubExec, bool const isRandom);
// Return the maximum bandwidth measured for given (src/dst) pair // Return the maximum bandwidth measured for given (src/dst) pair
double GetPeakBandwidth(EnvVars const& ev, double GetPeakBandwidth(EnvVars const& ev, size_t const N,
size_t const N,
int const isBidirectional, int const isBidirectional,
int const readMode, MemType const srcType, int const srcIndex,
int const numBlocksToUse, MemType const dstType, int const dstIndex);
MemType const srcMemType,
int const srcIndex,
MemType const dstMemType,
int const dstIndex);
std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount); std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount);
std::string GetDesc(MemType srcMemType, int srcIndex,
MemType dstMemType, int dstIndex); int RemappedIndex(int const origIdx, bool const isCpuType);
std::string GetTransferDesc(Transfer const& transfer);
int RemappedIndex(int const origIdx, MemType const memType);
int GetWallClockRate(int deviceId); int GetWallClockRate(int deviceId);
void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers); void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers);
std::string PtrVectorToStr(std::vector<float*> const& strVector, int const initOffset);
# ConfigFile Format: # ConfigFile Format:
# ================== # ==================
# A Transfer is defined as a uni-directional copy from src memory location to dst memory location # A Transfer is defined as a single operation where an Executor reads and adds together
# executed by either CPU or GPU # values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
# This simplifies to a simple copy operation when dealing with single SRC/DST.
#
# SRC 0 DST 0
# SRC 1 -> Executor -> DST 1
# SRC X DST Y
# Three Executors are supported by TransferBench
# Executor: SubExecutor:
# 1) CPU CPU thread
# 2) GPU GPU threadblock/Compute Unit (CU)
# 3) DMA N/A. (May only be used for copies (single SRC/DST)
# Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel # Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
# There are two ways to specify a Test: # There are two ways to specify a Test:
# 1) Basic # 1) Basic
# The basic specification assumes the same number of threadblocks/CUs used per GPU-executed Transfer # The basic specification assumes the same number of SubExecutors (SE) used per Transfer
# A positive number of Transfers is specified followed by that number of triplets describing each Transfer # A positive number of Transfers is specified followed by that number of triplets describing each Transfer
# #Transfers #CUs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL) # #Transfers #SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
# 2) Advanced # 2) Advanced
# A negative number of Transfers is specified, followed by quintuplets describing each Transfer # A negative number of Transfers is specified, followed by quintuplets describing each Transfer
# A non-zero number of bytes specified will override any provided value # A non-zero number of bytes specified will override any provided value
# -#Transfers (srcMem1->Executor1->dstMem1 #CUs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #CUsL BytesL) # -#Transfers (srcMem1->Executor1->dstMem1 #SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #SEsL BytesL)
# Argument Details: # Argument Details:
# #Transfers: Number of Transfers to be run in parallel # #Transfers: Number of Transfers to be run in parallel
# #CUs : Number of threadblocks/CUs to use for a GPU-executed Transfer # #SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
# srcMemL : Source memory location (Where the data is to be read from). Ignored in memset mode # srcMemL : Source memory locations (Where the data is to be read from)
# Executor : Executor is specified by a character indicating type, followed by device index (0-indexed) # Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
# - C: CPU-executed (Indexed from 0 to # NUMA nodes - 1) # - C: CPU-executed (Indexed from 0 to # NUMA nodes - 1)
# - G: GPU-executed (Indexed from 0 to # GPUs - 1) # - G: GPU-executed (Indexed from 0 to # GPUs - 1)
# dstMemL : Destination memory location (Where the data is to be written to) # - D: DMA-executor (Indexed from 0 to # GPUs - 1)
# dstMemL : Destination memory locations (Where the data is to be written to)
# bytesL : Number of bytes to copy (0 means use command-line specified size) # bytesL : Number of bytes to copy (0 means use command-line specified size)
# Must be a multiple of 4 and may be suffixed with ('K','M', or 'G') # Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
# #
# Memory locations are specified by a character indicating memory type, # Memory locations are specified by one or more (device character / device index) pairs
# followed by device index (0-indexed) # Character indicating memory type followed by device index (0-indexed)
# Supported memory locations are: # Supported memory locations are:
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - U: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - U: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1]) # - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1]) # - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# Examples: # Examples:
# 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1 # 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
# 1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0 # 1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
# 2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 CUs # 2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 2 CUs # -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
# Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary # Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
# Lines starting with # will be ignored. Lines starting with ## will be echoed to output # Lines starting with # will be ignored. Lines starting with ## will be echoed to output
# Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs ## Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
1 4 (G0->G0->G1) 1 4 (G0->G0->G1)
# Copies 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs ## Single DMA executed Transfer between GPUs 0 and 1
1 1 (G0->D0->G1)
## Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M) -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
## "Memset" by GPU 0 to GPU 0 memory
1 32 (N0->G0->G0)
## "Read-only" by CPU 0
1 4 (C0->C0->N0)
## Broadcast from GPU 0 to GPU 0 and GPU 1
1 16 (G0->G0->G0G1)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment