Unverified Commit bbd72a6c authored by gilbertlee-amd's avatar gilbertlee-amd Committed by GitHub
Browse files

TransferBench v1.66 - Multi-Rank support (#224)

* Adding System singleton to support multi-node (communication and topology)
* Adding multi-node parsing, rank and device wildcard expansion
* Adding multi-node topology, and various support functions
* Adding multi-node consistency validation of Config and Transfers
* Introducing SINGLE_KERNEL=1 to Makefile to speed up compilation during development
* Updating CHANGELOG.  Overhauling wildcard parsing.  Adding dryrun
* Client refactoring.  Introduction of tabular formatted results and a2a multi-rank preset
* Adding MPI support into CMakeFiles
* Cleaning up multi-node topology using TableHelper
* Reducing compile time by removing some kernel variants
* Updating documentation.  Adding nicrings preset
* Adding NIC_FILTER to allow NIC device filtering via regex
* Updating supported memory types
* Fixing P2P preset, and adding some extra memIndex utility functions
parent 26717d50
......@@ -3,6 +3,79 @@
Documentation for TransferBench is available at
[https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench).
## v1.66.00
### Added
- Adding multi-node support
- TransferBench now supports multiple nodes through the use of MPI or sockets
- In order to utilize MPI, TransferBench must be compiled with MPI support (setting MPI_PATH to where
an MPI implementation is located). MPI support can explicitly disabled by setting DISABLE_MPI_COMM=1
- TransferBench can be executed with an MPI launcher, such as mpirun
- In order to utilize sockets, several environment variables need to be provided to processes
* TB_RANK: Rank of this process (0-based)
* TB_NUM_RANKS: Total number of processes
* TB_MASTER_ADDR: IP address of rank 0 (Other ranks will connect to rank 0)
* TB_MASTER_PORT: Port for communication (default: 29500)
- Additional debug messages can be enabled by setting TB_VERBOSE=1
- NOTE: It is recommended that one process be launched per node to avoid aliasing of devices
- Adding multi-node topology detection
- When running in multi-node mode, TransferBench will try to collect topology information about each
rank, then group ranks into homogenous configurations.
- This is done by running TransferBench with no arguments (e.g. mpirun -np 2 ./TransferBench)
- Adding multi-node Transfer parsing and wildcard support
- Memory locations have now been extended to support a rank index
* R(memRank)?(memIndex) (where ? is one of the supported memory type characters "CGBFUNMP")
(e.g. R2G3 is GPU memory location in GPU 3 on rank 2)
- Rank is optional and if not specified, will fallback to "local" rank
- Executor locations have been extended to support rank indices as well)
* R(exeRank)?(exeIndex){exeSlot}.{exeSubIndex}{exeSubSlot} (where ? is one of the supported executor-types characters "CGDIN")
- exeSlots are only relevant for the EXE_NIC_NEAREST executor, and allows for distinguishing when multiple NICs are closest to a GPU
- exeSlots are defined by upper case letters 'A' for first closest NIC, 'B' for 2nd closest NIC, etc.
- For example: N0B.4C would execute using the 2nd closest NIC to GPU 0 via communicating with the 3rd closest NIC to GPU 4
- Wildcard support:
- To help quickly define sets of transfers, Transfers can now be specified using wildcards
- All the fields above may be specified either:
* directly with a single value: E.g: R34 -> Rank 34
* full wildcard: E.g: R* -> Will be replaced by all available ranks
* Ranged wildcard: E.g. R[1,5..7] -> Will be replaced by Rank 1, Rank 5, Rank 6, Rank 7
- Wildcard nearest NIC wildcard
- To simplify nearest NIC execution, it is not necessary to specify exeIndex/exeSubIndex for the "N" executor
- If exeRank/exeIndex/exeSlot/exeSubIndex/exeSubSlot are all not specified, the Transfer will be expanded to
choose the correct values such that a remote write operation will occur based on SRC/DST mem locations
- For example: (R2G4->N->R4G5) will expand to (R2G4->R2N4.5->R4G5)
- Adding dry-run preset
- This new preset is similar to cmdline however it only shows the list of transfers that will be executed
- This new dryrun preset may be useful when using the new wildcard expressions to ensure that the Test
contains the correct set of Transfers
- Adding nicrings preset
- This new preset runs parallel transfers forming rings that connect identical NICs across ranks
- Adding NIC_FILTER to allow for filtering which NICs to detect. NIC_FILTER accepts regular-expression syntax
- Added new memory types based on latest HIP memory allocation flags
Supported memory locations are:
- C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- P: Pinned host memory (on NUMA node, indexed by closest GPU [#GPUs -1])
- B: Coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- D: Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- K: Uncached pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- H: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
- F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
- U: Uncached device memory (on GPU device indexed from 0 to [# GPUs - 1])
- N: Null memory (index ignored)
- As a result, the a2a preset has deprecated USE_FINE_GRAIN for MEM_TYPE to allow for selecting between various GPU memory types
- A warning message is issued if USE_FINE_GRAIN is used, however previous matching functionality remains for now
- The p2p preset has also deprecated USE_FINE_GRAIN for CPU_MEM_TYPE and GPU_MEM_TYPE
### Modified
- Refactored front-end client code to facilitate simpler and more consistent presets.
- Refactored tabular data display to simplify code. Output result tables now use ASCII box-drawing
characters for borders which helps group data visually. Borders may be disabled by setting SHOW_BORDERS=0
- The All-to-all preset is now multi-rank compatible. When executed on multiple ranks, it runs
inter-rank all-to-all and then reports the min/max across all ranks. The number of extrema
results shown can be adjusted by NUM_RESULTS
### Fixed
- Added guard for ROCM version when using __syncwarp();
- Exiting with non-zero code on fatal errors
## v1.65.00
### Added
- Added warp-level dispatch support via GFX_SE_TYPE environment variable
......
# Copyright (c) 2023-2025 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
......@@ -9,7 +9,7 @@ if (NOT CMAKE_TOOLCHAIN_FILE)
message(STATUS "CMAKE_TOOLCHAIN_FILE: ${CMAKE_TOOLCHAIN_FILE}")
endif()
set(VERSION_STRING "1.65.00")
set(VERSION_STRING "1.66.00")
project(TransferBench VERSION ${VERSION_STRING} LANGUAGES CXX)
## Load CMake modules
......@@ -24,6 +24,7 @@ list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
#==================================================================================================
option(BUILD_LOCAL_GPU_TARGET_ONLY "Build only for GPUs detected on this machine" OFF)
option(ENABLE_NIC_EXEC "Enable RDMA NIC Executor in TransferBench" OFF)
option(ENABLE_MPI_COMM "Enable MPI Communicator support" OFF)
# Default GPU architectures to build
#==================================================================================================
......@@ -129,7 +130,7 @@ endif()
if(DEFINED ENV{DISABLE_NIC_EXEC} AND "$ENV{DISABLE_NIC_EXEC}" STREQUAL "1")
message(STATUS "Disabling NIC Executor support as env. flag DISABLE_NIC_EXEC was enabled")
elseif(NOT ENABLE_NIC_EXEC)
message(STATUS "For CMake builds, NIC executor so requires explicit opt-in by setting CMake flag -DENABLE_NIC_EXEC=1")
message(STATUS "For CMake builds, NIC executor so requires explicit opt-in by setting CMake flag -DENABLE_NIC_EXEC=ON")
message(STATUS "Disabling NIC Executor support")
else()
find_library(IBVERBS_LIBRARY ibverbs)
......@@ -149,6 +150,40 @@ else()
endif()
endif()
## Check for MPI support
set(MPI_PATH "" CACHE PATH "Path to MPI installation (takes priority over system MPI)")
if(NOT ENABLE_MPI_COMM)
message(STATUS "For CMake builds, MPI Communicator requires explicit opt-in by setting CMake flag -DENABLE_MPI_COMM=ON")
message(STATUS "Disabling MPI Communicator support")
else()
# First check user-specified MPI_PATH (similar to Makefile)
if(MPI_PATH AND EXISTS "${MPI_PATH}/include/mpi.h")
find_library(MPI_LIBRARY NAMES mpi PATHS ${MPI_PATH}/lib NO_DEFAULT_PATH)
if(MPI_LIBRARY)
set(MPI_COMM_FOUND 1)
set(MPI_INCLUDE_DIR "${MPI_PATH}/include")
set(MPI_LINK_DIR "${MPI_PATH}/lib")
message(STATUS "Building with MPI Communicator support (found at MPI_PATH: ${MPI_PATH})")
else()
message(WARNING "Found mpi.h at ${MPI_PATH}/include but could not find MPI library at ${MPI_PATH}/lib")
endif()
else()
# Fall back to find_package
if(MPI_PATH)
message(STATUS "Unable to find mpi.h at ${MPI_PATH}/include, trying find_package")
endif()
find_package(MPI QUIET)
if(MPI_CXX_FOUND)
set(MPI_COMM_FOUND 1)
message(STATUS "Building with MPI Communicator support (found via find_package)")
message(STATUS "- Using MPI include path: ${MPI_CXX_INCLUDE_PATH}")
message(STATUS "- Using MPI library:: ${MPI_CXX_LIBRARIES}")
else()
message(WARNING "MPI not found. Please specify appropriate MPI_PATH or install MPI libraries (e.g., OpenMPI or MPICH)")
endif()
endif()
endif()
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY .)
add_executable(TransferBench src/client/Client.cpp)
......@@ -163,6 +198,22 @@ if(IBVERBS_FOUND)
target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY})
target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED)
endif()
if(MPI_COMM_FOUND)
if(TARGET MPI::MPI_CXX)
# Found via find_package
target_include_directories(TransferBench PRIVATE ${MPI_CXX_INCLUDE_DIRS})
target_link_libraries(TransferBench PRIVATE MPI::MPI_CXX)
else()
# Found via MPI_PATH fallback
target_include_directories(TransferBench PRIVATE ${MPI_INCLUDE_DIR})
target_link_libraries(TransferBench PRIVATE ${MPI_LIBRARY})
endif()
target_compile_definitions(TransferBench PRIVATE MPI_COMM_ENABLED)
endif()
if (HAVE_PARALLEL_JOBS)
target_compile_options(TransferBench PRIVATE -parallel-jobs=12)
endif()
target_link_libraries(TransferBench PRIVATE -fgpu-rdc) # Required when linking relocatable device code
target_link_libraries(TransferBench PRIVATE Threads::Threads)
......
#
# Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
#
# Configuration options
ROCM_PATH ?= /opt/rocm
CUDA_PATH ?= /usr/local/cuda
MPI_PATH ?= /usr/local/openmpi
HIPCC ?= $(ROCM_PATH)/bin/amdclang++
NVCC ?= $(CUDA_PATH)/bin/nvcc
# Option to compile with single GFX kernel to drop compilation time
SINGLE_KERNEL ?= 0
# This can be a space separated string of multiple GPU targets
# Default is the native GPU target
GPU_TARGETS ?= native
......@@ -35,9 +39,13 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),)
CXXFLAGS = -I$(ROCM_PATH)/include -I$(ROCM_PATH)/include/hip -I$(ROCM_PATH)/include/hsa
HIPLDFLAGS= -lnuma -L$(ROCM_PATH)/lib -lhsa-runtime64 -lamdhip64
HIPFLAGS = -x hip -D__HIP_PLATFORM_AMD__ -D__HIPCC__ $(GPU_TARGETS_FLAGS)
HIPFLAGS = -Wall -x hip -D__HIP_PLATFORM_AMD__ -D__HIPCC__ $(GPU_TARGETS_FLAGS)
NVFLAGS = -x cu -lnuma -arch=native
ifeq ($(SINGLE_KERNEL), 1)
CXXFLAGS += -DSINGLE_KERNEL
endif
ifeq ($(DEBUG), 0)
COMMON_FLAGS += -O3
else
......@@ -70,8 +78,34 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),)
$(info Building with NIC executor support. Can set DISABLE_NIC_EXEC=1 to disable)
endif
endif
MPI_ENABLED = 0
# Compile with MPI communicator support if
# 1) DISABLE_MPI_COMM is not set to 1
# 2) mpi.h is found in the MPI_PATH
DISABLE_MPI_COMM ?= 0
ifneq ($(DISABLE_MPI_COMM), 1)
ifeq ($(wildcard $(MPI_PATH)/include/mpi.h),)
$(info Unable to find mpi.h at $(MPI_PATH)/include. Please specify appropriate MPI_PATH)
else
MPI_ENABLED = 1
CXXFLAGS += -DMPI_COMM_ENABLED -I$(MPI_PATH)/include
LDFLAGS += -L/$(MPI_PATH)/lib -lmpi
ifeq ($(DEBUG), 1)
LDFLAGS += -lmpi_cxx
endif
endif
ifeq ($(MPI_ENABLED), 0)
$(info Building without MPI communicator support)
$(info To use TransferBench with MPI support, install MPI libraries and specify appropriate MPI_PATH)
else
$(info Building with MPI communicator support. Can set DISABLE_MPI_COMM=1 to disable)
endif
endif
endif
.PHONY : all clean
all: $(EXE)
......
# TransferBench
TransferBench is a utility for benchmarking simultaneous copies between user-specified
CPU and GPU devices.
CPU and GPU memory locations using CPUs/GPU kernels/DMA engines/NIC devices.
> [!NOTE]
> The published documentation is available at [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html) in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the `TransferBench/docs` folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see [Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
......@@ -18,7 +18,7 @@ left_nav_title = f"TransferBench {version_number} Documentation"
# for PDF output on Read the Docs
project = "TransferBench Documentation"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
version = version_number
release = version_number
......
......@@ -177,9 +177,16 @@ Here is the list of preset configurations that can be used instead of configurat
* - ``cmdline``
- Allows transfers to run from the command line instead of a configuration file
* - ``dryrun``
- Lists the set of transfers to be executed as provided from the command line
- This is useful when using wildcards to ensure correctness
* - ``healthcheck``
- Simple health check (supported on AMD Instinct MI300 series only)
* - ``nic_rings``
- Measure performance of NICs set up in a ring across ranks
* - ``p2p``
- Peer-to-peer benchmark test
......
......@@ -50,6 +50,12 @@ Here are the steps to build TransferBench:
If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.
NIC executor support will be enabled if IBVerbs is detected and if ``infiniband/verbs.h`` is found in the default include path.
NIC executor support can be disabled explicitly by setting ``DISABLE_NIC_EXEC=1``
MPI support will be enabled if mpi.h is found in ``MPI_PATH/include/``
MPI executor support can be disabled explicitly by setting ``DISABLE_MPI_COMM=1``
Building documentation
-----------------------
......
......@@ -47,13 +47,17 @@
# Memory locations are specified by one or more (device character / device index) pairs
# Character indicating memory type followed by device index (0-indexed)
# Supported memory locations are:
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - U: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# - P: Pinned host memory (on NUMA node, but indexed by closest GPU [#GPUs -1])
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - P: Pinned host memory (on NUMA node, indexed by closest GPU [#GPUs -1])
# - B: Coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - D: Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - K: Uncached pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - H: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - U: Uncached device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# Examples:
# 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
......
/*
Copyright (c) 2019-2024 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -20,23 +20,33 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#include "Client.hpp"
#include "Presets.hpp"
#include "Topology.hpp"
#include <fstream>
int main(int argc, char **argv) {
void DisplayVersion();
void DisplayUsage(char const* cmdName);
using namespace TransferBench;
using namespace TransferBench::Utils;
size_t constexpr DEFAULT_BYTES_PER_TRANSFER = (1<<28);
int main(int argc, char **argv)
{
// Collect environment variables
EnvVars ev;
// Display usage instructions and detected topology
if (argc <= 1) {
if (!ev.outputToCsv) {
DisplayUsage(argv[0]);
DisplayPresets();
if (RankDoesOutput()) {
if (!ev.outputToCsv) {
DisplayVersion();
DisplayUsage(argv[0]);
DisplayPresets();
}
DisplayTopology(ev.outputToCsv, ev.showBorders);
}
DisplayTopology(ev.outputToCsv);
exit(0);
}
......@@ -52,42 +62,73 @@ int main(int argc, char **argv) {
}
}
if (numBytesPerTransfer % 4) {
printf("[ERROR] numBytesPerTransfer (%lu) must be a multiple of 4\n", numBytesPerTransfer);
Print("[ERROR] numBytesPerTransfer (%lu) must be a multiple of 4\n", numBytesPerTransfer);
exit(1);
}
// Display TransferBench version and build configuration
DisplayVersion();
// Run preset benchmark if requested
if (RunPreset(ev, numBytesPerTransfer, argc, argv)) exit(0);
int retCode = 0;
if (RunPreset(ev, numBytesPerTransfer, argc, argv, retCode)) return retCode;
// Read input from command line or configuration file
bool isDryRun = !strcmp(argv[1], "dryrun");
std::vector<std::string> lines;
{
std::string line;
if (!strcmp(argv[1], "cmdline")) {
if (!strcmp(argv[1], "cmdline") || isDryRun) {
for (int i = 3; i < argc; i++)
line += std::string(argv[i]) + " ";
lines.push_back(line);
} else {
std::ifstream cfgFile(argv[1]);
if (!cfgFile.is_open()) {
printf("[ERROR] Unable to open transfer configuration file: [%s]\n", argv[1]);
Print("[ERROR] Unable to open transfer configuration file: [%s]\n", argv[1]);
exit(1);
}
}
while (std::getline(cfgFile, line))
lines.push_back(line);
cfgFile.close();
}
}
// Print environment variables and CSV header
ev.DisplayEnvVars();
if (ev.outputToCsv)
printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
TransferBench::ConfigOptions cfgOptions = ev.ToConfigOptions();
TransferBench::TestResults results;
ConfigOptions cfgOptions = ev.ToConfigOptions();
TestResults results;
std::vector<ErrResult> errors;
// Dry run prints off transfers (and errors)
if (isDryRun) {
Print("Transfers to be executed (dry-run):\n");
Print("================================================================================\n");
std::vector<Transfer> transfers;
CheckForError(ParseTransfers(lines[0], transfers));
if (transfers.empty()) {
Print("<none>\n");
} else {
bool isMultiNode = GetNumRanks() > 1;
for (size_t i = 0; i < transfers.size(); i++) {
Transfer const& t = transfers[i];
Print("Transfer %5lu: (%s->", i, MemDevicesToStr(t.srcs).c_str());
if (isMultiNode) Print("R%d", t.exeDevice.exeRank);
Print("%c%d", ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex);
if (t.exeDevice.exeSlot) Print("%c", 'A' + t.exeDevice.exeSlot);
if (t.exeSubIndex != -1) Print(".%d", t.exeSubIndex);
if (t.exeSubSlot != 0) Print("%c", 'A' + t.exeSubSlot);
Print("->%s)\n", MemDevicesToStr(t.dsts).c_str());
}
}
return 0;
}
// Print environment variables and CSV header
if (RankDoesOutput()) {
ev.DisplayEnvVars();
if (ev.outputToCsv)
Print("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
}
// Process each line as a Test
int testNum = 0;
for (std::string const &line : lines) {
......@@ -96,7 +137,7 @@ int main(int argc, char **argv) {
// Parse set of parallel Transfers to execute
std::vector<Transfer> transfers;
CheckForError(TransferBench::ParseTransfers(line, transfers));
CheckForError(ParseTransfers(line, transfers));
if (transfers.empty()) continue;
// Check for variable sub-executors Transfers
......@@ -107,7 +148,7 @@ int main(int argc, char **argv) {
for (auto const& t : transfers) {
if (t.numSubExecs == 0) {
if (t.exeDevice.exeType != EXE_GPU_GFX) {
printf("[ERROR] Variable number of subexecutors is only supported on GFX executors\n");
Print("[ERROR] Variable number of subexecutors is only supported on GFX executors\n");
exit(1);
}
numVariableTransfers++;
......@@ -116,7 +157,7 @@ int main(int argc, char **argv) {
}
}
if (numVariableTransfers > 0 && numVariableTransfers != transfers.size()) {
printf("[ERROR] All or none of the Transfers in the Test must use variable number of Subexecutors\n");
Print("[ERROR] All or none of the Transfers in the Test must use variable number of Subexecutors\n");
exit(1);
}
}
......@@ -140,18 +181,20 @@ int main(int argc, char **argv) {
}
if (maxVarCount == 0) {
if (TransferBench::RunTransfers(cfgOptions, transfers, results)) {
if (RunTransfers(cfgOptions, transfers, results)) {
PrintResults(ev, ++testNum, transfers, results);
}
PrintErrors(results.errResults);
if (RankDoesOutput()) {
PrintErrors(results.errResults);
}
} else {
// Variable subexecutors - Determine how many subexecutors to sweep up to
int maxNumVarSubExec = ev.maxNumVarSubExec;
if (maxNumVarSubExec == 0) {
maxNumVarSubExec = TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0}) / maxVarCount;
maxNumVarSubExec = GetNumSubExecutors({EXE_GPU_GFX, 0}) / maxVarCount;
}
TransferBench::TestResults bestResults;
TestResults bestResults;
std::vector<Transfer> bestTransfers;
for (int numSubExecs = ev.minNumVarSubExec; numSubExecs <= maxNumVarSubExec; numSubExecs++) {
std::vector<Transfer> tempTransfers = transfers;
......@@ -159,8 +202,8 @@ int main(int argc, char **argv) {
if (t.numSubExecs == 0) t.numSubExecs = numSubExecs;
}
TransferBench::TestResults tempResults;
if (!TransferBench::RunTransfers(cfgOptions, tempTransfers, tempResults)) {
TestResults tempResults;
if (!RunTransfers(cfgOptions, tempTransfers, tempResults)) {
PrintErrors(tempResults.errResults);
} else {
if (tempResults.avgTotalBandwidthGbPerSec > bestResults.avgTotalBandwidthGbPerSec) {
......@@ -180,158 +223,49 @@ int main(int argc, char **argv) {
}
}
void DisplayUsage(char const* cmdName)
void DisplayVersion()
{
std::string nicSupport = "";
bool nicSupport = false, mpiSupport = false;
#if NIC_EXEC_ENABLED
nicSupport = " (with NIC support)";
nicSupport = true;
#endif
#if MPI_COMM_ENABLED
mpiSupport = true;
#endif
printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
printf("========================================\n");
if (numa_available() == -1) {
printf("[ERROR] NUMA library not supported. Check to see if libnuma has been installed on this system\n");
exit(1);
}
printf("Usage: %s config <N>\n", cmdName);
printf(" config: Either:\n");
printf(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
printf(" - Name of preset config:\n");
printf(" N : (Optional) Number of bytes to copy per Transfer.\n");
printf(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
DEFAULT_BYTES_PER_TRANSFER);
printf(" If 0 is specified, a range of Ns will be benchmarked\n");
printf(" May append a suffix ('K', 'M', 'G') for kilobytes / megabytes / gigabytes\n");
printf("\n");
EnvVars::DisplayUsage();
}
std::string MemDevicesToStr(std::vector<MemDevice> const& memDevices) {
if (memDevices.empty()) return "N";
std::stringstream ss;
for (auto const& m : memDevices)
ss << TransferBench::MemTypeStr[m.memType] << m.memIndex;
return ss.str();
}
void PrintResults(EnvVars const& ev, int const testNum,
std::vector<Transfer> const& transfers,
TransferBench::TestResults const& results)
{
char sep = ev.outputToCsv ? ',' : '|';
size_t numTimedIterations = results.numTimedIterations;
if (!ev.outputToCsv) printf("Test %d:\n", testNum);
// Loop over each executor
for (auto exeInfoPair : results.exeResults) {
ExeDevice const& exeDevice = exeInfoPair.first;
ExeResult const& exeResult = exeInfoPair.second;
ExeType const exeType = exeDevice.exeType;
int32_t const exeIndex = exeDevice.exeIndex;
printf(" Executor: %3s %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
ExeTypeName[exeType], exeIndex, sep, exeResult.avgBandwidthGbPerSec, sep,
exeResult.avgDurationMsec, sep, exeResult.numBytes, sep, exeResult.sumBandwidthGbPerSec);
// Loop over each executor
for (int idx : exeResult.transferIdx) {
Transfer const& t = transfers[idx];
TransferResult const& r = results.tfrResults[idx];
char exeSubIndexStr[32] = "";
if (t.exeSubIndex != -1)
sprintf(exeSubIndexStr, ".%d", t.exeSubIndex);
printf(" Transfer %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %c%03d%s:%03d -> %s\n",
idx, sep,
r.avgBandwidthGbPerSec, sep,
r.avgDurationMsec, sep,
r.numBytes, sep,
MemDevicesToStr(t.srcs).c_str(),
TransferBench::ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
exeSubIndexStr, t.numSubExecs,
MemDevicesToStr(t.dsts).c_str());
// Show per-iteration timing information
if (ev.showIterations) {
// Check that per-iteration information exists
if (r.perIterMsec.size() != numTimedIterations) {
printf("[ERROR] Per iteration timing data unavailable: Expected %lu data points, but have %lu\n",
numTimedIterations, r.perIterMsec.size());
exit(1);
}
// Compute standard deviation and track iterations by speed
std::set<std::pair<double, int>> times;
double stdDevTime = 0;
double stdDevBw = 0;
for (int i = 0; i < numTimedIterations; i++) {
times.insert(std::make_pair(r.perIterMsec[i], i+1));
double const varTime = fabs(r.avgDurationMsec - r.perIterMsec[i]);
stdDevTime += varTime * varTime;
double iterBandwidthGbs = (t.numBytes / 1.0E9) / r.perIterMsec[i] * 1000.0f;
double const varBw = fabs(iterBandwidthGbs - r.avgBandwidthGbPerSec);
stdDevBw += varBw * varBw;
}
stdDevTime = sqrt(stdDevTime / numTimedIterations);
stdDevBw = sqrt(stdDevBw / numTimedIterations);
// Loop over iterations (fastest to slowest)
for (auto& time : times) {
double iterDurationMsec = time.first;
double iterBandwidthGbs = (t.numBytes / 1.0E9) / iterDurationMsec * 1000.0f;
printf(" Iter %03d %c %8.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);
std::string support = "";
if (mpiSupport && nicSupport) support = " (with MPI+NIC support)";
else if (mpiSupport) support = " (with MPI support)";
else if (nicSupport) support = " (with NIC support)";
std::set<int> usedXccs;
if (time.second - 1 < r.perIterCUs.size()) {
printf(" CUs:");
for (auto x : r.perIterCUs[time.second - 1]) {
printf(" %02d:%02d", x.first, x.second);
usedXccs.insert(x.first);
}
}
printf(" XCCs:");
for (auto x : usedXccs)
printf(" %02d", x);
printf("\n");
}
printf(" StandardDev %c %8.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
}
}
std::string multiNodeMode = "";
switch (GetCommMode()) {
case COMM_NONE: multiNodeMode = " (Single-node mode)"; break;
case COMM_SOCKET: multiNodeMode = " (Multi-node via sockets)"; break;
case COMM_MPI: multiNodeMode = " (Multi-node via MPI)"; break;
}
printf(" Aggregate (CPU) %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
sep, results.avgTotalBandwidthGbPerSec,
sep, results.avgTotalDurationMsec,
sep, results.totalBytesTransferred,
sep, results.overheadMsec);
Print("TransferBench v%s.%s%s%s\n", VERSION, CLIENT_VERSION, support.c_str(), multiNodeMode.c_str());
Print("=============================================================================================================\n");
}
void CheckForError(ErrResult const& error)
void DisplayUsage(char const* cmdName)
{
switch (error.errType) {
case ERR_NONE: return;
case ERR_WARN:
printf("[WARN] %s\n", error.errMsg.c_str());
return;
case ERR_FATAL:
printf("[ERROR] %s\n", error.errMsg.c_str());
if (numa_available() == -1) {
Print("[ERROR] NUMA library not supported. Check to see if libnuma has been installed on this system\n");
exit(1);
default:
break;
}
}
void PrintErrors(std::vector<ErrResult> const& errors)
{
bool isFatal = false;
for (auto const& err : errors) {
printf("[%s] %s\n", err.errType == ERR_FATAL ? "ERROR" : "WARN", err.errMsg.c_str());
isFatal |= (err.errType == ERR_FATAL);
}
if (isFatal) exit(1);
}
Print("Usage: %s config <N>\n", cmdName);
Print(" config: Either:\n");
Print(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
Print(" - Name of preset config:\n");
Print(" N : (Optional) Number of bytes to copy per Transfer.\n");
Print(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
DEFAULT_BYTES_PER_TRANSFER);
Print(" If 0 is specified, a range of Ns will be benchmarked\n");
Print(" May append a suffix ('K', 'M', 'G') for kilobytes / megabytes / gigabytes\n");
Print("\n");
EnvVars::DisplayUsage();
};
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#pragma once
// TransferBench client version
#define CLIENT_VERSION "00"
#include "TransferBench.hpp"
#include "EnvVars.hpp"
size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<28);
char const ExeTypeName[5][4] = {"CPU", "GPU", "DMA", "NIC", "NIC"};
// Display detected hardware
void DisplayTopology(bool outputToCsv);
// Display usage instructions
void DisplayUsage(char const* cmdName);
// Print TransferBench test results
void PrintResults(EnvVars const& ev, int const testNum,
std::vector<Transfer> const& transfers,
TransferBench::TestResults const& results);
// Helper function that converts MemDevices to a string
std::string MemDevicesToStr(std::vector<MemDevice> const& memDevices);
// Helper function to print warning / exit on fatal error
void CheckForError(ErrResult const& error);
// Helper function to print list of errors
void PrintErrors(std::vector<ErrResult> const& errors);
/*
Copyright (c) 2021-2025 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -39,7 +39,8 @@ THE SOFTWARE.
#include <numa.h>
#include <random>
#include <time.h>
#include "Client.hpp"
#define CLIENT_VERSION "00"
#include "TransferBench.hpp"
using namespace TransferBench;
......@@ -69,6 +70,7 @@ public:
int numIterations; // Number of timed iterations to perform. If negative, run for -numIterations seconds instead
int numSubIterations; // Number of subiterations to perform
int numWarmups; // Number of un-timed warmup iterations to perform
int showBorders; // Show ASCII box-drawing characaters in tables
int showIterations; // Show per-iteration timing info
int useInteractive; // Pause for user-input before starting transfer loop
......@@ -107,11 +109,11 @@ public:
// NIC options
int ibGidIndex; // GID Index for RoCE NICs
int roceVersion; // RoCE version number
int ipAddressFamily; // IP Address Famliy
uint8_t ibPort; // NIC port number to be used
int ipAddressFamily; // IP Address Famliy
int nicChunkBytes; // Number of bytes to send per chunk for RDMA operations
int nicRelaxedOrder; // Use relaxed ordering for RDMA
std::string closestNicStr; // Holds the user-specified list of closest NICs
int roceVersion; // RoCE version number
// Developer features
int gpuMaxHwQueues; // Tracks GPU_MAX_HW_QUEUES environment variable
......@@ -119,14 +121,16 @@ public:
// Constructor that collects values
EnvVars()
{
int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
int numDeviceCUs = TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0});
// Try to detect the GPU
hipDeviceProp_t prop;
HIP_CALL(hipGetDeviceProperties(&prop, 0));
std::string fullName = prop.gcnArchName;
std::string archName = fullName.substr(0, fullName.find(':'));
std::string fullName = "";
std::string archName = "";
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
if (numDetectedGpus > 0) {
HIP_CALL(hipGetDeviceProperties(&prop, 0));
fullName = prop.gcnArchName;
archName = fullName.substr(0, fullName.find(':'));
}
// Different hardware pick different GPU kernels
// This performance difference is generally only noticable when executing fewer CUs
......@@ -156,6 +160,7 @@ public:
numWarmups = GetEnvVar("NUM_WARMUPS" , 3);
outputToCsv = GetEnvVar("OUTPUT_TO_CSV" , 0);
samplingFactor = GetEnvVar("SAMPLING_FACTOR" , 1);
showBorders = GetEnvVar("SHOW_BORDERS" , 1);
showIterations = GetEnvVar("SHOW_ITERATIONS" , 0);
useHipEvents = GetEnvVar("USE_HIP_EVENTS" , 1);
useHsaDma = GetEnvVar("USE_HSA_DMA" , 0);
......@@ -168,8 +173,8 @@ public:
ibPort = GetEnvVar("IB_PORT_NUMBER" , 1);
roceVersion = GetEnvVar("ROCE_VERSION" , 2);
ipAddressFamily = GetEnvVar("IP_ADDRESS_FAMILY" , 4);
nicChunkBytes = GetEnvVar("NIC_CHUNK_BYTES" , 1073741824);
nicRelaxedOrder = GetEnvVar("NIC_RELAX_ORDER" , 1);
closestNicStr = GetEnvVar("CLOSEST_NIC" , "");
gpuMaxHwQueues = GetEnvVar("GPU_MAX_HW_QUEUES" , 4);
......@@ -314,9 +319,6 @@ public:
printf(" ALWAYS_VALIDATE - Validate after each iteration instead of once after all iterations\n");
printf(" BLOCK_BYTES - Controls granularity of how work is divided across subExecutors\n");
printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4\n");
#if NIC_EXEC_ENABLED
printf(" CLOSEST_NIC - Comma-separated list of per-GPU closest NIC (default=auto)\n");
#endif
printf(" CU_MASK - CU mask for streams. Can specify ranges e.g '5,10-12,14'\n");
printf(" FILL_COMPRESS - Percentages of 64B lines to be filled by random/1B0/2B0/4B0/32B0\n");
printf(" FILL_PATTERN - Big-endian pattern for source data, specified in hex digits. Must be even # of digits\n");
......@@ -337,6 +339,7 @@ public:
printf(" MIN_VAR_SUBEXEC - Minumum # of subexecutors to use for variable subExec Transfers\n");
printf(" MAX_VAR_SUBEXEC - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)\n");
#if NIC_EXEC_ENABLED
printf(" NIC_CHUNK_BYTES - Number of bytes to send at a time using NIC (default = 1GB)\n");
printf(" NIC_RELAX_ORDER - Set to non-zero to use relaxed ordering");
#endif
printf(" NUM_ITERATIONS - # of timed iterations per test. If negative, run for this many seconds instead\n");
......@@ -347,6 +350,7 @@ public:
printf(" ROCE_VERSION - RoCE version (default=2)\n");
#endif
printf(" SAMPLING_FACTOR - Add this many samples (when possible) between powers of 2 when auto-generating data sizes\n");
printf(" SHOW_BORDERS - Show ASCII box-drawing characaters in tables\n");
printf(" SHOW_ITERATIONS - Show per-iteration timing info\n");
printf(" USE_HIP_EVENTS - Use HIP events for GFX executor timing\n");
printf(" USE_HSA_DMA - Use hsa_amd_async_copy instead of hipMemcpy for non-targeted DMA execution\n");
......@@ -386,8 +390,6 @@ public:
nicSupport = " (with NIC support)";
#endif
if (!outputToCsv) {
printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
printf("===============================================================\n");
if (!hideEnv) printf("[Common] (Suppress by setting HIDE_ENV=1)\n");
}
else if (!hideEnv)
......@@ -400,10 +402,6 @@ public:
"Each CU gets a mulitple of %d bytes to copy", blockBytes);
Print("BYTE_OFFSET", byteOffset,
"Using byte offset of %d", byteOffset);
#if NIC_EXEC_ENABLED
Print("CLOSEST_NIC", (closestNicStr == "" ? "auto" : "user-input"),
"Per-GPU closest NIC is set as %s", (closestNicStr == "" ? "auto" : closestNicStr.c_str()));
#endif
Print("CU_MASK", getenv("CU_MASK") ? 1 : 0,
"%s", (cuMask.size() ? GetCuMaskDesc().c_str() : "All"));
Print("FILL_COMPRESS", getenv("FILL_COMPRESS") ? 1 : 0,
......@@ -452,6 +450,8 @@ public:
"Using up to %s subexecutors for variable subExec transfers",
maxNumVarSubExec ? std::to_string(maxNumVarSubExec).c_str() : "all available");
#if NIC_EXEC_ENABLED
Print("NIC_CHUNK_BYTES", nicChunkBytes,
"Sending %lu bytes at a time for NIC RDMA", nicChunkBytes);
Print("NIC_RELAX_ORDER", nicRelaxedOrder,
"Using %s ordering for NIC RDMA", nicRelaxedOrder ? "relaxed" : "strict");
#endif
......@@ -466,6 +466,7 @@ public:
Print("ROCE_VERSION", roceVersion,
"RoCE version is set to %d", roceVersion);
#endif
Print("SHOW_BORDERS", showBorders, "%s ASCII box-drawing characaters in tables", showBorders ? "Showing" : "Hiding");
Print("SHOW_ITERATIONS", showIterations,
"%s per-iteration timing", showIterations ? "Showing" : "Hiding");
Print("USE_HIP_EVENTS", useHipEvents,
......@@ -497,8 +498,17 @@ public:
// Helper function that gets parses environment variable or sets to default value
static int GetEnvVar(std::string const& varname, int defaultValue)
{
if (getenv(varname.c_str()))
return atoi(getenv(varname.c_str()));
char const* varStr = getenv(varname.c_str());
if (varStr) {
int val = atoi(varStr);
char units = varStr[strlen(varStr)-1];
switch (units) {
case 'G': case 'g': val *= 1024;
case 'M': case 'm': val *= 1024;
case 'K': case 'k': val *= 1024;
}
return val;
}
return defaultValue;
}
......@@ -633,27 +643,13 @@ public:
cfg.gfx.waveOrder = gfxWaveOrder;
cfg.gfx.wordSize = gfxWordSize;
cfg.nic.chunkBytes = nicChunkBytes;
cfg.nic.ibGidIndex = ibGidIndex;
cfg.nic.ibPort = ibPort;
cfg.nic.ipAddressFamily = ipAddressFamily;
cfg.nic.useRelaxedOrder = nicRelaxedOrder;
cfg.nic.roceVersion = roceVersion;
std::vector<int> closestNics;
if(closestNicStr != "") {
std::stringstream ss(closestNicStr);
std::string item;
while (std::getline(ss, item, ',')) {
try {
int nic = std::stoi(item);
closestNics.push_back(nic);
} catch (const std::invalid_argument& e) {
printf("[ERROR] Invalid NIC index (%s) by user in %s\n", item.c_str(), closestNicStr.c_str());
exit(1);
}
}
cfg.nic.closestNics = closestNics;
}
return cfg;
}
};
......
This diff is collapsed.
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -22,36 +22,49 @@ THE SOFTWARE.
#include "EnvVars.hpp"
void AllToAllRdmaPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
int AllToAllRdmaPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR]a2an preset currently not supported for multi-node\n");
return 1;
}
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
// Collect env vars for this preset
int numGpus = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN" , 1);
int memTypeIdx = EnvVars::GetEnvVar("MEM_TYPE" , 2);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN" , -999); // Deprecated
// Deprecated env var check
if (useFineGrain != -999) {
memTypeIdx = useFineGrain ? 2 : 0;
}
MemType memType = Utils::GetGpuMemType(memTypeIdx);
std::string memTypeStr = Utils::GetGpuMemTypeStr(memTypeIdx);
// Print off environment variables
ev.DisplayEnvVars();
if (!ev.hideEnv) {
if (!ev.outputToCsv) printf("[AllToAll Network Related]\n");
ev.Print("NUM_GPU_DEVICES", numGpus , "Using %d GPUs", numGpus);
ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
ev.Print("USE_FINE_GRAIN" , useFineGrain , "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
printf("\n");
if (Utils::RankDoesOutput()) {
ev.DisplayEnvVars();
if (!ev.hideEnv) {
if (!ev.outputToCsv) printf("[AllToAll Network Related]\n");
ev.Print("NUM_GPU_DEVICES", numGpus , "Using %d GPUs", numGpus);
ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
ev.Print("MEM_TYPE" , memTypeIdx , "Using %s memory (%s)", memTypeStr.c_str(), Utils::GetAllGpuMemTypeStr().c_str());
printf("\n");
}
}
// Validate env vars
if (numGpus < 0 || numGpus > numDetectedGpus) {
printf("[ERROR] Cannot use %d GPUs. Detected %d GPUs\n", numGpus, numDetectedGpus);
exit(1);
Utils::Print("[ERROR] Cannot use %d GPUs. Detected %d GPUs\n", numGpus, numDetectedGpus);
return 1;
}
MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
std::map<std::pair<int, int>, int> reIndex;
std::vector<Transfer> transfers;
......@@ -71,31 +84,31 @@ void AllToAllRdmaPreset(EnvVars& ev,
}
}
printf("GPU-RDMA All-To-All benchmark:\n");
printf("==========================\n");
printf("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n",
Utils::Print("GPU-RDMA All-To-All benchmark:\n");
Utils::Print("==========================\n");
Utils::Print("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n",
numBytesPerTransfer, numQueuePairs, transfers.size());
if (transfers.size() == 0) return;
if (transfers.size() == 0) return 0;
// Execute Transfers
TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
TransferBench::TestResults results;
if (!TransferBench::RunTransfers(cfg, transfers, results)) {
for (auto const& err : results.errResults)
printf("%s\n", err.errMsg.c_str());
exit(0);
Utils::Print("%s\n", err.errMsg.c_str());
return 1;
} else {
PrintResults(ev, 1, transfers, results);
Utils::PrintResults(ev, 1, transfers, results);
}
// Print results
char separator = (ev.outputToCsv ? ',' : ' ');
printf("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
printf("==========================================================\n");
printf("SRC\\DST ");
Utils::Print("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
Utils::Print("==========================================================\n");
Utils::Print("SRC\\DST ");
for (int dst = 0; dst < numGpus; dst++)
printf("%cGPU %02d ", separator, dst);
printf(" %cSTotal %cActual\n", separator, separator);
Utils::Print("%cGPU %02d ", separator, dst);
Utils::Print(" %cSTotal %cActual\n", separator, separator);
double totalBandwidthGpu = 0.0;
double minActualBandwidth = std::numeric_limits<double>::max();
......@@ -105,7 +118,7 @@ void AllToAllRdmaPreset(EnvVars& ev,
double rowTotalBandwidth = 0;
int transferCount = 0;
double minBandwidth = std::numeric_limits<double>::max();
printf("GPU %02d", src);
Utils::Print("GPU %02d", src);
for (int dst = 0; dst < numGpus; dst++) {
if (reIndex.count(std::make_pair(src, dst))) {
int const transferIdx = reIndex[std::make_pair(src,dst)];
......@@ -115,28 +128,30 @@ void AllToAllRdmaPreset(EnvVars& ev,
totalBandwidthGpu += r.avgBandwidthGbPerSec;
minBandwidth = std::min(minBandwidth, r.avgBandwidthGbPerSec);
transferCount++;
printf("%c%8.3f ", separator, r.avgBandwidthGbPerSec);
Utils::Print("%c%8.3f ", separator, r.avgBandwidthGbPerSec);
} else {
printf("%c%8s ", separator, "N/A");
Utils::Print("%c%8s ", separator, "N/A");
}
}
double actualBandwidth = minBandwidth * transferCount;
printf(" %c%8.3f %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
Utils::Print(" %c%8.3f %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
minActualBandwidth = std::min(minActualBandwidth, actualBandwidth);
maxActualBandwidth = std::max(maxActualBandwidth, actualBandwidth);
colTotalBandwidth[numGpus+1] += rowTotalBandwidth;
}
printf("\nRTotal");
Utils::Print("\nRTotal");
for (int dst = 0; dst < numGpus; dst++) {
printf("%c%8.3f ", separator, colTotalBandwidth[dst]);
Utils::Print("%c%8.3f ", separator, colTotalBandwidth[dst]);
}
printf(" %c%8.3f %c%8.3f %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
Utils::Print(" %c%8.3f %c%8.3f %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
separator, minActualBandwidth, separator, maxActualBandwidth);
printf("\n");
Utils::Print("\n");
Utils::Print("Average bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());
Utils::Print("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
Utils::Print("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
printf("Average bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());
printf("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
printf("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
Utils::PrintErrors(results.errResults);
PrintErrors(results.errResults);
return 0;
}
/*
Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -22,10 +22,15 @@ THE SOFTWARE.
#include "EnvVars.hpp"
void AllToAllSweepPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
int AllToAllSweepPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] All to All Sweep preset currently not supported for multi-node\n");
return 1;
}
enum
{
A2A_COPY = 0,
......@@ -172,7 +177,7 @@ void AllToAllSweepPreset(EnvVars& ev,
printf("- Copying %lu bytes between %s pairs of GPUs\n", numBytesPerTransfer, a2aDirect ? "directly connected" : "all");
if (transfers.size() == 0) {
printf("[WARN} No transfers requested. Try adjusting A2A_DIRECT or A2A_LOCAL\n");
return;
return 0;
}
// Execute Transfers
......@@ -227,9 +232,10 @@ void AllToAllSweepPreset(EnvVars& ev,
for (int c : numCusList) {
for (int u : unrollList) {
printf("CUs: %d Unroll %d\n", c, u);
PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]);
Utils::PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]);
}
}
}
}
return 1;
}
......@@ -186,7 +186,7 @@ int TestUnidir(int modelId, bool verbose)
}
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} else {
PrintErrors(results.errResults);
Utils::PrintErrors(results.errResults);
}
}
......@@ -232,7 +232,7 @@ int TestUnidir(int modelId, bool verbose)
}
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} else {
PrintErrors(results.errResults);
Utils::PrintErrors(results.errResults);
}
}
......@@ -298,7 +298,7 @@ int TestBidir(int modelId, bool verbose)
}
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} else {
PrintErrors(results.errResults);
Utils::PrintErrors(results.errResults);
}
}
......@@ -423,7 +423,7 @@ int TestHbmPerformance(int modelId, bool verbose)
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
}
} else {
PrintErrors(results.errResults);
Utils::PrintErrors(results.errResults);
}
if (fails.size() == 0) {
......@@ -439,14 +439,19 @@ int TestHbmPerformance(int modelId, bool verbose)
return hasFail;
}
void HealthCheckPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
int HealthCheckPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] Healthcheck preset currently not supported for multi-node\n");
return 1;
}
// Check for supported platforms
#if defined(__NVCC__)
printf("[WARN] healthcheck preset not supported on NVIDIA hardware\n");
return;
return 0;
#endif
printf("Disclaimer:\n");
......@@ -468,5 +473,5 @@ void HealthCheckPreset(EnvVars& ev,
numFails += TestUnidir(modelId, verbose);
numFails += TestBidir(modelId, verbose);
numFails += TestAllToAll(modelId, verbose);
exit(numFails ? 1 : 0);
return numFails ? 1 : 0;
}
/*
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
int NicRingsPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
// Check for single homogenous group
if (Utils::GetNumRankGroups() > 1) {
Utils::Print("[ERROR] NIC-rings preset can only be run across ranks that are homogenous\n");
Utils::Print("[ERROR] Run ./TransferBench without any args to display topology information\n");
Utils::Print("[ERROR] NIC_FILTER may also be used to limit NIC visibility\n");
return 1;
}
// Collect topology
int numRanks = TransferBench::GetNumRanks();
// Read in environment variables
int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
int showDetails = EnvVars::GetEnvVar("SHOW_DETAILS" , 0);
int useCpuMem = EnvVars::GetEnvVar("USE_CPU_MEM" , 0);
int memTypeIdx = EnvVars::GetEnvVar("MEM_TYPE" , 0);
int useRdmaRead = EnvVars::GetEnvVar("USE_RDMA_READ" , 0);
// Print off environment variables
MemType memType = Utils::GetMemType(memTypeIdx, useCpuMem);
std::string memTypeStr = Utils::GetMemTypeStr(memTypeIdx, useCpuMem);
if (Utils::RankDoesOutput()) {
ev.DisplayEnvVars();
if (!ev.hideEnv) {
if (!ev.outputToCsv) printf("[NIC-Rings Related]\n");
ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
ev.Print("SHOW_DETAILS" , showDetails , "%s full Test details", showDetails ? "Showing" : "Hiding");
ev.Print("USE_CPU_MEM" , useCpuMem , "Using closest %s memory", useCpuMem ? "CPU" : "GPU");
ev.Print("MEM_TYPE" , memTypeIdx , "Using %s memory (%s)", memTypeStr.c_str(), Utils::GetAllMemTypeStr(useCpuMem).c_str());
if (numRanks > 1)
ev.Print("USE_RDMA_READ", useRdmaRead , "Performing RDMA %s", useRdmaRead ? "reads" : "writes");
printf("\n");
}
}
// Prepare list of transfers
int numDevices = TransferBench::GetNumExecutors(useCpuMem ? EXE_CPU : EXE_GPU_GFX);
std::vector<Transfer> transfers;
int numRings = 0;
for (int memIndex = 0; memIndex < numDevices; memIndex++) {
std::vector<int> nicIndices;
if (useCpuMem) {
TransferBench::GetClosestNicsToCpu(nicIndices, memIndex);
} else {
TransferBench::GetClosestNicsToGpu(nicIndices, memIndex);
}
for (int nicIndex : nicIndices) {
numRings++;
for (int currRank = 0; currRank < numRanks; currRank++) {
int nextRank = (currRank + 1) % numRanks;
TransferBench::Transfer transfer;
transfer.srcs.push_back({memType, memIndex, currRank});
transfer.dsts.push_back({memType, memIndex, nextRank});
transfer.exeDevice = {EXE_NIC, nicIndex, useRdmaRead ? nextRank : currRank};
transfer.exeSubIndex = nicIndex;
transfer.numSubExecs = numQueuePairs;
transfer.numBytes = numBytesPerTransfer;
transfers.push_back(transfer);
}
}
}
Utils::Print("NIC Rings benchmark\n");
Utils::Print("==============================\n");
Utils::Print("%d parallel RDMA-%s rings(s) using %s memory across %d ranks\n",
numRings, useRdmaRead ? "read" : "write", memTypeStr.c_str(), numRanks);
Utils::Print("%d queue pairs per NIC. %lu bytes per Transfer. All numbers are GB/s\n",
numQueuePairs, numBytesPerTransfer);
Utils::Print("\n");
// Execute Transfers
TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
TransferBench::TestResults results;
if (!TransferBench::RunTransfers(cfg, transfers, results)) {
for (auto const& err : results.errResults)
Utils::Print("%s\n", err.errMsg.c_str());
return 1;
} else if (showDetails) {
Utils::PrintResults(ev, 1, transfers, results);
Utils::Print("\n");
}
// Only ranks that actually do output will compile results
if (!Utils::RankDoesOutput()) return 0;
// Prepare table of results
int numRows = 6 + numRanks;
int numCols = 3 + numRings;
Utils::TableHelper table(numRows, numCols);
// Prepare headers
table.Set(2, 0, " Rank ");
table.Set(2, 1, " Name ");
table.Set(1, numCols-1, " TOTAL ");
table.Set(2, numCols-1, " (GB/s) ");
table.SetColAlignment(1, Utils::TableHelper::ALIGN_LEFT);
for (int rank = 0; rank < numRanks; rank++) {
table.Set(3 + rank, 0, " %d ", rank);
table.Set(3 + rank, 1, " %s ", TransferBench::GetHostname(rank).c_str());
}
table.Set(numRows-3, 1, " MAX (GB/s) ");
table.Set(numRows-2, 1, " AVG (GB/s) ");
table.Set(numRows-1, 1, " MIN (GB/s) ");
for (int row = numRows-3; row < numRows; row++)
table.SetCellAlignment(row, 1, Utils::TableHelper::ALIGN_RIGHT);
table.DrawRowBorder(3);
table.DrawRowBorder(numRows-3);
int colIdx = 2;
int transferIdx = 0;
std::vector<double> rankTotal(numRanks, 0.0);
for (int memIndex = 0; memIndex < numDevices; memIndex++) {
std::vector<int> nicIndices;
if (useCpuMem) {
TransferBench::GetClosestNicsToCpu(nicIndices, memIndex);
table.Set(0, colIdx, " CPU %02d ", memIndex);
} else {
TransferBench::GetClosestNicsToGpu(nicIndices, memIndex);
table.Set(0, colIdx, " GPU %02d ", memIndex);
}
bool isFirst = true;
for (int nicIndex : nicIndices) {
if (isFirst) {
isFirst = false;
table.DrawColBorder(colIdx);
}
table.Set(1, colIdx, " NIC %02d ", nicIndex);
table.Set(2, colIdx, " %s ", TransferBench::GetExecutorName({EXE_NIC, nicIndex}).c_str());
double ringMin = std::numeric_limits<double>::max();
double ringAvg = 0.0;
double ringMax = std::numeric_limits<double>::min();
for (int rank = 0; rank < numRanks; rank++) {
double avgBw = results.tfrResults[transferIdx].avgBandwidthGbPerSec;
table.Set(3 + rank, colIdx, " %.2f ", avgBw);
ringMin = std::min(ringMin, avgBw);
ringAvg += avgBw;
ringMax = std::max(ringMax, avgBw);
rankTotal[rank] += avgBw;
transferIdx++;
}
table.Set(numRows-3, colIdx, " %.2f ", ringMax);
table.Set(numRows-2, colIdx, " %.2f ", ringAvg / numRanks);
table.Set(numRows-1, colIdx, " %.2f ", ringMin);
colIdx++;
}
if (!isFirst) {
table.DrawColBorder(colIdx);
}
}
double rankMin = std::numeric_limits<double>::max();
double rankAvg = 0.0;
double rankMax = std::numeric_limits<double>::min();
for (int rank = 0; rank < numRanks; rank++) {
table.Set(3 + rank, numCols - 1, " %.2f ", rankTotal[rank]);
rankMin = std::min(rankMin, rankTotal[rank]);
rankAvg += rankTotal[rank];
rankMax = std::max(rankMax, rankTotal[rank]);
}
table.Set(numRows - 3, numCols - 1, " %.2f ", rankMax);
table.Set(numRows - 2, numCols - 1, " %.2f ", rankAvg / numRanks);
table.Set(numRows - 1, numCols - 1, " %.2f ", rankMin);
table.PrintTable(ev.outputToCsv, ev.showBorders);
Utils::Print("\n");
Utils::Print("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
Utils::PrintErrors(results.errResults);
if (Utils::HasDuplicateHostname()) {
printf("[WARN] It is recommended to run TransferBench with one rank per host to avoid potential aliasing of executors\n");
}
return 0;
}
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -20,14 +20,19 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
void OneToAllPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
int OneToAllPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] One-to-All preset currently not supported for multi-node\n");
return 1;
}
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
if (numDetectedGpus < 2) {
printf("[ERROR] One-to-all benchmark requires machine with at least 2 GPUs\n");
exit(1);
return 1;
}
// Collect env vars for this preset
......@@ -61,7 +66,7 @@ void OneToAllPreset(EnvVars& ev,
for (auto ch : sweepExe) {
if (ch != 'G' && ch != 'D') {
printf("[ERROR] Unrecognized executor type '%c' specified\n", ch);
exit(1);
return 1;
}
}
......@@ -98,7 +103,7 @@ void OneToAllPreset(EnvVars& ev,
for (int i = 0; i < numGpuDevices; i++) {
if (bitmask & (1<<i)) {
Transfer t;
CheckForError(TransferBench::CharToExeType(exe, t.exeDevice.exeType));
Utils::CheckForError(TransferBench::CharToExeType(exe, t.exeDevice.exeType));
t.exeDevice.exeIndex = exeIndex;
t.exeSubIndex = -1;
t.numSubExecs = numSubExecs;
......@@ -108,7 +113,7 @@ void OneToAllPreset(EnvVars& ev,
t.srcs.clear();
} else {
t.srcs.resize(1);
CheckForError(TransferBench::CharToMemType(src, t.srcs[0].memType));
Utils::CheckForError(TransferBench::CharToMemType(src, t.srcs[0].memType));
t.srcs[0].memIndex = sweepDir == 0 ? exeIndex : i;
}
......@@ -116,15 +121,15 @@ void OneToAllPreset(EnvVars& ev,
t.dsts.clear();
} else {
t.dsts.resize(1);
CheckForError(TransferBench::CharToMemType(dst, t.dsts[0].memType));
Utils::CheckForError(TransferBench::CharToMemType(dst, t.dsts[0].memType));
t.dsts[0].memIndex = sweepDir == 0 ? i : exeIndex;
}
transfers.push_back(t);
}
}
if (!TransferBench::RunTransfers(cfg, transfers, results)) {
PrintErrors(results.errResults);
exit(1);
Utils::PrintErrors(results.errResults);
return 1;
}
int counter = 0;
......@@ -138,12 +143,13 @@ void OneToAllPreset(EnvVars& ev,
printf(" %d %d", p, numSubExecs);
for (auto i = 0; i < transfers.size(); i++) {
printf(" (%s %c%d %s)",
MemDevicesToStr(transfers[i].srcs).c_str(),
Utils::MemDevicesToStr(transfers[i].srcs).c_str(),
ExeTypeStr[transfers[i].exeDevice.exeType], transfers[i].exeDevice.exeIndex,
MemDevicesToStr(transfers[i].dsts).c_str());
Utils::MemDevicesToStr(transfers[i].dsts).c_str());
}
printf("\n");
}
}
}
return 0;
}
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -20,39 +20,60 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
void PeerToPeerPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
int PeerToPeerPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] Peer-to-peer preset currently not supported for multi-node\n");
return 1;
}
int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
// Collect env vars for this preset
int useDmaCopy = EnvVars::GetEnvVar("USE_GPU_DMA", 0);
int cpuMemTypeIdx = EnvVars::GetEnvVar("CPU_MEM_TYPE", 0);
int gpuMemTypeIdx = EnvVars::GetEnvVar("GPU_MEM_TYPE", 0);
int numCpuDevices = EnvVars::GetEnvVar("NUM_CPU_DEVICES", numDetectedCpus);
int numCpuSubExecs = EnvVars::GetEnvVar("NUM_CPU_SE", 4);
int numGpuDevices = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
int numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE", useDmaCopy ? 1 : TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0}));
int p2pMode = EnvVars::GetEnvVar("P2P_MODE", 0);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN", 0);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN", -999); // Deprecated
int useRemoteRead = EnvVars::GetEnvVar("USE_REMOTE_READ", 0);
MemType cpuMemType = Utils::GetCpuMemType(cpuMemTypeIdx);
MemType gpuMemType = Utils::GetGpuMemType(gpuMemTypeIdx);
// Display environment variables
ev.DisplayEnvVars();
if (!ev.hideEnv) {
int outputToCsv = ev.outputToCsv;
if (!outputToCsv) printf("[P2P Related]\n");
ev.Print("NUM_CPU_DEVICES", numCpuDevices, "Using %d CPUs", numCpuDevices);
ev.Print("NUM_CPU_SE", numCpuSubExecs, "Using %d CPU threads per Transfer", numCpuSubExecs);
ev.Print("NUM_GPU_DEVICES", numGpuDevices, "Using %d GPUs", numGpuDevices);
ev.Print("NUM_GPU_SE", numGpuSubExecs, "Using %d GPU subexecutors/CUs per Transfer", numGpuSubExecs);
ev.Print("P2P_MODE", p2pMode, "Running %s transfers", p2pMode == 0 ? "Uni + Bi" :
p2pMode == 1 ? "Unidirectional"
: "Bidirectional");
ev.Print("USE_FINE_GRAIN", useFineGrain, "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
ev.Print("USE_GPU_DMA", useDmaCopy, "Using GPU-%s as GPU executor", useDmaCopy ? "DMA" : "GFX");
ev.Print("USE_REMOTE_READ", useRemoteRead, "Using %s as executor", useRemoteRead ? "DST" : "SRC");
printf("\n");
if (Utils::RankDoesOutput()) {
ev.DisplayEnvVars();
if (!ev.hideEnv) {
int outputToCsv = ev.outputToCsv;
if (!outputToCsv) printf("[P2P Related]\n");
ev.Print("CPU_MEM_TYPE" , cpuMemTypeIdx, "Using %s (%s)", Utils::GetCpuMemTypeStr(cpuMemTypeIdx).c_str(), Utils::GetAllCpuMemTypeStr().c_str());
ev.Print("GPU_MEM_TYPE" , gpuMemTypeIdx, "Using %s (%s)", Utils::GetGpuMemTypeStr(gpuMemTypeIdx).c_str(), Utils::GetAllGpuMemTypeStr().c_str());
ev.Print("NUM_CPU_DEVICES", numCpuDevices, "Using %d CPUs", numCpuDevices);
ev.Print("NUM_CPU_SE", numCpuSubExecs, "Using %d CPU threads per Transfer", numCpuSubExecs);
ev.Print("NUM_GPU_DEVICES", numGpuDevices, "Using %d GPUs", numGpuDevices);
ev.Print("NUM_GPU_SE", numGpuSubExecs, "Using %d GPU subexecutors/CUs per Transfer", numGpuSubExecs);
ev.Print("P2P_MODE", p2pMode, "Running %s transfers", p2pMode == 0 ? "Uni + Bi" :
p2pMode == 1 ? "Unidirectional"
: "Bidirectional");
ev.Print("USE_GPU_DMA", useDmaCopy, "Using GPU-%s as GPU executor", useDmaCopy ? "DMA" : "GFX");
ev.Print("USE_REMOTE_READ", useRemoteRead, "Using %s as executor", useRemoteRead ? "DST" : "SRC");
printf("\n");
}
}
// Check for deprecated env vars
if (useFineGrain != -999) {
Utils::Print("[ERROR] USE_FINE_GRAIN has been deprecated and replaced by CPU_MEM_TYPE and GPU_MEM_TYPE\n");
return 1;
}
char const separator = ev.outputToCsv ? ',' : ' ';
......@@ -66,8 +87,8 @@ void PeerToPeerPreset(EnvVars& ev,
// Perform unidirectional / bidirectional
for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++) {
if (p2pMode == 1 && isBidirectional == 1 ||
p2pMode == 2 && isBidirectional == 0) continue;
if ((p2pMode == 1 && isBidirectional == 1) ||
(p2pMode == 2 && isBidirectional == 0)) continue;
printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni",
useRemoteRead ? "Remote" : "Local",
......@@ -102,11 +123,10 @@ void PeerToPeerPreset(EnvVars& ev,
// Loop over all possible src/dst pairs
for (int src = 0; src < numDevices; src++) {
MemType const srcType = (src < numCpuDevices ? MEM_CPU : MEM_GPU);
int const srcIndex = (srcType == MEM_CPU ? src : src - numCpuDevices);
MemType const srcTypeActual = ((useFineGrain && srcType == MEM_CPU) ? MEM_CPU_FINE :
(useFineGrain && srcType == MEM_GPU) ? MEM_GPU_FINE :
srcType);
int const srcIdx = (src < numCpuDevices ? 0 : 1);
MemType const srcType = (src < numCpuDevices ? cpuMemType : gpuMemType);
int const srcIndex = (src < numCpuDevices ? src : src - numCpuDevices);
std::vector<std::vector<double>> avgBandwidth(isBidirectional + 1);
std::vector<std::vector<double>> minBandwidth(isBidirectional + 1);
std::vector<std::vector<double>> maxBandwidth(isBidirectional + 1);
......@@ -114,18 +134,17 @@ void PeerToPeerPreset(EnvVars& ev,
if (src == numCpuDevices && src != 0) printf("\n");
for (int dst = 0; dst < numDevices; dst++) {
MemType const dstType = (dst < numCpuDevices ? MEM_CPU : MEM_GPU);
int const dstIndex = (dstType == MEM_CPU ? dst : dst - numCpuDevices);
MemType const dstTypeActual = ((useFineGrain && dstType == MEM_CPU) ? MEM_CPU_FINE :
(useFineGrain && dstType == MEM_GPU) ? MEM_GPU_FINE :
dstType);
int const dstIdx = (dst < numCpuDevices ? 0 : 1);
MemType const dstType = (dst < numCpuDevices ? cpuMemType : gpuMemType);
int const dstIndex = (dst < numCpuDevices ? dst : dst - numCpuDevices);
// Prepare Transfers
std::vector<Transfer> transfers(isBidirectional + 1);
// SRC -> DST
transfers[0].numBytes = numBytesPerTransfer;
transfers[0].srcs.push_back({srcTypeActual, srcIndex});
transfers[0].dsts.push_back({dstTypeActual, dstIndex});
transfers[0].srcs.push_back({srcType, srcIndex});
transfers[0].dsts.push_back({dstType, dstIndex});
transfers[0].exeDevice = {IsGpuMemType(useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU,
(useRemoteRead ? dstIndex : srcIndex)};
transfers[0].exeSubIndex = -1;
......@@ -134,8 +153,8 @@ void PeerToPeerPreset(EnvVars& ev,
// DST -> SRC
if (isBidirectional) {
transfers[1].numBytes = numBytesPerTransfer;
transfers[1].srcs.push_back({dstTypeActual, dstIndex});
transfers[1].dsts.push_back({srcTypeActual, srcIndex});
transfers[1].srcs.push_back({dstType, dstIndex});
transfers[1].dsts.push_back({srcType, srcIndex});
transfers[1].exeDevice = {IsGpuMemType(useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU,
(useRemoteRead ? srcIndex : dstIndex)};
transfers[1].exeSubIndex = -1;
......@@ -167,7 +186,7 @@ void PeerToPeerPreset(EnvVars& ev,
if (!TransferBench::RunTransfers(cfg, transfers, results)) {
for (auto const& err : results.errResults)
printf("%s\n", err.errMsg.c_str());
exit(1);
return 1;
}
for (int dir = 0; dir <= isBidirectional; dir++) {
......@@ -175,8 +194,8 @@ void PeerToPeerPreset(EnvVars& ev,
avgBandwidth[dir].push_back(avgBw);
if (!(srcType == dstType && srcIndex == dstIndex)) {
avgBwSum[srcType][dstType] += avgBw;
avgCount[srcType][dstType]++;
avgBwSum[srcIdx][dstIdx] += avgBw;
avgCount[srcIdx][dstIdx]++;
}
if (ev.showIterations) {
......@@ -209,7 +228,7 @@ void PeerToPeerPreset(EnvVars& ev,
}
for (int dir = 0; dir <= isBidirectional; dir++) {
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, dir ? "<- " : " ->");
printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, dir ? "<- " : " ->");
if (ev.outputToCsv) printf(",");
for (int dst = 0; dst < numDevices; dst++) {
......@@ -226,7 +245,7 @@ void PeerToPeerPreset(EnvVars& ev,
if (ev.showIterations) {
// minBw
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "min");
printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "min");
if (ev.outputToCsv) printf(",");
for (int i = 0; i < numDevices; i++) {
double const minBw = minBandwidth[dir][i];
......@@ -240,7 +259,7 @@ void PeerToPeerPreset(EnvVars& ev,
printf("\n");
// maxBw
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "max");
printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "max");
if (ev.outputToCsv) printf(",");
for (int i = 0; i < numDevices; i++) {
double const maxBw = maxBandwidth[dir][i];
......@@ -254,7 +273,7 @@ void PeerToPeerPreset(EnvVars& ev,
printf("\n");
// stddev
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, " sd");
printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, " sd");
if (ev.outputToCsv) printf(",");
for (int i = 0; i < numDevices; i++) {
double const sd = stdDev[dir][i];
......@@ -271,7 +290,7 @@ void PeerToPeerPreset(EnvVars& ev,
}
if (isBidirectional) {
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "<->");
printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "<->");
if (ev.outputToCsv) printf(",");
for (int dst = 0; dst < numDevices; dst++) {
double const sumBw = avgBandwidth[0][dst] + avgBandwidth[1][dst];
......@@ -289,14 +308,14 @@ void PeerToPeerPreset(EnvVars& ev,
if (!ev.outputToCsv) {
printf(" ");
for (int srcType : {MEM_CPU, MEM_GPU})
for (int dstType : {MEM_CPU, MEM_GPU})
printf(" %cPU->%cPU", srcType == MEM_CPU ? 'C' : 'G', dstType == MEM_CPU ? 'C' : 'G');
for (int srcType = 0; srcType <= 1; srcType++)
for (int dstType = 0; dstType <= 1; dstType++)
printf(" %cPU->%cPU", srcType == 0 ? 'C' : 'G', dstType == 0 ? 'C' : 'G');
printf("\n");
printf("Averages (During %s):", isBidirectional ? " BiDir" : "UniDir");
for (int srcType : {MEM_CPU, MEM_GPU})
for (int dstType : {MEM_CPU, MEM_GPU}) {
for (int srcType = 0; srcType <= 1; srcType++)
for (int dstType = 0; dstType <= 1; dstType++) {
if (avgCount[srcType][dstType])
printf("%10.2f", avgBwSum[srcType][dstType] / avgCount[srcType][dstType]);
else
......@@ -305,4 +324,5 @@ void PeerToPeerPreset(EnvVars& ev,
printf("\n\n");
}
}
return 0;
}
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -21,22 +21,26 @@ THE SOFTWARE.
*/
#pragma once
#include <map>
// EnvVars is available to all presets
#include "EnvVars.hpp"
#include "Utilities.hpp"
// Included after EnvVars and Executors
#include "AllToAll.hpp"
#include "AllToAllN.hpp"
#include "AllToAllSweep.hpp"
#include "HealthCheck.hpp"
#include "NicRings.hpp"
#include "OneToAll.hpp"
#include "PeerToPeer.hpp"
#include "Scaling.hpp"
#include "Schmoo.hpp"
#include "Sweep.hpp"
#include <map>
typedef void (*PresetFunc)(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName);
typedef int (*PresetFunc)(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName);
std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
{
......@@ -44,8 +48,9 @@ std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
{"a2a_n", {AllToAllRdmaPreset, "Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers"}},
{"a2asweep", {AllToAllSweepPreset, "Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts"}},
{"healthcheck", {HealthCheckPreset, "Simple bandwidth health check (MI300X series only)"}},
{"nicrings", {NicRingsPreset, "Tests NIC rings created across identical NIC indices across ranks"}},
{"one2all", {OneToAllPreset, "Test all subsets of parallel transfers from one GPU to all others"}},
{"p2p" , {PeerToPeerPreset, " Peer-to-peer device memory bandwidth test"}},
{"p2p" , {PeerToPeerPreset, "Peer-to-peer device memory bandwidth test"}},
{"rsweep", {SweepPreset, "Randomly sweep through sets of Transfers"}},
{"scaling", {ScalingPreset, "Run scaling test from one GPU to other devices"}},
{"schmoo", {SchmooPreset, "Scaling tests for local/remote read/write/copy"}},
......@@ -63,11 +68,12 @@ void DisplayPresets()
int RunPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
int const argc,
char** const argv)
char** const argv,
int& retCode)
{
std::string preset = (argc > 1 ? argv[1] : "");
if (presetFuncMap.count(preset)) {
(presetFuncMap[preset].first)(ev, numBytesPerTransfer, preset);
retCode = (presetFuncMap[preset].first)(ev, numBytesPerTransfer, preset);
return 1;
}
return 0;
......
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......@@ -20,10 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
void ScalingPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
int ScalingPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] Scaling preset currently not supported for multi-node\n");
return 1;
}
int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
......@@ -49,7 +54,7 @@ void ScalingPreset(EnvVars& ev,
// Validate env vars
if (localIdx >= numDetectedGpus) {
printf("[ERROR] Cannot execute scaling test with local GPU device %d\n", localIdx);
exit(1);
return 1;
}
TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
......@@ -69,12 +74,14 @@ void ScalingPreset(EnvVars& ev,
std::vector<std::pair<double, int>> bestResult(numDevices);
MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
std::vector<Transfer> transfers(1);
Transfer& t = transfers[0];
t.exeDevice = {EXE_GPU_GFX, localIdx};
t.exeSubIndex = -1;
t.numBytes = numBytesPerTransfer;
t.srcs = {{MEM_GPU, localIdx}};
t.srcs = {{memType, localIdx}};
for (int numSubExec = sweepMin; numSubExec <= sweepMax; numSubExec++) {
t.numSubExecs = numSubExec;
......@@ -84,8 +91,8 @@ void ScalingPreset(EnvVars& ev,
t.dsts = {{i < numCpuDevices ? MEM_CPU : MEM_GPU,
i < numCpuDevices ? i : i - numCpuDevices}};
if (!RunTransfers(cfg, transfers, results)) {
PrintErrors(results.errResults);
exit(1);
Utils::PrintErrors(results.errResults);
return 1;
}
double bw = results.tfrResults[0].avgBandwidthGbPerSec;
printf("%c%7.2f ", separator, bw);
......@@ -102,4 +109,5 @@ void ScalingPreset(EnvVars& ev,
for (int i = 0; i < numDevices; i++)
printf("%c%7.2f(%3d)", separator, bestResult[i].first, bestResult[i].second);
printf("\n");
return 0;
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment