Unverified Commit bbd72a6c authored by gilbertlee-amd's avatar gilbertlee-amd Committed by GitHub
Browse files

TransferBench v1.66 - Multi-Rank support (#224)

* Adding System singleton to support multi-node (communication and topology)
* Adding multi-node parsing, rank and device wildcard expansion
* Adding multi-node topology, and various support functions
* Adding multi-node consistency validation of Config and Transfers
* Introducing SINGLE_KERNEL=1 to Makefile to speed up compilation during development
* Updating CHANGELOG.  Overhauling wildcard parsing.  Adding dryrun
* Client refactoring.  Introduction of tabular formatted results and a2a multi-rank preset
* Adding MPI support into CMakeFiles
* Cleaning up multi-node topology using TableHelper
* Reducing compile time by removing some kernel variants
* Updating documentation.  Adding nicrings preset
* Adding NIC_FILTER to allow NIC device filtering via regex
* Updating supported memory types
* Fixing P2P preset, and adding some extra memIndex utility functions
parent 26717d50
...@@ -3,6 +3,79 @@ ...@@ -3,6 +3,79 @@
Documentation for TransferBench is available at Documentation for TransferBench is available at
[https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench). [https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench).
## v1.66.00
### Added
- Adding multi-node support
- TransferBench now supports multiple nodes through the use of MPI or sockets
- In order to utilize MPI, TransferBench must be compiled with MPI support (setting MPI_PATH to where
an MPI implementation is located). MPI support can explicitly disabled by setting DISABLE_MPI_COMM=1
- TransferBench can be executed with an MPI launcher, such as mpirun
- In order to utilize sockets, several environment variables need to be provided to processes
* TB_RANK: Rank of this process (0-based)
* TB_NUM_RANKS: Total number of processes
* TB_MASTER_ADDR: IP address of rank 0 (Other ranks will connect to rank 0)
* TB_MASTER_PORT: Port for communication (default: 29500)
- Additional debug messages can be enabled by setting TB_VERBOSE=1
- NOTE: It is recommended that one process be launched per node to avoid aliasing of devices
- Adding multi-node topology detection
- When running in multi-node mode, TransferBench will try to collect topology information about each
rank, then group ranks into homogenous configurations.
- This is done by running TransferBench with no arguments (e.g. mpirun -np 2 ./TransferBench)
- Adding multi-node Transfer parsing and wildcard support
- Memory locations have now been extended to support a rank index
* R(memRank)?(memIndex) (where ? is one of the supported memory type characters "CGBFUNMP")
(e.g. R2G3 is GPU memory location in GPU 3 on rank 2)
- Rank is optional and if not specified, will fallback to "local" rank
- Executor locations have been extended to support rank indices as well)
* R(exeRank)?(exeIndex){exeSlot}.{exeSubIndex}{exeSubSlot} (where ? is one of the supported executor-types characters "CGDIN")
- exeSlots are only relevant for the EXE_NIC_NEAREST executor, and allows for distinguishing when multiple NICs are closest to a GPU
- exeSlots are defined by upper case letters 'A' for first closest NIC, 'B' for 2nd closest NIC, etc.
- For example: N0B.4C would execute using the 2nd closest NIC to GPU 0 via communicating with the 3rd closest NIC to GPU 4
- Wildcard support:
- To help quickly define sets of transfers, Transfers can now be specified using wildcards
- All the fields above may be specified either:
* directly with a single value: E.g: R34 -> Rank 34
* full wildcard: E.g: R* -> Will be replaced by all available ranks
* Ranged wildcard: E.g. R[1,5..7] -> Will be replaced by Rank 1, Rank 5, Rank 6, Rank 7
- Wildcard nearest NIC wildcard
- To simplify nearest NIC execution, it is not necessary to specify exeIndex/exeSubIndex for the "N" executor
- If exeRank/exeIndex/exeSlot/exeSubIndex/exeSubSlot are all not specified, the Transfer will be expanded to
choose the correct values such that a remote write operation will occur based on SRC/DST mem locations
- For example: (R2G4->N->R4G5) will expand to (R2G4->R2N4.5->R4G5)
- Adding dry-run preset
- This new preset is similar to cmdline however it only shows the list of transfers that will be executed
- This new dryrun preset may be useful when using the new wildcard expressions to ensure that the Test
contains the correct set of Transfers
- Adding nicrings preset
- This new preset runs parallel transfers forming rings that connect identical NICs across ranks
- Adding NIC_FILTER to allow for filtering which NICs to detect. NIC_FILTER accepts regular-expression syntax
- Added new memory types based on latest HIP memory allocation flags
Supported memory locations are:
- C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- P: Pinned host memory (on NUMA node, indexed by closest GPU [#GPUs -1])
- B: Coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- D: Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- K: Uncached pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- H: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
- G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
- F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
- U: Uncached device memory (on GPU device indexed from 0 to [# GPUs - 1])
- N: Null memory (index ignored)
- As a result, the a2a preset has deprecated USE_FINE_GRAIN for MEM_TYPE to allow for selecting between various GPU memory types
- A warning message is issued if USE_FINE_GRAIN is used, however previous matching functionality remains for now
- The p2p preset has also deprecated USE_FINE_GRAIN for CPU_MEM_TYPE and GPU_MEM_TYPE
### Modified
- Refactored front-end client code to facilitate simpler and more consistent presets.
- Refactored tabular data display to simplify code. Output result tables now use ASCII box-drawing
characters for borders which helps group data visually. Borders may be disabled by setting SHOW_BORDERS=0
- The All-to-all preset is now multi-rank compatible. When executed on multiple ranks, it runs
inter-rank all-to-all and then reports the min/max across all ranks. The number of extrema
results shown can be adjusted by NUM_RESULTS
### Fixed
- Added guard for ROCM version when using __syncwarp();
- Exiting with non-zero code on fatal errors
## v1.65.00 ## v1.65.00
### Added ### Added
- Added warp-level dispatch support via GFX_SE_TYPE environment variable - Added warp-level dispatch support via GFX_SE_TYPE environment variable
......
# Copyright (c) 2023-2025 Advanced Micro Devices, Inc. All rights reserved. # Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
cmake_minimum_required(VERSION 3.5 FATAL_ERROR) cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
...@@ -9,7 +9,7 @@ if (NOT CMAKE_TOOLCHAIN_FILE) ...@@ -9,7 +9,7 @@ if (NOT CMAKE_TOOLCHAIN_FILE)
message(STATUS "CMAKE_TOOLCHAIN_FILE: ${CMAKE_TOOLCHAIN_FILE}") message(STATUS "CMAKE_TOOLCHAIN_FILE: ${CMAKE_TOOLCHAIN_FILE}")
endif() endif()
set(VERSION_STRING "1.65.00") set(VERSION_STRING "1.66.00")
project(TransferBench VERSION ${VERSION_STRING} LANGUAGES CXX) project(TransferBench VERSION ${VERSION_STRING} LANGUAGES CXX)
## Load CMake modules ## Load CMake modules
...@@ -24,6 +24,7 @@ list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake") ...@@ -24,6 +24,7 @@ list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
#================================================================================================== #==================================================================================================
option(BUILD_LOCAL_GPU_TARGET_ONLY "Build only for GPUs detected on this machine" OFF) option(BUILD_LOCAL_GPU_TARGET_ONLY "Build only for GPUs detected on this machine" OFF)
option(ENABLE_NIC_EXEC "Enable RDMA NIC Executor in TransferBench" OFF) option(ENABLE_NIC_EXEC "Enable RDMA NIC Executor in TransferBench" OFF)
option(ENABLE_MPI_COMM "Enable MPI Communicator support" OFF)
# Default GPU architectures to build # Default GPU architectures to build
#================================================================================================== #==================================================================================================
...@@ -129,7 +130,7 @@ endif() ...@@ -129,7 +130,7 @@ endif()
if(DEFINED ENV{DISABLE_NIC_EXEC} AND "$ENV{DISABLE_NIC_EXEC}" STREQUAL "1") if(DEFINED ENV{DISABLE_NIC_EXEC} AND "$ENV{DISABLE_NIC_EXEC}" STREQUAL "1")
message(STATUS "Disabling NIC Executor support as env. flag DISABLE_NIC_EXEC was enabled") message(STATUS "Disabling NIC Executor support as env. flag DISABLE_NIC_EXEC was enabled")
elseif(NOT ENABLE_NIC_EXEC) elseif(NOT ENABLE_NIC_EXEC)
message(STATUS "For CMake builds, NIC executor so requires explicit opt-in by setting CMake flag -DENABLE_NIC_EXEC=1") message(STATUS "For CMake builds, NIC executor so requires explicit opt-in by setting CMake flag -DENABLE_NIC_EXEC=ON")
message(STATUS "Disabling NIC Executor support") message(STATUS "Disabling NIC Executor support")
else() else()
find_library(IBVERBS_LIBRARY ibverbs) find_library(IBVERBS_LIBRARY ibverbs)
...@@ -149,6 +150,40 @@ else() ...@@ -149,6 +150,40 @@ else()
endif() endif()
endif() endif()
## Check for MPI support
set(MPI_PATH "" CACHE PATH "Path to MPI installation (takes priority over system MPI)")
if(NOT ENABLE_MPI_COMM)
message(STATUS "For CMake builds, MPI Communicator requires explicit opt-in by setting CMake flag -DENABLE_MPI_COMM=ON")
message(STATUS "Disabling MPI Communicator support")
else()
# First check user-specified MPI_PATH (similar to Makefile)
if(MPI_PATH AND EXISTS "${MPI_PATH}/include/mpi.h")
find_library(MPI_LIBRARY NAMES mpi PATHS ${MPI_PATH}/lib NO_DEFAULT_PATH)
if(MPI_LIBRARY)
set(MPI_COMM_FOUND 1)
set(MPI_INCLUDE_DIR "${MPI_PATH}/include")
set(MPI_LINK_DIR "${MPI_PATH}/lib")
message(STATUS "Building with MPI Communicator support (found at MPI_PATH: ${MPI_PATH})")
else()
message(WARNING "Found mpi.h at ${MPI_PATH}/include but could not find MPI library at ${MPI_PATH}/lib")
endif()
else()
# Fall back to find_package
if(MPI_PATH)
message(STATUS "Unable to find mpi.h at ${MPI_PATH}/include, trying find_package")
endif()
find_package(MPI QUIET)
if(MPI_CXX_FOUND)
set(MPI_COMM_FOUND 1)
message(STATUS "Building with MPI Communicator support (found via find_package)")
message(STATUS "- Using MPI include path: ${MPI_CXX_INCLUDE_PATH}")
message(STATUS "- Using MPI library:: ${MPI_CXX_LIBRARIES}")
else()
message(WARNING "MPI not found. Please specify appropriate MPI_PATH or install MPI libraries (e.g., OpenMPI or MPICH)")
endif()
endif()
endif()
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY .) set(CMAKE_RUNTIME_OUTPUT_DIRECTORY .)
add_executable(TransferBench src/client/Client.cpp) add_executable(TransferBench src/client/Client.cpp)
...@@ -163,6 +198,22 @@ if(IBVERBS_FOUND) ...@@ -163,6 +198,22 @@ if(IBVERBS_FOUND)
target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY}) target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY})
target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED) target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED)
endif() endif()
if(MPI_COMM_FOUND)
if(TARGET MPI::MPI_CXX)
# Found via find_package
target_include_directories(TransferBench PRIVATE ${MPI_CXX_INCLUDE_DIRS})
target_link_libraries(TransferBench PRIVATE MPI::MPI_CXX)
else()
# Found via MPI_PATH fallback
target_include_directories(TransferBench PRIVATE ${MPI_INCLUDE_DIR})
target_link_libraries(TransferBench PRIVATE ${MPI_LIBRARY})
endif()
target_compile_definitions(TransferBench PRIVATE MPI_COMM_ENABLED)
endif()
if (HAVE_PARALLEL_JOBS)
target_compile_options(TransferBench PRIVATE -parallel-jobs=12)
endif()
target_link_libraries(TransferBench PRIVATE -fgpu-rdc) # Required when linking relocatable device code target_link_libraries(TransferBench PRIVATE -fgpu-rdc) # Required when linking relocatable device code
target_link_libraries(TransferBench PRIVATE Threads::Threads) target_link_libraries(TransferBench PRIVATE Threads::Threads)
......
# #
# Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved. # Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
# #
# Configuration options # Configuration options
ROCM_PATH ?= /opt/rocm ROCM_PATH ?= /opt/rocm
CUDA_PATH ?= /usr/local/cuda CUDA_PATH ?= /usr/local/cuda
MPI_PATH ?= /usr/local/openmpi
HIPCC ?= $(ROCM_PATH)/bin/amdclang++ HIPCC ?= $(ROCM_PATH)/bin/amdclang++
NVCC ?= $(CUDA_PATH)/bin/nvcc NVCC ?= $(CUDA_PATH)/bin/nvcc
# Option to compile with single GFX kernel to drop compilation time
SINGLE_KERNEL ?= 0
# This can be a space separated string of multiple GPU targets # This can be a space separated string of multiple GPU targets
# Default is the native GPU target # Default is the native GPU target
GPU_TARGETS ?= native GPU_TARGETS ?= native
...@@ -35,9 +39,13 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),) ...@@ -35,9 +39,13 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),)
CXXFLAGS = -I$(ROCM_PATH)/include -I$(ROCM_PATH)/include/hip -I$(ROCM_PATH)/include/hsa CXXFLAGS = -I$(ROCM_PATH)/include -I$(ROCM_PATH)/include/hip -I$(ROCM_PATH)/include/hsa
HIPLDFLAGS= -lnuma -L$(ROCM_PATH)/lib -lhsa-runtime64 -lamdhip64 HIPLDFLAGS= -lnuma -L$(ROCM_PATH)/lib -lhsa-runtime64 -lamdhip64
HIPFLAGS = -x hip -D__HIP_PLATFORM_AMD__ -D__HIPCC__ $(GPU_TARGETS_FLAGS) HIPFLAGS = -Wall -x hip -D__HIP_PLATFORM_AMD__ -D__HIPCC__ $(GPU_TARGETS_FLAGS)
NVFLAGS = -x cu -lnuma -arch=native NVFLAGS = -x cu -lnuma -arch=native
ifeq ($(SINGLE_KERNEL), 1)
CXXFLAGS += -DSINGLE_KERNEL
endif
ifeq ($(DEBUG), 0) ifeq ($(DEBUG), 0)
COMMON_FLAGS += -O3 COMMON_FLAGS += -O3
else else
...@@ -70,8 +78,34 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),) ...@@ -70,8 +78,34 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),)
$(info Building with NIC executor support. Can set DISABLE_NIC_EXEC=1 to disable) $(info Building with NIC executor support. Can set DISABLE_NIC_EXEC=1 to disable)
endif endif
endif endif
MPI_ENABLED = 0
# Compile with MPI communicator support if
# 1) DISABLE_MPI_COMM is not set to 1
# 2) mpi.h is found in the MPI_PATH
DISABLE_MPI_COMM ?= 0
ifneq ($(DISABLE_MPI_COMM), 1)
ifeq ($(wildcard $(MPI_PATH)/include/mpi.h),)
$(info Unable to find mpi.h at $(MPI_PATH)/include. Please specify appropriate MPI_PATH)
else
MPI_ENABLED = 1
CXXFLAGS += -DMPI_COMM_ENABLED -I$(MPI_PATH)/include
LDFLAGS += -L/$(MPI_PATH)/lib -lmpi
ifeq ($(DEBUG), 1)
LDFLAGS += -lmpi_cxx
endif
endif
ifeq ($(MPI_ENABLED), 0)
$(info Building without MPI communicator support)
$(info To use TransferBench with MPI support, install MPI libraries and specify appropriate MPI_PATH)
else
$(info Building with MPI communicator support. Can set DISABLE_MPI_COMM=1 to disable)
endif
endif
endif endif
.PHONY : all clean .PHONY : all clean
all: $(EXE) all: $(EXE)
......
# TransferBench # TransferBench
TransferBench is a utility for benchmarking simultaneous copies between user-specified TransferBench is a utility for benchmarking simultaneous copies between user-specified
CPU and GPU devices. CPU and GPU memory locations using CPUs/GPU kernels/DMA engines/NIC devices.
> [!NOTE] > [!NOTE]
> The published documentation is available at [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html) in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the `TransferBench/docs` folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see [Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html). > The published documentation is available at [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html) in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the `TransferBench/docs` folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see [Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
...@@ -18,7 +18,7 @@ left_nav_title = f"TransferBench {version_number} Documentation" ...@@ -18,7 +18,7 @@ left_nav_title = f"TransferBench {version_number} Documentation"
# for PDF output on Read the Docs # for PDF output on Read the Docs
project = "TransferBench Documentation" project = "TransferBench Documentation"
author = "Advanced Micro Devices, Inc." author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
version = version_number version = version_number
release = version_number release = version_number
......
...@@ -177,9 +177,16 @@ Here is the list of preset configurations that can be used instead of configurat ...@@ -177,9 +177,16 @@ Here is the list of preset configurations that can be used instead of configurat
* - ``cmdline`` * - ``cmdline``
- Allows transfers to run from the command line instead of a configuration file - Allows transfers to run from the command line instead of a configuration file
* - ``dryrun``
- Lists the set of transfers to be executed as provided from the command line
- This is useful when using wildcards to ensure correctness
* - ``healthcheck`` * - ``healthcheck``
- Simple health check (supported on AMD Instinct MI300 series only) - Simple health check (supported on AMD Instinct MI300 series only)
* - ``nic_rings``
- Measure performance of NICs set up in a ring across ranks
* - ``p2p`` * - ``p2p``
- Peer-to-peer benchmark test - Peer-to-peer benchmark test
......
...@@ -50,6 +50,12 @@ Here are the steps to build TransferBench: ...@@ -50,6 +50,12 @@ Here are the steps to build TransferBench:
If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately. If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.
NIC executor support will be enabled if IBVerbs is detected and if ``infiniband/verbs.h`` is found in the default include path.
NIC executor support can be disabled explicitly by setting ``DISABLE_NIC_EXEC=1``
MPI support will be enabled if mpi.h is found in ``MPI_PATH/include/``
MPI executor support can be disabled explicitly by setting ``DISABLE_MPI_COMM=1``
Building documentation Building documentation
----------------------- -----------------------
......
...@@ -47,13 +47,17 @@ ...@@ -47,13 +47,17 @@
# Memory locations are specified by one or more (device character / device index) pairs # Memory locations are specified by one or more (device character / device index) pairs
# Character indicating memory type followed by device index (0-indexed) # Character indicating memory type followed by device index (0-indexed)
# Supported memory locations are: # Supported memory locations are:
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - U: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - P: Pinned host memory (on NUMA node, indexed by closest GPU [#GPUs -1])
# - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - B: Coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1]) # - D: Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1]) # - K: Uncached pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - N: Null memory (index ignored) # - H: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - P: Pinned host memory (on NUMA node, but indexed by closest GPU [#GPUs -1]) # - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - U: Uncached device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# Examples: # Examples:
# 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1 # 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
......
/* /*
Copyright (c) 2019-2024 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -20,23 +20,33 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...@@ -20,23 +20,33 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE. THE SOFTWARE.
*/ */
#include "Client.hpp"
#include "Presets.hpp" #include "Presets.hpp"
#include "Topology.hpp" #include "Topology.hpp"
#include <fstream> #include <fstream>
int main(int argc, char **argv) { void DisplayVersion();
void DisplayUsage(char const* cmdName);
using namespace TransferBench;
using namespace TransferBench::Utils;
size_t constexpr DEFAULT_BYTES_PER_TRANSFER = (1<<28);
int main(int argc, char **argv)
{
// Collect environment variables // Collect environment variables
EnvVars ev; EnvVars ev;
// Display usage instructions and detected topology // Display usage instructions and detected topology
if (argc <= 1) { if (argc <= 1) {
if (!ev.outputToCsv) { if (RankDoesOutput()) {
DisplayUsage(argv[0]); if (!ev.outputToCsv) {
DisplayPresets(); DisplayVersion();
DisplayUsage(argv[0]);
DisplayPresets();
}
DisplayTopology(ev.outputToCsv, ev.showBorders);
} }
DisplayTopology(ev.outputToCsv);
exit(0); exit(0);
} }
...@@ -52,42 +62,73 @@ int main(int argc, char **argv) { ...@@ -52,42 +62,73 @@ int main(int argc, char **argv) {
} }
} }
if (numBytesPerTransfer % 4) { if (numBytesPerTransfer % 4) {
printf("[ERROR] numBytesPerTransfer (%lu) must be a multiple of 4\n", numBytesPerTransfer); Print("[ERROR] numBytesPerTransfer (%lu) must be a multiple of 4\n", numBytesPerTransfer);
exit(1); exit(1);
} }
// Display TransferBench version and build configuration
DisplayVersion();
// Run preset benchmark if requested // Run preset benchmark if requested
if (RunPreset(ev, numBytesPerTransfer, argc, argv)) exit(0); int retCode = 0;
if (RunPreset(ev, numBytesPerTransfer, argc, argv, retCode)) return retCode;
// Read input from command line or configuration file // Read input from command line or configuration file
bool isDryRun = !strcmp(argv[1], "dryrun");
std::vector<std::string> lines; std::vector<std::string> lines;
{ {
std::string line; std::string line;
if (!strcmp(argv[1], "cmdline")) { if (!strcmp(argv[1], "cmdline") || isDryRun) {
for (int i = 3; i < argc; i++) for (int i = 3; i < argc; i++)
line += std::string(argv[i]) + " "; line += std::string(argv[i]) + " ";
lines.push_back(line); lines.push_back(line);
} else { } else {
std::ifstream cfgFile(argv[1]); std::ifstream cfgFile(argv[1]);
if (!cfgFile.is_open()) { if (!cfgFile.is_open()) {
printf("[ERROR] Unable to open transfer configuration file: [%s]\n", argv[1]); Print("[ERROR] Unable to open transfer configuration file: [%s]\n", argv[1]);
exit(1); exit(1);
} }
while (std::getline(cfgFile, line)) while (std::getline(cfgFile, line))
lines.push_back(line); lines.push_back(line);
cfgFile.close(); cfgFile.close();
} }
} }
// Print environment variables and CSV header ConfigOptions cfgOptions = ev.ToConfigOptions();
ev.DisplayEnvVars(); TestResults results;
if (ev.outputToCsv)
printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
TransferBench::ConfigOptions cfgOptions = ev.ToConfigOptions();
TransferBench::TestResults results;
std::vector<ErrResult> errors; std::vector<ErrResult> errors;
// Dry run prints off transfers (and errors)
if (isDryRun) {
Print("Transfers to be executed (dry-run):\n");
Print("================================================================================\n");
std::vector<Transfer> transfers;
CheckForError(ParseTransfers(lines[0], transfers));
if (transfers.empty()) {
Print("<none>\n");
} else {
bool isMultiNode = GetNumRanks() > 1;
for (size_t i = 0; i < transfers.size(); i++) {
Transfer const& t = transfers[i];
Print("Transfer %5lu: (%s->", i, MemDevicesToStr(t.srcs).c_str());
if (isMultiNode) Print("R%d", t.exeDevice.exeRank);
Print("%c%d", ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex);
if (t.exeDevice.exeSlot) Print("%c", 'A' + t.exeDevice.exeSlot);
if (t.exeSubIndex != -1) Print(".%d", t.exeSubIndex);
if (t.exeSubSlot != 0) Print("%c", 'A' + t.exeSubSlot);
Print("->%s)\n", MemDevicesToStr(t.dsts).c_str());
}
}
return 0;
}
// Print environment variables and CSV header
if (RankDoesOutput()) {
ev.DisplayEnvVars();
if (ev.outputToCsv)
Print("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
}
// Process each line as a Test // Process each line as a Test
int testNum = 0; int testNum = 0;
for (std::string const &line : lines) { for (std::string const &line : lines) {
...@@ -96,7 +137,7 @@ int main(int argc, char **argv) { ...@@ -96,7 +137,7 @@ int main(int argc, char **argv) {
// Parse set of parallel Transfers to execute // Parse set of parallel Transfers to execute
std::vector<Transfer> transfers; std::vector<Transfer> transfers;
CheckForError(TransferBench::ParseTransfers(line, transfers)); CheckForError(ParseTransfers(line, transfers));
if (transfers.empty()) continue; if (transfers.empty()) continue;
// Check for variable sub-executors Transfers // Check for variable sub-executors Transfers
...@@ -107,7 +148,7 @@ int main(int argc, char **argv) { ...@@ -107,7 +148,7 @@ int main(int argc, char **argv) {
for (auto const& t : transfers) { for (auto const& t : transfers) {
if (t.numSubExecs == 0) { if (t.numSubExecs == 0) {
if (t.exeDevice.exeType != EXE_GPU_GFX) { if (t.exeDevice.exeType != EXE_GPU_GFX) {
printf("[ERROR] Variable number of subexecutors is only supported on GFX executors\n"); Print("[ERROR] Variable number of subexecutors is only supported on GFX executors\n");
exit(1); exit(1);
} }
numVariableTransfers++; numVariableTransfers++;
...@@ -116,7 +157,7 @@ int main(int argc, char **argv) { ...@@ -116,7 +157,7 @@ int main(int argc, char **argv) {
} }
} }
if (numVariableTransfers > 0 && numVariableTransfers != transfers.size()) { if (numVariableTransfers > 0 && numVariableTransfers != transfers.size()) {
printf("[ERROR] All or none of the Transfers in the Test must use variable number of Subexecutors\n"); Print("[ERROR] All or none of the Transfers in the Test must use variable number of Subexecutors\n");
exit(1); exit(1);
} }
} }
...@@ -140,18 +181,20 @@ int main(int argc, char **argv) { ...@@ -140,18 +181,20 @@ int main(int argc, char **argv) {
} }
if (maxVarCount == 0) { if (maxVarCount == 0) {
if (TransferBench::RunTransfers(cfgOptions, transfers, results)) { if (RunTransfers(cfgOptions, transfers, results)) {
PrintResults(ev, ++testNum, transfers, results); PrintResults(ev, ++testNum, transfers, results);
} }
PrintErrors(results.errResults); if (RankDoesOutput()) {
PrintErrors(results.errResults);
}
} else { } else {
// Variable subexecutors - Determine how many subexecutors to sweep up to // Variable subexecutors - Determine how many subexecutors to sweep up to
int maxNumVarSubExec = ev.maxNumVarSubExec; int maxNumVarSubExec = ev.maxNumVarSubExec;
if (maxNumVarSubExec == 0) { if (maxNumVarSubExec == 0) {
maxNumVarSubExec = TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0}) / maxVarCount; maxNumVarSubExec = GetNumSubExecutors({EXE_GPU_GFX, 0}) / maxVarCount;
} }
TransferBench::TestResults bestResults; TestResults bestResults;
std::vector<Transfer> bestTransfers; std::vector<Transfer> bestTransfers;
for (int numSubExecs = ev.minNumVarSubExec; numSubExecs <= maxNumVarSubExec; numSubExecs++) { for (int numSubExecs = ev.minNumVarSubExec; numSubExecs <= maxNumVarSubExec; numSubExecs++) {
std::vector<Transfer> tempTransfers = transfers; std::vector<Transfer> tempTransfers = transfers;
...@@ -159,8 +202,8 @@ int main(int argc, char **argv) { ...@@ -159,8 +202,8 @@ int main(int argc, char **argv) {
if (t.numSubExecs == 0) t.numSubExecs = numSubExecs; if (t.numSubExecs == 0) t.numSubExecs = numSubExecs;
} }
TransferBench::TestResults tempResults; TestResults tempResults;
if (!TransferBench::RunTransfers(cfgOptions, tempTransfers, tempResults)) { if (!RunTransfers(cfgOptions, tempTransfers, tempResults)) {
PrintErrors(tempResults.errResults); PrintErrors(tempResults.errResults);
} else { } else {
if (tempResults.avgTotalBandwidthGbPerSec > bestResults.avgTotalBandwidthGbPerSec) { if (tempResults.avgTotalBandwidthGbPerSec > bestResults.avgTotalBandwidthGbPerSec) {
...@@ -180,158 +223,49 @@ int main(int argc, char **argv) { ...@@ -180,158 +223,49 @@ int main(int argc, char **argv) {
} }
} }
void DisplayUsage(char const* cmdName) void DisplayVersion()
{ {
std::string nicSupport = ""; bool nicSupport = false, mpiSupport = false;
#if NIC_EXEC_ENABLED #if NIC_EXEC_ENABLED
nicSupport = " (with NIC support)"; nicSupport = true;
#endif
#if MPI_COMM_ENABLED
mpiSupport = true;
#endif #endif
printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
printf("========================================\n");
if (numa_available() == -1) {
printf("[ERROR] NUMA library not supported. Check to see if libnuma has been installed on this system\n");
exit(1);
}
printf("Usage: %s config <N>\n", cmdName);
printf(" config: Either:\n");
printf(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
printf(" - Name of preset config:\n");
printf(" N : (Optional) Number of bytes to copy per Transfer.\n");
printf(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
DEFAULT_BYTES_PER_TRANSFER);
printf(" If 0 is specified, a range of Ns will be benchmarked\n");
printf(" May append a suffix ('K', 'M', 'G') for kilobytes / megabytes / gigabytes\n");
printf("\n");
EnvVars::DisplayUsage();
}
std::string MemDevicesToStr(std::vector<MemDevice> const& memDevices) {
if (memDevices.empty()) return "N";
std::stringstream ss;
for (auto const& m : memDevices)
ss << TransferBench::MemTypeStr[m.memType] << m.memIndex;
return ss.str();
}
void PrintResults(EnvVars const& ev, int const testNum,
std::vector<Transfer> const& transfers,
TransferBench::TestResults const& results)
{
char sep = ev.outputToCsv ? ',' : '|';
size_t numTimedIterations = results.numTimedIterations;
if (!ev.outputToCsv) printf("Test %d:\n", testNum);
// Loop over each executor
for (auto exeInfoPair : results.exeResults) {
ExeDevice const& exeDevice = exeInfoPair.first;
ExeResult const& exeResult = exeInfoPair.second;
ExeType const exeType = exeDevice.exeType;
int32_t const exeIndex = exeDevice.exeIndex;
printf(" Executor: %3s %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
ExeTypeName[exeType], exeIndex, sep, exeResult.avgBandwidthGbPerSec, sep,
exeResult.avgDurationMsec, sep, exeResult.numBytes, sep, exeResult.sumBandwidthGbPerSec);
// Loop over each executor
for (int idx : exeResult.transferIdx) {
Transfer const& t = transfers[idx];
TransferResult const& r = results.tfrResults[idx];
char exeSubIndexStr[32] = "";
if (t.exeSubIndex != -1)
sprintf(exeSubIndexStr, ".%d", t.exeSubIndex);
printf(" Transfer %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %c%03d%s:%03d -> %s\n",
idx, sep,
r.avgBandwidthGbPerSec, sep,
r.avgDurationMsec, sep,
r.numBytes, sep,
MemDevicesToStr(t.srcs).c_str(),
TransferBench::ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
exeSubIndexStr, t.numSubExecs,
MemDevicesToStr(t.dsts).c_str());
// Show per-iteration timing information
if (ev.showIterations) {
// Check that per-iteration information exists
if (r.perIterMsec.size() != numTimedIterations) {
printf("[ERROR] Per iteration timing data unavailable: Expected %lu data points, but have %lu\n",
numTimedIterations, r.perIterMsec.size());
exit(1);
}
// Compute standard deviation and track iterations by speed
std::set<std::pair<double, int>> times;
double stdDevTime = 0;
double stdDevBw = 0;
for (int i = 0; i < numTimedIterations; i++) {
times.insert(std::make_pair(r.perIterMsec[i], i+1));
double const varTime = fabs(r.avgDurationMsec - r.perIterMsec[i]);
stdDevTime += varTime * varTime;
double iterBandwidthGbs = (t.numBytes / 1.0E9) / r.perIterMsec[i] * 1000.0f;
double const varBw = fabs(iterBandwidthGbs - r.avgBandwidthGbPerSec);
stdDevBw += varBw * varBw;
}
stdDevTime = sqrt(stdDevTime / numTimedIterations);
stdDevBw = sqrt(stdDevBw / numTimedIterations);
// Loop over iterations (fastest to slowest) std::string support = "";
for (auto& time : times) { if (mpiSupport && nicSupport) support = " (with MPI+NIC support)";
double iterDurationMsec = time.first; else if (mpiSupport) support = " (with MPI support)";
double iterBandwidthGbs = (t.numBytes / 1.0E9) / iterDurationMsec * 1000.0f; else if (nicSupport) support = " (with NIC support)";
printf(" Iter %03d %c %8.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);
std::set<int> usedXccs; std::string multiNodeMode = "";
if (time.second - 1 < r.perIterCUs.size()) { switch (GetCommMode()) {
printf(" CUs:"); case COMM_NONE: multiNodeMode = " (Single-node mode)"; break;
for (auto x : r.perIterCUs[time.second - 1]) { case COMM_SOCKET: multiNodeMode = " (Multi-node via sockets)"; break;
printf(" %02d:%02d", x.first, x.second); case COMM_MPI: multiNodeMode = " (Multi-node via MPI)"; break;
usedXccs.insert(x.first);
}
}
printf(" XCCs:");
for (auto x : usedXccs)
printf(" %02d", x);
printf("\n");
}
printf(" StandardDev %c %8.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
}
}
} }
printf(" Aggregate (CPU) %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
sep, results.avgTotalBandwidthGbPerSec, Print("TransferBench v%s.%s%s%s\n", VERSION, CLIENT_VERSION, support.c_str(), multiNodeMode.c_str());
sep, results.avgTotalDurationMsec, Print("=============================================================================================================\n");
sep, results.totalBytesTransferred,
sep, results.overheadMsec);
} }
void CheckForError(ErrResult const& error) void DisplayUsage(char const* cmdName)
{ {
switch (error.errType) { if (numa_available() == -1) {
case ERR_NONE: return; Print("[ERROR] NUMA library not supported. Check to see if libnuma has been installed on this system\n");
case ERR_WARN:
printf("[WARN] %s\n", error.errMsg.c_str());
return;
case ERR_FATAL:
printf("[ERROR] %s\n", error.errMsg.c_str());
exit(1); exit(1);
default:
break;
} }
}
void PrintErrors(std::vector<ErrResult> const& errors) Print("Usage: %s config <N>\n", cmdName);
{ Print(" config: Either:\n");
bool isFatal = false; Print(" - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
for (auto const& err : errors) { Print(" - Name of preset config:\n");
printf("[%s] %s\n", err.errType == ERR_FATAL ? "ERROR" : "WARN", err.errMsg.c_str()); Print(" N : (Optional) Number of bytes to copy per Transfer.\n");
isFatal |= (err.errType == ERR_FATAL); Print(" If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
} DEFAULT_BYTES_PER_TRANSFER);
if (isFatal) exit(1); Print(" If 0 is specified, a range of Ns will be benchmarked\n");
} Print(" May append a suffix ('K', 'M', 'G') for kilobytes / megabytes / gigabytes\n");
Print("\n");
EnvVars::DisplayUsage();
};
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#pragma once
// TransferBench client version
#define CLIENT_VERSION "00"
#include "TransferBench.hpp"
#include "EnvVars.hpp"
size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<28);
char const ExeTypeName[5][4] = {"CPU", "GPU", "DMA", "NIC", "NIC"};
// Display detected hardware
void DisplayTopology(bool outputToCsv);
// Display usage instructions
void DisplayUsage(char const* cmdName);
// Print TransferBench test results
void PrintResults(EnvVars const& ev, int const testNum,
std::vector<Transfer> const& transfers,
TransferBench::TestResults const& results);
// Helper function that converts MemDevices to a string
std::string MemDevicesToStr(std::vector<MemDevice> const& memDevices);
// Helper function to print warning / exit on fatal error
void CheckForError(ErrResult const& error);
// Helper function to print list of errors
void PrintErrors(std::vector<ErrResult> const& errors);
/* /*
Copyright (c) 2021-2025 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -39,7 +39,8 @@ THE SOFTWARE. ...@@ -39,7 +39,8 @@ THE SOFTWARE.
#include <numa.h> #include <numa.h>
#include <random> #include <random>
#include <time.h> #include <time.h>
#include "Client.hpp"
#define CLIENT_VERSION "00"
#include "TransferBench.hpp" #include "TransferBench.hpp"
using namespace TransferBench; using namespace TransferBench;
...@@ -69,6 +70,7 @@ public: ...@@ -69,6 +70,7 @@ public:
int numIterations; // Number of timed iterations to perform. If negative, run for -numIterations seconds instead int numIterations; // Number of timed iterations to perform. If negative, run for -numIterations seconds instead
int numSubIterations; // Number of subiterations to perform int numSubIterations; // Number of subiterations to perform
int numWarmups; // Number of un-timed warmup iterations to perform int numWarmups; // Number of un-timed warmup iterations to perform
int showBorders; // Show ASCII box-drawing characaters in tables
int showIterations; // Show per-iteration timing info int showIterations; // Show per-iteration timing info
int useInteractive; // Pause for user-input before starting transfer loop int useInteractive; // Pause for user-input before starting transfer loop
...@@ -107,11 +109,11 @@ public: ...@@ -107,11 +109,11 @@ public:
// NIC options // NIC options
int ibGidIndex; // GID Index for RoCE NICs int ibGidIndex; // GID Index for RoCE NICs
int roceVersion; // RoCE version number
int ipAddressFamily; // IP Address Famliy
uint8_t ibPort; // NIC port number to be used uint8_t ibPort; // NIC port number to be used
int ipAddressFamily; // IP Address Famliy
int nicChunkBytes; // Number of bytes to send per chunk for RDMA operations
int nicRelaxedOrder; // Use relaxed ordering for RDMA int nicRelaxedOrder; // Use relaxed ordering for RDMA
std::string closestNicStr; // Holds the user-specified list of closest NICs int roceVersion; // RoCE version number
// Developer features // Developer features
int gpuMaxHwQueues; // Tracks GPU_MAX_HW_QUEUES environment variable int gpuMaxHwQueues; // Tracks GPU_MAX_HW_QUEUES environment variable
...@@ -119,14 +121,16 @@ public: ...@@ -119,14 +121,16 @@ public:
// Constructor that collects values // Constructor that collects values
EnvVars() EnvVars()
{ {
int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU); // Try to detect the GPU
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
int numDeviceCUs = TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0});
hipDeviceProp_t prop; hipDeviceProp_t prop;
HIP_CALL(hipGetDeviceProperties(&prop, 0)); std::string fullName = "";
std::string fullName = prop.gcnArchName; std::string archName = "";
std::string archName = fullName.substr(0, fullName.find(':')); int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
if (numDetectedGpus > 0) {
HIP_CALL(hipGetDeviceProperties(&prop, 0));
fullName = prop.gcnArchName;
archName = fullName.substr(0, fullName.find(':'));
}
// Different hardware pick different GPU kernels // Different hardware pick different GPU kernels
// This performance difference is generally only noticable when executing fewer CUs // This performance difference is generally only noticable when executing fewer CUs
...@@ -156,6 +160,7 @@ public: ...@@ -156,6 +160,7 @@ public:
numWarmups = GetEnvVar("NUM_WARMUPS" , 3); numWarmups = GetEnvVar("NUM_WARMUPS" , 3);
outputToCsv = GetEnvVar("OUTPUT_TO_CSV" , 0); outputToCsv = GetEnvVar("OUTPUT_TO_CSV" , 0);
samplingFactor = GetEnvVar("SAMPLING_FACTOR" , 1); samplingFactor = GetEnvVar("SAMPLING_FACTOR" , 1);
showBorders = GetEnvVar("SHOW_BORDERS" , 1);
showIterations = GetEnvVar("SHOW_ITERATIONS" , 0); showIterations = GetEnvVar("SHOW_ITERATIONS" , 0);
useHipEvents = GetEnvVar("USE_HIP_EVENTS" , 1); useHipEvents = GetEnvVar("USE_HIP_EVENTS" , 1);
useHsaDma = GetEnvVar("USE_HSA_DMA" , 0); useHsaDma = GetEnvVar("USE_HSA_DMA" , 0);
...@@ -168,8 +173,8 @@ public: ...@@ -168,8 +173,8 @@ public:
ibPort = GetEnvVar("IB_PORT_NUMBER" , 1); ibPort = GetEnvVar("IB_PORT_NUMBER" , 1);
roceVersion = GetEnvVar("ROCE_VERSION" , 2); roceVersion = GetEnvVar("ROCE_VERSION" , 2);
ipAddressFamily = GetEnvVar("IP_ADDRESS_FAMILY" , 4); ipAddressFamily = GetEnvVar("IP_ADDRESS_FAMILY" , 4);
nicChunkBytes = GetEnvVar("NIC_CHUNK_BYTES" , 1073741824);
nicRelaxedOrder = GetEnvVar("NIC_RELAX_ORDER" , 1); nicRelaxedOrder = GetEnvVar("NIC_RELAX_ORDER" , 1);
closestNicStr = GetEnvVar("CLOSEST_NIC" , "");
gpuMaxHwQueues = GetEnvVar("GPU_MAX_HW_QUEUES" , 4); gpuMaxHwQueues = GetEnvVar("GPU_MAX_HW_QUEUES" , 4);
...@@ -314,9 +319,6 @@ public: ...@@ -314,9 +319,6 @@ public:
printf(" ALWAYS_VALIDATE - Validate after each iteration instead of once after all iterations\n"); printf(" ALWAYS_VALIDATE - Validate after each iteration instead of once after all iterations\n");
printf(" BLOCK_BYTES - Controls granularity of how work is divided across subExecutors\n"); printf(" BLOCK_BYTES - Controls granularity of how work is divided across subExecutors\n");
printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4\n"); printf(" BYTE_OFFSET - Initial byte-offset for memory allocations. Must be multiple of 4\n");
#if NIC_EXEC_ENABLED
printf(" CLOSEST_NIC - Comma-separated list of per-GPU closest NIC (default=auto)\n");
#endif
printf(" CU_MASK - CU mask for streams. Can specify ranges e.g '5,10-12,14'\n"); printf(" CU_MASK - CU mask for streams. Can specify ranges e.g '5,10-12,14'\n");
printf(" FILL_COMPRESS - Percentages of 64B lines to be filled by random/1B0/2B0/4B0/32B0\n"); printf(" FILL_COMPRESS - Percentages of 64B lines to be filled by random/1B0/2B0/4B0/32B0\n");
printf(" FILL_PATTERN - Big-endian pattern for source data, specified in hex digits. Must be even # of digits\n"); printf(" FILL_PATTERN - Big-endian pattern for source data, specified in hex digits. Must be even # of digits\n");
...@@ -337,6 +339,7 @@ public: ...@@ -337,6 +339,7 @@ public:
printf(" MIN_VAR_SUBEXEC - Minumum # of subexecutors to use for variable subExec Transfers\n"); printf(" MIN_VAR_SUBEXEC - Minumum # of subexecutors to use for variable subExec Transfers\n");
printf(" MAX_VAR_SUBEXEC - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)\n"); printf(" MAX_VAR_SUBEXEC - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)\n");
#if NIC_EXEC_ENABLED #if NIC_EXEC_ENABLED
printf(" NIC_CHUNK_BYTES - Number of bytes to send at a time using NIC (default = 1GB)\n");
printf(" NIC_RELAX_ORDER - Set to non-zero to use relaxed ordering"); printf(" NIC_RELAX_ORDER - Set to non-zero to use relaxed ordering");
#endif #endif
printf(" NUM_ITERATIONS - # of timed iterations per test. If negative, run for this many seconds instead\n"); printf(" NUM_ITERATIONS - # of timed iterations per test. If negative, run for this many seconds instead\n");
...@@ -347,6 +350,7 @@ public: ...@@ -347,6 +350,7 @@ public:
printf(" ROCE_VERSION - RoCE version (default=2)\n"); printf(" ROCE_VERSION - RoCE version (default=2)\n");
#endif #endif
printf(" SAMPLING_FACTOR - Add this many samples (when possible) between powers of 2 when auto-generating data sizes\n"); printf(" SAMPLING_FACTOR - Add this many samples (when possible) between powers of 2 when auto-generating data sizes\n");
printf(" SHOW_BORDERS - Show ASCII box-drawing characaters in tables\n");
printf(" SHOW_ITERATIONS - Show per-iteration timing info\n"); printf(" SHOW_ITERATIONS - Show per-iteration timing info\n");
printf(" USE_HIP_EVENTS - Use HIP events for GFX executor timing\n"); printf(" USE_HIP_EVENTS - Use HIP events for GFX executor timing\n");
printf(" USE_HSA_DMA - Use hsa_amd_async_copy instead of hipMemcpy for non-targeted DMA execution\n"); printf(" USE_HSA_DMA - Use hsa_amd_async_copy instead of hipMemcpy for non-targeted DMA execution\n");
...@@ -386,8 +390,6 @@ public: ...@@ -386,8 +390,6 @@ public:
nicSupport = " (with NIC support)"; nicSupport = " (with NIC support)";
#endif #endif
if (!outputToCsv) { if (!outputToCsv) {
printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
printf("===============================================================\n");
if (!hideEnv) printf("[Common] (Suppress by setting HIDE_ENV=1)\n"); if (!hideEnv) printf("[Common] (Suppress by setting HIDE_ENV=1)\n");
} }
else if (!hideEnv) else if (!hideEnv)
...@@ -400,10 +402,6 @@ public: ...@@ -400,10 +402,6 @@ public:
"Each CU gets a mulitple of %d bytes to copy", blockBytes); "Each CU gets a mulitple of %d bytes to copy", blockBytes);
Print("BYTE_OFFSET", byteOffset, Print("BYTE_OFFSET", byteOffset,
"Using byte offset of %d", byteOffset); "Using byte offset of %d", byteOffset);
#if NIC_EXEC_ENABLED
Print("CLOSEST_NIC", (closestNicStr == "" ? "auto" : "user-input"),
"Per-GPU closest NIC is set as %s", (closestNicStr == "" ? "auto" : closestNicStr.c_str()));
#endif
Print("CU_MASK", getenv("CU_MASK") ? 1 : 0, Print("CU_MASK", getenv("CU_MASK") ? 1 : 0,
"%s", (cuMask.size() ? GetCuMaskDesc().c_str() : "All")); "%s", (cuMask.size() ? GetCuMaskDesc().c_str() : "All"));
Print("FILL_COMPRESS", getenv("FILL_COMPRESS") ? 1 : 0, Print("FILL_COMPRESS", getenv("FILL_COMPRESS") ? 1 : 0,
...@@ -452,6 +450,8 @@ public: ...@@ -452,6 +450,8 @@ public:
"Using up to %s subexecutors for variable subExec transfers", "Using up to %s subexecutors for variable subExec transfers",
maxNumVarSubExec ? std::to_string(maxNumVarSubExec).c_str() : "all available"); maxNumVarSubExec ? std::to_string(maxNumVarSubExec).c_str() : "all available");
#if NIC_EXEC_ENABLED #if NIC_EXEC_ENABLED
Print("NIC_CHUNK_BYTES", nicChunkBytes,
"Sending %lu bytes at a time for NIC RDMA", nicChunkBytes);
Print("NIC_RELAX_ORDER", nicRelaxedOrder, Print("NIC_RELAX_ORDER", nicRelaxedOrder,
"Using %s ordering for NIC RDMA", nicRelaxedOrder ? "relaxed" : "strict"); "Using %s ordering for NIC RDMA", nicRelaxedOrder ? "relaxed" : "strict");
#endif #endif
...@@ -466,6 +466,7 @@ public: ...@@ -466,6 +466,7 @@ public:
Print("ROCE_VERSION", roceVersion, Print("ROCE_VERSION", roceVersion,
"RoCE version is set to %d", roceVersion); "RoCE version is set to %d", roceVersion);
#endif #endif
Print("SHOW_BORDERS", showBorders, "%s ASCII box-drawing characaters in tables", showBorders ? "Showing" : "Hiding");
Print("SHOW_ITERATIONS", showIterations, Print("SHOW_ITERATIONS", showIterations,
"%s per-iteration timing", showIterations ? "Showing" : "Hiding"); "%s per-iteration timing", showIterations ? "Showing" : "Hiding");
Print("USE_HIP_EVENTS", useHipEvents, Print("USE_HIP_EVENTS", useHipEvents,
...@@ -497,8 +498,17 @@ public: ...@@ -497,8 +498,17 @@ public:
// Helper function that gets parses environment variable or sets to default value // Helper function that gets parses environment variable or sets to default value
static int GetEnvVar(std::string const& varname, int defaultValue) static int GetEnvVar(std::string const& varname, int defaultValue)
{ {
if (getenv(varname.c_str())) char const* varStr = getenv(varname.c_str());
return atoi(getenv(varname.c_str())); if (varStr) {
int val = atoi(varStr);
char units = varStr[strlen(varStr)-1];
switch (units) {
case 'G': case 'g': val *= 1024;
case 'M': case 'm': val *= 1024;
case 'K': case 'k': val *= 1024;
}
return val;
}
return defaultValue; return defaultValue;
} }
...@@ -633,27 +643,13 @@ public: ...@@ -633,27 +643,13 @@ public:
cfg.gfx.waveOrder = gfxWaveOrder; cfg.gfx.waveOrder = gfxWaveOrder;
cfg.gfx.wordSize = gfxWordSize; cfg.gfx.wordSize = gfxWordSize;
cfg.nic.chunkBytes = nicChunkBytes;
cfg.nic.ibGidIndex = ibGidIndex; cfg.nic.ibGidIndex = ibGidIndex;
cfg.nic.ibPort = ibPort; cfg.nic.ibPort = ibPort;
cfg.nic.ipAddressFamily = ipAddressFamily; cfg.nic.ipAddressFamily = ipAddressFamily;
cfg.nic.useRelaxedOrder = nicRelaxedOrder; cfg.nic.useRelaxedOrder = nicRelaxedOrder;
cfg.nic.roceVersion = roceVersion; cfg.nic.roceVersion = roceVersion;
std::vector<int> closestNics;
if(closestNicStr != "") {
std::stringstream ss(closestNicStr);
std::string item;
while (std::getline(ss, item, ',')) {
try {
int nic = std::stoi(item);
closestNics.push_back(nic);
} catch (const std::invalid_argument& e) {
printf("[ERROR] Invalid NIC index (%s) by user in %s\n", item.c_str(), closestNicStr.c_str());
exit(1);
}
}
cfg.nic.closestNics = closestNics;
}
return cfg; return cfg;
} }
}; };
......
This diff is collapsed.
/* /*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -22,36 +22,49 @@ THE SOFTWARE. ...@@ -22,36 +22,49 @@ THE SOFTWARE.
#include "EnvVars.hpp" #include "EnvVars.hpp"
void AllToAllRdmaPreset(EnvVars& ev, int AllToAllRdmaPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName) std::string const presetName)
{ {
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR]a2an preset currently not supported for multi-node\n");
return 1;
}
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX); int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
// Collect env vars for this preset // Collect env vars for this preset
int numGpus = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus); int numGpus = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1); int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN" , 1); int memTypeIdx = EnvVars::GetEnvVar("MEM_TYPE" , 2);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN" , -999); // Deprecated
// Deprecated env var check
if (useFineGrain != -999) {
memTypeIdx = useFineGrain ? 2 : 0;
}
MemType memType = Utils::GetGpuMemType(memTypeIdx);
std::string memTypeStr = Utils::GetGpuMemTypeStr(memTypeIdx);
// Print off environment variables // Print off environment variables
ev.DisplayEnvVars(); if (Utils::RankDoesOutput()) {
if (!ev.hideEnv) { ev.DisplayEnvVars();
if (!ev.outputToCsv) printf("[AllToAll Network Related]\n"); if (!ev.hideEnv) {
ev.Print("NUM_GPU_DEVICES", numGpus , "Using %d GPUs", numGpus); if (!ev.outputToCsv) printf("[AllToAll Network Related]\n");
ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs); ev.Print("NUM_GPU_DEVICES", numGpus , "Using %d GPUs", numGpus);
ev.Print("USE_FINE_GRAIN" , useFineGrain , "Using %s-grained memory", useFineGrain ? "fine" : "coarse"); ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
printf("\n"); ev.Print("MEM_TYPE" , memTypeIdx , "Using %s memory (%s)", memTypeStr.c_str(), Utils::GetAllGpuMemTypeStr().c_str());
printf("\n");
}
} }
// Validate env vars // Validate env vars
if (numGpus < 0 || numGpus > numDetectedGpus) { if (numGpus < 0 || numGpus > numDetectedGpus) {
printf("[ERROR] Cannot use %d GPUs. Detected %d GPUs\n", numGpus, numDetectedGpus); Utils::Print("[ERROR] Cannot use %d GPUs. Detected %d GPUs\n", numGpus, numDetectedGpus);
exit(1); return 1;
} }
MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
std::map<std::pair<int, int>, int> reIndex; std::map<std::pair<int, int>, int> reIndex;
std::vector<Transfer> transfers; std::vector<Transfer> transfers;
...@@ -71,31 +84,31 @@ void AllToAllRdmaPreset(EnvVars& ev, ...@@ -71,31 +84,31 @@ void AllToAllRdmaPreset(EnvVars& ev,
} }
} }
printf("GPU-RDMA All-To-All benchmark:\n"); Utils::Print("GPU-RDMA All-To-All benchmark:\n");
printf("==========================\n"); Utils::Print("==========================\n");
printf("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n", Utils::Print("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n",
numBytesPerTransfer, numQueuePairs, transfers.size()); numBytesPerTransfer, numQueuePairs, transfers.size());
if (transfers.size() == 0) return; if (transfers.size() == 0) return 0;
// Execute Transfers // Execute Transfers
TransferBench::ConfigOptions cfg = ev.ToConfigOptions(); TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
TransferBench::TestResults results; TransferBench::TestResults results;
if (!TransferBench::RunTransfers(cfg, transfers, results)) { if (!TransferBench::RunTransfers(cfg, transfers, results)) {
for (auto const& err : results.errResults) for (auto const& err : results.errResults)
printf("%s\n", err.errMsg.c_str()); Utils::Print("%s\n", err.errMsg.c_str());
exit(0); return 1;
} else { } else {
PrintResults(ev, 1, transfers, results); Utils::PrintResults(ev, 1, transfers, results);
} }
// Print results // Print results
char separator = (ev.outputToCsv ? ',' : ' '); char separator = (ev.outputToCsv ? ',' : ' ');
printf("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer); Utils::Print("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
printf("==========================================================\n"); Utils::Print("==========================================================\n");
printf("SRC\\DST "); Utils::Print("SRC\\DST ");
for (int dst = 0; dst < numGpus; dst++) for (int dst = 0; dst < numGpus; dst++)
printf("%cGPU %02d ", separator, dst); Utils::Print("%cGPU %02d ", separator, dst);
printf(" %cSTotal %cActual\n", separator, separator); Utils::Print(" %cSTotal %cActual\n", separator, separator);
double totalBandwidthGpu = 0.0; double totalBandwidthGpu = 0.0;
double minActualBandwidth = std::numeric_limits<double>::max(); double minActualBandwidth = std::numeric_limits<double>::max();
...@@ -105,7 +118,7 @@ void AllToAllRdmaPreset(EnvVars& ev, ...@@ -105,7 +118,7 @@ void AllToAllRdmaPreset(EnvVars& ev,
double rowTotalBandwidth = 0; double rowTotalBandwidth = 0;
int transferCount = 0; int transferCount = 0;
double minBandwidth = std::numeric_limits<double>::max(); double minBandwidth = std::numeric_limits<double>::max();
printf("GPU %02d", src); Utils::Print("GPU %02d", src);
for (int dst = 0; dst < numGpus; dst++) { for (int dst = 0; dst < numGpus; dst++) {
if (reIndex.count(std::make_pair(src, dst))) { if (reIndex.count(std::make_pair(src, dst))) {
int const transferIdx = reIndex[std::make_pair(src,dst)]; int const transferIdx = reIndex[std::make_pair(src,dst)];
...@@ -115,28 +128,30 @@ void AllToAllRdmaPreset(EnvVars& ev, ...@@ -115,28 +128,30 @@ void AllToAllRdmaPreset(EnvVars& ev,
totalBandwidthGpu += r.avgBandwidthGbPerSec; totalBandwidthGpu += r.avgBandwidthGbPerSec;
minBandwidth = std::min(minBandwidth, r.avgBandwidthGbPerSec); minBandwidth = std::min(minBandwidth, r.avgBandwidthGbPerSec);
transferCount++; transferCount++;
printf("%c%8.3f ", separator, r.avgBandwidthGbPerSec); Utils::Print("%c%8.3f ", separator, r.avgBandwidthGbPerSec);
} else { } else {
printf("%c%8s ", separator, "N/A"); Utils::Print("%c%8s ", separator, "N/A");
} }
} }
double actualBandwidth = minBandwidth * transferCount; double actualBandwidth = minBandwidth * transferCount;
printf(" %c%8.3f %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth); Utils::Print(" %c%8.3f %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
minActualBandwidth = std::min(minActualBandwidth, actualBandwidth); minActualBandwidth = std::min(minActualBandwidth, actualBandwidth);
maxActualBandwidth = std::max(maxActualBandwidth, actualBandwidth); maxActualBandwidth = std::max(maxActualBandwidth, actualBandwidth);
colTotalBandwidth[numGpus+1] += rowTotalBandwidth; colTotalBandwidth[numGpus+1] += rowTotalBandwidth;
} }
printf("\nRTotal"); Utils::Print("\nRTotal");
for (int dst = 0; dst < numGpus; dst++) { for (int dst = 0; dst < numGpus; dst++) {
printf("%c%8.3f ", separator, colTotalBandwidth[dst]); Utils::Print("%c%8.3f ", separator, colTotalBandwidth[dst]);
} }
printf(" %c%8.3f %c%8.3f %c%8.3f\n", separator, colTotalBandwidth[numGpus+1], Utils::Print(" %c%8.3f %c%8.3f %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
separator, minActualBandwidth, separator, maxActualBandwidth); separator, minActualBandwidth, separator, maxActualBandwidth);
printf("\n"); Utils::Print("\n");
Utils::Print("Average bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());
Utils::Print("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
Utils::Print("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
printf("Average bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size()); Utils::PrintErrors(results.errResults);
printf("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
printf("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
PrintErrors(results.errResults); return 0;
} }
/* /*
Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -22,10 +22,15 @@ THE SOFTWARE. ...@@ -22,10 +22,15 @@ THE SOFTWARE.
#include "EnvVars.hpp" #include "EnvVars.hpp"
void AllToAllSweepPreset(EnvVars& ev, int AllToAllSweepPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName) std::string const presetName)
{ {
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] All to All Sweep preset currently not supported for multi-node\n");
return 1;
}
enum enum
{ {
A2A_COPY = 0, A2A_COPY = 0,
...@@ -172,7 +177,7 @@ void AllToAllSweepPreset(EnvVars& ev, ...@@ -172,7 +177,7 @@ void AllToAllSweepPreset(EnvVars& ev,
printf("- Copying %lu bytes between %s pairs of GPUs\n", numBytesPerTransfer, a2aDirect ? "directly connected" : "all"); printf("- Copying %lu bytes between %s pairs of GPUs\n", numBytesPerTransfer, a2aDirect ? "directly connected" : "all");
if (transfers.size() == 0) { if (transfers.size() == 0) {
printf("[WARN} No transfers requested. Try adjusting A2A_DIRECT or A2A_LOCAL\n"); printf("[WARN} No transfers requested. Try adjusting A2A_DIRECT or A2A_LOCAL\n");
return; return 0;
} }
// Execute Transfers // Execute Transfers
...@@ -227,9 +232,10 @@ void AllToAllSweepPreset(EnvVars& ev, ...@@ -227,9 +232,10 @@ void AllToAllSweepPreset(EnvVars& ev,
for (int c : numCusList) { for (int c : numCusList) {
for (int u : unrollList) { for (int u : unrollList) {
printf("CUs: %d Unroll %d\n", c, u); printf("CUs: %d Unroll %d\n", c, u);
PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]); Utils::PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]);
} }
} }
} }
} }
return 1;
} }
...@@ -186,7 +186,7 @@ int TestUnidir(int modelId, bool verbose) ...@@ -186,7 +186,7 @@ int TestUnidir(int modelId, bool verbose)
} }
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit); if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} else { } else {
PrintErrors(results.errResults); Utils::PrintErrors(results.errResults);
} }
} }
...@@ -232,7 +232,7 @@ int TestUnidir(int modelId, bool verbose) ...@@ -232,7 +232,7 @@ int TestUnidir(int modelId, bool verbose)
} }
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit); if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} else { } else {
PrintErrors(results.errResults); Utils::PrintErrors(results.errResults);
} }
} }
...@@ -298,7 +298,7 @@ int TestBidir(int modelId, bool verbose) ...@@ -298,7 +298,7 @@ int TestBidir(int modelId, bool verbose)
} }
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit); if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} else { } else {
PrintErrors(results.errResults); Utils::PrintErrors(results.errResults);
} }
} }
...@@ -423,7 +423,7 @@ int TestHbmPerformance(int modelId, bool verbose) ...@@ -423,7 +423,7 @@ int TestHbmPerformance(int modelId, bool verbose)
if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit); if (verbose) printf(" GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
} }
} else { } else {
PrintErrors(results.errResults); Utils::PrintErrors(results.errResults);
} }
if (fails.size() == 0) { if (fails.size() == 0) {
...@@ -439,14 +439,19 @@ int TestHbmPerformance(int modelId, bool verbose) ...@@ -439,14 +439,19 @@ int TestHbmPerformance(int modelId, bool verbose)
return hasFail; return hasFail;
} }
void HealthCheckPreset(EnvVars& ev, int HealthCheckPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName) std::string const presetName)
{ {
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] Healthcheck preset currently not supported for multi-node\n");
return 1;
}
// Check for supported platforms // Check for supported platforms
#if defined(__NVCC__) #if defined(__NVCC__)
printf("[WARN] healthcheck preset not supported on NVIDIA hardware\n"); printf("[WARN] healthcheck preset not supported on NVIDIA hardware\n");
return; return 0;
#endif #endif
printf("Disclaimer:\n"); printf("Disclaimer:\n");
...@@ -468,5 +473,5 @@ void HealthCheckPreset(EnvVars& ev, ...@@ -468,5 +473,5 @@ void HealthCheckPreset(EnvVars& ev,
numFails += TestUnidir(modelId, verbose); numFails += TestUnidir(modelId, verbose);
numFails += TestBidir(modelId, verbose); numFails += TestBidir(modelId, verbose);
numFails += TestAllToAll(modelId, verbose); numFails += TestAllToAll(modelId, verbose);
exit(numFails ? 1 : 0); return numFails ? 1 : 0;
} }
/*
Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
int NicRingsPreset(EnvVars& ev,
size_t const numBytesPerTransfer,
std::string const presetName)
{
// Check for single homogenous group
if (Utils::GetNumRankGroups() > 1) {
Utils::Print("[ERROR] NIC-rings preset can only be run across ranks that are homogenous\n");
Utils::Print("[ERROR] Run ./TransferBench without any args to display topology information\n");
Utils::Print("[ERROR] NIC_FILTER may also be used to limit NIC visibility\n");
return 1;
}
// Collect topology
int numRanks = TransferBench::GetNumRanks();
// Read in environment variables
int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
int showDetails = EnvVars::GetEnvVar("SHOW_DETAILS" , 0);
int useCpuMem = EnvVars::GetEnvVar("USE_CPU_MEM" , 0);
int memTypeIdx = EnvVars::GetEnvVar("MEM_TYPE" , 0);
int useRdmaRead = EnvVars::GetEnvVar("USE_RDMA_READ" , 0);
// Print off environment variables
MemType memType = Utils::GetMemType(memTypeIdx, useCpuMem);
std::string memTypeStr = Utils::GetMemTypeStr(memTypeIdx, useCpuMem);
if (Utils::RankDoesOutput()) {
ev.DisplayEnvVars();
if (!ev.hideEnv) {
if (!ev.outputToCsv) printf("[NIC-Rings Related]\n");
ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
ev.Print("SHOW_DETAILS" , showDetails , "%s full Test details", showDetails ? "Showing" : "Hiding");
ev.Print("USE_CPU_MEM" , useCpuMem , "Using closest %s memory", useCpuMem ? "CPU" : "GPU");
ev.Print("MEM_TYPE" , memTypeIdx , "Using %s memory (%s)", memTypeStr.c_str(), Utils::GetAllMemTypeStr(useCpuMem).c_str());
if (numRanks > 1)
ev.Print("USE_RDMA_READ", useRdmaRead , "Performing RDMA %s", useRdmaRead ? "reads" : "writes");
printf("\n");
}
}
// Prepare list of transfers
int numDevices = TransferBench::GetNumExecutors(useCpuMem ? EXE_CPU : EXE_GPU_GFX);
std::vector<Transfer> transfers;
int numRings = 0;
for (int memIndex = 0; memIndex < numDevices; memIndex++) {
std::vector<int> nicIndices;
if (useCpuMem) {
TransferBench::GetClosestNicsToCpu(nicIndices, memIndex);
} else {
TransferBench::GetClosestNicsToGpu(nicIndices, memIndex);
}
for (int nicIndex : nicIndices) {
numRings++;
for (int currRank = 0; currRank < numRanks; currRank++) {
int nextRank = (currRank + 1) % numRanks;
TransferBench::Transfer transfer;
transfer.srcs.push_back({memType, memIndex, currRank});
transfer.dsts.push_back({memType, memIndex, nextRank});
transfer.exeDevice = {EXE_NIC, nicIndex, useRdmaRead ? nextRank : currRank};
transfer.exeSubIndex = nicIndex;
transfer.numSubExecs = numQueuePairs;
transfer.numBytes = numBytesPerTransfer;
transfers.push_back(transfer);
}
}
}
Utils::Print("NIC Rings benchmark\n");
Utils::Print("==============================\n");
Utils::Print("%d parallel RDMA-%s rings(s) using %s memory across %d ranks\n",
numRings, useRdmaRead ? "read" : "write", memTypeStr.c_str(), numRanks);
Utils::Print("%d queue pairs per NIC. %lu bytes per Transfer. All numbers are GB/s\n",
numQueuePairs, numBytesPerTransfer);
Utils::Print("\n");
// Execute Transfers
TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
TransferBench::TestResults results;
if (!TransferBench::RunTransfers(cfg, transfers, results)) {
for (auto const& err : results.errResults)
Utils::Print("%s\n", err.errMsg.c_str());
return 1;
} else if (showDetails) {
Utils::PrintResults(ev, 1, transfers, results);
Utils::Print("\n");
}
// Only ranks that actually do output will compile results
if (!Utils::RankDoesOutput()) return 0;
// Prepare table of results
int numRows = 6 + numRanks;
int numCols = 3 + numRings;
Utils::TableHelper table(numRows, numCols);
// Prepare headers
table.Set(2, 0, " Rank ");
table.Set(2, 1, " Name ");
table.Set(1, numCols-1, " TOTAL ");
table.Set(2, numCols-1, " (GB/s) ");
table.SetColAlignment(1, Utils::TableHelper::ALIGN_LEFT);
for (int rank = 0; rank < numRanks; rank++) {
table.Set(3 + rank, 0, " %d ", rank);
table.Set(3 + rank, 1, " %s ", TransferBench::GetHostname(rank).c_str());
}
table.Set(numRows-3, 1, " MAX (GB/s) ");
table.Set(numRows-2, 1, " AVG (GB/s) ");
table.Set(numRows-1, 1, " MIN (GB/s) ");
for (int row = numRows-3; row < numRows; row++)
table.SetCellAlignment(row, 1, Utils::TableHelper::ALIGN_RIGHT);
table.DrawRowBorder(3);
table.DrawRowBorder(numRows-3);
int colIdx = 2;
int transferIdx = 0;
std::vector<double> rankTotal(numRanks, 0.0);
for (int memIndex = 0; memIndex < numDevices; memIndex++) {
std::vector<int> nicIndices;
if (useCpuMem) {
TransferBench::GetClosestNicsToCpu(nicIndices, memIndex);
table.Set(0, colIdx, " CPU %02d ", memIndex);
} else {
TransferBench::GetClosestNicsToGpu(nicIndices, memIndex);
table.Set(0, colIdx, " GPU %02d ", memIndex);
}
bool isFirst = true;
for (int nicIndex : nicIndices) {
if (isFirst) {
isFirst = false;
table.DrawColBorder(colIdx);
}
table.Set(1, colIdx, " NIC %02d ", nicIndex);
table.Set(2, colIdx, " %s ", TransferBench::GetExecutorName({EXE_NIC, nicIndex}).c_str());
double ringMin = std::numeric_limits<double>::max();
double ringAvg = 0.0;
double ringMax = std::numeric_limits<double>::min();
for (int rank = 0; rank < numRanks; rank++) {
double avgBw = results.tfrResults[transferIdx].avgBandwidthGbPerSec;
table.Set(3 + rank, colIdx, " %.2f ", avgBw);
ringMin = std::min(ringMin, avgBw);
ringAvg += avgBw;
ringMax = std::max(ringMax, avgBw);
rankTotal[rank] += avgBw;
transferIdx++;
}
table.Set(numRows-3, colIdx, " %.2f ", ringMax);
table.Set(numRows-2, colIdx, " %.2f ", ringAvg / numRanks);
table.Set(numRows-1, colIdx, " %.2f ", ringMin);
colIdx++;
}
if (!isFirst) {
table.DrawColBorder(colIdx);
}
}
double rankMin = std::numeric_limits<double>::max();
double rankAvg = 0.0;
double rankMax = std::numeric_limits<double>::min();
for (int rank = 0; rank < numRanks; rank++) {
table.Set(3 + rank, numCols - 1, " %.2f ", rankTotal[rank]);
rankMin = std::min(rankMin, rankTotal[rank]);
rankAvg += rankTotal[rank];
rankMax = std::max(rankMax, rankTotal[rank]);
}
table.Set(numRows - 3, numCols - 1, " %.2f ", rankMax);
table.Set(numRows - 2, numCols - 1, " %.2f ", rankAvg / numRanks);
table.Set(numRows - 1, numCols - 1, " %.2f ", rankMin);
table.PrintTable(ev.outputToCsv, ev.showBorders);
Utils::Print("\n");
Utils::Print("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
Utils::PrintErrors(results.errResults);
if (Utils::HasDuplicateHostname()) {
printf("[WARN] It is recommended to run TransferBench with one rank per host to avoid potential aliasing of executors\n");
}
return 0;
}
/* /*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -20,14 +20,19 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...@@ -20,14 +20,19 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE. THE SOFTWARE.
*/ */
void OneToAllPreset(EnvVars& ev, int OneToAllPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName) std::string const presetName)
{ {
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] One-to-All preset currently not supported for multi-node\n");
return 1;
}
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX); int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
if (numDetectedGpus < 2) { if (numDetectedGpus < 2) {
printf("[ERROR] One-to-all benchmark requires machine with at least 2 GPUs\n"); printf("[ERROR] One-to-all benchmark requires machine with at least 2 GPUs\n");
exit(1); return 1;
} }
// Collect env vars for this preset // Collect env vars for this preset
...@@ -61,7 +66,7 @@ void OneToAllPreset(EnvVars& ev, ...@@ -61,7 +66,7 @@ void OneToAllPreset(EnvVars& ev,
for (auto ch : sweepExe) { for (auto ch : sweepExe) {
if (ch != 'G' && ch != 'D') { if (ch != 'G' && ch != 'D') {
printf("[ERROR] Unrecognized executor type '%c' specified\n", ch); printf("[ERROR] Unrecognized executor type '%c' specified\n", ch);
exit(1); return 1;
} }
} }
...@@ -98,7 +103,7 @@ void OneToAllPreset(EnvVars& ev, ...@@ -98,7 +103,7 @@ void OneToAllPreset(EnvVars& ev,
for (int i = 0; i < numGpuDevices; i++) { for (int i = 0; i < numGpuDevices; i++) {
if (bitmask & (1<<i)) { if (bitmask & (1<<i)) {
Transfer t; Transfer t;
CheckForError(TransferBench::CharToExeType(exe, t.exeDevice.exeType)); Utils::CheckForError(TransferBench::CharToExeType(exe, t.exeDevice.exeType));
t.exeDevice.exeIndex = exeIndex; t.exeDevice.exeIndex = exeIndex;
t.exeSubIndex = -1; t.exeSubIndex = -1;
t.numSubExecs = numSubExecs; t.numSubExecs = numSubExecs;
...@@ -108,7 +113,7 @@ void OneToAllPreset(EnvVars& ev, ...@@ -108,7 +113,7 @@ void OneToAllPreset(EnvVars& ev,
t.srcs.clear(); t.srcs.clear();
} else { } else {
t.srcs.resize(1); t.srcs.resize(1);
CheckForError(TransferBench::CharToMemType(src, t.srcs[0].memType)); Utils::CheckForError(TransferBench::CharToMemType(src, t.srcs[0].memType));
t.srcs[0].memIndex = sweepDir == 0 ? exeIndex : i; t.srcs[0].memIndex = sweepDir == 0 ? exeIndex : i;
} }
...@@ -116,15 +121,15 @@ void OneToAllPreset(EnvVars& ev, ...@@ -116,15 +121,15 @@ void OneToAllPreset(EnvVars& ev,
t.dsts.clear(); t.dsts.clear();
} else { } else {
t.dsts.resize(1); t.dsts.resize(1);
CheckForError(TransferBench::CharToMemType(dst, t.dsts[0].memType)); Utils::CheckForError(TransferBench::CharToMemType(dst, t.dsts[0].memType));
t.dsts[0].memIndex = sweepDir == 0 ? i : exeIndex; t.dsts[0].memIndex = sweepDir == 0 ? i : exeIndex;
} }
transfers.push_back(t); transfers.push_back(t);
} }
} }
if (!TransferBench::RunTransfers(cfg, transfers, results)) { if (!TransferBench::RunTransfers(cfg, transfers, results)) {
PrintErrors(results.errResults); Utils::PrintErrors(results.errResults);
exit(1); return 1;
} }
int counter = 0; int counter = 0;
...@@ -138,12 +143,13 @@ void OneToAllPreset(EnvVars& ev, ...@@ -138,12 +143,13 @@ void OneToAllPreset(EnvVars& ev,
printf(" %d %d", p, numSubExecs); printf(" %d %d", p, numSubExecs);
for (auto i = 0; i < transfers.size(); i++) { for (auto i = 0; i < transfers.size(); i++) {
printf(" (%s %c%d %s)", printf(" (%s %c%d %s)",
MemDevicesToStr(transfers[i].srcs).c_str(), Utils::MemDevicesToStr(transfers[i].srcs).c_str(),
ExeTypeStr[transfers[i].exeDevice.exeType], transfers[i].exeDevice.exeIndex, ExeTypeStr[transfers[i].exeDevice.exeType], transfers[i].exeDevice.exeIndex,
MemDevicesToStr(transfers[i].dsts).c_str()); Utils::MemDevicesToStr(transfers[i].dsts).c_str());
} }
printf("\n"); printf("\n");
} }
} }
} }
return 0;
} }
/* /*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -20,39 +20,60 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...@@ -20,39 +20,60 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE. THE SOFTWARE.
*/ */
void PeerToPeerPreset(EnvVars& ev, int PeerToPeerPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName) std::string const presetName)
{ {
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] Peer-to-peer preset currently not supported for multi-node\n");
return 1;
}
int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU); int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX); int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
// Collect env vars for this preset // Collect env vars for this preset
int useDmaCopy = EnvVars::GetEnvVar("USE_GPU_DMA", 0); int useDmaCopy = EnvVars::GetEnvVar("USE_GPU_DMA", 0);
int cpuMemTypeIdx = EnvVars::GetEnvVar("CPU_MEM_TYPE", 0);
int gpuMemTypeIdx = EnvVars::GetEnvVar("GPU_MEM_TYPE", 0);
int numCpuDevices = EnvVars::GetEnvVar("NUM_CPU_DEVICES", numDetectedCpus); int numCpuDevices = EnvVars::GetEnvVar("NUM_CPU_DEVICES", numDetectedCpus);
int numCpuSubExecs = EnvVars::GetEnvVar("NUM_CPU_SE", 4); int numCpuSubExecs = EnvVars::GetEnvVar("NUM_CPU_SE", 4);
int numGpuDevices = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus); int numGpuDevices = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
int numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE", useDmaCopy ? 1 : TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0})); int numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE", useDmaCopy ? 1 : TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0}));
int p2pMode = EnvVars::GetEnvVar("P2P_MODE", 0); int p2pMode = EnvVars::GetEnvVar("P2P_MODE", 0);
int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN", 0); int useFineGrain = EnvVars::GetEnvVar("USE_FINE_GRAIN", -999); // Deprecated
int useRemoteRead = EnvVars::GetEnvVar("USE_REMOTE_READ", 0); int useRemoteRead = EnvVars::GetEnvVar("USE_REMOTE_READ", 0);
MemType cpuMemType = Utils::GetCpuMemType(cpuMemTypeIdx);
MemType gpuMemType = Utils::GetGpuMemType(gpuMemTypeIdx);
// Display environment variables // Display environment variables
ev.DisplayEnvVars();
if (!ev.hideEnv) { if (Utils::RankDoesOutput()) {
int outputToCsv = ev.outputToCsv; ev.DisplayEnvVars();
if (!outputToCsv) printf("[P2P Related]\n"); if (!ev.hideEnv) {
ev.Print("NUM_CPU_DEVICES", numCpuDevices, "Using %d CPUs", numCpuDevices); int outputToCsv = ev.outputToCsv;
ev.Print("NUM_CPU_SE", numCpuSubExecs, "Using %d CPU threads per Transfer", numCpuSubExecs); if (!outputToCsv) printf("[P2P Related]\n");
ev.Print("NUM_GPU_DEVICES", numGpuDevices, "Using %d GPUs", numGpuDevices); ev.Print("CPU_MEM_TYPE" , cpuMemTypeIdx, "Using %s (%s)", Utils::GetCpuMemTypeStr(cpuMemTypeIdx).c_str(), Utils::GetAllCpuMemTypeStr().c_str());
ev.Print("NUM_GPU_SE", numGpuSubExecs, "Using %d GPU subexecutors/CUs per Transfer", numGpuSubExecs); ev.Print("GPU_MEM_TYPE" , gpuMemTypeIdx, "Using %s (%s)", Utils::GetGpuMemTypeStr(gpuMemTypeIdx).c_str(), Utils::GetAllGpuMemTypeStr().c_str());
ev.Print("P2P_MODE", p2pMode, "Running %s transfers", p2pMode == 0 ? "Uni + Bi" : ev.Print("NUM_CPU_DEVICES", numCpuDevices, "Using %d CPUs", numCpuDevices);
p2pMode == 1 ? "Unidirectional" ev.Print("NUM_CPU_SE", numCpuSubExecs, "Using %d CPU threads per Transfer", numCpuSubExecs);
: "Bidirectional"); ev.Print("NUM_GPU_DEVICES", numGpuDevices, "Using %d GPUs", numGpuDevices);
ev.Print("USE_FINE_GRAIN", useFineGrain, "Using %s-grained memory", useFineGrain ? "fine" : "coarse"); ev.Print("NUM_GPU_SE", numGpuSubExecs, "Using %d GPU subexecutors/CUs per Transfer", numGpuSubExecs);
ev.Print("USE_GPU_DMA", useDmaCopy, "Using GPU-%s as GPU executor", useDmaCopy ? "DMA" : "GFX"); ev.Print("P2P_MODE", p2pMode, "Running %s transfers", p2pMode == 0 ? "Uni + Bi" :
ev.Print("USE_REMOTE_READ", useRemoteRead, "Using %s as executor", useRemoteRead ? "DST" : "SRC"); p2pMode == 1 ? "Unidirectional"
printf("\n"); : "Bidirectional");
ev.Print("USE_GPU_DMA", useDmaCopy, "Using GPU-%s as GPU executor", useDmaCopy ? "DMA" : "GFX");
ev.Print("USE_REMOTE_READ", useRemoteRead, "Using %s as executor", useRemoteRead ? "DST" : "SRC");
printf("\n");
}
}
// Check for deprecated env vars
if (useFineGrain != -999) {
Utils::Print("[ERROR] USE_FINE_GRAIN has been deprecated and replaced by CPU_MEM_TYPE and GPU_MEM_TYPE\n");
return 1;
} }
char const separator = ev.outputToCsv ? ',' : ' '; char const separator = ev.outputToCsv ? ',' : ' ';
...@@ -66,8 +87,8 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -66,8 +87,8 @@ void PeerToPeerPreset(EnvVars& ev,
// Perform unidirectional / bidirectional // Perform unidirectional / bidirectional
for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++) { for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++) {
if (p2pMode == 1 && isBidirectional == 1 || if ((p2pMode == 1 && isBidirectional == 1) ||
p2pMode == 2 && isBidirectional == 0) continue; (p2pMode == 2 && isBidirectional == 0)) continue;
printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni", printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni",
useRemoteRead ? "Remote" : "Local", useRemoteRead ? "Remote" : "Local",
...@@ -102,11 +123,10 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -102,11 +123,10 @@ void PeerToPeerPreset(EnvVars& ev,
// Loop over all possible src/dst pairs // Loop over all possible src/dst pairs
for (int src = 0; src < numDevices; src++) { for (int src = 0; src < numDevices; src++) {
MemType const srcType = (src < numCpuDevices ? MEM_CPU : MEM_GPU); int const srcIdx = (src < numCpuDevices ? 0 : 1);
int const srcIndex = (srcType == MEM_CPU ? src : src - numCpuDevices); MemType const srcType = (src < numCpuDevices ? cpuMemType : gpuMemType);
MemType const srcTypeActual = ((useFineGrain && srcType == MEM_CPU) ? MEM_CPU_FINE : int const srcIndex = (src < numCpuDevices ? src : src - numCpuDevices);
(useFineGrain && srcType == MEM_GPU) ? MEM_GPU_FINE :
srcType);
std::vector<std::vector<double>> avgBandwidth(isBidirectional + 1); std::vector<std::vector<double>> avgBandwidth(isBidirectional + 1);
std::vector<std::vector<double>> minBandwidth(isBidirectional + 1); std::vector<std::vector<double>> minBandwidth(isBidirectional + 1);
std::vector<std::vector<double>> maxBandwidth(isBidirectional + 1); std::vector<std::vector<double>> maxBandwidth(isBidirectional + 1);
...@@ -114,18 +134,17 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -114,18 +134,17 @@ void PeerToPeerPreset(EnvVars& ev,
if (src == numCpuDevices && src != 0) printf("\n"); if (src == numCpuDevices && src != 0) printf("\n");
for (int dst = 0; dst < numDevices; dst++) { for (int dst = 0; dst < numDevices; dst++) {
MemType const dstType = (dst < numCpuDevices ? MEM_CPU : MEM_GPU); int const dstIdx = (dst < numCpuDevices ? 0 : 1);
int const dstIndex = (dstType == MEM_CPU ? dst : dst - numCpuDevices); MemType const dstType = (dst < numCpuDevices ? cpuMemType : gpuMemType);
MemType const dstTypeActual = ((useFineGrain && dstType == MEM_CPU) ? MEM_CPU_FINE : int const dstIndex = (dst < numCpuDevices ? dst : dst - numCpuDevices);
(useFineGrain && dstType == MEM_GPU) ? MEM_GPU_FINE :
dstType);
// Prepare Transfers // Prepare Transfers
std::vector<Transfer> transfers(isBidirectional + 1); std::vector<Transfer> transfers(isBidirectional + 1);
// SRC -> DST // SRC -> DST
transfers[0].numBytes = numBytesPerTransfer; transfers[0].numBytes = numBytesPerTransfer;
transfers[0].srcs.push_back({srcTypeActual, srcIndex}); transfers[0].srcs.push_back({srcType, srcIndex});
transfers[0].dsts.push_back({dstTypeActual, dstIndex}); transfers[0].dsts.push_back({dstType, dstIndex});
transfers[0].exeDevice = {IsGpuMemType(useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU, transfers[0].exeDevice = {IsGpuMemType(useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU,
(useRemoteRead ? dstIndex : srcIndex)}; (useRemoteRead ? dstIndex : srcIndex)};
transfers[0].exeSubIndex = -1; transfers[0].exeSubIndex = -1;
...@@ -134,8 +153,8 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -134,8 +153,8 @@ void PeerToPeerPreset(EnvVars& ev,
// DST -> SRC // DST -> SRC
if (isBidirectional) { if (isBidirectional) {
transfers[1].numBytes = numBytesPerTransfer; transfers[1].numBytes = numBytesPerTransfer;
transfers[1].srcs.push_back({dstTypeActual, dstIndex}); transfers[1].srcs.push_back({dstType, dstIndex});
transfers[1].dsts.push_back({srcTypeActual, srcIndex}); transfers[1].dsts.push_back({srcType, srcIndex});
transfers[1].exeDevice = {IsGpuMemType(useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU, transfers[1].exeDevice = {IsGpuMemType(useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU,
(useRemoteRead ? srcIndex : dstIndex)}; (useRemoteRead ? srcIndex : dstIndex)};
transfers[1].exeSubIndex = -1; transfers[1].exeSubIndex = -1;
...@@ -167,7 +186,7 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -167,7 +186,7 @@ void PeerToPeerPreset(EnvVars& ev,
if (!TransferBench::RunTransfers(cfg, transfers, results)) { if (!TransferBench::RunTransfers(cfg, transfers, results)) {
for (auto const& err : results.errResults) for (auto const& err : results.errResults)
printf("%s\n", err.errMsg.c_str()); printf("%s\n", err.errMsg.c_str());
exit(1); return 1;
} }
for (int dir = 0; dir <= isBidirectional; dir++) { for (int dir = 0; dir <= isBidirectional; dir++) {
...@@ -175,8 +194,8 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -175,8 +194,8 @@ void PeerToPeerPreset(EnvVars& ev,
avgBandwidth[dir].push_back(avgBw); avgBandwidth[dir].push_back(avgBw);
if (!(srcType == dstType && srcIndex == dstIndex)) { if (!(srcType == dstType && srcIndex == dstIndex)) {
avgBwSum[srcType][dstType] += avgBw; avgBwSum[srcIdx][dstIdx] += avgBw;
avgCount[srcType][dstType]++; avgCount[srcIdx][dstIdx]++;
} }
if (ev.showIterations) { if (ev.showIterations) {
...@@ -209,7 +228,7 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -209,7 +228,7 @@ void PeerToPeerPreset(EnvVars& ev,
} }
for (int dir = 0; dir <= isBidirectional; dir++) { for (int dir = 0; dir <= isBidirectional; dir++) {
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, dir ? "<- " : " ->"); printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, dir ? "<- " : " ->");
if (ev.outputToCsv) printf(","); if (ev.outputToCsv) printf(",");
for (int dst = 0; dst < numDevices; dst++) { for (int dst = 0; dst < numDevices; dst++) {
...@@ -226,7 +245,7 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -226,7 +245,7 @@ void PeerToPeerPreset(EnvVars& ev,
if (ev.showIterations) { if (ev.showIterations) {
// minBw // minBw
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "min"); printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "min");
if (ev.outputToCsv) printf(","); if (ev.outputToCsv) printf(",");
for (int i = 0; i < numDevices; i++) { for (int i = 0; i < numDevices; i++) {
double const minBw = minBandwidth[dir][i]; double const minBw = minBandwidth[dir][i];
...@@ -240,7 +259,7 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -240,7 +259,7 @@ void PeerToPeerPreset(EnvVars& ev,
printf("\n"); printf("\n");
// maxBw // maxBw
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "max"); printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "max");
if (ev.outputToCsv) printf(","); if (ev.outputToCsv) printf(",");
for (int i = 0; i < numDevices; i++) { for (int i = 0; i < numDevices; i++) {
double const maxBw = maxBandwidth[dir][i]; double const maxBw = maxBandwidth[dir][i];
...@@ -254,7 +273,7 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -254,7 +273,7 @@ void PeerToPeerPreset(EnvVars& ev,
printf("\n"); printf("\n");
// stddev // stddev
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, " sd"); printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, " sd");
if (ev.outputToCsv) printf(","); if (ev.outputToCsv) printf(",");
for (int i = 0; i < numDevices; i++) { for (int i = 0; i < numDevices; i++) {
double const sd = stdDev[dir][i]; double const sd = stdDev[dir][i];
...@@ -271,7 +290,7 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -271,7 +290,7 @@ void PeerToPeerPreset(EnvVars& ev,
} }
if (isBidirectional) { if (isBidirectional) {
printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "<->"); printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "<->");
if (ev.outputToCsv) printf(","); if (ev.outputToCsv) printf(",");
for (int dst = 0; dst < numDevices; dst++) { for (int dst = 0; dst < numDevices; dst++) {
double const sumBw = avgBandwidth[0][dst] + avgBandwidth[1][dst]; double const sumBw = avgBandwidth[0][dst] + avgBandwidth[1][dst];
...@@ -289,14 +308,14 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -289,14 +308,14 @@ void PeerToPeerPreset(EnvVars& ev,
if (!ev.outputToCsv) { if (!ev.outputToCsv) {
printf(" "); printf(" ");
for (int srcType : {MEM_CPU, MEM_GPU}) for (int srcType = 0; srcType <= 1; srcType++)
for (int dstType : {MEM_CPU, MEM_GPU}) for (int dstType = 0; dstType <= 1; dstType++)
printf(" %cPU->%cPU", srcType == MEM_CPU ? 'C' : 'G', dstType == MEM_CPU ? 'C' : 'G'); printf(" %cPU->%cPU", srcType == 0 ? 'C' : 'G', dstType == 0 ? 'C' : 'G');
printf("\n"); printf("\n");
printf("Averages (During %s):", isBidirectional ? " BiDir" : "UniDir"); printf("Averages (During %s):", isBidirectional ? " BiDir" : "UniDir");
for (int srcType : {MEM_CPU, MEM_GPU}) for (int srcType = 0; srcType <= 1; srcType++)
for (int dstType : {MEM_CPU, MEM_GPU}) { for (int dstType = 0; dstType <= 1; dstType++) {
if (avgCount[srcType][dstType]) if (avgCount[srcType][dstType])
printf("%10.2f", avgBwSum[srcType][dstType] / avgCount[srcType][dstType]); printf("%10.2f", avgBwSum[srcType][dstType] / avgCount[srcType][dstType]);
else else
...@@ -305,4 +324,5 @@ void PeerToPeerPreset(EnvVars& ev, ...@@ -305,4 +324,5 @@ void PeerToPeerPreset(EnvVars& ev,
printf("\n\n"); printf("\n\n");
} }
} }
return 0;
} }
/* /*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -21,22 +21,26 @@ THE SOFTWARE. ...@@ -21,22 +21,26 @@ THE SOFTWARE.
*/ */
#pragma once #pragma once
#include <map>
// EnvVars is available to all presets
#include "EnvVars.hpp"
#include "Utilities.hpp"
// Included after EnvVars and Executors
#include "AllToAll.hpp" #include "AllToAll.hpp"
#include "AllToAllN.hpp" #include "AllToAllN.hpp"
#include "AllToAllSweep.hpp" #include "AllToAllSweep.hpp"
#include "HealthCheck.hpp" #include "HealthCheck.hpp"
#include "NicRings.hpp"
#include "OneToAll.hpp" #include "OneToAll.hpp"
#include "PeerToPeer.hpp" #include "PeerToPeer.hpp"
#include "Scaling.hpp" #include "Scaling.hpp"
#include "Schmoo.hpp" #include "Schmoo.hpp"
#include "Sweep.hpp" #include "Sweep.hpp"
#include <map>
typedef void (*PresetFunc)(EnvVars& ev, typedef int (*PresetFunc)(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName); std::string const presetName);
std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap = std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
{ {
...@@ -44,8 +48,9 @@ std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap = ...@@ -44,8 +48,9 @@ std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
{"a2a_n", {AllToAllRdmaPreset, "Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers"}}, {"a2a_n", {AllToAllRdmaPreset, "Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers"}},
{"a2asweep", {AllToAllSweepPreset, "Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts"}}, {"a2asweep", {AllToAllSweepPreset, "Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts"}},
{"healthcheck", {HealthCheckPreset, "Simple bandwidth health check (MI300X series only)"}}, {"healthcheck", {HealthCheckPreset, "Simple bandwidth health check (MI300X series only)"}},
{"nicrings", {NicRingsPreset, "Tests NIC rings created across identical NIC indices across ranks"}},
{"one2all", {OneToAllPreset, "Test all subsets of parallel transfers from one GPU to all others"}}, {"one2all", {OneToAllPreset, "Test all subsets of parallel transfers from one GPU to all others"}},
{"p2p" , {PeerToPeerPreset, " Peer-to-peer device memory bandwidth test"}}, {"p2p" , {PeerToPeerPreset, "Peer-to-peer device memory bandwidth test"}},
{"rsweep", {SweepPreset, "Randomly sweep through sets of Transfers"}}, {"rsweep", {SweepPreset, "Randomly sweep through sets of Transfers"}},
{"scaling", {ScalingPreset, "Run scaling test from one GPU to other devices"}}, {"scaling", {ScalingPreset, "Run scaling test from one GPU to other devices"}},
{"schmoo", {SchmooPreset, "Scaling tests for local/remote read/write/copy"}}, {"schmoo", {SchmooPreset, "Scaling tests for local/remote read/write/copy"}},
...@@ -63,11 +68,12 @@ void DisplayPresets() ...@@ -63,11 +68,12 @@ void DisplayPresets()
int RunPreset(EnvVars& ev, int RunPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
int const argc, int const argc,
char** const argv) char** const argv,
int& retCode)
{ {
std::string preset = (argc > 1 ? argv[1] : ""); std::string preset = (argc > 1 ? argv[1] : "");
if (presetFuncMap.count(preset)) { if (presetFuncMap.count(preset)) {
(presetFuncMap[preset].first)(ev, numBytesPerTransfer, preset); retCode = (presetFuncMap[preset].first)(ev, numBytesPerTransfer, preset);
return 1; return 1;
} }
return 0; return 0;
......
/* /*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved. Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
...@@ -20,10 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...@@ -20,10 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE. THE SOFTWARE.
*/ */
void ScalingPreset(EnvVars& ev, int ScalingPreset(EnvVars& ev,
size_t const numBytesPerTransfer, size_t const numBytesPerTransfer,
std::string const presetName) std::string const presetName)
{ {
if (TransferBench::GetNumRanks() > 1) {
Utils::Print("[ERROR] Scaling preset currently not supported for multi-node\n");
return 1;
}
int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU); int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX); int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
...@@ -49,7 +54,7 @@ void ScalingPreset(EnvVars& ev, ...@@ -49,7 +54,7 @@ void ScalingPreset(EnvVars& ev,
// Validate env vars // Validate env vars
if (localIdx >= numDetectedGpus) { if (localIdx >= numDetectedGpus) {
printf("[ERROR] Cannot execute scaling test with local GPU device %d\n", localIdx); printf("[ERROR] Cannot execute scaling test with local GPU device %d\n", localIdx);
exit(1); return 1;
} }
TransferBench::ConfigOptions cfg = ev.ToConfigOptions(); TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
...@@ -69,12 +74,14 @@ void ScalingPreset(EnvVars& ev, ...@@ -69,12 +74,14 @@ void ScalingPreset(EnvVars& ev,
std::vector<std::pair<double, int>> bestResult(numDevices); std::vector<std::pair<double, int>> bestResult(numDevices);
MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
std::vector<Transfer> transfers(1); std::vector<Transfer> transfers(1);
Transfer& t = transfers[0]; Transfer& t = transfers[0];
t.exeDevice = {EXE_GPU_GFX, localIdx}; t.exeDevice = {EXE_GPU_GFX, localIdx};
t.exeSubIndex = -1; t.exeSubIndex = -1;
t.numBytes = numBytesPerTransfer; t.numBytes = numBytesPerTransfer;
t.srcs = {{MEM_GPU, localIdx}}; t.srcs = {{memType, localIdx}};
for (int numSubExec = sweepMin; numSubExec <= sweepMax; numSubExec++) { for (int numSubExec = sweepMin; numSubExec <= sweepMax; numSubExec++) {
t.numSubExecs = numSubExec; t.numSubExecs = numSubExec;
...@@ -84,8 +91,8 @@ void ScalingPreset(EnvVars& ev, ...@@ -84,8 +91,8 @@ void ScalingPreset(EnvVars& ev,
t.dsts = {{i < numCpuDevices ? MEM_CPU : MEM_GPU, t.dsts = {{i < numCpuDevices ? MEM_CPU : MEM_GPU,
i < numCpuDevices ? i : i - numCpuDevices}}; i < numCpuDevices ? i : i - numCpuDevices}};
if (!RunTransfers(cfg, transfers, results)) { if (!RunTransfers(cfg, transfers, results)) {
PrintErrors(results.errResults); Utils::PrintErrors(results.errResults);
exit(1); return 1;
} }
double bw = results.tfrResults[0].avgBandwidthGbPerSec; double bw = results.tfrResults[0].avgBandwidthGbPerSec;
printf("%c%7.2f ", separator, bw); printf("%c%7.2f ", separator, bw);
...@@ -102,4 +109,5 @@ void ScalingPreset(EnvVars& ev, ...@@ -102,4 +109,5 @@ void ScalingPreset(EnvVars& ev,
for (int i = 0; i < numDevices; i++) for (int i = 0; i < numDevices; i++)
printf("%c%7.2f(%3d)", separator, bestResult[i].first, bestResult[i].second); printf("%c%7.2f(%3d)", separator, bestResult[i].first, bestResult[i].second);
printf("\n"); printf("\n");
return 0;
} }
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment