TransferBench v1.66 - Multi-Rank support (#224)

* Adding System singleton to support multi-node (communication and topology) * Adding multi-node parsing, rank and device wildcard expansion * Adding multi-node topology, and various support functions * Adding multi-node consistency validation of Config and Transfers * Introducing SINGLE_KERNEL=1 to Makefile to speed up compilation during development * Updating CHANGELOG. Overhauling wildcard parsing. Adding dryrun * Client refactoring. Introduction of tabular formatted results and a2a multi-rank preset * Adding MPI support into CMakeFiles * Cleaning up multi-node topology using TableHelper * Reducing compile time by removing some kernel variants * Updating documentation. Adding nicrings preset * Adding NIC_FILTER to allow NIC device filtering via regex * Updating supported memory types * Fixing P2P preset, and adding some extra memIndex utility functions

TransferBench v1.66 - Multi-Rank support (#224)
* Adding System singleton to support multi-node (communication and topology) * Adding multi-node parsing, rank and device wildcard expansion * Adding multi-node topology, and various support functions * Adding multi-node consistency validation of Config and Transfers * Introducing SINGLE_KERNEL=1 to Makefile to speed up compilation during development * Updating CHANGELOG. Overhauling wildcard parsing. Adding dryrun * Client refactoring. Introduction of tabular formatted results and a2a multi-rank preset * Adding MPI support into CMakeFiles * Cleaning up multi-node topology using TableHelper * Reducing compile time by removing some kernel variants * Updating documentation. Adding nicrings preset * Adding NIC_FILTER to allow NIC device filtering via regex * Updating supported memory types * Fixing P2P preset, and adding some extra memIndex utility functions
bbd72a6c · gilbertlee-amd · GitHub · 26717d50 · bbd72a6c · bbd72a6c
Unverified Commit bbd72a6c authored Jan 05, 2026 by gilbertlee-amd Committed by GitHub Jan 05, 2026
20 changed files
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,79 @@
 Documentation for TransferBench is available at
 [https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench).

+## v1.66.00
+### Added
+- Adding multi-node support
+  - TransferBench now supports multiple nodes through the use of MPI or sockets
+    - In order to utilize MPI, TransferBench must be compiled with MPI support (setting MPI_PATH to where
+      an MPI implementation is located).  MPI support can explicitly disabled by setting DISABLE_MPI_COMM=1
+      - TransferBench can be executed with an MPI launcher, such as mpirun
+    - In order to utilize sockets, several environment variables need to be provided to processes
+      * TB_RANK:        Rank of this process (0-based)
+      * TB_NUM_RANKS:   Total number of processes
+      * TB_MASTER_ADDR: IP address of rank 0 (Other ranks will connect to rank 0)
+      * TB_MASTER_PORT: Port for communication (default: 29500)
+    - Additional debug messages can be enabled by setting TB_VERBOSE=1
+    - NOTE: It is recommended that one process be launched per node to avoid aliasing of devices
+- Adding multi-node topology detection
+  -  When running in multi-node mode, TransferBench will try to collect topology information about each
+     rank, then group ranks into homogenous configurations.
+  - This is done by running TransferBench with no arguments (e.g. mpirun -np 2 ./TransferBench)
+- Adding multi-node Transfer parsing and wildcard support
+  - Memory locations have now been extended to support a rank index
+    * R(memRank)?(memIndex)  (where ? is one of the supported memory type characters "CGBFUNMP")
+        (e.g. R2G3 is GPU memory location in GPU 3 on rank 2)
+    - Rank is optional and if not specified, will fallback to "local" rank
+  - Executor locations have been extended to support rank indices as well)
+    * R(exeRank)?(exeIndex){exeSlot}.{exeSubIndex}{exeSubSlot} (where ? is one of the supported executor-types characters "CGDIN")
+      - exeSlots are only relevant for the EXE_NIC_NEAREST executor, and allows for distinguishing when multiple NICs are closest to a GPU
+      - exeSlots are defined by upper case letters 'A' for first closest NIC, 'B' for 2nd closest NIC, etc.
+        -  For example:  N0B.4C would execute using the 2nd closest NIC to GPU 0 via communicating with the 3rd closest NIC to GPU 4
+  - Wildcard support:
+    - To help quickly define sets of transfers, Transfers can now be specified using wildcards
+    - All the fields above may be specified either:
+      * directly with a single value:  E.g: R34        -> Rank 34
+      * full wildcard:                 E.g: R*         -> Will be replaced by all available ranks
+      * Ranged wildcard:               E.g. R[1,5..7]  -> Will be replaced by Rank 1, Rank 5, Rank 6, Rank 7
+  - Wildcard nearest NIC wildcard
+    - To simplify nearest NIC execution, it is not necessary to specify exeIndex/exeSubIndex for the "N" executor
+    - If exeRank/exeIndex/exeSlot/exeSubIndex/exeSubSlot are all not specified, the Transfer will be expanded to
+      choose the correct values such that a remote write operation will occur based on SRC/DST mem locations
+      -  For example: (R2G4->N->R4G5) will expand to (R2G4->R2N4.5->R4G5)
+- Adding dry-run preset
+  - This new preset is similar to cmdline however it only shows the list of transfers that will be executed
+  - This new dryrun preset may be useful when using the new wildcard expressions to ensure that the Test
+    contains the correct set of Transfers
+- Adding nicrings preset
+  - This new preset runs parallel transfers forming rings that connect identical NICs across ranks
+- Adding NIC_FILTER to allow for filtering which NICs to detect.  NIC_FILTER accepts regular-expression syntax
+- Added new memory types based on latest HIP memory allocation flags
+    Supported memory locations are:
+    - C:    Pinned host memory              (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+    - P:    Pinned host memory              (on NUMA node, indexed by closest GPU [#GPUs -1])
+    - B:    Coherent pinned host memory     (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+    - D:    Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+    - K:    Uncached pinned host memory     (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+    - H:    Unpinned host memory            (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+    - G:    Global device memory            (on GPU device indexed from 0 to [# GPUs - 1])
+    - F:    Fine-grain device memory        (on GPU device indexed from 0 to [# GPUs - 1])
+    - U:    Uncached device memory          (on GPU device indexed from 0 to [# GPUs - 1])
+    - N:    Null memory                     (index ignored)
+  - As a result, the a2a preset has deprecated USE_FINE_GRAIN for MEM_TYPE to allow for selecting between various GPU memory types
+  - A warning message is issued if USE_FINE_GRAIN is used, however previous matching functionality remains for now
+  - The p2p preset has also deprecated USE_FINE_GRAIN for CPU_MEM_TYPE and GPU_MEM_TYPE
+### Modified
+- Refactored front-end client code to facilitate simpler and more consistent presets.
+- Refactored tabular data display to simplify code.  Output result tables now use ASCII box-drawing
+  characters for borders which helps group data visually.  Borders may be disabled by setting SHOW_BORDERS=0
+- The All-to-all preset is now multi-rank compatible.  When executed on multiple ranks, it runs
+  inter-rank all-to-all and then reports the min/max across all ranks.  The number of extrema
+  results shown can be adjusted by NUM_RESULTS
+
+### Fixed
+- Added guard for ROCM version when using __syncwarp();
+- Exiting with non-zero code on fatal errors
+
 ## v1.65.00
 ### Added
 - Added warp-level dispatch support via GFX_SE_TYPE environment variable

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
-# Copyright (c) 2023-2025 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 cmake_minimum_required(VERSION 3.5 FATAL_ERROR)

@@ -9,7 +9,7 @@ if (NOT CMAKE_TOOLCHAIN_FILE)
  message(STATUS "CMAKE_TOOLCHAIN_FILE: ${CMAKE_TOOLCHAIN_FILE}")
 endif()

-set(VERSION_STRING "1.65.00")
+set(VERSION_STRING "1.66.00")
 project(TransferBench VERSION ${VERSION_STRING} LANGUAGES CXX)

 ## Load CMake modules
@@ -24,6 +24,7 @@ list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
 #==================================================================================================
 option(BUILD_LOCAL_GPU_TARGET_ONLY "Build only for GPUs detected on this machine" OFF)
 option(ENABLE_NIC_EXEC             "Enable RDMA NIC Executor in TransferBench"    OFF)
+option(ENABLE_MPI_COMM             "Enable MPI Communicator support"              OFF)

 # Default GPU architectures to build
 #==================================================================================================
@@ -129,7 +130,7 @@ endif()
 if(DEFINED ENV{DISABLE_NIC_EXEC} AND "$ENV{DISABLE_NIC_EXEC}" STREQUAL "1")
  message(STATUS "Disabling NIC Executor support as env. flag DISABLE_NIC_EXEC was enabled")
 elseif(NOT ENABLE_NIC_EXEC)
-  message(STATUS "For CMake builds, NIC executor so requires explicit opt-in by setting CMake flag -DENABLE_NIC_EXEC=1")
+  message(STATUS "For CMake builds, NIC executor so requires explicit opt-in by setting CMake flag -DENABLE_NIC_EXEC=ON")
  message(STATUS "Disabling NIC Executor support")
 else()
  find_library(IBVERBS_LIBRARY ibverbs)
@@ -149,6 +150,40 @@ else()
  endif()
 endif()

+## Check for MPI support
+set(MPI_PATH "" CACHE PATH "Path to MPI installation (takes priority over system MPI)")
+if(NOT ENABLE_MPI_COMM)
+  message(STATUS "For CMake builds, MPI Communicator requires explicit opt-in by setting CMake flag -DENABLE_MPI_COMM=ON")
+  message(STATUS "Disabling MPI Communicator support")
+else()
+  # First check user-specified MPI_PATH (similar to Makefile)
+  if(MPI_PATH AND EXISTS "${MPI_PATH}/include/mpi.h")
+    find_library(MPI_LIBRARY NAMES mpi PATHS ${MPI_PATH}/lib NO_DEFAULT_PATH)
+    if(MPI_LIBRARY)
+      set(MPI_COMM_FOUND 1)
+      set(MPI_INCLUDE_DIR "${MPI_PATH}/include")
+      set(MPI_LINK_DIR "${MPI_PATH}/lib")
+      message(STATUS "Building with MPI Communicator support (found at MPI_PATH: ${MPI_PATH})")
+    else()
+      message(WARNING "Found mpi.h at ${MPI_PATH}/include but could not find MPI library at ${MPI_PATH}/lib")
+    endif()
+  else()
+    # Fall back to find_package
+    if(MPI_PATH)
+      message(STATUS "Unable to find mpi.h at ${MPI_PATH}/include, trying find_package")
+    endif()
+    find_package(MPI QUIET)
+    if(MPI_CXX_FOUND)
+      set(MPI_COMM_FOUND 1)
+      message(STATUS "Building with MPI Communicator support (found via find_package)")
+      message(STATUS "- Using MPI include path: ${MPI_CXX_INCLUDE_PATH}")
+      message(STATUS "- Using MPI library:: ${MPI_CXX_LIBRARIES}")
+    else()
+      message(WARNING "MPI not found. Please specify appropriate MPI_PATH or install MPI libraries (e.g., OpenMPI or MPICH)")
+    endif()
+  endif()
+endif()
+
 set(CMAKE_RUNTIME_OUTPUT_DIRECTORY .)

 add_executable(TransferBench src/client/Client.cpp)
@@ -163,6 +198,22 @@ if(IBVERBS_FOUND)
  target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY})
  target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED)
 endif()
+if(MPI_COMM_FOUND)
+  if(TARGET MPI::MPI_CXX)
+    # Found via find_package
+    target_include_directories(TransferBench PRIVATE ${MPI_CXX_INCLUDE_DIRS})
+    target_link_libraries(TransferBench PRIVATE MPI::MPI_CXX)
+  else()
+    # Found via MPI_PATH fallback
+    target_include_directories(TransferBench PRIVATE ${MPI_INCLUDE_DIR})
+    target_link_libraries(TransferBench PRIVATE ${MPI_LIBRARY})
+  endif()
+  target_compile_definitions(TransferBench PRIVATE MPI_COMM_ENABLED)
+endif()
+if (HAVE_PARALLEL_JOBS)
+  target_compile_options(TransferBench PRIVATE -parallel-jobs=12)
+endif()
+

 target_link_libraries(TransferBench PRIVATE -fgpu-rdc)             # Required when linking relocatable device code
 target_link_libraries(TransferBench PRIVATE Threads::Threads)

--- a/Makefile
+++ b/Makefile
 #
-# Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
 #

 # Configuration options
 ROCM_PATH ?= /opt/rocm
 CUDA_PATH ?= /usr/local/cuda
+MPI_PATH  ?= /usr/local/openmpi

 HIPCC ?= $(ROCM_PATH)/bin/amdclang++
 NVCC ?= $(CUDA_PATH)/bin/nvcc

+# Option to compile with single GFX kernel to drop compilation time
+SINGLE_KERNEL ?= 0
+
 # This can be a space separated string of multiple GPU targets
 # Default is the native GPU target
 GPU_TARGETS ?= native
@@ -35,9 +39,13 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),)

  CXXFLAGS = -I$(ROCM_PATH)/include -I$(ROCM_PATH)/include/hip -I$(ROCM_PATH)/include/hsa
  HIPLDFLAGS= -lnuma -L$(ROCM_PATH)/lib -lhsa-runtime64 -lamdhip64
-  HIPFLAGS = -x hip -D__HIP_PLATFORM_AMD__ -D__HIPCC__ $(GPU_TARGETS_FLAGS)
+  HIPFLAGS = -Wall -x hip -D__HIP_PLATFORM_AMD__ -D__HIPCC__ $(GPU_TARGETS_FLAGS)
  NVFLAGS  = -x cu -lnuma -arch=native

+  ifeq ($(SINGLE_KERNEL), 1)
+    CXXFLAGS += -DSINGLE_KERNEL
+  endif
+
  ifeq ($(DEBUG), 0)
    COMMON_FLAGS += -O3
  else
@@ -70,8 +78,34 @@ ifeq ($(filter clean,$(MAKECMDGOALS)),)
      $(info Building with NIC executor support. Can set DISABLE_NIC_EXEC=1 to disable)
    endif
  endif
+
+  MPI_ENABLED = 0
+  # Compile with MPI communicator support if
+  # 1) DISABLE_MPI_COMM is not set to 1
+  # 2) mpi.h is found in the MPI_PATH
+  DISABLE_MPI_COMM ?= 0
+  ifneq ($(DISABLE_MPI_COMM), 1)
+    ifeq ($(wildcard $(MPI_PATH)/include/mpi.h),)
+      $(info Unable to find mpi.h at $(MPI_PATH)/include.  Please specify appropriate MPI_PATH)
+    else
+      MPI_ENABLED = 1
+      CXXFLAGS += -DMPI_COMM_ENABLED -I$(MPI_PATH)/include
+      LDFLAGS += -L/$(MPI_PATH)/lib -lmpi
+      ifeq ($(DEBUG), 1)
+        LDFLAGS += -lmpi_cxx
+      endif
+    endif
+
+    ifeq ($(MPI_ENABLED), 0)
+      $(info Building without MPI communicator support)
+      $(info To use TransferBench with MPI support, install MPI libraries and specify appropriate MPI_PATH)
+    else
+      $(info Building with MPI communicator support.  Can set DISABLE_MPI_COMM=1 to disable)
+   endif
+  endif
 endif

+
 .PHONY : all clean

 all: $(EXE)

--- a/README.md
+++ b/README.md
 # TransferBench

 TransferBench is a utility for benchmarking simultaneous copies between user-specified
-CPU and GPU devices.
+CPU and GPU memory locations using CPUs/GPU kernels/DMA engines/NIC devices.

 > [!NOTE]
 > The published documentation is available at [TransferBench](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html) in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the `TransferBench/docs` folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see [Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -18,7 +18,7 @@ left_nav_title = f"TransferBench {version_number} Documentation"
 # for PDF output on Read the Docs
 project = "TransferBench Documentation"
 author = "Advanced Micro Devices, Inc."
-copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
+copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
 version = version_number
 release = version_number


--- a/docs/how to/use-transferbench.rst
+++ b/docs/how to/use-transferbench.rst
@@ -177,9 +177,16 @@ Here is the list of preset configurations that can be used instead of configurat
   * - ``cmdline``
     - Allows transfers to run from the command line instead of a configuration file

+   * - ``dryrun``
+     - Lists the set of transfers to be executed as provided from the command line
+     - This is useful when using wildcards to ensure correctness
+
   * - ``healthcheck``
     - Simple health check (supported on AMD Instinct MI300 series only)

+   * - ``nic_rings``
+     - Measure performance of NICs set up in a ring across ranks
+
   * - ``p2p``
     - Peer-to-peer benchmark test


--- a/docs/install/install.rst
+++ b/docs/install/install.rst
@@ -50,6 +50,12 @@ Here are the steps to build TransferBench:

  If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.

+  NIC executor support will be enabled if IBVerbs is detected and if ``infiniband/verbs.h`` is found in the default include path.
+  NIC executor support can be disabled explicitly by setting ``DISABLE_NIC_EXEC=1``
+
+  MPI support will be enabled if mpi.h is found in ``MPI_PATH/include/``
+  MPI executor support can be disabled explicitly by setting ``DISABLE_MPI_COMM=1``
+
 Building documentation
 -----------------------


--- a/examples/example.cfg
+++ b/examples/example.cfg
@@ -47,13 +47,17 @@
 #                 Memory locations are specified by one or more (device character / device index) pairs
 #                 Character indicating memory type followed by device index (0-indexed)
 #                 Supported memory locations are:
-#                 - C:    Pinned host memory       (on NUMA node, indexed from 0 to [# NUMA nodes-1])
-#                 - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [# NUMA nodes-1])
-#                 - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [# NUMA nodes-1])
-#                 - G:    Global device memory     (on GPU device indexed from 0 to [# GPUs - 1])
-#                 - F:    Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
-#                 - N:    Null memory              (index ignored)
-#                 - P:    Pinned host memory       (on NUMA node, but indexed by closest GPU [#GPUs -1])
+#                 - C:    Pinned host memory              (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+#                 - P:    Pinned host memory              (on NUMA node, indexed by closest GPU [#GPUs -1])
+#                 - B:    Coherent pinned host memory     (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+#                 - D:    Non-coherent pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+#                 - K:    Uncached pinned host memory     (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+#                 - H:    Unpinned host memory            (on NUMA node, indexed from 0 to [# NUMA nodes-1])
+#                 - G:    Global device memory            (on GPU device indexed from 0 to [# GPUs - 1])
+#                 - F:    Fine-grain device memory        (on GPU device indexed from 0 to [# GPUs - 1])
+#                 - U:    Uncached device memory          (on GPU device indexed from 0 to [# GPUs - 1])
+#                 - N:    Null memory                     (index ignored)
+

 # Examples:
 # 1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1

--- a/src/client/Client.cpp
+++ b/src/client/Client.cpp
 /*
-Copyright (c) 2019-2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -20,23 +20,33 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
 */

-#include "Client.hpp"
 #include "Presets.hpp"
 #include "Topology.hpp"
 #include <fstream>

-int main(int argc, char **argv) {
+void DisplayVersion();
+void DisplayUsage(char const* cmdName);

+using namespace TransferBench;
+using namespace TransferBench::Utils;
+
+size_t constexpr DEFAULT_BYTES_PER_TRANSFER = (1<<28);
+
+int main(int argc, char **argv)
+{
  // Collect environment variables
  EnvVars ev;

  // Display usage instructions and detected topology
  if (argc <= 1) {
-    if (!ev.outputToCsv) {
-      DisplayUsage(argv[0]);
-      DisplayPresets();
+    if (RankDoesOutput()) {
+      if (!ev.outputToCsv) {
+        DisplayVersion();
+        DisplayUsage(argv[0]);
+        DisplayPresets();
+      }
+      DisplayTopology(ev.outputToCsv, ev.showBorders);
    }
-    DisplayTopology(ev.outputToCsv);
    exit(0);
  }

@@ -52,42 +62,73 @@ int main(int argc, char **argv) {
    }
  }
  if (numBytesPerTransfer % 4) {
-    printf("[ERROR] numBytesPerTransfer (%lu) must be a multiple of 4\n", numBytesPerTransfer);
+    Print("[ERROR] numBytesPerTransfer (%lu) must be a multiple of 4\n", numBytesPerTransfer);
    exit(1);
  }

+  // Display TransferBench version and build configuration
+  DisplayVersion();
+
  // Run preset benchmark if requested
-  if (RunPreset(ev, numBytesPerTransfer, argc, argv)) exit(0);
+  int retCode = 0;
+  if (RunPreset(ev, numBytesPerTransfer, argc, argv, retCode)) return retCode;

  // Read input from command line or configuration file
+  bool isDryRun = !strcmp(argv[1], "dryrun");
  std::vector<std::string> lines;
  {
    std::string line;
-    if (!strcmp(argv[1], "cmdline")) {
+    if (!strcmp(argv[1], "cmdline") || isDryRun) {
      for (int i = 3; i < argc; i++)
        line += std::string(argv[i]) + " ";
      lines.push_back(line);
    } else {
      std::ifstream cfgFile(argv[1]);
      if (!cfgFile.is_open()) {
-        printf("[ERROR] Unable to open transfer configuration file: [%s]\n", argv[1]);
+        Print("[ERROR] Unable to open transfer configuration file: [%s]\n", argv[1]);
        exit(1);
-      }
+     }
      while (std::getline(cfgFile, line))
        lines.push_back(line);
      cfgFile.close();
    }
  }

-  // Print environment variables and CSV header
-  ev.DisplayEnvVars();
-  if (ev.outputToCsv)
-    printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
-
-  TransferBench::ConfigOptions cfgOptions = ev.ToConfigOptions();
-  TransferBench::TestResults results;
+  ConfigOptions cfgOptions = ev.ToConfigOptions();
+  TestResults results;
  std::vector<ErrResult> errors;

+  // Dry run prints off transfers (and errors)
+  if (isDryRun) {
+    Print("Transfers to be executed (dry-run):\n");
+    Print("================================================================================\n");
+    std::vector<Transfer> transfers;
+    CheckForError(ParseTransfers(lines[0], transfers));
+    if (transfers.empty()) {
+      Print("<none>\n");
+    } else {
+      bool isMultiNode = GetNumRanks() > 1;
+      for (size_t i = 0; i < transfers.size(); i++) {
+        Transfer const& t = transfers[i];
+        Print("Transfer %5lu: (%s->", i, MemDevicesToStr(t.srcs).c_str());
+        if (isMultiNode) Print("R%d", t.exeDevice.exeRank);
+        Print("%c%d", ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex);
+        if (t.exeDevice.exeSlot) Print("%c", 'A' + t.exeDevice.exeSlot);
+        if (t.exeSubIndex != -1) Print(".%d", t.exeSubIndex);
+        if (t.exeSubSlot != 0) Print("%c", 'A' + t.exeSubSlot);
+        Print("->%s)\n", MemDevicesToStr(t.dsts).c_str());
+      }
+    }
+    return 0;
+  }
+
+  // Print environment variables and CSV header
+  if (RankDoesOutput()) {
+    ev.DisplayEnvVars();
+    if (ev.outputToCsv)
+      Print("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
+  }
+
  // Process each line as a Test
  int testNum = 0;
  for (std::string const &line : lines) {
@@ -96,7 +137,7 @@ int main(int argc, char **argv) {

    // Parse set of parallel Transfers to execute
    std::vector<Transfer> transfers;
-    CheckForError(TransferBench::ParseTransfers(line, transfers));
+    CheckForError(ParseTransfers(line, transfers));
    if (transfers.empty()) continue;

    // Check for variable sub-executors Transfers
@@ -107,7 +148,7 @@ int main(int argc, char **argv) {
      for (auto const& t : transfers) {
        if (t.numSubExecs == 0) {
          if (t.exeDevice.exeType != EXE_GPU_GFX) {
-            printf("[ERROR] Variable number of subexecutors is only supported on GFX executors\n");
+            Print("[ERROR] Variable number of subexecutors is only supported on GFX executors\n");
            exit(1);
          }
          numVariableTransfers++;
@@ -116,7 +157,7 @@ int main(int argc, char **argv) {
        }
      }
      if (numVariableTransfers > 0 && numVariableTransfers != transfers.size()) {
-        printf("[ERROR] All or none of the Transfers in the Test must use variable number of Subexecutors\n");
+        Print("[ERROR] All or none of the Transfers in the Test must use variable number of Subexecutors\n");
        exit(1);
      }
    }
@@ -140,18 +181,20 @@ int main(int argc, char **argv) {
        }

        if (maxVarCount == 0) {
-          if (TransferBench::RunTransfers(cfgOptions, transfers, results)) {
+          if (RunTransfers(cfgOptions, transfers, results)) {
            PrintResults(ev, ++testNum, transfers, results);
          }
-          PrintErrors(results.errResults);
+          if (RankDoesOutput()) {
+            PrintErrors(results.errResults);
+          }
        } else {
          // Variable subexecutors - Determine how many subexecutors to sweep up to
          int maxNumVarSubExec = ev.maxNumVarSubExec;
          if (maxNumVarSubExec == 0) {
-            maxNumVarSubExec = TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0}) / maxVarCount;
+            maxNumVarSubExec = GetNumSubExecutors({EXE_GPU_GFX, 0}) / maxVarCount;
          }

-          TransferBench::TestResults bestResults;
+          TestResults bestResults;
          std::vector<Transfer> bestTransfers;
          for (int numSubExecs = ev.minNumVarSubExec; numSubExecs <= maxNumVarSubExec; numSubExecs++) {
            std::vector<Transfer> tempTransfers = transfers;
@@ -159,8 +202,8 @@ int main(int argc, char **argv) {
              if (t.numSubExecs == 0) t.numSubExecs = numSubExecs;
            }

-            TransferBench::TestResults tempResults;
-            if (!TransferBench::RunTransfers(cfgOptions, tempTransfers, tempResults)) {
+            TestResults tempResults;
+            if (!RunTransfers(cfgOptions, tempTransfers, tempResults)) {
              PrintErrors(tempResults.errResults);
            } else {
              if (tempResults.avgTotalBandwidthGbPerSec > bestResults.avgTotalBandwidthGbPerSec) {
@@ -180,158 +223,49 @@ int main(int argc, char **argv) {
  }
 }

-void DisplayUsage(char const* cmdName)
+void DisplayVersion()
 {
-  std::string nicSupport = "";
+  bool nicSupport = false, mpiSupport = false;
 #if NIC_EXEC_ENABLED
-  nicSupport = " (with NIC support)";
+  nicSupport = true;
+#endif
+#if MPI_COMM_ENABLED
+  mpiSupport = true;
 #endif
-  printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
-  printf("========================================\n");
-
-  if (numa_available() == -1) {
-    printf("[ERROR] NUMA library not supported. Check to see if libnuma has been installed on this system\n");
-    exit(1);
-  }
-
-  printf("Usage: %s config <N>\n", cmdName);
-  printf("  config: Either:\n");
-  printf("          - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
-  printf("          - Name of preset config:\n");
-  printf("  N     : (Optional) Number of bytes to copy per Transfer.\n");
-  printf("          If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
-         DEFAULT_BYTES_PER_TRANSFER);
-  printf("          If 0 is specified, a range of Ns will be benchmarked\n");
-  printf("          May append a suffix ('K', 'M', 'G') for kilobytes / megabytes / gigabytes\n");
-  printf("\n");
-
-  EnvVars::DisplayUsage();
-}
-
-std::string MemDevicesToStr(std::vector<MemDevice> const& memDevices) {
-  if (memDevices.empty()) return "N";
-  std::stringstream ss;
-  for (auto const& m : memDevices)
-    ss << TransferBench::MemTypeStr[m.memType] << m.memIndex;
-  return ss.str();
-}
-
-void PrintResults(EnvVars const& ev, int const testNum,
-                  std::vector<Transfer> const& transfers,
-                  TransferBench::TestResults const& results)
-{
-  char sep = ev.outputToCsv ? ',' : '|';
-  size_t numTimedIterations = results.numTimedIterations;
-
-  if (!ev.outputToCsv) printf("Test %d:\n", testNum);
-
-  // Loop over each executor
-  for (auto exeInfoPair : results.exeResults) {
-    ExeDevice const& exeDevice = exeInfoPair.first;
-    ExeResult const& exeResult = exeInfoPair.second;
-    ExeType const    exeType   = exeDevice.exeType;
-    int32_t const    exeIndex  = exeDevice.exeIndex;
-
-    printf(" Executor: %3s %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
-           ExeTypeName[exeType], exeIndex, sep, exeResult.avgBandwidthGbPerSec, sep,
-           exeResult.avgDurationMsec, sep, exeResult.numBytes, sep, exeResult.sumBandwidthGbPerSec);
-
-    // Loop over each executor
-    for (int idx : exeResult.transferIdx) {
-      Transfer const& t = transfers[idx];
-      TransferResult const& r = results.tfrResults[idx];
-
-      char exeSubIndexStr[32] = "";
-      if (t.exeSubIndex != -1)
-        sprintf(exeSubIndexStr, ".%d", t.exeSubIndex);
-      printf("     Transfer %02d  %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %c%03d%s:%03d -> %s\n",
-             idx,                    sep,
-             r.avgBandwidthGbPerSec, sep,
-             r.avgDurationMsec,      sep,
-             r.numBytes,             sep,
-             MemDevicesToStr(t.srcs).c_str(),
-             TransferBench::ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
-             exeSubIndexStr, t.numSubExecs,
-             MemDevicesToStr(t.dsts).c_str());
-
-      // Show per-iteration timing information
-      if (ev.showIterations) {
-
-        // Check that per-iteration information exists
-        if (r.perIterMsec.size() != numTimedIterations) {
-          printf("[ERROR] Per iteration timing data unavailable: Expected %lu data points, but have %lu\n",
-                 numTimedIterations, r.perIterMsec.size());
-          exit(1);
-        }
-
-        // Compute standard deviation and track iterations by speed
-        std::set<std::pair<double, int>> times;
-        double stdDevTime = 0;
-        double stdDevBw = 0;
-        for (int i = 0; i < numTimedIterations; i++) {
-          times.insert(std::make_pair(r.perIterMsec[i], i+1));
-          double const varTime = fabs(r.avgDurationMsec - r.perIterMsec[i]);
-          stdDevTime += varTime * varTime;
-
-          double iterBandwidthGbs = (t.numBytes / 1.0E9) / r.perIterMsec[i] * 1000.0f;
-          double const varBw = fabs(iterBandwidthGbs - r.avgBandwidthGbPerSec);
-          stdDevBw += varBw * varBw;
-        }
-        stdDevTime = sqrt(stdDevTime / numTimedIterations);
-        stdDevBw = sqrt(stdDevBw / numTimedIterations);

-        // Loop over iterations (fastest to slowest)
-        for (auto& time : times) {
-          double iterDurationMsec = time.first;
-          double iterBandwidthGbs = (t.numBytes / 1.0E9) / iterDurationMsec * 1000.0f;
-          printf("      Iter %03d    %c %8.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);
+  std::string support = "";
+  if (mpiSupport && nicSupport) support = " (with MPI+NIC support)";
+  else if (mpiSupport)          support = " (with MPI support)";
+  else if (nicSupport)          support = " (with NIC support)";

-          std::set<int> usedXccs;
-          if (time.second - 1 < r.perIterCUs.size()) {
-            printf(" CUs:");
-            for (auto x : r.perIterCUs[time.second - 1]) {
-              printf(" %02d:%02d", x.first, x.second);
-              usedXccs.insert(x.first);
-            }
-          }
-
-          printf(" XCCs:");
-          for (auto x : usedXccs)
-            printf(" %02d", x);
-          printf("\n");
-        }
-        printf("      StandardDev %c %8.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
-      }
-    }
+  std::string multiNodeMode = "";
+  switch (GetCommMode()) {
+  case COMM_NONE:   multiNodeMode = " (Single-node mode)";       break;
+  case COMM_SOCKET: multiNodeMode = " (Multi-node via sockets)"; break;
+  case COMM_MPI:    multiNodeMode = " (Multi-node via MPI)";     break;
  }
-  printf(" Aggregate (CPU)  %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
-         sep, results.avgTotalBandwidthGbPerSec,
-         sep, results.avgTotalDurationMsec,
-         sep, results.totalBytesTransferred,
-         sep, results.overheadMsec);
+
+  Print("TransferBench v%s.%s%s%s\n", VERSION, CLIENT_VERSION, support.c_str(), multiNodeMode.c_str());
+  Print("=============================================================================================================\n");
 }

-void CheckForError(ErrResult const& error)
+void DisplayUsage(char const* cmdName)
 {
-  switch (error.errType) {
-  case ERR_NONE: return;
-  case ERR_WARN:
-    printf("[WARN] %s\n", error.errMsg.c_str());
-    return;
-  case ERR_FATAL:
-    printf("[ERROR] %s\n", error.errMsg.c_str());
+  if (numa_available() == -1) {
+    Print("[ERROR] NUMA library not supported. Check to see if libnuma has been installed on this system\n");
    exit(1);
-  default:
-    break;
  }
-}

-void PrintErrors(std::vector<ErrResult> const& errors)
-{
-  bool isFatal = false;
-  for (auto const& err : errors) {
-    printf("[%s] %s\n", err.errType == ERR_FATAL ? "ERROR" : "WARN", err.errMsg.c_str());
-    isFatal |= (err.errType == ERR_FATAL);
-  }
-  if (isFatal) exit(1);
-}
+  Print("Usage: %s config <N>\n", cmdName);
+  Print("  config: Either:\n");
+  Print("          - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
+  Print("          - Name of preset config:\n");
+  Print("  N     : (Optional) Number of bytes to copy per Transfer.\n");
+  Print("          If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
+        DEFAULT_BYTES_PER_TRANSFER);
+  Print("          If 0 is specified, a range of Ns will be benchmarked\n");
+  Print("          May append a suffix ('K', 'M', 'G') for kilobytes / megabytes / gigabytes\n");
+  Print("\n");
+
+  EnvVars::DisplayUsage();
+};
--- a/src/client/Client.hpp
+++ b/src/client/Client.hpp
-/*
-Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in
-all copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-THE SOFTWARE.
-*/
-
-#pragma once
-
-// TransferBench client version
-#define CLIENT_VERSION "00"
-
-#include "TransferBench.hpp"
-#include "EnvVars.hpp"
-
-size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<28);
-
-char const ExeTypeName[5][4] = {"CPU", "GPU", "DMA", "NIC", "NIC"};
-
-// Display detected hardware
-void DisplayTopology(bool outputToCsv);
-
-// Display usage instructions
-void DisplayUsage(char const* cmdName);
-
-// Print TransferBench test results
-void PrintResults(EnvVars const& ev, int const testNum,
-                  std::vector<Transfer> const& transfers,
-                  TransferBench::TestResults const& results);
-
-// Helper function that converts MemDevices to a string
-std::string MemDevicesToStr(std::vector<MemDevice> const& memDevices);
-
-// Helper function to print warning / exit on fatal error
-void CheckForError(ErrResult const& error);
-
-// Helper function to print list of errors
-void PrintErrors(std::vector<ErrResult> const& errors);
--- a/src/client/EnvVars.hpp
+++ b/src/client/EnvVars.hpp
 /*
-Copyright (c) 2021-2025 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -39,7 +39,8 @@ THE SOFTWARE.
 #include <numa.h>
 #include <random>
 #include <time.h>
-#include "Client.hpp"
+
+#define CLIENT_VERSION "00"

 #include "TransferBench.hpp"
 using namespace TransferBench;
@@ -69,6 +70,7 @@ public:
  int numIterations;                 // Number of timed iterations to perform.  If negative, run for -numIterations seconds instead
  int numSubIterations;              // Number of subiterations to perform
  int numWarmups;                    // Number of un-timed warmup iterations to perform
+  int showBorders;                   // Show ASCII box-drawing characaters in tables
  int showIterations;                // Show per-iteration timing info
  int useInteractive;                // Pause for user-input before starting transfer loop

@@ -107,11 +109,11 @@ public:

  // NIC options
  int ibGidIndex;                    // GID Index for RoCE NICs
-  int roceVersion;                   // RoCE version number
-  int ipAddressFamily;               // IP Address Famliy
  uint8_t ibPort;                    // NIC port number to be used
+  int ipAddressFamily;               // IP Address Famliy
+  int nicChunkBytes;                 // Number of bytes to send per chunk for RDMA operations
  int nicRelaxedOrder;               // Use relaxed ordering for RDMA
-  std::string closestNicStr;         // Holds the user-specified list of closest NICs
+  int roceVersion;                   // RoCE version number

  // Developer features
  int gpuMaxHwQueues;                // Tracks GPU_MAX_HW_QUEUES environment variable
@@ -119,14 +121,16 @@ public:
  // Constructor that collects values
  EnvVars()
  {
-    int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
-    int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
-    int numDeviceCUs    = TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0});
-
+    // Try to detect the GPU
    hipDeviceProp_t prop;
-    HIP_CALL(hipGetDeviceProperties(&prop, 0));
-    std::string fullName = prop.gcnArchName;
-    std::string archName = fullName.substr(0, fullName.find(':'));
+    std::string fullName = "";
+    std::string archName = "";
+    int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
+    if (numDetectedGpus > 0) {
+      HIP_CALL(hipGetDeviceProperties(&prop, 0));
+      fullName = prop.gcnArchName;
+      archName = fullName.substr(0, fullName.find(':'));
+    }

    // Different hardware pick different GPU kernels
    // This performance difference is generally only noticable when executing fewer CUs
@@ -156,6 +160,7 @@ public:
    numWarmups        = GetEnvVar("NUM_WARMUPS"         , 3);
    outputToCsv       = GetEnvVar("OUTPUT_TO_CSV"       , 0);
    samplingFactor    = GetEnvVar("SAMPLING_FACTOR"     , 1);
+    showBorders       = GetEnvVar("SHOW_BORDERS"        , 1);
    showIterations    = GetEnvVar("SHOW_ITERATIONS"     , 0);
    useHipEvents      = GetEnvVar("USE_HIP_EVENTS"      , 1);
    useHsaDma         = GetEnvVar("USE_HSA_DMA"         , 0);
@@ -168,8 +173,8 @@ public:
    ibPort            = GetEnvVar("IB_PORT_NUMBER"      , 1);
    roceVersion       = GetEnvVar("ROCE_VERSION"        , 2);
    ipAddressFamily   = GetEnvVar("IP_ADDRESS_FAMILY"   , 4);
+    nicChunkBytes     = GetEnvVar("NIC_CHUNK_BYTES"     , 1073741824);
    nicRelaxedOrder   = GetEnvVar("NIC_RELAX_ORDER"     , 1);
-    closestNicStr     = GetEnvVar("CLOSEST_NIC"         , "");

    gpuMaxHwQueues    = GetEnvVar("GPU_MAX_HW_QUEUES"   , 4);

@@ -314,9 +319,6 @@ public:
    printf(" ALWAYS_VALIDATE   - Validate after each iteration instead of once after all iterations\n");
    printf(" BLOCK_BYTES       - Controls granularity of how work is divided across subExecutors\n");
    printf(" BYTE_OFFSET       - Initial byte-offset for memory allocations.  Must be multiple of 4\n");
-#if NIC_EXEC_ENABLED
-    printf(" CLOSEST_NIC       - Comma-separated list of per-GPU closest NIC (default=auto)\n");
-#endif
    printf(" CU_MASK           - CU mask for streams. Can specify ranges e.g '5,10-12,14'\n");
    printf(" FILL_COMPRESS     - Percentages of 64B lines to be filled by random/1B0/2B0/4B0/32B0\n");
    printf(" FILL_PATTERN      - Big-endian pattern for source data, specified in hex digits. Must be even # of digits\n");
@@ -337,6 +339,7 @@ public:
    printf(" MIN_VAR_SUBEXEC   - Minumum # of subexecutors to use for variable subExec Transfers\n");
    printf(" MAX_VAR_SUBEXEC   - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)\n");
 #if NIC_EXEC_ENABLED
+    printf(" NIC_CHUNK_BYTES   - Number of bytes to send at a time using NIC (default = 1GB)\n");
    printf(" NIC_RELAX_ORDER   - Set to non-zero to use relaxed ordering");
 #endif
    printf(" NUM_ITERATIONS    - # of timed iterations per test. If negative, run for this many seconds instead\n");
@@ -347,6 +350,7 @@ public:
    printf(" ROCE_VERSION      - RoCE version (default=2)\n");
 #endif
    printf(" SAMPLING_FACTOR   - Add this many samples (when possible) between powers of 2 when auto-generating data sizes\n");
+    printf(" SHOW_BORDERS      - Show ASCII box-drawing characaters in tables\n");
    printf(" SHOW_ITERATIONS   - Show per-iteration timing info\n");
    printf(" USE_HIP_EVENTS    - Use HIP events for GFX executor timing\n");
    printf(" USE_HSA_DMA       - Use hsa_amd_async_copy instead of hipMemcpy for non-targeted DMA execution\n");
@@ -386,8 +390,6 @@ public:
    nicSupport = " (with NIC support)";
 #endif
    if (!outputToCsv) {
-      printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
-      printf("===============================================================\n");
      if (!hideEnv) printf("[Common]                              (Suppress by setting HIDE_ENV=1)\n");
    }
    else if (!hideEnv)
@@ -400,10 +402,6 @@ public:
          "Each CU gets a mulitple of %d bytes to copy", blockBytes);
    Print("BYTE_OFFSET", byteOffset,
          "Using byte offset of %d", byteOffset);
-#if NIC_EXEC_ENABLED
-    Print("CLOSEST_NIC", (closestNicStr == "" ? "auto" : "user-input"),
-          "Per-GPU closest NIC is set as %s", (closestNicStr == "" ? "auto" : closestNicStr.c_str()));
-#endif
    Print("CU_MASK", getenv("CU_MASK") ? 1 : 0,
          "%s", (cuMask.size() ? GetCuMaskDesc().c_str() : "All"));
    Print("FILL_COMPRESS", getenv("FILL_COMPRESS") ? 1 : 0,
@@ -452,6 +450,8 @@ public:
          "Using up to %s subexecutors for variable subExec transfers",
          maxNumVarSubExec ? std::to_string(maxNumVarSubExec).c_str() : "all available");
 #if NIC_EXEC_ENABLED
+    Print("NIC_CHUNK_BYTES", nicChunkBytes,
+          "Sending %lu bytes at a time for NIC RDMA", nicChunkBytes);
    Print("NIC_RELAX_ORDER", nicRelaxedOrder,
          "Using %s ordering for NIC RDMA", nicRelaxedOrder ? "relaxed" : "strict");
 #endif
@@ -466,6 +466,7 @@ public:
    Print("ROCE_VERSION", roceVersion,
          "RoCE version is set to %d", roceVersion);
 #endif
+    Print("SHOW_BORDERS", showBorders, "%s ASCII box-drawing characaters in tables", showBorders ? "Showing" : "Hiding");
    Print("SHOW_ITERATIONS", showIterations,
          "%s per-iteration timing", showIterations ? "Showing" : "Hiding");
    Print("USE_HIP_EVENTS", useHipEvents,
@@ -497,8 +498,17 @@ public:
  // Helper function that gets parses environment variable or sets to default value
  static int GetEnvVar(std::string const& varname, int defaultValue)
  {
-    if (getenv(varname.c_str()))
-      return atoi(getenv(varname.c_str()));
+    char const* varStr = getenv(varname.c_str());
+    if (varStr) {
+      int val = atoi(varStr);
+      char units = varStr[strlen(varStr)-1];
+      switch (units) {
+      case 'G': case 'g': val *= 1024;
+      case 'M': case 'm': val *= 1024;
+      case 'K': case 'k': val *= 1024;
+      }
+      return val;
+    }
    return defaultValue;
  }

@@ -633,27 +643,13 @@ public:
    cfg.gfx.waveOrder              = gfxWaveOrder;
    cfg.gfx.wordSize               = gfxWordSize;

+    cfg.nic.chunkBytes             = nicChunkBytes;
    cfg.nic.ibGidIndex             = ibGidIndex;
    cfg.nic.ibPort                 = ibPort;
    cfg.nic.ipAddressFamily        = ipAddressFamily;
    cfg.nic.useRelaxedOrder        = nicRelaxedOrder;
    cfg.nic.roceVersion            = roceVersion;

-    std::vector<int> closestNics;
-    if(closestNicStr != "") {
-      std::stringstream ss(closestNicStr);
-      std::string item;
-      while (std::getline(ss, item, ',')) {
-        try {
-          int nic = std::stoi(item);
-          closestNics.push_back(nic);
-        } catch (const std::invalid_argument& e) {
-          printf("[ERROR] Invalid NIC index (%s) by user in %s\n", item.c_str(), closestNicStr.c_str());
-          exit(1);
-        }
-      }
-      cfg.nic.closestNics = closestNics;
-    }
    return cfg;
  }
 };

--- a/src/client/Presets/AllToAll.hpp
+++ b/src/client/Presets/AllToAll.hpp
--- a/src/client/Presets/AllToAllN.hpp
+++ b/src/client/Presets/AllToAllN.hpp
 /*
-Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -22,36 +22,49 @@ THE SOFTWARE.

 #include "EnvVars.hpp"

-void AllToAllRdmaPreset(EnvVars&           ev,
-                        size_t      const  numBytesPerTransfer,
-                        std::string const  presetName)
+int AllToAllRdmaPreset(EnvVars&           ev,
+                       size_t      const  numBytesPerTransfer,
+                       std::string const  presetName)
 {
-
+  if (TransferBench::GetNumRanks() > 1) {
+    Utils::Print("[ERROR]a2an preset currently not supported for multi-node\n");
+    return 1;
+  }

  int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);

  // Collect env vars for this preset
  int numGpus       = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
  int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
-  int useFineGrain  = EnvVars::GetEnvVar("USE_FINE_GRAIN" , 1);
+  int memTypeIdx    = EnvVars::GetEnvVar("MEM_TYPE"       , 2);
+  int useFineGrain  = EnvVars::GetEnvVar("USE_FINE_GRAIN" , -999); // Deprecated
+
+  // Deprecated env var check
+  if (useFineGrain != -999) {
+    memTypeIdx = useFineGrain ? 2 : 0;
+  }
+
+  MemType memType = Utils::GetGpuMemType(memTypeIdx);
+  std::string memTypeStr = Utils::GetGpuMemTypeStr(memTypeIdx);

  // Print off environment variables
-  ev.DisplayEnvVars();
-  if (!ev.hideEnv) {
-    if (!ev.outputToCsv) printf("[AllToAll Network Related]\n");
-    ev.Print("NUM_GPU_DEVICES", numGpus      , "Using %d GPUs", numGpus);
-    ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
-    ev.Print("USE_FINE_GRAIN" , useFineGrain , "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
-    printf("\n");
+  if (Utils::RankDoesOutput()) {
+    ev.DisplayEnvVars();
+    if (!ev.hideEnv) {
+      if (!ev.outputToCsv) printf("[AllToAll Network Related]\n");
+      ev.Print("NUM_GPU_DEVICES", numGpus      , "Using %d GPUs", numGpus);
+      ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
+      ev.Print("MEM_TYPE"       , memTypeIdx   , "Using %s memory (%s)", memTypeStr.c_str(), Utils::GetAllGpuMemTypeStr().c_str());
+      printf("\n");
+    }
  }

  // Validate env vars
  if (numGpus < 0 || numGpus > numDetectedGpus) {
-    printf("[ERROR] Cannot use %d GPUs.  Detected %d GPUs\n", numGpus, numDetectedGpus);
-    exit(1);
+    Utils::Print("[ERROR] Cannot use %d GPUs.  Detected %d GPUs\n", numGpus, numDetectedGpus);
+    return 1;
  }

-  MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;

  std::map<std::pair<int, int>, int> reIndex;
  std::vector<Transfer> transfers;
@@ -71,31 +84,31 @@ void AllToAllRdmaPreset(EnvVars&           ev,
    }
  }

-  printf("GPU-RDMA All-To-All benchmark:\n");
-  printf("==========================\n");
-  printf("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n",
+  Utils::Print("GPU-RDMA All-To-All benchmark:\n");
+  Utils::Print("==========================\n");
+  Utils::Print("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n",
         numBytesPerTransfer, numQueuePairs, transfers.size());
-  if (transfers.size() == 0) return;
+  if (transfers.size() == 0) return 0;

  // Execute Transfers
  TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
  TransferBench::TestResults results;
  if (!TransferBench::RunTransfers(cfg, transfers, results)) {
    for (auto const& err : results.errResults)
-      printf("%s\n", err.errMsg.c_str());
-    exit(0);
+      Utils::Print("%s\n", err.errMsg.c_str());
+    return 1;
  } else {
-    PrintResults(ev, 1, transfers, results);
+    Utils::PrintResults(ev, 1, transfers, results);
  }

  // Print results
  char separator = (ev.outputToCsv ? ',' : ' ');
-  printf("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
-  printf("==========================================================\n");
-  printf("SRC\\DST ");
+  Utils::Print("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
+  Utils::Print("==========================================================\n");
+  Utils::Print("SRC\\DST ");
  for (int dst = 0; dst < numGpus; dst++)
-    printf("%cGPU %02d    ", separator, dst);
-  printf("   %cSTotal     %cActual\n", separator, separator);
+    Utils::Print("%cGPU %02d    ", separator, dst);
+  Utils::Print("   %cSTotal     %cActual\n", separator, separator);

  double totalBandwidthGpu = 0.0;
  double minActualBandwidth = std::numeric_limits<double>::max();
@@ -105,7 +118,7 @@ void AllToAllRdmaPreset(EnvVars&           ev,
    double rowTotalBandwidth = 0;
    int    transferCount = 0;
    double minBandwidth = std::numeric_limits<double>::max();
-    printf("GPU %02d", src);
+    Utils::Print("GPU %02d", src);
    for (int dst = 0; dst < numGpus; dst++) {
      if (reIndex.count(std::make_pair(src, dst))) {
        int const transferIdx = reIndex[std::make_pair(src,dst)];
@@ -115,28 +128,30 @@ void AllToAllRdmaPreset(EnvVars&           ev,
        totalBandwidthGpu       += r.avgBandwidthGbPerSec;
        minBandwidth             = std::min(minBandwidth, r.avgBandwidthGbPerSec);
        transferCount++;
-        printf("%c%8.3f  ", separator, r.avgBandwidthGbPerSec);
+        Utils::Print("%c%8.3f  ", separator, r.avgBandwidthGbPerSec);
      } else {
-        printf("%c%8s  ", separator, "N/A");
+        Utils::Print("%c%8s  ", separator, "N/A");
      }
    }
    double actualBandwidth = minBandwidth * transferCount;
-    printf("   %c%8.3f   %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
+    Utils::Print("   %c%8.3f   %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
    minActualBandwidth = std::min(minActualBandwidth, actualBandwidth);
    maxActualBandwidth = std::max(maxActualBandwidth, actualBandwidth);
    colTotalBandwidth[numGpus+1] += rowTotalBandwidth;
  }
-  printf("\nRTotal");
+  Utils::Print("\nRTotal");
  for (int dst = 0; dst < numGpus; dst++) {
-    printf("%c%8.3f  ", separator, colTotalBandwidth[dst]);
+    Utils::Print("%c%8.3f  ", separator, colTotalBandwidth[dst]);
  }
-  printf("   %c%8.3f   %c%8.3f   %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
+  Utils::Print("   %c%8.3f   %c%8.3f   %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
         separator, minActualBandwidth, separator, maxActualBandwidth);
-  printf("\n");
+  Utils::Print("\n");
+
+  Utils::Print("Average   bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());
+  Utils::Print("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
+  Utils::Print("Aggregate bandwidth       (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);

-  printf("Average   bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());
-  printf("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
-  printf("Aggregate bandwidth       (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
+  Utils::PrintErrors(results.errResults);

-  PrintErrors(results.errResults);
+  return 0;
 }
--- a/src/client/Presets/AllToAllSweep.hpp
+++ b/src/client/Presets/AllToAllSweep.hpp
 /*
-Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -22,10 +22,15 @@ THE SOFTWARE.

 #include "EnvVars.hpp"

-void AllToAllSweepPreset(EnvVars&           ev,
-                         size_t      const  numBytesPerTransfer,
-                         std::string const  presetName)
+int AllToAllSweepPreset(EnvVars&           ev,
+                        size_t      const  numBytesPerTransfer,
+                        std::string const  presetName)
 {
+  if (TransferBench::GetNumRanks() > 1) {
+    Utils::Print("[ERROR] All to All Sweep preset currently not supported for multi-node\n");
+    return 1;
+  }
+
  enum
  {
    A2A_COPY       = 0,
@@ -172,7 +177,7 @@ void AllToAllSweepPreset(EnvVars&           ev,
  printf("- Copying %lu bytes between %s pairs of GPUs\n", numBytesPerTransfer, a2aDirect ? "directly connected" : "all");
  if (transfers.size() == 0) {
    printf("[WARN} No transfers requested. Try adjusting A2A_DIRECT or A2A_LOCAL\n");
-    return;
+    return 0;
  }

  // Execute Transfers
@@ -227,9 +232,10 @@ void AllToAllSweepPreset(EnvVars&           ev,
      for (int c : numCusList) {
        for (int u : unrollList) {
          printf("CUs: %d Unroll %d\n", c, u);
-          PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]);
+          Utils::PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]);
        }
      }
    }
  }
+  return 1;
 }
--- a/src/client/Presets/HealthCheck.hpp
+++ b/src/client/Presets/HealthCheck.hpp
@@ -186,7 +186,7 @@ int TestUnidir(int modelId, bool verbose)
        }
        if (verbose) printf("   GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
      } else {
-        PrintErrors(results.errResults);
+        Utils::PrintErrors(results.errResults);
      }
    }

@@ -232,7 +232,7 @@ int TestUnidir(int modelId, bool verbose)
        }
        if (verbose) printf("   GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
      } else {
-        PrintErrors(results.errResults);
+        Utils::PrintErrors(results.errResults);
      }
    }

@@ -298,7 +298,7 @@ int TestBidir(int modelId, bool verbose)
        }
        if (verbose) printf("   GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
      } else {
-        PrintErrors(results.errResults);
+        Utils::PrintErrors(results.errResults);
      }
    }

@@ -423,7 +423,7 @@ int TestHbmPerformance(int modelId, bool verbose)
        if (verbose) printf("   GPU %02d: Measured %6.2f Limit %6.2f\n", gpuId, measuredBw, limit);
      }
    } else {
-      PrintErrors(results.errResults);
+      Utils::PrintErrors(results.errResults);
    }

    if (fails.size() == 0) {
@@ -439,14 +439,19 @@ int TestHbmPerformance(int modelId, bool verbose)
  return hasFail;
 }

-void HealthCheckPreset(EnvVars&           ev,
-                       size_t      const  numBytesPerTransfer,
-                       std::string const  presetName)
+int HealthCheckPreset(EnvVars&           ev,
+                      size_t      const  numBytesPerTransfer,
+                      std::string const  presetName)
 {
+  if (TransferBench::GetNumRanks() > 1) {
+    Utils::Print("[ERROR] Healthcheck preset currently not supported for multi-node\n");
+    return 1;
+  }
+
  // Check for supported platforms
 #if defined(__NVCC__)
  printf("[WARN] healthcheck preset not supported on NVIDIA hardware\n");
-  return;
+  return 0;
 #endif

  printf("Disclaimer:\n");
@@ -468,5 +473,5 @@ void HealthCheckPreset(EnvVars&           ev,
  numFails += TestUnidir(modelId, verbose);
  numFails += TestBidir(modelId, verbose);
  numFails += TestAllToAll(modelId, verbose);
-  exit(numFails ? 1 : 0);
+  return numFails ? 1 : 0;
 }
--- a/src/client/Presets/NicRings.hpp
+++ b/src/client/Presets/NicRings.hpp
+/*
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+*/
+
+int NicRingsPreset(EnvVars&           ev,
+                   size_t      const  numBytesPerTransfer,
+                   std::string const  presetName)
+{
+
+  // Check for single homogenous group
+  if (Utils::GetNumRankGroups() > 1) {
+    Utils::Print("[ERROR] NIC-rings preset can only be run across ranks that are homogenous\n");
+    Utils::Print("[ERROR] Run ./TransferBench without any args to display topology information\n");
+    Utils::Print("[ERROR] NIC_FILTER may also be used to limit NIC visibility\n");
+    return 1;
+  }
+
+  // Collect topology
+  int numRanks = TransferBench::GetNumRanks();
+
+  // Read in environment variables
+  int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
+  int showDetails   = EnvVars::GetEnvVar("SHOW_DETAILS"   , 0);
+  int useCpuMem     = EnvVars::GetEnvVar("USE_CPU_MEM"    , 0);
+  int memTypeIdx    = EnvVars::GetEnvVar("MEM_TYPE"       , 0);
+  int useRdmaRead   = EnvVars::GetEnvVar("USE_RDMA_READ"  , 0);
+
+  // Print off environment variables
+  MemType memType        = Utils::GetMemType(memTypeIdx, useCpuMem);
+  std::string memTypeStr = Utils::GetMemTypeStr(memTypeIdx, useCpuMem);
+
+  if (Utils::RankDoesOutput()) {
+    ev.DisplayEnvVars();
+    if (!ev.hideEnv) {
+      if (!ev.outputToCsv) printf("[NIC-Rings Related]\n");
+      ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
+      ev.Print("SHOW_DETAILS"   , showDetails  , "%s full Test details", showDetails ? "Showing" : "Hiding");
+      ev.Print("USE_CPU_MEM"    , useCpuMem    , "Using closest %s memory", useCpuMem ? "CPU" : "GPU");
+      ev.Print("MEM_TYPE"       , memTypeIdx   , "Using %s memory (%s)", memTypeStr.c_str(), Utils::GetAllMemTypeStr(useCpuMem).c_str());
+      if (numRanks > 1)
+        ev.Print("USE_RDMA_READ", useRdmaRead  , "Performing RDMA %s", useRdmaRead ? "reads" : "writes");
+      printf("\n");
+    }
+  }
+
+  // Prepare list of transfers
+  int numDevices = TransferBench::GetNumExecutors(useCpuMem ? EXE_CPU : EXE_GPU_GFX);
+  std::vector<Transfer> transfers;
+  int numRings = 0;
+  for (int memIndex = 0; memIndex < numDevices; memIndex++) {
+    std::vector<int> nicIndices;
+    if (useCpuMem) {
+      TransferBench::GetClosestNicsToCpu(nicIndices, memIndex);
+    } else {
+      TransferBench::GetClosestNicsToGpu(nicIndices, memIndex);
+    }
+    for (int nicIndex : nicIndices) {
+      numRings++;
+      for (int currRank = 0; currRank < numRanks; currRank++) {
+        int nextRank = (currRank + 1) % numRanks;
+
+        TransferBench::Transfer transfer;
+        transfer.srcs.push_back({memType, memIndex, currRank});
+        transfer.dsts.push_back({memType, memIndex, nextRank});
+        transfer.exeDevice   = {EXE_NIC, nicIndex, useRdmaRead ? nextRank : currRank};
+        transfer.exeSubIndex = nicIndex;
+        transfer.numSubExecs = numQueuePairs;
+        transfer.numBytes    = numBytesPerTransfer;
+
+        transfers.push_back(transfer);
+      }
+    }
+  }
+
+  Utils::Print("NIC Rings benchmark\n");
+  Utils::Print("==============================\n");
+  Utils::Print("%d parallel RDMA-%s rings(s) using %s memory across %d ranks\n",
+               numRings, useRdmaRead ? "read" : "write", memTypeStr.c_str(), numRanks);
+  Utils::Print("%d queue pairs per NIC.  %lu bytes per Transfer.  All numbers are GB/s\n",
+               numQueuePairs, numBytesPerTransfer);
+  Utils::Print("\n");
+
+  // Execute Transfers
+  TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
+  TransferBench::TestResults results;
+  if (!TransferBench::RunTransfers(cfg, transfers, results)) {
+    for (auto const& err : results.errResults)
+      Utils::Print("%s\n", err.errMsg.c_str());
+    return 1;
+  } else if (showDetails) {
+    Utils::PrintResults(ev, 1, transfers, results);
+    Utils::Print("\n");
+  }
+
+  // Only ranks that actually do output will compile results
+  if (!Utils::RankDoesOutput()) return 0;
+
+  // Prepare table of results
+  int numRows   = 6 + numRanks;
+  int numCols   = 3 + numRings;
+  Utils::TableHelper table(numRows, numCols);
+
+  // Prepare headers
+  table.Set(2, 0, " Rank ");
+  table.Set(2, 1, " Name ");
+  table.Set(1, numCols-1, " TOTAL ");
+  table.Set(2, numCols-1, " (GB/s) ");
+  table.SetColAlignment(1, Utils::TableHelper::ALIGN_LEFT);
+  for (int rank = 0; rank < numRanks; rank++) {
+    table.Set(3 + rank, 0, " %d ", rank);
+    table.Set(3 + rank, 1, " %s ", TransferBench::GetHostname(rank).c_str());
+  }
+
+  table.Set(numRows-3, 1, " MAX (GB/s) ");
+  table.Set(numRows-2, 1, " AVG (GB/s) ");
+  table.Set(numRows-1, 1, " MIN (GB/s) ");
+  for (int row = numRows-3; row < numRows; row++)
+    table.SetCellAlignment(row, 1, Utils::TableHelper::ALIGN_RIGHT);
+  table.DrawRowBorder(3);
+  table.DrawRowBorder(numRows-3);
+
+  int colIdx = 2;
+  int transferIdx = 0;
+  std::vector<double> rankTotal(numRanks, 0.0);
+  for (int memIndex = 0; memIndex < numDevices; memIndex++) {
+    std::vector<int> nicIndices;
+    if (useCpuMem) {
+      TransferBench::GetClosestNicsToCpu(nicIndices, memIndex);
+      table.Set(0, colIdx, " CPU %02d ", memIndex);
+    } else {
+      TransferBench::GetClosestNicsToGpu(nicIndices, memIndex);
+      table.Set(0, colIdx, " GPU %02d ", memIndex);
+    }
+    bool isFirst = true;
+    for (int nicIndex : nicIndices) {
+      if (isFirst) {
+        isFirst = false;
+        table.DrawColBorder(colIdx);
+      }
+      table.Set(1, colIdx, " NIC %02d ", nicIndex);
+      table.Set(2, colIdx, " %s ", TransferBench::GetExecutorName({EXE_NIC, nicIndex}).c_str());
+
+      double ringMin = std::numeric_limits<double>::max();
+      double ringAvg = 0.0;
+      double ringMax = std::numeric_limits<double>::min();
+
+      for (int rank = 0; rank < numRanks; rank++) {
+        double avgBw = results.tfrResults[transferIdx].avgBandwidthGbPerSec;
+        table.Set(3 + rank, colIdx, " %.2f ", avgBw);
+        ringMin = std::min(ringMin, avgBw);
+        ringAvg += avgBw;
+        ringMax = std::max(ringMax, avgBw);
+        rankTotal[rank] += avgBw;
+        transferIdx++;
+      }
+
+      table.Set(numRows-3, colIdx, " %.2f ", ringMax);
+      table.Set(numRows-2, colIdx, " %.2f ", ringAvg / numRanks);
+      table.Set(numRows-1, colIdx, " %.2f ", ringMin);
+      colIdx++;
+    }
+    if (!isFirst) {
+      table.DrawColBorder(colIdx);
+    }
+  }
+
+  double rankMin = std::numeric_limits<double>::max();
+  double rankAvg = 0.0;
+  double rankMax = std::numeric_limits<double>::min();
+  for (int rank = 0; rank < numRanks; rank++) {
+    table.Set(3 + rank, numCols - 1, " %.2f ", rankTotal[rank]);
+    rankMin = std::min(rankMin, rankTotal[rank]);
+    rankAvg += rankTotal[rank];
+    rankMax = std::max(rankMax, rankTotal[rank]);
+  }
+  table.Set(numRows - 3, numCols - 1, " %.2f ", rankMax);
+  table.Set(numRows - 2, numCols - 1, " %.2f ", rankAvg / numRanks);
+  table.Set(numRows - 1, numCols - 1, " %.2f ", rankMin);
+
+  table.PrintTable(ev.outputToCsv, ev.showBorders);
+
+  Utils::Print("\n");
+  Utils::Print("Aggregate bandwidth (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
+  Utils::PrintErrors(results.errResults);
+
+  if (Utils::HasDuplicateHostname()) {
+    printf("[WARN] It is recommended to run TransferBench with one rank per host to avoid potential aliasing of executors\n");
+  }
+  return 0;
+}
--- a/src/client/Presets/OneToAll.hpp
+++ b/src/client/Presets/OneToAll.hpp
 /*
-Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -20,14 +20,19 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
 */

-void OneToAllPreset(EnvVars&           ev,
-                    size_t      const  numBytesPerTransfer,
-                    std::string const  presetName)
+int OneToAllPreset(EnvVars&           ev,
+                   size_t      const  numBytesPerTransfer,
+                   std::string const  presetName)
 {
+  if (TransferBench::GetNumRanks() > 1) {
+    Utils::Print("[ERROR] One-to-All preset currently not supported for multi-node\n");
+    return 1;
+  }
+
  int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
  if (numDetectedGpus < 2) {
    printf("[ERROR] One-to-all benchmark requires machine with at least 2 GPUs\n");
-    exit(1);
+    return 1;
  }

  // Collect env vars for this preset
@@ -61,7 +66,7 @@ void OneToAllPreset(EnvVars&           ev,
  for (auto ch : sweepExe) {
    if (ch != 'G' && ch != 'D') {
      printf("[ERROR] Unrecognized executor type '%c' specified\n", ch);
-      exit(1);
+      return 1;
    }
  }

@@ -98,7 +103,7 @@ void OneToAllPreset(EnvVars&           ev,
        for (int i = 0; i < numGpuDevices; i++) {
          if (bitmask & (1<<i)) {
            Transfer t;
-            CheckForError(TransferBench::CharToExeType(exe, t.exeDevice.exeType));
+            Utils::CheckForError(TransferBench::CharToExeType(exe, t.exeDevice.exeType));
            t.exeDevice.exeIndex = exeIndex;
            t.exeSubIndex = -1;
            t.numSubExecs = numSubExecs;
@@ -108,7 +113,7 @@ void OneToAllPreset(EnvVars&           ev,
              t.srcs.clear();
            } else {
              t.srcs.resize(1);
-              CheckForError(TransferBench::CharToMemType(src, t.srcs[0].memType));
+              Utils::CheckForError(TransferBench::CharToMemType(src, t.srcs[0].memType));
              t.srcs[0].memIndex = sweepDir == 0 ? exeIndex : i;
            }

@@ -116,15 +121,15 @@ void OneToAllPreset(EnvVars&           ev,
              t.dsts.clear();
            } else {
              t.dsts.resize(1);
-              CheckForError(TransferBench::CharToMemType(dst, t.dsts[0].memType));
+              Utils::CheckForError(TransferBench::CharToMemType(dst, t.dsts[0].memType));
              t.dsts[0].memIndex = sweepDir == 0 ? i : exeIndex;
            }
            transfers.push_back(t);
          }
        }
        if (!TransferBench::RunTransfers(cfg, transfers, results)) {
-          PrintErrors(results.errResults);
-          exit(1);
+          Utils::PrintErrors(results.errResults);
+          return 1;
        }

        int counter = 0;
@@ -138,12 +143,13 @@ void OneToAllPreset(EnvVars&           ev,
        printf(" %d %d", p, numSubExecs);
        for (auto i = 0; i < transfers.size(); i++) {
          printf(" (%s %c%d %s)",
-                 MemDevicesToStr(transfers[i].srcs).c_str(),
+                 Utils::MemDevicesToStr(transfers[i].srcs).c_str(),
                 ExeTypeStr[transfers[i].exeDevice.exeType], transfers[i].exeDevice.exeIndex,
-                 MemDevicesToStr(transfers[i].dsts).c_str());
+                 Utils::MemDevicesToStr(transfers[i].dsts).c_str());
        }
        printf("\n");
      }
    }
  }
+  return 0;
 }
--- a/src/client/Presets/PeerToPeer.hpp
+++ b/src/client/Presets/PeerToPeer.hpp
 /*
-Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -20,39 +20,60 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
 */

-void PeerToPeerPreset(EnvVars&           ev,
-                      size_t      const  numBytesPerTransfer,
-                      std::string const  presetName)
+int PeerToPeerPreset(EnvVars&           ev,
+                     size_t      const  numBytesPerTransfer,
+                     std::string const  presetName)
 {
+  if (TransferBench::GetNumRanks() > 1) {
+    Utils::Print("[ERROR] Peer-to-peer preset currently not supported for multi-node\n");
+    return 1;
+  }
+
  int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
  int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);

  // Collect env vars for this preset
  int useDmaCopy     = EnvVars::GetEnvVar("USE_GPU_DMA",     0);
+
+  int cpuMemTypeIdx  = EnvVars::GetEnvVar("CPU_MEM_TYPE",    0);
+  int gpuMemTypeIdx  = EnvVars::GetEnvVar("GPU_MEM_TYPE",    0);
  int numCpuDevices  = EnvVars::GetEnvVar("NUM_CPU_DEVICES", numDetectedCpus);
  int numCpuSubExecs = EnvVars::GetEnvVar("NUM_CPU_SE",      4);
  int numGpuDevices  = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
  int numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE",      useDmaCopy ? 1 : TransferBench::GetNumSubExecutors({EXE_GPU_GFX, 0}));
  int p2pMode        = EnvVars::GetEnvVar("P2P_MODE",        0);
-  int useFineGrain   = EnvVars::GetEnvVar("USE_FINE_GRAIN",  0);
+  int useFineGrain   = EnvVars::GetEnvVar("USE_FINE_GRAIN",  -999); // Deprecated
  int useRemoteRead  = EnvVars::GetEnvVar("USE_REMOTE_READ", 0);

+  MemType cpuMemType = Utils::GetCpuMemType(cpuMemTypeIdx);
+  MemType gpuMemType = Utils::GetGpuMemType(gpuMemTypeIdx);
+
  // Display environment variables
-  ev.DisplayEnvVars();
-  if (!ev.hideEnv) {
-    int outputToCsv = ev.outputToCsv;
-    if (!outputToCsv) printf("[P2P Related]\n");
-    ev.Print("NUM_CPU_DEVICES", numCpuDevices,  "Using %d CPUs", numCpuDevices);
-    ev.Print("NUM_CPU_SE",      numCpuSubExecs, "Using %d CPU threads per Transfer", numCpuSubExecs);
-    ev.Print("NUM_GPU_DEVICES", numGpuDevices,  "Using %d GPUs", numGpuDevices);
-    ev.Print("NUM_GPU_SE",      numGpuSubExecs, "Using %d GPU subexecutors/CUs per Transfer", numGpuSubExecs);
-    ev.Print("P2P_MODE",        p2pMode,        "Running %s transfers", p2pMode == 0 ? "Uni + Bi" :
-                                                                        p2pMode == 1 ? "Unidirectional"
-                                                                                     : "Bidirectional");
-    ev.Print("USE_FINE_GRAIN",  useFineGrain,   "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
-    ev.Print("USE_GPU_DMA",     useDmaCopy,     "Using GPU-%s as GPU executor", useDmaCopy ? "DMA" : "GFX");
-    ev.Print("USE_REMOTE_READ", useRemoteRead,  "Using %s as executor", useRemoteRead ? "DST" : "SRC");
-    printf("\n");
+
+  if (Utils::RankDoesOutput()) {
+    ev.DisplayEnvVars();
+    if (!ev.hideEnv) {
+      int outputToCsv = ev.outputToCsv;
+      if (!outputToCsv) printf("[P2P Related]\n");
+      ev.Print("CPU_MEM_TYPE"   , cpuMemTypeIdx,  "Using %s (%s)", Utils::GetCpuMemTypeStr(cpuMemTypeIdx).c_str(), Utils::GetAllCpuMemTypeStr().c_str());
+      ev.Print("GPU_MEM_TYPE"   , gpuMemTypeIdx,  "Using %s (%s)", Utils::GetGpuMemTypeStr(gpuMemTypeIdx).c_str(), Utils::GetAllGpuMemTypeStr().c_str());
+      ev.Print("NUM_CPU_DEVICES", numCpuDevices,  "Using %d CPUs", numCpuDevices);
+      ev.Print("NUM_CPU_SE",      numCpuSubExecs, "Using %d CPU threads per Transfer", numCpuSubExecs);
+      ev.Print("NUM_GPU_DEVICES", numGpuDevices,  "Using %d GPUs", numGpuDevices);
+      ev.Print("NUM_GPU_SE",      numGpuSubExecs, "Using %d GPU subexecutors/CUs per Transfer", numGpuSubExecs);
+      ev.Print("P2P_MODE",        p2pMode,        "Running %s transfers", p2pMode == 0 ? "Uni + Bi" :
+                                                                          p2pMode == 1 ? "Unidirectional"
+                                                                                       : "Bidirectional");
+      ev.Print("USE_GPU_DMA",     useDmaCopy,     "Using GPU-%s as GPU executor", useDmaCopy ? "DMA" : "GFX");
+      ev.Print("USE_REMOTE_READ", useRemoteRead,  "Using %s as executor", useRemoteRead ? "DST" : "SRC");
+      printf("\n");
+    }
+  }
+
+  // Check for deprecated env vars
+  if (useFineGrain != -999) {
+    Utils::Print("[ERROR] USE_FINE_GRAIN has been deprecated and replaced by CPU_MEM_TYPE and GPU_MEM_TYPE\n");
+    return 1;
  }

  char const separator = ev.outputToCsv ? ',' : ' ';
@@ -66,8 +87,8 @@ void PeerToPeerPreset(EnvVars&           ev,

  // Perform unidirectional / bidirectional
  for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++) {
-    if (p2pMode == 1 && isBidirectional == 1 ||
-        p2pMode == 2 && isBidirectional == 0) continue;
+    if ((p2pMode == 1 && isBidirectional == 1) ||
+        (p2pMode == 2 && isBidirectional == 0)) continue;

    printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni",
           useRemoteRead ? "Remote" : "Local",
@@ -102,11 +123,10 @@ void PeerToPeerPreset(EnvVars&           ev,

    // Loop over all possible src/dst pairs
    for (int src = 0; src < numDevices; src++) {
-      MemType const srcType  = (src < numCpuDevices ? MEM_CPU : MEM_GPU);
-      int     const srcIndex = (srcType == MEM_CPU ? src : src - numCpuDevices);
-      MemType const srcTypeActual = ((useFineGrain && srcType == MEM_CPU) ? MEM_CPU_FINE :
-                                     (useFineGrain && srcType == MEM_GPU) ? MEM_GPU_FINE :
-                                                                               srcType);
+      int     const srcIdx   = (src < numCpuDevices ? 0 : 1);
+      MemType const srcType  = (src < numCpuDevices ? cpuMemType : gpuMemType);
+      int     const srcIndex = (src < numCpuDevices ? src        : src - numCpuDevices);
+
      std::vector<std::vector<double>> avgBandwidth(isBidirectional + 1);
      std::vector<std::vector<double>> minBandwidth(isBidirectional + 1);
      std::vector<std::vector<double>> maxBandwidth(isBidirectional + 1);
@@ -114,18 +134,17 @@ void PeerToPeerPreset(EnvVars&           ev,

      if (src == numCpuDevices && src != 0) printf("\n");
      for (int dst = 0; dst < numDevices; dst++) {
-        MemType const dstType  = (dst < numCpuDevices ? MEM_CPU : MEM_GPU);
-        int     const dstIndex = (dstType == MEM_CPU ? dst : dst - numCpuDevices);
-        MemType const dstTypeActual = ((useFineGrain && dstType == MEM_CPU) ? MEM_CPU_FINE :
-                                       (useFineGrain && dstType == MEM_GPU) ? MEM_GPU_FINE :
-                                                                                 dstType);
+        int     const dstIdx   = (dst < numCpuDevices ? 0 : 1);
+        MemType const dstType  = (dst < numCpuDevices ? cpuMemType : gpuMemType);
+        int     const dstIndex = (dst < numCpuDevices ? dst        : dst - numCpuDevices);
+
        // Prepare Transfers
        std::vector<Transfer> transfers(isBidirectional + 1);

        // SRC -> DST
        transfers[0].numBytes = numBytesPerTransfer;
-        transfers[0].srcs.push_back({srcTypeActual, srcIndex});
-        transfers[0].dsts.push_back({dstTypeActual, dstIndex});
+        transfers[0].srcs.push_back({srcType, srcIndex});
+        transfers[0].dsts.push_back({dstType, dstIndex});
        transfers[0].exeDevice = {IsGpuMemType(useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU,
                                  (useRemoteRead ? dstIndex : srcIndex)};
        transfers[0].exeSubIndex = -1;
@@ -134,8 +153,8 @@ void PeerToPeerPreset(EnvVars&           ev,
        // DST -> SRC
        if (isBidirectional) {
          transfers[1].numBytes = numBytesPerTransfer;
-          transfers[1].srcs.push_back({dstTypeActual, dstIndex});
-          transfers[1].dsts.push_back({srcTypeActual, srcIndex});
+          transfers[1].srcs.push_back({dstType, dstIndex});
+          transfers[1].dsts.push_back({srcType, srcIndex});
          transfers[1].exeDevice = {IsGpuMemType(useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU,
                                    (useRemoteRead ? srcIndex : dstIndex)};
          transfers[1].exeSubIndex = -1;
@@ -167,7 +186,7 @@ void PeerToPeerPreset(EnvVars&           ev,
          if (!TransferBench::RunTransfers(cfg, transfers, results)) {
            for (auto const& err : results.errResults)
              printf("%s\n", err.errMsg.c_str());
-            exit(1);
+            return 1;
          }

          for (int dir = 0; dir <= isBidirectional; dir++) {
@@ -175,8 +194,8 @@ void PeerToPeerPreset(EnvVars&           ev,
            avgBandwidth[dir].push_back(avgBw);

            if (!(srcType == dstType && srcIndex == dstIndex)) {
-              avgBwSum[srcType][dstType] += avgBw;
-              avgCount[srcType][dstType]++;
+              avgBwSum[srcIdx][dstIdx] += avgBw;
+              avgCount[srcIdx][dstIdx]++;
            }

            if (ev.showIterations) {
@@ -209,7 +228,7 @@ void PeerToPeerPreset(EnvVars&           ev,
      }

      for (int dir = 0; dir <= isBidirectional; dir++) {
-        printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, dir ? "<- " : " ->");
+        printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, dir ? "<- " : " ->");
        if (ev.outputToCsv) printf(",");

        for (int dst = 0; dst < numDevices; dst++) {
@@ -226,7 +245,7 @@ void PeerToPeerPreset(EnvVars&           ev,

        if (ev.showIterations) {
          // minBw
-          printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "min");
+          printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "min");
          if (ev.outputToCsv) printf(",");
          for (int i = 0; i < numDevices; i++) {
            double const minBw = minBandwidth[dir][i];
@@ -240,7 +259,7 @@ void PeerToPeerPreset(EnvVars&           ev,
          printf("\n");

          // maxBw
-          printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "max");
+          printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "max");
          if (ev.outputToCsv) printf(",");
          for (int i = 0; i < numDevices; i++) {
            double const maxBw = maxBandwidth[dir][i];
@@ -254,7 +273,7 @@ void PeerToPeerPreset(EnvVars&           ev,
          printf("\n");

          // stddev
-          printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, " sd");
+          printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, " sd");
          if (ev.outputToCsv) printf(",");
          for (int i = 0; i < numDevices; i++) {
            double const sd = stdDev[dir][i];
@@ -271,7 +290,7 @@ void PeerToPeerPreset(EnvVars&           ev,
      }

      if (isBidirectional) {
-        printf("%5s %02d %3s", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex, "<->");
+        printf("%5s %02d %3s", (srcType == cpuMemType) ? "CPU" : "GPU", srcIndex, "<->");
        if (ev.outputToCsv) printf(",");
        for (int dst = 0; dst < numDevices; dst++) {
          double const sumBw = avgBandwidth[0][dst] + avgBandwidth[1][dst];
@@ -289,14 +308,14 @@ void PeerToPeerPreset(EnvVars&           ev,

    if (!ev.outputToCsv) {
      printf("                         ");
-      for (int srcType : {MEM_CPU, MEM_GPU})
-        for (int dstType : {MEM_CPU, MEM_GPU})
-          printf("  %cPU->%cPU", srcType == MEM_CPU ? 'C' : 'G', dstType == MEM_CPU ? 'C' : 'G');
+      for (int srcType = 0; srcType <= 1; srcType++)
+        for (int dstType = 0; dstType <= 1; dstType++)
+          printf("  %cPU->%cPU", srcType == 0 ? 'C' : 'G', dstType == 0 ? 'C' : 'G');
      printf("\n");

      printf("Averages (During %s):",  isBidirectional ? " BiDir" : "UniDir");
-      for (int srcType : {MEM_CPU, MEM_GPU})
-        for (int dstType : {MEM_CPU, MEM_GPU}) {
+      for (int srcType = 0; srcType <= 1; srcType++)
+        for (int dstType = 0; dstType <= 1; dstType++) {
          if (avgCount[srcType][dstType])
            printf("%10.2f", avgBwSum[srcType][dstType] / avgCount[srcType][dstType]);
          else
@@ -305,4 +324,5 @@ void PeerToPeerPreset(EnvVars&           ev,
      printf("\n\n");
    }
  }
+  return 0;
 }
--- a/src/client/Presets/Presets.hpp
+++ b/src/client/Presets/Presets.hpp
 /*
-Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -21,22 +21,26 @@ THE SOFTWARE.
 */

 #pragma once
+#include <map>
+
+// EnvVars is available to all presets
+#include "EnvVars.hpp"
+#include "Utilities.hpp"

-// Included after EnvVars and Executors
 #include "AllToAll.hpp"
 #include "AllToAllN.hpp"
 #include "AllToAllSweep.hpp"
 #include "HealthCheck.hpp"
+#include "NicRings.hpp"
 #include "OneToAll.hpp"
 #include "PeerToPeer.hpp"
 #include "Scaling.hpp"
 #include "Schmoo.hpp"
 #include "Sweep.hpp"
-#include <map>

-typedef void (*PresetFunc)(EnvVars&          ev,
-                           size_t      const numBytesPerTransfer,
-                           std::string const presetName);
+typedef int (*PresetFunc)(EnvVars&          ev,
+                          size_t      const numBytesPerTransfer,
+                          std::string const presetName);

 std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
 {
@@ -44,8 +48,9 @@ std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
  {"a2a_n",       {AllToAllRdmaPreset,  "Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers"}},
  {"a2asweep",    {AllToAllSweepPreset, "Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts"}},
  {"healthcheck", {HealthCheckPreset,   "Simple bandwidth health check (MI300X series only)"}},
+  {"nicrings",    {NicRingsPreset,      "Tests NIC rings created across identical NIC indices across ranks"}},
  {"one2all",     {OneToAllPreset,      "Test all subsets of parallel transfers from one GPU to all others"}},
-  {"p2p"   ,      {PeerToPeerPreset,    " Peer-to-peer device memory bandwidth test"}},
+  {"p2p"   ,      {PeerToPeerPreset,    "Peer-to-peer device memory bandwidth test"}},
  {"rsweep",      {SweepPreset,         "Randomly sweep through sets of Transfers"}},
  {"scaling",     {ScalingPreset,       "Run scaling test from one GPU to other devices"}},
  {"schmoo",      {SchmooPreset,        "Scaling tests for local/remote read/write/copy"}},
@@ -63,11 +68,12 @@ void DisplayPresets()
 int RunPreset(EnvVars&       ev,
              size_t   const numBytesPerTransfer,
              int      const argc,
-              char**   const argv)
+              char**   const argv,
+              int&           retCode)
 {
  std::string preset = (argc > 1 ? argv[1] : "");
  if (presetFuncMap.count(preset)) {
-    (presetFuncMap[preset].first)(ev, numBytesPerTransfer, preset);
+    retCode = (presetFuncMap[preset].first)(ev, numBytesPerTransfer, preset);
    return 1;
  }
  return 0;

--- a/src/client/Presets/Scaling.hpp
+++ b/src/client/Presets/Scaling.hpp
 /*
-Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -20,10 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
 */

-void ScalingPreset(EnvVars&           ev,
-                   size_t      const  numBytesPerTransfer,
-                   std::string const  presetName)
+int ScalingPreset(EnvVars&           ev,
+                  size_t      const  numBytesPerTransfer,
+                  std::string const  presetName)
 {
+  if (TransferBench::GetNumRanks() > 1) {
+    Utils::Print("[ERROR] Scaling preset currently not supported for multi-node\n");
+    return 1;
+  }
+
  int numDetectedCpus = TransferBench::GetNumExecutors(EXE_CPU);
  int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);

@@ -49,7 +54,7 @@ void ScalingPreset(EnvVars&           ev,
  // Validate env vars
  if (localIdx >= numDetectedGpus) {
    printf("[ERROR] Cannot execute scaling test with local GPU device %d\n", localIdx);
-    exit(1);
+    return 1;
  }

  TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
@@ -69,12 +74,14 @@ void ScalingPreset(EnvVars&           ev,

  std::vector<std::pair<double, int>> bestResult(numDevices);

+  MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
+
  std::vector<Transfer> transfers(1);
  Transfer& t   = transfers[0];
  t.exeDevice   = {EXE_GPU_GFX, localIdx};
  t.exeSubIndex = -1;
  t.numBytes    = numBytesPerTransfer;
-  t.srcs        = {{MEM_GPU, localIdx}};
+  t.srcs        = {{memType, localIdx}};

  for (int numSubExec = sweepMin; numSubExec <= sweepMax; numSubExec++) {
    t.numSubExecs = numSubExec;
@@ -84,8 +91,8 @@ void ScalingPreset(EnvVars&           ev,
      t.dsts = {{i < numCpuDevices ? MEM_CPU : MEM_GPU,
                 i < numCpuDevices ? i : i - numCpuDevices}};
      if (!RunTransfers(cfg, transfers, results)) {
-        PrintErrors(results.errResults);
-        exit(1);
+        Utils::PrintErrors(results.errResults);
+        return 1;
      }
      double bw = results.tfrResults[0].avgBandwidthGbPerSec;
      printf("%c%7.2f     ", separator, bw);
@@ -102,4 +109,5 @@ void ScalingPreset(EnvVars&           ev,
  for (int i = 0; i < numDevices; i++)
    printf("%c%7.2f(%3d)", separator, bestResult[i].first, bestResult[i].second);
  printf("\n");
+  return 0;
 }