Update rocm-rel-6.4 to use TransferBench v1.62.00 (#183)

* updating metadata * v1.58.00 Fixing DMA copy-on-engine (#152) * Leo's review * Update use-transferbench.rst * Update Doxyfile * Refining API library * Update TransferBench.hpp * Update TransferBench.hpp * Update TransferBench.hpp * Bump rocm-docs-core from 1.9.2 to 1.11.0 in /docs/sphinx (#153) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.11.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.11.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#155) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * External CI: enable CI triggers * Apply suggestions from code review Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Update LICENSE.md * Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#160) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench V1.59 (#162) Adding NIC execution capabilities, various bug fixes introduced by header-only-library refactor --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Adding ability to specify A2A_MODE=numSrcs:numDsts (#164) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Fixing specific DMA engine transfers, enabling GFX_SINGLE_TEAM=1 by default (#166) * Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx (#165) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#171) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.61 (#174) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#176) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#177) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.18.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.62.00 (#181) * Adding non-temporal loads and stores via GFX_TEMPORAL * Adding additional summary details to a2a preset * Add SHOW_MIN_ONLY for a2asweep preset * Adding new P CPU memory type which is indexed by closest GPU * Bump rocm-docs-core from 1.18.2 to 1.20.0 in /docs/sphinx (#180) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.20.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.20.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: srawat <120587655+SwRaw@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Su <danielsu@amd.com> Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

Update rocm-rel-6.4 to use TransferBench v1.62.00 (#183)
* updating metadata * v1.58.00 Fixing DMA copy-on-engine (#152) * Leo's review * Update use-transferbench.rst * Update Doxyfile * Refining API library * Update TransferBench.hpp * Update TransferBench.hpp * Update TransferBench.hpp * Bump rocm-docs-core from 1.9.2 to 1.11.0 in /docs/sphinx (#153) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.11.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.11.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#155) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * External CI: enable CI triggers * Apply suggestions from code review Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Update LICENSE.md * Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#160) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench V1.59 (#162) Adding NIC execution capabilities, various bug fixes introduced by header-only-library refactor --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Adding ability to specify A2A_MODE=numSrcs:numDsts (#164) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Fixing specific DMA engine transfers, enabling GFX_SINGLE_TEAM=1 by default (#166) * Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx (#165) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#171) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.61 (#174) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#176) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#177) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.18.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.62.00 (#181) * Adding non-temporal loads and stores via GFX_TEMPORAL * Adding additional summary details to a2a preset * Add SHOW_MIN_ONLY for a2asweep preset * Adding new P CPU memory type which is indexed by closest GPU * Bump rocm-docs-core from 1.18.2 to 1.20.0 in /docs/sphinx (#180) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.20.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.20.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: srawat <120587655+SwRaw@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Su <danielsu@amd.com> Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
4e2be38c · gilbertlee-amd · GitHub · 3ea2f226 · 4e2be38c · 4e2be38c
Unverified Commit 4e2be38c authored Jun 12, 2025 by gilbertlee-amd Committed by GitHub Jun 12, 2025
20 changed files
--- a/.azuredevops/rocm-ci.yml
+++ b/.azuredevops/rocm-ci.yml
+resources:
+  repositories:
+  - repository: pipelines_repo
+    type: github
+    endpoint: ROCm
+    name: ROCm/ROCm
+
+variables:
+- group: common
+- template: /.azuredevops/variables-global.yml@pipelines_repo
+
+trigger:
+  batch: true
+  branches:
+    include:
+    - develop
+    - mainline
+  paths:
+    exclude:
+    - .github
+    - docs
+    - '.*.yaml'
+    - '.*.yml'
+    - '*.md'
+    - LICENSE
+
+pr:
+  autoCancel: true
+  branches:
+    include:
+    - develop
+    - mainline
+  paths:
+    exclude:
+    - .github
+    - docs
+    - '.*.yaml'
+    - '.*.yml'
+    - '*.md'
+    - LICENSE
+  drafts: false
+
+jobs:
+  - template: ${{ variables.CI_COMPONENT_PATH }}/TransferBench.yml@pipelines_repo
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -3,6 +3,72 @@
 Documentation for TransferBench is available at
 [https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench).

+## v1.62.00
+### Added
+- Adding GFX_TEMPORAL to allow for use for use of non-temporal loads/stores
+  - (0 = none [default], 1 = load, 2 = store, 3 = both)
+- Addding "P" memory type which maps to CPU memory but is indexed by closest GPU
+  - For example, P4 refers to CPU memory on NUMA node closest to GPU 4
+### Modified
+- Adding some additional summary details to a2a preset
+
+## v1.61.00
+### Added
+- Added a2a_n preset which conducts alltoall GPU-to-GPU tranfers over nearest NIC executors
+- Re-implemented GFX_BLOCK_ORDER which allows for control over how threadblocks of multiple transfers are ordered
+  - 0 = sequential, 1 = interleaved, 2 = random
+- Added a2asweep preset which tries various CU/unroll options for GFX-executed all-to-all
+- Rewrite main GID index detection logic
+- Show the GID index and description in the topology table. It is helpful for debugging purposes
+- Added GFX_WORD_SIZE to allow for different packed float sizes to use for GFX kernel.  Must be either 4 (default), 2 or 1
+
+
+### Fixed
+- Avoid build errors for CMake and Makefile if infiniband/verbs.h header is not present and disable NIC executor in such case
+- Have a priority list of which GID entry to go for instead of hardcoding choices based on underdocumented user input (such as RoCE version and IP address family)
+- Use link-local when it is the only choice (i.e. when routing information is not available beyond local link)
+
+## v1.60.00
+### Modified
+- Reverted GFX_SINGLE_TEAM default back to 1
+
+### Fixed
+- Fixed bug where peer memory access was not enabled for DMA transfers, which would break specific DMA engine transfers
+
+## v1.59.01
+### Added
+- The a2a preset A2A_MODE variable has been enhanced to allow for customizing the number of srcs/dsts to use
+  This is specified by setting A2A_MODE to numSrcs:numDsts.  Extra destinations past 1 will be "local" writes (i.e. if one sets A2A_MODE=1:3, then transfers will follow this pattern: Fx Gx FyFxFx)
+  to simulate similar conditions normally used during collective algorithms such as ring-based AllReduce
+
+## v1.59.00
+### Added
+- Adding in support for NIC executor, which allows for RDMA copies on NICs that support IBVerbs
+  By default, NIC executor will be enabled if IBVerbs is found in the dynamic linker cache
+- NIC executor can be indexed in two methods
+  - "I"   Ix.y will use NIC x as the source and NIC y as the destination.
+          E.g. (G0 I0.5 G4)
+  - "N"   Nx.y will use NIC closest to GPU x as source, and NIC closest to GPU y as destination
+          E.g. (G0 N0.4 N4)
+- The closest NIC can be overridden by the environment variable CLOSEST_NIC, which should be a comma-separated
+  list of NIC indices to use for the corresponding GPU
+- This feature can be explicitly disabled at compile time by specifying DISABLE_NIC_EXEC=1
+
+### Modified
+- Changing default data size to 256M from 64M
+- Adding NUM_QUEUE_PAIRS which enables NIC traffic in A2A.  Each GPU will talk to the next GPU via the closest NIC
+- Sweep preset now saves last sweep run configuration to /tmp/lastSweep.cfg and can be changed via SWEEP_FILE
+
+### Fixed
+- Fixed bug with reporting when using subiterations
+- Fixed bug with per-Transfer data size specification
+- Fixed bug when using XCC prefered table
+
+
+## v1.58.00
+### Fixed
+- Fixed broken specific DMA-engine copies
+
 ## v1.57.01
 ### Added
 - Re-added "scaling" GPU GFX preset benchmark, which tests copies from GPU to other devices using varying

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
-# Copyright (c) 2023-2024 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2023-2025 Advanced Micro Devices, Inc. All rights reserved.

 if (DEFINED ENV{ROCM_PATH})
    set(ROCM_PATH "$ENV{ROCM_PATH}" CACHE STRING "ROCm install directory")
@@ -7,7 +7,7 @@ else()
 endif()
 cmake_minimum_required(VERSION 3.5)

-project(TransferBench VERSION 1.57.0 LANGUAGES CXX)
+project(TransferBench VERSION 1.62.00 LANGUAGES CXX)

 # Default GPU architectures to build
 #==================================================================================================
@@ -56,6 +56,23 @@ set( CMAKE_CXX_FLAGS "${flags_str} ${CMAKE_CXX_FLAGS}")

 set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -L${ROCM_PATH}/lib")
 include_directories(${ROCM_PATH}/include)
+find_library(IBVERBS_LIBRARY ibverbs)
+find_path(IBVERBS_INCLUDE_DIR infiniband/verbs.h)
+if (DEFINED ENV{DISABLE_NIC_EXEC})
+  message(STATUS "Disabling NIC Executor support")
+elseif(IBVERBS_LIBRARY AND IBVERBS_INCLUDE_DIR)
+  message(STATUS "Found ibverbs: ${IBVERBS_LIBRARY}. Building with NIC executor support. Can set DISABLE_NIC_EXEC=1 to disable")
+  add_definitions(-DNIC_EXEC_ENABLED)
+  link_libraries(ibverbs)
+else()
+  if (NOT IBVERBS_LIBRARY)
+    message(WARNING "IBVerbs library not found")
+  elseif (NOT IBVERBS_INCLUDE_DIR)
+    message(WARNING "infiniband/verbs.h not found")
+  endif()
+  message(WARNING "Building without NIC executor support. To use the TransferBench RDMA executor, check if your system has NICs, the NIC drivers are installed, and libibverbs-dev is installed")
+endif()
+
 link_libraries(numa hsa-runtime64 pthread)
 set (CMAKE_RUNTIME_OUTPUT_DIRECTORY .)
 add_executable(TransferBench src/client/Client.cpp)

--- a/LICENSE.md
+++ b/LICENSE.md
-Copyright (c) 2019-2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

--- a/Makefile
+++ b/Makefile
 #
-# Copyright (c) 2019-2024 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
 #

 # Configuration options
@@ -12,8 +12,10 @@ NVCC=$(CUDA_PATH)/bin/nvcc
 # Compile TransferBenchCuda if nvcc detected
 ifeq ("$(shell test -e $(NVCC) && echo found)", "found")
  EXE=TransferBenchCuda
+  CXX=$(NVCC)
 else
  EXE=TransferBench
+  CXX=$(HIPCC)
 endif

 CXXFLAGS = -I$(ROCM_PATH)/include -lnuma -L$(ROCM_PATH)/lib -lhsa-runtime64
@@ -21,13 +23,40 @@ NVFLAGS  = -x cu -lnuma -arch=native
 COMMON_FLAGS = -O3 -I./src/header -I./src/client -I./src/client/Presets
 LDFLAGS += -lpthread

+# Compile RDMA executor if
+# 1) DISABLE_NIC_EXEC is not set to 1
+# 2) IBVerbs is found in the Dynamic Linker cache
+# 3) infiniband/verbs.h is found in the default include path
+NIC_ENABLED = 0
+ifneq ($(DISABLE_NIC_EXEC),1)
+  ifeq ("$(shell ldconfig -p | grep -c ibverbs)", "0")
+    $(info lib IBVerbs not found)
+  else ifeq ("$(shell echo '#include <infiniband/verbs.h>' | $(CXX) -E - 2>/dev/null | grep -c 'infiniband/verbs.h')", "0")
+    $(info infiniband/verbs.h not found)
+  else
+    LDFLAGS += -libverbs -DNIC_EXEC_ENABLED
+    NVFLAGS += -libverbs -DNIC_EXEC_ENABLED
+    NIC_ENABLED = 1
+  endif
+  ifeq ($(NIC_ENABLED), 0)
+    $(info To use the TransferBench RDMA executor, check if your system has NICs, the NIC drivers are installed, and libibverbs-dev is installed)
+  endif
+endif
+
 all: $(EXE)

-TransferBench: ./src/client/Client.cpp $(shell find -regex ".*\.\hpp")
+TransferBench: ./src/client/Client.cpp $(shell find -regex ".*\.\hpp") NicStatus
 	$(HIPCC) $(CXXFLAGS) $(COMMON_FLAGS) $< -o $@ $(LDFLAGS)

-TransferBenchCuda: ./src/client/Client.cpp $(shell find -regex ".*\.\hpp")
+TransferBenchCuda: ./src/client/Client.cpp $(shell find -regex ".*\.\hpp") NicStatus
 	$(NVCC) $(NVFLAGS) $(COMMON_FLAGS) $< -o $@ $(LDFLAGS)

 clean:
 	rm -f *.o ./TransferBench ./TransferBenchCuda
+
+NicStatus:
+  ifeq ($(NIC_ENABLED), 1)
+		$(info Building with NIC executor support. Can set DISABLE_NIC_EXEC=1 to disable)
+  else
+		$(info Building without NIC executor support)
+  endif
--- a/docs/api.rst
+++ b/docs/api.rst
-----
-API
-----
-
-.. doxygenindex::
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
@@ -775,7 +775,7 @@ WARN_LOGFILE           =
 # spaces. See also FILE_PATTERNS and EXTENSION_MAPPING
 # Note: If this tag is empty the current directory is searched.

-INPUT                  = ../../src/ \
+INPUT                  = ../../src/header \
                         ../../src/include

 # This tag can be used to specify the character encoding of the source files

--- a/docs/how to/use-transferbench.rst
+++ b/docs/how to/use-transferbench.rst
 .. meta::
  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: Using TransferBench, TransferBench Usage, TransferBench How To, API, ROCm, documentation, HIP
+  :keywords: TransferBench usage, TransferBench how to, TransferBench user guide, TransferBench user manual

 .. _using-transferbench:


--- a/docs/index.rst
+++ b/docs/index.rst
 .. meta::
  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: TransferBench, API, ROCm, documentation, HIP
+  :keywords: Benchmarking utility, Memory transfers, Device transfers

 ****************************
 TransferBench documentation

--- a/docs/install/install.rst
+++ b/docs/install/install.rst
 .. meta::
  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: Build TransferBench, Install TransferBench, API, ROCm, HIP
+  :keywords: Build TransferBench, Install TransferBench

 .. _install-transferbench:


--- a/docs/reference/api.rst
+++ b/docs/reference/api.rst
 .. meta::
  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: TransferBench API, TransferBench library, documentation, HIP
+  :keywords: TransferBench library, TransferBench functions, Transferbench API, Transferbench interface

 .. _transferbench-api:


--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
-rocm-docs-core==1.9.2
+rocm-docs-core==1.20.0
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -8,6 +8,13 @@ accessible-pygments==0.0.5
    # via pydata-sphinx-theme
 alabaster==0.7.16
    # via sphinx
+asttokens==3.0.0
+    # via stack-data
+attrs==25.1.0
+    # via
+    #   jsonschema
+    #   jupyter-cache
+    #   referencing
 babel==2.15.0
    # via
    #   pydata-sphinx-theme
@@ -25,9 +32,17 @@ cffi==1.16.0
 charset-normalizer==3.3.2
    # via requests
 click==8.1.7
-    # via sphinx-external-toc
+    # via
+    #   jupyter-cache
+    #   sphinx-external-toc
+comm==0.2.2
+    # via ipykernel
 cryptography==42.0.7
    # via pyjwt
+debugpy==1.8.12
+    # via ipykernel
+decorator==5.1.1
+    # via ipython
 deprecated==1.2.14
    # via pygithub
 docutils==0.21.2
@@ -36,36 +51,104 @@ docutils==0.21.2
    #   myst-parser
    #   pydata-sphinx-theme
    #   sphinx
+exceptiongroup==1.2.2
+    # via ipython
+executing==2.2.0
+    # via stack-data
 fastjsonschema==2.19.1
-    # via rocm-docs-core
+    # via
+    #   nbformat
+    #   rocm-docs-core
 gitdb==4.0.11
    # via gitpython
 gitpython==3.1.43
    # via rocm-docs-core
+greenlet==3.1.1
+    # via sqlalchemy
 idna==3.7
    # via requests
 imagesize==1.4.1
    # via sphinx
+importlib-metadata==8.6.1
+    # via
+    #   jupyter-cache
+    #   myst-nb
+ipykernel==6.29.5
+    # via myst-nb
+ipython==8.31.0
+    # via
+    #   ipykernel
+    #   myst-nb
+jedi==0.19.2
+    # via ipython
 jinja2==3.1.4
    # via
    #   myst-parser
    #   sphinx
+jsonschema==4.23.0
+    # via nbformat
+jsonschema-specifications==2024.10.1
+    # via jsonschema
+jupyter-cache==1.0.1
+    # via myst-nb
+jupyter-client==8.6.3
+    # via
+    #   ipykernel
+    #   nbclient
+jupyter-core==5.7.2
+    # via
+    #   ipykernel
+    #   jupyter-client
+    #   nbclient
+    #   nbformat
 markdown-it-py==3.0.0
    # via
    #   mdit-py-plugins
    #   myst-parser
 markupsafe==2.1.5
    # via jinja2
+matplotlib-inline==0.1.7
+    # via
+    #   ipykernel
+    #   ipython
 mdit-py-plugins==0.4.1
    # via myst-parser
 mdurl==0.1.2
    # via markdown-it-py
-myst-parser==3.0.1
+myst-nb==1.1.2
    # via rocm-docs-core
+myst-parser==3.0.1
+    # via myst-nb
+nbclient==0.10.2
+    # via
+    #   jupyter-cache
+    #   myst-nb
+nbformat==5.10.4
+    # via
+    #   jupyter-cache
+    #   myst-nb
+    #   nbclient
+nest-asyncio==1.6.0
+    # via ipykernel
 packaging==24.0
    # via
+    #   ipykernel
    #   pydata-sphinx-theme
    #   sphinx
+parso==0.8.4
+    # via jedi
+pexpect==4.9.0
+    # via ipython
+platformdirs==4.3.6
+    # via jupyter-core
+prompt-toolkit==3.0.50
+    # via ipython
+psutil==6.1.1
+    # via ipykernel
+ptyprocess==0.7.0
+    # via pexpect
+pure-eval==0.2.3
+    # via stack-data
 pycparser==2.22
    # via cffi
 pydata-sphinx-theme==0.15.2
@@ -77,23 +160,42 @@ pygithub==2.3.0
 pygments==2.18.0
    # via
    #   accessible-pygments
+    #   ipython
    #   pydata-sphinx-theme
    #   sphinx
 pyjwt[crypto]==2.8.0
    # via pygithub
 pynacl==1.5.0
    # via pygithub
+python-dateutil==2.9.0.post0
+    # via jupyter-client
 pyyaml==6.0.1
    # via
+    #   jupyter-cache
+    #   myst-nb
    #   myst-parser
    #   rocm-docs-core
    #   sphinx-external-toc
+pyzmq==26.2.1
+    # via
+    #   ipykernel
+    #   jupyter-client
+referencing==0.36.2
+    # via
+    #   jsonschema
+    #   jsonschema-specifications
 requests==2.32.2
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==1.9.2
+rocm-docs-core==1.20.0
    # via -r requirements.in
+rpds-py==0.22.3
+    # via
+    #   jsonschema
+    #   referencing
+six==1.17.0
+    # via python-dateutil
 smmap==5.0.1
    # via gitdb
 snowballstemmer==2.2.0
@@ -103,6 +205,7 @@ soupsieve==2.5
 sphinx==7.3.7
    # via
    #   breathe
+    #   myst-nb
    #   myst-parser
    #   pydata-sphinx-theme
    #   rocm-docs-core
@@ -133,15 +236,43 @@ sphinxcontrib-qthelp==1.0.7
    # via sphinx
 sphinxcontrib-serializinghtml==1.1.10
    # via sphinx
+sqlalchemy==2.0.37
+    # via jupyter-cache
+stack-data==0.6.3
+    # via ipython
+tabulate==0.9.0
+    # via jupyter-cache
 tomli==2.0.1
    # via sphinx
+tornado==6.4.2
+    # via
+    #   ipykernel
+    #   jupyter-client
+traitlets==5.14.3
+    # via
+    #   comm
+    #   ipykernel
+    #   ipython
+    #   jupyter-client
+    #   jupyter-core
+    #   matplotlib-inline
+    #   nbclient
+    #   nbformat
 typing-extensions==4.11.0
    # via
+    #   ipython
+    #   myst-nb
    #   pydata-sphinx-theme
    #   pygithub
+    #   referencing
+    #   sqlalchemy
 urllib3==2.2.1
    # via
    #   pygithub
    #   requests
+wcwidth==0.2.13
+    # via prompt-toolkit
 wrapt==1.16.0
    # via deprecated
+zipp==3.21.0
+    # via importlib-metadata
--- a/examples/example.cfg
+++ b/examples/example.cfg
@@ -13,6 +13,7 @@
 #   1) CPU           CPU thread
 #   2) GPU           GPU threadblock/Compute Unit (CU)
 #   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
+#   4) NIC           Queue Pair

 # Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel

@@ -37,6 +38,8 @@
 #                 - C:    CPU-executed          (Indexed from 0 to # NUMA nodes - 1)
 #                 - G:    GPU-executed          (Indexed from 0 to # GPUs - 1)
 #                 - D:    DMA-executor          (Indexed from 0 to # GPUs - 1)
+#                 - I#.#: NIC executor          (Indexed from 0 to # NICs - 1)
+#                 - N#.#: Nearest NIC executor  (Indexed from 0 to # GPUs - 1)
 #   dstMemL   :   Destination memory locations (Where the data is to be written to)
 #   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
 #                 Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
@@ -50,13 +53,17 @@
 #                 - G:    Global device memory     (on GPU device indexed from 0 to [# GPUs - 1])
 #                 - F:    Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
 #                 - N:    Null memory              (index ignored)
+#                 - P:    Pinned host memory       (on NUMA node, but indexed by closest GPU [#GPUs -1])

 # Examples:
 # 1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
 # 1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
 # 2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
 # -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
-
+# 1 2 (F0->I0.2->F1)                 Uses 2 QPs to transfer data from GPU0 via NIC0 to GPU1 via NIC2
+# 1 1 (F0->N0.1->F1)                 Uses 1 QP to transfer data from GPU0 via GPU0's closest NIC to GPU1 via GPU1's closest NIC
+# -2 (G0->N0.1->G1 2 128M) (G1->N1.0->G0 1 256M) Uses Nearest NIC executor to copy 128Mb from GPU0 to GPU1 with 2 QPs,
+#                                                and 256Mb from GPU1 to GPU0 with 1 QP
 # Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
 # Lines starting with # will be ignored. Lines starting with ## will be echoed to output


--- a/src/client/Client.cpp
+++ b/src/client/Client.cpp
@@ -121,13 +121,23 @@ int main(int argc, char **argv) {
      }
    }

+    // Track which transfers have already numBytes specified
+    std::vector<bool> bytesSpecified(transfers.size());
+    int hasUnspecified = false;
+    for (int i = 0; i < transfers.size(); i++) {
+      bytesSpecified[i] = (transfers[i].numBytes != 0);
+      if (transfers[i].numBytes == 0) hasUnspecified = true;
+    }
+
    // Run the specified numbers of bytes otherwise generate a range of values
    for (size_t bytes = (1<<10); bytes <= (1<<29); bytes *= 2) {
      size_t deltaBytes = std::max(1UL, bytes / ev.samplingFactor);
      size_t currBytes = (numBytesPerTransfer == 0) ? bytes : numBytesPerTransfer;
      do {
-        for (auto& t : transfers)
-          t.numBytes = currBytes;
+        for (int i = 0; i < transfers.size(); i++) {
+          if (!bytesSpecified[i])
+            transfers[i].numBytes = currBytes;
+        }

        if (maxVarCount == 0) {
          if (TransferBench::RunTransfers(cfgOptions, transfers, results)) {
@@ -162,17 +172,21 @@ int main(int argc, char **argv) {
          PrintResults(ev, ++testNum, bestTransfers, bestResults);
          PrintErrors(bestResults.errResults);
        }
-        if (numBytesPerTransfer != 0) break;
+        if (numBytesPerTransfer != 0 || !hasUnspecified) break;
        currBytes += deltaBytes;
      } while (currBytes < bytes * 2);
-      if (numBytesPerTransfer != 0) break;
+      if (numBytesPerTransfer != 0 || !hasUnspecified) break;
    }
  }
 }

 void DisplayUsage(char const* cmdName)
 {
-  printf("TransferBench v%s.%s\n", TransferBench::VERSION, CLIENT_VERSION);
+  std::string nicSupport = "";
+#if NIC_EXEC_ENABLED
+  nicSupport = " (with NIC support)";
+#endif
+  printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
  printf("========================================\n");

  if (numa_available() == -1) {
@@ -218,7 +232,7 @@ void PrintResults(EnvVars const& ev, int const testNum,
    ExeType const    exeType   = exeDevice.exeType;
    int32_t const    exeIndex  = exeDevice.exeIndex;

-    printf(" Executor: %3s %02d %c %7.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
+    printf(" Executor: %3s %02d %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %-7.3f GB/s (sum)\n",
           ExeTypeName[exeType], exeIndex, sep, exeResult.avgBandwidthGbPerSec, sep,
           exeResult.avgDurationMsec, sep, exeResult.numBytes, sep, exeResult.sumBandwidthGbPerSec);

@@ -230,14 +244,15 @@ void PrintResults(EnvVars const& ev, int const testNum,
      char exeSubIndexStr[32] = "";
      if (t.exeSubIndex != -1)
        sprintf(exeSubIndexStr, ".%d", t.exeSubIndex);
-
-      printf("     Transfer %02d  %c %7.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %s%02d%s:%03d -> %s\n",
+      printf("     Transfer %02d  %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c %s -> %c%03d%s:%03d -> %s\n",
             idx,                    sep,
             r.avgBandwidthGbPerSec, sep,
             r.avgDurationMsec,      sep,
             r.numBytes,             sep,
-             MemDevicesToStr(t.srcs).c_str(), ExeTypeName[exeType], exeIndex,
-             exeSubIndexStr, t.numSubExecs, MemDevicesToStr(t.dsts).c_str());
+             MemDevicesToStr(t.srcs).c_str(),
+             TransferBench::ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
+             exeSubIndexStr, t.numSubExecs,
+             MemDevicesToStr(t.dsts).c_str());

      // Show per-iteration timing information
      if (ev.showIterations) {
@@ -269,7 +284,7 @@ void PrintResults(EnvVars const& ev, int const testNum,
        for (auto& time : times) {
          double iterDurationMsec = time.first;
          double iterBandwidthGbs = (t.numBytes / 1.0E9) / iterDurationMsec * 1000.0f;
-          printf("      Iter %03d    %c %7.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);
+          printf("      Iter %03d    %c %8.3f GB/s %c %8.3f ms %c", time.second, sep, iterBandwidthGbs, sep, iterDurationMsec, sep);

          std::set<int> usedXccs;
          if (time.second - 1 < r.perIterCUs.size()) {
@@ -285,11 +300,11 @@ void PrintResults(EnvVars const& ev, int const testNum,
            printf(" %02d", x);
          printf("\n");
        }
-        printf("      StandardDev %c %7.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
+        printf("      StandardDev %c %8.3f GB/s %c %8.3f ms %c\n", sep, stdDevBw, sep, stdDevTime, sep);
      }
    }
  }
-  printf(" Aggregate (CPU)  %c %7.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
+  printf(" Aggregate (CPU)  %c %8.3f GB/s %c %8.3f ms %c %12lu bytes %c Overhead: %.3f ms\n",
         sep, results.avgTotalBandwidthGbPerSec,
         sep, results.avgTotalDurationMsec,
         sep, results.totalBytesTransferred,

--- a/src/client/Client.hpp
+++ b/src/client/Client.hpp
@@ -23,14 +23,14 @@ THE SOFTWARE.
 #pragma once

 // TransferBench client version
-#define CLIENT_VERSION "01"
+#define CLIENT_VERSION "00"

 #include "TransferBench.hpp"
 #include "EnvVars.hpp"

-size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<26);
+size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<28);

-char const ExeTypeName[4][4] = {"CPU", "GPU", "DMA", "IBV"};
+char const ExeTypeName[5][4] = {"CPU", "GPU", "DMA", "NIC", "NIC"};

 // Display detected hardware
 void DisplayTopology(bool outputToCsv);

--- a/src/client/EnvVars.hpp
+++ b/src/client/EnvVars.hpp
 /*
-Copyright (c) 2021-2024 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2021-2025 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -84,14 +84,17 @@ public:
  int useHsaDma;                     // Use hsa_amd_async_copy instead of hipMemcpy for non-targetted DMA executions

  // GFX options
+  int gfxBlockOrder;                 // How threadblocks for multiple Transfers are ordered 0=sequential 1=interleaved
  int gfxBlockSize;                  // Size of each threadblock (must be multiple of 64)
  vector<uint32_t> cuMask;           // Bit-vector representing the CU mask
  vector<vector<int>> prefXccTable;  // Specifies XCC to use for given exe->dst pair
+  int gfxTemporal;                   // Non-temporal load/store mode (0=none, 1=load, 2=store, 3=both)
  int gfxUnroll;                     // GFX-kernel unroll factor
  int useHipEvents;                  // Use HIP events for timing GFX/DMA Executor
  int useSingleStream;               // Use a single stream per GPU GFX executor instead of stream per Transfer
  int gfxSingleTeam;                 // Team all subExecutors across the data array
  int gfxWaveOrder;                  // GFX-kernel wavefront ordering
+  int gfxWordSize;                   // GFX-kernel packed data size (4=DWORDx4, 2=DWORDx2, 1=DWORDx1)

  // Client options
  int hideEnv;                       // Skip printing environment variable
@@ -100,6 +103,14 @@ public:
  int outputToCsv;                   // Output in CSV format
  int samplingFactor;                // Affects how many different values of N are generated (when N set to 0)

+  // NIC options
+  int ibGidIndex;                    // GID Index for RoCE NICs
+  int roceVersion;                   // RoCE version number
+  int ipAddressFamily;               // IP Address Famliy
+  uint8_t ibPort;                    // NIC port number to be used
+  int nicRelaxedOrder;               // Use relaxed ordering for RDMA
+  std::string closestNicStr;         // Holds the user-specified list of closest NICs
+
  // Developer features
  int gpuMaxHwQueues;                // Tracks GPU_MAX_HW_QUEUES environment variable

@@ -127,10 +138,13 @@ public:
    alwaysValidate    = GetEnvVar("ALWAYS_VALIDATE"     , 0);
    blockBytes        = GetEnvVar("BLOCK_BYTES"         , 256);
    byteOffset        = GetEnvVar("BYTE_OFFSET"         , 0);
+    gfxBlockOrder     = GetEnvVar("GFX_BLOCK_ORDER"     , 0);
    gfxBlockSize      = GetEnvVar("GFX_BLOCK_SIZE"      , 256);
-    gfxSingleTeam     = GetEnvVar("GFX_SINGLE_TEAM"     , 0);
+    gfxSingleTeam     = GetEnvVar("GFX_SINGLE_TEAM"     , 1);
+    gfxTemporal       = GetEnvVar("GFX_TEMPORAL"        , 0);
    gfxUnroll         = GetEnvVar("GFX_UNROLL"          , defaultGfxUnroll);
    gfxWaveOrder      = GetEnvVar("GFX_WAVE_ORDER"      , 0);
+    gfxWordSize       = GetEnvVar("GFX_WORD_SIZE"       , 4);
    hideEnv           = GetEnvVar("HIDE_ENV"            , 0);
    minNumVarSubExec  = GetEnvVar("MIN_VAR_SUBEXEC"     , 1);
    maxNumVarSubExec  = GetEnvVar("MAX_VAR_SUBEXEC"     , 0);
@@ -147,8 +161,16 @@ public:
    validateDirect    = GetEnvVar("VALIDATE_DIRECT"     , 0);
    validateSource    = GetEnvVar("VALIDATE_SOURCE"     , 0);

+    ibGidIndex        = GetEnvVar("IB_GID_INDEX"        ,-1);
+    ibPort            = GetEnvVar("IB_PORT_NUMBER"      , 1);
+    roceVersion       = GetEnvVar("ROCE_VERSION"        , 2);
+    ipAddressFamily   = GetEnvVar("IP_ADDRESS_FAMILY"   , 4);
+    nicRelaxedOrder   = GetEnvVar("NIC_RELAX_ORDER"     , 1);
+    closestNicStr     = GetEnvVar("CLOSEST_NIC"         , "");
+
    gpuMaxHwQueues    = GetEnvVar("GPU_MAX_HW_QUEUES"   , 4);

+
    // Check for fill pattern
    char* pattern = getenv("FILL_PATTERN");
    if (pattern != NULL) {
@@ -270,27 +292,55 @@ public:
    }
  }

+  static std::string ToStr(std::vector<int> const& values) {
+    std::string result = "";
+    bool isFirst = true;
+    for (int v : values) {
+      if (isFirst) isFirst = false;
+      else result += ",";
+      result += std::to_string(v);
+    }
+    return result;
+  }
+
  // Display info on the env vars that can be used
  static void DisplayUsage()
  {
    printf("Environment variables:\n");
    printf("======================\n");
    printf(" ALWAYS_VALIDATE   - Validate after each iteration instead of once after all iterations\n");
-    printf(" BLOCK_SIZE        - # of threads per threadblock (Must be multiple of 64)\n");
    printf(" BLOCK_BYTES       - Controls granularity of how work is divided across subExecutors\n");
    printf(" BYTE_OFFSET       - Initial byte-offset for memory allocations.  Must be multiple of 4\n");
+#if NIC_EXEC_ENABLED
+    printf(" CLOSEST_NIC       - Comma-separated list of per-GPU closest NIC (default=auto)\n");
+#endif
    printf(" CU_MASK           - CU mask for streams. Can specify ranges e.g '5,10-12,14'\n");
    printf(" FILL_PATTERN      - Big-endian pattern for source data, specified in hex digits. Must be even # of digits\n");
+    printf(" GFX_BLOCK_ORDER   - How blocks for transfers are ordered. 0=sequential, 1=interleaved\n");
+    printf(" GFX_BLOCK_SIZE    - # of threads per threadblock (Must be multiple of 64)\n");
+    printf(" GFX_TEMPORAL      - Use of non-temporal loads or stores (0=none 1=loads 2=stores 3=both)\n");
    printf(" GFX_UNROLL        - Unroll factor for GFX kernel (0=auto), must be less than %d\n", TransferBench::GetIntAttribute(ATR_GFX_MAX_UNROLL));
    printf(" GFX_SINGLE_TEAM   - Have subexecutors work together on full array instead of working on disjoint subarrays\n");
    printf(" GFX_WAVE_ORDER    - Stride pattern for GFX kernel (0=UWC,1=UCW,2=WUC,3=WCU,4=CUW,5=CWU)\n");
+    printf(" GFX_WORD_SIZE     - GFX kernel packed data size (4=DWORDx4, 2=DWORDx2, 1=DWORDx1)\n");
    printf(" HIDE_ENV          - Hide environment variable value listing\n");
+#if NIC_EXEC_ENABLED
+    printf(" IB_GID_INDEX      - Required for RoCE NICs (default=-1/auto)\n");
+    printf(" IB_PORT_NUMBER    - RDMA port count for RDMA NIC (default=1)\n");
+    printf(" IP_ADDRESS_FAMILY - IP address family (4=v4, 6=v6, default=v4)\n");
+#endif
    printf(" MIN_VAR_SUBEXEC   - Minumum # of subexecutors to use for variable subExec Transfers\n");
    printf(" MAX_VAR_SUBEXEC   - Maximum # of subexecutors to use for variable subExec Transfers (0 for device limits)\n");
+#if NIC_EXEC_ENABLED
+    printf(" NIC_RELAX_ORDER   - Set to non-zero to use relaxed ordering");
+#endif
    printf(" NUM_ITERATIONS    - # of timed iterations per test. If negative, run for this many seconds instead\n");
    printf(" NUM_SUBITERATIONS - # of sub-iterations to run per iteration. Must be non-negative\n");
    printf(" NUM_WARMUPS       - # of untimed warmup iterations per test\n");
    printf(" OUTPUT_TO_CSV     - Outputs to CSV format if set\n");
+#if NIC_EXEC_ENABLED
+    printf(" ROCE_VERSION      - RoCE version (default=2)\n");
+#endif
    printf(" SAMPLING_FACTOR   - Add this many samples (when possible) between powers of 2 when auto-generating data sizes\n");
    printf(" SHOW_ITERATIONS   - Show per-iteration timing info\n");
    printf(" USE_HIP_EVENTS    - Use HIP events for GFX executor timing\n");
@@ -301,6 +351,7 @@ public:
    printf(" VALIDATE_SOURCE   - Validate GPU src memory immediately after preparation\n");
  }

+
  void Print(std::string const& name, int32_t const value, const char* format, ...) const
  {
    printf("%-20s%s%12d%s", name.c_str(), outputToCsv ? "," : " = ", value, outputToCsv ? "," : " : ");
@@ -325,9 +376,12 @@ public:
  void DisplayEnvVars() const
  {
    int numGpuDevices = TransferBench::GetNumExecutors(EXE_GPU_GFX);
-
+    std::string nicSupport = "";
+#if NIC_EXEC_ENABLED
+    nicSupport = " (with NIC support)";
+#endif
    if (!outputToCsv) {
-      printf("TransferBench Client v%s Backend v%s\n", CLIENT_VERSION, TransferBench::VERSION);
+      printf("TransferBench v%s.%s%s\n", TransferBench::VERSION, CLIENT_VERSION, nicSupport.c_str());
      printf("===============================================================\n");
      if (!hideEnv) printf("[Common]                              (Suppress by setting HIDE_ENV=1)\n");
    }
@@ -341,15 +395,27 @@ public:
          "Each CU gets a mulitple of %d bytes to copy", blockBytes);
    Print("BYTE_OFFSET", byteOffset,
          "Using byte offset of %d", byteOffset);
+#if NIC_EXEC_ENABLED
+    Print("CLOSEST_NIC", (closestNicStr == "" ? "auto" : "user-input"),
+          "Per-GPU closest NIC is set as %s", (closestNicStr == "" ? "auto" : closestNicStr.c_str()));
+#endif
    Print("CU_MASK", getenv("CU_MASK") ? 1 : 0,
          "%s", (cuMask.size() ? GetCuMaskDesc().c_str() : "All"));
    Print("FILL_PATTERN", getenv("FILL_PATTERN") ? 1 : 0,
          "%s", (fillPattern.size() ? getenv("FILL_PATTERN") : TransferBench::GetStrAttribute(ATR_SRC_PREP_DESCRIPTION).c_str()));
+    Print("GFX_BLOCK_ORDER", gfxBlockOrder,
+          "Thread block ordering: %s", gfxBlockOrder == 0 ? "Sequential" : "Interleaved");
    Print("GFX_BLOCK_SIZE", gfxBlockSize,
          "Threadblock size of %d", gfxBlockSize);
    Print("GFX_SINGLE_TEAM", gfxSingleTeam,
          "%s", (gfxSingleTeam ? "Combining CUs to work across entire data array" :
                                 "Each CUs operates on its own disjoint subarray"));
+    Print("GFX_TEMPORAL", gfxTemporal,
+          "%s", (gfxTemporal == 0 ? "Not using non-temporal loads/stores" :
+                 gfxTemporal == 1 ? "Using non-temporal loads" :
+                 gfxTemporal == 2 ? "Using non-temporal stores" :
+                                    "Using non-temporal loads and stores"));
+
    Print("GFX_UNROLL", gfxUnroll,
          "Using GFX unroll factor of %d", gfxUnroll);
    Print("GFX_WAVE_ORDER", gfxWaveOrder,
@@ -359,11 +425,27 @@ public:
                                            gfxWaveOrder == 3 ? "Wavefront,CU,Unroll" :
                                            gfxWaveOrder == 4 ? "CU,Unroll,Wavefront" :
                                                                "CU,Wavefront,Unroll"));
+    Print("GFX_WORD_SIZE", gfxWordSize,
+          "Using GFX word size of %d (DWORDx%d)", gfxWordSize, gfxWordSize);
+
+#if NIC_EXEC_ENABLED
+    Print("IP_ADDRESS_FAMILY", ipAddressFamily,
+          "IP address family is set to IPv%d", ipAddressFamily);
+
+    Print("IB_GID_INDEX", ibGidIndex,
+          "RoCE GID index is set to %s", (ibGidIndex < 0 ? "auto" : std::to_string(ibGidIndex).c_str()));
+    Print("IB_PORT_NUMBER", ibPort,
+          "IB port number is set to %d", ibPort);
+#endif
    Print("MIN_VAR_SUBEXEC", minNumVarSubExec,
          "Using at least %d subexecutor(s) for variable subExec tranfers", minNumVarSubExec);
    Print("MAX_VAR_SUBEXEC", maxNumVarSubExec,
          "Using up to %s subexecutors for variable subExec transfers",
          maxNumVarSubExec ? std::to_string(maxNumVarSubExec).c_str() : "all available");
+#if NIC_EXEC_ENABLED
+    Print("NIC_RELAX_ORDER", nicRelaxedOrder,
+          "Using %s ordering for NIC RDMA", nicRelaxedOrder ? "relaxed" : "strict");
+#endif
    Print("NUM_ITERATIONS", numIterations,
          (numIterations == 0) ? "Running infinitely" :
          "Running %d %s", abs(numIterations), (numIterations > 0 ? " timed iteration(s)" : "seconds(s) per Test"));
@@ -371,6 +453,10 @@ public:
          "Running %s subiterations", (numSubIterations == 0 ? "infinite" : std::to_string(numSubIterations)).c_str());
    Print("NUM_WARMUPS", numWarmups,
          "Running %d warmup iteration(s) per Test", numWarmups);
+#if NIC_EXEC_ENABLED
+    Print("ROCE_VERSION", roceVersion,
+          "RoCE version is set to %d", roceVersion);
+#endif
    Print("SHOW_ITERATIONS", showIterations,
          "%s per-iteration timing", showIterations ? "Showing" : "Hiding");
    Print("USE_HIP_EVENTS", useHipEvents,
@@ -381,7 +467,6 @@ public:
          "Running in %s mode", useInteractive ? "interactive" : "non-interactive");
    Print("USE_SINGLE_STREAM", useSingleStream,
          "Using single stream per GFX %s", useSingleStream ? "device" : "Transfer");
-
    if (getenv("XCC_PREF_TABLE")) {
      printf("%36s: Preferred XCC Table (XCC_PREF_TABLE)\n", "");
      printf("%36s:         ", "");
@@ -408,6 +493,31 @@ public:
    return defaultValue;
  }

+  static std::vector<int> GetEnvVarArray(std::string const& varname, std::vector<int> const& defaultValue)
+  {
+    if (getenv(varname.c_str())) {
+      char* rangeStr = getenv(varname.c_str());
+      std::set<int> values;
+      char* token = strtok(rangeStr, ",");
+      while (token) {
+        int start, end;
+        if (sscanf(token, "%d-%d", &start, &end) == 2) {
+          for (int i = start; i <= end; i++) values.insert(i);
+        } else if (sscanf(token, "%d", &start) == 1) {
+          values.insert(start);
+        } else {
+          printf("[ERROR] Unrecognized token [%s]\n", token);
+          exit(1);
+        }
+        token = strtok(NULL, ",");
+      }
+      std::vector<int> result;
+      for (auto v : values) result.push_back(v);
+      return result;
+    }
+    return defaultValue;
+  }
+
  static std::string GetEnvVar(std::string const& varname, std::string const& defaultValue)
  {
    if (getenv(varname.c_str()))
@@ -470,15 +580,39 @@ public:
    cfg.dma.useHipEvents           = useHipEvents;
    cfg.dma.useHsaCopy             = useHsaDma;

+    cfg.gfx.blockOrder             = gfxBlockOrder;
    cfg.gfx.blockSize              = gfxBlockSize;
    cfg.gfx.cuMask                 = cuMask;
    cfg.gfx.prefXccTable           = prefXccTable;
    cfg.gfx.unrollFactor           = gfxUnroll;
+    cfg.gfx.temporalMode           = gfxTemporal;
    cfg.gfx.useHipEvents           = useHipEvents;
    cfg.gfx.useMultiStream         = !useSingleStream;
    cfg.gfx.useSingleTeam          = gfxSingleTeam;
    cfg.gfx.waveOrder              = gfxWaveOrder;
-
+    cfg.gfx.wordSize               = gfxWordSize;
+
+    cfg.nic.ibGidIndex             = ibGidIndex;
+    cfg.nic.ibPort                 = ibPort;
+    cfg.nic.ipAddressFamily        = ipAddressFamily;
+    cfg.nic.useRelaxedOrder        = nicRelaxedOrder;
+    cfg.nic.roceVersion            = roceVersion;
+
+    std::vector<int> closestNics;
+    if(closestNicStr != "") {
+      std::stringstream ss(closestNicStr);
+      std::string item;
+      while (std::getline(ss, item, ',')) {
+        try {
+          int nic = std::stoi(item);
+          closestNics.push_back(nic);
+        } catch (const std::invalid_argument& e) {
+          printf("[ERROR] Invalid NIC index (%s) by user in %s\n", item.c_str(), closestNicStr.c_str());
+          exit(1);
+        }
+      }
+      cfg.nic.closestNics = closestNics;
+    }
    return cfg;
  }
 };

--- a/src/client/Presets/AllToAll.hpp
+++ b/src/client/Presets/AllToAll.hpp
@@ -30,9 +30,10 @@ void AllToAllPreset(EnvVars&           ev,
  {
    A2A_COPY       = 0,
    A2A_READ_ONLY  = 1,
-    A2A_WRITE_ONLY = 2
+    A2A_WRITE_ONLY = 2,
+    A2A_CUSTOM     = 3,
  };
-  char a2aModeStr[3][20] = {"Copy", "Read-Only", "Write-Only"};
+  char a2aModeStr[4][20] = {"Copy", "Read-Only", "Write-Only", "Custom"};

  // Force single-stream mode for all-to-all benchmark
  ev.useSingleStream = 1;
@@ -45,21 +46,39 @@ void AllToAllPreset(EnvVars&           ev,
  // Collect env vars for this preset
  int a2aDirect     = EnvVars::GetEnvVar("A2A_DIRECT"     , 1);
  int a2aLocal      = EnvVars::GetEnvVar("A2A_LOCAL"      , 0);
-  int a2aMode       = EnvVars::GetEnvVar("A2A_MODE"       , 0);
  int numGpus       = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
+  int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 0);
  int numSubExecs   = EnvVars::GetEnvVar("NUM_SUB_EXEC"   , 8);
  int useDmaExec    = EnvVars::GetEnvVar("USE_DMA_EXEC"   , 0);
  int useFineGrain  = EnvVars::GetEnvVar("USE_FINE_GRAIN" , 1);
  int useRemoteRead = EnvVars::GetEnvVar("USE_REMOTE_READ", 0);

+  // A2A_MODE may be 0,1,2 or else custom numSrcs:numDsts
+  int numSrcs, numDsts;
+  int a2aMode = 0;
+  if (getenv("A2A_MODE") && sscanf(getenv("A2A_MODE"), "%d:%d", &numSrcs, &numDsts) == 2) {
+    a2aMode = A2A_CUSTOM;
+  } else {
+    a2aMode = EnvVars::GetEnvVar("A2A_MODE", 0);
+    if (a2aMode < 0 || a2aMode > 2) {
+      printf("[ERROR] a2aMode must be between 0 and 2, or else numSrcs:numDsts\n");
+      exit(1);
+    }
+    numSrcs = (a2aMode == A2A_WRITE_ONLY ? 0 : 1);
+    numDsts = (a2aMode == A2A_READ_ONLY  ? 0 : 1);
+  }
+
  // Print off environment variables
  ev.DisplayEnvVars();
  if (!ev.hideEnv) {
    if (!ev.outputToCsv) printf("[AllToAll Related]\n");
    ev.Print("A2A_DIRECT"     , a2aDirect    , a2aDirect ? "Only using direct links" : "Full all-to-all");
    ev.Print("A2A_LOCAL"      , a2aLocal     , "%s local transfers", a2aLocal ? "Include" : "Exclude");
-    ev.Print("A2A_MODE"       , a2aMode      , a2aModeStr[a2aMode]);
+    ev.Print("A2A_MODE"       , (a2aMode == A2A_CUSTOM) ?  std::to_string(numSrcs) + ":" + std::to_string(numDsts) : std::to_string(a2aMode),
+                                (a2aMode == A2A_CUSTOM) ? (std::to_string(numSrcs) + " read(s) " +
+                                                           std::to_string(numDsts) + " write(s)").c_str(): a2aModeStr[a2aMode]);
    ev.Print("NUM_GPU_DEVICES", numGpus      , "Using %d GPUs", numGpus);
+    ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
    ev.Print("NUM_SUB_EXEC"   , numSubExecs  , "Using %d subexecutors/CUs per Transfer", numSubExecs);
    ev.Print("USE_DMA_EXEC"   , useDmaExec   , "Using %s executor", useDmaExec ? "DMA" : "GFX");
    ev.Print("USE_FINE_GRAIN" , useFineGrain , "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
@@ -68,19 +87,16 @@ void AllToAllPreset(EnvVars&           ev,
  }

  // Validate env vars
-  if (a2aMode < 0 || a2aMode > 2) {
-    printf("[ERROR] a2aMode must be between 0 and 2\n");
-    exit(1);
-  }
  if (numGpus < 0 || numGpus > numDetectedGpus) {
    printf("[ERROR] Cannot use %d GPUs.  Detected %d GPUs\n", numGpus, numDetectedGpus);
    exit(1);
  }
+  if (useDmaExec && (numSrcs != 1 || numDsts != 1)) {
+    printf("[ERROR] DMA execution can only be used for copies (A2A_MODE=0)\n");
+    exit(1);
+  }

  // Collect the number of GPU devices to use
-  int const numSrcs = (a2aMode == A2A_WRITE_ONLY ? 0 : 1);
-  int const numDsts = (a2aMode == A2A_READ_ONLY  ? 0 : 1);
-
  MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
  ExeType exeType = useDmaExec ? EXE_GPU_DMA : EXE_GPU_GFX;

@@ -103,8 +119,11 @@ void AllToAllPreset(EnvVars&           ev,
      // Build Transfer and add it to list
      TransferBench::Transfer transfer;
      transfer.numBytes = numBytesPerTransfer;
-      if (numSrcs) transfer.srcs.push_back({memType, i});
+      for (int x = 0; x < numSrcs; x++) transfer.srcs.push_back({memType, i});
+
+      // When using multiple destinations, the additional destinations are "local"
      if (numDsts) transfer.dsts.push_back({memType, j});
+      for (int x = 1; x < numDsts; x++) transfer.dsts.push_back({memType, i});
      transfer.exeDevice = {exeType, (useRemoteRead ? j : i)};
      transfer.exeSubIndex = -1;
      transfer.numSubExecs = numSubExecs;
@@ -114,6 +133,23 @@ void AllToAllPreset(EnvVars&           ev,
    }
  }

+  // Create a ring using NICs
+  std::vector<int> nicTransferIdx(numGpus);
+  if (numQueuePairs > 0) {
+    int numNics = TransferBench::GetNumExecutors(EXE_NIC);
+    for (int i = 0; i < numGpus; i++) {
+      TransferBench::Transfer transfer;
+      transfer.numBytes = numBytesPerTransfer;
+      transfer.srcs.push_back({memType, i});
+      transfer.dsts.push_back({memType, (i+1) % numGpus});
+      transfer.exeDevice = {TransferBench::EXE_NIC_NEAREST, i};
+      transfer.exeSubIndex = (i+1) % numGpus;
+      transfer.numSubExecs = numQueuePairs;
+      nicTransferIdx[i] = transfers.size();
+      transfers.push_back(transfer);
+    }
+  }
+
  printf("GPU-GFX All-To-All benchmark:\n");
  printf("==========================\n");
  printf("- Copying %lu bytes between %s pairs of GPUs using %d CUs (%lu Transfers)\n",
@@ -133,20 +169,24 @@ void AllToAllPreset(EnvVars&           ev,

  // Print results
  char separator = (ev.outputToCsv ? ',' : ' ');
-  printf("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
-  printf("==========================================================\n");
+  printf("\nSummary: [%lu bytes per Transfer] [%s:%d] [%d Read(s) %d Write(s)]\n",
+         numBytesPerTransfer, useDmaExec ? "DMA" : "GFX", numSubExecs, numSrcs, numDsts);
+  printf("===========================================================================\n");
  printf("SRC\\DST ");
  for (int dst = 0; dst < numGpus; dst++)
    printf("%cGPU %02d    ", separator, dst);
+  if (numQueuePairs > 0)
+    printf("%cNIC(%02d QP)", separator, numQueuePairs);
  printf("   %cSTotal     %cActual\n", separator, separator);

  double totalBandwidthGpu = 0.0;
-  double minExecutorBandwidth = std::numeric_limits<double>::max();
-  double maxExecutorBandwidth = 0.0;
-  std::vector<double> colTotalBandwidth(numGpus+1, 0.0);
+  double minActualBandwidth = std::numeric_limits<double>::max();
+  double maxActualBandwidth = 0.0;
+  std::vector<double> colTotalBandwidth(numGpus+2, 0.0);
  for (int src = 0; src < numGpus; src++) {
    double rowTotalBandwidth = 0;
-    double executorBandwidth = 0;
+    int    transferCount = 0;
+    double minBandwidth = std::numeric_limits<double>::max();
    printf("GPU %02d", src);
    for (int dst = 0; dst < numGpus; dst++) {
      if (reIndex.count(std::make_pair(src, dst))) {
@@ -155,24 +195,38 @@ void AllToAllPreset(EnvVars&           ev,
        colTotalBandwidth[dst]  += r.avgBandwidthGbPerSec;
        rowTotalBandwidth       += r.avgBandwidthGbPerSec;
        totalBandwidthGpu       += r.avgBandwidthGbPerSec;
-        executorBandwidth        = std::max(executorBandwidth,
-                                            results.exeResults[transfers[transferIdx].exeDevice].avgBandwidthGbPerSec);
+        minBandwidth             = std::min(minBandwidth, r.avgBandwidthGbPerSec);
+        transferCount++;
        printf("%c%8.3f  ", separator, r.avgBandwidthGbPerSec);
      } else {
        printf("%c%8s  ", separator, "N/A");
      }
    }
-    printf("   %c%8.3f   %c%8.3f\n", separator, rowTotalBandwidth, separator, executorBandwidth);
-    minExecutorBandwidth = std::min(minExecutorBandwidth, executorBandwidth);
-    maxExecutorBandwidth = std::max(maxExecutorBandwidth, executorBandwidth);
-    colTotalBandwidth[numGpus] += rowTotalBandwidth;
+
+    if (numQueuePairs > 0) {
+      TransferBench::TransferResult const& r = results.tfrResults[nicTransferIdx[src]];
+      colTotalBandwidth[numGpus]  += r.avgBandwidthGbPerSec;
+      rowTotalBandwidth           += r.avgBandwidthGbPerSec;
+      totalBandwidthGpu           += r.avgBandwidthGbPerSec;
+      minBandwidth                 = std::min(minBandwidth, r.avgBandwidthGbPerSec);
+      transferCount++;
+      printf("%c%8.3f  ", separator, r.avgBandwidthGbPerSec);
+    }
+    double actualBandwidth = minBandwidth * transferCount;
+    printf("   %c%8.3f   %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
+    minActualBandwidth = std::min(minActualBandwidth, actualBandwidth);
+    maxActualBandwidth = std::max(maxActualBandwidth, actualBandwidth);
+    colTotalBandwidth[numGpus+1] += rowTotalBandwidth;
  }
  printf("\nRTotal");
  for (int dst = 0; dst < numGpus; dst++) {
    printf("%c%8.3f  ", separator, colTotalBandwidth[dst]);
  }
-  printf("   %c%8.3f   %c%8.3f   %c%8.3f\n", separator, colTotalBandwidth[numGpus],
-         separator, minExecutorBandwidth, separator, maxExecutorBandwidth);
+  if (numQueuePairs > 0) {
+    printf("%c%8.3f  ", separator, colTotalBandwidth[numGpus]);
+  }
+  printf("   %c%8.3f   %c%8.3f   %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
+         separator, minActualBandwidth, separator, maxActualBandwidth);
  printf("\n");

  printf("Average   bandwidth (GPU Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());

--- a/src/client/Presets/AllToAllN.hpp
+++ b/src/client/Presets/AllToAllN.hpp
+/*
+Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+*/
+
+#include "EnvVars.hpp"
+
+void AllToAllRdmaPreset(EnvVars&           ev,
+                        size_t      const  numBytesPerTransfer,
+                        std::string const  presetName)
+{
+
+
+  int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
+
+  // Collect env vars for this preset
+  int numGpus       = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
+  int numQueuePairs = EnvVars::GetEnvVar("NUM_QUEUE_PAIRS", 1);
+  int useFineGrain  = EnvVars::GetEnvVar("USE_FINE_GRAIN" , 1);
+
+  // Print off environment variables
+  ev.DisplayEnvVars();
+  if (!ev.hideEnv) {
+    if (!ev.outputToCsv) printf("[AllToAll Network Related]\n");
+    ev.Print("NUM_GPU_DEVICES", numGpus      , "Using %d GPUs", numGpus);
+    ev.Print("NUM_QUEUE_PAIRS", numQueuePairs, "Using %d queue pairs for NIC transfers", numQueuePairs);
+    ev.Print("USE_FINE_GRAIN" , useFineGrain , "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
+    printf("\n");
+  }
+
+  // Validate env vars
+  if (numGpus < 0 || numGpus > numDetectedGpus) {
+    printf("[ERROR] Cannot use %d GPUs.  Detected %d GPUs\n", numGpus, numDetectedGpus);
+    exit(1);
+  }
+
+  MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
+
+  std::map<std::pair<int, int>, int> reIndex;
+  std::vector<Transfer> transfers;
+  for (int i = 0; i < numGpus; i++) {
+    for (int j = 0; j < numGpus; j++) {
+      // Build Transfer and add it to list
+      TransferBench::Transfer transfer;
+      transfer.numBytes = numBytesPerTransfer;
+      transfer.srcs.push_back({memType, i});
+      transfer.dsts.push_back({memType, j});
+      transfer.exeDevice = {EXE_NIC_NEAREST, i};
+      transfer.exeSubIndex = j;
+      transfer.numSubExecs = numQueuePairs;
+
+      reIndex[std::make_pair(i,j)] = transfers.size();
+      transfers.push_back(transfer);
+    }
+  }
+
+  printf("GPU-RDMA All-To-All benchmark:\n");
+  printf("==========================\n");
+  printf("- Copying %lu bytes between all pairs of GPUs using %d QPs per Transfer (%lu Transfers)\n",
+         numBytesPerTransfer, numQueuePairs, transfers.size());
+  if (transfers.size() == 0) return;
+
+  // Execute Transfers
+  TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
+  TransferBench::TestResults results;
+  if (!TransferBench::RunTransfers(cfg, transfers, results)) {
+    for (auto const& err : results.errResults)
+      printf("%s\n", err.errMsg.c_str());
+    exit(0);
+  } else {
+    PrintResults(ev, 1, transfers, results);
+  }
+
+  // Print results
+  char separator = (ev.outputToCsv ? ',' : ' ');
+  printf("\nSummary: [%lu bytes per Transfer]\n", numBytesPerTransfer);
+  printf("==========================================================\n");
+  printf("SRC\\DST ");
+  for (int dst = 0; dst < numGpus; dst++)
+    printf("%cGPU %02d    ", separator, dst);
+  printf("   %cSTotal     %cActual\n", separator, separator);
+
+  double totalBandwidthGpu = 0.0;
+  double minActualBandwidth = std::numeric_limits<double>::max();
+  double maxActualBandwidth = 0.0;
+  std::vector<double> colTotalBandwidth(numGpus+2, 0.0);
+  for (int src = 0; src < numGpus; src++) {
+    double rowTotalBandwidth = 0;
+    int    transferCount = 0;
+    double minBandwidth = std::numeric_limits<double>::max();
+    printf("GPU %02d", src);
+    for (int dst = 0; dst < numGpus; dst++) {
+      if (reIndex.count(std::make_pair(src, dst))) {
+        int const transferIdx = reIndex[std::make_pair(src,dst)];
+        TransferBench::TransferResult const& r = results.tfrResults[transferIdx];
+        colTotalBandwidth[dst]  += r.avgBandwidthGbPerSec;
+        rowTotalBandwidth       += r.avgBandwidthGbPerSec;
+        totalBandwidthGpu       += r.avgBandwidthGbPerSec;
+        minBandwidth             = std::min(minBandwidth, r.avgBandwidthGbPerSec);
+        transferCount++;
+        printf("%c%8.3f  ", separator, r.avgBandwidthGbPerSec);
+      } else {
+        printf("%c%8s  ", separator, "N/A");
+      }
+    }
+    double actualBandwidth = minBandwidth * transferCount;
+    printf("   %c%8.3f   %c%8.3f\n", separator, rowTotalBandwidth, separator, actualBandwidth);
+    minActualBandwidth = std::min(minActualBandwidth, actualBandwidth);
+    maxActualBandwidth = std::max(maxActualBandwidth, actualBandwidth);
+    colTotalBandwidth[numGpus+1] += rowTotalBandwidth;
+  }
+  printf("\nRTotal");
+  for (int dst = 0; dst < numGpus; dst++) {
+    printf("%c%8.3f  ", separator, colTotalBandwidth[dst]);
+  }
+  printf("   %c%8.3f   %c%8.3f   %c%8.3f\n", separator, colTotalBandwidth[numGpus+1],
+         separator, minActualBandwidth, separator, maxActualBandwidth);
+  printf("\n");
+
+  printf("Average   bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu / transfers.size());
+  printf("Aggregate bandwidth (Tx Thread Timed): %8.3f GB/s\n", totalBandwidthGpu);
+  printf("Aggregate bandwidth       (CPU Timed): %8.3f GB/s\n", results.avgTotalBandwidthGbPerSec);
+
+  PrintErrors(results.errResults);
+}
--- a/src/client/Presets/AllToAllSweep.hpp
+++ b/src/client/Presets/AllToAllSweep.hpp
+/*
+Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+*/
+
+#include "EnvVars.hpp"
+
+void AllToAllSweepPreset(EnvVars&           ev,
+                         size_t      const  numBytesPerTransfer,
+                         std::string const  presetName)
+{
+  enum
+  {
+    A2A_COPY       = 0,
+    A2A_READ_ONLY  = 1,
+    A2A_WRITE_ONLY = 2,
+    A2A_CUSTOM     = 3,
+  };
+  char a2aModeStr[4][20] = {"Copy", "Read-Only", "Write-Only", "Custom"};
+
+  // Force single-stream mode for all-to-all benchmark
+  ev.useSingleStream = 1;
+
+  int numDetectedGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
+
+  // Collect env vars for this preset
+  int a2aDirect     = EnvVars::GetEnvVar("A2A_DIRECT"     , 1);
+  int a2aLocal      = EnvVars::GetEnvVar("A2A_LOCAL"      , 0);
+  int numGpus       = EnvVars::GetEnvVar("NUM_GPU_DEVICES", numDetectedGpus);
+  int showMinOnly   = EnvVars::GetEnvVar("SHOW_MIN_ONLY",   1);
+  int useFineGrain  = EnvVars::GetEnvVar("USE_FINE_GRAIN" , 1);
+  int useRemoteRead = EnvVars::GetEnvVar("USE_REMOTE_READ", 0);
+  int useSpray      = EnvVars::GetEnvVar("USE_SPRAY",       0);
+  int verbose       = EnvVars::GetEnvVar("VERBOSE",         0);
+
+  std::vector<int> unrollList = EnvVars::GetEnvVarArray("UNROLLS", {1,2,3,4,6,8});
+  std::vector<int> numCusList = EnvVars::GetEnvVarArray("NUM_CUS", {4,8,12,16,24,32});
+
+  // A2A_MODE may be 0,1,2 or else custom numSrcs:numDsts
+  int numSrcs, numDsts;
+  int a2aMode = 0;
+  if (getenv("A2A_MODE") && sscanf(getenv("A2A_MODE"), "%d:%d", &numSrcs, &numDsts) == 2) {
+    a2aMode = A2A_CUSTOM;
+  } else {
+    a2aMode = EnvVars::GetEnvVar("A2A_MODE", 0);
+    if (a2aMode < 0 || a2aMode > 2) {
+      printf("[ERROR] a2aMode must be between 0 and 2, or else numSrcs:numDsts\n");
+      exit(1);
+    }
+    numSrcs = (a2aMode == A2A_WRITE_ONLY ? 0 : 1);
+    numDsts = (a2aMode == A2A_READ_ONLY  ? 0 : 1);
+  }
+
+  // Print off environment variables
+  ev.DisplayEnvVars();
+  if (!ev.hideEnv) {
+    if (!ev.outputToCsv) printf("[AllToAll Related]\n");
+    ev.Print("A2A_DIRECT"     , a2aDirect        , a2aDirect ? "Only using direct links" : "Full all-to-all");
+    ev.Print("A2A_LOCAL"      , a2aLocal         , "%s local transfers", a2aLocal ? "Include" : "Exclude");
+    ev.Print("A2A_MODE"       , (a2aMode == A2A_CUSTOM) ?  std::to_string(numSrcs) + ":" + std::to_string(numDsts) : std::to_string(a2aMode),
+                                (a2aMode == A2A_CUSTOM) ? (std::to_string(numSrcs) + " read(s) " +
+                                                           std::to_string(numDsts) + " write(s)").c_str(): a2aModeStr[a2aMode]);
+    ev.Print("SHOW_MIN_ONLY"  , showMinOnly      , showMinOnly ? "Showing only slowest GPU results" : "Showing slowest and fastest GPU results");
+    ev.Print("NUM_CUS"        , numCusList.size(), EnvVars::ToStr(numCusList).c_str());
+    ev.Print("NUM_GPU_DEVICES", numGpus          , "Using %d GPUs", numGpus);
+    ev.Print("UNROLLS"        , unrollList.size(), EnvVars::ToStr(unrollList).c_str());
+    ev.Print("USE_FINE_GRAIN" , useFineGrain     , "Using %s-grained memory", useFineGrain ? "fine" : "coarse");
+    ev.Print("USE_REMOTE_READ", useRemoteRead    , "Using %s as executor", useRemoteRead ? "DST" : "SRC");
+    ev.Print("USE_SPRAY"      , useSpray         , "%s per CU", useSpray ? "All targets" : "One target");
+    ev.Print("VERBOSE"        , verbose          , verbose ? "Display test results" : "Display summary only");
+    printf("\n");
+  }
+
+  // Validate env vars
+  if (numGpus < 0 || numGpus > numDetectedGpus) {
+    printf("[ERROR] Cannot use %d GPUs.  Detected %d GPUs\n", numGpus, numDetectedGpus);
+    exit(1);
+  }
+
+  if (useSpray && numDsts > 1) {
+    printf("[ERROR] Cannot use USE_SPRAY with multiple destination buffers\n");
+    exit(1);
+  }
+
+  // Collect the number of GPU devices to use
+  MemType memType = useFineGrain ? MEM_GPU_FINE : MEM_GPU;
+  ExeType exeType = EXE_GPU_GFX;
+
+  std::vector<Transfer> transfers;
+
+  int targetCount = 0;
+  if (!useSpray) {
+    // Each CU will work on just one target
+    for (int i = 0; i < numGpus; i++) {
+      targetCount = 0;
+      for (int j = 0; j < numGpus; j++) {
+        // Check whether or not to execute this pair
+        if (i == j) {
+          if (!a2aLocal) continue;
+        } else if (a2aDirect) {
+#if !defined(__NVCC__)
+          uint32_t linkType, hopCount;
+          HIP_CALL(hipExtGetLinkTypeAndHopCount(i, j, &linkType, &hopCount));
+          if (hopCount != 1) continue;
+#endif
+        }
+
+        // Build Transfer and add it to list
+        TransferBench::Transfer transfer;
+        targetCount++;
+        transfer.numBytes = numBytesPerTransfer;
+        for (int x = 0; x < numSrcs; x++) transfer.srcs.push_back({memType, i});
+
+        // When using multiple destinations, the additional destinations are "local"
+        if (numDsts) transfer.dsts.push_back({memType, j});
+        for (int x = 1; x < numDsts; x++) transfer.dsts.push_back({memType, i});
+        transfer.exeDevice = {exeType, (useRemoteRead ? j : i)};
+        transfer.exeSubIndex = -1;
+        transfers.push_back(transfer);
+      }
+    }
+  } else {
+    // Each CU will work on all targets
+    for (int i = 0; i < numGpus; i++) {
+      TransferBench::Transfer transfer;
+      transfer.numBytes = numBytesPerTransfer;
+      transfer.exeDevice = {exeType, i};
+      transfer.exeSubIndex = -1;
+      targetCount = 0;
+      for (int j = 0; j < numGpus; j++) {
+        // Check whether or not to transfer to this GPU
+        if (i == j) {
+          if (!a2aLocal) continue;
+        } else if (a2aDirect) {
+#if !defined(__NVCC__)
+          uint32_t linkType, hopCount;
+          HIP_CALL(hipExtGetLinkTypeAndHopCount(i, j, &linkType, &hopCount));
+          if (hopCount != 1) continue;
+#endif
+        }
+        targetCount++;
+        for (int x = 0; x < numSrcs; x++) transfer.srcs.push_back({memType, useRemoteRead ? j : i});
+
+        if (numDsts) transfer.dsts.push_back({memType, j});
+        for (int x = 1; x < numDsts; x++) transfer.dsts.push_back({memType, i});
+      }
+      transfers.push_back(transfer);
+    }
+  }
+
+  printf("GPU-GFX All-To-All Sweep benchmark:\n");
+  printf("==========================\n");
+  printf("- Copying %lu bytes between %s pairs of GPUs\n", numBytesPerTransfer, a2aDirect ? "directly connected" : "all");
+  if (transfers.size() == 0) {
+    printf("[WARN} No transfers requested. Try adjusting A2A_DIRECT or A2A_LOCAL\n");
+    return;
+  }
+
+  // Execute Transfers
+  TransferBench::ConfigOptions cfg = ev.ToConfigOptions();
+
+  // Run tests
+  std::map<std::pair<int, int>, TransferBench::TestResults> results;
+
+  // Display summary
+  printf("#CUs\\Unroll");
+  for (int u : unrollList) {
+    printf("  %d(Min) ", u);
+    if (!showMinOnly) printf("  %d(Max) ", u);
+  }
+  printf("\n");
+  for (int c : numCusList) {
+    printf("   %5d   ", c);  fflush(stdout);
+    for (int u : unrollList) {
+      ev.gfxUnroll = cfg.gfx.unrollFactor = u;
+      for (auto& transfer : transfers)
+        transfer.numSubExecs = useSpray ? (c * targetCount) : c;
+
+      double minBandwidth = std::numeric_limits<double>::max();
+      double maxBandwidth = std::numeric_limits<double>::min();
+      TransferBench::TestResults result;
+      if (TransferBench::RunTransfers(cfg, transfers, result)) {
+        for (auto const& exeResult : result.exeResults) {
+          minBandwidth = std::min(minBandwidth, exeResult.second.avgBandwidthGbPerSec);
+	  maxBandwidth = std::max(maxBandwidth, exeResult.second.avgBandwidthGbPerSec);
+	}
+        if (useSpray) {
+	  minBandwidth *= targetCount;
+	  maxBandwidth *= targetCount;
+	}
+        results[std::make_pair(c,u)] = result;
+      } else {
+        minBandwidth = 0.0;
+      }
+      printf(" %7.2f ", minBandwidth);
+      if (!showMinOnly) printf(" %7.2f ", maxBandwidth);
+      fflush(stdout);
+    }
+    printf("\n"); fflush(stdout);
+  }
+
+  if (verbose) {
+    int testNum = 0;
+    for (int c : numCusList) {
+      for (int u : unrollList) {
+        printf("CUs: %d Unroll %d\n", c, u);
+        PrintResults(ev, ++testNum, transfers, results[std::make_pair(c,u)]);
+      }
+    }
+  }
+}