Update rocm-rel-6.4 to use TransferBench v1.62.00 (#183)

* updating metadata * v1.58.00 Fixing DMA copy-on-engine (#152) * Leo's review * Update use-transferbench.rst * Update Doxyfile * Refining API library * Update TransferBench.hpp * Update TransferBench.hpp * Update TransferBench.hpp * Bump rocm-docs-core from 1.9.2 to 1.11.0 in /docs/sphinx (#153) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.11.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.11.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#155) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * External CI: enable CI triggers * Apply suggestions from code review Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Update LICENSE.md * Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#160) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench V1.59 (#162) Adding NIC execution capabilities, various bug fixes introduced by header-only-library refactor --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Adding ability to specify A2A_MODE=numSrcs:numDsts (#164) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Fixing specific DMA engine transfers, enabling GFX_SINGLE_TEAM=1 by default (#166) * Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx (#165) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#171) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.61 (#174) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#176) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#177) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.18.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.62.00 (#181) * Adding non-temporal loads and stores via GFX_TEMPORAL * Adding additional summary details to a2a preset * Add SHOW_MIN_ONLY for a2asweep preset * Adding new P CPU memory type which is indexed by closest GPU * Bump rocm-docs-core from 1.18.2 to 1.20.0 in /docs/sphinx (#180) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.20.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.20.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: srawat <120587655+SwRaw@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Su <danielsu@amd.com> Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

Update rocm-rel-6.4 to use TransferBench v1.62.00 (#183)
* updating metadata * v1.58.00 Fixing DMA copy-on-engine (#152) * Leo's review * Update use-transferbench.rst * Update Doxyfile * Refining API library * Update TransferBench.hpp * Update TransferBench.hpp * Update TransferBench.hpp * Bump rocm-docs-core from 1.9.2 to 1.11.0 in /docs/sphinx (#153) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.11.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.11.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#155) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * External CI: enable CI triggers * Apply suggestions from code review Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Update LICENSE.md * Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#160) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench V1.59 (#162) Adding NIC execution capabilities, various bug fixes introduced by header-only-library refactor --------- Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Adding ability to specify A2A_MODE=numSrcs:numDsts (#164) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Fixing specific DMA engine transfers, enabling GFX_SINGLE_TEAM=1 by default (#166) * Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx (#165) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#171) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.61 (#174) Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com> * Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#176) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#177) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.18.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * TransferBench v1.62.00 (#181) * Adding non-temporal loads and stores via GFX_TEMPORAL * Adding additional summary details to a2a preset * Add SHOW_MIN_ONLY for a2asweep preset * Adding new P CPU memory type which is indexed by closest GPU * Bump rocm-docs-core from 1.18.2 to 1.20.0 in /docs/sphinx (#180) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.20.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.20.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: srawat <120587655+SwRaw@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Su <danielsu@amd.com> Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
4e2be38c · gilbertlee-amd · GitHub · 3ea2f226 · 4e2be38c · 4e2be38c
Unverified Commit 4e2be38c authored Jun 12, 2025 by gilbertlee-amd Committed by GitHub Jun 12, 2025
4 changed files
--- a/src/client/Presets/Presets.hpp
+++ b/src/client/Presets/Presets.hpp
@@ -24,6 +24,8 @@ THE SOFTWARE.
 // Included after EnvVars and Executors
 #include "AllToAll.hpp"
+#include "AllToAllN.hpp"
+#include "AllToAllSweep.hpp"
 #include "HealthCheck.hpp"
 #include "OneToAll.hpp"
 #include "PeerToPeer.hpp"
@@ -38,14 +40,16 @@ typedef void (*PresetFunc)(EnvVars&          ev,
 std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
 {
-  {"a2a",         {AllToAllPreset,   "Tests parallel transfers between all pairs of GPU devices"}},
+  {"a2a",         {AllToAllPreset,      "Tests parallel transfers between all pairs of GPU devices"}},
-  {"healthcheck", {HealthCheckPreset,"Simple bandwidth health check (MI300X series only)"}},
+  {"a2a_n",       {AllToAllRdmaPreset,  "Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers"}},
-  {"one2all",     {OneToAllPreset,   "Test all subsets of parallel transfers from one GPU to all others"}},
+  {"a2asweep",    {AllToAllSweepPreset, "Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts"}},
-  {"p2p"   ,      {PeerToPeerPreset, "Peer-to-peer device memory bandwidth test"}},
+  {"healthcheck", {HealthCheckPreset,   "Simple bandwidth health check (MI300X series only)"}},
-  {"rsweep",      {SweepPreset,      "Randomly sweep through sets of Transfers"}},
+  {"one2all",     {OneToAllPreset,      "Test all subsets of parallel transfers from one GPU to all others"}},
-  {"scaling",     {ScalingPreset,    "Run scaling test from one GPU to other devices"}},
+  {"p2p"   ,      {PeerToPeerPreset,    " Peer-to-peer device memory bandwidth test"}},
-  {"schmoo",      {SchmooPreset,     "Scaling tests for local/remote read/write/copy"}},
+  {"rsweep",      {SweepPreset,         "Randomly sweep through sets of Transfers"}},
-  {"sweep",       {SweepPreset,      "Ordered sweep through sets of Transfers"}},
+  {"scaling",     {ScalingPreset,       "Run scaling test from one GPU to other devices"}},
+  {"schmoo",      {SchmooPreset,        "Scaling tests for local/remote read/write/copy"}},
+  {"sweep",       {SweepPreset,         "Ordered sweep through sets of Transfers"}},
 };
 void DisplayPresets()

--- a/src/client/Presets/Sweep.hpp
+++ b/src/client/Presets/Sweep.hpp
@@ -22,19 +22,21 @@ THE SOFTWARE.
 void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers)
 {
-  fprintf(fp, "# Test %d\n", testNum);
+  if (fp) {
-  fprintf(fp, "%d", -1 * (int)transfers.size());
+    fprintf(fp, "# Test %d\n", testNum);
-  for (auto const& transfer : transfers)
+    fprintf(fp, "%d", -1 * (int)transfers.size());
-  {
+    for (auto const& transfer : transfers)
-    fprintf(fp, " (%s->%c%d->%s %d %lu)",
+    {
-            MemDevicesToStr(transfer.srcs).c_str(),
+      fprintf(fp, " (%s->%c%d->%s %d %lu)",
-            ExeTypeStr[transfer.exeDevice.exeType], transfer.exeDevice.exeIndex,
+              MemDevicesToStr(transfer.srcs).c_str(),
-            MemDevicesToStr(transfer.dsts).c_str(),
+              ExeTypeStr[transfer.exeDevice.exeType], transfer.exeDevice.exeIndex,
-            transfer.numSubExecs,
+              MemDevicesToStr(transfer.dsts).c_str(),
-            transfer.numBytes);
+              transfer.numSubExecs,
+              transfer.numBytes);
+    }
+    fprintf(fp, "\n");
+    fflush(fp);
  }
-  fprintf(fp, "\n");
-  fflush(fp);
 }
 void SweepPreset(EnvVars&           ev,
@@ -54,6 +56,7 @@ void SweepPreset(EnvVars&           ev,
  int         numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE"          , 4);
  std::string sweepDst       = EnvVars::GetEnvVar("SWEEP_DST"           , "CG");
  std::string sweepExe       = EnvVars::GetEnvVar("SWEEP_EXE"           , "CDG");
+  std::string sweepFile      = EnvVars::GetEnvVar("SWEEP_FILE"          , "/tmp/lastSweep.cfg");
  int         sweepMax       = EnvVars::GetEnvVar("SWEEP_MAX"           , 24);
  int         sweepMin       = EnvVars::GetEnvVar("SWEEP_MIN"           , 1);
  int         sweepRandBytes = EnvVars::GetEnvVar("SWEEP_RAND_BYTES"    , 0);
@@ -78,6 +81,7 @@ void SweepPreset(EnvVars&           ev,
    ev.Print("NUM_GPU_SE",        numGpuSubExecs,   "Using %d subExecutors/CUs per GPU executed Transfer", numGpuSubExecs);
    ev.Print("SWEEP_DST",         sweepDst.c_str(), "Destination Memory Types to sweep");
    ev.Print("SWEEP_EXE",         sweepExe.c_str(), "Executor Types to sweep");
+    ev.Print("SWEEP_FILE",        sweepFile.c_str(),"File to store the executing sweep configuration");
    ev.Print("SWEEP_MAX",         sweepMax,         "Max simultaneous transfers (0 = no limit)");
    ev.Print("SWEEP_MIN",         sweepMin,         "Min simultaenous transfers");
    ev.Print("SWEEP_RAND_BYTES",  sweepRandBytes,   "Using %s number of bytes per Transfer", (sweepRandBytes ? "random" : "constant"));
@@ -283,10 +287,14 @@ void SweepPreset(EnvVars&           ev,
  std::uniform_int_distribution<int> distribution(sweepMin, maxParallelTransfers);
  // Log sweep to configuration file
-  FILE *fp = fopen("lastSweep.cfg", "w");
+  char absPath[1024];
+  auto const res = realpath(sweepFile.c_str(), absPath);
+  FILE *fp = fopen(sweepFile.c_str(), "w");
  if (!fp) {
-    printf("[ERROR] Unable to open lastSweep.cfg.  Check permissions\n");
+    printf("[WARN] Unable to open %s.  Skipping output of sweep configuration file\n", res ? absPath : sweepFile.c_str());
-    exit(1);
+  } else {
+    printf("Sweep configuration saved to: %s\n", res ? absPath : sweepFile.c_str());
  }
  // Create bitmask of numPossible triplets, of which M will be chosen
@@ -333,7 +341,7 @@ void SweepPreset(EnvVars&           ev,
    // Check for test limit
    if (numTestsRun == sweepTestLimit) {
-      printf("Test limit reached\n");
+      printf("Sweep Test limit reached\n");
      break;
    }
@@ -341,7 +349,7 @@ void SweepPreset(EnvVars&           ev,
    auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
    double totalCpuTime = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count();
    if (sweepTimeLimit && totalCpuTime > sweepTimeLimit) {
-      printf("Time limit exceeded\n");
+      printf("Sweep Time limit exceeded\n");
      break;
    }
@@ -357,5 +365,5 @@ void SweepPreset(EnvVars&           ev,
        bitmask[i] = (i < M) ? 1 : 0;
    }
  }
-  fclose(fp);
+  if (fp) fclose(fp);
 }
--- a/src/client/Topology.hpp
+++ b/src/client/Topology.hpp
@@ -38,21 +38,56 @@ static int RemappedCpuIndex(int origIdx)
  return remappingCpu[origIdx];
 }
+static void PrintNicToGPUTopo(bool outputToCsv)
+{
+#ifdef NIC_EXEC_ENABLED
+  printf(" NIC | Device Name | Active | PCIe Bus ID  | NUMA | Closest GPU(s) | GID Index | GID Descriptor\n");
+  if(!outputToCsv)
+    printf("-----+-------------+--------+--------------+------+----------------+-----------+-------------------\n");
+  int numGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
+  auto const& ibvDeviceList = GetIbvDeviceList();
+  for (int i = 0; i < ibvDeviceList.size(); i++) {
+    std::string closestGpusStr = "";
+    for (int j = 0; j < numGpus; j++) {
+      if (TransferBench::GetClosestNicToGpu(j) == i) {
+        if (closestGpusStr != "") closestGpusStr += ",";
+        closestGpusStr += std::to_string(j);
+      }
+    }
+    printf(" %-3d | %-11s | %-6s | %-12s | %-4d | %-14s | %-9s | %-20s\n",
+           i, ibvDeviceList[i].name.c_str(),
+           ibvDeviceList[i].hasActivePort ? "Yes" : "No",
+           ibvDeviceList[i].busId.c_str(),
+           ibvDeviceList[i].numaNode,
+           closestGpusStr.c_str(),
+           ibvDeviceList[i].isRoce && ibvDeviceList[i].hasActivePort?  std::to_string(ibvDeviceList[i].gidIndex).c_str() : "N/A",
+           ibvDeviceList[i].isRoce && ibvDeviceList[i].hasActivePort?  ibvDeviceList[i].gidDescriptor.c_str() : "N/A"
+          );
+  }
+  printf("\n");
+#endif
+}
 void DisplayTopology(bool outputToCsv)
 {
  int numCpus = TransferBench::GetNumExecutors(EXE_CPU);
  int numGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
+  int numNics = TransferBench::GetNumExecutors(EXE_NIC);
  char sep = (outputToCsv ? ',' : '|');
  if (outputToCsv) {
    printf("NumCpus,%d\n", numCpus);
    printf("NumGpus,%d\n", numGpus);
+    printf("NumNics,%d\n", numNics);
  } else {
    printf("\nDetected Topology:\n");
    printf("==================\n");
    printf("  %d configured CPU NUMA node(s) [%d total]\n", numCpus, numa_max_node() + 1);
    printf("  %d GPU device(s)\n", numGpus);
+    printf("  %d Supported NIC device(s)\n", numNics);
  }
  // Print out detected CPU topology
@@ -91,8 +126,10 @@ void DisplayTopology(bool outputToCsv)
  }
  printf("\n");
-  // Print out detected GPU topology
+  // Print out detected NIC topology
+  PrintNicToGPUTopo(outputToCsv);
+  // Print out detected GPU topology
 #if defined(__NVCC__)
  for (int i = 0; i < numGpus; i++) {
    hipDeviceProp_t prop;
@@ -118,12 +155,12 @@ void DisplayTopology(bool outputToCsv)
  printf("        %c", sep);
  for (int j = 0; j < numGpus; j++)
    printf(" GPU %02d %c", j, sep);
-  printf(" PCIe Bus ID  %c #CUs %c NUMA %c #DMA %c #XCC\n", sep, sep, sep, sep);
+  printf(" PCIe Bus ID  %c #CUs %c NUMA %c #DMA %c #XCC %c NIC\n", sep, sep, sep, sep, sep);
  if (!outputToCsv) {
    for (int j = 0; j <= numGpus; j++)
      printf("--------+");
-    printf("--------------+------+------+------+------\n");
+    printf("--------------+------+------+------+------+------\n");
  }
  // Loop over each GPU device
@@ -149,12 +186,13 @@ void DisplayTopology(bool outputToCsv)
    char pciBusId[20];
    HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, i));
-    printf(" %11s %c %4d %c %4d %c %4d %c %4d\n",
+    printf(" %-11s %c %-4d %c %-4d %c %-4d %c %-4d %c %-4d\n",
           pciBusId, sep,
           TransferBench::GetNumSubExecutors({EXE_GPU_GFX, i}), sep,
           TransferBench::GetClosestCpuNumaToGpu(i), sep,
           TransferBench::GetNumExecutorSubIndices({EXE_GPU_DMA, i}), sep,
-           TransferBench::GetNumExecutorSubIndices({EXE_GPU_GFX, i}));
+           TransferBench::GetNumExecutorSubIndices({EXE_GPU_GFX, i}), sep,
+           TransferBench::GetClosestNicToGpu(i));
  }
 #endif
 }
--- a/src/header/TransferBench.hpp
+++ b/src/header/TransferBench.hpp
@@ -20,12 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
 */
+/// @cond
 #pragma once
+#include <algorithm>
 #include <cstring>
 #include <future>
 #include <map>
 #include <numa.h> // If not found, try installing libnuma-dev (e.g apt-get install libnuma-dev)
 #include <numaif.h>
+#include <random>
 #include <set>
 #include <sstream>
 #include <stdarg.h>
@@ -33,6 +36,19 @@ THE SOFTWARE.
 #include <unistd.h>
 #include <vector>
+#ifdef NIC_EXEC_ENABLED
+#include <infiniband/verbs.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <arpa/inet.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <filesystem>
+#include <fstream>
+#endif
 #if defined(__NVCC__)
 #include <cuda_runtime.h>
 #else
@@ -41,6 +57,7 @@ THE SOFTWARE.
 #include <hsa/hsa.h>
 #include <hsa/hsa_ext_amd.h>
 #endif
+/// @endcond
 namespace TransferBench
 {
@@ -49,24 +66,25 @@ namespace TransferBench
  using std::set;
  using std::vector;
-  constexpr char VERSION[] = "1.57";
+  constexpr char VERSION[] = "1.62";
  /**
   * Enumeration of supported Executor types
   *
   * @note The Executor is the device used to perform a Transfer
-   * @note IBVerbs executor is currently not implemented yet
   */
  enum ExeType
  {
    EXE_CPU          = 0,                       ///<  CPU executor              (subExecutor = CPU thread)
    EXE_GPU_GFX      = 1,                       ///<  GPU kernel-based executor (subExecutor = threadblock/CU)
    EXE_GPU_DMA      = 2,                       ///<  GPU SDMA executor         (subExecutor = not supported)
-    EXE_IBV          = 3,                       ///<  IBVerbs executor          (subExecutor = queue pair)
+    EXE_NIC          = 3,                       ///<  NIC RDMA executor         (subExecutor = queue pair)
+    EXE_NIC_NEAREST  = 4                        ///<  NIC RDMA nearest executor (subExecutor = queue pair)
  };
-  char const ExeTypeStr[5] = "CGDI";
+  char const ExeTypeStr[6] = "CGDIN";
  inline bool IsCpuExeType(ExeType e){ return e == EXE_CPU; }
  inline bool IsGpuExeType(ExeType e){ return e == EXE_GPU_GFX || e == EXE_GPU_DMA; }
+  inline bool IsNicExeType(ExeType e){ return e == EXE_NIC || e == EXE_NIC_NEAREST; }
  /**
   * A ExeDevice defines a specific Executor
@@ -94,10 +112,11 @@ namespace TransferBench
    MEM_GPU_FINE     = 3,                       ///< Fine-grained global GPU memory
    MEM_CPU_UNPINNED = 4,                       ///< Unpinned CPU memory
    MEM_NULL         = 5,                       ///< NULL memory - used for empty
-    MEM_MANAGED      = 6                        ///< Managed memory
+    MEM_MANAGED      = 6,                       ///< Managed memory
+    MEM_CPU_CLOSEST  = 7,                       ///< Coarse-grained pinned CPU memory indexed by closest GPU
  };
-  char const MemTypeStr[8] = "CGBFUNM";
+  char const MemTypeStr[9] = "CGBFUNMP";
-  inline bool IsCpuMemType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED); }
+  inline bool IsCpuMemType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED || m == MEM_CPU_CLOSEST); }
  inline bool IsGpuMemType(MemType m) { return (m == MEM_GPU || m == MEM_GPU_FINE || m == MEM_MANAGED); }
  /**
@@ -118,11 +137,10 @@ namespace TransferBench
   */
  struct Transfer
  {
-    size_t            numBytes    = (1<<26);    ///< # of bytes to Transfer
+    size_t            numBytes    = 0;          ///< Number of bytes to Transfer
    vector<MemDevice> srcs        = {};         ///< List of source memory devices
    vector<MemDevice> dsts        = {};         ///< List of destination memory devices
    ExeDevice         exeDevice   = {};         ///< Executor to use
-    int32_t           exeDstIndex = -1;         ///< Destination executor index (for RDMA executor only)
    int32_t           exeSubIndex = -1;         ///< Executor subindex
    int               numSubExecs = 0;          ///< Number of subExecutors to use for this Transfer
  };
@@ -132,8 +150,8 @@ namespace TransferBench
   */
  struct GeneralOptions
  {
-    int numIterations      = 10;                ///< # of timed iterations to perform. If negative, run for -numIterations seconds instead
+    int numIterations      = 10;                ///< Number of timed iterations to perform. If negative, run for -numIterations seconds instead
-    int numSubIterations   = 1;                 ///< # of sub-iterations per iteration
+    int numSubIterations   = 1;                 ///< Number of sub-iterations per iteration
    int numWarmups         = 3;                 ///< Number of un-timed warmup iterations to perform
    int recordPerIteration = 0;                 ///< Record per-iteration timing information
    int useInteractive     = 0;                 ///< Pause for user-input before starting transfer loop
@@ -152,30 +170,51 @@ namespace TransferBench
    int           validateSource   = 0;         ///< Validate src GPU memory immediately after preparation
  };
-  /**
-   * DMA Executor options
-   */
-  struct DmaOptions
-  {
-    int useHipEvents = 1;                       ///< Use HIP events for timing DMA Executor
-    int useHsaCopy   = 0;                       ///< Use HSA copy instead of HIP copy to perform DMA
-  };
  /**
   * GFX Executor options
   */
  struct GfxOptions
  {
+    int                 blockOrder     = 0;     ///< Determines how threadblocks are ordered (0=sequential, 1=interleaved, 2=random)
    int                 blockSize      = 256;   ///< Size of each threadblock (must be multiple of 64)
    vector<uint32_t>    cuMask         = {};    ///< Bit-vector representing the CU mask
    vector<vector<int>> prefXccTable   = {};    ///< 2D table with preferred XCD to use for a specific [src][dst] GPU device
+    int                 temporalMode   = 0;     ///< Non-temporal load/store mode 0=none, 1=load, 2=store, 3=both
    int                 unrollFactor   = 4;     ///< GFX-kernel unroll factor
    int                 useHipEvents   = 1;     ///< Use HIP events for timing GFX Executor
    int                 useMultiStream = 0;     ///< Use multiple streams for GFX
    int                 useSingleTeam  = 0;     ///< Team all subExecutors across the data array
    int                 waveOrder      = 0;     ///< GFX-kernel wavefront ordering
+    int                 wordSize       = 4;     ///< GFX-kernel packed data size (4=dwordx4, 2=dwordx2, 1=dwordx1)
  };
+  /**
+   * DMA Executor options
+   */
+  struct DmaOptions
+  {
+    int useHipEvents = 1;                       ///< Use HIP events for timing DMA Executor
+    int useHsaCopy   = 0;                       ///< Use HSA copy instead of HIP copy to perform DMA
+  };
+  /**
+   * NIC Executor options
+   */
+  struct NicOptions
+  {
+    vector<int> closestNics     = {};           ///< Overrides the auto-detected closest NIC per GPU
+    int         ibGidIndex      = -1;           ///< GID Index for RoCE NICs (-1 is auto)
+    uint8_t     ibPort          = 1;            ///< NIC port number to be used
+    int         ipAddressFamily = 4;            ///< 4=IPv4, 6=IPv6 (used for auto GID detection)
+    int         maxRecvWorkReq  = 16;           ///< Maximum number of recv work requests per queue pair
+    int         maxSendWorkReq  = 16;           ///< Maximum number of send work requests per queue pair
+    int         queueSize       = 100;          ///< Completion queue size
+    int         roceVersion     = 2;            ///< RoCE version (used for auto GID detection)
+    int         useRelaxedOrder = 1;            ///< Use relaxed ordering
+    int         useNuma         = 0;            ///< Switch to closest numa thread for execution
+  };
  /**
   * Configuration options for performing Transfers
   */
@@ -186,6 +225,7 @@ namespace TransferBench
    GfxOptions     gfx;                         ///< GFX executor options
    DmaOptions     dma;                         ///< DMA executor options
+    NicOptions     nic;                         ///< NIC executor options
  };
  /**
@@ -198,6 +238,31 @@ namespace TransferBench
    ERR_FATAL = 2,                              ///< Fatal error - results are invalid
  };
+  /**
+   * Enumeration of GID priority
+   *
+   * @note These are the GID types ordered in priority from lowest (0) to highest
+   */
+  enum GidPriority
+  {
+    UNKNOWN           = -1,                      ///< Default
+    ROCEV1_LINK_LOCAL = 0,                       ///< RoCEv1 Link-local
+    ROCEV2_LINK_LOCAL = 1,                       ///< RoCEv2 Link-local fe80::/10
+    ROCEV1_IPV6       = 2,                       ///< RoCEv1 IPv6
+    ROCEV2_IPV6       = 3,                       ///< RoCEv2 IPv6
+    ROCEV1_IPV4       = 4,                       ///< RoCEv1 IPv4-mapped IPv6
+    ROCEV2_IPV4       = 5,                       ///< RoCEv2 IPv4-mapped IPv6 ::ffff:192.168.x.x
+  };
+  const char* GidPriorityStr[] = {
+    "RoCEv1 Link-local",
+    "RoCEv2 Link-local",
+    "RoCEv1 IPv6",
+    "RoCEv2 IPv6",
+    "RoCEv1 IPv4-mapped IPv6",
+    "RoCEv2 IPv4-mapped IPv6"
+  };
  /**
   * ErrResult consists of error type and error message
   */
@@ -241,6 +306,9 @@ namespace TransferBench
    // Only filled in if recordPerIteration = 1
    vector<double> perIterMsec;                 ///< Duration for each individual iteration
    vector<set<pair<int,int>>> perIterCUs;      ///< GFX-Executor only. XCC:CU used per iteration
+    ExeDevice exeDevice;                        ///< Tracks which executor performed this Transfer (e.g. for EXE_NIC_NEAREST)
+    ExeDevice exeDstDevice;                     ///< Tracks actual destination executor (only valid for EXE_NIC/EXE_NIC_NEAREST)
  };
  /**
@@ -342,6 +410,23 @@ namespace TransferBench
   */
  int GetClosestCpuNumaToGpu(int gpuIndex);
+  /**
+   * Returns the index of the NUMA node closest to the given NIC
+   *
+   * @param[in] nicIndex Index of the NIC to query
+   * @returns NUMA node index closest to the NIC nicIndex, or -1 if unable to detect
+   */
+  int GetClosestCpuNumaToNic(int nicIndex);
+  /**
+   * Returns the index of the NIC closest to the given GPU
+   *
+   * @param[in] gpuIndex Index of the GPU to query
+   * @note This function is applicable when the IBV/RDMA executor is available
+   * @returns IB Verbs capable NIC index closest to GPU gpuIndex, or -1 if unable to detect
+   */
+  int GetClosestNicToGpu(int gpuIndex);
  /**
   * Helper function to parse a line containing Transfers into a vector of Transfers
   *
@@ -351,7 +436,6 @@ namespace TransferBench
   */
  ErrResult ParseTransfers(std::string str,
                           std::vector<Transfer>& transfers);
 };
 //==========================================================================================
 // End of TransferBench API
@@ -409,6 +493,14 @@ namespace TransferBench
  #define hipStreamDestroy                                   cudaStreamDestroy
  #define hipStreamSynchronize                               cudaStreamSynchronize
+  // Define float2 addition operator for NVIDIA platform
+  __device__ inline float2& operator +=(float2& a, const float2& b)
+  {
+    a.x += b.x;
+    a.y += b.y;
+    return a;
+  }
  // Define float4 addition operator for NVIDIA platform
  __device__ inline float4& operator +=(float4& a, const float4& b)
  {
@@ -433,7 +525,7 @@ namespace TransferBench
 #endif
 // Macro for collecting XCC GFX kernel is running on
-#if defined(__gfx940__) || defined(__gfx941__) || defined(__gfx942__)
+#if defined(__gfx940__) || defined(__gfx941__) || defined(__gfx942__) || defined(__gfx950__)
 #define GetXccId(val) asm volatile ("s_getreg_b32 %0, hwreg(HW_REG_XCC_ID)" : "=s" (val));
 #else
 #define GetXccId(val) val = 0
@@ -457,21 +549,63 @@ namespace TransferBench
      return false;               \
  } while (0)
+// Helper macros for calling RDMA functions and reporting errors
+#ifdef VERBS_DEBUG
+#define IBV_CALL(__func__, ...)                                         \
+  do {                                                                  \
+    int error = __func__(__VA_ARGS__);                                  \
+    if (error != 0) {                                                   \
+      return {ERR_FATAL, "Encountered IbVerbs error (%d) at line (%d) " \
+              "and function (%s)", (error), __LINE__, #__func__};       \
+    }                                                                   \
+  } while (0)
+#define IBV_PTR_CALL(__ptr__, __func__, ...)                               \
+  do {                                                                     \
+    __ptr__ = __func__(__VA_ARGS__);                                       \
+    if (__ptr__ == nullptr) {                                              \
+      return {ERR_FATAL, "Encountered IbVerbs nullptr error at line (%d) " \
+              "and function (%s)", __LINE__, #__func__};                   \
+    }                                                                      \
+  } while (0)
+#else
+#define IBV_CALL(__func__, ...)                                         \
+  do {                                                                  \
+    int error = __func__(__VA_ARGS__);                                  \
+    if (error != 0) {                                                   \
+      return {ERR_FATAL, "Encountered IbVerbs error (%d) in func (%s) " \
+              , error, #__func__};                                      \
+    }                                                                   \
+  } while (0)
+#define IBV_PTR_CALL(__ptr__, __func__, ...)                               \
+  do {                                                                     \
+    __ptr__ = __func__(__VA_ARGS__);                                       \
+    if (__ptr__ == nullptr) {                                              \
+      return {ERR_FATAL, "Encountered IbVerbs nullptr error in func (%s) " \
+              , #__func__};                                                \
+    }                                                                      \
+  } while (0)
+#endif
 namespace TransferBench
 {
+/// @cond
 // Helper functions ('hidden' in anonymous namespace)
 //========================================================================================
 namespace {
 // Constants
 //========================================================================================
-  int   constexpr MAX_BLOCKSIZE  = 512;                       // Max threadblock size
-  int   constexpr MAX_WAVEGROUPS = MAX_BLOCKSIZE / 64;        // Max wavegroups/warps
+  int   constexpr MAX_BLOCKSIZE  = 512;                // Max threadblock size
-  int   constexpr MAX_UNROLL     = 8;                         // Max unroll factor
+  int   constexpr MAX_WAVEGROUPS = MAX_BLOCKSIZE / 64; // Max wavegroups/warps
-  int   constexpr MAX_SRCS       = 8;                         // Max # srcs per Transfer
+  int   constexpr MAX_UNROLL     = 8;                  // Max unroll factor
-  int   constexpr MAX_DSTS       = 8;                         // Max # dsts per Transfer
+  int   constexpr MAX_SRCS       = 8;                  // Max srcs per Transfer
-  int   constexpr MEMSET_CHAR    = 75;                        // Value to memset (char)
+  int   constexpr MAX_DSTS       = 8;                  // Max dsts per Transfer
-  float constexpr MEMSET_VAL     = 13323083.0f;               // Value to memset (double)
+  int   constexpr MEMSET_CHAR    = 75;                 // Value to memset (char)
+  float constexpr MEMSET_VAL     = 13323083.0f;        // Value to memset (double)
 // Parsing-related functions
 //========================================================================================
@@ -590,7 +724,7 @@ namespace {
    if (mistakeCount > 0) {
      return {ERR_FATAL,
              "%lu out of %lu pages for memory allocation were not on NUMA node %d."
-              " This could be due to hardware memory issues",
+              " This could be due to hardware memory issues, or the use of numa-rebalancing daemons such as numad",
              mistakeCount, numPages, targetId};
    }
    return ERR_NONE;
@@ -607,8 +741,14 @@ namespace {
    MemType const& memType = memDevice.memType;
    if (IsCpuMemType(memType)) {
-      // Set numa policy prior to call to hipHostMalloc
+      // Determine which NUMA device to use
-      numa_set_preferred(memDevice.memIndex);
+      int numaIdx = memDevice.memIndex;
+      if (memType == MEM_CPU_CLOSEST) {
+        numaIdx = GetClosestCpuNumaToGpu(memDevice.memIndex);
+      }
+      // Set NUMA policy prior to call to hipHostMalloc
+      numa_set_preferred(numaIdx);
      // Allocate host-pinned memory (should respect NUMA mem policy)
      if (memType == MEM_CPU_FINE) {
@@ -617,19 +757,19 @@ namespace {
 #else
        ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, hipHostMallocNumaUser));
 #endif
-      } else if (memType == MEM_CPU) {
+      } else if (memType == MEM_CPU || memType == MEM_CPU_CLOSEST) {
 #if defined (__NVCC__)
        ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, 0));
 #else
        ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, hipHostMallocNumaUser | hipHostMallocNonCoherent));
 #endif
      } else if (memType == MEM_CPU_UNPINNED) {
-        *memPtr = numa_alloc_onnode(numBytes, memDevice.memIndex);
+        *memPtr = numa_alloc_onnode(numBytes, numaIdx);
      }
      // Check that the allocated pages are actually on the correct NUMA node
      memset(*memPtr, 0, numBytes);
-      ERR_CHECK(CheckPages((char*)*memPtr, numBytes, memDevice.memIndex));
+      ERR_CHECK(CheckPages((char*)*memPtr, numBytes, numaIdx));
      // Reset to default numa mem policy
      numa_set_preferred(-1);
@@ -668,7 +808,7 @@ namespace {
      return {ERR_FATAL, "Attempted to free null pointer for %lu bytes", bytes};
    switch (memType) {
-    case MEM_CPU: case MEM_CPU_FINE:
+    case MEM_CPU: case MEM_CPU_FINE: case MEM_CPU_CLOSEST:
    {
      ERR_CHECK(hipHostFree(memPtr));
      break;
@@ -740,6 +880,7 @@ namespace {
      break;
    case EXE_GPU_GFX: case EXE_GPU_DMA:
      if (exeIndex < 0 || exeIndex >= numGpus)
        return {ERR_FATAL, "GPU index must be between 0 and %d inclusively", numGpus - 1};
      agent = gpuAgents[exeIndex];
      break;
@@ -765,13 +906,36 @@ namespace {
 // Setup validation-related functions
 //========================================================================================
+  static ErrResult GetActualExecutor(ConfigOptions const& cfg,
+                                     ExeDevice     const& origExeDevice,
+                                     ExeDevice&           actualExeDevice)
+  {
+    // By default, nothing needs to change
+    actualExeDevice = origExeDevice;
+    // When using NIC_NEAREST, remap to the closest NIC to the GPU
+    if (origExeDevice.exeType == EXE_NIC_NEAREST) {
+      actualExeDevice.exeType  = EXE_NIC;
+      if (cfg.nic.closestNics.size() > 0) {
+        if (origExeDevice.exeIndex < 0 || origExeDevice.exeIndex >= cfg.nic.closestNics.size())
+          return {ERR_FATAL, "NIC index is out of range (%d)", origExeDevice.exeIndex};
+        actualExeDevice.exeIndex = cfg.nic.closestNics[origExeDevice.exeIndex];
+      } else {
+        actualExeDevice.exeIndex = GetClosestNicToGpu(origExeDevice.exeIndex);
+      }
+    }
+    return ERR_NONE;
+  }
  // Validate that MemDevice exists
  static ErrResult CheckMemDevice(MemDevice const& memDevice)
  {
    if (memDevice.memType == MEM_NULL)
      return ERR_NONE;
-    if (IsCpuMemType(memDevice.memType)) {
+    if (IsCpuMemType(memDevice.memType) && memDevice.memType != MEM_CPU_CLOSEST) {
      int numCpus = GetNumExecutors(EXE_CPU);
      if (memDevice.memIndex < 0 || memDevice.memIndex >= numCpus)
        return {ERR_FATAL,
@@ -779,11 +943,16 @@ namespace {
      return ERR_NONE;
    }
-    if (IsGpuMemType(memDevice.memType)) {
+    if (IsGpuMemType(memDevice.memType) || memDevice.memType == MEM_CPU_CLOSEST) {
    int numGpus = GetNumExecutors(EXE_GPU_GFX);
      if (memDevice.memIndex < 0 || memDevice.memIndex >= numGpus)
        return {ERR_FATAL,
                "GPU index must be between 0 and %d (instead of %d)", numGpus - 1, memDevice.memIndex};
+      if (memDevice.memType == MEM_CPU_CLOSEST) {
+        if (GetClosestCpuNumaToGpu(memDevice.memIndex) == -1) {
+          return {ERR_FATAL, "Unable to determine closest NUMA node for GPU %d", memDevice.memIndex};
+        }
+      }
      return ERR_NONE;
    }
    return {ERR_FATAL, "Unsupported memory type (%d)", memDevice.memType};
@@ -804,12 +973,29 @@ namespace {
      errors.push_back({ERR_FATAL, "[data.byteOffset] must be positive multiple of %lu", sizeof(float)});
    // Check GFX options
+    if (cfg.gfx.blockOrder < 0 || cfg.gfx.blockOrder > 2)
+      errors.push_back({ERR_FATAL,
+          "[gfx.blockOrder] must be 0 for sequential, 1 for interleaved, or 2 for random"});
+    if (cfg.gfx.useMultiStream && cfg.gfx.blockOrder > 0)
+      errors.push_back({ERR_WARN, "[gfx.blockOrder] will be ignored when running in multi-stream mode"});
    int gfxMaxBlockSize = GetIntAttribute(ATR_GFX_MAX_BLOCKSIZE);
    if (cfg.gfx.blockSize < 0 || cfg.gfx.blockSize % 64 || cfg.gfx.blockSize > gfxMaxBlockSize)
      errors.push_back({ERR_FATAL,
                        "[gfx.blockSize] must be positive multiple of 64 less than or equal to %d",
                        gfxMaxBlockSize});
+    if (cfg.gfx.temporalMode < 0 || cfg.gfx.temporalMode > 3)
+      errors.push_back({ERR_FATAL,
+                        "[gfx.temporalMode] must be non-negative and less than or equal to 3"});
+#if defined(__NVCC__)
+    if (cfg.gfx.temporalMode > 0)
+      errors.push_back({ERR_FATAL,
+          "[gfx.temporalMode] is not supported on NVIDIA hardware"});
+#endif
    int gfxMaxUnroll = GetIntAttribute(ATR_GFX_MAX_UNROLL);
    if (cfg.gfx.unrollFactor < 0 || cfg.gfx.unrollFactor > gfxMaxUnroll)
      errors.push_back({ERR_FATAL,
@@ -819,6 +1005,9 @@ namespace {
      errors.push_back({ERR_FATAL,
                        "[gfx.waveOrder] must be non-negative and less than 6"});
+    if (!(cfg.gfx.wordSize == 1 || cfg.gfx.wordSize == 2 || cfg.gfx.wordSize == 4))
+      errors.push_back({ERR_FATAL, "[gfx.wordSize] must be either 1, 2 or 4"});
    int numGpus = GetNumExecutors(EXE_GPU_GFX);
    int numXccs = GetNumExecutorSubIndices({EXE_GPU_GFX, 0});
    vector<vector<int>> const& table = cfg.gfx.prefXccTable;
@@ -844,6 +1033,20 @@ namespace {
      }
    }
+    // Check NIC options
+#ifdef NIC_EXEC_ENABLED
+    int numNics = GetNumExecutors(EXE_NIC);
+    for (auto const& nic : cfg.nic.closestNics)
+      if (nic < 0 || nic >= numNics)
+        errors.push_back({ERR_FATAL, "NIC index (%d) in user-specified closest NIC list must be between 0 and %d",
+            nic, numNics - 1});
+    size_t closetNicsSize = cfg.nic.closestNics.size();
+    if (closetNicsSize > 0 && closetNicsSize < numGpus)
+      errors.push_back({ERR_FATAL, "User-specified closest NIC list must match GPU count of %d",
+          numGpus});
+#endif
    // NVIDIA specific
 #if defined(__NVCC__)
    if (cfg.data.validateDirect)
@@ -877,6 +1080,7 @@ namespace {
  {
    int numCpus = GetNumExecutors(EXE_CPU);
    int numGpus = GetNumExecutors(EXE_GPU_GFX);
+    int numNics = GetNumExecutors(EXE_NIC);
    std::set<ExeDevice>      executors;
    std::map<ExeDevice, int> transferCount;
@@ -1002,7 +1206,7 @@ namespace {
 #endif
        }
-        if (!IsGpuMemType(t.srcs[0].memType) || !IsGpuMemType(t.dsts[0].memType)) {
+        if (!IsGpuMemType(t.srcs[0].memType) && !IsGpuMemType(t.dsts[0].memType)) {
          errors.push_back({ERR_WARN,
              "Transfer %d: No GPU memory for source or destination.  Copy might not execute on DMA %d",
              i, t.exeDevice.exeIndex});
@@ -1021,8 +1225,33 @@ namespace {
          }
        }
        break;
-      case EXE_IBV:
+      case EXE_NIC:
-        errors.push_back({ERR_FATAL, "Transfer %d: IBV executor currently not supported", i});
+#ifdef NIC_EXEC_ENABLED
+      {
+        int srcIndex = t.exeDevice.exeIndex;
+        int dstIndex = t.exeSubIndex;
+        if (srcIndex < 0 || srcIndex >= numNics)
+          errors.push_back({ERR_FATAL, "Transfer %d: src NIC executor indexes an out-of-range NIC (%d)", i, srcIndex});
+        if (dstIndex < 0 || dstIndex >= numNics)
+          errors.push_back({ERR_FATAL, "Transfer %d: dst NIC executor indexes an out-of-range NIC (%d)", i, dstIndex});
+      }
+#else
+        errors.push_back({ERR_FATAL, "Transfer %d: NIC executor is requested but is not available", i});
+#endif
+        break;
+      case EXE_NIC_NEAREST:
+#ifdef NIC_EXEC_ENABLED
+      {
+        ExeDevice srcExeDevice;
+        ErrResult errSrc = GetActualExecutor(cfg, t.exeDevice, srcExeDevice);
+        if (errSrc.errType != ERR_NONE) errors.push_back(errSrc);
+        ExeDevice dstExeDevice;
+        ErrResult errDst = GetActualExecutor(cfg, {t.exeDevice.exeType, t.exeSubIndex}, dstExeDevice);
+        if (errDst.errType != ERR_NONE) errors.push_back(errDst);
+      }
+#else
+        errors.push_back({ERR_FATAL, "Transfer %d: NIC executor is requested but is not available", i});
+#endif
        break;
      }
@@ -1102,7 +1331,6 @@ namespace {
      }
    }
    // Check for fatal errors
    for (auto const& err : errors)
      if (err.errType == ERR_FATAL) return true;
@@ -1143,6 +1371,7 @@ namespace {
    vector<float*>             dstMem;            ///< Destination memory
    vector<SubExecParam>       subExecParamCpu;   ///< Defines subarrays for each subexecutor
    vector<int>                subExecIdx;        ///< Indices into subExecParamGpu
+    int                        numaNode;          ///< NUMA node to use for this Transfer
    // For GFX executor
    SubExecParam*              subExecParamGpuPtr;
@@ -1155,6 +1384,29 @@ namespace {
    hsa_amd_sdma_engine_id_t   sdmaEngineId;      ///< DMA engine ID
 #endif
+// For IBV executor
+#ifdef NIC_EXEC_ENABLED
+    int                        srcNicIndex;       ///< SRC NIC index
+    int                        dstNicIndex;       ///< DST NIC index
+    ibv_context*               srcContext;        ///< Device context for SRC NIC
+    ibv_context*               dstContext;        ///< Device context for DST NIC
+    ibv_pd*                    srcProtect;        ///< Protection domain for SRC NIC
+    ibv_pd*                    dstProtect;        ///< Protection domain for DST NIC
+    ibv_cq*                    srcCompQueue;      ///< Completion queue for SRC NIC
+    ibv_cq*                    dstCompQueue;      ///< Completion queue for DST NIC
+    ibv_port_attr              srcPortAttr;       ///< Port attributes for SRC NIC
+    ibv_port_attr              dstPortAttr;       ///< Port attributes for DST NIC
+    ibv_gid                    srcGid;            ///< GID handle for SRC NIC
+    ibv_gid                    dstGid;            ///< GID handle for DST NIC
+    vector<ibv_qp*>            srcQueuePairs;     ///< Queue pairs for SRC NIC
+    vector<ibv_qp*>            dstQueuePairs;     ///< Queue pairs for DST NIC
+    ibv_mr*                    srcMemRegion;      ///< Memory region for SRC
+    ibv_mr*                    dstMemRegion;      ///< Memory region for DST
+    uint8_t                    qpCount;           ///< Number of QPs to be used for transferring data
+    vector<ibv_sge>            sgePerQueuePair;   ///< Scatter-gather elements per queue pair
+    vector<ibv_send_wr>        sendWorkRequests;  ///< Send work requests per queue pair
+#endif
    // Counters
    double                     totalDurationMsec; ///< Total duration for all iterations for this Transfer
    vector<double>             perIterMsec;       ///< Duration for each individual iteration
@@ -1169,7 +1421,6 @@ namespace {
    int                        totalSubExecs;     ///< Total number of subExecutors to use
    bool                       useSubIndices;     ///< Use subexecutor indicies
    int                        numSubIndices;     ///< Number of subindices this ExeDevice has
-    int                        wallClockRate;     ///< (GFX-only) Device wall clock rate
    vector<SubExecParam>       subExecParamCpu;   ///< Subexecutor parameters for this executor
    vector<TransferResources>  resources;         ///< Per-Transfer resources
@@ -1178,7 +1429,718 @@ namespace {
    vector<hipStream_t>        streams;           ///< HIP streams to launch on
    vector<hipEvent_t>         startEvents;       ///< HIP start timing event
    vector<hipEvent_t>         stopEvents;        ///< HIP stop timing event
+    int                        wallClockRate;     ///< (GFX-only) Device wall clock rate
+  };
+  // Structure to track PCIe topology
+  struct PCIeNode
+  {
+    std::string        address;                   ///< PCIe address for this PCIe node
+    std::string        description;               ///< Description for this PCIe node
+    std::set<PCIeNode> children;                  ///< Children PCIe nodes
+    // Default constructor
+    PCIeNode() : address(""), description("") {}
+    // Constructor
+    PCIeNode(std::string const& addr) : address(addr) {}
+    // Constructor
+    PCIeNode(std::string const& addr, std::string const& desc)
+      :address(addr), description(desc) {}
+    // Comparison operator for std::set
+    bool operator<(PCIeNode const& other) const {
+      return address < other.address;
+    }
+  };
+#ifdef NIC_EXEC_ENABLED
+  // Structure to track information about IBV devices
+  struct IbvDevice
+  {
+    ibv_device* devicePtr;
+    std::string name;
+    std::string busId;
+    bool        hasActivePort;
+    int         numaNode;
+    int         gidIndex;
+    std::string gidDescriptor;
+    bool        isRoce;
  };
+#endif
+#ifdef NIC_EXEC_ENABLED
+// Function to collect information about IBV devices
+//========================================================================================
+static bool IsConfiguredGid(union ibv_gid const& gid)
+  {
+    const struct in6_addr *a = (struct in6_addr *) gid.raw;
+    int trailer = (a->s6_addr32[1] | a->s6_addr32[2] | a->s6_addr32[3]);
+    if (((a->s6_addr32[0] | trailer) == 0UL) ||
+        ((a->s6_addr32[0] == htonl(0xfe800000)) && (trailer == 0UL))) {
+      return false;
+    }
+    return true;
+  }
+  static bool LinkLocalGid(union ibv_gid const& gid)
+  {
+    const struct in6_addr *a = (struct in6_addr *) gid.raw;
+    if (a->s6_addr32[0] == htonl(0xfe800000) && a->s6_addr32[1] == 0UL) {
+      return true;
+    }
+    return false;
+  }
+  static ErrResult GetRoceVersionNumber(struct ibv_context* const& context,
+                                        int const&  portNum,
+                                        int const&  gidIndex,
+                                        int&        version)
+  {
+    char const* deviceName = ibv_get_device_name(context->device);
+    char gidRoceVerStr[16]      = {};
+    char roceTypePath[PATH_MAX] = {};
+    sprintf(roceTypePath, "/sys/class/infiniband/%s/ports/%d/gid_attrs/types/%d",
+            deviceName, portNum, gidIndex);
+    int fd = open(roceTypePath, O_RDONLY);
+    if (fd == -1)
+      return {ERR_FATAL, "Failed while opening RoCE file path (%s)", roceTypePath};
+    int ret = read(fd, gidRoceVerStr, 15);
+    close(fd);
+    if (ret == -1)
+      return {ERR_FATAL, "Failed while reading RoCE version"};
+    if (strlen(gidRoceVerStr)) {
+      if (strncmp(gidRoceVerStr, "IB/RoCE v1", strlen("IB/RoCE v1")) == 0
+          || strncmp(gidRoceVerStr, "RoCE v1", strlen("RoCE v1")) == 0) {
+        version = 1;
+      }
+      else if (strncmp(gidRoceVerStr, "RoCE v2", strlen("RoCE v2")) == 0) {
+        version = 2;
+      }
+    }
+    return ERR_NONE;
+  }
+  static bool IsIPv4MappedIPv6(const union ibv_gid &gid)
+  {
+    // look for ::ffff:x.x.x.x format
+    // From Broadcom documentation
+    // https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/frequently-asked-questions1.html
+    // "The IPv4 address is really an IPv4 address mapped into the IPv6 address space.
+    // This can be identified by 80 “0” bits, followed by 16 “1” bits (“FFFF” in hexadecimal)
+    // followed by the original 32-bit IPv4 address."
+    return (gid.global.subnet_prefix == 0    &&
+            gid.raw[8]               == 0    &&
+            gid.raw[9]               == 0    &&
+            gid.raw[10]              == 0xff &&
+            gid.raw[11]              == 0xff);
+  }
+  static ErrResult GetGidIndex(struct ibv_context*          context,
+                               int const&                   gidTblLen,
+                               int const&                   portNum,
+                               std::pair<int, std::string>& gidInfo)
+  {
+    if(gidInfo.first >= 0) return ERR_NONE; // honor user choice
+    union ibv_gid gid;
+    GidPriority highestPriority = GidPriority::UNKNOWN;
+    int gidIndex = -1;
+    for (int i = 0; i < gidTblLen; ++i) {
+      IBV_CALL(ibv_query_gid, context, portNum, i, &gid);
+      if (!IsConfiguredGid(gid)) continue;
+      int gidCurrRoceVersion;
+      if(GetRoceVersionNumber(context, portNum, i, gidCurrRoceVersion).errType != ERR_NONE) continue;
+      GidPriority currPriority;
+      if (IsIPv4MappedIPv6(gid)) {
+        currPriority = (gidCurrRoceVersion == 2) ? GidPriority::ROCEV2_IPV4 : GidPriority::ROCEV1_IPV4;
+      } else if (!LinkLocalGid(gid)) {
+        currPriority = (gidCurrRoceVersion == 2) ? GidPriority::ROCEV2_IPV6 : GidPriority::ROCEV1_IPV6;
+      } else {
+        currPriority = (gidCurrRoceVersion == 2) ? GidPriority::ROCEV2_LINK_LOCAL : GidPriority::ROCEV1_LINK_LOCAL;
+      }
+      if(currPriority > highestPriority) {
+        highestPriority = currPriority;
+        gidIndex = i;
+      }
+    }
+    if (highestPriority == GidPriority::UNKNOWN) {
+      gidInfo.first = -1;
+      return {ERR_FATAL, "Failed to auto-detect a valid GID index. Try setting it manually through IB_GID_INDEX"};
+    }
+    gidInfo.first = gidIndex;
+    gidInfo.second = GidPriorityStr[highestPriority];
+    return ERR_NONE;
+  }
+  static vector<IbvDevice>& GetIbvDeviceList()
+  {
+    static bool isInitialized = false;
+    static vector<IbvDevice> ibvDeviceList = {};
+    // Build list on first use
+    if (!isInitialized) {
+      // Query the number of IBV devices
+      int numIbvDevices = 0;
+      ibv_device** deviceList = ibv_get_device_list(&numIbvDevices);
+      if (deviceList && numIbvDevices > 0) {
+        // Loop over each device to collect information
+        for (int i = 0; i < numIbvDevices; i++) {
+          IbvDevice ibvDevice;
+          ibvDevice.devicePtr = deviceList[i];
+          ibvDevice.name = deviceList[i]->name;
+          ibvDevice.hasActivePort = false;
+          {
+            struct ibv_context *context = ibv_open_device(ibvDevice.devicePtr);
+            if (context) {
+              struct ibv_device_attr deviceAttr;
+              if (!ibv_query_device(context, &deviceAttr)) {
+                int activePort;
+                ibvDevice.gidIndex = -1;
+                for (int port = 1; port <= deviceAttr.phys_port_cnt; ++port) {
+                  struct ibv_port_attr portAttr;
+                  if (ibv_query_port(context, port, &portAttr)) continue;
+                  if (portAttr.state == IBV_PORT_ACTIVE) {
+                    activePort = port;
+                    ibvDevice.hasActivePort = true;
+                    if(portAttr.link_layer == IBV_LINK_LAYER_ETHERNET) {
+                      ibvDevice.isRoce = true;
+                      std::pair<int, std::string> gidInfo (-1, "");
+                      auto res = GetGidIndex(context, portAttr.gid_tbl_len, activePort, gidInfo);
+                      if (res.errType == ERR_NONE) {
+                        ibvDevice.gidIndex = gidInfo.first;
+                        ibvDevice.gidDescriptor = gidInfo.second;
+                      }
+                    }
+                    break;
+                  }
+                }
+              }
+              ibv_close_device(context);
+            }
+          }
+          ibvDevice.busId = "";
+          {
+            std::string device_path(ibvDevice.devicePtr->dev_path);
+            if (std::filesystem::exists(device_path)) {
+              std::string pciPath = std::filesystem::canonical(device_path + "/device").string();
+              std::size_t pos = pciPath.find_last_of('/');
+              if (pos != std::string::npos) {
+                ibvDevice.busId = pciPath.substr(pos + 1);
+              }
+            }
+          }
+          // Get nearest numa node for this device
+          ibvDevice.numaNode = -1;
+          std::filesystem::path devicePath = "/sys/bus/pci/devices/" + ibvDevice.busId + "/numa_node";
+          std::string canonicalPath = std::filesystem::canonical(devicePath).string();
+          if (std::filesystem::exists(canonicalPath)) {
+            std::ifstream file(canonicalPath);
+            if (file.is_open()) {
+              std::string numaNodeStr;
+              std::getline(file, numaNodeStr);
+              int numaNodeVal;
+              if (sscanf(numaNodeStr.c_str(), "%d", &numaNodeVal) == 1)
+                ibvDevice.numaNode = numaNodeVal;
+              file.close();
+            }
+          }
+          ibvDeviceList.push_back(ibvDevice);
+        }
+      }
+      ibv_free_device_list(deviceList);
+      isInitialized = true;
+    }
+    return ibvDeviceList;
+  }
+#endif // NIC_EXEC_ENABLED
+#ifdef NIC_EXEC_ENABLED
+// PCIe-related functions
+//========================================================================================
+  // Prints off PCIe tree
+  static void PrintPCIeTree(PCIeNode    const& node,
+                            std::string const& prefix = "",
+                            bool               isLast = true)
+  {
+    if (!node.address.empty()) {
+      printf("%s%s%s", prefix.c_str(), (isLast ? "└── " : "├── "), node.address.c_str());
+      if (!node.description.empty()) {
+        printf("(%s)", node.description.c_str());
+      }
+      printf("\n");
+    }
+    auto const& children = node.children;
+    for (auto it = children.begin(); it != children.end(); ++it) {
+      PrintPCIeTree(*it, prefix + (isLast ? "    " : "│   "), std::next(it) == children.end());
+    }
+  }
+  // Inserts nodes along pcieAddress down a tree starting from root
+  static ErrResult InsertPCIePathToTree(std::string const& pcieAddress,
+                                        std::string const& description,
+                                        PCIeNode&          root)
+  {
+    std::filesystem::path devicePath = "/sys/bus/pci/devices/" + pcieAddress;
+    std::string canonicalPath = std::filesystem::canonical(devicePath).string();
+    if (!std::filesystem::exists(devicePath)) {
+      return {ERR_FATAL, "Device path %s does not exist", devicePath.c_str()};
+    }
+    std::istringstream iss(canonicalPath);
+    std::string token;
+    PCIeNode* currNode = &root;
+    while (std::getline(iss, token, '/')) {
+      auto it = (currNode->children.insert(PCIeNode(token))).first;
+      currNode = const_cast<PCIeNode*>(&(*it));
+    }
+    currNode->description = description;
+    return ERR_NONE;
+  }
+  // Returns root node for PCIe tree.  Constructed on first use
+  static PCIeNode* GetPCIeTreeRoot()
+  {
+    static bool isInitialized = false;
+    static PCIeNode pcieRoot;
+    // Build PCIe tree on first use
+    if (!isInitialized) {
+      // Add NICs to the tree
+      int numNics = GetNumExecutors(EXE_NIC);
+      auto const& ibvDeviceList = GetIbvDeviceList();
+      for (IbvDevice const& ibvDevice : ibvDeviceList) {
+        if (!ibvDevice.hasActivePort || ibvDevice.busId == "") continue;
+        InsertPCIePathToTree(ibvDevice.busId, ibvDevice.name, pcieRoot);
+      }
+      // Add GPUs to the tree
+      int numGpus = GetNumExecutors(EXE_GPU_GFX);
+      for (int i = 0; i < numGpus; ++i) {
+        char hipPciBusId[64];
+        if (hipDeviceGetPCIBusId(hipPciBusId, sizeof(hipPciBusId), i) == hipSuccess) {
+          InsertPCIePathToTree(hipPciBusId, "GPU " + std::to_string(i), pcieRoot);
+        }
+      }
+#ifdef VERBS_DEBUG
+      PrintPCIeTree(pcieRoot);
+#endif
+      isInitialized = true;
+    }
+    return &pcieRoot;
+  }
+  // Finds the lowest common ancestor in PCIe tree between two nodes
+  static PCIeNode const* GetLcaBetweenNodes(PCIeNode    const* root,
+                                            std::string const& node1Address,
+                                            std::string const& node2Address)
+  {
+    if (!root || root->address == node1Address || root->address == node2Address)
+      return root;
+    PCIeNode const* lcaFound1 = nullptr;
+    PCIeNode const* lcaFound2 = nullptr;
+    // Recursively iterate over children
+    for (auto const& child : root->children) {
+      PCIeNode const* lca = GetLcaBetweenNodes(&child, node1Address, node2Address);
+      if (!lca) continue;
+      if (!lcaFound1) {
+        // First time found
+        lcaFound1 = lca;
+      } else {
+        // Second time found
+        lcaFound2 = lca;
+        break;
+      }
+    }
+    // If two children were found, then current node is the lowest common ancestor
+    return (lcaFound1 && lcaFound2) ? root : lcaFound1;
+  }
+  // Gets the depth of an node in the PCIe tree
+  static int GetLcaDepth(std::string const&     targetBusID,
+                         PCIeNode const* const& node,
+                         int                    depth = 0)
+  {
+    if (!node) return -1;
+    if (targetBusID == node->address) return depth;
+    for (auto const& child : node->children) {
+      int distance = GetLcaDepth(targetBusID, &child, depth + 1);
+      if (distance != -1)
+        return distance;
+    }
+    return -1;
+  }
+  // Function to extract the bus number from a PCIe address (domain:bus:device.function)
+  static int ExtractBusNumber(std::string const& pcieAddress)
+  {
+    int domain, bus, device, function;
+    char delimiter;
+    std::istringstream iss(pcieAddress);
+    iss >> std::hex >> domain >> delimiter >> bus >> delimiter >> device >> delimiter >> function;
+    if (iss.fail()) {
+#ifdef VERBS_DEBUG
+      printf("Invalid PCIe address format: %s\n", pcieAddress.c_str());
+#endif
+      return -1;
+    }
+    return bus;
+  }
+  // Function to compute the distance between two bus IDs
+  static int GetBusIdDistance(std::string const& pcieAddress1,
+                              std::string const& pcieAddress2)
+  {
+    int bus1 = ExtractBusNumber(pcieAddress1);
+    int bus2 = ExtractBusNumber(pcieAddress2);
+    return (bus1 < 0 || bus2 < 0) ? -1 : std::abs(bus1 - bus2);
+  }
+  // Given a target busID and a set of candidate devices, returns a set of indices
+  // that is "closest" to the target
+  static std::set<int> GetNearestDevicesInTree(std::string              const& targetBusId,
+                                               std::vector<std::string> const& candidateBusIdList)
+  {
+    int maxDepth = -1;
+    int minDistance = std::numeric_limits<int>::max();
+    std::set<int> matches = {};
+    // Loop over the candidates to find the ones with the lowest common ancestor (LCA)
+    for (int i = 0; i < candidateBusIdList.size(); i++) {
+      std::string const& candidateBusId = candidateBusIdList[i];
+      if (candidateBusId == "") continue;
+      PCIeNode const* lca = GetLcaBetweenNodes(GetPCIeTreeRoot(), targetBusId, candidateBusId);
+      if (!lca) continue;
+      int depth = GetLcaDepth(lca->address, GetPCIeTreeRoot());
+      int currDistance = GetBusIdDistance(targetBusId, candidateBusId);
+      // When more than one LCA match is found, choose the one with smallest busId difference
+      // NOTE: currDistance could be -1, which signals problem with parsing, however still
+      //       remains a valid "closest" candidate, so is included
+      if (depth > maxDepth || (depth == maxDepth && depth >= 0 && currDistance < minDistance)) {
+        maxDepth = depth;
+        matches.clear();
+        matches.insert(i);
+        minDistance = currDistance;
+      } else if (depth == maxDepth && depth >= 0 && currDistance == minDistance) {
+        matches.insert(i);
+      }
+    }
+    return matches;
+  }
+#endif // NIC_EXEC_ENABLED
+#ifdef NIC_EXEC_ENABLED
+// IB Verbs-related functions
+//========================================================================================
+  // Create a queue pair
+  static ErrResult CreateQueuePair(ConfigOptions const& cfg,
+                                   struct ibv_pd*       pd,
+                                   struct ibv_cq*       cq,
+                                   struct ibv_qp*&      qp)
+  {
+    // Set queue pair attributes
+    struct ibv_qp_init_attr attr = {};
+    attr.qp_type          = IBV_QPT_RC;                  // Set type to reliable connection
+    attr.send_cq          = cq;                          // Send completion queue
+    attr.recv_cq          = cq;                          // Recv completion queue
+    attr.cap.max_send_wr  = cfg.nic.maxSendWorkReq;      // Max send work requests
+    attr.cap.max_recv_wr  = cfg.nic.maxRecvWorkReq;      // Max recv work requests
+    attr.cap.max_send_sge = 1;                           // Max send scatter-gather entries
+    attr.cap.max_recv_sge = 1;                           // Max recv scatter-gather entries
+    qp = ibv_create_qp(pd, &attr);
+    if (qp == NULL)
+      return {ERR_FATAL, "Error while creating QP"};
+    return ERR_NONE;
+  }
+  // Initialize a queue pair
+  static ErrResult InitQueuePair(struct ibv_qp* qp,
+                                 uint8_t        port,
+                                 unsigned       flags)
+  {
+    struct ibv_qp_attr attr = {};                        // Clear all attributes
+    attr.qp_state        = IBV_QPS_INIT;                 // Set the QP state to INIT
+    attr.pkey_index      = 0;                            // Set the partition key index to 0
+    attr.port_num        = port;                         // Set the port number to the defined IB_PORT
+    attr.qp_access_flags = flags;                        // Set the QP access flags to the provided flags
+    int ret = ibv_modify_qp(qp, &attr,
+                            IBV_QP_STATE      |          // Modify the QP state
+                            IBV_QP_PKEY_INDEX |          // Modify the partition key index
+                            IBV_QP_PORT       |          // Modify the port number
+                            IBV_QP_ACCESS_FLAGS);        // Modify the access flags
+    if (ret != 0)
+      return {ERR_FATAL, "Error during QP Init. IB Verbs Error code: %d", ret};
+    return ERR_NONE;
+  }
+  // Transition QueuePair to Ready to Receive State
+  static ErrResult TransitionQpToRtr(ibv_qp*         qp,
+                                     uint16_t const& dlid,
+                                     uint32_t const& dqpn,
+                                     ibv_gid  const& gid,
+                                     uint8_t  const& gidIndex,
+                                     uint8_t  const& port,
+                                     bool     const& isRoCE,
+                                     ibv_mtu  const& mtu)
+  {
+    // Prepare QP attributes
+    struct ibv_qp_attr attr = {};
+    attr.qp_state           = IBV_QPS_RTR;
+    attr.path_mtu           = mtu;
+    attr.rq_psn             = 0;
+    attr.max_dest_rd_atomic = 1;
+    attr.min_rnr_timer      = 12;
+    if (isRoCE) {
+      attr.ah_attr.is_global                     = 1;
+      attr.ah_attr.grh.dgid.global.subnet_prefix = gid.global.subnet_prefix;
+      attr.ah_attr.grh.dgid.global.interface_id  = gid.global.interface_id;
+      attr.ah_attr.grh.flow_label                = 0;
+      attr.ah_attr.grh.sgid_index                = gidIndex;
+      attr.ah_attr.grh.hop_limit                 = 255;
+    } else {
+      attr.ah_attr.is_global = 0;
+      attr.ah_attr.dlid      = dlid;
+    }
+    attr.ah_attr.sl            = 0;
+    attr.ah_attr.src_path_bits = 0;
+    attr.ah_attr.port_num      = port;
+    attr.dest_qp_num           = dqpn;
+    // Modify the QP
+    int ret = ibv_modify_qp(qp, &attr,
+                            IBV_QP_STATE              |
+                            IBV_QP_AV                 |
+                            IBV_QP_PATH_MTU           |
+                            IBV_QP_DEST_QPN           |
+                            IBV_QP_RQ_PSN             |
+                            IBV_QP_MAX_DEST_RD_ATOMIC |
+                            IBV_QP_MIN_RNR_TIMER);
+    if (ret != 0)
+      return {ERR_FATAL, "Error during QP RTR. IB Verbs Error code: %d", ret};
+    return ERR_NONE;
+  }
+  // Transition QueuePair to Ready to Send state
+  static ErrResult TransitionQpToRts(struct ibv_qp *qp)
+  {
+    struct ibv_qp_attr attr = {};
+    attr.qp_state           = IBV_QPS_RTS;
+    attr.sq_psn             = 0;
+    attr.timeout            = 14;
+    attr.retry_cnt          = 7;
+    attr.rnr_retry          = 7;
+    attr.max_rd_atomic      = 1;
+    int ret = ibv_modify_qp(qp, &attr,
+                            IBV_QP_STATE     |
+                            IBV_QP_TIMEOUT   |
+                            IBV_QP_RETRY_CNT |
+                            IBV_QP_RNR_RETRY |
+                            IBV_QP_SQ_PSN    |
+                            IBV_QP_MAX_QP_RD_ATOMIC);
+    if (ret != 0)
+      return {ERR_FATAL, "Error during QP RTS. IB Verbs Error code: %d", ret};
+    return ERR_NONE;
+  }
+  static ErrResult PrepareNicTransferResources(ConfigOptions const& cfg,
+                                               ExeDevice     const& srcExeDevice,
+                                               Transfer      const& t,
+                                               TransferResources&   rss)
+  {
+    // Switch to the closest NUMA node to this NIC
+    int numaNode = GetIbvDeviceList()[srcExeDevice.exeIndex].numaNode;
+    if (numaNode != -1)
+      numa_run_on_node(numaNode);
+    int const port = cfg.nic.ibPort;
+    // Figure out destination NIC (Accounts for possible remap due to use of EXE_NIC_NEAREST)
+    ExeDevice dstExeDevice;
+    ERR_CHECK(GetActualExecutor(cfg, {t.exeDevice.exeType, t.exeSubIndex}, dstExeDevice));
+    rss.srcNicIndex = srcExeDevice.exeIndex;
+    rss.dstNicIndex = dstExeDevice.exeIndex;
+    rss.qpCount     = t.numSubExecs;
+    // Check for valid NICs and active ports
+    int numNics = GetNumExecutors(EXE_NIC);
+    if (rss.srcNicIndex < 0 || rss.srcNicIndex >= numNics)
+      return {ERR_FATAL, "SRC NIC index is out of range (%d)", rss.srcNicIndex};
+    if (rss.dstNicIndex < 0 || rss.dstNicIndex >= numNics)
+      return {ERR_FATAL, "DST NIC index is out of range (%d)", rss.dstNicIndex};
+    if (!GetIbvDeviceList()[rss.srcNicIndex].hasActivePort)
+      return {ERR_FATAL, "SRC NIC %d is not active\n", rss.srcNicIndex};
+    if (!GetIbvDeviceList()[rss.dstNicIndex].hasActivePort)
+      return {ERR_FATAL, "DST NIC %d is not active\n", rss.dstNicIndex};
+    // Queue pair flags
+    unsigned int rdmaAccessFlags = (IBV_ACCESS_LOCAL_WRITE    |
+                                    IBV_ACCESS_REMOTE_READ    |
+                                    IBV_ACCESS_REMOTE_WRITE   |
+                                    IBV_ACCESS_REMOTE_ATOMIC);
+    unsigned int rdmaMemRegFlags = rdmaAccessFlags;
+    if (cfg.nic.useRelaxedOrder) rdmaMemRegFlags |= IBV_ACCESS_RELAXED_ORDERING;
+    // Open NIC contexts
+    IBV_PTR_CALL(rss.srcContext, ibv_open_device, GetIbvDeviceList()[rss.srcNicIndex].devicePtr);
+    IBV_PTR_CALL(rss.dstContext, ibv_open_device, GetIbvDeviceList()[rss.dstNicIndex].devicePtr);
+    // Open protection domains
+    IBV_PTR_CALL(rss.srcProtect, ibv_alloc_pd, rss.srcContext);
+    IBV_PTR_CALL(rss.dstProtect, ibv_alloc_pd, rss.dstContext);
+    // Register memory region
+    IBV_PTR_CALL(rss.srcMemRegion, ibv_reg_mr, rss.srcProtect, rss.srcMem[0], rss.numBytes, rdmaMemRegFlags);
+    IBV_PTR_CALL(rss.dstMemRegion, ibv_reg_mr, rss.dstProtect, rss.dstMem[0], rss.numBytes, rdmaMemRegFlags);
+    // Create completion queues
+    IBV_PTR_CALL(rss.srcCompQueue, ibv_create_cq, rss.srcContext, cfg.nic.queueSize, NULL, NULL, 0);
+    IBV_PTR_CALL(rss.dstCompQueue, ibv_create_cq, rss.dstContext, cfg.nic.queueSize, NULL, NULL, 0);
+    // Get port attributes
+    IBV_CALL(ibv_query_port, rss.srcContext, port, &rss.srcPortAttr);
+    IBV_CALL(ibv_query_port, rss.dstContext, port, &rss.dstPortAttr);
+    if (rss.srcPortAttr.link_layer != rss.dstPortAttr.link_layer)
+      return {ERR_FATAL, "SRC NIC (%d) and DST NIC (%d) do not have the same link layer", rss.srcNicIndex, rss.dstNicIndex};
+    // Prepare GID index
+    int srcGidIndex = cfg.nic.ibGidIndex;
+    int dstGidIndex = cfg.nic.ibGidIndex;
+    // Check for RDMA over Converged Ethernet (RoCE) and update GID index appropriately
+    bool isRoCE = (rss.srcPortAttr.link_layer == IBV_LINK_LAYER_ETHERNET);
+    if (isRoCE) {
+      // Try to auto-detect the GID index
+      std::pair<int, std::string> srcGidInfo (srcGidIndex, "");
+      std::pair<int, std::string> dstGidInfo (dstGidIndex, "");
+      ERR_CHECK(GetGidIndex(rss.srcContext, rss.srcPortAttr.gid_tbl_len, cfg.nic.ibPort, srcGidInfo));
+      ERR_CHECK(GetGidIndex(rss.dstContext, rss.dstPortAttr.gid_tbl_len, cfg.nic.ibPort, dstGidInfo));
+      srcGidIndex = srcGidInfo.first;
+      dstGidIndex = dstGidInfo.first;
+      IBV_CALL(ibv_query_gid, rss.srcContext, port, srcGidIndex, &rss.srcGid);
+      IBV_CALL(ibv_query_gid, rss.dstContext, port, dstGidIndex, &rss.dstGid);
+    }
+    // Prepare queue pairs and send elements
+    rss.srcQueuePairs.resize(rss.qpCount);
+    rss.dstQueuePairs.resize(rss.qpCount);
+    rss.sgePerQueuePair.resize(rss.qpCount);
+    rss.sendWorkRequests.resize(rss.qpCount);
+    for (int i = 0; i < rss.qpCount; ++i) {
+      // Create scatter-gather element for the portion of memory assigned to this queue pair
+      ibv_sge sg = {};
+      sg.addr   = (uint64_t)rss.subExecParamCpu[i].src[0];
+      sg.length = rss.subExecParamCpu[i].N * sizeof(float);
+      sg.lkey   = rss.srcMemRegion->lkey;
+      rss.sgePerQueuePair[i] = sg;
+      // Create send work request
+      ibv_send_wr wr = {};
+      wr.wr_id                = i;
+      wr.sg_list              = &rss.sgePerQueuePair[i];
+      wr.num_sge              = 1;
+      wr.opcode               = IBV_WR_RDMA_WRITE;
+      wr.send_flags           = IBV_SEND_SIGNALED;
+      wr.wr.rdma.remote_addr  = (uint64_t)rss.subExecParamCpu[i].dst[0];
+      wr.wr.rdma.rkey         = rss.dstMemRegion->rkey;
+      rss.sendWorkRequests[i] = wr;
+      // Create SRC/DST queue pairs
+      ERR_CHECK(CreateQueuePair(cfg, rss.srcProtect, rss.srcCompQueue, rss.srcQueuePairs[i]));
+      ERR_CHECK(CreateQueuePair(cfg, rss.dstProtect, rss.dstCompQueue, rss.dstQueuePairs[i]));
+      // Initialize SRC/DST queue pairs
+      ERR_CHECK(InitQueuePair(rss.srcQueuePairs[i], port, rdmaAccessFlags));
+      ERR_CHECK(InitQueuePair(rss.dstQueuePairs[i], port, rdmaAccessFlags));
+      // Transition the SRC queue pair to ready to receive
+      ERR_CHECK(TransitionQpToRtr(rss.srcQueuePairs[i], rss.dstPortAttr.lid,
+                                  rss.dstQueuePairs[i]->qp_num, rss.dstGid,
+                                  dstGidIndex, port, isRoCE,
+                                  rss.srcPortAttr.active_mtu));
+      // Transition the SRC queue pair to ready to send
+      ERR_CHECK(TransitionQpToRts(rss.srcQueuePairs[i]));
+      // Transition the DST queue pair to ready to receive
+      ERR_CHECK(TransitionQpToRtr(rss.dstQueuePairs[i], rss.srcPortAttr.lid,
+                                  rss.srcQueuePairs[i]->qp_num, rss.srcGid,
+                                  srcGidIndex, port, isRoCE,
+                                  rss.dstPortAttr.active_mtu));
+      // Transition the DST queue pair to ready to send
+      ERR_CHECK(TransitionQpToRts(rss.dstQueuePairs[i]));
+    }
+    return ERR_NONE;
+  }
+  static ErrResult TeardownNicTransferResources(TransferResources& rss)
+  {
+    // Deregister memory regions
+    IBV_CALL(ibv_dereg_mr, rss.srcMemRegion);
+    IBV_CALL(ibv_dereg_mr, rss.dstMemRegion);
+    // Destroy queue pairs
+    for (auto srcQueuePair : rss.srcQueuePairs)
+      IBV_CALL(ibv_destroy_qp, srcQueuePair);
+    rss.srcQueuePairs.clear();
+    for (auto dstQueuePair : rss.dstQueuePairs)
+      IBV_CALL(ibv_destroy_qp, dstQueuePair);
+    rss.dstQueuePairs.clear();
+    // Destroy completion queues
+    IBV_CALL(ibv_destroy_cq, rss.srcCompQueue);
+    IBV_CALL(ibv_destroy_cq, rss.dstCompQueue);
+    // Deallocate protection domains
+    IBV_CALL(ibv_dealloc_pd, rss.srcProtect);
+    IBV_CALL(ibv_dealloc_pd, rss.dstProtect);
+    // Destroy context
+    IBV_CALL(ibv_close_device, rss.srcContext);
+    IBV_CALL(ibv_close_device, rss.dstContext);
+    return ERR_NONE;
+  }
+#endif // NIC_EXEC_ENABLED
 // Data validation-related functions
 //========================================================================================
@@ -1244,17 +2206,17 @@ namespace {
    float* output;
    size_t initOffset = cfg.data.byteOffset / sizeof(float);
-    for (auto resource : transferResources) {
+    for (auto rss : transferResources) {
-      int transferIdx = resource->transferIdx;
+      int transferIdx = rss->transferIdx;
      Transfer const& t = transfers[transferIdx];
      size_t N = t.numBytes / sizeof(float);
      float const* expected = dstReference[t.srcs.size()].data();
-      for (int dstIdx = 0; dstIdx < resource->dstMem.size(); dstIdx++) {
+      for (int dstIdx = 0; dstIdx < rss->dstMem.size(); dstIdx++) {
        if (IsCpuMemType(t.dsts[dstIdx].memType) || cfg.data.validateDirect) {
-          output = (resource->dstMem[dstIdx]) + initOffset;
+          output = (rss->dstMem[dstIdx]) + initOffset;
        } else {
-          ERR_CHECK(hipMemcpy(outputBuffer.data(), (resource->dstMem[dstIdx]) + initOffset, t.numBytes, hipMemcpyDefault));
+          ERR_CHECK(hipMemcpy(outputBuffer.data(), (rss->dstMem[dstIdx]) + initOffset, t.numBytes, hipMemcpyDefault));
          ERR_CHECK(hipDeviceSynchronize());
          output = outputBuffer.data();
        }
@@ -1282,7 +2244,7 @@ namespace {
  // Initializes counters
  static ErrResult PrepareSubExecParams(ConfigOptions const& cfg,
                                        Transfer      const& transfer,
-                                        TransferResources&   resources)
+                                        TransferResources&   rss)
  {
    // Each subExecutor needs to know src/dst pointers and how many elements to transfer
    // Figure out the sub-array each subExecutor works on for this Transfer
@@ -1296,15 +2258,15 @@ namespace {
    int const maxSubExecToUse = std::min((size_t)(N + targetMultiple - 1) / targetMultiple,
                                         (size_t)transfer.numSubExecs);
-    vector<SubExecParam>& subExecParam = resources.subExecParamCpu;
+    vector<SubExecParam>& subExecParam = rss.subExecParamCpu;
    subExecParam.clear();
    subExecParam.resize(transfer.numSubExecs);
    size_t assigned = 0;
    for (int i = 0; i < transfer.numSubExecs; ++i) {
      SubExecParam& p  = subExecParam[i];
-      p.numSrcs        = resources.srcMem.size();
+      p.numSrcs        = rss.srcMem.size();
-      p.numDsts        = resources.dstMem.size();
+      p.numDsts        = rss.dstMem.size();
      p.startCycle     = 0;
      p.stopCycle      = 0;
      p.hwId           = 0;
@@ -1315,8 +2277,8 @@ namespace {
        p.N        = N;
        p.teamSize = transfer.numSubExecs;
        p.teamIdx  = i;
-        for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = resources.srcMem[iSrc] + initOffset;
+        for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = rss.srcMem[iSrc] + initOffset;
-        for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = resources.dstMem[iDst] + initOffset;
+        for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = rss.dstMem[iDst] + initOffset;
      } else {
        // Otherwise, each subexecutor works on separate subarrays
        int    const subExecLeft = std::max(0, maxSubExecToUse - i);
@@ -1326,8 +2288,8 @@ namespace {
        p.N        = subExecLeft ? std::min(leftover, ((roundedN / subExecLeft) * targetMultiple)) : 0;
        p.teamSize = 1;
        p.teamIdx  = 0;
-        for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = resources.srcMem[iSrc] + initOffset + assigned;
+        for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = rss.srcMem[iSrc] + initOffset + assigned;
-        for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = resources.dstMem[iDst] + initOffset + assigned;
+        for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = rss.dstMem[iDst] + initOffset + assigned;
        assigned += p.N;
      }
@@ -1348,7 +2310,7 @@ namespace {
    }
    // Clear counters
-    resources.totalDurationMsec = 0.0;
+    rss.totalDurationMsec = 0.0;
    return ERR_NONE;
  }
@@ -1363,34 +2325,34 @@ namespace {
    exeInfo.totalDurationMsec = 0.0;
    // Loop over each transfer this executor is involved in
-    for (auto& resources : exeInfo.resources) {
+    for (auto& rss : exeInfo.resources) {
-      Transfer const& t = transfers[resources.transferIdx];
+      Transfer const& t = transfers[rss.transferIdx];
-      resources.numBytes = t.numBytes;
+      rss.numBytes = t.numBytes;
      // Allocate source memory
-      resources.srcMem.resize(t.srcs.size());
+      rss.srcMem.resize(t.srcs.size());
      for (int iSrc = 0; iSrc < t.srcs.size(); ++iSrc) {
        MemDevice const& srcMemDevice = t.srcs[iSrc];
        // Ensure executing GPU can access source memory
-        if (exeDevice.exeType == EXE_GPU_GFX && IsGpuMemType(srcMemDevice.memType) &&
+        if (IsGpuExeType(exeDevice.exeType) && IsGpuMemType(srcMemDevice.memType) &&
            srcMemDevice.memIndex != exeDevice.exeIndex) {
          ERR_CHECK(EnablePeerAccess(exeDevice.exeIndex, srcMemDevice.memIndex));
        }
-        ERR_CHECK(AllocateMemory(srcMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&resources.srcMem[iSrc]));
+        ERR_CHECK(AllocateMemory(srcMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&rss.srcMem[iSrc]));
      }
      // Allocate destination memory
-      resources.dstMem.resize(t.dsts.size());
+      rss.dstMem.resize(t.dsts.size());
      for (int iDst = 0; iDst < t.dsts.size(); ++iDst) {
        MemDevice const& dstMemDevice = t.dsts[iDst];
        // Ensure executing GPU can access destination memory
-        if (exeDevice.exeType == EXE_GPU_GFX && IsGpuMemType(dstMemDevice.memType) &&
+        if (IsGpuExeType(exeDevice.exeType) && IsGpuMemType(dstMemDevice.memType) &&
            dstMemDevice.memIndex != exeDevice.exeIndex) {
          ERR_CHECK(EnablePeerAccess(exeDevice.exeIndex, dstMemDevice.memIndex));
        }
-        ERR_CHECK(AllocateMemory(dstMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&resources.dstMem[iDst]));
+        ERR_CHECK(AllocateMemory(dstMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&rss.dstMem[iDst]));
      }
      if (exeDevice.exeType == EXE_GPU_DMA && (t.exeSubIndex != -1 || cfg.dma.useHsaCopy)) {
@@ -1398,19 +2360,22 @@ namespace {
        // Collect HSA agent information
        hsa_amd_pointer_info_t info;
        info.size = sizeof(info);
-        ERR_CHECK(hsa_amd_pointer_info(resources.dstMem[0], &info, NULL, NULL, NULL));
+        ERR_CHECK(hsa_amd_pointer_info(rss.dstMem[0], &info, NULL, NULL, NULL));
-        resources.dstAgent = info.agentOwner;
+        rss.dstAgent = info.agentOwner;
-        ERR_CHECK(hsa_amd_pointer_info(resources.srcMem[0], &info, NULL, NULL, NULL));
+        ERR_CHECK(hsa_amd_pointer_info(rss.srcMem[0], &info, NULL, NULL, NULL));
-        resources.srcAgent = info.agentOwner;
+        rss.srcAgent = info.agentOwner;
        // Create HSA completion signal
-        ERR_CHECK(hsa_signal_create(1, 0, NULL, &resources.signal));
+        ERR_CHECK(hsa_signal_create(1, 0, NULL, &rss.signal));
+        if (t.exeSubIndex != -1)
+          rss.sdmaEngineId = (hsa_amd_sdma_engine_id_t)(1U << t.exeSubIndex);
 #endif
      }
      // Prepare subexecutor parameters
-      ERR_CHECK(PrepareSubExecParams(cfg, t, resources));
+      ERR_CHECK(PrepareSubExecParams(cfg, t, rss));
    }
    // Prepare additional requirements for GPU-based executors
@@ -1469,13 +2434,47 @@ namespace {
                                      exeDevice.exeIndex));
 #endif
      int transferOffset = 0;
-      for (auto& resources : exeInfo.resources) {
+      if (cfg.gfx.useMultiStream || cfg.gfx.blockOrder == 0) {
-        Transfer const& t = transfers[resources.transferIdx];
+        // Threadblocks are ordered sequentially one transfer at a time
-        resources.subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset;
+        for (auto& rss : exeInfo.resources) {
-        for (auto p : resources.subExecParamCpu) {
+          Transfer const& t = transfers[rss.transferIdx];
-          resources.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
+          rss.subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset;
-          exeInfo.subExecParamCpu.push_back(p);
+          for (auto p : rss.subExecParamCpu) {
-          transferOffset++;
+            rss.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
+            exeInfo.subExecParamCpu.push_back(p);
+            transferOffset++;
+          }
+        }
+      } else if (cfg.gfx.blockOrder == 1) {
+        // Interleave threadblocks of different Transfers
+        for (int subExecIdx = 0; exeInfo.subExecParamCpu.size() < exeInfo.totalSubExecs; ++subExecIdx) {
+          for (auto& rss : exeInfo.resources) {
+            Transfer const& t = transfers[rss.transferIdx];
+            if (subExecIdx < t.numSubExecs) {
+              rss.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
+              exeInfo.subExecParamCpu.push_back(rss.subExecParamCpu[subExecIdx]);
+            }
+          }
+        }
+      } else if (cfg.gfx.blockOrder == 2) {
+        // Build randomized threadblock list
+        std::vector<std::pair<int,int>> indices;
+        for (int i = 0; i < exeInfo.resources.size(); i++) {
+          auto const& rss = exeInfo.resources[i];
+          Transfer const& t = transfers[rss.transferIdx];
+          for (int j = 0; j < t.numSubExecs; j++)
+            indices.push_back(std::make_pair(i,j));
+        }
+        std::random_device rd;
+        std::default_random_engine gen(rd());
+        std::shuffle(indices.begin(), indices.end(), gen);
+        // Build randomized threadblock list
+        for (auto p : indices) {
+          auto& rss = exeInfo.resources[p.first];
+          rss.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
+          exeInfo.subExecParamCpu.push_back(rss.subExecParamCpu[p.second]);
        }
      }
@@ -1488,6 +2487,17 @@ namespace {
      ERR_CHECK(hipDeviceSynchronize());
    }
+    // Prepare for NIC-based executors
+    if (IsNicExeType(exeDevice.exeType)) {
+#ifdef NIC_EXEC_ENABLED
+      for (auto& rss : exeInfo.resources) {
+        Transfer const& t = transfers[rss.transferIdx];
+        ERR_CHECK(PrepareNicTransferResources(cfg, exeDevice, t, rss));
+      }
+#else
+      return {ERR_FATAL, "RDMA executor is not supported"};
+#endif
+    }
    return ERR_NONE;
  }
@@ -1501,23 +2511,30 @@ namespace {
                                    ExeInfo&                exeInfo)
  {
    // Loop over each transfer this executor is involved in
-    for (auto& resources : exeInfo.resources) {
+    for (auto& rss : exeInfo.resources) {
-      Transfer const& t = transfers[resources.transferIdx];
+      Transfer const& t = transfers[rss.transferIdx];
      // Deallocate source memory
      for (int iSrc = 0; iSrc < t.srcs.size(); ++iSrc) {
-        ERR_CHECK(DeallocateMemory(t.srcs[iSrc].memType, resources.srcMem[iSrc], t.numBytes + cfg.data.byteOffset));
+        ERR_CHECK(DeallocateMemory(t.srcs[iSrc].memType, rss.srcMem[iSrc], t.numBytes + cfg.data.byteOffset));
      }
      // Deallocate destination memory
      for (int iDst = 0; iDst < t.dsts.size(); ++iDst) {
-        ERR_CHECK(DeallocateMemory(t.dsts[iDst].memType, resources.dstMem[iDst], t.numBytes + cfg.data.byteOffset));
+        ERR_CHECK(DeallocateMemory(t.dsts[iDst].memType, rss.dstMem[iDst], t.numBytes + cfg.data.byteOffset));
      }
      // Destroy HSA signal for DMA executor
 #if !defined(__NVCC__)
      if (exeDevice.exeType == EXE_GPU_DMA && (t.exeSubIndex != -1 || cfg.dma.useHsaCopy)) {
-        ERR_CHECK(hsa_signal_destroy(resources.signal));
+        ERR_CHECK(hsa_signal_destroy(rss.signal));
+      }
+#endif
+      // Destroy NIC related resources
+#ifdef NIC_EXEC_ENABLED
+      if (IsNicExeType(exeDevice.exeType)) {
+        ERR_CHECK(TeardownNicTransferResources(rss));
      }
 #endif
    }
@@ -1550,68 +2567,69 @@ namespace {
 //========================================================================================
  // Kernel for CPU execution (run by a single subexecutor)
-  static void CpuReduceKernel(SubExecParam const& p)
+  static void CpuReduceKernel(SubExecParam const& p, int numSubIterations)
  {
    if (p.N == 0) return;
-    int const& numSrcs = p.numSrcs;
+    int subIteration = 0;
-    int const& numDsts = p.numDsts;
+    do {
+      int const& numSrcs = p.numSrcs;
-    if (numSrcs == 0) {
+      int const& numDsts = p.numDsts;
-      for (int i = 0; i < numDsts; ++i) {
-        memset(p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
-        //for (int j = 0; j < p.N; j++) p.dst[i][j] = MEMSET_VAL;
-      }
-    } else if (numSrcs == 1) {
-      float const* __restrict__ src = p.src[0];
-      if (numDsts == 0) {
-        float sum = 0.0;
-        for (int j = 0; j < p.N; j++)
-          sum += p.src[0][j];
-        // Add a dummy check to ensure the read is not optimized out
+      if (numSrcs == 0) {
-        if (sum != sum) {
+        for (int i = 0; i < numDsts; ++i) {
-          printf("[ERROR] Nan detected\n");
+          memset(p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
+          //for (int j = 0; j < p.N; j++) p.dst[i][j] = MEMSET_VAL;
+        }
+      } else if (numSrcs == 1) {
+        float const* __restrict__ src = p.src[0];
+        if (numDsts == 0) {
+          float sum = 0.0;
+          for (int j = 0; j < p.N; j++)
+            sum += p.src[0][j];
+          // Add a dummy check to ensure the read is not optimized out
+          if (sum != sum) {
+            printf("[ERROR] Nan detected\n");
+          }
+        } else {
+          for (int i = 0; i < numDsts; ++i)
+            memcpy(p.dst[i], src, p.N * sizeof(float));
        }
      } else {
-        for (int i = 0; i < numDsts; ++i)
+        float sum = 0.0f;
-          memcpy(p.dst[i], src, p.N * sizeof(float));
+        for (int j = 0; j < p.N; j++) {
-      }
+          sum = p.src[0][j];
-    } else {
+          for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
-      float sum = 0.0f;
+          for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
-      for (int j = 0; j < p.N; j++) {
+        }
-        sum = p.src[0][j];
-        for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
-        for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
      }
-    }
+    } while (++subIteration != numSubIterations);
  }
  // Execution of a single CPU Transfers
  static void ExecuteCpuTransfer(int           const  iteration,
                                 ConfigOptions const& cfg,
                                 int           const  exeIndex,
-                                 TransferResources&   resources)
+                                 TransferResources&   rss)
  {
    auto cpuStart = std::chrono::high_resolution_clock::now();
    vector<std::thread> childThreads;
-    int subIteration = 0;
-    do {
-      for (auto const& subExecParam : resources.subExecParamCpu)
-        childThreads.emplace_back(std::thread(CpuReduceKernel, std::cref(subExecParam)));
-      for (auto& subExecThread : childThreads)
+    for (auto const& subExecParam : rss.subExecParamCpu)
-        subExecThread.join();
+      childThreads.emplace_back(std::thread(CpuReduceKernel, std::cref(subExecParam), cfg.general.numSubIterations));
-      childThreads.clear();
-    } while (++subIteration != cfg.general.numSubIterations);
+    for (auto& subExecThread : childThreads)
+      subExecThread.join();
+    childThreads.clear();
    auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
    double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0;
    if (iteration >= 0) {
-      resources.totalDurationMsec += deltaMsec;
+      rss.totalDurationMsec += deltaMsec;
      if (cfg.general.recordPerIteration)
-        resources.perIterMsec.push_back(deltaMsec);
+        rss.perIterMsec.push_back(deltaMsec);
    }
  }
@@ -1625,12 +2643,12 @@ namespace {
    auto cpuStart = std::chrono::high_resolution_clock::now();
    vector<std::thread> asyncTransfers;
-    for (auto& resource : exeInfo.resources) {
+    for (auto& rss : exeInfo.resources) {
      asyncTransfers.emplace_back(std::thread(ExecuteCpuTransfer,
                                              iteration,
                                              std::cref(cfg),
                                              exeIndex,
-                                              std::ref(resource)));
+                                              std::ref(rss)));
    }
    for (auto& asyncTransfer : asyncTransfers)
      asyncTransfer.join();
@@ -1642,6 +2660,88 @@ namespace {
    return ERR_NONE;
  }
+#ifdef NIC_EXEC_ENABLED
+  // Execution of a single NIC Transfer
+  static ErrResult ExecuteNicTransfer(int           const  iteration,
+                                      ConfigOptions const& cfg,
+                                      int           const  exeIndex,
+                                      TransferResources&   rss)
+  {
+    // Loop over each of the queue pairs and post the send
+    ibv_send_wr* badWorkReq;
+    for (int qpIndex = 0; qpIndex < rss.qpCount; qpIndex++) {
+      int error = ibv_post_send(rss.srcQueuePairs[qpIndex], &rss.sendWorkRequests[qpIndex], &badWorkReq);
+      if (error)
+        return {ERR_FATAL, "Transfer %d: Error when calling ibv_post_send for QP %d Error code %d\n",
+          rss.transferIdx, qpIndex, error};
+    }
+    return ERR_NONE;
+  }
+  // Execution of a single NIC executor
+  static ErrResult RunNicExecutor(int           const  iteration,
+                                  ConfigOptions const& cfg,
+                                  int           const  exeIndex,
+                                  ExeInfo&             exeInfo)
+  {
+    // Switch to the closest NUMA node to this NIC
+    if (cfg.nic.useNuma) {
+      int numaNode = GetIbvDeviceList()[exeIndex].numaNode;
+      if (numaNode != -1)
+        numa_run_on_node(numaNode);
+    }
+    int subIterations = 0;
+    do {
+      auto cpuStart = std::chrono::high_resolution_clock::now();
+      size_t completedTransfers = 0;
+      auto transferCount = exeInfo.resources.size();
+      std::vector<uint8_t> receivedQPs(transferCount);
+      std::vector<std::chrono::high_resolution_clock::time_point> transferTimers(transferCount);
+      // post the sends
+      for (auto i = 0; i < transferCount; i++) {
+        transferTimers[i] = std::chrono::high_resolution_clock::now();
+        ERR_CHECK(ExecuteNicTransfer(iteration, cfg, exeIndex, exeInfo.resources[i]));
+      }
+      // poll for completions
+      do {
+        for (auto i = 0; i < transferCount; i++) {
+          if(receivedQPs[i] < exeInfo.resources[i].qpCount) {
+            auto& rss = exeInfo.resources[i];
+            // Poll the completion queue until all queue pairs are complete
+            // The order of completion doesn't matter because this completion queue is dedicated to this Transfer
+            ibv_wc wc;
+            int nc = ibv_poll_cq(rss.srcCompQueue, 1, &wc);
+            if (nc > 0) {
+              receivedQPs[i]++;
+              if (wc.status != IBV_WC_SUCCESS) {
+                return {ERR_FATAL, "Transfer %d: Received unsuccessful work completion", rss.transferIdx};
+              }
+            } else if (nc < 0) {
+              return {ERR_FATAL, "Transfer %d: Received negative work completion", rss.transferIdx};
+            }
+            if(receivedQPs[i] == rss.qpCount) {
+              auto cpuDelta = std::chrono::high_resolution_clock::now() - transferTimers[i];
+              double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0;
+              if (iteration >= 0) {
+                rss.totalDurationMsec += deltaMsec;
+                if (cfg.general.recordPerIteration)
+                  rss.perIterMsec.push_back(deltaMsec);
+              }
+              completedTransfers++;
+            }
+          }
+        }
+      } while(completedTransfers < transferCount);
+      auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
+      double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0;
+      if (iteration >= 0)
+        exeInfo.totalDurationMsec += deltaMsec;
+    } while(++subIterations < cfg.general.numSubIterations);
+    return ERR_NONE;
+  }
+#endif
 // GFX Executor-related functions
 //========================================================================================
@@ -1674,13 +2774,97 @@ namespace {
  // Helper function for memset
  template <typename T> __device__ __forceinline__ T      MemsetVal();
  template <>           __device__ __forceinline__ float  MemsetVal(){ return MEMSET_VAL; };
+  template <>           __device__ __forceinline__ float2 MemsetVal(){ return make_float2(MEMSET_VAL,
+                                                                                          MEMSET_VAL); };
  template <>           __device__ __forceinline__ float4 MemsetVal(){ return make_float4(MEMSET_VAL,
                                                                                          MEMSET_VAL,
                                                                                          MEMSET_VAL,
                                                                                          MEMSET_VAL); }
+  // Helper function for temporal/non-temporal reads / writes
+  #define TEMPORAL_NONE  0
+  #define TEMPORAL_LOAD  1
+  #define TEMPORAL_STORE 2
+  #define TEMPORAL_BOTH  3
+  template <int TEMPORAL_MODE>
+  __device__ __forceinline__ void Load(float const* src, float& dst) {
+    if (TEMPORAL_MODE & TEMPORAL_LOAD) {
+#if !defined(__NVCC__)
+      dst = __builtin_nontemporal_load(src);
+#endif
+    } else {
+      dst = *src;
+    }
+  }
+  template <int TEMPORAL_MODE>
+  __device__ __forceinline__ void Load(float2 const* src, float2& dst) {
+    if (TEMPORAL_MODE & TEMPORAL_LOAD) {
+#if !defined(__NVCC__)
+      dst.x = __builtin_nontemporal_load(&(src->x));
+      dst.y = __builtin_nontemporal_load(&(src->y));
+#endif
+    } else {
+      dst = *src;
+    }
+  }
+  template <int TEMPORAL_MODE>
+  __device__ __forceinline__ void Load(float4 const* src, float4& dst) {
+    if (TEMPORAL_MODE & TEMPORAL_LOAD) {
+#if !defined(__NVCC__)
+      dst.x = __builtin_nontemporal_load(&(src->x));
+      dst.y = __builtin_nontemporal_load(&(src->y));
+      dst.z = __builtin_nontemporal_load(&(src->z));
+      dst.w = __builtin_nontemporal_load(&(src->w));
+#endif
+    } else {
+      dst = *src;
+    }
+  }
+  template <int TEMPORAL_MODE>
+  __device__ __forceinline__ void Store(float const& src, float* dst) {
+    if (TEMPORAL_MODE & TEMPORAL_STORE) {
+#if !defined(__NVCC__)
+      __builtin_nontemporal_store(src, dst);
+#endif
+    } else {
+      *dst = src;
+    }
+  }
+  template <int TEMPORAL_MODE>
+  __device__ __forceinline__ void Store(float2 const& src, float2* dst) {
+    if (TEMPORAL_MODE & TEMPORAL_STORE) {
+#if !defined(__NVCC__)
+      __builtin_nontemporal_store(src.x, &(dst->x));
+      __builtin_nontemporal_store(src.y, &(dst->y));
+#endif
+    } else {
+      *dst = src;
+    }
+  }
+  template <int TEMPORAL_MODE>
+  __device__ __forceinline__ void Store(float4 const& src, float4* dst) {
+    if (TEMPORAL_MODE & TEMPORAL_STORE) {
+#if !defined(__NVCC__)
+      __builtin_nontemporal_store(src.x, &(dst->x));
+      __builtin_nontemporal_store(src.y, &(dst->y));
+      __builtin_nontemporal_store(src.z, &(dst->z));
+      __builtin_nontemporal_store(src.w, &(dst->w));
+#endif
+    } else {
+      *dst = src;
+    }
+  }
  // Kernel for GFX execution
-  template <int BLOCKSIZE, int UNROLL>
+  template <typename PACKED_FLOAT, int BLOCKSIZE, int UNROLL, int TEMPORAL_MODE>
  __global__ void __launch_bounds__(BLOCKSIZE)
    GpuReduceKernel(SubExecParam* params, int waveOrder, int numSubIterations)
  {
@@ -1699,10 +2883,10 @@ namespace {
    // Collect data information
    int32_t const  numSrcs  = p.numSrcs;
    int32_t const  numDsts  = p.numDsts;
-    float4  const* __restrict__ srcFloat4[MAX_SRCS];
+    PACKED_FLOAT const* __restrict__ srcFloatPacked[MAX_SRCS];
-    float4*        __restrict__ dstFloat4[MAX_DSTS];
+    PACKED_FLOAT*       __restrict__ dstFloatPacked[MAX_DSTS];
-    for (int i = 0; i < numSrcs; i++) srcFloat4[i] = (float4*)p.src[i];
+    for (int i = 0; i < numSrcs; i++) srcFloatPacked[i] = (PACKED_FLOAT const*)p.src[i];
-    for (int i = 0; i < numDsts; i++) dstFloat4[i] = (float4*)p.dst[i];
+    for (int i = 0; i < numDsts; i++) dstFloatPacked[i] = (PACKED_FLOAT*)p.dst[i];
    // Operate on wavefront granularity
    int32_t const nTeams   = p.teamSize;             // Number of threadblocks working together on this subarray
@@ -1711,7 +2895,7 @@ namespace {
    int32_t const waveIdx  = threadIdx.x / warpSize; // Index of this wavefront within the threadblock
    int32_t const tIdx     = threadIdx.x % warpSize; // Thread index within wavefront
-    size_t  const numFloat4 = p.N / 4;
+    size_t  const numPackedFloat = p.N / (sizeof(PACKED_FLOAT)/sizeof(float));
    int32_t teamStride, waveStride, unrlStride, teamStride2, waveStride2;
    switch (waveOrder) {
@@ -1725,72 +2909,84 @@ namespace {
    int subIterations = 0;
    while (1) {
-      // First loop: Each wavefront in the team works on UNROLL float4s per thread
+      // First loop: Each wavefront in the team works on UNROLL PACKED_FLOAT per thread
      size_t const loop1Stride = nTeams * nWaves * UNROLL * warpSize;
-      size_t const loop1Limit  = numFloat4 / loop1Stride * loop1Stride;
+      size_t const loop1Limit  = numPackedFloat / loop1Stride * loop1Stride;
      {
-        float4 val[UNROLL];
+        PACKED_FLOAT val[UNROLL];
+        PACKED_FLOAT tmp[UNROLL];
        if (numSrcs == 0) {
          #pragma unroll
          for (int u = 0; u < UNROLL; u++)
-            val[u] = MemsetVal<float4>();
+            val[u] = MemsetVal<PACKED_FLOAT>();
        }
        for (size_t idx = (teamIdx * teamStride + waveIdx * waveStride) * warpSize + tIdx; idx < loop1Limit; idx += loop1Stride) {
          // Read sources into memory and accumulate in registers
          if (numSrcs) {
+            #pragma unroll
            for (int u = 0; u < UNROLL; u++)
-              val[u] = srcFloat4[0][idx + u * unrlStride * warpSize];
+              Load<TEMPORAL_MODE>(&srcFloatPacked[0][idx + u * unrlStride * warpSize], val[u]);
-            for (int s = 1; s < numSrcs; s++)
+            for (int s = 1; s < numSrcs; s++) {
+              #pragma unroll
+              for (int u = 0; u < UNROLL; u++)
+                Load<TEMPORAL_MODE>(&srcFloatPacked[s][idx + u * unrlStride * warpSize], tmp[u]);
+              #pragma unroll
              for (int u = 0; u < UNROLL; u++)
-                val[u] += srcFloat4[s][idx + u * unrlStride * warpSize];
+                val[u] += tmp[u];
+            }
          }
          // Write accumulation to all outputs
          for (int d = 0; d < numDsts; d++) {
            #pragma unroll
            for (int u = 0; u < UNROLL; u++)
-              dstFloat4[d][idx + u * unrlStride * warpSize] = val[u];
+              Store<TEMPORAL_MODE>(val[u], &dstFloatPacked[d][idx + u * unrlStride * warpSize]);
          }
        }
      }
-      // Second loop: Deal with remaining float4s
+      // Second loop: Deal with remaining PACKED_FLOAT
      {
-        if (loop1Limit < numFloat4) {
+        if (loop1Limit < numPackedFloat) {
-          float4 val;
+          PACKED_FLOAT val, tmp;
-          if (numSrcs == 0) val = MemsetVal<float4>();
+          if (numSrcs == 0) val = MemsetVal<PACKED_FLOAT>();
          size_t const loop2Stride = nTeams * nWaves * warpSize;
          for (size_t idx = loop1Limit + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx;
-               idx < numFloat4; idx += loop2Stride) {
+               idx < numPackedFloat; idx += loop2Stride) {
            if (numSrcs) {
-              val = srcFloat4[0][idx];
+              Load<TEMPORAL_MODE>(&srcFloatPacked[0][idx], val);
-              for (int s = 1; s < numSrcs; s++)
+              for (int s = 1; s < numSrcs; s++) {
-                val += srcFloat4[s][idx];
+                Load<TEMPORAL_MODE>(&srcFloatPacked[s][idx], tmp);
+                val += tmp;
+              }
            }
            for (int d = 0; d < numDsts; d++)
-              dstFloat4[d][idx] = val;
+              Store<TEMPORAL_MODE>(val, &dstFloatPacked[d][idx]);
          }
        }
      }
      // Third loop; Deal with remaining floats
      {
-        if (numFloat4 * 4 < p.N) {
+        if (numPackedFloat * (sizeof(PACKED_FLOAT)/sizeof(float)) < p.N) {
-          float val;
+          float val, tmp;
          if (numSrcs == 0) val = MemsetVal<float>();
          size_t const loop3Stride = nTeams * nWaves * warpSize;
-          for( size_t idx = numFloat4 * 4 + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx; idx < p.N; idx += loop3Stride) {
+          for (size_t idx = numPackedFloat * (sizeof(PACKED_FLOAT)/sizeof(float)) + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx; idx < p.N; idx += loop3Stride) {
            if (numSrcs) {
-              val = p.src[0][idx];
+              Load<TEMPORAL_MODE>(&p.src[0][idx], val);
-              for (int s = 1; s < numSrcs; s++)
+              for (int s = 1; s < numSrcs; s++) {
-                val += p.src[s][idx];
+                Load<TEMPORAL_MODE>(&p.src[s][idx], tmp);
+                val += tmp;
+              }
            }
            for (int d = 0; d < numDsts; d++)
-              p.dst[d][idx] = val;
+              Store<TEMPORAL_MODE>(val, &p.dst[d][idx]);
          }
        }
      }
@@ -1809,19 +3005,30 @@ namespace {
    }
  }
-#define GPU_KERNEL_UNROLL_DECL(BLOCKSIZE)   \
+#define GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, DWORD)    \
-    {GpuReduceKernel<BLOCKSIZE, 1>,         \
+  {GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_NONE>,  \
-     GpuReduceKernel<BLOCKSIZE, 2>,         \
+   GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_LOAD>,  \
-     GpuReduceKernel<BLOCKSIZE, 3>,         \
+   GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_STORE>, \
-     GpuReduceKernel<BLOCKSIZE, 4>,         \
+   GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_BOTH>}
-     GpuReduceKernel<BLOCKSIZE, 5>,         \
-     GpuReduceKernel<BLOCKSIZE, 6>,         \
+#define GPU_KERNEL_DWORD_DECL(BLOCKSIZE, UNROLL)        \
-     GpuReduceKernel<BLOCKSIZE, 7>,         \
+  {GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, float),  \
-     GpuReduceKernel<BLOCKSIZE, 8>}
+   GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, float2), \
+   GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, float4)}
-  // Table of all GPU Reduction kernel functions (templated blocksize / unroll)
+#define GPU_KERNEL_UNROLL_DECL(BLOCKSIZE)    \
+  {GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 1),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 2),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 3),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 4),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 5),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 6),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 7),      \
+   GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 8)}
+  // Table of all GPU Reduction kernel functions (templated blocksize / unroll / dword size)
  typedef void (*GpuKernelFuncPtr)(SubExecParam*, int, int);
-  GpuKernelFuncPtr GpuKernelTable[MAX_WAVEGROUPS][MAX_UNROLL] =
+  GpuKernelFuncPtr GpuKernelTable[MAX_WAVEGROUPS][MAX_UNROLL][3][4] =
  {
    GPU_KERNEL_UNROLL_DECL(64),
    GPU_KERNEL_UNROLL_DECL(128),
@@ -1833,6 +3040,8 @@ namespace {
    GPU_KERNEL_UNROLL_DECL(512)
  };
  #undef GPU_KERNEL_UNROLL_DECL
+  #undef GPU_KERNEL_DWORD_DECL
+  #undef GPU_KERNEL_TEMPORAL_DECL
  // Execute a single GPU Transfer (when using 1 stream per Transfer)
  static ErrResult ExecuteGpuTransfer(int           const  iteration,
@@ -1841,27 +3050,28 @@ namespace {
                                      hipEvent_t    const  stopEvent,
                                      int           const  xccDim,
                                      ConfigOptions const& cfg,
-                                      TransferResources&   resources)
+                                      TransferResources&   rss)
  {
    auto cpuStart = std::chrono::high_resolution_clock::now();
-    int numSubExecs = resources.subExecParamCpu.size();
+    int numSubExecs = rss.subExecParamCpu.size();
    dim3 const gridSize(xccDim, numSubExecs, 1);
    dim3 const blockSize(cfg.gfx.blockSize, 1);
+    int wordSizeIdx = cfg.gfx.wordSize == 1 ? 0 :
+                      cfg.gfx.wordSize == 2 ? 1 :
+                                              2;
+    auto gpuKernel = GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1][wordSizeIdx][cfg.gfx.temporalMode];
 #if defined(__NVCC__)
    if (startEvent != NULL)
      ERR_CHECK(hipEventRecord(startEvent, stream));
+    gpuKernel<<<gridSize, blockSize, 0, stream>>>(rss.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
-    GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1]
-      <<<gridSize, blockSize, 0, stream>>>
-      (resources.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
    if (stopEvent != NULL)
      ERR_CHECK(hipEventRecord(stopEvent, stream));
 #else
-    hipExtLaunchKernelGGL(GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1],
+    hipExtLaunchKernelGGL(gpuKernel, gridSize, blockSize, 0, stream, startEvent, stopEvent,
-                          gridSize, blockSize, 0, stream, startEvent, stopEvent,
+                          0, rss.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
-                          0, resources.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
 #endif
    ERR_CHECK(hipStreamSynchronize(stream));
@@ -1876,15 +3086,15 @@ namespace {
        ERR_CHECK(hipEventElapsedTime(&gpuDeltaMsec, startEvent, stopEvent));
        deltaMsec = gpuDeltaMsec;
      }
-      resources.totalDurationMsec += deltaMsec;
+      rss.totalDurationMsec += deltaMsec;
      if (cfg.general.recordPerIteration) {
-        resources.perIterMsec.push_back(deltaMsec);
+        rss.perIterMsec.push_back(deltaMsec);
        std::set<std::pair<int,int>> CUs;
        for (int i = 0; i < numSubExecs; i++) {
-          CUs.insert(std::make_pair(resources.subExecParamGpuPtr[i].xccId,
+          CUs.insert(std::make_pair(rss.subExecParamGpuPtr[i].xccId,
-                                    GetId(resources.subExecParamGpuPtr[i].hwId)));
+                                    GetId(rss.subExecParamGpuPtr[i].hwId)));
        }
-        resources.perIterCUs.push_back(CUs);
+        rss.perIterCUs.push_back(CUs);
      }
    }
    return ERR_NONE;
@@ -1924,19 +3134,19 @@ namespace {
      dim3 const blockSize(cfg.gfx.blockSize, 1);
      hipStream_t stream = exeInfo.streams[0];
+      int wordSizeIdx = cfg.gfx.wordSize == 1 ? 0 :
+                        cfg.gfx.wordSize == 2 ? 1 :
+                                                2;
+      auto gpuKernel = GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1][wordSizeIdx][cfg.gfx.temporalMode];
 #if defined(__NVCC__)
      if (cfg.gfx.useHipEvents)
        ERR_CHECK(hipEventRecord(exeInfo.startEvents[0], stream));
+      gpuKernel<<<gridSize, blockSize, 0 , stream>>>(exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations);
-      GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1]
-        <<<gridSize, blockSize, 0 , stream>>>
-        (exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations);
      if (cfg.gfx.useHipEvents)
        ERR_CHECK(hipEventRecord(exeInfo.stopEvents[0], stream));
 #else
-      hipExtLaunchKernelGGL(GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1],
+      hipExtLaunchKernelGGL(gpuKernel, gridSize, blockSize, 0, stream,
-                            gridSize, blockSize, 0, stream,
                            cfg.gfx.useHipEvents ? exeInfo.startEvents[0] : NULL,
                            cfg.gfx.useHipEvents ? exeInfo.stopEvents[0] : NULL, 0,
                            exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations);
@@ -1958,12 +3168,12 @@ namespace {
      // Determine timing for each of the individual transfers that were part of this launch
      if (!cfg.gfx.useMultiStream) {
        for (int i = 0; i < exeInfo.resources.size(); i++) {
-          TransferResources& resources = exeInfo.resources[i];
+          TransferResources& rss = exeInfo.resources[i];
          long long minStartCycle = std::numeric_limits<long long>::max();
          long long maxStopCycle  = std::numeric_limits<long long>::min();
          std::set<std::pair<int, int>> CUs;
-          for (auto subExecIdx : resources.subExecIdx) {
+          for (auto subExecIdx : rss.subExecIdx) {
            minStartCycle = std::min(minStartCycle, exeInfo.subExecParamGpu[subExecIdx].startCycle);
            maxStopCycle  = std::max(maxStopCycle,  exeInfo.subExecParamGpu[subExecIdx].stopCycle);
            if (cfg.general.recordPerIteration) {
@@ -1973,10 +3183,10 @@ namespace {
          }
          double deltaMsec = (maxStopCycle - minStartCycle) / (double)(exeInfo.wallClockRate);
-          resources.totalDurationMsec += deltaMsec;
+          rss.totalDurationMsec += deltaMsec;
          if (cfg.general.recordPerIteration) {
-            resources.perIterMsec.push_back(deltaMsec);
+            rss.perIterMsec.push_back(deltaMsec);
-            resources.perIterCUs.push_back(CUs);
+            rss.perIterCUs.push_back(CUs);
          }
        }
      }
@@ -1984,7 +3194,6 @@ namespace {
    return ERR_NONE;
  }
 // DMA Executor-related functions
 //========================================================================================
@@ -2020,7 +3229,7 @@ namespace {
      // Use HSA async copy
      do {
        hsa_signal_store_screlease(resources.signal, 1);
-        if (cfg.dma.useHsaCopy) {
+        if (!useSubIndices) {
          ERR_CHECK(hsa_amd_memory_async_copy(resources.dstMem[0], resources.dstAgent,
                                              resources.srcMem[0], resources.srcAgent,
                                              resources.numBytes, 0, NULL,
@@ -2099,12 +3308,16 @@ namespace {
    case EXE_CPU:     return RunCpuExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
    case EXE_GPU_GFX: return RunGpuExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
    case EXE_GPU_DMA: return RunDmaExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
+#ifdef NIC_EXEC_ENABLED
+    case EXE_NIC:     return RunNicExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
+#endif
    default:          return {ERR_FATAL, "Unsupported executor (%d)", exeDevice.exeType};
    }
  }
 } // End of anonymous namespace
 //========================================================================================
+/// @endcond
  ErrResult::ErrResult(ErrType err) : errType(err), errMsg("") {};
@@ -2174,16 +3387,17 @@ namespace {
    std::map<ExeDevice, ExeInfo> executorMap;
    for (int i = 0; i < transfers.size(); i++) {
      Transfer const& t = transfers[i];
+      ExeDevice exeDevice;
-      ExeInfo& exeInfo = executorMap[t.exeDevice];
+      ERR_APPEND(GetActualExecutor(cfg, t.exeDevice, exeDevice), errResults);
-      exeInfo.totalBytes    += t.numBytes;
-      exeInfo.totalSubExecs += t.numSubExecs;
-      exeInfo.useSubIndices |= (t.exeSubIndex != -1);
      TransferResources resource = {};
      resource.transferIdx = i;
-      exeInfo.resources.push_back(resource);
+      ExeInfo& exeInfo = executorMap[exeDevice];
+      exeInfo.totalBytes    += t.numBytes;
+      exeInfo.totalSubExecs += t.numSubExecs;
+      exeInfo.useSubIndices |= (t.exeSubIndex != -1 || (t.exeDevice.exeType == EXE_GPU_GFX && !cfg.gfx.prefXccTable.empty()));
+      exeInfo.resources.push_back(resource);
      minNumSrcs  = std::min(minNumSrcs, (int)t.srcs.size());
      maxNumSrcs  = std::max(maxNumSrcs, (int)t.srcs.size());
      maxNumBytes = std::max(maxNumBytes, t.numBytes);
@@ -2314,8 +3528,8 @@ namespace {
    results.tfrResults.resize(transfers.size());
    results.numTimedIterations = numTimedIterations;
    results.totalBytesTransferred = 0;
-    results.avgTotalDurationMsec = (totalCpuTimeSec * 1000.0) / numTimedIterations;
+    results.avgTotalDurationMsec = (totalCpuTimeSec * 1000.0) / (numTimedIterations * cfg.general.numSubIterations);
-    results.overheadMsec = 0.0;
+    results.overheadMsec = results.avgTotalDurationMsec;
    for (auto& exeInfoPair : executorMap) {
      ExeDevice const& exeDevice = exeInfoPair.first;
      ExeInfo&         exeInfo   = exeInfoPair.second;
@@ -2323,25 +3537,32 @@ namespace {
      // Copy over executor results
      ExeResult& exeResult = results.exeResults[exeDevice];
      exeResult.numBytes = exeInfo.totalBytes;
-      exeResult.avgDurationMsec = exeInfo.totalDurationMsec / numTimedIterations;
+      exeResult.avgDurationMsec = exeInfo.totalDurationMsec / (numTimedIterations * cfg.general.numSubIterations);
      exeResult.avgBandwidthGbPerSec = (exeResult.numBytes / 1.0e6) /  exeResult.avgDurationMsec;
      exeResult.sumBandwidthGbPerSec = 0.0;
      exeResult.transferIdx.clear();
      results.totalBytesTransferred += exeInfo.totalBytes;
-      results.overheadMsec = std::max(results.overheadMsec, (results.avgTotalDurationMsec -
+      results.overheadMsec = std::min(results.overheadMsec, (results.avgTotalDurationMsec -
                                                             exeResult.avgDurationMsec));
      // Copy over transfer results
-      for (auto const& resources : exeInfo.resources) {
+      for (auto const& rss : exeInfo.resources) {
-        int const transferIdx = resources.transferIdx;
+        int const transferIdx = rss.transferIdx;
-        TransferResult& tfrResult = results.tfrResults[transferIdx];
        exeResult.transferIdx.push_back(transferIdx);
-        tfrResult.numBytes = resources.numBytes;
-        tfrResult.avgDurationMsec = resources.totalDurationMsec / numTimedIterations;
+        TransferResult& tfrResult      = results.tfrResults[transferIdx];
-        tfrResult.avgBandwidthGbPerSec = (resources.numBytes / 1.0e6) / tfrResult.avgDurationMsec;
+        tfrResult.exeDevice            = exeDevice;
+#ifdef NIC_EXEC_ENABLED
+        tfrResult.exeDstDevice         = {exeDevice.exeType, rss.dstNicIndex};
+#else
+        tfrResult.exeDstDevice         = exeDevice;
+#endif
+        tfrResult.numBytes             = rss.numBytes;
+        tfrResult.avgDurationMsec      = rss.totalDurationMsec / numTimedIterations;
+        tfrResult.avgBandwidthGbPerSec = (rss.numBytes / 1.0e6) / tfrResult.avgDurationMsec;
        if (cfg.general.recordPerIteration) {
-          tfrResult.perIterMsec = resources.perIterMsec;
+          tfrResult.perIterMsec = rss.perIterMsec;
-          tfrResult.perIterCUs  = resources.perIterCUs;
+          tfrResult.perIterCUs  = rss.perIterCUs;
        }
        exeResult.sumBandwidthGbPerSec += tfrResult.avgBandwidthGbPerSec;
      }
@@ -2382,7 +3603,7 @@ namespace {
  {
    // Replace any round brackets or '->' with spaces,
    for (int i = 1; line[i]; i++)
-      if (line[i] == '(' || line[i] == ')' || line[i] == '-' || line[i] == '>' ) line[i] = ' ';
+      if (line[i] == '(' || line[i] == ')' || line[i] == '-'  || line[i] == ':' || line[i] == '>' ) line[i] = ' ';
    transfers.clear();
@@ -2416,17 +3637,18 @@ namespace {
        transfer.numSubExecs = numSubExecs;
        if (iss.fail()) {
          return {ERR_FATAL,
-                  "Parsing error: Unable to read valid Transfer %d (SRC EXE DST) triplet", i+1};
+            "Parsing error: Unable to read valid Transfer %d (SRC EXE DST) triplet", i+1};
        }
+        transfer.numBytes = 0;
      } else {
        iss >> srcStr >> exeStr >> dstStr >> transfer.numSubExecs >> numBytesToken;
        if (iss.fail()) {
          return {ERR_FATAL,
-                  "Parsing error: Unable to read valid Transfer %d (SRC EXE DST $CU #Bytes) tuple", i+1};
+            "Parsing error: Unable to read valid Transfer %d (SRC EXE DST $CU #Bytes) tuple", i+1};
        }
        if (sscanf(numBytesToken.c_str(), "%lu", &transfer.numBytes) != 1) {
          return {ERR_FATAL,
-                  "Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple", i+1};
+            "Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple", i+1};
        }
        char units = numBytesToken.back();
@@ -2440,7 +3662,6 @@ namespace {
      ERR_CHECK(ParseMemType(srcStr, transfer.srcs));
      ERR_CHECK(ParseMemType(dstStr, transfer.dsts));
      ERR_CHECK(ParseExeType(exeStr, transfer.exeDevice, transfer.exeSubIndex));
      transfers.push_back(transfer);
    }
    return ERR_NONE;
@@ -2458,6 +3679,12 @@ namespace {
      if (status != hipSuccess) numDetectedGpus = 0;
      return numDetectedGpus;
    }
+#ifdef NIC_EXEC_ENABLED
+    case EXE_NIC: case EXE_NIC_NEAREST:
+    {
+      return GetIbvDeviceList().size();
+    }
+#endif
    default:
      return 0;
    }
@@ -2564,6 +3791,102 @@ namespace {
 #endif
  }
+  int GetClosestCpuNumaToNic(int nicIndex)
+  {
+#ifdef NIC_EXEC_ENABLED
+    int numNics = GetNumExecutors(EXE_NIC);
+    if (nicIndex < 0 || nicIndex >= numNics) return -1;
+    return GetIbvDeviceList()[nicIndex].numaNode;
+#else
+    return -1;
+#endif
+  }
+  int GetClosestNicToGpu(int gpuIndex)
+  {
+#ifdef NIC_EXEC_ENABLED
+    static bool isInitialized = false;
+    static std::vector<int> closestNicId;
+    int numGpus = GetNumExecutors(EXE_GPU_GFX);
+    if (gpuIndex < 0 || gpuIndex >= numGpus) return -1;
+    // Build closest NICs per GPU on first use
+    if (!isInitialized) {
+      closestNicId.resize(numGpus, -1);
+      // Build up list of NIC bus addresses
+      std::vector<std::string> ibvAddressList;
+      auto const& ibvDeviceList = GetIbvDeviceList();
+      for (auto const& ibvDevice : ibvDeviceList)
+        ibvAddressList.push_back(ibvDevice.hasActivePort ? ibvDevice.busId : "");
+      // Track how many times a device has been assigned as "closest"
+      // This allows distributed work across devices using multiple ports (sharing the same busID)
+      // NOTE: This isn't necessarily optimal, but likely to work in most cases involving multi-port
+      // Counter example:
+      //
+      //  G0 prefers (N0,N1), picks N0
+      //  G1 prefers (N1,N2), picks N1
+      //  G2 prefers N0,      picks N0
+      //
+      //  instead of G0->N1, G1->N2, G2->N0
+      std::vector<int> assignedCount(ibvDeviceList.size(), 0);
+      // Loop over each GPU to find the closest NIC(s) based on PCIe address
+      for (int i = 0; i < numGpus; i++) {
+        // Collect PCIe address for the GPU
+        char hipPciBusId[64];
+        hipError_t err = hipDeviceGetPCIBusId(hipPciBusId, sizeof(hipPciBusId), i);
+        if (err != hipSuccess) {
+#ifdef VERBS_DEBUG
+          printf("Failed to get PCI Bus ID for HIP device %d: %s\n", i, hipGetErrorString(err));
+#endif
+          closestNicId[i] = -1;
+          continue;
+        }
+        // Find closest NICs
+        std::set<int> closestNicIdxs = GetNearestDevicesInTree(hipPciBusId, ibvAddressList);
+        // Pick the least-used NIC to assign as closest
+        int closestIdx = -1;
+        for (auto idx : closestNicIdxs) {
+          if (closestIdx == -1 || assignedCount[idx] < assignedCount[closestIdx])
+            closestIdx = idx;
+        }
+        // The following will only use distance between bus IDs
+        // to determine the closest NIC to GPU if the PCIe tree approach fails
+        if (closestIdx < 0) {
+#ifdef VERBS_DEBUG
+          printf("[WARN] Falling back to PCIe bus ID distance to determine proximity\n");
+#endif
+          int minDistance = std::numeric_limits<int>::max();
+          for (int j = 0; j < ibvDeviceList.size(); j++) {
+            if (ibvDeviceList[j].busId != "") {
+              int distance = GetBusIdDistance(hipPciBusId, ibvDeviceList[j].busId);
+              if (distance < minDistance && distance >= 0) {
+                minDistance = distance;
+                closestIdx = j;
+              }
+            }
+          }
+        }
+        closestNicId[i] = closestIdx;
+        if (closestIdx != -1) assignedCount[closestIdx]++;
+      }
+      isInitialized = true;
+    }
+    return closestNicId[gpuIndex];
+#else
+    return -1;
+#endif
+  }
 // Undefine CUDA compatibility macros
 #if defined(__NVCC__)