TransferBench v1.11 (#9)

* Adding MIMO support, DMA executor, Null memory type

TransferBench v1.11 (#9)
* Adding MIMO support, DMA executor, Null memory type
cc0e9cb4 · gilbertlee-amd · GitHub · 3b47b874 · cc0e9cb4 · cc0e9cb4
Unverified Commit cc0e9cb4 authored Jan 20, 2023 by gilbertlee-amd Committed by GitHub Jan 20, 2023
8 changed files
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
 # Changelog for TransferBench

+## v1.11
+### Added
+- New multi-input / multi-output support (MIMO).  Transfers now can reduce (element-wise summation) multiple input memory arrays
+  and write the sums to multiple outputs
+- New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies).  Previously this was done using USE_HIP_CALL, but now this allows
+  GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally.
+  - GPU-DMA executor can only be used for single-input/single-output Transfers
+  - GPU-DMA executor can only be associated with one SubExecutor
+- Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers
+- Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels
+
+### Optimized
+- Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs
+
+### Changed
+- Updated the example.cfg file to cover the new features
+- Updated output to support MIMO
+- Changed CUs/CPUs threads naming to SubExecutors for consistency
+- Sweep Preset:
+  - Default sweep preset executors now includes DMA
+- P2P Benchmarks:
+  - Now only works via "p2p".  Removed "p2p_rr", "g2g" and "g2g_rr".
+    - Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g")
+    - New environment variable USE_REMOTE_READ replaces "_rr" presets
+  - New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor
+  - Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE
+    - Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
+  - Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE
+- Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer
+
+### Removed
+- USE_HIP_CALL has been removed.  Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets
+  - Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate
+- Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer
+- Removed USE_MEMSET environment variable.  This can now be done via a Transfer using the null memory type
+
 ## v1.10
 ### Fixed
 - Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes

--- a/EnvVars.hpp
+++ b/EnvVars.hpp
 /*
-Copyright (c) 2021-2022 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2021-2023 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -26,9 +26,12 @@ THE SOFTWARE.
 #include <algorithm>
 #include <random>
 #include <time.h>
-#define TB_VERSION "1.10"
+#include "Kernels.hpp"
+
+#define TB_VERSION "1.11"

 extern char const MemTypeStr[];
+extern char const ExeTypeStr[];

 enum ConfigModeEnum
 {
@@ -45,10 +48,13 @@ public:
  int const DEFAULT_NUM_WARMUPS          =  1;
  int const DEFAULT_NUM_ITERATIONS       = 10;
  int const DEFAULT_SAMPLING_FACTOR      =  1;
-  int const DEFAULT_NUM_CPU_PER_TRANSFER =  4;

+  // Peer-to-peer Benchmark preset defaults
+  int const DEFAULT_P2P_NUM_CPU_SE    = 4;
+
+  // Sweep-preset defaults
  std::string const DEFAULT_SWEEP_SRC = "CG";
-  std::string const DEFAULT_SWEEP_EXE = "CG";
+  std::string const DEFAULT_SWEEP_EXE = "CDG";
  std::string const DEFAULT_SWEEP_DST = "CG";
  int const DEFAULT_SWEEP_MIN         = 1;
  int const DEFAULT_SWEEP_MAX         = 24;
@@ -59,21 +65,24 @@ public:
  int blockBytes;        // Each CU, except the last, gets a multiple of this many bytes to copy
  int byteOffset;        // Byte-offset for memory allocations
  int numCpuDevices;     // Number of CPU devices to use (defaults to # NUMA nodes detected)
-  int numCpuPerTransfer; // Number of CPU child threads to use per CPU Transfer
  int numGpuDevices;     // Number of GPU devices to use (defaults to # HIP devices detected)
  int numIterations;     // Number of timed iterations to perform.  If negative, run for -numIterations seconds instead
  int numWarmups;        // Number of un-timed warmup iterations to perform
  int outputToCsv;       // Output in CSV format
  int samplingFactor;    // Affects how many different values of N are generated (when N set to 0)
  int sharedMemBytes;    // Amount of shared memory to use per threadblock
-  int useHipCall;        // Use hipMemcpy/hipMemset instead of custom shader kernels
  int useInteractive;    // Pause for user-input before starting transfer loop
-  int useMemset;         // Perform a memset instead of a copy (ignores source memory)
  int usePcieIndexing;   // Base GPU indexing on PCIe address instead of HIP device
-  int useSingleStream;   // Use a single stream per device instead of per Tink. Can not be used with USE_HIP_CALL
+  int useSingleStream;   // Use a single stream per GPU GFX executor instead of stream per Transfer

  std::vector<float> fillPattern; // Pattern of floats used to fill source data

+  // Environment variables only for Benchmark-preset
+  int useRemoteRead;     // Use destination memory type as executor instead of source memory type
+  int useDmaCopy;        // Use DMA copy instead of GPU copy
+  int numGpuSubExecs;    // Number of GPU subexecutors to use
+  int numCpuSubExecs;    // Number of CPU subexecttors to use
+
  // Environment variables only for Sweep-preset
  int sweepMin;          // Min number of simultaneous Transfers to be executed per test
  int sweepMax;          // Max number of simulatneous Transfers to be executed per test
@@ -87,6 +96,10 @@ public:
  std::string sweepExe;  // Set of executors to be swept
  std::string sweepDst;  // Set of dst memory types to be swept

+  // Developer features
+  int enableDebug;       // Enable debug output
+  int gpuKernel;         // Which GPU kernel to use
+
  // Used to track current configuration mode
  ConfigModeEnum configMode;

@@ -100,29 +113,48 @@ public:
  EnvVars()
  {
    int maxSharedMemBytes = 0;
-    hipDeviceGetAttribute(&maxSharedMemBytes,
-                          hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, 0);
+    HIP_CALL(hipDeviceGetAttribute(&maxSharedMemBytes,
+                                   hipDeviceAttributeMaxSharedMemoryPerMultiprocessor, 0));
+    int numDeviceCUs = 0;
+    HIP_CALL(hipDeviceGetAttribute(&numDeviceCUs, hipDeviceAttributeMultiprocessorCount, 0));

    int numDetectedCpus = numa_num_configured_nodes();
    int numDetectedGpus;
-    hipGetDeviceCount(&numDetectedGpus);
+    HIP_CALL(hipGetDeviceCount(&numDetectedGpus));
+
+    hipDeviceProp_t prop;
+    HIP_CALL(hipGetDeviceProperties(&prop, 0));
+    std::string fullName = prop.gcnArchName;
+    std::string archName = fullName.substr(0, fullName.find(':'));
+
+    // Different hardware pick different GPU kernels
+    // This performance difference is generally only noticable when executing fewer CUs
+    int defaultGpuKernel = 0;
+    if      (archName == "gfx906") defaultGpuKernel = 13;
+    else if (archName == "gfx90a") defaultGpuKernel = 9;

    blockBytes        = GetEnvVar("BLOCK_BYTES"         , 256);
    byteOffset        = GetEnvVar("BYTE_OFFSET"         , 0);
    numCpuDevices     = GetEnvVar("NUM_CPU_DEVICES"     , numDetectedCpus);
-    numCpuPerTransfer = GetEnvVar("NUM_CPU_PER_TRANSFER", DEFAULT_NUM_CPU_PER_TRANSFER);
    numGpuDevices     = GetEnvVar("NUM_GPU_DEVICES"     , numDetectedGpus);
    numIterations     = GetEnvVar("NUM_ITERATIONS"      , DEFAULT_NUM_ITERATIONS);
    numWarmups        = GetEnvVar("NUM_WARMUPS"         , DEFAULT_NUM_WARMUPS);
    outputToCsv       = GetEnvVar("OUTPUT_TO_CSV"       , 0);
    samplingFactor    = GetEnvVar("SAMPLING_FACTOR"     , DEFAULT_SAMPLING_FACTOR);
    sharedMemBytes    = GetEnvVar("SHARED_MEM_BYTES"    , maxSharedMemBytes / 2 + 1);
-    useHipCall        = GetEnvVar("USE_HIP_CALL"        , 0);
    useInteractive    = GetEnvVar("USE_INTERACTIVE"     , 0);
-    useMemset         = GetEnvVar("USE_MEMSET"          , 0);
    usePcieIndexing   = GetEnvVar("USE_PCIE_INDEX"      , 0);
    useSingleStream   = GetEnvVar("USE_SINGLE_STREAM"   , 0);
+    enableDebug       = GetEnvVar("DEBUG"               , 0);
+    gpuKernel         = GetEnvVar("GPU_KERNEL"          , defaultGpuKernel);

+    // P2P Benchmark related
+    useRemoteRead    = GetEnvVar("USE_REMOTE_READ"      , 0);
+    useDmaCopy       = GetEnvVar("USE_GPU_DMA"          , 0);
+    numGpuSubExecs   = GetEnvVar("NUM_GPU_SE"           , useDmaCopy ? 1 : numDeviceCUs);
+    numCpuSubExecs   = GetEnvVar("NUM_CPU_SE"           , DEFAULT_P2P_NUM_CPU_SE);
+
+    // Sweep related
    sweepMin          = GetEnvVar("SWEEP_MIN"           , DEFAULT_SWEEP_MIN);
    sweepMax          = GetEnvVar("SWEEP_MAX"           , DEFAULT_SWEEP_MAX);
    sweepSrc          = GetEnvVar("SWEEP_SRC"           , DEFAULT_SWEEP_SRC);
@@ -135,7 +167,6 @@ public:
    sweepRandBytes    = GetEnvVar("SWEEP_RAND_BYTES"    , 0);

    // Determine random seed
-
    char *sweepSeedStr = getenv("SWEEP_SEED");
    sweepSeed = (sweepSeedStr != NULL ? atoi(sweepSeedStr) : time(NULL));
    generator = new std::default_random_engine(sweepSeed);
@@ -224,11 +255,6 @@ public:
      printf("[ERROR] SAMPLING_FACTOR must be greater or equal to 1\n");
      exit(1);
    }
-    if (numCpuPerTransfer < 1)
-    {
-      printf("[ERROR] NUM_CPU_PER_TRANSFER must be greater or equal to 1\n");
-      exit(1);
-    }
    if (sharedMemBytes < 0 || sharedMemBytes > maxSharedMemBytes)
    {
      printf("[ERROR] SHARED_MEM_BYTES must be between 0 and %d\n", maxSharedMemBytes);
@@ -239,9 +265,16 @@ public:
      printf("[ERROR] BLOCK_BYTES must be a positive multiple of 4\n");
      exit(1);
    }
-    if (useSingleStream && useHipCall)
+
+    if (numGpuSubExecs <= 0)
+    {
+      printf("[ERROR] NUM_GPU_SE must be greater than 0\n");
+      exit(1);
+    }
+
+    if (numCpuSubExecs <= 0)
    {
-      printf("[ERROR] Single stream mode cannot be used with HIP calls\n");
+      printf("[ERROR] NUM_CPU_SE must be greater than 0\n");
      exit(1);
    }

@@ -273,10 +306,9 @@ public:
      }
    }

-    char const* permittedExecutors = "CG";
    for (auto ch : sweepExe)
    {
-      if (!strchr(permittedExecutors, ch))
+      if (!strchr(ExeTypeStr, ch))
      {
        printf("[ERROR] Unrecognized executor type '%c' specified for sweep executor\n", ch);
        exit(1);
@@ -287,12 +319,30 @@ public:
        exit(1);
      }
    }
+    if (gpuKernel < 0 || gpuKernel > NUM_GPU_KERNELS)
+    {
+      printf("[ERROR] GPU kernel must be between 0 and %d\n", NUM_GPU_KERNELS);
+      exit(1);
+    }

    // Determine how many CPUs exit per NUMA node (to avoid executing on NUMA without CPUs)
    numCpusPerNuma.resize(numDetectedCpus);
    int const totalCpus = numa_num_configured_cpus();
    for (int i = 0; i < totalCpus; i++)
      numCpusPerNuma[numa_node_of_cpu(i)]++;
+
+    // Check for deprecated env vars
+    if (getenv("USE_HIP_CALL"))
+    {
+      printf("[WARN] USE_HIP_CALL has been deprecated.  Please use DMA executor 'D' or set USE_GPU_DMA for P2P-Benchmark preset\n");
+      exit(1);
+    }
+
+    char* enableSdma = getenv("HSA_ENABLE_SDMA");
+    if (enableSdma && !strcmp(enableSdma, "0"))
+    {
+      printf("[WARN] DMA functionality disabled due to environment variable HSA_ENABLE_SDMA=0.  Copies will fallback to blit kernels\n");
+    }
  }

  // Display info on the env vars that can be used
@@ -304,18 +354,15 @@ public:
    printf(" BYTE_OFFSET            - Initial byte-offset for memory allocations.  Must be multiple of 4. Defaults to 0\n");
    printf(" FILL_PATTERN=STR       - Fill input buffer with pattern specified in hex digits (0-9,a-f,A-F).  Must be even number of digits, (byte-level big-endian)\n");
    printf(" NUM_CPU_DEVICES=X      - Restrict number of CPUs to X.  May not be greater than # detected NUMA nodes\n");
-    printf(" NUM_CPU_PER_TRANSFER=C - Use C threads per Transfer for CPU-executed copies\n");
-    printf(" NUM_GPU_DEVICES=X      - Restrict number of GCPUs to X.  May not be greater than # detected HIP devices\n");
+    printf(" NUM_GPU_DEVICES=X      - Restrict number of GPUs to X.  May not be greater than # detected HIP devices\n");
    printf(" NUM_ITERATIONS=I       - Perform I timed iteration(s) per test\n");
    printf(" NUM_WARMUPS=W          - Perform W untimed warmup iteration(s) per test\n");
    printf(" OUTPUT_TO_CSV          - Outputs to CSV format if set\n");
    printf(" SAMPLING_FACTOR=F      - Add F samples (when possible) between powers of 2 when auto-generating data sizes\n");
    printf(" SHARED_MEM_BYTES=X     - Use X shared mem bytes per threadblock, potentially to avoid multiple threadblocks per CU\n");
-    printf(" USE_HIP_CALL           - Use hipMemcpy/hipMemset instead of custom shader kernels for GPU-executed copies\n");
    printf(" USE_INTERACTIVE        - Pause for user-input before starting transfer loop\n");
-    printf(" USE_MEMSET             - Perform a memset instead of a copy (ignores source memory)\n");
    printf(" USE_PCIE_INDEX         - Index GPUs by PCIe address-ordering instead of HIP-provided indexing\n");
-    printf(" USE_SINGLE_STREAM      - Use single stream per device instead of per Transfer.  Cannot be used with USE_HIP_CALL\n");
+    printf(" USE_SINGLE_STREAM      - Use a single stream per GPU GFX executor instead of stream per Transfer\n");
  }

  // Display env var settings
@@ -331,10 +378,10 @@ public:
      if (fillPattern.size())
        printf("Pattern: %s", getenv("FILL_PATTERN"));
      else
-        printf("Pseudo-random: (Element i = i modulo 383 + 31)");
+        printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
      printf("\n");
+      printf("%-20s = %12d : Using GPU kernel %d [%s]\n" , "GPU_KERNEL", gpuKernel, gpuKernel, GpuKernelNames[gpuKernel].c_str());
      printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
-      printf("%-20s = %12d : Using %d CPU thread(s) per CPU-executed Transfer\n", "NUM_CPU_PER_TRANSFER", numCpuPerTransfer, numCpuPerTransfer);
      printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
      printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations,
             numIterations > 0 ? numIterations : -numIterations,
@@ -344,18 +391,8 @@ public:
             outputToCsv ? "CSV" : "console");
      printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
             getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
-      printf("%-20s = %12d : Using %s for GPU-executed copies\n", "USE_HIP_CALL", useHipCall,
-             useHipCall ? "HIP functions" : "custom kernels");
-      if (useHipCall && !useMemset)
-      {
-        char* env = getenv("HSA_ENABLE_SDMA");
-        printf("%-20s = %12s : %s\n", "HSA_ENABLE_SDMA", env,
-               (env && !strcmp(env, "0")) ? "Using blit kernels for hipMemcpy" : "Using DMA copy engines");
-      }
      printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive,
             useInteractive ? "interactive" : "non-interactive");
-      printf("%-20s = %12d : Performing %s\n", "USE_MEMSET", useMemset,
-             useMemset ? "memset" : "memcopy");
      printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
             usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
      printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM",
@@ -371,23 +408,82 @@ public:
      if (fillPattern.size())
        printf("Pattern: %s", getenv("FILL_PATTERN"));
      else
-        printf("Pseudo-random: (Element i = i modulo 383 + 31)");
+        printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
      printf("\n");
      printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
-      printf("NUM_CPU_PER_TRANSFER,%d,Using %d CPU thread(s) per CPU-executed Transfer\n", numCpuPerTransfer, numCpuPerTransfer);
      printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
      printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations,
             numIterations > 0 ? numIterations : -numIterations,
             numIterations > 0 ? "timed iteration(s)" : "second(s)");
      printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
      printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
-      printf("USE_HIP_CALL,%d,Using %s for GPU-executed copies\n", useHipCall, useHipCall ? "HIP functions" : "custom kernels");
-      printf("USE_MEMSET,%d,Performing %s\n", useMemset, useMemset ? "memset" : "memcopy");
      printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
      printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
    }
  };

+  // Display env var for P2P Benchmark preset
+  void DisplayP2PBenchmarkEnvVars() const
+  {
+    if (!outputToCsv)
+    {
+      printf("Peer-to-peer Benchmark configuration (TransferBench v%s)\n", TB_VERSION);
+      printf("=====================================================\n");
+      printf("%-20s = %12d : Using %s as executor\n",         "USE_REMOTE_READ", useRemoteRead , useRemoteRead ? "DST" : "SRC");
+      printf("%-20s = %12d : Using GPU-%s as GPU executor\n", "USE_GPU_DMA"    , useDmaCopy    , useDmaCopy ? "DMA" : "GFX");
+      printf("%-20s = %12d : Using %d CPU subexecutors\n",    "NUM_CPU_SE"     , numCpuSubExecs, numCpuSubExecs);
+      printf("%-20s = %12d : Using %d GPU subexecutors\n",    "NUM_GPU_SE"     , numGpuSubExecs, numGpuSubExecs);
+
+      printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes);
+      printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset);
+      printf("%-20s = %12s : ", "FILL_PATTERN", getenv("FILL_PATTERN") ? "(specified)" : "(unset)");
+      if (fillPattern.size())
+        printf("Pattern: %s", getenv("FILL_PATTERN"));
+      else
+        printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
+      printf("\n");
+      printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
+      printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
+      printf("%-20s = %12d : Running %d %s per Test\n", "NUM_ITERATIONS", numIterations,
+             numIterations > 0 ? numIterations : -numIterations,
+             numIterations > 0 ? "timed iteration(s)" : "second(s)");
+      printf("%-20s = %12d : Running %d warmup iteration(s) per Test\n", "NUM_WARMUPS", numWarmups, numWarmups);
+      printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
+             getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
+      printf("%-20s = %12d : Running in %s mode\n", "USE_INTERACTIVE", useInteractive,
+             useInteractive ? "interactive" : "non-interactive");
+      printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
+             usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
+      printf("\n");
+    }
+    else
+    {
+      printf("EnvVar,Value,Description,(TransferBench v%s)\n", TB_VERSION);
+      printf("USE_REMOTE_READ,%d,Using %s as executor\n", useRemoteRead, useRemoteRead ? "DST" : "SRC");
+      printf("USE_GPU_DMA,%d,Using GPU-%s as GPU executor\n", useDmaCopy    , useDmaCopy ? "DMA" : "GFX");
+      printf("NUM_CPU_SE,%d,Using %d CPU subexecutors\n", numCpuSubExecs, numCpuSubExecs);
+      printf("NUM_GPU_SE,%d,Using %d GPU subexecutors\n", numGpuSubExecs, numGpuSubExecs);
+      printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes);
+      printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset);
+      printf("FILL_PATTERN,%s,", getenv("FILL_PATTERN") ? "(specified)" : "(unset)");
+      if (fillPattern.size())
+        printf("Pattern: %s", getenv("FILL_PATTERN"));
+      else
+        printf("Pseudo-random: (Element i = i modulo 383 + 31) * (InputIdx + 1)");
+      printf("\n");
+      printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
+      printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
+      printf("NUM_ITERATIONS,%d,Running %d %s per Test\n", numIterations,
+             numIterations > 0 ? numIterations : -numIterations,
+             numIterations > 0 ? "timed iteration(s)" : "second(s)");
+      printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
+      printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
+      printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
+      printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
+      printf("\n");
+    }
+  }
+
  // Display env var settings
  void DisplaySweepEnvVars() const
  {
@@ -407,7 +503,6 @@ public:
      printf("%-20s = %12d : Max number of XGMI hops for Transfers (-1 = no limit)\n", "SWEEP_XGMI_MAX", sweepXgmiMax);
      printf("%-20s = %12d : Using %s number of bytes per Transfer\n", "SWEEP_RAND_BYTES", sweepRandBytes, sweepRandBytes ? "random" : "constant");
      printf("%-20s = %12d : Using %d CPU devices\n" , "NUM_CPU_DEVICES", numCpuDevices, numCpuDevices);
-      printf("%-20s = %12d : Using %d CPU thread(s) per CPU-executed Transfer\n", "NUM_CPU_PER_TRANSFER", numCpuPerTransfer, numCpuPerTransfer);
      printf("%-20s = %12d : Using %d GPU devices\n", "NUM_GPU_DEVICES", numGpuDevices, numGpuDevices);
      printf("%-20s = %12d : Each CU gets a multiple of %d bytes to copy\n", "BLOCK_BYTES", blockBytes, blockBytes);
      printf("%-20s = %12d : Using byte offset of %d\n", "BYTE_OFFSET", byteOffset, byteOffset);
@@ -425,14 +520,6 @@ public:
             outputToCsv ? "CSV" : "console");
      printf("%-20s = %12s : Using %d shared mem per threadblock\n", "SHARED_MEM_BYTES",
             getenv("SHARED_MEM_BYTES") ? "(specified)" : "(unset)", sharedMemBytes);
-      printf("%-20s = %12d : Using %s for GPU-executed copies\n", "USE_HIP_CALL", useHipCall,
-             useHipCall ? "HIP functions" : "custom kernels");
-      if (useHipCall && !useMemset)
-      {
-        char* env = getenv("HSA_ENABLE_SDMA");
-        printf("%-20s = %12s : %s\n", "HSA_ENABLE_SDMA", env,
-               (env && !strcmp(env, "0")) ? "Using blit kernels for hipMemcpy" : "Using DMA copy engines");
-      }
      printf("%-20s = %12d : Using %s-based GPU indexing\n", "USE_PCIE_INDEX",
             usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
      printf("%-20s = %12d : Using single stream per %s\n", "USE_SINGLE_STREAM",
@@ -454,7 +541,6 @@ public:
      printf("SWEEP_XGMI_MAX,%d,Max number of XGMI hops for Transfers (-1 = no limit)\n", sweepXgmiMax);
      printf("SWEEP_RAND_BYTES,%d,Using %s number of bytes per Transfer\n", sweepRandBytes, sweepRandBytes ? "random" : "constant");
      printf("NUM_CPU_DEVICES,%d,Using %d CPU devices\n" , numCpuDevices, numCpuDevices);
-      printf("NUM_CPU_PER_TRANSFER,%d,Using %d CPU thread(s) per CPU-executed Transfer\n", numCpuPerTransfer, numCpuPerTransfer);
      printf("NUM_GPU_DEVICES,%d,Using %d GPU devices\n", numGpuDevices, numGpuDevices);
      printf("BLOCK_BYTES,%d,Each CU gets a multiple of %d bytes to copy\n", blockBytes, blockBytes);
      printf("BYTE_OFFSET,%d,Using byte offset of %d\n", byteOffset, byteOffset);
@@ -469,7 +555,6 @@ public:
             numIterations > 0 ? "timed iteration(s)" : "second(s)");
      printf("NUM_WARMUPS,%d,Running %d warmup iteration(s) per Test\n", numWarmups, numWarmups);
      printf("SHARED_MEM_BYTES,%d,Using %d shared mem per threadblock\n", sharedMemBytes, sharedMemBytes);
-      printf("USE_HIP_CALL,%d,Using %s for GPU-executed copies\n", useHipCall, useHipCall ? "HIP functions" : "custom kernels");
      printf("USE_PCIE_INDEX,%d,Using %s-based GPU indexing\n", usePcieIndexing, (usePcieIndexing ? "PCIe" : "HIP"));
      printf("USE_SINGLE_STREAM,%d,Using single stream per %s\n", useSingleStream, (useSingleStream ? "device" : "Transfer"));
    }

--- a/Kernels.hpp
+++ b/Kernels.hpp
 /*
-Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -22,66 +22,145 @@ THE SOFTWARE.

 #pragma once

+#define PackedFloat_t   float4
 #define WARP_SIZE       64
 #define BLOCKSIZE       256
+#define FLOATS_PER_PACK (sizeof(PackedFloat_t) / sizeof(float))
+#define MEMSET_CHAR     75
+#define MEMSET_VAL      13323083.0f

-// GPU copy kernel
-__global__ void __launch_bounds__(BLOCKSIZE)
-GpuCopyKernel(BlockParam* blockParams)
+// Each subExecutor is provided with subarrays to work on
+#define MAX_SRCS 16
+#define MAX_DSTS 16
+struct SubExecParam
 {
-  #define PackedFloat_t float4
-  #define FLOATS_PER_PACK (sizeof(PackedFloat_t) / sizeof(float))
+  size_t    N;                                  // Number of floats this subExecutor works on
+  int       numSrcs;                            // Number of source arrays
+  int       numDsts;                            // Number of destination arrays
+  float*    src[MAX_SRCS];                      // Source array pointers
+  float*    dst[MAX_DSTS];                      // Destination array pointers
+  long long startCycle;                         // Start timestamp for in-kernel timing (GPU-GFX executor)
+  long long stopCycle;                          // Stop  timestamp for in-kernel timing (GPU-GFX executor)
+};

-  // Collect the arguments for this threadblock
-  int          Nrem = blockParams[blockIdx.x].N;
-  float const* src  = blockParams[blockIdx.x].src;
-  float*       dst  = blockParams[blockIdx.x].dst;
-  if (threadIdx.x == 0) blockParams[blockIdx.x].startCycle = __builtin_amdgcn_s_memrealtime();
+void CpuReduceKernel(SubExecParam const& p)
+{
+  int const& numSrcs = p.numSrcs;
+  int const& numDsts = p.numDsts;
+
+  if (numSrcs == 0)
+  {
+    for (int i = 0; i < numDsts; ++i)
+      memset((float* __restrict__)p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
+  }
+  else if (numSrcs == 1)
+  {
+    float const* __restrict__ src = p.src[0];
+    for (int i = 0; i < numDsts; ++i)
+    {
+      memcpy((float* __restrict__)p.dst[i], src, p.N * sizeof(float));
+    }
+  }
+  else
+  {
+    for (int j = 0; j < p.N; j++)
+    {
+      float sum = p.src[0][j];
+      for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
+      for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
+    }
+  }
+}
+
+// Helper function for memset
+template <typename T> __device__ __forceinline__ T      MemsetVal();
+template <>           __device__ __forceinline__ float  MemsetVal(){ return MEMSET_VAL; };
+template <>           __device__ __forceinline__ float4 MemsetVal(){ return make_float4(MEMSET_VAL, MEMSET_VAL, MEMSET_VAL, MEMSET_VAL); }
+
+// GPU copy kernel 0: 3 loops: unroll float 4, float4s, floats
+template <int LOOP1_UNROLL>
+__global__ void __launch_bounds__(BLOCKSIZE)
+GpuReduceKernel(SubExecParam* params)
+{
+  int64_t startCycle = __builtin_amdgcn_s_memrealtime();

  // Operate on wavefront granularity
-  int numWaves = BLOCKSIZE   / WARP_SIZE; // Number of wavefronts per threadblock
-  int waveId   = threadIdx.x / WARP_SIZE; // Wavefront number
-  int threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront
+  SubExecParam& p    = params[blockIdx.x];
+  int const numSrcs  = p.numSrcs;
+  int const numDsts  = p.numDsts;
+  int const numWaves = BLOCKSIZE   / WARP_SIZE; // Number of wavefronts per threadblock
+  int const waveId   = threadIdx.x / WARP_SIZE; // Wavefront number
+  int const threadId = threadIdx.x % WARP_SIZE; // Thread index within wavefront

-  #define LOOP1_UNROLL 8
  // 1st loop - each wavefront operates on LOOP1_UNROLL x FLOATS_PER_PACK per thread per iteration
  // Determine the number of packed floats processed by the first loop
-  int const loop1Npack  = (Nrem / (FLOATS_PER_PACK * LOOP1_UNROLL * WARP_SIZE)) * (LOOP1_UNROLL * WARP_SIZE);
-  int const loop1Nelem  = loop1Npack * FLOATS_PER_PACK;
-  int const loop1Inc    = BLOCKSIZE * LOOP1_UNROLL;
-  int       loop1Offset = waveId * LOOP1_UNROLL * WARP_SIZE + threadId;
+  size_t       Nrem        = p.N;
+  size_t const loop1Npack  = (Nrem / (FLOATS_PER_PACK * LOOP1_UNROLL * WARP_SIZE)) * (LOOP1_UNROLL * WARP_SIZE);
+  size_t const loop1Nelem  = loop1Npack * FLOATS_PER_PACK;
+  size_t const loop1Inc    = BLOCKSIZE * LOOP1_UNROLL;
+  size_t       loop1Offset = waveId * LOOP1_UNROLL * WARP_SIZE + threadId;

-  PackedFloat_t const* packedSrc = (PackedFloat_t const*)(src) + loop1Offset;
-  PackedFloat_t*       packedDst = (PackedFloat_t      *)(dst) + loop1Offset;
  while (loop1Offset < loop1Npack)
  {
-    PackedFloat_t vals[LOOP1_UNROLL];
-    #pragma unroll
-    for (int u = 0; u < LOOP1_UNROLL; ++u)
-      vals[u] = *(packedSrc + u * WARP_SIZE);
+    PackedFloat_t vals[LOOP1_UNROLL] = {};

+    if (numSrcs == 0)
+    {
+      #pragma unroll
+      for (int u = 0; u < LOOP1_UNROLL; ++u) vals[u] = MemsetVal<float4>();
+    }
+    else
+    {
+      for (int i = 0; i < numSrcs; ++i)
+      {
+        PackedFloat_t const* __restrict__ packedSrc = (PackedFloat_t const*)(p.src[i]) + loop1Offset;
        #pragma unroll
        for (int u = 0; u < LOOP1_UNROLL; ++u)
-      *(packedDst + u * WARP_SIZE) = vals[u];
+          vals[u] += *(packedSrc + u * WARP_SIZE);
+      }
+    }

-    packedSrc   += loop1Inc;
-    packedDst   += loop1Inc;
+    for (int i = 0; i < numDsts; ++i)
+    {
+      PackedFloat_t* __restrict__ packedDst = (PackedFloat_t*)(p.dst[i]) + loop1Offset;
+      #pragma unroll
+      for (int u = 0; u < LOOP1_UNROLL; ++u) *(packedDst + u * WARP_SIZE) = vals[u];
+    }
    loop1Offset += loop1Inc;
  }
  Nrem -= loop1Nelem;
+
  if (Nrem > 0)
  {
    // 2nd loop - Each thread operates on FLOATS_PER_PACK per iteration
-    int const loop2Npack  = Nrem / FLOATS_PER_PACK;
-    int const loop2Nelem  = loop2Npack * FLOATS_PER_PACK;
-    int const loop2Inc    = BLOCKSIZE;
-    int       loop2Offset = threadIdx.x;
+    // NOTE: Using int32_t due to smaller size requirements
+    int32_t const loop2Npack  = Nrem / FLOATS_PER_PACK;
+    int32_t const loop2Nelem  = loop2Npack * FLOATS_PER_PACK;
+    int32_t const loop2Inc    = BLOCKSIZE;
+    int32_t       loop2Offset = threadIdx.x;

-    packedSrc = (PackedFloat_t const*)(src + loop1Nelem);
-    packedDst = (PackedFloat_t      *)(dst + loop1Nelem);
    while (loop2Offset < loop2Npack)
    {
-      packedDst[loop2Offset] = packedSrc[loop2Offset];
+      PackedFloat_t val;
+      if (numSrcs == 0)
+      {
+        val = MemsetVal<float4>();
+      }
+      else
+      {
+        val = {};
+        for (int i = 0; i < numSrcs; ++i)
+        {
+          PackedFloat_t const* __restrict__ packedSrc = (PackedFloat_t const*)(p.src[i] + loop1Nelem) + loop2Offset;
+          val += *packedSrc;
+        }
+      }
+
+      for (int i = 0; i < numDsts; ++i)
+      {
+        PackedFloat_t* __restrict__ packedDst = (PackedFloat_t*)(p.dst[i] + loop1Nelem) + loop2Offset;
+        *packedDst = val;
+      }
      loop2Offset += loop2Inc;
    }
    Nrem -= loop2Nelem;
@@ -90,40 +169,221 @@ GpuCopyKernel(BlockParam* blockParams)
    if (threadIdx.x < Nrem)
    {
      int offset = loop1Nelem + loop2Nelem + threadIdx.x;
-      dst[offset] = src[offset];
+      float val = 0;
+      if (numSrcs == 0)
+      {
+        val = MEMSET_VAL;
+      }
+      else
+      {
+        for (int i = 0; i < numSrcs; ++i)
+          val += ((float const* __restrict__)p.src[i])[offset];
+      }
+
+      for (int i = 0; i < numDsts; ++i)
+        ((float* __restrict__)p.dst[i])[offset] = val;
    }
  }

-  __threadfence_system();
+  __syncthreads();
  if (threadIdx.x == 0)
-    blockParams[blockIdx.x].stopCycle = __builtin_amdgcn_s_memrealtime();
+  {
+    p.startCycle = startCycle;
+    p.stopCycle  = __builtin_amdgcn_s_memrealtime();
+  }
 }

-#define MEMSET_UNROLL 8
-__global__ void __launch_bounds__(BLOCKSIZE)
-GpuMemsetKernel(BlockParam* blockParams)
+template <typename FLOAT_TYPE, int UNROLL_FACTOR>
+__device__ size_t GpuReduceFuncImpl2(SubExecParam const &p, size_t const offset, size_t const N)
 {
-  // Collect the arguments for this block
-  int N = blockParams[blockIdx.x].N;
-  float* __restrict__ dst = (float*)blockParams[blockIdx.x].dst;
+  int    constexpr numFloatsPerPack = sizeof(FLOAT_TYPE) / sizeof(float); // Number of floats handled at a time per thread
+  int    constexpr numWaves         = BLOCKSIZE   / WARP_SIZE;            // Number of wavefronts per threadblock
+  size_t constexpr loopPackInc      = BLOCKSIZE * UNROLL_FACTOR;
+  size_t constexpr numPacksPerWave  = WARP_SIZE * UNROLL_FACTOR;
+  int    const     waveId           = threadIdx.x / WARP_SIZE;            // Wavefront number
+  int    const     threadId         = threadIdx.x % WARP_SIZE;            // Thread index within wavefront
+  int    const     numSrcs          = p.numSrcs;
+  int    const     numDsts          = p.numDsts;
+  size_t const     numPacksDone     = (numFloatsPerPack == 1 && UNROLL_FACTOR == 1) ? N : (N / (FLOATS_PER_PACK * numPacksPerWave)) * numPacksPerWave;
+  size_t const     numFloatsLeft    = N - numPacksDone * numFloatsPerPack;
+  size_t           loopPackOffset   = waveId * numPacksPerWave + threadId;
+
+  while (loopPackOffset < numPacksDone)
+  {
+    FLOAT_TYPE vals[UNROLL_FACTOR];
+
+    if (numSrcs == 0)
+    {
+      #pragma unroll UNROLL_FACTOR
+      for (int u = 0; u < UNROLL_FACTOR; ++u) vals[u] = MemsetVal<FLOAT_TYPE>();
+    }
+    else
+    {
+      FLOAT_TYPE const* __restrict__ src0Ptr = ((FLOAT_TYPE const*)(p.src[0] + offset)) + loopPackOffset;
+      #pragma unroll UNROLL_FACTOR
+      for (int u = 0; u < UNROLL_FACTOR; ++u)
+        vals[u] = *(src0Ptr + u * WARP_SIZE);

-  // Use non-zero value
-  #pragma unroll MEMSET_UNROLL
-  for (int tid = threadIdx.x; tid < N; tid += BLOCKSIZE)
+      for (int i = 1; i < numSrcs; ++i)
      {
-    dst[tid] = 1234.0;
+        FLOAT_TYPE const* __restrict__ srcPtr = ((FLOAT_TYPE const*)(p.src[i] + offset)) + loopPackOffset;
+
+        #pragma unroll UNROLL_FACTOR
+        for (int u = 0; u < UNROLL_FACTOR; ++u)
+          vals[u] += *(srcPtr + u * WARP_SIZE);
+      }
+    }
+
+    for (int i = 0; i < numDsts; ++i)
+    {
+      FLOAT_TYPE* __restrict__ dstPtr = (FLOAT_TYPE*)(p.dst[i + offset]) + loopPackOffset;
+      #pragma unroll UNROLL_FACTOR
+      for (int u = 0; u < UNROLL_FACTOR; ++u)
+        *(dstPtr + u * WARP_SIZE) = vals[u];
+    }
+    loopPackOffset += loopPackInc;
  }
+
+  return numFloatsLeft;
 }

-// CPU copy kernel
-void CpuCopyKernel(BlockParam const& blockParams)
+template <typename FLOAT_TYPE, int UNROLL_FACTOR>
+__device__ size_t GpuReduceFuncImpl(SubExecParam const &p, size_t const offset, size_t const N)
 {
-  memcpy(blockParams.dst, blockParams.src, blockParams.N * sizeof(float));
+  // Each thread in the block works on UNROLL_FACTOR FLOAT_TYPEs during each iteration of the loop
+  int    constexpr numFloatsPerRead      = sizeof(FLOAT_TYPE) / sizeof(float);
+  size_t constexpr numFloatsPerInnerLoop = BLOCKSIZE * numFloatsPerRead;
+  size_t constexpr numFloatsPerOuterLoop = numFloatsPerInnerLoop * UNROLL_FACTOR;
+  size_t const     numFloatsLeft         = (numFloatsPerRead == 1 && UNROLL_FACTOR == 1) ? 0 : N % numFloatsPerOuterLoop;
+  size_t const     numFloatsDone         = N - numFloatsLeft;
+  int    const     numSrcs               = p.numSrcs;
+  int    const     numDsts               = p.numDsts;
+
+  for (size_t idx = threadIdx.x * numFloatsPerRead; idx < numFloatsDone; idx += numFloatsPerOuterLoop)
+  {
+    FLOAT_TYPE tmp[UNROLL_FACTOR];
+
+    if (numSrcs == 0)
+    {
+        #pragma unroll UNROLL_FACTOR
+        for (int u = 0; u < UNROLL_FACTOR; ++u)
+          tmp[u] = MemsetVal<FLOAT_TYPE>();
+    }
+    else
+    {
+      #pragma unroll UNROLL_FACTOR
+      for (int u = 0; u < UNROLL_FACTOR; ++u)
+        tmp[u] = *((FLOAT_TYPE*)(&p.src[0][offset + idx + u * numFloatsPerInnerLoop]));
+
+      for (int i = 1; i < numSrcs; ++i)
+      {
+        #pragma unroll UNROLL_FACTOR
+        for (int u = 0; u < UNROLL_FACTOR; ++u)
+          tmp[u] += *((FLOAT_TYPE*)(&p.src[i][offset + idx + u * numFloatsPerInnerLoop]));
+      }
+    }
+
+    for (int i = 0; i < numDsts; ++i)
+    {
+      for (int u = 0; u < UNROLL_FACTOR; ++u)
+      {
+        *((FLOAT_TYPE*)(&p.dst[i][offset + idx + u * numFloatsPerInnerLoop])) = tmp[u];
+      }
+    }
+  }
+  return numFloatsLeft;
+}
+
+template <typename FLOAT_TYPE>
+__device__ size_t GpuReduceFunc(SubExecParam const &p, size_t const offset, size_t const N, int const unroll)
+{
+  switch (unroll)
+  {
+  case  1: return GpuReduceFuncImpl<FLOAT_TYPE,  1>(p, offset, N);
+  case  2: return GpuReduceFuncImpl<FLOAT_TYPE,  2>(p, offset, N);
+  case  3: return GpuReduceFuncImpl<FLOAT_TYPE,  3>(p, offset, N);
+  case  4: return GpuReduceFuncImpl<FLOAT_TYPE,  4>(p, offset, N);
+  case  5: return GpuReduceFuncImpl<FLOAT_TYPE,  5>(p, offset, N);
+  case  6: return GpuReduceFuncImpl<FLOAT_TYPE,  6>(p, offset, N);
+  case  7: return GpuReduceFuncImpl<FLOAT_TYPE,  7>(p, offset, N);
+  case  8: return GpuReduceFuncImpl<FLOAT_TYPE,  8>(p, offset, N);
+  case  9: return GpuReduceFuncImpl<FLOAT_TYPE,  9>(p, offset, N);
+  case 10: return GpuReduceFuncImpl<FLOAT_TYPE, 10>(p, offset, N);
+  case 11: return GpuReduceFuncImpl<FLOAT_TYPE, 11>(p, offset, N);
+  case 12: return GpuReduceFuncImpl<FLOAT_TYPE, 12>(p, offset, N);
+  case 13: return GpuReduceFuncImpl<FLOAT_TYPE, 13>(p, offset, N);
+  case 14: return GpuReduceFuncImpl<FLOAT_TYPE, 14>(p, offset, N);
+  case 15: return GpuReduceFuncImpl<FLOAT_TYPE, 15>(p, offset, N);
+  case 16: return GpuReduceFuncImpl<FLOAT_TYPE, 16>(p, offset, N);
+  default: return GpuReduceFuncImpl<FLOAT_TYPE,  1>(p, offset, N);
+  }
 }

-// CPU memset kernel
-void CpuMemsetKernel(BlockParam const& blockParams)
+// GPU copy kernel
+__global__ void __launch_bounds__(BLOCKSIZE)
+GpuReduceKernel2(SubExecParam* params)
 {
-  for (int i = 0; i < blockParams.N; i++)
-    blockParams.dst[i] = 1234.0;
+  int64_t startCycle = __builtin_amdgcn_s_memrealtime();
+  SubExecParam& p = params[blockIdx.x];
+
+  size_t numFloatsLeft = GpuReduceFunc<float4>(p, 0, p.N, 8);
+  if (numFloatsLeft)
+    numFloatsLeft = GpuReduceFunc<float4>(p, p.N - numFloatsLeft, numFloatsLeft, 1);
+
+  if (numFloatsLeft)
+  GpuReduceFunc<float>(p, p.N - numFloatsLeft, numFloatsLeft, 1);
+
+  __threadfence_system();
+  if (threadIdx.x == 0)
+  {
+    p.startCycle = startCycle;
+    p.stopCycle  = __builtin_amdgcn_s_memrealtime();
+  }
 }
+
+#define NUM_GPU_KERNELS 18
+typedef void (*GpuKernelFuncPtr)(SubExecParam*);
+
+GpuKernelFuncPtr GpuKernelTable[NUM_GPU_KERNELS] =
+{
+  GpuReduceKernel<8>,
+  GpuReduceKernel<1>,
+  GpuReduceKernel<2>,
+  GpuReduceKernel<3>,
+  GpuReduceKernel<4>,
+  GpuReduceKernel<5>,
+  GpuReduceKernel<6>,
+  GpuReduceKernel<7>,
+  GpuReduceKernel<8>,
+  GpuReduceKernel<9>,
+  GpuReduceKernel<10>,
+  GpuReduceKernel<11>,
+  GpuReduceKernel<12>,
+  GpuReduceKernel<13>,
+  GpuReduceKernel<14>,
+  GpuReduceKernel<15>,
+  GpuReduceKernel<16>,
+  GpuReduceKernel2
+};
+
+std::string GpuKernelNames[NUM_GPU_KERNELS] =
+{
+  "Default - 8xUnroll",
+  "Unroll x1",
+  "Unroll x2",
+  "Unroll x3",
+  "Unroll x4",
+  "Unroll x5",
+  "Unroll x6",
+  "Unroll x7",
+  "Unroll x8",
+  "Unroll x9",
+  "Unroll x10",
+  "Unroll x11",
+  "Unroll x12",
+  "Unroll x13",
+  "Unroll x14",
+  "Unroll x15",
+  "Unroll x16",
+  "8xUnrollB",
+};
--- a/LICENSE.md
+++ b/LICENSE.md
-Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

--- a/Makefile
+++ b/Makefile
-# Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.
 ROCM_PATH ?= /opt/rocm
 HIPCC=$(ROCM_PATH)/bin/hipcc

 EXE=TransferBench
-CXXFLAGS = -O3 -I. -lnuma -L$(ROCM_PATH)/hsa/lib -lhsa-runtime64
+CXXFLAGS = -O3 -I. -lnuma -L$(ROCM_PATH)/hsa/lib -lhsa-runtime64 -ferror-limit=5

 all: $(EXE)


--- a/TransferBench.cpp
+++ b/TransferBench.cpp
 /*
-Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -30,7 +30,6 @@ THE SOFTWARE.

 #include "TransferBench.hpp"
 #include "GetClosestNumaNode.hpp"
-#include "Kernels.hpp"

 int main(int argc, char **argv)
 {
@@ -76,30 +75,18 @@ int main(int argc, char **argv)
  // - Tests that sweep across possible sets of Transfers
  if (!strcmp(argv[1], "sweep") || !strcmp(argv[1], "rsweep"))
  {
-    int numBlocksToUse = (argc > 3 ? atoi(argv[3]) : 4);
+    int numGpuSubExecs = (argc > 3 ? atoi(argv[3]) : 4);
+    int numCpuSubExecs = (argc > 4 ? atoi(argv[4]) : 4);

    ev.configMode = CFG_SWEEP;
-    RunSweepPreset(ev, numBytesPerTransfer, numBlocksToUse, !strcmp(argv[1], "rsweep"));
+    RunSweepPreset(ev, numBytesPerTransfer, numGpuSubExecs, numCpuSubExecs, !strcmp(argv[1], "rsweep"));
    exit(0);
  }
  // - Tests that benchmark peer-to-peer performance
-  else if (!strcmp(argv[1], "p2p") || !strcmp(argv[1], "p2p_rr") ||
-           !strcmp(argv[1], "g2g") || !strcmp(argv[1], "g2g_rr"))
+  else if (!strcmp(argv[1], "p2p"))
  {
-    int numBlocksToUse = 0;
-    if (argc > 3)
-      numBlocksToUse = atoi(argv[3]);
-    else
-      HIP_CALL(hipDeviceGetAttribute(&numBlocksToUse, hipDeviceAttributeMultiprocessorCount, 0));
-
-    // Perform either local read (+remote write) [EXE = SRC] or
-    // remote read (+local write)                [EXE = DST]
-    int readMode = (!strcmp(argv[1], "p2p_rr") || !strcmp(argv[1], "g2g_rr") ? 1 : 0);
-    int skipCpu  = (!strcmp(argv[1], "g2g"   ) || !strcmp(argv[1], "g2g_rr") ? 1 : 0);
-
-    // Execute peer to peer benchmark mode
    ev.configMode = CFG_P2P;
-    RunPeerToPeerBenchmarks(ev, numBytesPerTransfer / sizeof(float), numBlocksToUse, readMode, skipCpu);
+    RunPeerToPeerBenchmarks(ev, numBytesPerTransfer / sizeof(float));
    exit(0);
  }

@@ -116,8 +103,7 @@ int main(int argc, char **argv)
  ev.DisplayEnvVars();
  if (ev.outputToCsv)
  {
-    printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),"
-           "ExeToSrcLinkType,ExeToDstLinkType,SrcAddr,DstAddr\n");
+    printf("Test#,Transfer#,NumBytes,Src,Exe,Dst,CUs,BW(GB/s),Time(ms),SrcAddr,DstAddr\n");
  }

  int testNum = 0;
@@ -170,71 +156,70 @@ void ExecuteTransfers(EnvVars const& ev,
  TransferMap transferMap;
  for (Transfer& transfer : transfers)
  {
-    Executor executor(transfer.exeMemType, transfer.exeIndex);
+    Executor executor(transfer.exeType, transfer.exeIndex);
    ExecutorInfo& executorInfo = transferMap[executor];
    executorInfo.transfers.push_back(&transfer);
  }

-  // Loop over each executor and prepare GPU resources
+  // Loop over each executor and prepare sub-executors
  std::map<int, Transfer*> transferList;
  for (auto& exeInfoPair : transferMap)
  {
    Executor const& executor = exeInfoPair.first;
    ExecutorInfo& exeInfo    = exeInfoPair.second;
+    ExeType const exeType    = executor.first;
+    int     const exeIndex   = RemappedIndex(executor.second, IsCpuType(exeType));
+
    exeInfo.totalTime = 0.0;
-    exeInfo.totalBlocks = 0;
+    exeInfo.totalSubExecs = 0;

    // Loop over each transfer this executor is involved in
    for (Transfer* transfer : exeInfo.transfers)
    {
-      // Get some aliases to transfer variables
-      MemType const& exeMemType  = transfer->exeMemType;
-      MemType const& srcMemType  = transfer->srcMemType;
-      MemType const& dstMemType  = transfer->dstMemType;
-      int     const& blocksToUse = transfer->numBlocksToUse;
-
-      // Get potentially remapped device indices
-      int const srcIndex = RemappedIndex(transfer->srcIndex, srcMemType);
-      int const exeIndex = RemappedIndex(transfer->exeIndex, exeMemType);
-      int const dstIndex = RemappedIndex(transfer->dstIndex, dstMemType);
+      // Determine how many bytes to copy for this Transfer (use custom if pre-specified)
+      transfer->numBytesActual = (transfer->numBytes ? transfer->numBytes : N * sizeof(float));

-      // Enable peer-to-peer access if necessary (can only be called once per unique pair)
-      if (exeMemType == MEM_GPU)
+      // Allocate source memory
+      transfer->srcMem.resize(transfer->numSrcs);
+      for (int iSrc = 0; iSrc < transfer->numSrcs; ++iSrc)
      {
+        MemType const& srcType  = transfer->srcType[iSrc];
+        int     const  srcIndex    = RemappedIndex(transfer->srcIndex[iSrc], IsCpuType(srcType));
+
        // Ensure executing GPU can access source memory
-        if ((srcMemType == MEM_GPU || srcMemType == MEM_GPU_FINE) && srcIndex != exeIndex)
+        if (IsGpuType(exeType) == MEM_GPU && IsGpuType(srcType) && srcIndex != exeIndex)
          EnablePeerAccess(exeIndex, srcIndex);

+        AllocateMemory(srcType, srcIndex, transfer->numBytesActual + ev.byteOffset, (void**)&transfer->srcMem[iSrc]);
+      }
+
+      // Allocate destination memory
+      transfer->dstMem.resize(transfer->numDsts);
+      for (int iDst = 0; iDst < transfer->numDsts; ++iDst)
+      {
+        MemType const& dstType  = transfer->dstType[iDst];
+        int     const  dstIndex    = RemappedIndex(transfer->dstIndex[iDst], IsCpuType(dstType));
+
        // Ensure executing GPU can access destination memory
-        if ((dstMemType == MEM_GPU || dstMemType == MEM_GPU_FINE) && dstIndex != exeIndex)
+        if (IsGpuType(exeType) == MEM_GPU && IsGpuType(dstType) && dstIndex != exeIndex)
          EnablePeerAccess(exeIndex, dstIndex);
-      }

-      // Allocate (maximum) source / destination memory based on type / device index
-      transfer->numBytesToCopy = (transfer->numBytes ? transfer->numBytes : N * sizeof(float));
-      AllocateMemory(srcMemType, srcIndex, transfer->numBytesToCopy + ev.byteOffset, (void**)&transfer->srcMem);
-      AllocateMemory(dstMemType, dstIndex, transfer->numBytesToCopy + ev.byteOffset, (void**)&transfer->dstMem);
+        AllocateMemory(dstType, dstIndex, transfer->numBytesActual + ev.byteOffset, (void**)&transfer->dstMem[iDst]);
+      }

-      transfer->blockParam.resize(exeMemType == MEM_CPU ? ev.numCpuPerTransfer : blocksToUse);
-      exeInfo.totalBlocks += transfer->blockParam.size();
+      exeInfo.totalSubExecs += transfer->numSubExecs;
      transferList[transfer->transferIndex] = transfer;
    }

-    // Prepare per-threadblock parameters for GPU executors
-    MemType const exeMemType = executor.first;
-    int     const exeIndex   = RemappedIndex(executor.second, exeMemType);
-    if (exeMemType == MEM_GPU)
+    // Prepare additional requirement for GPU-based executors
+    if (IsGpuType(exeType))
    {
-      // Allocate one contiguous chunk of GPU memory for threadblock parameters
-      // This allows support for executing one transfer per stream, or all transfers in a single stream
-      AllocateMemory(exeMemType, exeIndex, exeInfo.totalBlocks * sizeof(BlockParam),
-                     (void**)&exeInfo.blockParamGpu);
-
-      int const numTransfersToRun = ev.useSingleStream ? 1 : exeInfo.transfers.size();
-      exeInfo.streams.resize(numTransfersToRun);
-      exeInfo.startEvents.resize(numTransfersToRun);
-      exeInfo.stopEvents.resize(numTransfersToRun);
-      for (int i = 0; i < numTransfersToRun; ++i)
+      // Single-stream is only supported for GFX-based executors
+      int const numStreamsToUse = (exeType == EXE_GPU_DMA || !ev.useSingleStream) ? exeInfo.transfers.size() : 1;
+      exeInfo.streams.resize(numStreamsToUse);
+      exeInfo.startEvents.resize(numStreamsToUse);
+      exeInfo.stopEvents.resize(numStreamsToUse);
+      for (int i = 0; i < numStreamsToUse; ++i)
      {
        HIP_CALL(hipSetDevice(exeIndex));
        HIP_CALL(hipStreamCreate(&exeInfo.streams[i]));
@@ -242,12 +227,12 @@ void ExecuteTransfers(EnvVars const& ev,
        HIP_CALL(hipEventCreate(&exeInfo.stopEvents[i]));
      }

-      // Assign each transfer its portion of threadblock parameters
-      int transferOffset = 0;
-      for (int i = 0; i < exeInfo.transfers.size(); i++)
+      if (exeType == EXE_GPU_GFX)
      {
-        exeInfo.transfers[i]->blockParamGpuPtr = exeInfo.blockParamGpu + transferOffset;
-        transferOffset += exeInfo.transfers[i]->blockParam.size();
+        // Allocate one contiguous chunk of GPU memory for threadblock parameters
+        // This allows support for executing one transfer per stream, or all transfers in a single stream
+        AllocateMemory(MEM_GPU, exeIndex, exeInfo.totalSubExecs * sizeof(SubExecParam),
+                       (void**)&exeInfo.subExecParamGpu);
      }
    }
  }
@@ -265,17 +250,20 @@ void ExecuteTransfers(EnvVars const& ev,
    {
      // Prepare subarrays each threadblock works on and fill src memory with patterned data
      Transfer* transfer = exeInfo.transfers[i];
-      transfer->PrepareBlockParams(ev, transfer->numBytesToCopy / sizeof(float));
-      exeInfo.totalBytes += transfer->numBytesToCopy;
+      transfer->PrepareSubExecParams(ev);
+      transfer->PrepareSrc(ev);
+      exeInfo.totalBytes += transfer->numBytesActual;

      // Copy block parameters to GPU for GPU executors
-      if (transfer->exeMemType == MEM_GPU)
+      if (transfer->exeType == EXE_GPU_GFX)
      {
-        HIP_CALL(hipMemcpy(&exeInfo.blockParamGpu[transferOffset],
-                           transfer->blockParam.data(),
-                           transfer->blockParam.size() * sizeof(BlockParam),
+        exeInfo.transfers[i]->subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset;
+        HIP_CALL(hipMemcpy(&exeInfo.subExecParamGpu[transferOffset],
+                           transfer->subExecParam.data(),
+                           transfer->subExecParam.size() * sizeof(SubExecParam),
                           hipMemcpyHostToDevice));
-        transferOffset += transfer->blockParam.size();
+
+        transferOffset += transfer->subExecParam.size();
      }
    }
  }
@@ -296,7 +284,11 @@ void ExecuteTransfers(EnvVars const& ev,

      for (Transfer& transfer : transfers)
      {
-        printf("Transfer %03d: SRC: %p DST: %p\n", transfer.transferIndex, transfer.srcMem, transfer.dstMem);
+        printf("Transfer %03d:\n", transfer.transferIndex);
+        for (int iSrc = 0; iSrc < transfer.numSrcs; ++iSrc)
+          printf("  SRC %0d: %p\n", iSrc, transfer.srcMem[iSrc]);
+        for (int iDst = 0; iDst < transfer.numDsts; ++iDst)
+          printf("  DST %0d: %p\n", iDst, transfer.dstMem[iDst]);
      }
      printf("Hit <Enter> to continue: ");
      scanf("%*c");
@@ -310,8 +302,9 @@ void ExecuteTransfers(EnvVars const& ev,
    for (auto& exeInfoPair : transferMap)
    {
      ExecutorInfo& exeInfo = exeInfoPair.second;
-      int const numTransfersToRun = (IsGpuType(exeInfoPair.first.first) && ev.useSingleStream) ?
-        1 : exeInfo.transfers.size();
+      ExeType       exeType = exeInfoPair.first.first;
+      int const numTransfersToRun = (exeType == EXE_GPU_GFX && ev.useSingleStream) ? 1 : exeInfo.transfers.size();
+
      for (int i = 0; i < numTransfersToRun; ++i)
        threads.push(std::thread(RunTransfer, std::ref(ev), iteration, std::ref(exeInfo), i));
    }
@@ -349,8 +342,8 @@ void ExecuteTransfers(EnvVars const& ev,
  for (auto transferPair : transferList)
  {
    Transfer* transfer = transferPair.second;
-    CheckOrFill(MODE_CHECK, transfer->numBytesToCopy / sizeof(float), ev.useMemset, ev.useHipCall, ev.fillPattern, transfer->dstMem + initOffset);
-    totalBytesTransferred += transfer->numBytesToCopy;
+    transfer->ValidateDst(ev);
+    totalBytesTransferred += transfer->numBytesActual;
  }

  // Report timings
@@ -363,11 +356,11 @@ void ExecuteTransfers(EnvVars const& ev,
    for (auto& exeInfoPair : transferMap)
    {
      ExecutorInfo  exeInfo  = exeInfoPair.second;
-      MemType const exeMemType = exeInfoPair.first.first;
+      ExeType const exeType  = exeInfoPair.first.first;
      int     const exeIndex = exeInfoPair.first.second;

-      // Compute total time for CPU executors
-      if (!IsGpuType(exeMemType))
+      // Compute total time for non GPU executors
+      if (exeType != EXE_GPU_GFX)
      {
        exeInfo.totalTime = 0;
        for (auto const& transfer : exeInfo.transfers)
@@ -380,51 +373,49 @@ void ExecuteTransfers(EnvVars const& ev,

      if (verbose && !ev.outputToCsv)
      {
-        printf(" Executor: %cPU %02d        (# Transfers %02lu)| %9.3f GB/s | %8.3f ms | %12lu bytes\n",
-               MemTypeStr[exeMemType], exeIndex, exeInfo.transfers.size(), exeBandwidthGbs, exeDurationMsec, exeInfo.totalBytes);
+        printf(" Executor: %3s %02d | %7.3f GB/s | %8.3f ms | %12lu bytes\n",
+               ExeTypeName[exeType], exeIndex, exeBandwidthGbs, exeDurationMsec, exeInfo.totalBytes);
      }

      int totalCUs = 0;
      for (auto const& transfer : exeInfo.transfers)
      {
        double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations);
-        double transferBandwidthGbs = (transfer->numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f;
-        totalCUs += transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse;
+        double transferBandwidthGbs = (transfer->numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
+        totalCUs += transfer->numSubExecs;

        if (!verbose) continue;
        if (!ev.outputToCsv)
        {
-          printf("                            Transfer  %02d | %9.3f GB/s | %8.3f ms | %12lu bytes | %c%02d -> %c%02d:(%03d) -> %c%02d\n",
+          printf("     Transfer %02d  | %7.3f GB/s | %8.3f ms | %12lu bytes | %s -> %s%02d:%03d -> %s\n",
                 transfer->transferIndex,
                 transferBandwidthGbs,
                 transferDurationMsec,
-                 transfer->numBytesToCopy,
-                 MemTypeStr[transfer->srcMemType], transfer->srcIndex,
-                 MemTypeStr[transfer->exeMemType], transfer->exeIndex,
-                 transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
-                 MemTypeStr[transfer->dstMemType], transfer->dstIndex);
-
+                 transfer->numBytesActual,
+                 transfer->SrcToStr().c_str(),
+                 ExeTypeName[transfer->exeType], transfer->exeIndex,
+                 transfer->numSubExecs,
+                 transfer->DstToStr().c_str());
        }
        else
        {
-          printf("%d,%d,%lu,%c%02d,%c%02d,%c%02d,%d,%.3f,%.3f,%s,%s,%p,%p\n",
-                 testNum, transfer->transferIndex, transfer->numBytesToCopy,
-                 MemTypeStr[transfer->srcMemType], transfer->srcIndex,
-                 MemTypeStr[transfer->exeMemType], transfer->exeIndex,
-                 MemTypeStr[transfer->dstMemType], transfer->dstIndex,
-                 transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
+          printf("%d,%d,%lu,%s,%c%02d,%s,%d,%.3f,%.3f,%s,%s\n",
+                 testNum, transfer->transferIndex, transfer->numBytesActual,
+                 transfer->SrcToStr().c_str(),
+                 MemTypeStr[transfer->exeType], transfer->exeIndex,
+                 transfer->DstToStr().c_str(),
+                 transfer->numSubExecs,
                 transferBandwidthGbs, transferDurationMsec,
-                 GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->srcMemType, transfer->srcIndex).c_str(),
-                 GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->dstMemType, transfer->dstIndex).c_str(),
-                 transfer->srcMem + initOffset, transfer->dstMem + initOffset);
+                 PtrVectorToStr(transfer->srcMem, initOffset).c_str(),
+                 PtrVectorToStr(transfer->dstMem, initOffset).c_str());
        }
      }

      if (verbose && ev.outputToCsv)
      {
-        printf("%d,ALL,%lu,ALL,%c%02d,ALL,%d,%.3f,%.3f,ALL,ALL,ALL,ALL\n",
+        printf("%d,ALL,%lu,ALL,%c%02d,ALL,%d,%.3f,%.3f,ALL,ALL\n",
               testNum, totalBytesTransferred,
-               MemTypeStr[exeMemType], exeIndex, totalCUs,
+               MemTypeStr[exeType], exeIndex, totalCUs,
               exeBandwidthGbs, exeDurationMsec);
      }
    }
@@ -435,33 +426,31 @@ void ExecuteTransfers(EnvVars const& ev,
    {
      Transfer* transfer = transferPair.second;
      double transferDurationMsec = transfer->transferTime / (1.0 * numTimedIterations);
-      double transferBandwidthGbs = (transfer->numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f;
+      double transferBandwidthGbs = (transfer->numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
      maxGpuTime = std::max(maxGpuTime, transferDurationMsec);
      if (!verbose) continue;
      if (!ev.outputToCsv)
      {
-        printf(" Transfer %02d: %c%02d -> [%cPU %02d:%03d] -> %c%02d | %9.3f GB/s | %8.3f ms | %12lu bytes | %-16s\n",
+        printf(" Transfer %02d      | %7.3f GB/s | %8.3f ms | %12lu bytes | %s -> %s%02d:%03d -> %s\n",
               transfer->transferIndex,
-               MemTypeStr[transfer->srcMemType], transfer->srcIndex,
-               MemTypeStr[transfer->exeMemType], transfer->exeIndex,
-               transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
-               MemTypeStr[transfer->dstMemType], transfer->dstIndex,
               transferBandwidthGbs, transferDurationMsec,
-               transfer->numBytesToCopy,
-               GetTransferDesc(*transfer).c_str());
+               transfer->numBytesActual,
+               transfer->SrcToStr().c_str(),
+               ExeTypeName[transfer->exeType], transfer->exeIndex,
+               transfer->numSubExecs,
+               transfer->DstToStr().c_str());
      }
      else
      {
-        printf("%d,%d,%lu,%c%02d,%c%02d,%c%02d,%d,%.3f,%.3f,%s,%s,%p,%p\n",
-               testNum, transfer->transferIndex, transfer->numBytesToCopy,
-               MemTypeStr[transfer->srcMemType], transfer->srcIndex,
-               MemTypeStr[transfer->exeMemType], transfer->exeIndex,
-               MemTypeStr[transfer->dstMemType], transfer->dstIndex,
-               transfer->exeMemType == MEM_CPU ? ev.numCpuPerTransfer : transfer->numBlocksToUse,
+        printf("%d,%d,%lu,%s,%s%02d,%s,%d,%.3f,%.3f,%s,%s\n",
+               testNum, transfer->transferIndex, transfer->numBytesActual,
+               transfer->SrcToStr().c_str(),
+               ExeTypeName[transfer->exeType], transfer->exeIndex,
+               transfer->DstToStr().c_str(),
+               transfer->numSubExecs,
               transferBandwidthGbs, transferDurationMsec,
-               GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->srcMemType, transfer->srcIndex).c_str(),
-               GetDesc(transfer->exeMemType, transfer->exeIndex, transfer->dstMemType, transfer->dstIndex).c_str(),
-               transfer->srcMem + initOffset, transfer->dstMem + initOffset);
+               PtrVectorToStr(transfer->srcMem, initOffset).c_str(),
+               PtrVectorToStr(transfer->dstMem, initOffset).c_str());
      }
    }
  }
@@ -471,12 +460,12 @@ void ExecuteTransfers(EnvVars const& ev,
  {
    if (!ev.outputToCsv)
    {
-      printf(" Aggregate Bandwidth (CPU timed)         | %9.3f GB/s | %8.3f ms | %12lu bytes | Overhead: %.3f ms\n",
+      printf(" Aggregate (CPU)  | %7.3f GB/s | %8.3f ms | %12lu bytes | Overhead: %.3f ms\n",
             totalBandwidthGbs, totalCpuTime, totalBytesTransferred, totalCpuTime - maxGpuTime);
    }
    else
    {
-      printf("%d,ALL,%lu,ALL,ALL,ALL,ALL,%.3f,%.3f,ALL,ALL,ALL,ALL\n",
+      printf("%d,ALL,%lu,ALL,ALL,ALL,ALL,%.3f,%.3f,ALL,ALL\n",
             testNum, totalBytesTransferred, totalBandwidthGbs, totalCpuTime);
    }
  }
@@ -485,31 +474,38 @@ void ExecuteTransfers(EnvVars const& ev,
  for (auto exeInfoPair : transferMap)
  {
    ExecutorInfo& exeInfo  = exeInfoPair.second;
+    ExeType const exeType  = exeInfoPair.first.first;
+    int     const exeIndex = RemappedIndex(exeInfoPair.first.second, IsCpuType(exeType));
+
    for (auto& transfer : exeInfo.transfers)
    {
-      // Get some aliases to Transfer variables
-      MemType const& exeMemType = transfer->exeMemType;
-      MemType const& srcMemType = transfer->srcMemType;
-      MemType const& dstMemType = transfer->dstMemType;
-
-      // Allocate (maximum) source / destination memory based on type / device index
-      DeallocateMemory(srcMemType, transfer->srcMem,  N * sizeof(float) + ev.byteOffset);
-      DeallocateMemory(dstMemType, transfer->dstMem,  N * sizeof(float) + ev.byteOffset);
-      transfer->blockParam.clear();
+      for (int iSrc = 0; iSrc < transfer->numSrcs; ++iSrc)
+      {
+        MemType const& srcType = transfer->srcType[iSrc];
+        DeallocateMemory(srcType, transfer->srcMem[iSrc], transfer->numBytesActual + ev.byteOffset);
+      }
+      for (int iDst = 0; iDst < transfer->numDsts; ++iDst)
+      {
+        MemType const& dstType = transfer->dstType[iDst];
+        DeallocateMemory(dstType, transfer->dstMem[iDst], transfer->numBytesActual + ev.byteOffset);
+      }
+      transfer->subExecParam.clear();
    }

-    MemType const exeMemType = exeInfoPair.first.first;
-    int     const exeIndex   = RemappedIndex(exeInfoPair.first.second, exeMemType);
-    if (exeMemType == MEM_GPU)
+    if (IsGpuType(exeType))
    {
-      DeallocateMemory(exeMemType, exeInfo.blockParamGpu);
-      int const numTransfersToRun = ev.useSingleStream ? 1 : exeInfo.transfers.size();
-      for (int i = 0; i < numTransfersToRun; ++i)
+      int const numStreams = (int)exeInfo.streams.size();
+      for (int i = 0; i < numStreams; ++i)
      {
        HIP_CALL(hipEventDestroy(exeInfo.startEvents[i]));
        HIP_CALL(hipEventDestroy(exeInfo.stopEvents[i]));
        HIP_CALL(hipStreamDestroy(exeInfo.streams[i]));
      }
+
+      if (exeType == EXE_GPU_GFX)
+      {
+        DeallocateMemory(MEM_GPU, exeInfo.subExecParamGpu);
+      }
    }
  }
 }
@@ -531,12 +527,10 @@ void DisplayUsage(char const* cmdName)
  printf("Usage: %s config <N>\n", cmdName);
  printf("  config: Either:\n");
  printf("          - Filename of configFile containing Transfers to execute (see example.cfg for format)\n");
-  printf("          - Name of preset benchmark:\n");
-  printf("              p2p{_rr} - All CPU/GPU pairs benchmark {with remote reads}\n");
-  printf("              g2g{_rr} - All GPU/GPU pairs benchmark {with remote reads}\n");
-  printf("              sweep    - Sweep across possible sets of Transfers\n");
-  printf("              rsweep   - Randomly sweep across possible sets of Transfers\n");
-  printf("            - 3rd optional argument used as # of CUs to use (all by default for p2p / 4 for sweep)\n");
+  printf("          - Name of preset config:\n");
+  printf("              p2p          - Peer-to-peer benchmark tests\n");
+  printf("              sweep/rsweep - Sweep/random sweep across possible sets of Transfers\n");
+  printf("                             - 3rd/4th optional args for # GPU SubExecs / # CPU SubExecs per Transfer\n");
  printf("  N     : (Optional) Number of bytes to copy per Transfer.\n");
  printf("          If not specified, defaults to %lu bytes. Must be a multiple of 4 bytes\n",
         DEFAULT_BYTES_PER_TRANSFER);
@@ -547,7 +541,7 @@ void DisplayUsage(char const* cmdName)
  EnvVars::DisplayUsage();
 }

-int RemappedIndex(int const origIdx, MemType const memType)
+int RemappedIndex(int const origIdx, bool const isCpuType)
 {
  static std::vector<int> remappingCpu;
  static std::vector<int> remappingGpu;
@@ -591,7 +585,7 @@ int RemappedIndex(int const origIdx, MemType const memType)
        remappingGpu[i] = mapping[i].second;
    }
  }
-  return IsCpuType(memType) ? remappingCpu[origIdx] : remappingGpu[origIdx];
+  return isCpuType ? remappingCpu[origIdx] : remappingGpu[origIdx];
 }

 void DisplayTopology(bool const outputToCsv)
@@ -634,11 +628,11 @@ void DisplayTopology(bool const outputToCsv)

  for (int i = 0; i < numCpuDevices; i++)
  {
-    int nodeI = RemappedIndex(i, MEM_CPU);
+    int nodeI = RemappedIndex(i, true);
    printf("NUMA %02d (%02d)%s", i, nodeI, outputToCsv ? "," : "|");
    for (int j = 0; j < numCpuDevices; j++)
    {
-      int nodeJ = RemappedIndex(j, MEM_CPU);
+      int nodeJ = RemappedIndex(j, true);
      int numaDist = numa_distance(nodeI, nodeJ);
      if (outputToCsv)
        printf("%d,", numaDist);
@@ -657,7 +651,7 @@ void DisplayTopology(bool const outputToCsv)
    bool isFirst = true;
    for (int j = 0; j < numGpuDevices; j++)
    {
-      if (GetClosestNumaNode(RemappedIndex(j, MEM_GPU)) == i)
+      if (GetClosestNumaNode(RemappedIndex(j, false)) == i)
      {
        if (isFirst) isFirst = false;
        else printf(",");
@@ -678,19 +672,30 @@ void DisplayTopology(bool const outputToCsv)
  }
  else
  {
+    printf("        |");
+    for (int j = 0; j < numGpuDevices; j++)
+    {
+      hipDeviceProp_t prop;
+      HIP_CALL(hipGetDeviceProperties(&prop, j));
+      std::string fullName = prop.gcnArchName;
+      std::string archName = fullName.substr(0, fullName.find(':'));
+      printf(" %6s |", archName.c_str());
+    }
+    printf("\n");
    printf("        |");
    for (int j = 0; j < numGpuDevices; j++)
      printf(" GPU %02d |", j);
-    printf(" PCIe Bus ID  | Closest NUMA\n");
+    printf(" PCIe Bus ID  | #CUs | Closest NUMA\n");
    for (int j = 0; j <= numGpuDevices; j++)
      printf("--------+");
-    printf("--------------+-------------\n");
+    printf("--------------+------+-------------\n");
  }

  char pciBusId[20];

  for (int i = 0; i < numGpuDevices; i++)
  {
+    int const deviceIdx = RemappedIndex(i, false);
    printf("%sGPU %02d%s", outputToCsv ? "" : " ", i, outputToCsv ? "," : " |");
    for (int j = 0; j < numGpuDevices; j++)
    {
@@ -704,8 +709,8 @@ void DisplayTopology(bool const outputToCsv)
      else
      {
        uint32_t linkType, hopCount;
-        HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(i, MEM_GPU),
-                                              RemappedIndex(j, MEM_GPU),
+        HIP_CALL(hipExtGetLinkTypeAndHopCount(deviceIdx,
+                                              RemappedIndex(j, false),
                                              &linkType, &hopCount));
        printf("%s%s-%d%s",
               outputToCsv ? "" : " ",
@@ -717,44 +722,78 @@ void DisplayTopology(bool const outputToCsv)
               hopCount, outputToCsv ? "," : " |");
      }
    }
-    HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, RemappedIndex(i, MEM_GPU)));
+    HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, deviceIdx));
+
+    int numDeviceCUs = 0;
+    HIP_CALL(hipDeviceGetAttribute(&numDeviceCUs, hipDeviceAttributeMultiprocessorCount, deviceIdx));
+
    if (outputToCsv)
-      printf("%s,%d\n", pciBusId, GetClosestNumaNode(RemappedIndex(i, MEM_GPU)));
+      printf("%s,%d,%d\n", pciBusId, numDeviceCUs, GetClosestNumaNode(deviceIdx));
    else
-      printf(" %11s |  %d  \n", pciBusId, GetClosestNumaNode(RemappedIndex(i, MEM_GPU)));
+      printf(" %11s | %4d | %d\n", pciBusId, numDeviceCUs, GetClosestNumaNode(deviceIdx));
  }
 }

-void ParseMemType(std::string const& token, int const numCpus, int const numGpus, MemType* memType, int* memIndex)
+void ParseMemType(std::string const& token, int const numCpus, int const numGpus,
+                  std::vector<MemType>& memTypes, std::vector<int>& memIndices)
 {
  char typeChar;
-  if (sscanf(token.c_str(), " %c %d", &typeChar, memIndex) != 2)
+  int offset = 0, devIndex, inc;
+  bool found = false;
+
+  memTypes.clear();
+  memIndices.clear();
+  while (sscanf(token.c_str() + offset, " %c %d%n", &typeChar, &devIndex, &inc) == 2)
+  {
+    offset += inc;
+    MemType memType = CharToMemType(typeChar);
+
+    if (IsCpuType(memType) && (devIndex < 0 || devIndex >= numCpus))
    {
-    printf("[ERROR] Unable to parse memory type token %s - expecting either 'B,C,G or F' followed by an index\n",
-           token.c_str());
+      printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, devIndex);
+      exit(1);
+    }
+    if (IsGpuType(memType) && (devIndex < 0 || devIndex >= numGpus))
+    {
+      printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, devIndex);
      exit(1);
    }

-  switch (typeChar)
+    found = true;
+    if (memType != MEM_NULL)
+    {
+      memTypes.push_back(memType);
+      memIndices.push_back(devIndex);
+    }
+  }
+  if (!found)
  {
-  case 'C': case 'c': case 'B': case 'b': case 'U': case 'u':
-    *memType = (typeChar == 'C' || typeChar == 'c') ? MEM_CPU : ((typeChar == 'B' || typeChar == 'b') ? MEM_CPU_FINE : MEM_CPU_UNPINNED);
-    if (*memIndex < 0 || *memIndex >= numCpus)
+    printf("[ERROR] Unable to parse memory type token %s.  Expected one of %s followed by an index\n",
+           token.c_str(), MemTypeStr);
+    exit(1);
+  }
+}
+
+void ParseExeType(std::string const& token, int const numCpus, int const numGpus,
+                  ExeType &exeType, int& exeIndex)
+{
+  char typeChar;
+  if (sscanf(token.c_str(), " %c%d", &typeChar, &exeIndex) != 2)
  {
-      printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, *memIndex);
+    printf("[ERROR] Unable to parse valid executor token (%s).  Exepected one of %s followed by an index\n",
+           token.c_str(), ExeTypeStr);
    exit(1);
  }
-    break;
-  case 'G': case 'g': case 'F': case 'f':
-    *memType = (typeChar == 'G' || typeChar == 'g') ? MEM_GPU : MEM_GPU_FINE;
-    if (*memIndex < 0 || *memIndex >= numGpus)
+  exeType = CharToExeType(typeChar);
+
+  if (IsCpuType(exeType) && (exeIndex < 0 || exeIndex >= numCpus))
  {
-      printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, *memIndex);
+    printf("[ERROR] CPU index must be between 0 and %d (instead of %d)\n", numCpus-1, exeIndex);
    exit(1);
  }
-    break;
-  default:
-    printf("[ERROR] Unrecognized memory type %s.  Expecting either 'B','C','U','G' or 'F'\n", token.c_str());
+  if (IsGpuType(exeType) && (exeIndex < 0 || exeIndex >= numGpus))
+  {
+    printf("[ERROR] GPU index must be between 0 and %d (instead of %d)\n", numGpus-1, exeIndex);
    exit(1);
  }
 }
@@ -777,18 +816,18 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
  std::string srcMem;
  std::string dstMem;

-  // If numTransfers < 0, read quads (srcMem, exeMem, dstMem, #CUs)
+  // If numTransfers < 0, read 5-tuple (srcMem, exeMem, dstMem, #CUs, #Bytes)
  // otherwise read triples (srcMem, exeMem, dstMem)
  bool const advancedMode = (numTransfers < 0);
  numTransfers = abs(numTransfers);

-  int numBlocksToUse;
+  int numSubExecs;
  if (!advancedMode)
  {
-    iss >> numBlocksToUse;
-    if (numBlocksToUse <= 0 || iss.fail())
+    iss >> numSubExecs;
+    if (numSubExecs <= 0 || iss.fail())
    {
-      printf("Parsing error: Number of blocks to use (%d) must be greater than 0\n", numBlocksToUse);
+      printf("Parsing error: Number of blocks to use (%d) must be greater than 0\n", numSubExecs);
      exit(1);
    }
  }
@@ -799,7 +838,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
    Transfer transfer;
    transfer.transferIndex = i;
    transfer.numBytes = 0;
-    transfer.numBytesToCopy = 0;
+    transfer.numBytesActual = 0;
    if (!advancedMode)
    {
      iss >> srcMem >> exeMem >> dstMem;
@@ -812,7 +851,7 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
    else
    {
      std::string numBytesToken;
-      iss >> srcMem >> exeMem >> dstMem >> numBlocksToUse >> numBytesToken;
+      iss >> srcMem >> exeMem >> dstMem >> numSubExecs >> numBytesToken;
      if (iss.fail())
      {
        printf("Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple\n", i+1);
@@ -824,18 +863,33 @@ void ParseTransfers(char* line, int numCpus, int numGpus, std::vector<Transfer>&
        exit(1);
      }
      char units = numBytesToken.back();
-      switch (units)
+      switch (toupper(units))
+      {
+      case 'K': numBytes *= 1024; break;
+      case 'M': numBytes *= 1024*1024; break;
+      case 'G': numBytes *= 1024*1024*1024; break;
+      }
+    }
+
+    ParseMemType(srcMem, numCpus, numGpus, transfer.srcType, transfer.srcIndex);
+    ParseMemType(dstMem, numCpus, numGpus, transfer.dstType, transfer.dstIndex);
+    ParseExeType(exeMem, numCpus, numGpus, transfer.exeType, transfer.exeIndex);
+
+    transfer.numSrcs = (int)transfer.srcType.size();
+    transfer.numDsts = (int)transfer.dstType.size();
+    if (transfer.numSrcs == 0 && transfer.numDsts == 0)
    {
-      case 'K': case 'k': numBytes *= 1024; break;
-      case 'M': case 'm': numBytes *= 1024*1024; break;
-      case 'G': case 'g': numBytes *= 1024*1024*1024; break;
+      printf("[ERROR] Transfer must have at least one src or dst\n");
+      exit(1);
    }
+
+    if (transfer.exeType == EXE_GPU_DMA && (transfer.numSrcs > 1 || transfer.numDsts > 1))
+    {
+      printf("[ERROR] GPU DMA executor can only be used for single source / single dst Transfers\n");
+      exit(1);
    }

-    ParseMemType(srcMem, numCpus, numGpus, &transfer.srcMemType, &transfer.srcIndex);
-    ParseMemType(exeMem, numCpus, numGpus, &transfer.exeMemType, &transfer.exeIndex);
-    ParseMemType(dstMem, numCpus, numGpus, &transfer.dstMemType, &transfer.dstIndex);
-    transfer.numBlocksToUse = numBlocksToUse;
+    transfer.numSubExecs = numSubExecs;
    transfer.numBytes = numBytes;
    transfers.push_back(transfer);
  }
@@ -971,158 +1025,31 @@ void CheckPages(char* array, size_t numBytes, int targetId)
  }
 }

-// Helper function to either fill a device pointer with pseudo-random data, or to check to see if it matches
-void CheckOrFill(ModeType mode, int N, bool isMemset, bool isHipCall, std::vector<float>const& fillPattern, float* ptr)
-{
-  // Prepare reference resultx
-  float* refBuffer = (float*)malloc(N * sizeof(float));
-  if (isMemset)
-  {
-    if (isHipCall)
-    {
-      memset(refBuffer, 42, N * sizeof(float));
-    }
-    else
-    {
-      for (int i = 0; i < N; i++)
-        refBuffer[i] = 1234.0f;
-    }
-  }
-  else
-  {
-    // Fill with repeated pattern if specified
-    size_t patternLen = fillPattern.size();
-    if (patternLen > 0)
-    {
-      for (int i = 0; i < N; i++)
-        refBuffer[i] = fillPattern[i % patternLen];
-    }
-    else // Otherwise fill with pseudo-random values
-    {
-      for (int i = 0; i < N; i++)
-        refBuffer[i] = (i % 383 + 31);
-    }
-  }
-
-  // Either fill the memory with the reference buffer, or compare against it
-  if (mode == MODE_FILL)
-  {
-    HIP_CALL(hipMemcpy(ptr, refBuffer, N * sizeof(float), hipMemcpyDefault));
-  }
-  else if (mode == MODE_CHECK)
-  {
-    float* hostBuffer = (float*) malloc(N * sizeof(float));
-    HIP_CALL(hipMemcpy(hostBuffer, ptr, N * sizeof(float), hipMemcpyDefault));
-    for (int i = 0; i < N; i++)
-    {
-      if (refBuffer[i] != hostBuffer[i])
-      {
-        printf("[ERROR] Mismatch at element %d Ref: %f Actual: %f\n", i, refBuffer[i], hostBuffer[i]);
-        exit(1);
-      }
-    }
-    free(hostBuffer);
-  }
-
-  free(refBuffer);
-}
-
-std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount)
-{
-  char result[10];
-
-  switch (linkType)
-  {
-  case HSA_AMD_LINK_INFO_TYPE_HYPERTRANSPORT: sprintf(result, "  HT-%d", hopCount); break;
-  case HSA_AMD_LINK_INFO_TYPE_QPI           : sprintf(result, " QPI-%d", hopCount); break;
-  case HSA_AMD_LINK_INFO_TYPE_PCIE          : sprintf(result, "PCIE-%d", hopCount); break;
-  case HSA_AMD_LINK_INFO_TYPE_INFINBAND     : sprintf(result, "INFB-%d", hopCount); break;
-  case HSA_AMD_LINK_INFO_TYPE_XGMI          : sprintf(result, "XGMI-%d", hopCount); break;
-  default: sprintf(result, "??????");
-  }
-  return result;
-}
-
-std::string GetDesc(MemType srcMemType, int srcIndex,
-                    MemType dstMemType, int dstIndex)
-{
-  if (IsCpuType(srcMemType))
-  {
-    if (IsCpuType(dstMemType)) return (srcIndex == dstIndex) ? "LOCAL" : "NUMA";
-    if (IsGpuType(dstMemType)) return "PCIE";
-    goto error;
-  }
-  if (IsGpuType(srcMemType))
-  {
-    if (IsCpuType(dstMemType)) return "PCIE";
-    if (IsGpuType(dstMemType))
-    {
-      if (srcIndex == dstIndex) return "LOCAL";
-      else
-      {
-        uint32_t linkType, hopCount;
-        HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(srcIndex, MEM_GPU),
-                                              RemappedIndex(dstIndex, MEM_GPU),
-                                              &linkType, &hopCount));
-        return GetLinkTypeDesc(linkType, hopCount);
-      }
-    }
-  }
-error:
-  printf("[ERROR] Unrecognized memory type\n");
-  exit(1);
-}
-
-std::string GetTransferDesc(Transfer const& transfer)
-{
-  return GetDesc(transfer.srcMemType, transfer.srcIndex, transfer.exeMemType, transfer.exeIndex) + "-"
-    + GetDesc(transfer.exeMemType, transfer.exeIndex, transfer.dstMemType, transfer.dstIndex);
-}
-
 void RunTransfer(EnvVars const& ev, int const iteration,
                 ExecutorInfo& exeInfo, int const transferIdx)
 {
  Transfer* transfer = exeInfo.transfers[transferIdx];

-  // GPU execution agent
-  if (transfer->exeMemType == MEM_GPU)
+  if (transfer->exeType == EXE_GPU_GFX)
  {
    // Switch to executing GPU
-    int const exeIndex = RemappedIndex(transfer->exeIndex, MEM_GPU);
+    int const exeIndex = RemappedIndex(transfer->exeIndex, false);
    HIP_CALL(hipSetDevice(exeIndex));

    hipStream_t& stream     = exeInfo.streams[transferIdx];
    hipEvent_t&  startEvent = exeInfo.startEvents[transferIdx];
    hipEvent_t&  stopEvent  = exeInfo.stopEvents[transferIdx];

-    int const initOffset = ev.byteOffset / sizeof(float);
-
-    if (ev.useHipCall)
-    {
-      // Record start event
-      HIP_CALL(hipEventRecord(startEvent, stream));
-
-      // Execute hipMemset / hipMemcpy
-      if (ev.useMemset)
-        HIP_CALL(hipMemsetAsync(transfer->dstMem + initOffset, 42, transfer->numBytesToCopy, stream));
-      else
-        HIP_CALL(hipMemcpyAsync(transfer->dstMem + initOffset,
-                                transfer->srcMem + initOffset,
-                                transfer->numBytesToCopy, hipMemcpyDefault,
-                                stream));
-      // Record stop event
-      HIP_CALL(hipEventRecord(stopEvent, stream));
-    }
-    else
-    {
-      int const numBlocksToRun = ev.useSingleStream ? exeInfo.totalBlocks : transfer->numBlocksToUse;
-      hipExtLaunchKernelGGL(ev.useMemset ? GpuMemsetKernel : GpuCopyKernel,
+    // Figure out how many threadblocks to use.
+    // In single stream mode, all the threadblocks for this GPU are launched
+    // Otherwise, just launch the threadblocks associated with this single Transfer
+    int const numBlocksToRun = ev.useSingleStream ? exeInfo.totalSubExecs : transfer->numSubExecs;
+    hipExtLaunchKernelGGL(GpuKernelTable[ev.gpuKernel],
                          dim3(numBlocksToRun, 1, 1),
                          dim3(BLOCKSIZE, 1, 1),
                          ev.sharedMemBytes, stream,
                          startEvent, stopEvent,
-                            0, transfer->blockParamGpuPtr);
-    }
+                          0, transfer->subExecParamGpuPtr);

    // Synchronize per iteration, unless in single sync mode, in which case
    // synchronize during last warmup / last actual iteration
@@ -1136,14 +1063,15 @@ void RunTransfer(EnvVars const& ev, int const iteration,

      if (ev.useSingleStream)
      {
+        // Figure out individual timings for Transfers that were all launched together
        for (Transfer* currTransfer : exeInfo.transfers)
        {
-          long long minStartCycle = currTransfer->blockParamGpuPtr[0].startCycle;
-          long long maxStopCycle  = currTransfer->blockParamGpuPtr[0].stopCycle;
-          for (int i = 1; i < currTransfer->numBlocksToUse; i++)
+          long long minStartCycle = currTransfer->subExecParamGpuPtr[0].startCycle;
+          long long maxStopCycle  = currTransfer->subExecParamGpuPtr[0].stopCycle;
+          for (int i = 1; i < currTransfer->numSubExecs; i++)
          {
-            minStartCycle = std::min(minStartCycle, currTransfer->blockParamGpuPtr[i].startCycle);
-            maxStopCycle  = std::max(maxStopCycle,  currTransfer->blockParamGpuPtr[i].stopCycle);
+            minStartCycle = std::min(minStartCycle, currTransfer->subExecParamGpuPtr[i].startCycle);
+            maxStopCycle  = std::max(maxStopCycle,  currTransfer->subExecParamGpuPtr[i].stopCycle);
          }
          int const wallClockRate = GetWallClockRate(exeIndex);
          double iterationTimeMs = (maxStopCycle - minStartCycle) / (double)(wallClockRate);
@@ -1157,10 +1085,43 @@ void RunTransfer(EnvVars const& ev, int const iteration,
      }
    }
  }
-  else if (transfer->exeMemType == MEM_CPU) // CPU execution agent
+  else if (transfer->exeType == EXE_GPU_DMA)
+  {
+    // Switch to executing GPU
+    int const exeIndex = RemappedIndex(transfer->exeIndex, false);
+    HIP_CALL(hipSetDevice(exeIndex));
+
+    hipStream_t& stream     = exeInfo.streams[transferIdx];
+    hipEvent_t&  startEvent = exeInfo.startEvents[transferIdx];
+    hipEvent_t&  stopEvent  = exeInfo.stopEvents[transferIdx];
+
+    HIP_CALL(hipEventRecord(startEvent, stream));
+    if (transfer->numSrcs == 0 && transfer->numDsts == 1)
+    {
+      HIP_CALL(hipMemsetAsync(transfer->dstMem[0],
+                              MEMSET_CHAR, transfer->numBytesActual, stream));
+    }
+    else if (transfer->numSrcs == 1 && transfer->numDsts == 1)
+    {
+      HIP_CALL(hipMemcpyAsync(transfer->dstMem[0], transfer->srcMem[0],
+                              transfer->numBytesActual, hipMemcpyDefault,
+                              stream));
+    }
+    HIP_CALL(hipEventRecord(stopEvent, stream));
+    HIP_CALL(hipStreamSynchronize(stream));
+
+    if (iteration >= 0)
+    {
+      // Record GPU timing
+      float gpuDeltaMsec;
+      HIP_CALL(hipEventElapsedTime(&gpuDeltaMsec, startEvent, stopEvent));
+      transfer->transferTime += gpuDeltaMsec;
+    }
+  }
+  else if (transfer->exeType == EXE_CPU) // CPU execution agent
  {
    // Force this thread and all child threads onto correct NUMA node
-    int const exeIndex = RemappedIndex(transfer->exeIndex, MEM_CPU);
+    int const exeIndex = RemappedIndex(transfer->exeIndex, true);
    if (numa_run_on_node(exeIndex))
    {
      printf("[ERROR] Unable to set CPU to NUMA node %d\n", exeIndex);
@@ -1171,12 +1132,12 @@ void RunTransfer(EnvVars const& ev, int const iteration,

    auto cpuStart = std::chrono::high_resolution_clock::now();

-    // Launch child-threads to perform memcopies
-    for (int i = 0; i < ev.numCpuPerTransfer; i++)
-      childThreads.push_back(std::thread(ev.useMemset ? CpuMemsetKernel : CpuCopyKernel, std::ref(transfer->blockParam[i])));
+    // Launch each subExecutor in child-threads to perform memcopies
+    for (int i = 0; i < transfer->numSubExecs; ++i)
+      childThreads.push_back(std::thread(CpuReduceKernel, std::ref(transfer->subExecParam[i])));

    // Wait for child-threads to finish
-    for (int i = 0; i < ev.numCpuPerTransfer; i++)
+    for (int i = 0; i < transfer->numSubExecs; ++i)
      childThreads[i].join();

    auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
@@ -1187,11 +1148,13 @@ void RunTransfer(EnvVars const& ev, int const iteration,
  }
 }

-void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, int readMode, int skipCpu)
+void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N)
 {
+  ev.DisplayP2PBenchmarkEnvVars();
+
  // Collect the number of available CPUs/GPUs on this machine
-  int const numGpus = ev.numGpuDevices;
  int const numCpus    = ev.numCpuDevices;
+  int const numGpus    = ev.numGpuDevices;
  int const numDevices = numCpus + numGpus;

  // Enable peer to peer for each GPU
@@ -1199,52 +1162,38 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
    for (int j = 0; j < numGpus; j++)
      if (i != j) EnablePeerAccess(i, j);

-  if (!ev.outputToCsv)
-  {
-    printf("Performing copies in each direction of %lu bytes\n", N * sizeof(float));
-    printf("Using %d threads per NUMA node for CPU copies\n", ev.numCpuPerTransfer);
-    printf("Using %d CUs per transfer\n", numBlocksToUse);
-  }
-  else
-  {
-    printf("SRC,DST,Direction,ReadMode,BW(GB/s),Bytes\n");
-  }
-
  // Perform unidirectional / bidirectional
  for (int isBidirectional = 0; isBidirectional <= 1; isBidirectional++)
  {
    // Print header
    if (!ev.outputToCsv)
    {
-      printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write]\n", isBidirectional ? "Bi" : "Uni",
-             readMode == 0 ? "Local" : "Remote",
-             readMode == 0 ? "Remote" : "Local");
-      printf("%10s", "D/D");
-      if (!skipCpu)
-      {
-        for (int i = 0; i < numCpus; i++)
-          printf("%7s %02d", "CPU", i);
-      }
-      for (int i = 0; i < numGpus; i++)
-        printf("%7s %02d", "GPU", i);
+      printf("%sdirectional copy peak bandwidth GB/s [%s read / %s write] (GPU-Executor: %s)\n", isBidirectional ? "Bi" : "Uni",
+             ev.useRemoteRead ? "Remote" : "Local",
+             ev.useRemoteRead ? "Local" : "Remote",
+             ev.useDmaCopy    ? "DMA"   : "GFX");
+
+      printf("%10s", "SRC\\DST");
+      for (int i = 0; i < numCpus; i++) printf("%7s %02d", "CPU", i);
+      for (int i = 0; i < numGpus; i++) printf("%7s %02d", "GPU", i);
      printf("\n");
    }

    // Loop over all possible src/dst pairs
    for (int src = 0; src < numDevices; src++)
    {
-      MemType const& srcMemType = (src < numCpus ? MEM_CPU : MEM_GPU);
-      if (skipCpu && srcMemType == MEM_CPU) continue;
-      int srcIndex = (srcMemType == MEM_CPU ? src : src - numCpus);
+      MemType const srcType  = (src < numCpus ? MEM_CPU : MEM_GPU);
+      int     const srcIndex = (srcType == MEM_CPU ? src : src - numCpus);
+
      if (!ev.outputToCsv)
-        printf("%7s %02d", (srcMemType == MEM_CPU) ? "CPU" : "GPU", srcIndex);
+        printf("%7s %02d", (srcType == MEM_CPU) ? "CPU" : "GPU", srcIndex);
+
      for (int dst = 0; dst < numDevices; dst++)
      {
-        MemType const& dstMemType = (dst < numCpus ? MEM_CPU : MEM_GPU);
-        if (skipCpu && dstMemType == MEM_CPU) continue;
-        int dstIndex = (dstMemType == MEM_CPU ? dst : dst - numCpus);
-        double bandwidth = GetPeakBandwidth(ev, N, isBidirectional, readMode, numBlocksToUse,
-                                            srcMemType, srcIndex, dstMemType, dstIndex);
+        MemType const dstType  = (dst < numCpus ? MEM_CPU : MEM_GPU);
+        int     const dstIndex = (dstType == MEM_CPU ? dst : dst - numCpus);
+
+        double bandwidth = GetPeakBandwidth(ev, N, isBidirectional, srcType, srcIndex, dstType, dstIndex);
        if (!ev.outputToCsv)
        {
          if (bandwidth == 0)
@@ -1254,13 +1203,12 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
        }
        else
        {
-          printf("%s %02d,%s %02d,%s,%s,%.2f,%lu\n",
-                 srcMemType == MEM_CPU ? "CPU" : "GPU",
-                 srcIndex,
-                 dstMemType == MEM_CPU ? "CPU" : "GPU",
-                 dstIndex,
+          printf("%s %02d,%s %02d,%s,%s,%s,%.2f,%lu\n",
+                 srcType == MEM_CPU ? "CPU" : "GPU", srcIndex,
+                 dstType == MEM_CPU ? "CPU" : "GPU", dstIndex,
                 isBidirectional ? "bidirectional" : "unidirectional",
-                 readMode == 0 ? "Local" : "Remote",
+                 ev.useRemoteRead ? "Remote" : "Local",
+                 ev.useDmaCopy ? "DMA" : "GFX",
                 bandwidth,
                 N * sizeof(float));
        }
@@ -1272,42 +1220,50 @@ void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, in
  }
 }

-double GetPeakBandwidth(EnvVars const& ev,
-                        size_t  const  N,
+double GetPeakBandwidth(EnvVars const& ev, size_t const N,
                        int     const  isBidirectional,
-                        int     const  readMode,
-                        int     const  numBlocksToUse,
-                        MemType const  srcMemType,
-                        int     const  srcIndex,
-                        MemType const  dstMemType,
-                        int     const  dstIndex)
+                        MemType const  srcType, int const srcIndex,
+                        MemType const  dstType, int const dstIndex)
 {
  // Skip bidirectional on same device
-  if (isBidirectional && srcMemType == dstMemType && srcIndex == dstIndex) return 0.0f;
+  if (isBidirectional && srcType == dstType && srcIndex == dstIndex) return 0.0f;

  int const initOffset = ev.byteOffset / sizeof(float);

  // Prepare Transfers
  std::vector<Transfer> transfers(2);
-  transfers[0].srcMemType     = transfers[1].dstMemType     = srcMemType;
-  transfers[0].dstMemType     = transfers[1].srcMemType     = dstMemType;
-  transfers[0].srcIndex       = transfers[1].dstIndex       = srcIndex;
-  transfers[0].dstIndex       = transfers[1].srcIndex       = dstIndex;
  transfers[0].numBytes = transfers[1].numBytes = N * sizeof(float);
-  transfers[0].numBlocksToUse = transfers[1].numBlocksToUse = numBlocksToUse;

-  // Either perform (local read + remote write), or (remote read + local write)
-  transfers[0].exeMemType = (readMode == 0 ? srcMemType : dstMemType);
-  transfers[1].exeMemType = (readMode == 0 ? dstMemType : srcMemType);
-  transfers[0].exeIndex   = (readMode == 0 ? srcIndex   : dstIndex);
-  transfers[1].exeIndex   = (readMode == 0 ? dstIndex   : srcIndex);
+  // SRC -> DST
+  transfers[0].numSrcs = transfers[0].numDsts = 1;
+  transfers[0].srcType.push_back(srcType);
+  transfers[0].dstType.push_back(dstType);
+  transfers[0].srcIndex.push_back(srcIndex);
+  transfers[0].dstIndex.push_back(dstIndex);

+  // DST -> SRC
+  transfers[1].numSrcs = transfers[1].numDsts = 1;
+  transfers[1].srcType.push_back(dstType);
+  transfers[1].dstType.push_back(srcType);
+  transfers[1].srcIndex.push_back(dstIndex);
+  transfers[1].dstIndex.push_back(srcIndex);
+
+  // Either perform (local read + remote write), or (remote read + local write)
+  ExeType gpuExeType = ev.useDmaCopy ? EXE_GPU_DMA : EXE_GPU_GFX;
+  transfers[0].exeType = IsGpuType(ev.useRemoteRead ? dstType : srcType) ? gpuExeType : EXE_CPU;
+  transfers[1].exeType = IsGpuType(ev.useRemoteRead ? srcType : dstType) ? gpuExeType : EXE_CPU;
+  transfers[0].exeIndex = (ev.useRemoteRead ? dstIndex : srcIndex);
+  transfers[1].exeIndex = (ev.useRemoteRead ? srcIndex : dstIndex);
+  transfers[0].numSubExecs = IsGpuType(transfers[0].exeType) ? ev.numGpuSubExecs : ev.numCpuSubExecs;
+  transfers[1].numSubExecs = IsGpuType(transfers[0].exeType) ? ev.numGpuSubExecs : ev.numCpuSubExecs;
+
+  // Remove (DST->SRC) if not bidirectional
  transfers.resize(isBidirectional + 1);

  // Abort if executing on NUMA node with no CPUs
  for (int i = 0; i <= isBidirectional; i++)
  {
-    if (transfers[i].exeMemType == MEM_CPU && ev.numCpusPerNuma[transfers[i].exeIndex] == 0)
+    if (transfers[i].exeType == EXE_CPU && ev.numCpusPerNuma[transfers[i].exeIndex] == 0)
      return 0;
  }

@@ -1318,45 +1274,176 @@ double GetPeakBandwidth(EnvVars const& ev,
  for (int i = 0; i <= isBidirectional; i++)
  {
    double transferDurationMsec = transfers[i].transferTime / (1.0 * ev.numIterations);
-    double transferBandwidthGbs = (transfers[i].numBytesToCopy / 1.0E9) / transferDurationMsec * 1000.0f;
+    double transferBandwidthGbs = (transfers[i].numBytesActual / 1.0E9) / transferDurationMsec * 1000.0f;
    totalBandwidth += transferBandwidthGbs;
  }
-
  return totalBandwidth;
 }

-void Transfer::PrepareBlockParams(EnvVars const& ev, size_t const N)
+void Transfer::PrepareSubExecParams(EnvVars const& ev)
 {
+  // Each subExecutor needs to know src/dst pointers and how many elements to transfer
+  // Figure out the sub-array each subExecutor works on for this Transfer
+  // - Partition N as evenly as possible, but try to keep subarray sizes as multiples of BLOCK_BYTES bytes,
+  //   except the very last one, for alignment reasons
+  size_t const N              = this->numBytesActual / sizeof(float);
  int    const initOffset     = ev.byteOffset / sizeof(float);
+  int    const targetMultiple = ev.blockBytes / sizeof(float);

-  // Initialize source memory with patterned data
-  CheckOrFill(MODE_FILL, N, ev.useMemset, ev.useHipCall, ev.fillPattern, this->srcMem + initOffset);
+  // In some cases, there may not be enough data for all subExectors
+  int const maxSubExecToUse = std::min((int)(N + targetMultiple - 1) / targetMultiple, this->numSubExecs);
+
+  this->subExecParam.clear();
+  this->subExecParam.resize(this->numSubExecs);

-  // Each block needs to know src/dst pointers and how many elements to transfer
-  // Figure out the sub-array each block does for this Transfer
-  // - Partition N as evenly as possible, but try to keep blocks as multiples of BLOCK_BYTES bytes,
-  //   except the very last one, for alignment reasons
-  int const targetMultiple = ev.blockBytes / sizeof(float);
-  int const maxNumBlocksToUse = std::min((N + targetMultiple - 1) / targetMultiple, this->blockParam.size());
  size_t assigned = 0;
-  for (int j = 0; j < this->blockParam.size(); j++)
+  for (int i = 0; i < this->numSubExecs; ++i)
  {
-    int    const blocksLeft = std::max(0, maxNumBlocksToUse - j);
+    int    const subExecLeft = std::max(0, maxSubExecToUse - i);
    size_t const leftover    = N - assigned;
    size_t const roundedN    = (leftover + targetMultiple - 1) / targetMultiple;

-    BlockParam& param = this->blockParam[j];
-    param.N          = blocksLeft ? std::min(leftover, ((roundedN / blocksLeft) * targetMultiple)) : 0;
-    param.src        = this->srcMem + assigned + initOffset;
-    param.dst        = this->dstMem + assigned + initOffset;
-    param.startCycle = 0;
-    param.stopCycle  = 0;
-    assigned += param.N;
+    SubExecParam& p = this->subExecParam[i];
+    p.N             = subExecLeft ? std::min(leftover, ((roundedN / subExecLeft) * targetMultiple)) : 0;
+    p.numSrcs       = this->numSrcs;
+    p.numDsts       = this->numDsts;
+    for (int iSrc = 0; iSrc < this->numSrcs; ++iSrc)
+      p.src[iSrc] = this->srcMem[iSrc] + assigned + initOffset;
+    for (int iDst = 0; iDst < this->numDsts; ++iDst)
+      p.dst[iDst] = this->dstMem[iDst] + assigned + initOffset;
+
+    if (ev.enableDebug)
+    {
+      printf("Transfer %02d SE:%02d: %10lu floats: %10lu to %10lu\n",
+             this->transferIndex, i, p.N, assigned, assigned + p.N);
+    }
+
+    p.startCycle = 0;
+    p.stopCycle  = 0;
+    assigned += p.N;
  }

  this->transferTime = 0.0;
 }

+void Transfer::PrepareReference(EnvVars const& ev, std::vector<float>& buffer, int bufferIdx)
+{
+  size_t N = buffer.size();
+  if (bufferIdx >= 0)
+  {
+    size_t patternLen = ev.fillPattern.size();
+    if (patternLen > 0)
+    {
+      for (size_t i = 0; i < N; ++i)
+        buffer[i] = ev.fillPattern[i % patternLen];
+    }
+    else
+    {
+      for (size_t i = 0; i < N; ++i)
+        buffer[i] = (i % 383 + 31) * (bufferIdx + 1);
+    }
+  }
+  else // Destination buffer
+  {
+    if (this->numSrcs == 0)
+    {
+      // Note: 0x75757575 = 13323083.0
+      memset(buffer.data(), MEMSET_CHAR, N * sizeof(float));
+    }
+    else
+    {
+      PrepareReference(ev, buffer, 0);
+
+      if (this->numSrcs > 1)
+      {
+        std::vector<float> temp(N);
+        for (int srcIdx = 1; srcIdx < this->numSrcs; ++srcIdx)
+        {
+          PrepareReference(ev, temp, srcIdx);
+          for (int i = 0; i < N; ++i)
+          {
+            buffer[i] += temp[i];
+          }
+        }
+      }
+    }
+  }
+}
+
+void Transfer::PrepareSrc(EnvVars const& ev)
+{
+  if (this->numSrcs == 0) return;
+  size_t const N = this->numBytesActual / sizeof(float);
+  int const initOffset = ev.byteOffset / sizeof(float);
+
+  std::vector<float> reference(N);
+  for (int srcIdx = 0; srcIdx < this->numSrcs; ++srcIdx)
+  {
+    //PrepareReference(ev, reference, srcIdx);
+    PrepareReference(ev, reference, srcIdx);
+    HIP_CALL(hipMemcpy(this->srcMem[srcIdx] + initOffset, reference.data(), this->numBytesActual, hipMemcpyDefault));
+  }
+}
+
+void Transfer::ValidateDst(EnvVars const& ev)
+{
+  if (this->numDsts == 0) return;
+  size_t const N = this->numBytesActual / sizeof(float);
+  int const initOffset = ev.byteOffset / sizeof(float);
+
+  std::vector<float> reference(N);
+  PrepareReference(ev, reference, -1);
+
+  std::vector<float> hostBuffer(N);
+  for (int dstIdx = 0; dstIdx < this->numDsts; ++dstIdx)
+  {
+    float* output;
+    if (IsCpuType(this->dstType[dstIdx]))
+    {
+      output = this->dstMem[dstIdx] + initOffset;
+    }
+    else
+    {
+      HIP_CALL(hipMemcpy(hostBuffer.data(), this->dstMem[dstIdx] + initOffset, this->numBytesActual, hipMemcpyDefault));
+      output = hostBuffer.data();
+    }
+
+    for (size_t i = 0; i < N; ++i)
+    {
+      if (reference[i] != output[i])
+      {
+        printf("\n[ERROR] Destination array %d value at index %lu (%.3f) does not match expected value (%.3f)\n",
+               dstIdx, i, output[i], reference[i]);
+        printf("[ERROR] Failed Transfer details: #%d: %s -> [%c%d:%d] -> %s\n",
+               this->transferIndex,
+               this->SrcToStr().c_str(),
+               ExeTypeStr[this->exeType], this->exeIndex,
+               this->numSubExecs,
+               this->DstToStr().c_str());
+        exit(1);
+      }
+    }
+  }
+}
+
+std::string Transfer::SrcToStr() const
+{
+  if (numSrcs == 0) return "N";
+  std::stringstream ss;
+  for (int i = 0; i < numSrcs; ++i)
+    ss << MemTypeStr[srcType[i]] << srcIndex[i];
+  return ss.str();
+}
+
+std::string Transfer::DstToStr() const
+{
+  if (numDsts == 0) return "N";
+  std::stringstream ss;
+  for (int i = 0; i < numDsts; ++i)
+    ss << MemTypeStr[dstType[i]] << dstIndex[i];
+  return ss.str();
+}
+
 // NOTE: This is a stop-gap solution until HIP provides wallclock values
 int GetWallClockRate(int deviceId)
 {
@@ -1385,27 +1472,27 @@ int GetWallClockRate(int deviceId)
  return wallClockPerDeviceMhz[deviceId];
 }

-void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numBlocksToUse, bool const isRandom)
+void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numGpuSubExecs, int const numCpuSubExecs, bool const isRandom)
 {
  ev.DisplaySweepEnvVars();

  // Compute how many possible Transfers are permitted (unique SRC/EXE/DST triplets)
-  std::vector<std::pair<MemType, int>> exeList;
+  std::vector<std::pair<ExeType, int>> exeList;
  for (auto exe : ev.sweepExe)
  {
-    MemType const exeMemType = CharToMemType(exe);
-    if (IsGpuType(exeMemType))
+    ExeType const exeType = CharToExeType(exe);
+    if (IsGpuType(exeType))
    {
      for (int exeIndex = 0; exeIndex < ev.numGpuDevices; ++exeIndex)
-        exeList.push_back(std::make_pair(exeMemType, exeIndex));
+        exeList.push_back(std::make_pair(exeType, exeIndex));
    }
-    else
+    else if (IsCpuType(exeType))
    {
      for (int exeIndex = 0; exeIndex < ev.numCpuDevices; ++exeIndex)
      {
        // Skip NUMA nodes that have no CPUs (e.g. CXL)
        if (ev.numCpusPerNuma[exeIndex] == 0) continue;
-        exeList.push_back(std::make_pair(exeMemType, exeIndex));
+        exeList.push_back(std::make_pair(exeType, exeIndex));
      }
    }
  }
@@ -1414,11 +1501,11 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
  std::vector<std::pair<MemType, int>> srcList;
  for (auto src : ev.sweepSrc)
  {
-    MemType const srcMemType = CharToMemType(src);
-    int const numDevices = IsGpuType(srcMemType) ? ev.numGpuDevices : ev.numCpuDevices;
+    MemType const srcType = CharToMemType(src);
+    int const numDevices = IsGpuType(srcType) ? ev.numGpuDevices : ev.numCpuDevices;

    for (int srcIndex = 0; srcIndex < numDevices; ++srcIndex)
-      srcList.push_back(std::make_pair(srcMemType, srcIndex));
+      srcList.push_back(std::make_pair(srcType, srcIndex));
  }
  int numSrcs = srcList.size();

@@ -1426,20 +1513,20 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
  std::vector<std::pair<MemType, int>> dstList;
  for (auto dst : ev.sweepDst)
  {
-    MemType const dstMemType = CharToMemType(dst);
-    int const numDevices = IsGpuType(dstMemType) ? ev.numGpuDevices : ev.numCpuDevices;
+    MemType const dstType = CharToMemType(dst);
+    int const numDevices = IsGpuType(dstType) ? ev.numGpuDevices : ev.numCpuDevices;

    for (int dstIndex = 0; dstIndex < numDevices; ++dstIndex)
-      dstList.push_back(std::make_pair(dstMemType, dstIndex));
+      dstList.push_back(std::make_pair(dstType, dstIndex));
  }
  int numDsts = dstList.size();

  // Build array of possibilities, respecting any additional restrictions (e.g. XGMI hop count)
  struct TransferInfo
  {
-    MemType srcMemType; int srcIndex;
-    MemType exeMemType; int exeIndex;
-    MemType dstMemType; int dstIndex;
+    MemType srcType; int srcIndex;
+    ExeType exeType; int exeIndex;
+    MemType dstType; int dstIndex;
  };

  // If either XGMI minimum is non-zero, or XGMI maximum is specified and non-zero then both links must be XGMI
@@ -1451,7 +1538,7 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
  {
    // Skip CPU executors if XGMI link must be used
    if (useXgmiOnly && !IsGpuType(exeList[i].first)) continue;
-    tinfo.exeMemType = exeList[i].first;
+    tinfo.exeType  = exeList[i].first;
    tinfo.exeIndex = exeList[i].second;

    bool isXgmiSrc  = false;
@@ -1463,8 +1550,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
        if (exeList[i].second != srcList[j].second)
        {
          uint32_t exeToSrcLinkType, exeToSrcHopCount;
-          HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, MEM_GPU),
-                                                RemappedIndex(srcList[j].second, MEM_GPU),
+          HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, false),
+                                                RemappedIndex(srcList[j].second, false),
                                                &exeToSrcLinkType,
                                                &exeToSrcHopCount));
          isXgmiSrc = (exeToSrcLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI);
@@ -1484,7 +1571,7 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
      }
      else if (useXgmiOnly) continue;

-      tinfo.srcMemType = srcList[j].first;
+      tinfo.srcType  = srcList[j].first;
      tinfo.srcIndex = srcList[j].second;

      bool isXgmiDst = false;
@@ -1496,8 +1583,8 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
          if (exeList[i].second != dstList[k].second)
          {
            uint32_t exeToDstLinkType, exeToDstHopCount;
-            HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, MEM_GPU),
-                                                  RemappedIndex(dstList[k].second, MEM_GPU),
+            HIP_CALL(hipExtGetLinkTypeAndHopCount(RemappedIndex(exeList[i].second, false),
+                                                  RemappedIndex(dstList[k].second, false),
                                                  &exeToDstLinkType,
                                                  &exeToDstHopCount));
            isXgmiDst = (exeToDstLinkType == HSA_AMD_LINK_INFO_TYPE_XGMI);
@@ -1519,7 +1606,7 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
        // Skip this DST if total XGMI distance (SRC + DST) is greater than max limit
        if (ev.sweepXgmiMax >= 0 && (numHopsSrc + numHopsDst) > ev.sweepXgmiMax) continue;

-        tinfo.dstMemType = dstList[k].first;
+        tinfo.dstType  = dstList[k].first;
        tinfo.dstIndex = dstList[k].second;

        possibleTransfers.push_back(tinfo);
@@ -1580,13 +1667,15 @@ void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int con
      {
        // Convert integer value to (SRC->EXE->DST) triplet
        Transfer transfer;
-        transfer.srcMemType     = possibleTransfers[value].srcMemType;
-        transfer.srcIndex       = possibleTransfers[value].srcIndex;
-        transfer.exeMemType     = possibleTransfers[value].exeMemType;
+        transfer.numSrcs        = 1;
+        transfer.numDsts        = 1;
+        transfer.srcType        = {possibleTransfers[value].srcType};
+        transfer.srcIndex       = {possibleTransfers[value].srcIndex};
+        transfer.exeType        = possibleTransfers[value].exeType;
        transfer.exeIndex       = possibleTransfers[value].exeIndex;
-        transfer.dstMemType     = possibleTransfers[value].dstMemType;
-        transfer.dstIndex       = possibleTransfers[value].dstIndex;
-        transfer.numBlocksToUse = IsGpuType(transfer.exeMemType) ? numBlocksToUse : ev.numCpuPerTransfer;
+        transfer.dstType        = {possibleTransfers[value].dstType};
+        transfer.dstIndex       = {possibleTransfers[value].dstIndex};
+        transfer.numSubExecs    = IsGpuType(transfer.exeType) ? numGpuSubExecs : numCpuSubExecs;
        transfer.transferIndex  = transfers.size();
        transfer.numBytes       = ev.sweepRandBytes ? randSize(*ev.generator) * sizeof(float) : 0;
        transfers.push_back(transfer);
@@ -1636,12 +1725,23 @@ void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& tran
  for (auto const& transfer : transfers)
  {
    fprintf(fp, " (%c%d->%c%d->%c%d %d %lu)",
-            MemTypeStr[transfer.srcMemType], transfer.srcIndex,
-            MemTypeStr[transfer.exeMemType], transfer.exeIndex,
-            MemTypeStr[transfer.dstMemType], transfer.dstIndex,
-            transfer.numBlocksToUse,
+            MemTypeStr[transfer.srcType[0]], transfer.srcIndex[0],
+            ExeTypeStr[transfer.exeType],    transfer.exeIndex,
+            MemTypeStr[transfer.dstType[0]], transfer.dstIndex[0],
+            transfer.numSubExecs,
            transfer.numBytes);
  }
  fprintf(fp, "\n");
  fflush(fp);
 }
+
+std::string PtrVectorToStr(std::vector<float*> const& strVector, int const initOffset)
+{
+  std::stringstream ss;
+  for (int i = 0; i < strVector.size(); ++i)
+  {
+    if (i) ss << " ";
+    ss << (strVector[i] + initOffset);
+  }
+  return ss.str();
+}
--- a/TransferBench.hpp
+++ b/TransferBench.hpp
 /*
-Copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2019-2023 Advanced Micro Devices, Inc. All rights reserved.

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -35,20 +35,20 @@ THE SOFTWARE.
 #include <hip/hip_ext.h>
 #include <hsa/hsa_ext_amd.h>

-#include "EnvVars.hpp"
-
 // Helper macro for catching HIP errors
 #define HIP_CALL(cmd)                                                                   \
    do {                                                                                \
        hipError_t error = (cmd);                                                       \
        if (error != hipSuccess)                                                        \
        {                                                                               \
-            std::cerr << "Encountered HIP error (" << hipGetErrorString(error) << ") at line " \
-                      << __LINE__ << " in file " << __FILE__ << "\n";   \
+            std::cerr << "Encountered HIP error (" << hipGetErrorString(error)          \
+                      << ") at line " << __LINE__ << " in file " << __FILE__ << "\n";   \
            exit(-1);                                                                   \
        }                                                                               \
    } while (0)

+#include "EnvVars.hpp"
+
 // Simple configuration parameters
 size_t const DEFAULT_BYTES_PER_TRANSFER = (1<<26);  // Amount of data transferred per Transfer

@@ -59,92 +59,92 @@ typedef enum
  MEM_GPU          = 1, // Coarse-grained global GPU memory
  MEM_CPU_FINE     = 2, // Fine-grained pinned CPU memory
  MEM_GPU_FINE     = 3, // Fine-grained global GPU memory
-  MEM_CPU_UNPINNED = 4 // Unpinned CPU memory
+  MEM_CPU_UNPINNED = 4, // Unpinned CPU memory
+  MEM_NULL         = 5, // NULL memory - used for empty
 } MemType;

-bool IsGpuType(MemType m)
-{
-  return (m == MEM_GPU || m == MEM_GPU_FINE);
-}
-bool IsCpuType(MemType m)
+typedef enum
 {
-  return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED);
-}
+  EXE_CPU          = 0, // CPU executor              (subExecutor = CPU thread)
+  EXE_GPU_GFX      = 1, // GPU kernel-based executor (subExecutor = threadblock/CU)
+  EXE_GPU_DMA      = 2, // GPU SDMA-based executor   (subExecutor = streams)
+} ExeType;
+
+bool IsGpuType(MemType m) { return (m == MEM_GPU || m == MEM_GPU_FINE); }
+bool IsCpuType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED); };
+bool IsGpuType(ExeType e) { return (e == EXE_GPU_GFX || e == EXE_GPU_DMA); };
+bool IsCpuType(ExeType e) { return (e == EXE_CPU); };

-char const MemTypeStr[6] = "CGBFU";
+char const MemTypeStr[7] = "CGBFUN";
+char const ExeTypeStr[4] = "CGD";
+char const ExeTypeName[3][4] = {"CPU", "GPU", "DMA"};

 MemType inline CharToMemType(char const c)
 {
-  switch (c)
-  {
-  case 'C': return MEM_CPU;
-  case 'G': return MEM_GPU;
-  case 'B': return MEM_CPU_FINE;
-  case 'F': return MEM_GPU_FINE;
-  case 'U': return MEM_CPU_UNPINNED;
-  default:
-    printf("[ERROR] Unexpected mem type (%c)\n", c);
+  char const* val = strchr(MemTypeStr, toupper(c));
+  if (*val) return (MemType)(val - MemTypeStr);
+  printf("[ERROR] Unexpected memory type (%c)\n", c);
  exit(1);
-  }
 }

-typedef enum
-{
-  MODE_FILL  = 0,         // Fill data with pattern
-  MODE_CHECK = 1          // Check data against pattern
-} ModeType;
-
-// Each threadblock copies N floats from src to dst
-struct BlockParam
+ExeType inline CharToExeType(char const c)
 {
-  int       N;
-  float*    src;
-  float*    dst;
-  long long startCycle;
-  long long stopCycle;
-};
+  char const* val = strchr(ExeTypeStr, toupper(c));
+  if (*val) return (ExeType)(val - ExeTypeStr);
+  printf("[ERROR] Unexpected executor type (%c)\n", c);
+  exit(1);
+}

-// Each Transfer is a uni-direction operation from a src memory to dst memory
+// Each Transfer performs reads from source memory location(s), sums them (if multiple sources are specified)
+// then writes the summation to each of the specified destination memory location(s)
 struct Transfer
 {
-  int     transferIndex;       // Transfer identifier
-
-  // Transfer config
-  MemType exeMemType;          // Transfer executor type (CPU or GPU)
+  int                       transferIndex;      // Transfer identifier (within a Test)
+  ExeType                   exeType;            // Transfer executor type
  int                       exeIndex;           // Executor index (NUMA node for CPU / device ID for GPU)
-  MemType srcMemType;          // Source memory type
-  int     srcIndex;            // Source device index
-  MemType dstMemType;          // Destination memory type
-  int     dstIndex;            // Destination device index
-  int     numBlocksToUse;      // Number of threadblocks to use for this Transfer
-  size_t  numBytes;            // Number of bytes to Transfer
-  size_t  numBytesToCopy;      // Number of bytes to copy
-
-  // Memory
-  float*  srcMem;              // Source memory
-  float*  dstMem;              // Destination memory
-
-  // How memory is split across threadblocks / CPU cores
-  std::vector<BlockParam> blockParam;
-  BlockParam* blockParamGpuPtr;
+  int                       numSubExecs;        // Number of subExecutors to use for this Transfer
+  size_t                    numBytes;           // # of bytes requested to Transfer (may be 0 to fallback to default)
+  size_t                    numBytesActual;     // Actual number of bytes to copy
+  double                    transferTime;       // Time taken in milliseconds

-  // Results
-  double  transferTime;
+  int                       numSrcs;            // Number of sources
+  std::vector<MemType>      srcType;            // Source memory types
+  std::vector<int>          srcIndex;           // Source device indice
+  std::vector<float*>       srcMem;             // Source memory

-  // Prepares src memory and how to divide N elements across threadblocks/threads
-  void PrepareBlockParams(EnvVars const& ev, size_t const N);
-};
+  int                       numDsts;            // Number of destinations
+  std::vector<MemType>      dstType;            // Destination memory type
+  std::vector<int>          dstIndex;           // Destination device index
+  std::vector<float*>       dstMem;             // Destination memory
+
+  std::vector<SubExecParam> subExecParam;       // Defines subarrays assigned to each threadblock
+  SubExecParam*             subExecParamGpuPtr; // Pointer to GPU copy of subExecParam

-typedef std::pair<MemType, int> Executor;
+  // Prepares src/dst subarray pointers for each SubExecutor
+  void PrepareSubExecParams(EnvVars const& ev);
+
+  // Prepare source arrays with input data
+  void PrepareSrc(EnvVars const& ev);
+
+  // Validate that destination data contains expected results
+  void ValidateDst(EnvVars const& ev);
+
+  // Prepare reference buffers
+  void PrepareReference(EnvVars const& ev, std::vector<float>& buffer, int bufferIdx);
+
+  // String representation functions
+  std::string SrcToStr() const;
+  std::string DstToStr() const;
+};

 struct ExecutorInfo
 {
  std::vector<Transfer*>   transfers;        // Transfers to execute
  size_t                   totalBytes;       // Total bytes this executor transfers
+  int                      totalSubExecs;    // Total number of subExecutors to use

  // For GPU-Executors
-  int                      totalBlocks;   // Total number of CUs/CPU threads to use
-  BlockParam*              blockParamGpu; // Copy of block parameters in GPU device memory
+  SubExecParam*            subExecParamGpu;  // GPU copy of subExecutor parameters
  std::vector<hipStream_t> streams;
  std::vector<hipEvent_t>  startEvents;
  std::vector<hipEvent_t>  stopEvents;
@@ -153,6 +153,7 @@ struct ExecutorInfo
  double totalTime;
 };

+typedef std::pair<ExeType, int> Executor;
 typedef std::map<Executor, ExecutorInfo> TransferMap;

 // Display usage instructions
@@ -166,7 +167,9 @@ void PopulateTestSizes(size_t const numBytesPerTransfer, int const samplingFacto
                       std::vector<size_t>& valuesofN);

 void ParseMemType(std::string const& token, int const numCpus, int const numGpus,
-                  MemType* memType, int* memIndex);
+                  std::vector<MemType>& memType, std::vector<int>& memIndex);
+void ParseExeType(std::string const& token, int const numCpus, int const numGpus,
+                  ExeType& exeType, int& exeIndex);

 void ParseTransfers(char* line, int numCpus, int numGpus,
                    std::vector<Transfer>& transfers);
@@ -178,26 +181,19 @@ void EnablePeerAccess(int const deviceId, int const peerDeviceId);
 void AllocateMemory(MemType memType, int devIndex, size_t numBytes, void** memPtr);
 void DeallocateMemory(MemType memType, void* memPtr, size_t const size = 0);
 void CheckPages(char* byteArray, size_t numBytes, int targetId);
-void CheckOrFill(ModeType mode, int N, bool isMemset, bool isHipCall, std::vector<float> const& fillPattern, float* ptr);
 void RunTransfer(EnvVars const& ev, int const iteration, ExecutorInfo& exeInfo, int const transferIdx);
-void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N, int numBlocksToUse, int readMode, int skipCpu);
-void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numBlocksToUse, bool const isRandom);
+void RunPeerToPeerBenchmarks(EnvVars const& ev, size_t N);
+void RunSweepPreset(EnvVars const& ev, size_t const numBytesPerTransfer, int const numGpuSubExec, int const numCpuSubExec, bool const isRandom);

 // Return the maximum bandwidth measured for given (src/dst) pair
-double GetPeakBandwidth(EnvVars const& ev,
-                        size_t  const  N,
+double GetPeakBandwidth(EnvVars const& ev, size_t  const  N,
                        int     const  isBidirectional,
-                        int     const  readMode,
-                        int     const  numBlocksToUse,
-                        MemType const  srcMemType,
-                        int     const  srcIndex,
-                        MemType const  dstMemType,
-                        int     const  dstIndex);
+                        MemType const  srcType, int const srcIndex,
+                        MemType const  dstType, int const dstIndex);

 std::string GetLinkTypeDesc(uint32_t linkType, uint32_t hopCount);
-std::string GetDesc(MemType srcMemType, int srcIndex,
-                    MemType dstMemType, int dstIndex);
-std::string GetTransferDesc(Transfer const& transfer);
-int RemappedIndex(int const origIdx, MemType const memType);
+
+int RemappedIndex(int const origIdx, bool const isCpuType);
 int GetWallClockRate(int deviceId);
 void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers);
+std::string PtrVectorToStr(std::vector<float*> const& strVector, int const initOffset);
--- a/example.cfg
+++ b/example.cfg
 # ConfigFile Format:
 # ==================
-# A Transfer is defined as a uni-directional copy from src memory location to dst memory location
-# executed by either CPU or GPU
+# A Transfer is defined as a single operation where an Executor reads and adds together
+# values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
+# This simplifies to a simple copy operation when dealing with single SRC/DST.
+#
+#                SRC 0                DST 0
+#                SRC 1 -> Executor -> DST 1
+#                SRC X                DST Y
+
+# Three Executors are supported by TransferBench
+#   Executor:        SubExecutor:
+#   1) CPU           CPU thread
+#   2) GPU           GPU threadblock/Compute Unit (CU)
+#   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
+
 # Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel

 # There are two ways to specify a Test:

 # 1) Basic
-#    The basic specification assumes the same number of threadblocks/CUs used per GPU-executed Transfer
+#    The basic specification assumes the same number of SubExecutors (SE) used per Transfer
 #    A positive number of Transfers is specified followed by that number of triplets describing each Transfer

-#    #Transfers #CUs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
+#    #Transfers #SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)

 # 2) Advanced
 #    A negative number of Transfers is specified, followed by quintuplets describing each Transfer
 #    A non-zero number of bytes specified will override any provided value
-#    -#Transfers (srcMem1->Executor1->dstMem1 #CUs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #CUsL BytesL)
+#    -#Transfers (srcMem1->Executor1->dstMem1 #SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #SEsL BytesL)

 # Argument Details:
 #   #Transfers:   Number of Transfers to be run in parallel
-#   #CUs      :   Number of threadblocks/CUs to use for a GPU-executed Transfer
-#   srcMemL   :   Source memory location (Where the data is to be read from). Ignored in memset mode
+#   #SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
+#   srcMemL   :   Source memory locations (Where the data is to be read from)
 #   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
 #                 - C: CPU-executed  (Indexed from 0 to # NUMA nodes - 1)
 #                 - G: GPU-executed  (Indexed from 0 to # GPUs - 1)
-#   dstMemL   :   Destination memory location (Where the data is to be written to)
+#                 - D: DMA-executor  (Indexed from 0 to # GPUs - 1)
+#   dstMemL   :   Destination memory locations (Where the data is to be written to)
 #   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
 #                 Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
 #
-#                 Memory locations are specified by a character indicating memory type,
-#                 followed by device index (0-indexed)
+#                 Memory locations are specified by one or more (device character / device index) pairs
+#                 Character indicating memory type followed by device index (0-indexed)
 #                 Supported memory locations are:
 #                 - C:    Pinned host memory       (on NUMA node, indexed from 0 to [# NUMA nodes-1])
 #                 - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [# NUMA nodes-1])
 #                 - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [# NUMA nodes-1])
 #                 - G:    Global device memory     (on GPU device indexed from 0 to [# GPUs - 1])
 #                 - F:    Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
+#                 - N:    Null memory              (index ignored)

 # Examples:
 # 1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
 # 1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
-# 2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 CUs
-# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 2 CUs
+# 2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
+# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs

 # Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
 # Lines starting with # will be ignored. Lines starting with ## will be echoed to output

-# Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
+## Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
 1 4 (G0->G0->G1)

-# Copies 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
+## Single DMA executed Transfer between GPUs 0 and 1
+1 1 (G0->D0->G1)
+
+## Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
 -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
+
+## "Memset" by GPU 0 to GPU 0 memory
+1 32 (N0->G0->G0)
+
+## "Read-only" by CPU 0
+1 4 (C0->C0->N0)
+
+## Broadcast from GPU 0 to GPU 0 and GPU 1
+1 16 (G0->G0->G0G1)