Unverified Commit 4e2be38c authored by gilbertlee-amd's avatar gilbertlee-amd Committed by GitHub
Browse files

Update rocm-rel-6.4 to use TransferBench v1.62.00 (#183)

* updating metadata

* v1.58.00 Fixing DMA copy-on-engine (#152)

* Leo's review

* Update use-transferbench.rst

* Update Doxyfile

* Refining API library

* Update TransferBench.hpp

* Update TransferBench.hpp

* Update TransferBench.hpp

* Bump rocm-docs-core from 1.9.2 to 1.11.0 in /docs/sphinx (#153)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.11.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.11.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#155)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* External CI: enable CI triggers

* Apply suggestions from code review
Co-authored-by: default avatarMustafa Abduljabbar <mustafa.abduljabbar@amd.com>

* Update LICENSE.md

* Bump rocm-docs-core from 1.12.0 to 1.13.0 in /docs/sphinx (#160)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.13.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.13.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* TransferBench V1.59 (#162)

Adding NIC execution capabilities, various bug fixes introduced by header-only-library refactor
---------
Co-authored-by: default avatarMustafa Abduljabbar <mustafa.abduljabbar@amd.com>

* Adding ability to specify A2A_MODE=numSrcs:numDsts (#164)
Co-authored-by: default avatarMustafa Abduljabbar <mustafa.abduljabbar@amd.com>

* Fixing specific DMA engine transfers, enabling GFX_SINGLE_TEAM=1 by default (#166)

* Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx (#165)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#171)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* TransferBench v1.61 (#174)
Co-authored-by: default avatarMustafa Abduljabbar <mustafa.abduljabbar@amd.com>

* Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx (#176)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx (#177)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.1 to 1.18.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.18.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* TransferBench v1.62.00 (#181)

* Adding non-temporal loads and stores via GFX_TEMPORAL
* Adding additional summary details to a2a preset
* Add SHOW_MIN_ONLY for a2asweep preset
* Adding new P CPU memory type which is indexed by closest GPU

* Bump rocm-docs-core from 1.18.2 to 1.20.0 in /docs/sphinx (#180)

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.20.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.20.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.20.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------
Signed-off-by: default avatardependabot[bot] <support@github.com>
Co-authored-by: default avatarsrawat <120587655+SwRaw@users.noreply.github.com>
Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: default avatarDaniel Su <danielsu@amd.com>
Co-authored-by: default avatarMustafa Abduljabbar <mustafa.abduljabbar@amd.com>
parent 3ea2f226
...@@ -24,6 +24,8 @@ THE SOFTWARE. ...@@ -24,6 +24,8 @@ THE SOFTWARE.
// Included after EnvVars and Executors // Included after EnvVars and Executors
#include "AllToAll.hpp" #include "AllToAll.hpp"
#include "AllToAllN.hpp"
#include "AllToAllSweep.hpp"
#include "HealthCheck.hpp" #include "HealthCheck.hpp"
#include "OneToAll.hpp" #include "OneToAll.hpp"
#include "PeerToPeer.hpp" #include "PeerToPeer.hpp"
...@@ -38,14 +40,16 @@ typedef void (*PresetFunc)(EnvVars& ev, ...@@ -38,14 +40,16 @@ typedef void (*PresetFunc)(EnvVars& ev,
std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap = std::map<std::string, std::pair<PresetFunc, std::string>> presetFuncMap =
{ {
{"a2a", {AllToAllPreset, "Tests parallel transfers between all pairs of GPU devices"}}, {"a2a", {AllToAllPreset, "Tests parallel transfers between all pairs of GPU devices"}},
{"healthcheck", {HealthCheckPreset,"Simple bandwidth health check (MI300X series only)"}}, {"a2a_n", {AllToAllRdmaPreset, "Tests parallel transfers between all pairs of GPU devices using Nearest NIC RDMA transfers"}},
{"one2all", {OneToAllPreset, "Test all subsets of parallel transfers from one GPU to all others"}}, {"a2asweep", {AllToAllSweepPreset, "Test GFX-based all-to-all transfers swept across different CU and GFX unroll counts"}},
{"p2p" , {PeerToPeerPreset, "Peer-to-peer device memory bandwidth test"}}, {"healthcheck", {HealthCheckPreset, "Simple bandwidth health check (MI300X series only)"}},
{"rsweep", {SweepPreset, "Randomly sweep through sets of Transfers"}}, {"one2all", {OneToAllPreset, "Test all subsets of parallel transfers from one GPU to all others"}},
{"scaling", {ScalingPreset, "Run scaling test from one GPU to other devices"}}, {"p2p" , {PeerToPeerPreset, " Peer-to-peer device memory bandwidth test"}},
{"schmoo", {SchmooPreset, "Scaling tests for local/remote read/write/copy"}}, {"rsweep", {SweepPreset, "Randomly sweep through sets of Transfers"}},
{"sweep", {SweepPreset, "Ordered sweep through sets of Transfers"}}, {"scaling", {ScalingPreset, "Run scaling test from one GPU to other devices"}},
{"schmoo", {SchmooPreset, "Scaling tests for local/remote read/write/copy"}},
{"sweep", {SweepPreset, "Ordered sweep through sets of Transfers"}},
}; };
void DisplayPresets() void DisplayPresets()
......
...@@ -22,19 +22,21 @@ THE SOFTWARE. ...@@ -22,19 +22,21 @@ THE SOFTWARE.
void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers) void LogTransfers(FILE *fp, int const testNum, std::vector<Transfer> const& transfers)
{ {
fprintf(fp, "# Test %d\n", testNum); if (fp) {
fprintf(fp, "%d", -1 * (int)transfers.size()); fprintf(fp, "# Test %d\n", testNum);
for (auto const& transfer : transfers) fprintf(fp, "%d", -1 * (int)transfers.size());
{ for (auto const& transfer : transfers)
fprintf(fp, " (%s->%c%d->%s %d %lu)", {
MemDevicesToStr(transfer.srcs).c_str(), fprintf(fp, " (%s->%c%d->%s %d %lu)",
ExeTypeStr[transfer.exeDevice.exeType], transfer.exeDevice.exeIndex, MemDevicesToStr(transfer.srcs).c_str(),
MemDevicesToStr(transfer.dsts).c_str(), ExeTypeStr[transfer.exeDevice.exeType], transfer.exeDevice.exeIndex,
transfer.numSubExecs, MemDevicesToStr(transfer.dsts).c_str(),
transfer.numBytes); transfer.numSubExecs,
transfer.numBytes);
}
fprintf(fp, "\n");
fflush(fp);
} }
fprintf(fp, "\n");
fflush(fp);
} }
void SweepPreset(EnvVars& ev, void SweepPreset(EnvVars& ev,
...@@ -54,6 +56,7 @@ void SweepPreset(EnvVars& ev, ...@@ -54,6 +56,7 @@ void SweepPreset(EnvVars& ev,
int numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE" , 4); int numGpuSubExecs = EnvVars::GetEnvVar("NUM_GPU_SE" , 4);
std::string sweepDst = EnvVars::GetEnvVar("SWEEP_DST" , "CG"); std::string sweepDst = EnvVars::GetEnvVar("SWEEP_DST" , "CG");
std::string sweepExe = EnvVars::GetEnvVar("SWEEP_EXE" , "CDG"); std::string sweepExe = EnvVars::GetEnvVar("SWEEP_EXE" , "CDG");
std::string sweepFile = EnvVars::GetEnvVar("SWEEP_FILE" , "/tmp/lastSweep.cfg");
int sweepMax = EnvVars::GetEnvVar("SWEEP_MAX" , 24); int sweepMax = EnvVars::GetEnvVar("SWEEP_MAX" , 24);
int sweepMin = EnvVars::GetEnvVar("SWEEP_MIN" , 1); int sweepMin = EnvVars::GetEnvVar("SWEEP_MIN" , 1);
int sweepRandBytes = EnvVars::GetEnvVar("SWEEP_RAND_BYTES" , 0); int sweepRandBytes = EnvVars::GetEnvVar("SWEEP_RAND_BYTES" , 0);
...@@ -78,6 +81,7 @@ void SweepPreset(EnvVars& ev, ...@@ -78,6 +81,7 @@ void SweepPreset(EnvVars& ev,
ev.Print("NUM_GPU_SE", numGpuSubExecs, "Using %d subExecutors/CUs per GPU executed Transfer", numGpuSubExecs); ev.Print("NUM_GPU_SE", numGpuSubExecs, "Using %d subExecutors/CUs per GPU executed Transfer", numGpuSubExecs);
ev.Print("SWEEP_DST", sweepDst.c_str(), "Destination Memory Types to sweep"); ev.Print("SWEEP_DST", sweepDst.c_str(), "Destination Memory Types to sweep");
ev.Print("SWEEP_EXE", sweepExe.c_str(), "Executor Types to sweep"); ev.Print("SWEEP_EXE", sweepExe.c_str(), "Executor Types to sweep");
ev.Print("SWEEP_FILE", sweepFile.c_str(),"File to store the executing sweep configuration");
ev.Print("SWEEP_MAX", sweepMax, "Max simultaneous transfers (0 = no limit)"); ev.Print("SWEEP_MAX", sweepMax, "Max simultaneous transfers (0 = no limit)");
ev.Print("SWEEP_MIN", sweepMin, "Min simultaenous transfers"); ev.Print("SWEEP_MIN", sweepMin, "Min simultaenous transfers");
ev.Print("SWEEP_RAND_BYTES", sweepRandBytes, "Using %s number of bytes per Transfer", (sweepRandBytes ? "random" : "constant")); ev.Print("SWEEP_RAND_BYTES", sweepRandBytes, "Using %s number of bytes per Transfer", (sweepRandBytes ? "random" : "constant"));
...@@ -283,10 +287,14 @@ void SweepPreset(EnvVars& ev, ...@@ -283,10 +287,14 @@ void SweepPreset(EnvVars& ev,
std::uniform_int_distribution<int> distribution(sweepMin, maxParallelTransfers); std::uniform_int_distribution<int> distribution(sweepMin, maxParallelTransfers);
// Log sweep to configuration file // Log sweep to configuration file
FILE *fp = fopen("lastSweep.cfg", "w"); char absPath[1024];
auto const res = realpath(sweepFile.c_str(), absPath);
FILE *fp = fopen(sweepFile.c_str(), "w");
if (!fp) { if (!fp) {
printf("[ERROR] Unable to open lastSweep.cfg. Check permissions\n"); printf("[WARN] Unable to open %s. Skipping output of sweep configuration file\n", res ? absPath : sweepFile.c_str());
exit(1); } else {
printf("Sweep configuration saved to: %s\n", res ? absPath : sweepFile.c_str());
} }
// Create bitmask of numPossible triplets, of which M will be chosen // Create bitmask of numPossible triplets, of which M will be chosen
...@@ -333,7 +341,7 @@ void SweepPreset(EnvVars& ev, ...@@ -333,7 +341,7 @@ void SweepPreset(EnvVars& ev,
// Check for test limit // Check for test limit
if (numTestsRun == sweepTestLimit) { if (numTestsRun == sweepTestLimit) {
printf("Test limit reached\n"); printf("Sweep Test limit reached\n");
break; break;
} }
...@@ -341,7 +349,7 @@ void SweepPreset(EnvVars& ev, ...@@ -341,7 +349,7 @@ void SweepPreset(EnvVars& ev,
auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart; auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
double totalCpuTime = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count(); double totalCpuTime = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count();
if (sweepTimeLimit && totalCpuTime > sweepTimeLimit) { if (sweepTimeLimit && totalCpuTime > sweepTimeLimit) {
printf("Time limit exceeded\n"); printf("Sweep Time limit exceeded\n");
break; break;
} }
...@@ -357,5 +365,5 @@ void SweepPreset(EnvVars& ev, ...@@ -357,5 +365,5 @@ void SweepPreset(EnvVars& ev,
bitmask[i] = (i < M) ? 1 : 0; bitmask[i] = (i < M) ? 1 : 0;
} }
} }
fclose(fp); if (fp) fclose(fp);
} }
...@@ -38,21 +38,56 @@ static int RemappedCpuIndex(int origIdx) ...@@ -38,21 +38,56 @@ static int RemappedCpuIndex(int origIdx)
return remappingCpu[origIdx]; return remappingCpu[origIdx];
} }
static void PrintNicToGPUTopo(bool outputToCsv)
{
#ifdef NIC_EXEC_ENABLED
printf(" NIC | Device Name | Active | PCIe Bus ID | NUMA | Closest GPU(s) | GID Index | GID Descriptor\n");
if(!outputToCsv)
printf("-----+-------------+--------+--------------+------+----------------+-----------+-------------------\n");
int numGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
auto const& ibvDeviceList = GetIbvDeviceList();
for (int i = 0; i < ibvDeviceList.size(); i++) {
std::string closestGpusStr = "";
for (int j = 0; j < numGpus; j++) {
if (TransferBench::GetClosestNicToGpu(j) == i) {
if (closestGpusStr != "") closestGpusStr += ",";
closestGpusStr += std::to_string(j);
}
}
printf(" %-3d | %-11s | %-6s | %-12s | %-4d | %-14s | %-9s | %-20s\n",
i, ibvDeviceList[i].name.c_str(),
ibvDeviceList[i].hasActivePort ? "Yes" : "No",
ibvDeviceList[i].busId.c_str(),
ibvDeviceList[i].numaNode,
closestGpusStr.c_str(),
ibvDeviceList[i].isRoce && ibvDeviceList[i].hasActivePort? std::to_string(ibvDeviceList[i].gidIndex).c_str() : "N/A",
ibvDeviceList[i].isRoce && ibvDeviceList[i].hasActivePort? ibvDeviceList[i].gidDescriptor.c_str() : "N/A"
);
}
printf("\n");
#endif
}
void DisplayTopology(bool outputToCsv) void DisplayTopology(bool outputToCsv)
{ {
int numCpus = TransferBench::GetNumExecutors(EXE_CPU); int numCpus = TransferBench::GetNumExecutors(EXE_CPU);
int numGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX); int numGpus = TransferBench::GetNumExecutors(EXE_GPU_GFX);
int numNics = TransferBench::GetNumExecutors(EXE_NIC);
char sep = (outputToCsv ? ',' : '|'); char sep = (outputToCsv ? ',' : '|');
if (outputToCsv) { if (outputToCsv) {
printf("NumCpus,%d\n", numCpus); printf("NumCpus,%d\n", numCpus);
printf("NumGpus,%d\n", numGpus); printf("NumGpus,%d\n", numGpus);
printf("NumNics,%d\n", numNics);
} else { } else {
printf("\nDetected Topology:\n"); printf("\nDetected Topology:\n");
printf("==================\n"); printf("==================\n");
printf(" %d configured CPU NUMA node(s) [%d total]\n", numCpus, numa_max_node() + 1); printf(" %d configured CPU NUMA node(s) [%d total]\n", numCpus, numa_max_node() + 1);
printf(" %d GPU device(s)\n", numGpus); printf(" %d GPU device(s)\n", numGpus);
printf(" %d Supported NIC device(s)\n", numNics);
} }
// Print out detected CPU topology // Print out detected CPU topology
...@@ -91,8 +126,10 @@ void DisplayTopology(bool outputToCsv) ...@@ -91,8 +126,10 @@ void DisplayTopology(bool outputToCsv)
} }
printf("\n"); printf("\n");
// Print out detected GPU topology // Print out detected NIC topology
PrintNicToGPUTopo(outputToCsv);
// Print out detected GPU topology
#if defined(__NVCC__) #if defined(__NVCC__)
for (int i = 0; i < numGpus; i++) { for (int i = 0; i < numGpus; i++) {
hipDeviceProp_t prop; hipDeviceProp_t prop;
...@@ -118,12 +155,12 @@ void DisplayTopology(bool outputToCsv) ...@@ -118,12 +155,12 @@ void DisplayTopology(bool outputToCsv)
printf(" %c", sep); printf(" %c", sep);
for (int j = 0; j < numGpus; j++) for (int j = 0; j < numGpus; j++)
printf(" GPU %02d %c", j, sep); printf(" GPU %02d %c", j, sep);
printf(" PCIe Bus ID %c #CUs %c NUMA %c #DMA %c #XCC\n", sep, sep, sep, sep); printf(" PCIe Bus ID %c #CUs %c NUMA %c #DMA %c #XCC %c NIC\n", sep, sep, sep, sep, sep);
if (!outputToCsv) { if (!outputToCsv) {
for (int j = 0; j <= numGpus; j++) for (int j = 0; j <= numGpus; j++)
printf("--------+"); printf("--------+");
printf("--------------+------+------+------+------\n"); printf("--------------+------+------+------+------+------\n");
} }
// Loop over each GPU device // Loop over each GPU device
...@@ -149,12 +186,13 @@ void DisplayTopology(bool outputToCsv) ...@@ -149,12 +186,13 @@ void DisplayTopology(bool outputToCsv)
char pciBusId[20]; char pciBusId[20];
HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, i)); HIP_CALL(hipDeviceGetPCIBusId(pciBusId, 20, i));
printf(" %11s %c %4d %c %4d %c %4d %c %4d\n", printf(" %-11s %c %-4d %c %-4d %c %-4d %c %-4d %c %-4d\n",
pciBusId, sep, pciBusId, sep,
TransferBench::GetNumSubExecutors({EXE_GPU_GFX, i}), sep, TransferBench::GetNumSubExecutors({EXE_GPU_GFX, i}), sep,
TransferBench::GetClosestCpuNumaToGpu(i), sep, TransferBench::GetClosestCpuNumaToGpu(i), sep,
TransferBench::GetNumExecutorSubIndices({EXE_GPU_DMA, i}), sep, TransferBench::GetNumExecutorSubIndices({EXE_GPU_DMA, i}), sep,
TransferBench::GetNumExecutorSubIndices({EXE_GPU_GFX, i})); TransferBench::GetNumExecutorSubIndices({EXE_GPU_GFX, i}), sep,
TransferBench::GetClosestNicToGpu(i));
} }
#endif #endif
} }
...@@ -20,12 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN ...@@ -20,12 +20,15 @@ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE. THE SOFTWARE.
*/ */
/// @cond
#pragma once #pragma once
#include <algorithm>
#include <cstring> #include <cstring>
#include <future> #include <future>
#include <map> #include <map>
#include <numa.h> // If not found, try installing libnuma-dev (e.g apt-get install libnuma-dev) #include <numa.h> // If not found, try installing libnuma-dev (e.g apt-get install libnuma-dev)
#include <numaif.h> #include <numaif.h>
#include <random>
#include <set> #include <set>
#include <sstream> #include <sstream>
#include <stdarg.h> #include <stdarg.h>
...@@ -33,6 +36,19 @@ THE SOFTWARE. ...@@ -33,6 +36,19 @@ THE SOFTWARE.
#include <unistd.h> #include <unistd.h>
#include <vector> #include <vector>
#ifdef NIC_EXEC_ENABLED
#include <infiniband/verbs.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdbool.h>
#include <arpa/inet.h>
#include <fcntl.h>
#include <unistd.h>
#include <filesystem>
#include <fstream>
#endif
#if defined(__NVCC__) #if defined(__NVCC__)
#include <cuda_runtime.h> #include <cuda_runtime.h>
#else #else
...@@ -41,6 +57,7 @@ THE SOFTWARE. ...@@ -41,6 +57,7 @@ THE SOFTWARE.
#include <hsa/hsa.h> #include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h> #include <hsa/hsa_ext_amd.h>
#endif #endif
/// @endcond
namespace TransferBench namespace TransferBench
{ {
...@@ -49,24 +66,25 @@ namespace TransferBench ...@@ -49,24 +66,25 @@ namespace TransferBench
using std::set; using std::set;
using std::vector; using std::vector;
constexpr char VERSION[] = "1.57"; constexpr char VERSION[] = "1.62";
/** /**
* Enumeration of supported Executor types * Enumeration of supported Executor types
* *
* @note The Executor is the device used to perform a Transfer * @note The Executor is the device used to perform a Transfer
* @note IBVerbs executor is currently not implemented yet
*/ */
enum ExeType enum ExeType
{ {
EXE_CPU = 0, ///< CPU executor (subExecutor = CPU thread) EXE_CPU = 0, ///< CPU executor (subExecutor = CPU thread)
EXE_GPU_GFX = 1, ///< GPU kernel-based executor (subExecutor = threadblock/CU) EXE_GPU_GFX = 1, ///< GPU kernel-based executor (subExecutor = threadblock/CU)
EXE_GPU_DMA = 2, ///< GPU SDMA executor (subExecutor = not supported) EXE_GPU_DMA = 2, ///< GPU SDMA executor (subExecutor = not supported)
EXE_IBV = 3, ///< IBVerbs executor (subExecutor = queue pair) EXE_NIC = 3, ///< NIC RDMA executor (subExecutor = queue pair)
EXE_NIC_NEAREST = 4 ///< NIC RDMA nearest executor (subExecutor = queue pair)
}; };
char const ExeTypeStr[5] = "CGDI"; char const ExeTypeStr[6] = "CGDIN";
inline bool IsCpuExeType(ExeType e){ return e == EXE_CPU; } inline bool IsCpuExeType(ExeType e){ return e == EXE_CPU; }
inline bool IsGpuExeType(ExeType e){ return e == EXE_GPU_GFX || e == EXE_GPU_DMA; } inline bool IsGpuExeType(ExeType e){ return e == EXE_GPU_GFX || e == EXE_GPU_DMA; }
inline bool IsNicExeType(ExeType e){ return e == EXE_NIC || e == EXE_NIC_NEAREST; }
/** /**
* A ExeDevice defines a specific Executor * A ExeDevice defines a specific Executor
...@@ -94,10 +112,11 @@ namespace TransferBench ...@@ -94,10 +112,11 @@ namespace TransferBench
MEM_GPU_FINE = 3, ///< Fine-grained global GPU memory MEM_GPU_FINE = 3, ///< Fine-grained global GPU memory
MEM_CPU_UNPINNED = 4, ///< Unpinned CPU memory MEM_CPU_UNPINNED = 4, ///< Unpinned CPU memory
MEM_NULL = 5, ///< NULL memory - used for empty MEM_NULL = 5, ///< NULL memory - used for empty
MEM_MANAGED = 6 ///< Managed memory MEM_MANAGED = 6, ///< Managed memory
MEM_CPU_CLOSEST = 7, ///< Coarse-grained pinned CPU memory indexed by closest GPU
}; };
char const MemTypeStr[8] = "CGBFUNM"; char const MemTypeStr[9] = "CGBFUNMP";
inline bool IsCpuMemType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED); } inline bool IsCpuMemType(MemType m) { return (m == MEM_CPU || m == MEM_CPU_FINE || m == MEM_CPU_UNPINNED || m == MEM_CPU_CLOSEST); }
inline bool IsGpuMemType(MemType m) { return (m == MEM_GPU || m == MEM_GPU_FINE || m == MEM_MANAGED); } inline bool IsGpuMemType(MemType m) { return (m == MEM_GPU || m == MEM_GPU_FINE || m == MEM_MANAGED); }
/** /**
...@@ -118,11 +137,10 @@ namespace TransferBench ...@@ -118,11 +137,10 @@ namespace TransferBench
*/ */
struct Transfer struct Transfer
{ {
size_t numBytes = (1<<26); ///< # of bytes to Transfer size_t numBytes = 0; ///< Number of bytes to Transfer
vector<MemDevice> srcs = {}; ///< List of source memory devices vector<MemDevice> srcs = {}; ///< List of source memory devices
vector<MemDevice> dsts = {}; ///< List of destination memory devices vector<MemDevice> dsts = {}; ///< List of destination memory devices
ExeDevice exeDevice = {}; ///< Executor to use ExeDevice exeDevice = {}; ///< Executor to use
int32_t exeDstIndex = -1; ///< Destination executor index (for RDMA executor only)
int32_t exeSubIndex = -1; ///< Executor subindex int32_t exeSubIndex = -1; ///< Executor subindex
int numSubExecs = 0; ///< Number of subExecutors to use for this Transfer int numSubExecs = 0; ///< Number of subExecutors to use for this Transfer
}; };
...@@ -132,8 +150,8 @@ namespace TransferBench ...@@ -132,8 +150,8 @@ namespace TransferBench
*/ */
struct GeneralOptions struct GeneralOptions
{ {
int numIterations = 10; ///< # of timed iterations to perform. If negative, run for -numIterations seconds instead int numIterations = 10; ///< Number of timed iterations to perform. If negative, run for -numIterations seconds instead
int numSubIterations = 1; ///< # of sub-iterations per iteration int numSubIterations = 1; ///< Number of sub-iterations per iteration
int numWarmups = 3; ///< Number of un-timed warmup iterations to perform int numWarmups = 3; ///< Number of un-timed warmup iterations to perform
int recordPerIteration = 0; ///< Record per-iteration timing information int recordPerIteration = 0; ///< Record per-iteration timing information
int useInteractive = 0; ///< Pause for user-input before starting transfer loop int useInteractive = 0; ///< Pause for user-input before starting transfer loop
...@@ -152,30 +170,51 @@ namespace TransferBench ...@@ -152,30 +170,51 @@ namespace TransferBench
int validateSource = 0; ///< Validate src GPU memory immediately after preparation int validateSource = 0; ///< Validate src GPU memory immediately after preparation
}; };
/**
* DMA Executor options
*/
struct DmaOptions
{
int useHipEvents = 1; ///< Use HIP events for timing DMA Executor
int useHsaCopy = 0; ///< Use HSA copy instead of HIP copy to perform DMA
};
/** /**
* GFX Executor options * GFX Executor options
*/ */
struct GfxOptions struct GfxOptions
{ {
int blockOrder = 0; ///< Determines how threadblocks are ordered (0=sequential, 1=interleaved, 2=random)
int blockSize = 256; ///< Size of each threadblock (must be multiple of 64) int blockSize = 256; ///< Size of each threadblock (must be multiple of 64)
vector<uint32_t> cuMask = {}; ///< Bit-vector representing the CU mask vector<uint32_t> cuMask = {}; ///< Bit-vector representing the CU mask
vector<vector<int>> prefXccTable = {}; ///< 2D table with preferred XCD to use for a specific [src][dst] GPU device vector<vector<int>> prefXccTable = {}; ///< 2D table with preferred XCD to use for a specific [src][dst] GPU device
int temporalMode = 0; ///< Non-temporal load/store mode 0=none, 1=load, 2=store, 3=both
int unrollFactor = 4; ///< GFX-kernel unroll factor int unrollFactor = 4; ///< GFX-kernel unroll factor
int useHipEvents = 1; ///< Use HIP events for timing GFX Executor int useHipEvents = 1; ///< Use HIP events for timing GFX Executor
int useMultiStream = 0; ///< Use multiple streams for GFX int useMultiStream = 0; ///< Use multiple streams for GFX
int useSingleTeam = 0; ///< Team all subExecutors across the data array int useSingleTeam = 0; ///< Team all subExecutors across the data array
int waveOrder = 0; ///< GFX-kernel wavefront ordering int waveOrder = 0; ///< GFX-kernel wavefront ordering
int wordSize = 4; ///< GFX-kernel packed data size (4=dwordx4, 2=dwordx2, 1=dwordx1)
}; };
/**
* DMA Executor options
*/
struct DmaOptions
{
int useHipEvents = 1; ///< Use HIP events for timing DMA Executor
int useHsaCopy = 0; ///< Use HSA copy instead of HIP copy to perform DMA
};
/**
* NIC Executor options
*/
struct NicOptions
{
vector<int> closestNics = {}; ///< Overrides the auto-detected closest NIC per GPU
int ibGidIndex = -1; ///< GID Index for RoCE NICs (-1 is auto)
uint8_t ibPort = 1; ///< NIC port number to be used
int ipAddressFamily = 4; ///< 4=IPv4, 6=IPv6 (used for auto GID detection)
int maxRecvWorkReq = 16; ///< Maximum number of recv work requests per queue pair
int maxSendWorkReq = 16; ///< Maximum number of send work requests per queue pair
int queueSize = 100; ///< Completion queue size
int roceVersion = 2; ///< RoCE version (used for auto GID detection)
int useRelaxedOrder = 1; ///< Use relaxed ordering
int useNuma = 0; ///< Switch to closest numa thread for execution
};
/** /**
* Configuration options for performing Transfers * Configuration options for performing Transfers
*/ */
...@@ -186,6 +225,7 @@ namespace TransferBench ...@@ -186,6 +225,7 @@ namespace TransferBench
GfxOptions gfx; ///< GFX executor options GfxOptions gfx; ///< GFX executor options
DmaOptions dma; ///< DMA executor options DmaOptions dma; ///< DMA executor options
NicOptions nic; ///< NIC executor options
}; };
/** /**
...@@ -198,6 +238,31 @@ namespace TransferBench ...@@ -198,6 +238,31 @@ namespace TransferBench
ERR_FATAL = 2, ///< Fatal error - results are invalid ERR_FATAL = 2, ///< Fatal error - results are invalid
}; };
/**
* Enumeration of GID priority
*
* @note These are the GID types ordered in priority from lowest (0) to highest
*/
enum GidPriority
{
UNKNOWN = -1, ///< Default
ROCEV1_LINK_LOCAL = 0, ///< RoCEv1 Link-local
ROCEV2_LINK_LOCAL = 1, ///< RoCEv2 Link-local fe80::/10
ROCEV1_IPV6 = 2, ///< RoCEv1 IPv6
ROCEV2_IPV6 = 3, ///< RoCEv2 IPv6
ROCEV1_IPV4 = 4, ///< RoCEv1 IPv4-mapped IPv6
ROCEV2_IPV4 = 5, ///< RoCEv2 IPv4-mapped IPv6 ::ffff:192.168.x.x
};
const char* GidPriorityStr[] = {
"RoCEv1 Link-local",
"RoCEv2 Link-local",
"RoCEv1 IPv6",
"RoCEv2 IPv6",
"RoCEv1 IPv4-mapped IPv6",
"RoCEv2 IPv4-mapped IPv6"
};
/** /**
* ErrResult consists of error type and error message * ErrResult consists of error type and error message
*/ */
...@@ -241,6 +306,9 @@ namespace TransferBench ...@@ -241,6 +306,9 @@ namespace TransferBench
// Only filled in if recordPerIteration = 1 // Only filled in if recordPerIteration = 1
vector<double> perIterMsec; ///< Duration for each individual iteration vector<double> perIterMsec; ///< Duration for each individual iteration
vector<set<pair<int,int>>> perIterCUs; ///< GFX-Executor only. XCC:CU used per iteration vector<set<pair<int,int>>> perIterCUs; ///< GFX-Executor only. XCC:CU used per iteration
ExeDevice exeDevice; ///< Tracks which executor performed this Transfer (e.g. for EXE_NIC_NEAREST)
ExeDevice exeDstDevice; ///< Tracks actual destination executor (only valid for EXE_NIC/EXE_NIC_NEAREST)
}; };
/** /**
...@@ -342,6 +410,23 @@ namespace TransferBench ...@@ -342,6 +410,23 @@ namespace TransferBench
*/ */
int GetClosestCpuNumaToGpu(int gpuIndex); int GetClosestCpuNumaToGpu(int gpuIndex);
/**
* Returns the index of the NUMA node closest to the given NIC
*
* @param[in] nicIndex Index of the NIC to query
* @returns NUMA node index closest to the NIC nicIndex, or -1 if unable to detect
*/
int GetClosestCpuNumaToNic(int nicIndex);
/**
* Returns the index of the NIC closest to the given GPU
*
* @param[in] gpuIndex Index of the GPU to query
* @note This function is applicable when the IBV/RDMA executor is available
* @returns IB Verbs capable NIC index closest to GPU gpuIndex, or -1 if unable to detect
*/
int GetClosestNicToGpu(int gpuIndex);
/** /**
* Helper function to parse a line containing Transfers into a vector of Transfers * Helper function to parse a line containing Transfers into a vector of Transfers
* *
...@@ -351,7 +436,6 @@ namespace TransferBench ...@@ -351,7 +436,6 @@ namespace TransferBench
*/ */
ErrResult ParseTransfers(std::string str, ErrResult ParseTransfers(std::string str,
std::vector<Transfer>& transfers); std::vector<Transfer>& transfers);
}; };
//========================================================================================== //==========================================================================================
// End of TransferBench API // End of TransferBench API
...@@ -409,6 +493,14 @@ namespace TransferBench ...@@ -409,6 +493,14 @@ namespace TransferBench
#define hipStreamDestroy cudaStreamDestroy #define hipStreamDestroy cudaStreamDestroy
#define hipStreamSynchronize cudaStreamSynchronize #define hipStreamSynchronize cudaStreamSynchronize
// Define float2 addition operator for NVIDIA platform
__device__ inline float2& operator +=(float2& a, const float2& b)
{
a.x += b.x;
a.y += b.y;
return a;
}
// Define float4 addition operator for NVIDIA platform // Define float4 addition operator for NVIDIA platform
__device__ inline float4& operator +=(float4& a, const float4& b) __device__ inline float4& operator +=(float4& a, const float4& b)
{ {
...@@ -433,7 +525,7 @@ namespace TransferBench ...@@ -433,7 +525,7 @@ namespace TransferBench
#endif #endif
// Macro for collecting XCC GFX kernel is running on // Macro for collecting XCC GFX kernel is running on
#if defined(__gfx940__) || defined(__gfx941__) || defined(__gfx942__) #if defined(__gfx940__) || defined(__gfx941__) || defined(__gfx942__) || defined(__gfx950__)
#define GetXccId(val) asm volatile ("s_getreg_b32 %0, hwreg(HW_REG_XCC_ID)" : "=s" (val)); #define GetXccId(val) asm volatile ("s_getreg_b32 %0, hwreg(HW_REG_XCC_ID)" : "=s" (val));
#else #else
#define GetXccId(val) val = 0 #define GetXccId(val) val = 0
...@@ -457,21 +549,63 @@ namespace TransferBench ...@@ -457,21 +549,63 @@ namespace TransferBench
return false; \ return false; \
} while (0) } while (0)
// Helper macros for calling RDMA functions and reporting errors
#ifdef VERBS_DEBUG
#define IBV_CALL(__func__, ...) \
do { \
int error = __func__(__VA_ARGS__); \
if (error != 0) { \
return {ERR_FATAL, "Encountered IbVerbs error (%d) at line (%d) " \
"and function (%s)", (error), __LINE__, #__func__}; \
} \
} while (0)
#define IBV_PTR_CALL(__ptr__, __func__, ...) \
do { \
__ptr__ = __func__(__VA_ARGS__); \
if (__ptr__ == nullptr) { \
return {ERR_FATAL, "Encountered IbVerbs nullptr error at line (%d) " \
"and function (%s)", __LINE__, #__func__}; \
} \
} while (0)
#else
#define IBV_CALL(__func__, ...) \
do { \
int error = __func__(__VA_ARGS__); \
if (error != 0) { \
return {ERR_FATAL, "Encountered IbVerbs error (%d) in func (%s) " \
, error, #__func__}; \
} \
} while (0)
#define IBV_PTR_CALL(__ptr__, __func__, ...) \
do { \
__ptr__ = __func__(__VA_ARGS__); \
if (__ptr__ == nullptr) { \
return {ERR_FATAL, "Encountered IbVerbs nullptr error in func (%s) " \
, #__func__}; \
} \
} while (0)
#endif
namespace TransferBench namespace TransferBench
{ {
/// @cond
// Helper functions ('hidden' in anonymous namespace) // Helper functions ('hidden' in anonymous namespace)
//======================================================================================== //========================================================================================
namespace { namespace {
// Constants // Constants
//======================================================================================== //========================================================================================
int constexpr MAX_BLOCKSIZE = 512; // Max threadblock size
int constexpr MAX_WAVEGROUPS = MAX_BLOCKSIZE / 64; // Max wavegroups/warps int constexpr MAX_BLOCKSIZE = 512; // Max threadblock size
int constexpr MAX_UNROLL = 8; // Max unroll factor int constexpr MAX_WAVEGROUPS = MAX_BLOCKSIZE / 64; // Max wavegroups/warps
int constexpr MAX_SRCS = 8; // Max # srcs per Transfer int constexpr MAX_UNROLL = 8; // Max unroll factor
int constexpr MAX_DSTS = 8; // Max # dsts per Transfer int constexpr MAX_SRCS = 8; // Max srcs per Transfer
int constexpr MEMSET_CHAR = 75; // Value to memset (char) int constexpr MAX_DSTS = 8; // Max dsts per Transfer
float constexpr MEMSET_VAL = 13323083.0f; // Value to memset (double) int constexpr MEMSET_CHAR = 75; // Value to memset (char)
float constexpr MEMSET_VAL = 13323083.0f; // Value to memset (double)
// Parsing-related functions // Parsing-related functions
//======================================================================================== //========================================================================================
...@@ -590,7 +724,7 @@ namespace { ...@@ -590,7 +724,7 @@ namespace {
if (mistakeCount > 0) { if (mistakeCount > 0) {
return {ERR_FATAL, return {ERR_FATAL,
"%lu out of %lu pages for memory allocation were not on NUMA node %d." "%lu out of %lu pages for memory allocation were not on NUMA node %d."
" This could be due to hardware memory issues", " This could be due to hardware memory issues, or the use of numa-rebalancing daemons such as numad",
mistakeCount, numPages, targetId}; mistakeCount, numPages, targetId};
} }
return ERR_NONE; return ERR_NONE;
...@@ -607,8 +741,14 @@ namespace { ...@@ -607,8 +741,14 @@ namespace {
MemType const& memType = memDevice.memType; MemType const& memType = memDevice.memType;
if (IsCpuMemType(memType)) { if (IsCpuMemType(memType)) {
// Set numa policy prior to call to hipHostMalloc // Determine which NUMA device to use
numa_set_preferred(memDevice.memIndex); int numaIdx = memDevice.memIndex;
if (memType == MEM_CPU_CLOSEST) {
numaIdx = GetClosestCpuNumaToGpu(memDevice.memIndex);
}
// Set NUMA policy prior to call to hipHostMalloc
numa_set_preferred(numaIdx);
// Allocate host-pinned memory (should respect NUMA mem policy) // Allocate host-pinned memory (should respect NUMA mem policy)
if (memType == MEM_CPU_FINE) { if (memType == MEM_CPU_FINE) {
...@@ -617,19 +757,19 @@ namespace { ...@@ -617,19 +757,19 @@ namespace {
#else #else
ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, hipHostMallocNumaUser)); ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, hipHostMallocNumaUser));
#endif #endif
} else if (memType == MEM_CPU) { } else if (memType == MEM_CPU || memType == MEM_CPU_CLOSEST) {
#if defined (__NVCC__) #if defined (__NVCC__)
ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, 0)); ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, 0));
#else #else
ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, hipHostMallocNumaUser | hipHostMallocNonCoherent)); ERR_CHECK(hipHostMalloc((void **)memPtr, numBytes, hipHostMallocNumaUser | hipHostMallocNonCoherent));
#endif #endif
} else if (memType == MEM_CPU_UNPINNED) { } else if (memType == MEM_CPU_UNPINNED) {
*memPtr = numa_alloc_onnode(numBytes, memDevice.memIndex); *memPtr = numa_alloc_onnode(numBytes, numaIdx);
} }
// Check that the allocated pages are actually on the correct NUMA node // Check that the allocated pages are actually on the correct NUMA node
memset(*memPtr, 0, numBytes); memset(*memPtr, 0, numBytes);
ERR_CHECK(CheckPages((char*)*memPtr, numBytes, memDevice.memIndex)); ERR_CHECK(CheckPages((char*)*memPtr, numBytes, numaIdx));
// Reset to default numa mem policy // Reset to default numa mem policy
numa_set_preferred(-1); numa_set_preferred(-1);
...@@ -668,7 +808,7 @@ namespace { ...@@ -668,7 +808,7 @@ namespace {
return {ERR_FATAL, "Attempted to free null pointer for %lu bytes", bytes}; return {ERR_FATAL, "Attempted to free null pointer for %lu bytes", bytes};
switch (memType) { switch (memType) {
case MEM_CPU: case MEM_CPU_FINE: case MEM_CPU: case MEM_CPU_FINE: case MEM_CPU_CLOSEST:
{ {
ERR_CHECK(hipHostFree(memPtr)); ERR_CHECK(hipHostFree(memPtr));
break; break;
...@@ -740,6 +880,7 @@ namespace { ...@@ -740,6 +880,7 @@ namespace {
break; break;
case EXE_GPU_GFX: case EXE_GPU_DMA: case EXE_GPU_GFX: case EXE_GPU_DMA:
if (exeIndex < 0 || exeIndex >= numGpus) if (exeIndex < 0 || exeIndex >= numGpus)
return {ERR_FATAL, "GPU index must be between 0 and %d inclusively", numGpus - 1}; return {ERR_FATAL, "GPU index must be between 0 and %d inclusively", numGpus - 1};
agent = gpuAgents[exeIndex]; agent = gpuAgents[exeIndex];
break; break;
...@@ -765,13 +906,36 @@ namespace { ...@@ -765,13 +906,36 @@ namespace {
// Setup validation-related functions // Setup validation-related functions
//======================================================================================== //========================================================================================
static ErrResult GetActualExecutor(ConfigOptions const& cfg,
ExeDevice const& origExeDevice,
ExeDevice& actualExeDevice)
{
// By default, nothing needs to change
actualExeDevice = origExeDevice;
// When using NIC_NEAREST, remap to the closest NIC to the GPU
if (origExeDevice.exeType == EXE_NIC_NEAREST) {
actualExeDevice.exeType = EXE_NIC;
if (cfg.nic.closestNics.size() > 0) {
if (origExeDevice.exeIndex < 0 || origExeDevice.exeIndex >= cfg.nic.closestNics.size())
return {ERR_FATAL, "NIC index is out of range (%d)", origExeDevice.exeIndex};
actualExeDevice.exeIndex = cfg.nic.closestNics[origExeDevice.exeIndex];
} else {
actualExeDevice.exeIndex = GetClosestNicToGpu(origExeDevice.exeIndex);
}
}
return ERR_NONE;
}
// Validate that MemDevice exists // Validate that MemDevice exists
static ErrResult CheckMemDevice(MemDevice const& memDevice) static ErrResult CheckMemDevice(MemDevice const& memDevice)
{ {
if (memDevice.memType == MEM_NULL) if (memDevice.memType == MEM_NULL)
return ERR_NONE; return ERR_NONE;
if (IsCpuMemType(memDevice.memType)) { if (IsCpuMemType(memDevice.memType) && memDevice.memType != MEM_CPU_CLOSEST) {
int numCpus = GetNumExecutors(EXE_CPU); int numCpus = GetNumExecutors(EXE_CPU);
if (memDevice.memIndex < 0 || memDevice.memIndex >= numCpus) if (memDevice.memIndex < 0 || memDevice.memIndex >= numCpus)
return {ERR_FATAL, return {ERR_FATAL,
...@@ -779,11 +943,16 @@ namespace { ...@@ -779,11 +943,16 @@ namespace {
return ERR_NONE; return ERR_NONE;
} }
if (IsGpuMemType(memDevice.memType)) { if (IsGpuMemType(memDevice.memType) || memDevice.memType == MEM_CPU_CLOSEST) {
int numGpus = GetNumExecutors(EXE_GPU_GFX); int numGpus = GetNumExecutors(EXE_GPU_GFX);
if (memDevice.memIndex < 0 || memDevice.memIndex >= numGpus) if (memDevice.memIndex < 0 || memDevice.memIndex >= numGpus)
return {ERR_FATAL, return {ERR_FATAL,
"GPU index must be between 0 and %d (instead of %d)", numGpus - 1, memDevice.memIndex}; "GPU index must be between 0 and %d (instead of %d)", numGpus - 1, memDevice.memIndex};
if (memDevice.memType == MEM_CPU_CLOSEST) {
if (GetClosestCpuNumaToGpu(memDevice.memIndex) == -1) {
return {ERR_FATAL, "Unable to determine closest NUMA node for GPU %d", memDevice.memIndex};
}
}
return ERR_NONE; return ERR_NONE;
} }
return {ERR_FATAL, "Unsupported memory type (%d)", memDevice.memType}; return {ERR_FATAL, "Unsupported memory type (%d)", memDevice.memType};
...@@ -804,12 +973,29 @@ namespace { ...@@ -804,12 +973,29 @@ namespace {
errors.push_back({ERR_FATAL, "[data.byteOffset] must be positive multiple of %lu", sizeof(float)}); errors.push_back({ERR_FATAL, "[data.byteOffset] must be positive multiple of %lu", sizeof(float)});
// Check GFX options // Check GFX options
if (cfg.gfx.blockOrder < 0 || cfg.gfx.blockOrder > 2)
errors.push_back({ERR_FATAL,
"[gfx.blockOrder] must be 0 for sequential, 1 for interleaved, or 2 for random"});
if (cfg.gfx.useMultiStream && cfg.gfx.blockOrder > 0)
errors.push_back({ERR_WARN, "[gfx.blockOrder] will be ignored when running in multi-stream mode"});
int gfxMaxBlockSize = GetIntAttribute(ATR_GFX_MAX_BLOCKSIZE); int gfxMaxBlockSize = GetIntAttribute(ATR_GFX_MAX_BLOCKSIZE);
if (cfg.gfx.blockSize < 0 || cfg.gfx.blockSize % 64 || cfg.gfx.blockSize > gfxMaxBlockSize) if (cfg.gfx.blockSize < 0 || cfg.gfx.blockSize % 64 || cfg.gfx.blockSize > gfxMaxBlockSize)
errors.push_back({ERR_FATAL, errors.push_back({ERR_FATAL,
"[gfx.blockSize] must be positive multiple of 64 less than or equal to %d", "[gfx.blockSize] must be positive multiple of 64 less than or equal to %d",
gfxMaxBlockSize}); gfxMaxBlockSize});
if (cfg.gfx.temporalMode < 0 || cfg.gfx.temporalMode > 3)
errors.push_back({ERR_FATAL,
"[gfx.temporalMode] must be non-negative and less than or equal to 3"});
#if defined(__NVCC__)
if (cfg.gfx.temporalMode > 0)
errors.push_back({ERR_FATAL,
"[gfx.temporalMode] is not supported on NVIDIA hardware"});
#endif
int gfxMaxUnroll = GetIntAttribute(ATR_GFX_MAX_UNROLL); int gfxMaxUnroll = GetIntAttribute(ATR_GFX_MAX_UNROLL);
if (cfg.gfx.unrollFactor < 0 || cfg.gfx.unrollFactor > gfxMaxUnroll) if (cfg.gfx.unrollFactor < 0 || cfg.gfx.unrollFactor > gfxMaxUnroll)
errors.push_back({ERR_FATAL, errors.push_back({ERR_FATAL,
...@@ -819,6 +1005,9 @@ namespace { ...@@ -819,6 +1005,9 @@ namespace {
errors.push_back({ERR_FATAL, errors.push_back({ERR_FATAL,
"[gfx.waveOrder] must be non-negative and less than 6"}); "[gfx.waveOrder] must be non-negative and less than 6"});
if (!(cfg.gfx.wordSize == 1 || cfg.gfx.wordSize == 2 || cfg.gfx.wordSize == 4))
errors.push_back({ERR_FATAL, "[gfx.wordSize] must be either 1, 2 or 4"});
int numGpus = GetNumExecutors(EXE_GPU_GFX); int numGpus = GetNumExecutors(EXE_GPU_GFX);
int numXccs = GetNumExecutorSubIndices({EXE_GPU_GFX, 0}); int numXccs = GetNumExecutorSubIndices({EXE_GPU_GFX, 0});
vector<vector<int>> const& table = cfg.gfx.prefXccTable; vector<vector<int>> const& table = cfg.gfx.prefXccTable;
...@@ -844,6 +1033,20 @@ namespace { ...@@ -844,6 +1033,20 @@ namespace {
} }
} }
// Check NIC options
#ifdef NIC_EXEC_ENABLED
int numNics = GetNumExecutors(EXE_NIC);
for (auto const& nic : cfg.nic.closestNics)
if (nic < 0 || nic >= numNics)
errors.push_back({ERR_FATAL, "NIC index (%d) in user-specified closest NIC list must be between 0 and %d",
nic, numNics - 1});
size_t closetNicsSize = cfg.nic.closestNics.size();
if (closetNicsSize > 0 && closetNicsSize < numGpus)
errors.push_back({ERR_FATAL, "User-specified closest NIC list must match GPU count of %d",
numGpus});
#endif
// NVIDIA specific // NVIDIA specific
#if defined(__NVCC__) #if defined(__NVCC__)
if (cfg.data.validateDirect) if (cfg.data.validateDirect)
...@@ -877,6 +1080,7 @@ namespace { ...@@ -877,6 +1080,7 @@ namespace {
{ {
int numCpus = GetNumExecutors(EXE_CPU); int numCpus = GetNumExecutors(EXE_CPU);
int numGpus = GetNumExecutors(EXE_GPU_GFX); int numGpus = GetNumExecutors(EXE_GPU_GFX);
int numNics = GetNumExecutors(EXE_NIC);
std::set<ExeDevice> executors; std::set<ExeDevice> executors;
std::map<ExeDevice, int> transferCount; std::map<ExeDevice, int> transferCount;
...@@ -1002,7 +1206,7 @@ namespace { ...@@ -1002,7 +1206,7 @@ namespace {
#endif #endif
} }
if (!IsGpuMemType(t.srcs[0].memType) || !IsGpuMemType(t.dsts[0].memType)) { if (!IsGpuMemType(t.srcs[0].memType) && !IsGpuMemType(t.dsts[0].memType)) {
errors.push_back({ERR_WARN, errors.push_back({ERR_WARN,
"Transfer %d: No GPU memory for source or destination. Copy might not execute on DMA %d", "Transfer %d: No GPU memory for source or destination. Copy might not execute on DMA %d",
i, t.exeDevice.exeIndex}); i, t.exeDevice.exeIndex});
...@@ -1021,8 +1225,33 @@ namespace { ...@@ -1021,8 +1225,33 @@ namespace {
} }
} }
break; break;
case EXE_IBV: case EXE_NIC:
errors.push_back({ERR_FATAL, "Transfer %d: IBV executor currently not supported", i}); #ifdef NIC_EXEC_ENABLED
{
int srcIndex = t.exeDevice.exeIndex;
int dstIndex = t.exeSubIndex;
if (srcIndex < 0 || srcIndex >= numNics)
errors.push_back({ERR_FATAL, "Transfer %d: src NIC executor indexes an out-of-range NIC (%d)", i, srcIndex});
if (dstIndex < 0 || dstIndex >= numNics)
errors.push_back({ERR_FATAL, "Transfer %d: dst NIC executor indexes an out-of-range NIC (%d)", i, dstIndex});
}
#else
errors.push_back({ERR_FATAL, "Transfer %d: NIC executor is requested but is not available", i});
#endif
break;
case EXE_NIC_NEAREST:
#ifdef NIC_EXEC_ENABLED
{
ExeDevice srcExeDevice;
ErrResult errSrc = GetActualExecutor(cfg, t.exeDevice, srcExeDevice);
if (errSrc.errType != ERR_NONE) errors.push_back(errSrc);
ExeDevice dstExeDevice;
ErrResult errDst = GetActualExecutor(cfg, {t.exeDevice.exeType, t.exeSubIndex}, dstExeDevice);
if (errDst.errType != ERR_NONE) errors.push_back(errDst);
}
#else
errors.push_back({ERR_FATAL, "Transfer %d: NIC executor is requested but is not available", i});
#endif
break; break;
} }
...@@ -1102,7 +1331,6 @@ namespace { ...@@ -1102,7 +1331,6 @@ namespace {
} }
} }
// Check for fatal errors // Check for fatal errors
for (auto const& err : errors) for (auto const& err : errors)
if (err.errType == ERR_FATAL) return true; if (err.errType == ERR_FATAL) return true;
...@@ -1143,6 +1371,7 @@ namespace { ...@@ -1143,6 +1371,7 @@ namespace {
vector<float*> dstMem; ///< Destination memory vector<float*> dstMem; ///< Destination memory
vector<SubExecParam> subExecParamCpu; ///< Defines subarrays for each subexecutor vector<SubExecParam> subExecParamCpu; ///< Defines subarrays for each subexecutor
vector<int> subExecIdx; ///< Indices into subExecParamGpu vector<int> subExecIdx; ///< Indices into subExecParamGpu
int numaNode; ///< NUMA node to use for this Transfer
// For GFX executor // For GFX executor
SubExecParam* subExecParamGpuPtr; SubExecParam* subExecParamGpuPtr;
...@@ -1155,6 +1384,29 @@ namespace { ...@@ -1155,6 +1384,29 @@ namespace {
hsa_amd_sdma_engine_id_t sdmaEngineId; ///< DMA engine ID hsa_amd_sdma_engine_id_t sdmaEngineId; ///< DMA engine ID
#endif #endif
// For IBV executor
#ifdef NIC_EXEC_ENABLED
int srcNicIndex; ///< SRC NIC index
int dstNicIndex; ///< DST NIC index
ibv_context* srcContext; ///< Device context for SRC NIC
ibv_context* dstContext; ///< Device context for DST NIC
ibv_pd* srcProtect; ///< Protection domain for SRC NIC
ibv_pd* dstProtect; ///< Protection domain for DST NIC
ibv_cq* srcCompQueue; ///< Completion queue for SRC NIC
ibv_cq* dstCompQueue; ///< Completion queue for DST NIC
ibv_port_attr srcPortAttr; ///< Port attributes for SRC NIC
ibv_port_attr dstPortAttr; ///< Port attributes for DST NIC
ibv_gid srcGid; ///< GID handle for SRC NIC
ibv_gid dstGid; ///< GID handle for DST NIC
vector<ibv_qp*> srcQueuePairs; ///< Queue pairs for SRC NIC
vector<ibv_qp*> dstQueuePairs; ///< Queue pairs for DST NIC
ibv_mr* srcMemRegion; ///< Memory region for SRC
ibv_mr* dstMemRegion; ///< Memory region for DST
uint8_t qpCount; ///< Number of QPs to be used for transferring data
vector<ibv_sge> sgePerQueuePair; ///< Scatter-gather elements per queue pair
vector<ibv_send_wr> sendWorkRequests; ///< Send work requests per queue pair
#endif
// Counters // Counters
double totalDurationMsec; ///< Total duration for all iterations for this Transfer double totalDurationMsec; ///< Total duration for all iterations for this Transfer
vector<double> perIterMsec; ///< Duration for each individual iteration vector<double> perIterMsec; ///< Duration for each individual iteration
...@@ -1169,7 +1421,6 @@ namespace { ...@@ -1169,7 +1421,6 @@ namespace {
int totalSubExecs; ///< Total number of subExecutors to use int totalSubExecs; ///< Total number of subExecutors to use
bool useSubIndices; ///< Use subexecutor indicies bool useSubIndices; ///< Use subexecutor indicies
int numSubIndices; ///< Number of subindices this ExeDevice has int numSubIndices; ///< Number of subindices this ExeDevice has
int wallClockRate; ///< (GFX-only) Device wall clock rate
vector<SubExecParam> subExecParamCpu; ///< Subexecutor parameters for this executor vector<SubExecParam> subExecParamCpu; ///< Subexecutor parameters for this executor
vector<TransferResources> resources; ///< Per-Transfer resources vector<TransferResources> resources; ///< Per-Transfer resources
...@@ -1178,7 +1429,718 @@ namespace { ...@@ -1178,7 +1429,718 @@ namespace {
vector<hipStream_t> streams; ///< HIP streams to launch on vector<hipStream_t> streams; ///< HIP streams to launch on
vector<hipEvent_t> startEvents; ///< HIP start timing event vector<hipEvent_t> startEvents; ///< HIP start timing event
vector<hipEvent_t> stopEvents; ///< HIP stop timing event vector<hipEvent_t> stopEvents; ///< HIP stop timing event
int wallClockRate; ///< (GFX-only) Device wall clock rate
};
// Structure to track PCIe topology
struct PCIeNode
{
std::string address; ///< PCIe address for this PCIe node
std::string description; ///< Description for this PCIe node
std::set<PCIeNode> children; ///< Children PCIe nodes
// Default constructor
PCIeNode() : address(""), description("") {}
// Constructor
PCIeNode(std::string const& addr) : address(addr) {}
// Constructor
PCIeNode(std::string const& addr, std::string const& desc)
:address(addr), description(desc) {}
// Comparison operator for std::set
bool operator<(PCIeNode const& other) const {
return address < other.address;
}
};
#ifdef NIC_EXEC_ENABLED
// Structure to track information about IBV devices
struct IbvDevice
{
ibv_device* devicePtr;
std::string name;
std::string busId;
bool hasActivePort;
int numaNode;
int gidIndex;
std::string gidDescriptor;
bool isRoce;
}; };
#endif
#ifdef NIC_EXEC_ENABLED
// Function to collect information about IBV devices
//========================================================================================
static bool IsConfiguredGid(union ibv_gid const& gid)
{
const struct in6_addr *a = (struct in6_addr *) gid.raw;
int trailer = (a->s6_addr32[1] | a->s6_addr32[2] | a->s6_addr32[3]);
if (((a->s6_addr32[0] | trailer) == 0UL) ||
((a->s6_addr32[0] == htonl(0xfe800000)) && (trailer == 0UL))) {
return false;
}
return true;
}
static bool LinkLocalGid(union ibv_gid const& gid)
{
const struct in6_addr *a = (struct in6_addr *) gid.raw;
if (a->s6_addr32[0] == htonl(0xfe800000) && a->s6_addr32[1] == 0UL) {
return true;
}
return false;
}
static ErrResult GetRoceVersionNumber(struct ibv_context* const& context,
int const& portNum,
int const& gidIndex,
int& version)
{
char const* deviceName = ibv_get_device_name(context->device);
char gidRoceVerStr[16] = {};
char roceTypePath[PATH_MAX] = {};
sprintf(roceTypePath, "/sys/class/infiniband/%s/ports/%d/gid_attrs/types/%d",
deviceName, portNum, gidIndex);
int fd = open(roceTypePath, O_RDONLY);
if (fd == -1)
return {ERR_FATAL, "Failed while opening RoCE file path (%s)", roceTypePath};
int ret = read(fd, gidRoceVerStr, 15);
close(fd);
if (ret == -1)
return {ERR_FATAL, "Failed while reading RoCE version"};
if (strlen(gidRoceVerStr)) {
if (strncmp(gidRoceVerStr, "IB/RoCE v1", strlen("IB/RoCE v1")) == 0
|| strncmp(gidRoceVerStr, "RoCE v1", strlen("RoCE v1")) == 0) {
version = 1;
}
else if (strncmp(gidRoceVerStr, "RoCE v2", strlen("RoCE v2")) == 0) {
version = 2;
}
}
return ERR_NONE;
}
static bool IsIPv4MappedIPv6(const union ibv_gid &gid)
{
// look for ::ffff:x.x.x.x format
// From Broadcom documentation
// https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/frequently-asked-questions1.html
// "The IPv4 address is really an IPv4 address mapped into the IPv6 address space.
// This can be identified by 80 “0” bits, followed by 16 “1” bits (“FFFF” in hexadecimal)
// followed by the original 32-bit IPv4 address."
return (gid.global.subnet_prefix == 0 &&
gid.raw[8] == 0 &&
gid.raw[9] == 0 &&
gid.raw[10] == 0xff &&
gid.raw[11] == 0xff);
}
static ErrResult GetGidIndex(struct ibv_context* context,
int const& gidTblLen,
int const& portNum,
std::pair<int, std::string>& gidInfo)
{
if(gidInfo.first >= 0) return ERR_NONE; // honor user choice
union ibv_gid gid;
GidPriority highestPriority = GidPriority::UNKNOWN;
int gidIndex = -1;
for (int i = 0; i < gidTblLen; ++i) {
IBV_CALL(ibv_query_gid, context, portNum, i, &gid);
if (!IsConfiguredGid(gid)) continue;
int gidCurrRoceVersion;
if(GetRoceVersionNumber(context, portNum, i, gidCurrRoceVersion).errType != ERR_NONE) continue;
GidPriority currPriority;
if (IsIPv4MappedIPv6(gid)) {
currPriority = (gidCurrRoceVersion == 2) ? GidPriority::ROCEV2_IPV4 : GidPriority::ROCEV1_IPV4;
} else if (!LinkLocalGid(gid)) {
currPriority = (gidCurrRoceVersion == 2) ? GidPriority::ROCEV2_IPV6 : GidPriority::ROCEV1_IPV6;
} else {
currPriority = (gidCurrRoceVersion == 2) ? GidPriority::ROCEV2_LINK_LOCAL : GidPriority::ROCEV1_LINK_LOCAL;
}
if(currPriority > highestPriority) {
highestPriority = currPriority;
gidIndex = i;
}
}
if (highestPriority == GidPriority::UNKNOWN) {
gidInfo.first = -1;
return {ERR_FATAL, "Failed to auto-detect a valid GID index. Try setting it manually through IB_GID_INDEX"};
}
gidInfo.first = gidIndex;
gidInfo.second = GidPriorityStr[highestPriority];
return ERR_NONE;
}
static vector<IbvDevice>& GetIbvDeviceList()
{
static bool isInitialized = false;
static vector<IbvDevice> ibvDeviceList = {};
// Build list on first use
if (!isInitialized) {
// Query the number of IBV devices
int numIbvDevices = 0;
ibv_device** deviceList = ibv_get_device_list(&numIbvDevices);
if (deviceList && numIbvDevices > 0) {
// Loop over each device to collect information
for (int i = 0; i < numIbvDevices; i++) {
IbvDevice ibvDevice;
ibvDevice.devicePtr = deviceList[i];
ibvDevice.name = deviceList[i]->name;
ibvDevice.hasActivePort = false;
{
struct ibv_context *context = ibv_open_device(ibvDevice.devicePtr);
if (context) {
struct ibv_device_attr deviceAttr;
if (!ibv_query_device(context, &deviceAttr)) {
int activePort;
ibvDevice.gidIndex = -1;
for (int port = 1; port <= deviceAttr.phys_port_cnt; ++port) {
struct ibv_port_attr portAttr;
if (ibv_query_port(context, port, &portAttr)) continue;
if (portAttr.state == IBV_PORT_ACTIVE) {
activePort = port;
ibvDevice.hasActivePort = true;
if(portAttr.link_layer == IBV_LINK_LAYER_ETHERNET) {
ibvDevice.isRoce = true;
std::pair<int, std::string> gidInfo (-1, "");
auto res = GetGidIndex(context, portAttr.gid_tbl_len, activePort, gidInfo);
if (res.errType == ERR_NONE) {
ibvDevice.gidIndex = gidInfo.first;
ibvDevice.gidDescriptor = gidInfo.second;
}
}
break;
}
}
}
ibv_close_device(context);
}
}
ibvDevice.busId = "";
{
std::string device_path(ibvDevice.devicePtr->dev_path);
if (std::filesystem::exists(device_path)) {
std::string pciPath = std::filesystem::canonical(device_path + "/device").string();
std::size_t pos = pciPath.find_last_of('/');
if (pos != std::string::npos) {
ibvDevice.busId = pciPath.substr(pos + 1);
}
}
}
// Get nearest numa node for this device
ibvDevice.numaNode = -1;
std::filesystem::path devicePath = "/sys/bus/pci/devices/" + ibvDevice.busId + "/numa_node";
std::string canonicalPath = std::filesystem::canonical(devicePath).string();
if (std::filesystem::exists(canonicalPath)) {
std::ifstream file(canonicalPath);
if (file.is_open()) {
std::string numaNodeStr;
std::getline(file, numaNodeStr);
int numaNodeVal;
if (sscanf(numaNodeStr.c_str(), "%d", &numaNodeVal) == 1)
ibvDevice.numaNode = numaNodeVal;
file.close();
}
}
ibvDeviceList.push_back(ibvDevice);
}
}
ibv_free_device_list(deviceList);
isInitialized = true;
}
return ibvDeviceList;
}
#endif // NIC_EXEC_ENABLED
#ifdef NIC_EXEC_ENABLED
// PCIe-related functions
//========================================================================================
// Prints off PCIe tree
static void PrintPCIeTree(PCIeNode const& node,
std::string const& prefix = "",
bool isLast = true)
{
if (!node.address.empty()) {
printf("%s%s%s", prefix.c_str(), (isLast ? "└── " : "├── "), node.address.c_str());
if (!node.description.empty()) {
printf("(%s)", node.description.c_str());
}
printf("\n");
}
auto const& children = node.children;
for (auto it = children.begin(); it != children.end(); ++it) {
PrintPCIeTree(*it, prefix + (isLast ? " " : "│ "), std::next(it) == children.end());
}
}
// Inserts nodes along pcieAddress down a tree starting from root
static ErrResult InsertPCIePathToTree(std::string const& pcieAddress,
std::string const& description,
PCIeNode& root)
{
std::filesystem::path devicePath = "/sys/bus/pci/devices/" + pcieAddress;
std::string canonicalPath = std::filesystem::canonical(devicePath).string();
if (!std::filesystem::exists(devicePath)) {
return {ERR_FATAL, "Device path %s does not exist", devicePath.c_str()};
}
std::istringstream iss(canonicalPath);
std::string token;
PCIeNode* currNode = &root;
while (std::getline(iss, token, '/')) {
auto it = (currNode->children.insert(PCIeNode(token))).first;
currNode = const_cast<PCIeNode*>(&(*it));
}
currNode->description = description;
return ERR_NONE;
}
// Returns root node for PCIe tree. Constructed on first use
static PCIeNode* GetPCIeTreeRoot()
{
static bool isInitialized = false;
static PCIeNode pcieRoot;
// Build PCIe tree on first use
if (!isInitialized) {
// Add NICs to the tree
int numNics = GetNumExecutors(EXE_NIC);
auto const& ibvDeviceList = GetIbvDeviceList();
for (IbvDevice const& ibvDevice : ibvDeviceList) {
if (!ibvDevice.hasActivePort || ibvDevice.busId == "") continue;
InsertPCIePathToTree(ibvDevice.busId, ibvDevice.name, pcieRoot);
}
// Add GPUs to the tree
int numGpus = GetNumExecutors(EXE_GPU_GFX);
for (int i = 0; i < numGpus; ++i) {
char hipPciBusId[64];
if (hipDeviceGetPCIBusId(hipPciBusId, sizeof(hipPciBusId), i) == hipSuccess) {
InsertPCIePathToTree(hipPciBusId, "GPU " + std::to_string(i), pcieRoot);
}
}
#ifdef VERBS_DEBUG
PrintPCIeTree(pcieRoot);
#endif
isInitialized = true;
}
return &pcieRoot;
}
// Finds the lowest common ancestor in PCIe tree between two nodes
static PCIeNode const* GetLcaBetweenNodes(PCIeNode const* root,
std::string const& node1Address,
std::string const& node2Address)
{
if (!root || root->address == node1Address || root->address == node2Address)
return root;
PCIeNode const* lcaFound1 = nullptr;
PCIeNode const* lcaFound2 = nullptr;
// Recursively iterate over children
for (auto const& child : root->children) {
PCIeNode const* lca = GetLcaBetweenNodes(&child, node1Address, node2Address);
if (!lca) continue;
if (!lcaFound1) {
// First time found
lcaFound1 = lca;
} else {
// Second time found
lcaFound2 = lca;
break;
}
}
// If two children were found, then current node is the lowest common ancestor
return (lcaFound1 && lcaFound2) ? root : lcaFound1;
}
// Gets the depth of an node in the PCIe tree
static int GetLcaDepth(std::string const& targetBusID,
PCIeNode const* const& node,
int depth = 0)
{
if (!node) return -1;
if (targetBusID == node->address) return depth;
for (auto const& child : node->children) {
int distance = GetLcaDepth(targetBusID, &child, depth + 1);
if (distance != -1)
return distance;
}
return -1;
}
// Function to extract the bus number from a PCIe address (domain:bus:device.function)
static int ExtractBusNumber(std::string const& pcieAddress)
{
int domain, bus, device, function;
char delimiter;
std::istringstream iss(pcieAddress);
iss >> std::hex >> domain >> delimiter >> bus >> delimiter >> device >> delimiter >> function;
if (iss.fail()) {
#ifdef VERBS_DEBUG
printf("Invalid PCIe address format: %s\n", pcieAddress.c_str());
#endif
return -1;
}
return bus;
}
// Function to compute the distance between two bus IDs
static int GetBusIdDistance(std::string const& pcieAddress1,
std::string const& pcieAddress2)
{
int bus1 = ExtractBusNumber(pcieAddress1);
int bus2 = ExtractBusNumber(pcieAddress2);
return (bus1 < 0 || bus2 < 0) ? -1 : std::abs(bus1 - bus2);
}
// Given a target busID and a set of candidate devices, returns a set of indices
// that is "closest" to the target
static std::set<int> GetNearestDevicesInTree(std::string const& targetBusId,
std::vector<std::string> const& candidateBusIdList)
{
int maxDepth = -1;
int minDistance = std::numeric_limits<int>::max();
std::set<int> matches = {};
// Loop over the candidates to find the ones with the lowest common ancestor (LCA)
for (int i = 0; i < candidateBusIdList.size(); i++) {
std::string const& candidateBusId = candidateBusIdList[i];
if (candidateBusId == "") continue;
PCIeNode const* lca = GetLcaBetweenNodes(GetPCIeTreeRoot(), targetBusId, candidateBusId);
if (!lca) continue;
int depth = GetLcaDepth(lca->address, GetPCIeTreeRoot());
int currDistance = GetBusIdDistance(targetBusId, candidateBusId);
// When more than one LCA match is found, choose the one with smallest busId difference
// NOTE: currDistance could be -1, which signals problem with parsing, however still
// remains a valid "closest" candidate, so is included
if (depth > maxDepth || (depth == maxDepth && depth >= 0 && currDistance < minDistance)) {
maxDepth = depth;
matches.clear();
matches.insert(i);
minDistance = currDistance;
} else if (depth == maxDepth && depth >= 0 && currDistance == minDistance) {
matches.insert(i);
}
}
return matches;
}
#endif // NIC_EXEC_ENABLED
#ifdef NIC_EXEC_ENABLED
// IB Verbs-related functions
//========================================================================================
// Create a queue pair
static ErrResult CreateQueuePair(ConfigOptions const& cfg,
struct ibv_pd* pd,
struct ibv_cq* cq,
struct ibv_qp*& qp)
{
// Set queue pair attributes
struct ibv_qp_init_attr attr = {};
attr.qp_type = IBV_QPT_RC; // Set type to reliable connection
attr.send_cq = cq; // Send completion queue
attr.recv_cq = cq; // Recv completion queue
attr.cap.max_send_wr = cfg.nic.maxSendWorkReq; // Max send work requests
attr.cap.max_recv_wr = cfg.nic.maxRecvWorkReq; // Max recv work requests
attr.cap.max_send_sge = 1; // Max send scatter-gather entries
attr.cap.max_recv_sge = 1; // Max recv scatter-gather entries
qp = ibv_create_qp(pd, &attr);
if (qp == NULL)
return {ERR_FATAL, "Error while creating QP"};
return ERR_NONE;
}
// Initialize a queue pair
static ErrResult InitQueuePair(struct ibv_qp* qp,
uint8_t port,
unsigned flags)
{
struct ibv_qp_attr attr = {}; // Clear all attributes
attr.qp_state = IBV_QPS_INIT; // Set the QP state to INIT
attr.pkey_index = 0; // Set the partition key index to 0
attr.port_num = port; // Set the port number to the defined IB_PORT
attr.qp_access_flags = flags; // Set the QP access flags to the provided flags
int ret = ibv_modify_qp(qp, &attr,
IBV_QP_STATE | // Modify the QP state
IBV_QP_PKEY_INDEX | // Modify the partition key index
IBV_QP_PORT | // Modify the port number
IBV_QP_ACCESS_FLAGS); // Modify the access flags
if (ret != 0)
return {ERR_FATAL, "Error during QP Init. IB Verbs Error code: %d", ret};
return ERR_NONE;
}
// Transition QueuePair to Ready to Receive State
static ErrResult TransitionQpToRtr(ibv_qp* qp,
uint16_t const& dlid,
uint32_t const& dqpn,
ibv_gid const& gid,
uint8_t const& gidIndex,
uint8_t const& port,
bool const& isRoCE,
ibv_mtu const& mtu)
{
// Prepare QP attributes
struct ibv_qp_attr attr = {};
attr.qp_state = IBV_QPS_RTR;
attr.path_mtu = mtu;
attr.rq_psn = 0;
attr.max_dest_rd_atomic = 1;
attr.min_rnr_timer = 12;
if (isRoCE) {
attr.ah_attr.is_global = 1;
attr.ah_attr.grh.dgid.global.subnet_prefix = gid.global.subnet_prefix;
attr.ah_attr.grh.dgid.global.interface_id = gid.global.interface_id;
attr.ah_attr.grh.flow_label = 0;
attr.ah_attr.grh.sgid_index = gidIndex;
attr.ah_attr.grh.hop_limit = 255;
} else {
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = dlid;
}
attr.ah_attr.sl = 0;
attr.ah_attr.src_path_bits = 0;
attr.ah_attr.port_num = port;
attr.dest_qp_num = dqpn;
// Modify the QP
int ret = ibv_modify_qp(qp, &attr,
IBV_QP_STATE |
IBV_QP_AV |
IBV_QP_PATH_MTU |
IBV_QP_DEST_QPN |
IBV_QP_RQ_PSN |
IBV_QP_MAX_DEST_RD_ATOMIC |
IBV_QP_MIN_RNR_TIMER);
if (ret != 0)
return {ERR_FATAL, "Error during QP RTR. IB Verbs Error code: %d", ret};
return ERR_NONE;
}
// Transition QueuePair to Ready to Send state
static ErrResult TransitionQpToRts(struct ibv_qp *qp)
{
struct ibv_qp_attr attr = {};
attr.qp_state = IBV_QPS_RTS;
attr.sq_psn = 0;
attr.timeout = 14;
attr.retry_cnt = 7;
attr.rnr_retry = 7;
attr.max_rd_atomic = 1;
int ret = ibv_modify_qp(qp, &attr,
IBV_QP_STATE |
IBV_QP_TIMEOUT |
IBV_QP_RETRY_CNT |
IBV_QP_RNR_RETRY |
IBV_QP_SQ_PSN |
IBV_QP_MAX_QP_RD_ATOMIC);
if (ret != 0)
return {ERR_FATAL, "Error during QP RTS. IB Verbs Error code: %d", ret};
return ERR_NONE;
}
static ErrResult PrepareNicTransferResources(ConfigOptions const& cfg,
ExeDevice const& srcExeDevice,
Transfer const& t,
TransferResources& rss)
{
// Switch to the closest NUMA node to this NIC
int numaNode = GetIbvDeviceList()[srcExeDevice.exeIndex].numaNode;
if (numaNode != -1)
numa_run_on_node(numaNode);
int const port = cfg.nic.ibPort;
// Figure out destination NIC (Accounts for possible remap due to use of EXE_NIC_NEAREST)
ExeDevice dstExeDevice;
ERR_CHECK(GetActualExecutor(cfg, {t.exeDevice.exeType, t.exeSubIndex}, dstExeDevice));
rss.srcNicIndex = srcExeDevice.exeIndex;
rss.dstNicIndex = dstExeDevice.exeIndex;
rss.qpCount = t.numSubExecs;
// Check for valid NICs and active ports
int numNics = GetNumExecutors(EXE_NIC);
if (rss.srcNicIndex < 0 || rss.srcNicIndex >= numNics)
return {ERR_FATAL, "SRC NIC index is out of range (%d)", rss.srcNicIndex};
if (rss.dstNicIndex < 0 || rss.dstNicIndex >= numNics)
return {ERR_FATAL, "DST NIC index is out of range (%d)", rss.dstNicIndex};
if (!GetIbvDeviceList()[rss.srcNicIndex].hasActivePort)
return {ERR_FATAL, "SRC NIC %d is not active\n", rss.srcNicIndex};
if (!GetIbvDeviceList()[rss.dstNicIndex].hasActivePort)
return {ERR_FATAL, "DST NIC %d is not active\n", rss.dstNicIndex};
// Queue pair flags
unsigned int rdmaAccessFlags = (IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_READ |
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_REMOTE_ATOMIC);
unsigned int rdmaMemRegFlags = rdmaAccessFlags;
if (cfg.nic.useRelaxedOrder) rdmaMemRegFlags |= IBV_ACCESS_RELAXED_ORDERING;
// Open NIC contexts
IBV_PTR_CALL(rss.srcContext, ibv_open_device, GetIbvDeviceList()[rss.srcNicIndex].devicePtr);
IBV_PTR_CALL(rss.dstContext, ibv_open_device, GetIbvDeviceList()[rss.dstNicIndex].devicePtr);
// Open protection domains
IBV_PTR_CALL(rss.srcProtect, ibv_alloc_pd, rss.srcContext);
IBV_PTR_CALL(rss.dstProtect, ibv_alloc_pd, rss.dstContext);
// Register memory region
IBV_PTR_CALL(rss.srcMemRegion, ibv_reg_mr, rss.srcProtect, rss.srcMem[0], rss.numBytes, rdmaMemRegFlags);
IBV_PTR_CALL(rss.dstMemRegion, ibv_reg_mr, rss.dstProtect, rss.dstMem[0], rss.numBytes, rdmaMemRegFlags);
// Create completion queues
IBV_PTR_CALL(rss.srcCompQueue, ibv_create_cq, rss.srcContext, cfg.nic.queueSize, NULL, NULL, 0);
IBV_PTR_CALL(rss.dstCompQueue, ibv_create_cq, rss.dstContext, cfg.nic.queueSize, NULL, NULL, 0);
// Get port attributes
IBV_CALL(ibv_query_port, rss.srcContext, port, &rss.srcPortAttr);
IBV_CALL(ibv_query_port, rss.dstContext, port, &rss.dstPortAttr);
if (rss.srcPortAttr.link_layer != rss.dstPortAttr.link_layer)
return {ERR_FATAL, "SRC NIC (%d) and DST NIC (%d) do not have the same link layer", rss.srcNicIndex, rss.dstNicIndex};
// Prepare GID index
int srcGidIndex = cfg.nic.ibGidIndex;
int dstGidIndex = cfg.nic.ibGidIndex;
// Check for RDMA over Converged Ethernet (RoCE) and update GID index appropriately
bool isRoCE = (rss.srcPortAttr.link_layer == IBV_LINK_LAYER_ETHERNET);
if (isRoCE) {
// Try to auto-detect the GID index
std::pair<int, std::string> srcGidInfo (srcGidIndex, "");
std::pair<int, std::string> dstGidInfo (dstGidIndex, "");
ERR_CHECK(GetGidIndex(rss.srcContext, rss.srcPortAttr.gid_tbl_len, cfg.nic.ibPort, srcGidInfo));
ERR_CHECK(GetGidIndex(rss.dstContext, rss.dstPortAttr.gid_tbl_len, cfg.nic.ibPort, dstGidInfo));
srcGidIndex = srcGidInfo.first;
dstGidIndex = dstGidInfo.first;
IBV_CALL(ibv_query_gid, rss.srcContext, port, srcGidIndex, &rss.srcGid);
IBV_CALL(ibv_query_gid, rss.dstContext, port, dstGidIndex, &rss.dstGid);
}
// Prepare queue pairs and send elements
rss.srcQueuePairs.resize(rss.qpCount);
rss.dstQueuePairs.resize(rss.qpCount);
rss.sgePerQueuePair.resize(rss.qpCount);
rss.sendWorkRequests.resize(rss.qpCount);
for (int i = 0; i < rss.qpCount; ++i) {
// Create scatter-gather element for the portion of memory assigned to this queue pair
ibv_sge sg = {};
sg.addr = (uint64_t)rss.subExecParamCpu[i].src[0];
sg.length = rss.subExecParamCpu[i].N * sizeof(float);
sg.lkey = rss.srcMemRegion->lkey;
rss.sgePerQueuePair[i] = sg;
// Create send work request
ibv_send_wr wr = {};
wr.wr_id = i;
wr.sg_list = &rss.sgePerQueuePair[i];
wr.num_sge = 1;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = (uint64_t)rss.subExecParamCpu[i].dst[0];
wr.wr.rdma.rkey = rss.dstMemRegion->rkey;
rss.sendWorkRequests[i] = wr;
// Create SRC/DST queue pairs
ERR_CHECK(CreateQueuePair(cfg, rss.srcProtect, rss.srcCompQueue, rss.srcQueuePairs[i]));
ERR_CHECK(CreateQueuePair(cfg, rss.dstProtect, rss.dstCompQueue, rss.dstQueuePairs[i]));
// Initialize SRC/DST queue pairs
ERR_CHECK(InitQueuePair(rss.srcQueuePairs[i], port, rdmaAccessFlags));
ERR_CHECK(InitQueuePair(rss.dstQueuePairs[i], port, rdmaAccessFlags));
// Transition the SRC queue pair to ready to receive
ERR_CHECK(TransitionQpToRtr(rss.srcQueuePairs[i], rss.dstPortAttr.lid,
rss.dstQueuePairs[i]->qp_num, rss.dstGid,
dstGidIndex, port, isRoCE,
rss.srcPortAttr.active_mtu));
// Transition the SRC queue pair to ready to send
ERR_CHECK(TransitionQpToRts(rss.srcQueuePairs[i]));
// Transition the DST queue pair to ready to receive
ERR_CHECK(TransitionQpToRtr(rss.dstQueuePairs[i], rss.srcPortAttr.lid,
rss.srcQueuePairs[i]->qp_num, rss.srcGid,
srcGidIndex, port, isRoCE,
rss.dstPortAttr.active_mtu));
// Transition the DST queue pair to ready to send
ERR_CHECK(TransitionQpToRts(rss.dstQueuePairs[i]));
}
return ERR_NONE;
}
static ErrResult TeardownNicTransferResources(TransferResources& rss)
{
// Deregister memory regions
IBV_CALL(ibv_dereg_mr, rss.srcMemRegion);
IBV_CALL(ibv_dereg_mr, rss.dstMemRegion);
// Destroy queue pairs
for (auto srcQueuePair : rss.srcQueuePairs)
IBV_CALL(ibv_destroy_qp, srcQueuePair);
rss.srcQueuePairs.clear();
for (auto dstQueuePair : rss.dstQueuePairs)
IBV_CALL(ibv_destroy_qp, dstQueuePair);
rss.dstQueuePairs.clear();
// Destroy completion queues
IBV_CALL(ibv_destroy_cq, rss.srcCompQueue);
IBV_CALL(ibv_destroy_cq, rss.dstCompQueue);
// Deallocate protection domains
IBV_CALL(ibv_dealloc_pd, rss.srcProtect);
IBV_CALL(ibv_dealloc_pd, rss.dstProtect);
// Destroy context
IBV_CALL(ibv_close_device, rss.srcContext);
IBV_CALL(ibv_close_device, rss.dstContext);
return ERR_NONE;
}
#endif // NIC_EXEC_ENABLED
// Data validation-related functions // Data validation-related functions
//======================================================================================== //========================================================================================
...@@ -1244,17 +2206,17 @@ namespace { ...@@ -1244,17 +2206,17 @@ namespace {
float* output; float* output;
size_t initOffset = cfg.data.byteOffset / sizeof(float); size_t initOffset = cfg.data.byteOffset / sizeof(float);
for (auto resource : transferResources) { for (auto rss : transferResources) {
int transferIdx = resource->transferIdx; int transferIdx = rss->transferIdx;
Transfer const& t = transfers[transferIdx]; Transfer const& t = transfers[transferIdx];
size_t N = t.numBytes / sizeof(float); size_t N = t.numBytes / sizeof(float);
float const* expected = dstReference[t.srcs.size()].data(); float const* expected = dstReference[t.srcs.size()].data();
for (int dstIdx = 0; dstIdx < resource->dstMem.size(); dstIdx++) { for (int dstIdx = 0; dstIdx < rss->dstMem.size(); dstIdx++) {
if (IsCpuMemType(t.dsts[dstIdx].memType) || cfg.data.validateDirect) { if (IsCpuMemType(t.dsts[dstIdx].memType) || cfg.data.validateDirect) {
output = (resource->dstMem[dstIdx]) + initOffset; output = (rss->dstMem[dstIdx]) + initOffset;
} else { } else {
ERR_CHECK(hipMemcpy(outputBuffer.data(), (resource->dstMem[dstIdx]) + initOffset, t.numBytes, hipMemcpyDefault)); ERR_CHECK(hipMemcpy(outputBuffer.data(), (rss->dstMem[dstIdx]) + initOffset, t.numBytes, hipMemcpyDefault));
ERR_CHECK(hipDeviceSynchronize()); ERR_CHECK(hipDeviceSynchronize());
output = outputBuffer.data(); output = outputBuffer.data();
} }
...@@ -1282,7 +2244,7 @@ namespace { ...@@ -1282,7 +2244,7 @@ namespace {
// Initializes counters // Initializes counters
static ErrResult PrepareSubExecParams(ConfigOptions const& cfg, static ErrResult PrepareSubExecParams(ConfigOptions const& cfg,
Transfer const& transfer, Transfer const& transfer,
TransferResources& resources) TransferResources& rss)
{ {
// Each subExecutor needs to know src/dst pointers and how many elements to transfer // Each subExecutor needs to know src/dst pointers and how many elements to transfer
// Figure out the sub-array each subExecutor works on for this Transfer // Figure out the sub-array each subExecutor works on for this Transfer
...@@ -1296,15 +2258,15 @@ namespace { ...@@ -1296,15 +2258,15 @@ namespace {
int const maxSubExecToUse = std::min((size_t)(N + targetMultiple - 1) / targetMultiple, int const maxSubExecToUse = std::min((size_t)(N + targetMultiple - 1) / targetMultiple,
(size_t)transfer.numSubExecs); (size_t)transfer.numSubExecs);
vector<SubExecParam>& subExecParam = resources.subExecParamCpu; vector<SubExecParam>& subExecParam = rss.subExecParamCpu;
subExecParam.clear(); subExecParam.clear();
subExecParam.resize(transfer.numSubExecs); subExecParam.resize(transfer.numSubExecs);
size_t assigned = 0; size_t assigned = 0;
for (int i = 0; i < transfer.numSubExecs; ++i) { for (int i = 0; i < transfer.numSubExecs; ++i) {
SubExecParam& p = subExecParam[i]; SubExecParam& p = subExecParam[i];
p.numSrcs = resources.srcMem.size(); p.numSrcs = rss.srcMem.size();
p.numDsts = resources.dstMem.size(); p.numDsts = rss.dstMem.size();
p.startCycle = 0; p.startCycle = 0;
p.stopCycle = 0; p.stopCycle = 0;
p.hwId = 0; p.hwId = 0;
...@@ -1315,8 +2277,8 @@ namespace { ...@@ -1315,8 +2277,8 @@ namespace {
p.N = N; p.N = N;
p.teamSize = transfer.numSubExecs; p.teamSize = transfer.numSubExecs;
p.teamIdx = i; p.teamIdx = i;
for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = resources.srcMem[iSrc] + initOffset; for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = rss.srcMem[iSrc] + initOffset;
for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = resources.dstMem[iDst] + initOffset; for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = rss.dstMem[iDst] + initOffset;
} else { } else {
// Otherwise, each subexecutor works on separate subarrays // Otherwise, each subexecutor works on separate subarrays
int const subExecLeft = std::max(0, maxSubExecToUse - i); int const subExecLeft = std::max(0, maxSubExecToUse - i);
...@@ -1326,8 +2288,8 @@ namespace { ...@@ -1326,8 +2288,8 @@ namespace {
p.N = subExecLeft ? std::min(leftover, ((roundedN / subExecLeft) * targetMultiple)) : 0; p.N = subExecLeft ? std::min(leftover, ((roundedN / subExecLeft) * targetMultiple)) : 0;
p.teamSize = 1; p.teamSize = 1;
p.teamIdx = 0; p.teamIdx = 0;
for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = resources.srcMem[iSrc] + initOffset + assigned; for (int iSrc = 0; iSrc < p.numSrcs; ++iSrc) p.src[iSrc] = rss.srcMem[iSrc] + initOffset + assigned;
for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = resources.dstMem[iDst] + initOffset + assigned; for (int iDst = 0; iDst < p.numDsts; ++iDst) p.dst[iDst] = rss.dstMem[iDst] + initOffset + assigned;
assigned += p.N; assigned += p.N;
} }
...@@ -1348,7 +2310,7 @@ namespace { ...@@ -1348,7 +2310,7 @@ namespace {
} }
// Clear counters // Clear counters
resources.totalDurationMsec = 0.0; rss.totalDurationMsec = 0.0;
return ERR_NONE; return ERR_NONE;
} }
...@@ -1363,34 +2325,34 @@ namespace { ...@@ -1363,34 +2325,34 @@ namespace {
exeInfo.totalDurationMsec = 0.0; exeInfo.totalDurationMsec = 0.0;
// Loop over each transfer this executor is involved in // Loop over each transfer this executor is involved in
for (auto& resources : exeInfo.resources) { for (auto& rss : exeInfo.resources) {
Transfer const& t = transfers[resources.transferIdx]; Transfer const& t = transfers[rss.transferIdx];
resources.numBytes = t.numBytes; rss.numBytes = t.numBytes;
// Allocate source memory // Allocate source memory
resources.srcMem.resize(t.srcs.size()); rss.srcMem.resize(t.srcs.size());
for (int iSrc = 0; iSrc < t.srcs.size(); ++iSrc) { for (int iSrc = 0; iSrc < t.srcs.size(); ++iSrc) {
MemDevice const& srcMemDevice = t.srcs[iSrc]; MemDevice const& srcMemDevice = t.srcs[iSrc];
// Ensure executing GPU can access source memory // Ensure executing GPU can access source memory
if (exeDevice.exeType == EXE_GPU_GFX && IsGpuMemType(srcMemDevice.memType) && if (IsGpuExeType(exeDevice.exeType) && IsGpuMemType(srcMemDevice.memType) &&
srcMemDevice.memIndex != exeDevice.exeIndex) { srcMemDevice.memIndex != exeDevice.exeIndex) {
ERR_CHECK(EnablePeerAccess(exeDevice.exeIndex, srcMemDevice.memIndex)); ERR_CHECK(EnablePeerAccess(exeDevice.exeIndex, srcMemDevice.memIndex));
} }
ERR_CHECK(AllocateMemory(srcMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&resources.srcMem[iSrc])); ERR_CHECK(AllocateMemory(srcMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&rss.srcMem[iSrc]));
} }
// Allocate destination memory // Allocate destination memory
resources.dstMem.resize(t.dsts.size()); rss.dstMem.resize(t.dsts.size());
for (int iDst = 0; iDst < t.dsts.size(); ++iDst) { for (int iDst = 0; iDst < t.dsts.size(); ++iDst) {
MemDevice const& dstMemDevice = t.dsts[iDst]; MemDevice const& dstMemDevice = t.dsts[iDst];
// Ensure executing GPU can access destination memory // Ensure executing GPU can access destination memory
if (exeDevice.exeType == EXE_GPU_GFX && IsGpuMemType(dstMemDevice.memType) && if (IsGpuExeType(exeDevice.exeType) && IsGpuMemType(dstMemDevice.memType) &&
dstMemDevice.memIndex != exeDevice.exeIndex) { dstMemDevice.memIndex != exeDevice.exeIndex) {
ERR_CHECK(EnablePeerAccess(exeDevice.exeIndex, dstMemDevice.memIndex)); ERR_CHECK(EnablePeerAccess(exeDevice.exeIndex, dstMemDevice.memIndex));
} }
ERR_CHECK(AllocateMemory(dstMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&resources.dstMem[iDst])); ERR_CHECK(AllocateMemory(dstMemDevice, t.numBytes + cfg.data.byteOffset, (void**)&rss.dstMem[iDst]));
} }
if (exeDevice.exeType == EXE_GPU_DMA && (t.exeSubIndex != -1 || cfg.dma.useHsaCopy)) { if (exeDevice.exeType == EXE_GPU_DMA && (t.exeSubIndex != -1 || cfg.dma.useHsaCopy)) {
...@@ -1398,19 +2360,22 @@ namespace { ...@@ -1398,19 +2360,22 @@ namespace {
// Collect HSA agent information // Collect HSA agent information
hsa_amd_pointer_info_t info; hsa_amd_pointer_info_t info;
info.size = sizeof(info); info.size = sizeof(info);
ERR_CHECK(hsa_amd_pointer_info(resources.dstMem[0], &info, NULL, NULL, NULL)); ERR_CHECK(hsa_amd_pointer_info(rss.dstMem[0], &info, NULL, NULL, NULL));
resources.dstAgent = info.agentOwner; rss.dstAgent = info.agentOwner;
ERR_CHECK(hsa_amd_pointer_info(resources.srcMem[0], &info, NULL, NULL, NULL)); ERR_CHECK(hsa_amd_pointer_info(rss.srcMem[0], &info, NULL, NULL, NULL));
resources.srcAgent = info.agentOwner; rss.srcAgent = info.agentOwner;
// Create HSA completion signal // Create HSA completion signal
ERR_CHECK(hsa_signal_create(1, 0, NULL, &resources.signal)); ERR_CHECK(hsa_signal_create(1, 0, NULL, &rss.signal));
if (t.exeSubIndex != -1)
rss.sdmaEngineId = (hsa_amd_sdma_engine_id_t)(1U << t.exeSubIndex);
#endif #endif
} }
// Prepare subexecutor parameters // Prepare subexecutor parameters
ERR_CHECK(PrepareSubExecParams(cfg, t, resources)); ERR_CHECK(PrepareSubExecParams(cfg, t, rss));
} }
// Prepare additional requirements for GPU-based executors // Prepare additional requirements for GPU-based executors
...@@ -1469,13 +2434,47 @@ namespace { ...@@ -1469,13 +2434,47 @@ namespace {
exeDevice.exeIndex)); exeDevice.exeIndex));
#endif #endif
int transferOffset = 0; int transferOffset = 0;
for (auto& resources : exeInfo.resources) { if (cfg.gfx.useMultiStream || cfg.gfx.blockOrder == 0) {
Transfer const& t = transfers[resources.transferIdx]; // Threadblocks are ordered sequentially one transfer at a time
resources.subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset; for (auto& rss : exeInfo.resources) {
for (auto p : resources.subExecParamCpu) { Transfer const& t = transfers[rss.transferIdx];
resources.subExecIdx.push_back(exeInfo.subExecParamCpu.size()); rss.subExecParamGpuPtr = exeInfo.subExecParamGpu + transferOffset;
exeInfo.subExecParamCpu.push_back(p); for (auto p : rss.subExecParamCpu) {
transferOffset++; rss.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
exeInfo.subExecParamCpu.push_back(p);
transferOffset++;
}
}
} else if (cfg.gfx.blockOrder == 1) {
// Interleave threadblocks of different Transfers
for (int subExecIdx = 0; exeInfo.subExecParamCpu.size() < exeInfo.totalSubExecs; ++subExecIdx) {
for (auto& rss : exeInfo.resources) {
Transfer const& t = transfers[rss.transferIdx];
if (subExecIdx < t.numSubExecs) {
rss.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
exeInfo.subExecParamCpu.push_back(rss.subExecParamCpu[subExecIdx]);
}
}
}
} else if (cfg.gfx.blockOrder == 2) {
// Build randomized threadblock list
std::vector<std::pair<int,int>> indices;
for (int i = 0; i < exeInfo.resources.size(); i++) {
auto const& rss = exeInfo.resources[i];
Transfer const& t = transfers[rss.transferIdx];
for (int j = 0; j < t.numSubExecs; j++)
indices.push_back(std::make_pair(i,j));
}
std::random_device rd;
std::default_random_engine gen(rd());
std::shuffle(indices.begin(), indices.end(), gen);
// Build randomized threadblock list
for (auto p : indices) {
auto& rss = exeInfo.resources[p.first];
rss.subExecIdx.push_back(exeInfo.subExecParamCpu.size());
exeInfo.subExecParamCpu.push_back(rss.subExecParamCpu[p.second]);
} }
} }
...@@ -1488,6 +2487,17 @@ namespace { ...@@ -1488,6 +2487,17 @@ namespace {
ERR_CHECK(hipDeviceSynchronize()); ERR_CHECK(hipDeviceSynchronize());
} }
// Prepare for NIC-based executors
if (IsNicExeType(exeDevice.exeType)) {
#ifdef NIC_EXEC_ENABLED
for (auto& rss : exeInfo.resources) {
Transfer const& t = transfers[rss.transferIdx];
ERR_CHECK(PrepareNicTransferResources(cfg, exeDevice, t, rss));
}
#else
return {ERR_FATAL, "RDMA executor is not supported"};
#endif
}
return ERR_NONE; return ERR_NONE;
} }
...@@ -1501,23 +2511,30 @@ namespace { ...@@ -1501,23 +2511,30 @@ namespace {
ExeInfo& exeInfo) ExeInfo& exeInfo)
{ {
// Loop over each transfer this executor is involved in // Loop over each transfer this executor is involved in
for (auto& resources : exeInfo.resources) { for (auto& rss : exeInfo.resources) {
Transfer const& t = transfers[resources.transferIdx]; Transfer const& t = transfers[rss.transferIdx];
// Deallocate source memory // Deallocate source memory
for (int iSrc = 0; iSrc < t.srcs.size(); ++iSrc) { for (int iSrc = 0; iSrc < t.srcs.size(); ++iSrc) {
ERR_CHECK(DeallocateMemory(t.srcs[iSrc].memType, resources.srcMem[iSrc], t.numBytes + cfg.data.byteOffset)); ERR_CHECK(DeallocateMemory(t.srcs[iSrc].memType, rss.srcMem[iSrc], t.numBytes + cfg.data.byteOffset));
} }
// Deallocate destination memory // Deallocate destination memory
for (int iDst = 0; iDst < t.dsts.size(); ++iDst) { for (int iDst = 0; iDst < t.dsts.size(); ++iDst) {
ERR_CHECK(DeallocateMemory(t.dsts[iDst].memType, resources.dstMem[iDst], t.numBytes + cfg.data.byteOffset)); ERR_CHECK(DeallocateMemory(t.dsts[iDst].memType, rss.dstMem[iDst], t.numBytes + cfg.data.byteOffset));
} }
// Destroy HSA signal for DMA executor // Destroy HSA signal for DMA executor
#if !defined(__NVCC__) #if !defined(__NVCC__)
if (exeDevice.exeType == EXE_GPU_DMA && (t.exeSubIndex != -1 || cfg.dma.useHsaCopy)) { if (exeDevice.exeType == EXE_GPU_DMA && (t.exeSubIndex != -1 || cfg.dma.useHsaCopy)) {
ERR_CHECK(hsa_signal_destroy(resources.signal)); ERR_CHECK(hsa_signal_destroy(rss.signal));
}
#endif
// Destroy NIC related resources
#ifdef NIC_EXEC_ENABLED
if (IsNicExeType(exeDevice.exeType)) {
ERR_CHECK(TeardownNicTransferResources(rss));
} }
#endif #endif
} }
...@@ -1550,68 +2567,69 @@ namespace { ...@@ -1550,68 +2567,69 @@ namespace {
//======================================================================================== //========================================================================================
// Kernel for CPU execution (run by a single subexecutor) // Kernel for CPU execution (run by a single subexecutor)
static void CpuReduceKernel(SubExecParam const& p) static void CpuReduceKernel(SubExecParam const& p, int numSubIterations)
{ {
if (p.N == 0) return; if (p.N == 0) return;
int const& numSrcs = p.numSrcs; int subIteration = 0;
int const& numDsts = p.numDsts; do {
int const& numSrcs = p.numSrcs;
if (numSrcs == 0) { int const& numDsts = p.numDsts;
for (int i = 0; i < numDsts; ++i) {
memset(p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
//for (int j = 0; j < p.N; j++) p.dst[i][j] = MEMSET_VAL;
}
} else if (numSrcs == 1) {
float const* __restrict__ src = p.src[0];
if (numDsts == 0) {
float sum = 0.0;
for (int j = 0; j < p.N; j++)
sum += p.src[0][j];
// Add a dummy check to ensure the read is not optimized out if (numSrcs == 0) {
if (sum != sum) { for (int i = 0; i < numDsts; ++i) {
printf("[ERROR] Nan detected\n"); memset(p.dst[i], MEMSET_CHAR, p.N * sizeof(float));
//for (int j = 0; j < p.N; j++) p.dst[i][j] = MEMSET_VAL;
}
} else if (numSrcs == 1) {
float const* __restrict__ src = p.src[0];
if (numDsts == 0) {
float sum = 0.0;
for (int j = 0; j < p.N; j++)
sum += p.src[0][j];
// Add a dummy check to ensure the read is not optimized out
if (sum != sum) {
printf("[ERROR] Nan detected\n");
}
} else {
for (int i = 0; i < numDsts; ++i)
memcpy(p.dst[i], src, p.N * sizeof(float));
} }
} else { } else {
for (int i = 0; i < numDsts; ++i) float sum = 0.0f;
memcpy(p.dst[i], src, p.N * sizeof(float)); for (int j = 0; j < p.N; j++) {
} sum = p.src[0][j];
} else { for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
float sum = 0.0f; for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
for (int j = 0; j < p.N; j++) { }
sum = p.src[0][j];
for (int i = 1; i < numSrcs; i++) sum += p.src[i][j];
for (int i = 0; i < numDsts; i++) p.dst[i][j] = sum;
} }
} } while (++subIteration != numSubIterations);
} }
// Execution of a single CPU Transfers // Execution of a single CPU Transfers
static void ExecuteCpuTransfer(int const iteration, static void ExecuteCpuTransfer(int const iteration,
ConfigOptions const& cfg, ConfigOptions const& cfg,
int const exeIndex, int const exeIndex,
TransferResources& resources) TransferResources& rss)
{ {
auto cpuStart = std::chrono::high_resolution_clock::now(); auto cpuStart = std::chrono::high_resolution_clock::now();
vector<std::thread> childThreads; vector<std::thread> childThreads;
int subIteration = 0;
do {
for (auto const& subExecParam : resources.subExecParamCpu)
childThreads.emplace_back(std::thread(CpuReduceKernel, std::cref(subExecParam)));
for (auto& subExecThread : childThreads) for (auto const& subExecParam : rss.subExecParamCpu)
subExecThread.join(); childThreads.emplace_back(std::thread(CpuReduceKernel, std::cref(subExecParam), cfg.general.numSubIterations));
childThreads.clear();
} while (++subIteration != cfg.general.numSubIterations); for (auto& subExecThread : childThreads)
subExecThread.join();
childThreads.clear();
auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart; auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0; double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0;
if (iteration >= 0) { if (iteration >= 0) {
resources.totalDurationMsec += deltaMsec; rss.totalDurationMsec += deltaMsec;
if (cfg.general.recordPerIteration) if (cfg.general.recordPerIteration)
resources.perIterMsec.push_back(deltaMsec); rss.perIterMsec.push_back(deltaMsec);
} }
} }
...@@ -1625,12 +2643,12 @@ namespace { ...@@ -1625,12 +2643,12 @@ namespace {
auto cpuStart = std::chrono::high_resolution_clock::now(); auto cpuStart = std::chrono::high_resolution_clock::now();
vector<std::thread> asyncTransfers; vector<std::thread> asyncTransfers;
for (auto& resource : exeInfo.resources) { for (auto& rss : exeInfo.resources) {
asyncTransfers.emplace_back(std::thread(ExecuteCpuTransfer, asyncTransfers.emplace_back(std::thread(ExecuteCpuTransfer,
iteration, iteration,
std::cref(cfg), std::cref(cfg),
exeIndex, exeIndex,
std::ref(resource))); std::ref(rss)));
} }
for (auto& asyncTransfer : asyncTransfers) for (auto& asyncTransfer : asyncTransfers)
asyncTransfer.join(); asyncTransfer.join();
...@@ -1642,6 +2660,88 @@ namespace { ...@@ -1642,6 +2660,88 @@ namespace {
return ERR_NONE; return ERR_NONE;
} }
#ifdef NIC_EXEC_ENABLED
// Execution of a single NIC Transfer
static ErrResult ExecuteNicTransfer(int const iteration,
ConfigOptions const& cfg,
int const exeIndex,
TransferResources& rss)
{
// Loop over each of the queue pairs and post the send
ibv_send_wr* badWorkReq;
for (int qpIndex = 0; qpIndex < rss.qpCount; qpIndex++) {
int error = ibv_post_send(rss.srcQueuePairs[qpIndex], &rss.sendWorkRequests[qpIndex], &badWorkReq);
if (error)
return {ERR_FATAL, "Transfer %d: Error when calling ibv_post_send for QP %d Error code %d\n",
rss.transferIdx, qpIndex, error};
}
return ERR_NONE;
}
// Execution of a single NIC executor
static ErrResult RunNicExecutor(int const iteration,
ConfigOptions const& cfg,
int const exeIndex,
ExeInfo& exeInfo)
{
// Switch to the closest NUMA node to this NIC
if (cfg.nic.useNuma) {
int numaNode = GetIbvDeviceList()[exeIndex].numaNode;
if (numaNode != -1)
numa_run_on_node(numaNode);
}
int subIterations = 0;
do {
auto cpuStart = std::chrono::high_resolution_clock::now();
size_t completedTransfers = 0;
auto transferCount = exeInfo.resources.size();
std::vector<uint8_t> receivedQPs(transferCount);
std::vector<std::chrono::high_resolution_clock::time_point> transferTimers(transferCount);
// post the sends
for (auto i = 0; i < transferCount; i++) {
transferTimers[i] = std::chrono::high_resolution_clock::now();
ERR_CHECK(ExecuteNicTransfer(iteration, cfg, exeIndex, exeInfo.resources[i]));
}
// poll for completions
do {
for (auto i = 0; i < transferCount; i++) {
if(receivedQPs[i] < exeInfo.resources[i].qpCount) {
auto& rss = exeInfo.resources[i];
// Poll the completion queue until all queue pairs are complete
// The order of completion doesn't matter because this completion queue is dedicated to this Transfer
ibv_wc wc;
int nc = ibv_poll_cq(rss.srcCompQueue, 1, &wc);
if (nc > 0) {
receivedQPs[i]++;
if (wc.status != IBV_WC_SUCCESS) {
return {ERR_FATAL, "Transfer %d: Received unsuccessful work completion", rss.transferIdx};
}
} else if (nc < 0) {
return {ERR_FATAL, "Transfer %d: Received negative work completion", rss.transferIdx};
}
if(receivedQPs[i] == rss.qpCount) {
auto cpuDelta = std::chrono::high_resolution_clock::now() - transferTimers[i];
double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0;
if (iteration >= 0) {
rss.totalDurationMsec += deltaMsec;
if (cfg.general.recordPerIteration)
rss.perIterMsec.push_back(deltaMsec);
}
completedTransfers++;
}
}
}
} while(completedTransfers < transferCount);
auto cpuDelta = std::chrono::high_resolution_clock::now() - cpuStart;
double deltaMsec = std::chrono::duration_cast<std::chrono::duration<double>>(cpuDelta).count() * 1000.0;
if (iteration >= 0)
exeInfo.totalDurationMsec += deltaMsec;
} while(++subIterations < cfg.general.numSubIterations);
return ERR_NONE;
}
#endif
// GFX Executor-related functions // GFX Executor-related functions
//======================================================================================== //========================================================================================
...@@ -1674,13 +2774,97 @@ namespace { ...@@ -1674,13 +2774,97 @@ namespace {
// Helper function for memset // Helper function for memset
template <typename T> __device__ __forceinline__ T MemsetVal(); template <typename T> __device__ __forceinline__ T MemsetVal();
template <> __device__ __forceinline__ float MemsetVal(){ return MEMSET_VAL; }; template <> __device__ __forceinline__ float MemsetVal(){ return MEMSET_VAL; };
template <> __device__ __forceinline__ float2 MemsetVal(){ return make_float2(MEMSET_VAL,
MEMSET_VAL); };
template <> __device__ __forceinline__ float4 MemsetVal(){ return make_float4(MEMSET_VAL, template <> __device__ __forceinline__ float4 MemsetVal(){ return make_float4(MEMSET_VAL,
MEMSET_VAL, MEMSET_VAL,
MEMSET_VAL, MEMSET_VAL,
MEMSET_VAL); } MEMSET_VAL); }
// Helper function for temporal/non-temporal reads / writes
#define TEMPORAL_NONE 0
#define TEMPORAL_LOAD 1
#define TEMPORAL_STORE 2
#define TEMPORAL_BOTH 3
template <int TEMPORAL_MODE>
__device__ __forceinline__ void Load(float const* src, float& dst) {
if (TEMPORAL_MODE & TEMPORAL_LOAD) {
#if !defined(__NVCC__)
dst = __builtin_nontemporal_load(src);
#endif
} else {
dst = *src;
}
}
template <int TEMPORAL_MODE>
__device__ __forceinline__ void Load(float2 const* src, float2& dst) {
if (TEMPORAL_MODE & TEMPORAL_LOAD) {
#if !defined(__NVCC__)
dst.x = __builtin_nontemporal_load(&(src->x));
dst.y = __builtin_nontemporal_load(&(src->y));
#endif
} else {
dst = *src;
}
}
template <int TEMPORAL_MODE>
__device__ __forceinline__ void Load(float4 const* src, float4& dst) {
if (TEMPORAL_MODE & TEMPORAL_LOAD) {
#if !defined(__NVCC__)
dst.x = __builtin_nontemporal_load(&(src->x));
dst.y = __builtin_nontemporal_load(&(src->y));
dst.z = __builtin_nontemporal_load(&(src->z));
dst.w = __builtin_nontemporal_load(&(src->w));
#endif
} else {
dst = *src;
}
}
template <int TEMPORAL_MODE>
__device__ __forceinline__ void Store(float const& src, float* dst) {
if (TEMPORAL_MODE & TEMPORAL_STORE) {
#if !defined(__NVCC__)
__builtin_nontemporal_store(src, dst);
#endif
} else {
*dst = src;
}
}
template <int TEMPORAL_MODE>
__device__ __forceinline__ void Store(float2 const& src, float2* dst) {
if (TEMPORAL_MODE & TEMPORAL_STORE) {
#if !defined(__NVCC__)
__builtin_nontemporal_store(src.x, &(dst->x));
__builtin_nontemporal_store(src.y, &(dst->y));
#endif
} else {
*dst = src;
}
}
template <int TEMPORAL_MODE>
__device__ __forceinline__ void Store(float4 const& src, float4* dst) {
if (TEMPORAL_MODE & TEMPORAL_STORE) {
#if !defined(__NVCC__)
__builtin_nontemporal_store(src.x, &(dst->x));
__builtin_nontemporal_store(src.y, &(dst->y));
__builtin_nontemporal_store(src.z, &(dst->z));
__builtin_nontemporal_store(src.w, &(dst->w));
#endif
} else {
*dst = src;
}
}
// Kernel for GFX execution // Kernel for GFX execution
template <int BLOCKSIZE, int UNROLL> template <typename PACKED_FLOAT, int BLOCKSIZE, int UNROLL, int TEMPORAL_MODE>
__global__ void __launch_bounds__(BLOCKSIZE) __global__ void __launch_bounds__(BLOCKSIZE)
GpuReduceKernel(SubExecParam* params, int waveOrder, int numSubIterations) GpuReduceKernel(SubExecParam* params, int waveOrder, int numSubIterations)
{ {
...@@ -1699,10 +2883,10 @@ namespace { ...@@ -1699,10 +2883,10 @@ namespace {
// Collect data information // Collect data information
int32_t const numSrcs = p.numSrcs; int32_t const numSrcs = p.numSrcs;
int32_t const numDsts = p.numDsts; int32_t const numDsts = p.numDsts;
float4 const* __restrict__ srcFloat4[MAX_SRCS]; PACKED_FLOAT const* __restrict__ srcFloatPacked[MAX_SRCS];
float4* __restrict__ dstFloat4[MAX_DSTS]; PACKED_FLOAT* __restrict__ dstFloatPacked[MAX_DSTS];
for (int i = 0; i < numSrcs; i++) srcFloat4[i] = (float4*)p.src[i]; for (int i = 0; i < numSrcs; i++) srcFloatPacked[i] = (PACKED_FLOAT const*)p.src[i];
for (int i = 0; i < numDsts; i++) dstFloat4[i] = (float4*)p.dst[i]; for (int i = 0; i < numDsts; i++) dstFloatPacked[i] = (PACKED_FLOAT*)p.dst[i];
// Operate on wavefront granularity // Operate on wavefront granularity
int32_t const nTeams = p.teamSize; // Number of threadblocks working together on this subarray int32_t const nTeams = p.teamSize; // Number of threadblocks working together on this subarray
...@@ -1711,7 +2895,7 @@ namespace { ...@@ -1711,7 +2895,7 @@ namespace {
int32_t const waveIdx = threadIdx.x / warpSize; // Index of this wavefront within the threadblock int32_t const waveIdx = threadIdx.x / warpSize; // Index of this wavefront within the threadblock
int32_t const tIdx = threadIdx.x % warpSize; // Thread index within wavefront int32_t const tIdx = threadIdx.x % warpSize; // Thread index within wavefront
size_t const numFloat4 = p.N / 4; size_t const numPackedFloat = p.N / (sizeof(PACKED_FLOAT)/sizeof(float));
int32_t teamStride, waveStride, unrlStride, teamStride2, waveStride2; int32_t teamStride, waveStride, unrlStride, teamStride2, waveStride2;
switch (waveOrder) { switch (waveOrder) {
...@@ -1725,72 +2909,84 @@ namespace { ...@@ -1725,72 +2909,84 @@ namespace {
int subIterations = 0; int subIterations = 0;
while (1) { while (1) {
// First loop: Each wavefront in the team works on UNROLL float4s per thread // First loop: Each wavefront in the team works on UNROLL PACKED_FLOAT per thread
size_t const loop1Stride = nTeams * nWaves * UNROLL * warpSize; size_t const loop1Stride = nTeams * nWaves * UNROLL * warpSize;
size_t const loop1Limit = numFloat4 / loop1Stride * loop1Stride; size_t const loop1Limit = numPackedFloat / loop1Stride * loop1Stride;
{ {
float4 val[UNROLL]; PACKED_FLOAT val[UNROLL];
PACKED_FLOAT tmp[UNROLL];
if (numSrcs == 0) { if (numSrcs == 0) {
#pragma unroll #pragma unroll
for (int u = 0; u < UNROLL; u++) for (int u = 0; u < UNROLL; u++)
val[u] = MemsetVal<float4>(); val[u] = MemsetVal<PACKED_FLOAT>();
} }
for (size_t idx = (teamIdx * teamStride + waveIdx * waveStride) * warpSize + tIdx; idx < loop1Limit; idx += loop1Stride) { for (size_t idx = (teamIdx * teamStride + waveIdx * waveStride) * warpSize + tIdx; idx < loop1Limit; idx += loop1Stride) {
// Read sources into memory and accumulate in registers // Read sources into memory and accumulate in registers
if (numSrcs) { if (numSrcs) {
#pragma unroll
for (int u = 0; u < UNROLL; u++) for (int u = 0; u < UNROLL; u++)
val[u] = srcFloat4[0][idx + u * unrlStride * warpSize]; Load<TEMPORAL_MODE>(&srcFloatPacked[0][idx + u * unrlStride * warpSize], val[u]);
for (int s = 1; s < numSrcs; s++)
for (int s = 1; s < numSrcs; s++) {
#pragma unroll
for (int u = 0; u < UNROLL; u++)
Load<TEMPORAL_MODE>(&srcFloatPacked[s][idx + u * unrlStride * warpSize], tmp[u]);
#pragma unroll
for (int u = 0; u < UNROLL; u++) for (int u = 0; u < UNROLL; u++)
val[u] += srcFloat4[s][idx + u * unrlStride * warpSize]; val[u] += tmp[u];
}
} }
// Write accumulation to all outputs // Write accumulation to all outputs
for (int d = 0; d < numDsts; d++) { for (int d = 0; d < numDsts; d++) {
#pragma unroll #pragma unroll
for (int u = 0; u < UNROLL; u++) for (int u = 0; u < UNROLL; u++)
dstFloat4[d][idx + u * unrlStride * warpSize] = val[u]; Store<TEMPORAL_MODE>(val[u], &dstFloatPacked[d][idx + u * unrlStride * warpSize]);
} }
} }
} }
// Second loop: Deal with remaining float4s // Second loop: Deal with remaining PACKED_FLOAT
{ {
if (loop1Limit < numFloat4) { if (loop1Limit < numPackedFloat) {
float4 val; PACKED_FLOAT val, tmp;
if (numSrcs == 0) val = MemsetVal<float4>(); if (numSrcs == 0) val = MemsetVal<PACKED_FLOAT>();
size_t const loop2Stride = nTeams * nWaves * warpSize; size_t const loop2Stride = nTeams * nWaves * warpSize;
for (size_t idx = loop1Limit + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx; for (size_t idx = loop1Limit + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx;
idx < numFloat4; idx += loop2Stride) { idx < numPackedFloat; idx += loop2Stride) {
if (numSrcs) { if (numSrcs) {
val = srcFloat4[0][idx]; Load<TEMPORAL_MODE>(&srcFloatPacked[0][idx], val);
for (int s = 1; s < numSrcs; s++) for (int s = 1; s < numSrcs; s++) {
val += srcFloat4[s][idx]; Load<TEMPORAL_MODE>(&srcFloatPacked[s][idx], tmp);
val += tmp;
}
} }
for (int d = 0; d < numDsts; d++) for (int d = 0; d < numDsts; d++)
dstFloat4[d][idx] = val; Store<TEMPORAL_MODE>(val, &dstFloatPacked[d][idx]);
} }
} }
} }
// Third loop; Deal with remaining floats // Third loop; Deal with remaining floats
{ {
if (numFloat4 * 4 < p.N) { if (numPackedFloat * (sizeof(PACKED_FLOAT)/sizeof(float)) < p.N) {
float val; float val, tmp;
if (numSrcs == 0) val = MemsetVal<float>(); if (numSrcs == 0) val = MemsetVal<float>();
size_t const loop3Stride = nTeams * nWaves * warpSize; size_t const loop3Stride = nTeams * nWaves * warpSize;
for( size_t idx = numFloat4 * 4 + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx; idx < p.N; idx += loop3Stride) { for (size_t idx = numPackedFloat * (sizeof(PACKED_FLOAT)/sizeof(float)) + (teamIdx * teamStride2 + waveIdx * waveStride2) * warpSize + tIdx; idx < p.N; idx += loop3Stride) {
if (numSrcs) { if (numSrcs) {
val = p.src[0][idx]; Load<TEMPORAL_MODE>(&p.src[0][idx], val);
for (int s = 1; s < numSrcs; s++) for (int s = 1; s < numSrcs; s++) {
val += p.src[s][idx]; Load<TEMPORAL_MODE>(&p.src[s][idx], tmp);
val += tmp;
}
} }
for (int d = 0; d < numDsts; d++) for (int d = 0; d < numDsts; d++)
p.dst[d][idx] = val; Store<TEMPORAL_MODE>(val, &p.dst[d][idx]);
} }
} }
} }
...@@ -1809,19 +3005,30 @@ namespace { ...@@ -1809,19 +3005,30 @@ namespace {
} }
} }
#define GPU_KERNEL_UNROLL_DECL(BLOCKSIZE) \ #define GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, DWORD) \
{GpuReduceKernel<BLOCKSIZE, 1>, \ {GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_NONE>, \
GpuReduceKernel<BLOCKSIZE, 2>, \ GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_LOAD>, \
GpuReduceKernel<BLOCKSIZE, 3>, \ GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_STORE>, \
GpuReduceKernel<BLOCKSIZE, 4>, \ GpuReduceKernel<DWORD, BLOCKSIZE, UNROLL, TEMPORAL_BOTH>}
GpuReduceKernel<BLOCKSIZE, 5>, \
GpuReduceKernel<BLOCKSIZE, 6>, \ #define GPU_KERNEL_DWORD_DECL(BLOCKSIZE, UNROLL) \
GpuReduceKernel<BLOCKSIZE, 7>, \ {GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, float), \
GpuReduceKernel<BLOCKSIZE, 8>} GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, float2), \
GPU_KERNEL_TEMPORAL_DECL(BLOCKSIZE, UNROLL, float4)}
// Table of all GPU Reduction kernel functions (templated blocksize / unroll)
#define GPU_KERNEL_UNROLL_DECL(BLOCKSIZE) \
{GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 1), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 2), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 3), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 4), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 5), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 6), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 7), \
GPU_KERNEL_DWORD_DECL(BLOCKSIZE, 8)}
// Table of all GPU Reduction kernel functions (templated blocksize / unroll / dword size)
typedef void (*GpuKernelFuncPtr)(SubExecParam*, int, int); typedef void (*GpuKernelFuncPtr)(SubExecParam*, int, int);
GpuKernelFuncPtr GpuKernelTable[MAX_WAVEGROUPS][MAX_UNROLL] = GpuKernelFuncPtr GpuKernelTable[MAX_WAVEGROUPS][MAX_UNROLL][3][4] =
{ {
GPU_KERNEL_UNROLL_DECL(64), GPU_KERNEL_UNROLL_DECL(64),
GPU_KERNEL_UNROLL_DECL(128), GPU_KERNEL_UNROLL_DECL(128),
...@@ -1833,6 +3040,8 @@ namespace { ...@@ -1833,6 +3040,8 @@ namespace {
GPU_KERNEL_UNROLL_DECL(512) GPU_KERNEL_UNROLL_DECL(512)
}; };
#undef GPU_KERNEL_UNROLL_DECL #undef GPU_KERNEL_UNROLL_DECL
#undef GPU_KERNEL_DWORD_DECL
#undef GPU_KERNEL_TEMPORAL_DECL
// Execute a single GPU Transfer (when using 1 stream per Transfer) // Execute a single GPU Transfer (when using 1 stream per Transfer)
static ErrResult ExecuteGpuTransfer(int const iteration, static ErrResult ExecuteGpuTransfer(int const iteration,
...@@ -1841,27 +3050,28 @@ namespace { ...@@ -1841,27 +3050,28 @@ namespace {
hipEvent_t const stopEvent, hipEvent_t const stopEvent,
int const xccDim, int const xccDim,
ConfigOptions const& cfg, ConfigOptions const& cfg,
TransferResources& resources) TransferResources& rss)
{ {
auto cpuStart = std::chrono::high_resolution_clock::now(); auto cpuStart = std::chrono::high_resolution_clock::now();
int numSubExecs = resources.subExecParamCpu.size(); int numSubExecs = rss.subExecParamCpu.size();
dim3 const gridSize(xccDim, numSubExecs, 1); dim3 const gridSize(xccDim, numSubExecs, 1);
dim3 const blockSize(cfg.gfx.blockSize, 1); dim3 const blockSize(cfg.gfx.blockSize, 1);
int wordSizeIdx = cfg.gfx.wordSize == 1 ? 0 :
cfg.gfx.wordSize == 2 ? 1 :
2;
auto gpuKernel = GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1][wordSizeIdx][cfg.gfx.temporalMode];
#if defined(__NVCC__) #if defined(__NVCC__)
if (startEvent != NULL) if (startEvent != NULL)
ERR_CHECK(hipEventRecord(startEvent, stream)); ERR_CHECK(hipEventRecord(startEvent, stream));
gpuKernel<<<gridSize, blockSize, 0, stream>>>(rss.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1]
<<<gridSize, blockSize, 0, stream>>>
(resources.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
if (stopEvent != NULL) if (stopEvent != NULL)
ERR_CHECK(hipEventRecord(stopEvent, stream)); ERR_CHECK(hipEventRecord(stopEvent, stream));
#else #else
hipExtLaunchKernelGGL(GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1], hipExtLaunchKernelGGL(gpuKernel, gridSize, blockSize, 0, stream, startEvent, stopEvent,
gridSize, blockSize, 0, stream, startEvent, stopEvent, 0, rss.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
0, resources.subExecParamGpuPtr, cfg.gfx.waveOrder, cfg.general.numSubIterations);
#endif #endif
ERR_CHECK(hipStreamSynchronize(stream)); ERR_CHECK(hipStreamSynchronize(stream));
...@@ -1876,15 +3086,15 @@ namespace { ...@@ -1876,15 +3086,15 @@ namespace {
ERR_CHECK(hipEventElapsedTime(&gpuDeltaMsec, startEvent, stopEvent)); ERR_CHECK(hipEventElapsedTime(&gpuDeltaMsec, startEvent, stopEvent));
deltaMsec = gpuDeltaMsec; deltaMsec = gpuDeltaMsec;
} }
resources.totalDurationMsec += deltaMsec; rss.totalDurationMsec += deltaMsec;
if (cfg.general.recordPerIteration) { if (cfg.general.recordPerIteration) {
resources.perIterMsec.push_back(deltaMsec); rss.perIterMsec.push_back(deltaMsec);
std::set<std::pair<int,int>> CUs; std::set<std::pair<int,int>> CUs;
for (int i = 0; i < numSubExecs; i++) { for (int i = 0; i < numSubExecs; i++) {
CUs.insert(std::make_pair(resources.subExecParamGpuPtr[i].xccId, CUs.insert(std::make_pair(rss.subExecParamGpuPtr[i].xccId,
GetId(resources.subExecParamGpuPtr[i].hwId))); GetId(rss.subExecParamGpuPtr[i].hwId)));
} }
resources.perIterCUs.push_back(CUs); rss.perIterCUs.push_back(CUs);
} }
} }
return ERR_NONE; return ERR_NONE;
...@@ -1924,19 +3134,19 @@ namespace { ...@@ -1924,19 +3134,19 @@ namespace {
dim3 const blockSize(cfg.gfx.blockSize, 1); dim3 const blockSize(cfg.gfx.blockSize, 1);
hipStream_t stream = exeInfo.streams[0]; hipStream_t stream = exeInfo.streams[0];
int wordSizeIdx = cfg.gfx.wordSize == 1 ? 0 :
cfg.gfx.wordSize == 2 ? 1 :
2;
auto gpuKernel = GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1][wordSizeIdx][cfg.gfx.temporalMode];
#if defined(__NVCC__) #if defined(__NVCC__)
if (cfg.gfx.useHipEvents) if (cfg.gfx.useHipEvents)
ERR_CHECK(hipEventRecord(exeInfo.startEvents[0], stream)); ERR_CHECK(hipEventRecord(exeInfo.startEvents[0], stream));
gpuKernel<<<gridSize, blockSize, 0 , stream>>>(exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations);
GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1]
<<<gridSize, blockSize, 0 , stream>>>
(exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations);
if (cfg.gfx.useHipEvents) if (cfg.gfx.useHipEvents)
ERR_CHECK(hipEventRecord(exeInfo.stopEvents[0], stream)); ERR_CHECK(hipEventRecord(exeInfo.stopEvents[0], stream));
#else #else
hipExtLaunchKernelGGL(GpuKernelTable[cfg.gfx.blockSize/64 - 1][cfg.gfx.unrollFactor - 1], hipExtLaunchKernelGGL(gpuKernel, gridSize, blockSize, 0, stream,
gridSize, blockSize, 0, stream,
cfg.gfx.useHipEvents ? exeInfo.startEvents[0] : NULL, cfg.gfx.useHipEvents ? exeInfo.startEvents[0] : NULL,
cfg.gfx.useHipEvents ? exeInfo.stopEvents[0] : NULL, 0, cfg.gfx.useHipEvents ? exeInfo.stopEvents[0] : NULL, 0,
exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations); exeInfo.subExecParamGpu, cfg.gfx.waveOrder, cfg.general.numSubIterations);
...@@ -1958,12 +3168,12 @@ namespace { ...@@ -1958,12 +3168,12 @@ namespace {
// Determine timing for each of the individual transfers that were part of this launch // Determine timing for each of the individual transfers that were part of this launch
if (!cfg.gfx.useMultiStream) { if (!cfg.gfx.useMultiStream) {
for (int i = 0; i < exeInfo.resources.size(); i++) { for (int i = 0; i < exeInfo.resources.size(); i++) {
TransferResources& resources = exeInfo.resources[i]; TransferResources& rss = exeInfo.resources[i];
long long minStartCycle = std::numeric_limits<long long>::max(); long long minStartCycle = std::numeric_limits<long long>::max();
long long maxStopCycle = std::numeric_limits<long long>::min(); long long maxStopCycle = std::numeric_limits<long long>::min();
std::set<std::pair<int, int>> CUs; std::set<std::pair<int, int>> CUs;
for (auto subExecIdx : resources.subExecIdx) { for (auto subExecIdx : rss.subExecIdx) {
minStartCycle = std::min(minStartCycle, exeInfo.subExecParamGpu[subExecIdx].startCycle); minStartCycle = std::min(minStartCycle, exeInfo.subExecParamGpu[subExecIdx].startCycle);
maxStopCycle = std::max(maxStopCycle, exeInfo.subExecParamGpu[subExecIdx].stopCycle); maxStopCycle = std::max(maxStopCycle, exeInfo.subExecParamGpu[subExecIdx].stopCycle);
if (cfg.general.recordPerIteration) { if (cfg.general.recordPerIteration) {
...@@ -1973,10 +3183,10 @@ namespace { ...@@ -1973,10 +3183,10 @@ namespace {
} }
double deltaMsec = (maxStopCycle - minStartCycle) / (double)(exeInfo.wallClockRate); double deltaMsec = (maxStopCycle - minStartCycle) / (double)(exeInfo.wallClockRate);
resources.totalDurationMsec += deltaMsec; rss.totalDurationMsec += deltaMsec;
if (cfg.general.recordPerIteration) { if (cfg.general.recordPerIteration) {
resources.perIterMsec.push_back(deltaMsec); rss.perIterMsec.push_back(deltaMsec);
resources.perIterCUs.push_back(CUs); rss.perIterCUs.push_back(CUs);
} }
} }
} }
...@@ -1984,7 +3194,6 @@ namespace { ...@@ -1984,7 +3194,6 @@ namespace {
return ERR_NONE; return ERR_NONE;
} }
// DMA Executor-related functions // DMA Executor-related functions
//======================================================================================== //========================================================================================
...@@ -2020,7 +3229,7 @@ namespace { ...@@ -2020,7 +3229,7 @@ namespace {
// Use HSA async copy // Use HSA async copy
do { do {
hsa_signal_store_screlease(resources.signal, 1); hsa_signal_store_screlease(resources.signal, 1);
if (cfg.dma.useHsaCopy) { if (!useSubIndices) {
ERR_CHECK(hsa_amd_memory_async_copy(resources.dstMem[0], resources.dstAgent, ERR_CHECK(hsa_amd_memory_async_copy(resources.dstMem[0], resources.dstAgent,
resources.srcMem[0], resources.srcAgent, resources.srcMem[0], resources.srcAgent,
resources.numBytes, 0, NULL, resources.numBytes, 0, NULL,
...@@ -2099,12 +3308,16 @@ namespace { ...@@ -2099,12 +3308,16 @@ namespace {
case EXE_CPU: return RunCpuExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo); case EXE_CPU: return RunCpuExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
case EXE_GPU_GFX: return RunGpuExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo); case EXE_GPU_GFX: return RunGpuExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
case EXE_GPU_DMA: return RunDmaExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo); case EXE_GPU_DMA: return RunDmaExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
#ifdef NIC_EXEC_ENABLED
case EXE_NIC: return RunNicExecutor(iteration, cfg, exeDevice.exeIndex, exeInfo);
#endif
default: return {ERR_FATAL, "Unsupported executor (%d)", exeDevice.exeType}; default: return {ERR_FATAL, "Unsupported executor (%d)", exeDevice.exeType};
} }
} }
} // End of anonymous namespace } // End of anonymous namespace
//======================================================================================== //========================================================================================
/// @endcond
ErrResult::ErrResult(ErrType err) : errType(err), errMsg("") {}; ErrResult::ErrResult(ErrType err) : errType(err), errMsg("") {};
...@@ -2174,16 +3387,17 @@ namespace { ...@@ -2174,16 +3387,17 @@ namespace {
std::map<ExeDevice, ExeInfo> executorMap; std::map<ExeDevice, ExeInfo> executorMap;
for (int i = 0; i < transfers.size(); i++) { for (int i = 0; i < transfers.size(); i++) {
Transfer const& t = transfers[i]; Transfer const& t = transfers[i];
ExeDevice exeDevice;
ExeInfo& exeInfo = executorMap[t.exeDevice]; ERR_APPEND(GetActualExecutor(cfg, t.exeDevice, exeDevice), errResults);
exeInfo.totalBytes += t.numBytes;
exeInfo.totalSubExecs += t.numSubExecs;
exeInfo.useSubIndices |= (t.exeSubIndex != -1);
TransferResources resource = {}; TransferResources resource = {};
resource.transferIdx = i; resource.transferIdx = i;
exeInfo.resources.push_back(resource);
ExeInfo& exeInfo = executorMap[exeDevice];
exeInfo.totalBytes += t.numBytes;
exeInfo.totalSubExecs += t.numSubExecs;
exeInfo.useSubIndices |= (t.exeSubIndex != -1 || (t.exeDevice.exeType == EXE_GPU_GFX && !cfg.gfx.prefXccTable.empty()));
exeInfo.resources.push_back(resource);
minNumSrcs = std::min(minNumSrcs, (int)t.srcs.size()); minNumSrcs = std::min(minNumSrcs, (int)t.srcs.size());
maxNumSrcs = std::max(maxNumSrcs, (int)t.srcs.size()); maxNumSrcs = std::max(maxNumSrcs, (int)t.srcs.size());
maxNumBytes = std::max(maxNumBytes, t.numBytes); maxNumBytes = std::max(maxNumBytes, t.numBytes);
...@@ -2314,8 +3528,8 @@ namespace { ...@@ -2314,8 +3528,8 @@ namespace {
results.tfrResults.resize(transfers.size()); results.tfrResults.resize(transfers.size());
results.numTimedIterations = numTimedIterations; results.numTimedIterations = numTimedIterations;
results.totalBytesTransferred = 0; results.totalBytesTransferred = 0;
results.avgTotalDurationMsec = (totalCpuTimeSec * 1000.0) / numTimedIterations; results.avgTotalDurationMsec = (totalCpuTimeSec * 1000.0) / (numTimedIterations * cfg.general.numSubIterations);
results.overheadMsec = 0.0; results.overheadMsec = results.avgTotalDurationMsec;
for (auto& exeInfoPair : executorMap) { for (auto& exeInfoPair : executorMap) {
ExeDevice const& exeDevice = exeInfoPair.first; ExeDevice const& exeDevice = exeInfoPair.first;
ExeInfo& exeInfo = exeInfoPair.second; ExeInfo& exeInfo = exeInfoPair.second;
...@@ -2323,25 +3537,32 @@ namespace { ...@@ -2323,25 +3537,32 @@ namespace {
// Copy over executor results // Copy over executor results
ExeResult& exeResult = results.exeResults[exeDevice]; ExeResult& exeResult = results.exeResults[exeDevice];
exeResult.numBytes = exeInfo.totalBytes; exeResult.numBytes = exeInfo.totalBytes;
exeResult.avgDurationMsec = exeInfo.totalDurationMsec / numTimedIterations; exeResult.avgDurationMsec = exeInfo.totalDurationMsec / (numTimedIterations * cfg.general.numSubIterations);
exeResult.avgBandwidthGbPerSec = (exeResult.numBytes / 1.0e6) / exeResult.avgDurationMsec; exeResult.avgBandwidthGbPerSec = (exeResult.numBytes / 1.0e6) / exeResult.avgDurationMsec;
exeResult.sumBandwidthGbPerSec = 0.0; exeResult.sumBandwidthGbPerSec = 0.0;
exeResult.transferIdx.clear(); exeResult.transferIdx.clear();
results.totalBytesTransferred += exeInfo.totalBytes; results.totalBytesTransferred += exeInfo.totalBytes;
results.overheadMsec = std::max(results.overheadMsec, (results.avgTotalDurationMsec - results.overheadMsec = std::min(results.overheadMsec, (results.avgTotalDurationMsec -
exeResult.avgDurationMsec)); exeResult.avgDurationMsec));
// Copy over transfer results // Copy over transfer results
for (auto const& resources : exeInfo.resources) { for (auto const& rss : exeInfo.resources) {
int const transferIdx = resources.transferIdx; int const transferIdx = rss.transferIdx;
TransferResult& tfrResult = results.tfrResults[transferIdx];
exeResult.transferIdx.push_back(transferIdx); exeResult.transferIdx.push_back(transferIdx);
tfrResult.numBytes = resources.numBytes;
tfrResult.avgDurationMsec = resources.totalDurationMsec / numTimedIterations; TransferResult& tfrResult = results.tfrResults[transferIdx];
tfrResult.avgBandwidthGbPerSec = (resources.numBytes / 1.0e6) / tfrResult.avgDurationMsec; tfrResult.exeDevice = exeDevice;
#ifdef NIC_EXEC_ENABLED
tfrResult.exeDstDevice = {exeDevice.exeType, rss.dstNicIndex};
#else
tfrResult.exeDstDevice = exeDevice;
#endif
tfrResult.numBytes = rss.numBytes;
tfrResult.avgDurationMsec = rss.totalDurationMsec / numTimedIterations;
tfrResult.avgBandwidthGbPerSec = (rss.numBytes / 1.0e6) / tfrResult.avgDurationMsec;
if (cfg.general.recordPerIteration) { if (cfg.general.recordPerIteration) {
tfrResult.perIterMsec = resources.perIterMsec; tfrResult.perIterMsec = rss.perIterMsec;
tfrResult.perIterCUs = resources.perIterCUs; tfrResult.perIterCUs = rss.perIterCUs;
} }
exeResult.sumBandwidthGbPerSec += tfrResult.avgBandwidthGbPerSec; exeResult.sumBandwidthGbPerSec += tfrResult.avgBandwidthGbPerSec;
} }
...@@ -2382,7 +3603,7 @@ namespace { ...@@ -2382,7 +3603,7 @@ namespace {
{ {
// Replace any round brackets or '->' with spaces, // Replace any round brackets or '->' with spaces,
for (int i = 1; line[i]; i++) for (int i = 1; line[i]; i++)
if (line[i] == '(' || line[i] == ')' || line[i] == '-' || line[i] == '>' ) line[i] = ' '; if (line[i] == '(' || line[i] == ')' || line[i] == '-' || line[i] == ':' || line[i] == '>' ) line[i] = ' ';
transfers.clear(); transfers.clear();
...@@ -2416,17 +3637,18 @@ namespace { ...@@ -2416,17 +3637,18 @@ namespace {
transfer.numSubExecs = numSubExecs; transfer.numSubExecs = numSubExecs;
if (iss.fail()) { if (iss.fail()) {
return {ERR_FATAL, return {ERR_FATAL,
"Parsing error: Unable to read valid Transfer %d (SRC EXE DST) triplet", i+1}; "Parsing error: Unable to read valid Transfer %d (SRC EXE DST) triplet", i+1};
} }
transfer.numBytes = 0;
} else { } else {
iss >> srcStr >> exeStr >> dstStr >> transfer.numSubExecs >> numBytesToken; iss >> srcStr >> exeStr >> dstStr >> transfer.numSubExecs >> numBytesToken;
if (iss.fail()) { if (iss.fail()) {
return {ERR_FATAL, return {ERR_FATAL,
"Parsing error: Unable to read valid Transfer %d (SRC EXE DST $CU #Bytes) tuple", i+1}; "Parsing error: Unable to read valid Transfer %d (SRC EXE DST $CU #Bytes) tuple", i+1};
} }
if (sscanf(numBytesToken.c_str(), "%lu", &transfer.numBytes) != 1) { if (sscanf(numBytesToken.c_str(), "%lu", &transfer.numBytes) != 1) {
return {ERR_FATAL, return {ERR_FATAL,
"Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple", i+1}; "Parsing error: Unable to read valid Transfer %d (SRC EXE DST #CU #Bytes) tuple", i+1};
} }
char units = numBytesToken.back(); char units = numBytesToken.back();
...@@ -2440,7 +3662,6 @@ namespace { ...@@ -2440,7 +3662,6 @@ namespace {
ERR_CHECK(ParseMemType(srcStr, transfer.srcs)); ERR_CHECK(ParseMemType(srcStr, transfer.srcs));
ERR_CHECK(ParseMemType(dstStr, transfer.dsts)); ERR_CHECK(ParseMemType(dstStr, transfer.dsts));
ERR_CHECK(ParseExeType(exeStr, transfer.exeDevice, transfer.exeSubIndex)); ERR_CHECK(ParseExeType(exeStr, transfer.exeDevice, transfer.exeSubIndex));
transfers.push_back(transfer); transfers.push_back(transfer);
} }
return ERR_NONE; return ERR_NONE;
...@@ -2458,6 +3679,12 @@ namespace { ...@@ -2458,6 +3679,12 @@ namespace {
if (status != hipSuccess) numDetectedGpus = 0; if (status != hipSuccess) numDetectedGpus = 0;
return numDetectedGpus; return numDetectedGpus;
} }
#ifdef NIC_EXEC_ENABLED
case EXE_NIC: case EXE_NIC_NEAREST:
{
return GetIbvDeviceList().size();
}
#endif
default: default:
return 0; return 0;
} }
...@@ -2564,6 +3791,102 @@ namespace { ...@@ -2564,6 +3791,102 @@ namespace {
#endif #endif
} }
int GetClosestCpuNumaToNic(int nicIndex)
{
#ifdef NIC_EXEC_ENABLED
int numNics = GetNumExecutors(EXE_NIC);
if (nicIndex < 0 || nicIndex >= numNics) return -1;
return GetIbvDeviceList()[nicIndex].numaNode;
#else
return -1;
#endif
}
int GetClosestNicToGpu(int gpuIndex)
{
#ifdef NIC_EXEC_ENABLED
static bool isInitialized = false;
static std::vector<int> closestNicId;
int numGpus = GetNumExecutors(EXE_GPU_GFX);
if (gpuIndex < 0 || gpuIndex >= numGpus) return -1;
// Build closest NICs per GPU on first use
if (!isInitialized) {
closestNicId.resize(numGpus, -1);
// Build up list of NIC bus addresses
std::vector<std::string> ibvAddressList;
auto const& ibvDeviceList = GetIbvDeviceList();
for (auto const& ibvDevice : ibvDeviceList)
ibvAddressList.push_back(ibvDevice.hasActivePort ? ibvDevice.busId : "");
// Track how many times a device has been assigned as "closest"
// This allows distributed work across devices using multiple ports (sharing the same busID)
// NOTE: This isn't necessarily optimal, but likely to work in most cases involving multi-port
// Counter example:
//
// G0 prefers (N0,N1), picks N0
// G1 prefers (N1,N2), picks N1
// G2 prefers N0, picks N0
//
// instead of G0->N1, G1->N2, G2->N0
std::vector<int> assignedCount(ibvDeviceList.size(), 0);
// Loop over each GPU to find the closest NIC(s) based on PCIe address
for (int i = 0; i < numGpus; i++) {
// Collect PCIe address for the GPU
char hipPciBusId[64];
hipError_t err = hipDeviceGetPCIBusId(hipPciBusId, sizeof(hipPciBusId), i);
if (err != hipSuccess) {
#ifdef VERBS_DEBUG
printf("Failed to get PCI Bus ID for HIP device %d: %s\n", i, hipGetErrorString(err));
#endif
closestNicId[i] = -1;
continue;
}
// Find closest NICs
std::set<int> closestNicIdxs = GetNearestDevicesInTree(hipPciBusId, ibvAddressList);
// Pick the least-used NIC to assign as closest
int closestIdx = -1;
for (auto idx : closestNicIdxs) {
if (closestIdx == -1 || assignedCount[idx] < assignedCount[closestIdx])
closestIdx = idx;
}
// The following will only use distance between bus IDs
// to determine the closest NIC to GPU if the PCIe tree approach fails
if (closestIdx < 0) {
#ifdef VERBS_DEBUG
printf("[WARN] Falling back to PCIe bus ID distance to determine proximity\n");
#endif
int minDistance = std::numeric_limits<int>::max();
for (int j = 0; j < ibvDeviceList.size(); j++) {
if (ibvDeviceList[j].busId != "") {
int distance = GetBusIdDistance(hipPciBusId, ibvDeviceList[j].busId);
if (distance < minDistance && distance >= 0) {
minDistance = distance;
closestIdx = j;
}
}
}
}
closestNicId[i] = closestIdx;
if (closestIdx != -1) assignedCount[closestIdx]++;
}
isInitialized = true;
}
return closestNicId[gpuIndex];
#else
return -1;
#endif
}
// Undefine CUDA compatibility macros // Undefine CUDA compatibility macros
#if defined(__NVCC__) #if defined(__NVCC__)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment