Use fixed point charge spreading on RDNA4 (#4960)

* Use fixed point spread charge on RDNA4 as it is faster Even though RDNA4 (gfx12) has global_atomic_add_f32, micro-benchmarks and OpenMM benchmarks show that it is very slow compared to global_atomic_add_u64. * Add a workaround for fixed point gridSpreadCharge on RDNA4 Workaround for rare cases when few values of pmeGrid are very large and incorrect. The cause is unknown. Why this workaround or other irrelevant changes like printf help is also unknown.

Use fixed point charge spreading on RDNA4 (#4960)
* Use fixed point spread charge on RDNA4 as it is faster Even though RDNA4 (gfx12) has global_atomic_add_f32, micro-benchmarks and OpenMM benchmarks show that it is very slow compared to global_atomic_add_u64. * Add a workaround for fixed point gridSpreadCharge on RDNA4 Workaround for rare cases when few values of pmeGrid are very large and incorrect. The cause is unknown. Why this workaround or other irrelevant changes like printf help is also unknown.
1ce5d91d · Anton Gorenko · GitHub · a4b43a04 · 1ce5d91d · 1ce5d91d
Unverified Commit 1ce5d91d authored Jun 08, 2025 by Anton Gorenko Committed by GitHub Jun 07, 2025
Show whitespace changes
Inline Side-by-side

Showing with 14 additions and 5 deletions

platforms/common/src/kernels/pme.cc platforms/common/src/kernels/pme.cc +6 -0

platforms/hip/src/HipContext.cpp platforms/hip/src/HipContext.cpp +8 -5

No files found.
--- a/platforms/common/src/kernels/pme.cc
+++ b/platforms/common/src/kernels/pme.cc
@@ -104,6 +104,12 @@ KERNEL void gridSpreadCharge(GLOBAL const real4* RESTRICT posq,
                real add = dzdx*data[iy].y;
 #ifdef USE_FIXED_POINT_CHARGE_SPREADING
                ATOMIC_ADD(&pmeGrid[index], (mm_ulong) realToFixedPoint(add));
+#if defined(__GFX12__)
+                // Workaround for rare cases when few values of pmeGrid are very large and
+                // incorrect. The cause is unknown. Why this workaround or other irrelevant
+                // changes like printf help is also unknown.
+                asm volatile("s_wait_storecnt 0x0");
+#endif
 #else
                ATOMIC_ADD(&pmeGrid[index], add);
 #endif

--- a/platforms/hip/src/HipContext.cpp
+++ b/platforms/hip/src/HipContext.cpp
@@ -174,11 +174,14 @@ HipContext::HipContext(const System& system, int deviceIndex, bool useBlockingSy

    // GPUs starting from CDNA1 and RDNA3 support atomic add for floats (global_atomic_add_f32),
    // which can be used in PME. Older GPUs use fixed point charge spreading instead.
-    this->supportsHardwareFloatGlobalAtomicAdd = true;
-    if (gpuArchitecture.find("gfx900") == 0 ||
-        gpuArchitecture.find("gfx906") == 0 ||
-        gpuArchitecture.find("gfx10") == 0) {
+    // RDNA4 also has this instruction but benchmarks show that it is very slow compared to
+    // global_atomic_add_u64.
    this->supportsHardwareFloatGlobalAtomicAdd = false;
+    if (gpuArchitecture.find("gfx908") == 0 ||
+        gpuArchitecture.find("gfx90a") == 0 ||
+        gpuArchitecture.find("gfx94") == 0 ||
+        gpuArchitecture.find("gfx11") == 0) {
+        this->supportsHardwareFloatGlobalAtomicAdd = true;
    }

    contextIsValid = true;