platforms/hip/include/HipContext.h · a96534c1a9af93fc7bd51d7da1ab68d422ee43e9 · tsoc / openmm

Optimize findInteractingBlocks · a96534c1

Anton Gorenko authored Aug 25, 2024

Optimize findBlocksWithInteractions

* Replace volatile shared mem accesses with shuffles;
* Add NUM_TILES_IN_BATCH for processing block1 by multiple warps
  (for small systems);
* Cherry-pick missing changes from .cu;
* Tune MAX_BITS_FOR_PAIRS depending on device and the system size;
* Store single pairs immediately (if there are any), this allows not to
  store flags to shared memory and filter buffer and flagsBuffer after
  saving single pairs;
* Use fma explicitly and sign bit for better device code;
* Use CDNA's MFMA with singe/mixed precision;
* On CDNA the coarse grained stage processes warpSize blocks for
  one block1, the fine grained stage checks atoms of two block2 vs atoms
  of the same block1, singlePairs and interactingAtoms are also stored
  by warps, not half-warps;

Optimize findBlockBounds

* Use shuffles;
* Use executeKernelFlat;
* Process 2 tiles per warp 64 on CDNA;
* Use more uniformly distributed keys when sorting blocks;

Use compareInt2LargeSIMD when tile size < SIMD width

Fix exclusion tiles sorting on AMD CDNA (64 threads per wave)

    The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList)
    so host code also must check it instead of useCutoff.

    See also https://github.com/openmm/openmm/issues/3462

a96534c1

HipContext.h 24.3 KB

Replace HipContext.h