Optimize findInteractingBlocks
Optimize findBlocksWithInteractions
* Replace volatile shared mem accesses with shuffles;
* Add NUM_TILES_IN_BATCH for processing block1 by multiple warps
(for small systems);
* Cherry-pick missing changes from .cu;
* Tune MAX_BITS_FOR_PAIRS depending on device and the system size;
* Store single pairs immediately (if there are any), this allows not to
store flags to shared memory and filter buffer and flagsBuffer after
saving single pairs;
* Use fma explicitly and sign bit for better device code;
* Use CDNA's MFMA with singe/mixed precision;
* On CDNA the coarse grained stage processes warpSize blocks for
one block1, the fine grained stage checks atoms of two block2 vs atoms
of the same block1, singlePairs and interactingAtoms are also stored
by warps, not half-warps;
Optimize findBlockBounds
* Use shuffles;
* Use executeKernelFlat;
* Process 2 tiles per warp 64 on CDNA;
* Use more uniformly distributed keys when sorting blocks;
Use compareInt2LargeSIMD when tile size < SIMD width
Fix exclusion tiles sorting on AMD CDNA (64 threads per wave)
The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList)
so host code also must check it instead of useCutoff.
See also https://github.com/openmm/openmm/issues/3462
Showing
This diff is collapsed.
Please register or sign in to comment