-
Anton Gorenko authored
* Compile with -munsafe-fp-atomics to enable fast hardware f32 atomic add on global memory on pre-MI100 GPUs; * Use fixed point charge spreading on other GPUs, otherwise float atomic add will be compiled as a slow CAS loop; * Tune block sizes, use executeKernelFlat; * Tune launch bounds of PME grid-related kernels: force the compiler to use all registers by limiting max waves per EU to 1.
a0acfbc9