Improve latencies, handling of streams and events, multi-GPU support
Use a small kernel for copying interactionCounts to host memory
hipMemcpy's CopyDeviceToHost operation has higher latency.
Do not set stream and event blocking/spin related flags
Let the runtime choose the best option because overriding does not
improve performance in most cases.
Remove NULL streams and use nonblocking streams explicitly
Make HipContext::pushAsCurrent/popAsCurrent thread-safe as they can be
called simultaneously from different threads via ContextSelector.
Allow peer access to be enabled more than once (if there are multiple
simulations one after another, like in benchmark.py).
Create peerCopyStream on a corresponding device
Use two-speed load balancing for multi GPU runs
First 100 steps do coarse balancing, next 100 - fine tuning.
Also ignore the slowest device (usually 0) if its fraction has
reached 0, (i.e. no work can be transfered to other devices) and
balance other devices.
Do not download inteactionCounts in parallel nonbonded tasks
This is not required because updateNeighborListSize has been called
and valid flag changed.
Initialize tilesAfterReorder properly
It may contain a garbage value, and if it is large then
updateNeighborListSize does not force reorder atoms after 25 steps
in extremal cases.
Showing
Please register or sign in to comment