- 06 May, 2026 1 commit
-
-
one authored
- Use bitwise prefix accounting when storing sparse interactions as single pairs in the HIP pair-list kernel. This reduces the number of ballot operations needed to compute per-lane single-pair offsets. - For HIP CDNA single precision, raise MAX_BITS_FOR_PAIRS to 8 so more sparse interactions are emitted as single pairs instead of full tiles. Keep the existing double precision and RDNA thresholds unchanged. - Also simplify the HIP LJPME direct correction by computing alpha^2*r2
-
- 19 Feb, 2026 1 commit
-
-
Peter Eastman authored
* Fixed issue that caused inefficient sorting when a block contained only one atom * Add the fix to OpenCL and HIP
-
- 14 Dec, 2025 1 commit
-
-
Anton Gorenko authored
* Remove std::enable_if, warpRotateLeft is always used with TILE_SIZE * Do not use built-in warpSize in constexpr contexts Starting from ROCm 7 warpSize is no longer constexpr. findInteractingBlocks.hip uses it for sizes of __shared__ arrays. * Check if hipHostMallocNumaUser is allowed before using it
-
- 05 Sep, 2024 3 commits
-
-
Anton Gorenko authored
Skip neighbor list for very small systems https://github.com/openmm/openmm/pull/4070 Store bounding box sizes in half precision https://github.com/openmm/openmm/commit/2ae50f9 Use large blocks to optimize building the neighbor list https://github.com/openmm/openmm/commit/3955033 Improved sorting of blocks when building neighbor list https://github.com/openmm/openmm/commit/796ffaa Fixed bug in large blocks optimization with triclinic boxes https://github.com/openmm/openmm/commit/4c10732 Optimize sorting of non-uniformly distributed data https://github.com/openmm/openmm/commit/71d9bb1 Co-authored-by:bdenhollander <44237618+bdenhollander@users.noreply.github.com>
-
Anton Gorenko authored
Use a small kernel for copying interactionCounts to host memory hipMemcpy's CopyDeviceToHost operation has higher latency. Do not set stream and event blocking/spin related flags Let the runtime choose the best option because overriding does not improve performance in most cases. Remove NULL streams and use nonblocking streams explicitly Make HipContext::pushAsCurrent/popAsCurrent thread-safe as they can be called simultaneously from different threads via ContextSelector. Allow peer access to be enabled more than once (if there are multiple simulations one after another, like in benchmark.py). Create peerCopyStream on a corresponding device Use two-speed load balancing for multi GPU runs First 100 steps do coarse balancing, next 100 - fine tuning. Also ignore the slowest device (usually 0) if its fraction has reached 0, (i.e. no work can be transfered to other devices) and balance other devices. Do not download inteactionCounts in parallel nonbonded tasks This is not required because updateNeighborListSize has been called and valid flag changed. Initialize tilesAfterReorder properly It may contain a garbage value, and if it is large then updateNeighborListSize does not force reorder atoms after 25 steps in extremal cases. -
Anton Gorenko authored
Optimize findBlocksWithInteractions * Replace volatile shared mem accesses with shuffles; * Add NUM_TILES_IN_BATCH for processing block1 by multiple warps (for small systems); * Cherry-pick missing changes from .cu; * Tune MAX_BITS_FOR_PAIRS depending on device and the system size; * Store single pairs immediately (if there are any), this allows not to store flags to shared memory and filter buffer and flagsBuffer after saving single pairs; * Use fma explicitly and sign bit for better device code; * Use CDNA's MFMA with singe/mixed precision; * On CDNA the coarse grained stage processes warpSize blocks for one block1, the fine grained stage checks atoms of two block2 vs atoms of the same block1, singlePairs and interactingAtoms are also stored by warps, not half-warps; Optimize findBlockBounds * Use shuffles; * Use executeKernelFlat; * Process 2 tiles per warp 64 on CDNA; * Use more uniformly distributed keys when sorting blocks; Use compareInt2LargeSIMD when tile size < SIMD width Fix exclusion tiles sorting on AMD CDNA (64 threads per wave) The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList) so host code also must check it instead of useCutoff. See also https://github.com/openmm/openmm/issues/3462
-
- 01 Sep, 2024 1 commit
-
-
Anton Gorenko authored
Port changes in CUDA backend to HIP Fix a warning about arithmetic operations on void* in HipArray::uploadSubArray Fix "Error Initializing context ROCm 5.3.0" https://github.com/StreamHPC/openmm-hip/issues/3 hipDeviceSetCacheConfig returns hipErrorNotSupported on 5.3 Co-authored-by:Nick Curtis <nicholas.curtis@amd.com>
-
- 14 Dec, 2023 1 commit
-
-
Peter Eastman authored
-
- 11 Dec, 2023 1 commit
-
-
Peter Eastman authored
* Improved sorting of blocks when building neighbor list * Improved block sorting for OpenCL * Made sort keys more evenly distributed
-
- 24 Jul, 2023 1 commit
-
-
Peter Eastman authored
* Use large blocks to optimize building the neighbor list * Large blocks optimization for OpenCL * Fix test failures * Select whether to use large blocks based on system size
-
- 14 May, 2023 1 commit
-
-
Peter Eastman authored
* Store bounding box sizes in half precision * Work correctly in double precision mode
-
- 27 Jan, 2022 1 commit
-
-
Peter Eastman authored
* Fixed potential invalid memory access * Fixed exception
-
- 11 Mar, 2021 1 commit
-
-
Peter Eastman authored
-
- 18 Feb, 2021 1 commit
-
-
Peter Eastman authored
-
- 28 Jan, 2021 1 commit
-
-
David Clark authored
* Frames distance calculation as matrix multiplciation * Adds comment explaining distance calculation * Tunes launch bound for cuda11.2 * Simplifies the effective matrix multiplication Co-authored-by:David Clark <daclark@nvidia.com>
-
- 10 Dec, 2020 1 commit
-
-
David Clark authored
* Changes name of NVRTC program * Adds launch bounds for findInteractingBlocks * Replaces launch bound parameter with named constant Co-authored-by:David Clark <daclark@nvidia.com>
-
- 25 Sep, 2020 1 commit
-
-
peastman authored
-
- 16 Sep, 2020 1 commit
-
-
peastman authored
-
- 20 Aug, 2020 1 commit
-
-
peastman authored
* Fixed range overflow with very large numbers of atoms * More fixes to overflow with large numbers of atoms * Fix test failures
-
- 04 Oct, 2019 1 commit
-
-
Peter Eastman authored
-
- 03 Oct, 2019 1 commit
-
-
Peter Eastman authored
-
- 03 May, 2018 1 commit
-
-
peastman authored
-
- 21 Sep, 2017 1 commit
-
-
peastman authored
-
- 10 Jan, 2017 1 commit
-
-
Peter Eastman authored
-
- 02 Dec, 2016 1 commit
-
-
Peter Eastman authored
-
- 18 Oct, 2016 1 commit
-
-
Peter Eastman authored
-
- 13 Oct, 2016 1 commit
-
-
Peter Eastman authored
-
- 22 Sep, 2016 1 commit
-
-
Peter Eastman authored
-
- 14 Sep, 2016 1 commit
-
-
Peter Eastman authored
-
- 19 Aug, 2016 2 commits
-
-
Peter Eastman authored
-
Peter Eastman authored
-
- 06 Mar, 2015 1 commit
-
-
peastman authored
-
- 05 Jan, 2015 1 commit
-
-
Peter Eastman authored
-
- 10 Nov, 2014 1 commit
-
-
peastman authored
-
- 09 Sep, 2014 1 commit
-
-
peastman authored
-
- 08 Sep, 2014 1 commit
-
-
peastman authored
-
- 04 Jun, 2013 1 commit
-
-
peastman authored
Converted the array containing atom block indices for the neighbor list from ushort2 to int. This removes the hard limit of 2 million atoms.
-
- 16 May, 2013 1 commit
-
-
Yutong Zhao authored
-
- 03 May, 2013 1 commit
-
-
Peter Eastman authored
-
- 24 Apr, 2013 1 commit
-
-
Peter Eastman authored
-