Commits · 939ecf28a4bee6c45e37a9e64ff53df3e2e0fd79 · tsoc / openmm

06 May, 2026 1 commit

Optimize HIP pair-list handling for CDNA LJPME · 939ecf28

one authored May 06, 2026

- Use bitwise prefix accounting when storing sparse interactions as single pairs in the HIP pair-list kernel. This reduces the number of ballot operations needed to compute per-lane single-pair offsets.
- For HIP CDNA single precision, raise MAX_BITS_FOR_PAIRS to 8 so more sparse interactions are emitted as single pairs instead of full tiles. Keep the existing double precision and RDNA thresholds unchanged.
- Also simplify the HIP LJPME direct correction by computing alpha^2*r2

939ecf28

19 Feb, 2026 1 commit
- Fixed issue that caused inefficient sorting when a block contained only one atom (#5215) · 2c287f10
  Peter Eastman authored Feb 19, 2026
```
* Fixed issue that caused inefficient sorting when a block contained only one atom

* Add the fix to OpenCL and HIP
```
  2c287f10
14 Dec, 2025 1 commit

Support ROCm 7 (#5162) · 07b738c5

Anton Gorenko authored Dec 14, 2025

* Remove std::enable_if, warpRotateLeft is always used with TILE_SIZE

* Do not use built-in warpSize in constexpr contexts

Starting from ROCm 7 warpSize is no longer constexpr.
findInteractingBlocks.hip uses it for sizes of __shared__ arrays.

* Check if hipHostMallocNumaUser is allowed before using it

07b738c5

05 Sep, 2024 3 commits

Port changes from the main repository (mostly related to large blocks) · 7d7490ea

Anton Gorenko authored Aug 25, 2024

Skip neighbor list for very small systems

    https://github.com/openmm/openmm/pull/4070

Store bounding box sizes in half precision

    https://github.com/openmm/openmm/commit/2ae50f9

Use large blocks to optimize building the neighbor list

    https://github.com/openmm/openmm/commit/3955033

Improved sorting of blocks when building neighbor list

    https://github.com/openmm/openmm/commit/796ffaa

Fixed bug in large blocks optimization with triclinic boxes

    https://github.com/openmm/openmm/commit/4c10732

Optimize sorting of non-uniformly distributed data

    https://github.com/openmm/openmm/commit/71d9bb1

Co-authored-by: bdenhollander <44237618+bdenhollander@users.noreply.github.com>

7d7490ea

Improve latencies, handling of streams and events, multi-GPU support · 70771a51

Anton Gorenko authored Aug 25, 2024

Use a small kernel for copying interactionCounts to host memory

    hipMemcpy's CopyDeviceToHost operation has higher latency.

Do not set stream and event blocking/spin related flags

    Let the runtime choose the best option because overriding does not
    improve performance in most cases.

Remove NULL streams and use nonblocking streams explicitly

Make HipContext::pushAsCurrent/popAsCurrent thread-safe as they can be
called simultaneously from different threads via ContextSelector.

Allow peer access to be enabled more than once (if there are multiple
simulations one after another, like in benchmark.py).

Create peerCopyStream on a corresponding device

Use two-speed load balancing for multi GPU runs

    First 100 steps do coarse balancing, next 100 - fine tuning.
    Also ignore the slowest device (usually 0) if its fraction has
    reached 0, (i.e. no work can be transfered to other devices) and
    balance other devices.

Do not download inteactionCounts in parallel nonbonded tasks

    This is not required because updateNeighborListSize has been called
    and valid flag changed.

Initialize tilesAfterReorder properly

    It may contain a garbage value, and if it is large then
    updateNeighborListSize does not force reorder atoms after 25 steps
    in extremal cases.

70771a51

Optimize findInteractingBlocks · a96534c1

Anton Gorenko authored Aug 25, 2024

Optimize findBlocksWithInteractions

* Replace volatile shared mem accesses with shuffles;
* Add NUM_TILES_IN_BATCH for processing block1 by multiple warps
  (for small systems);
* Cherry-pick missing changes from .cu;
* Tune MAX_BITS_FOR_PAIRS depending on device and the system size;
* Store single pairs immediately (if there are any), this allows not to
  store flags to shared memory and filter buffer and flagsBuffer after
  saving single pairs;
* Use fma explicitly and sign bit for better device code;
* Use CDNA's MFMA with singe/mixed precision;
* On CDNA the coarse grained stage processes warpSize blocks for
  one block1, the fine grained stage checks atoms of two block2 vs atoms
  of the same block1, singlePairs and interactingAtoms are also stored
  by warps, not half-warps;

Optimize findBlockBounds

* Use shuffles;
* Use executeKernelFlat;
* Process 2 tiles per warp 64 on CDNA;
* Use more uniformly distributed keys when sorting blocks;

Use compareInt2LargeSIMD when tile size < SIMD width

Fix exclusion tiles sorting on AMD CDNA (64 threads per wave)

    The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList)
    so host code also must check it instead of useCutoff.

    See also https://github.com/openmm/openmm/issues/3462

a96534c1

01 Sep, 2024 1 commit

Add hipification of CUDA platform · 89d2ff0e

Anton Gorenko authored Aug 25, 2024

Port changes in CUDA backend to HIP

Fix a warning about arithmetic operations on void* in HipArray::uploadSubArray

Fix "Error Initializing context ROCm 5.3.0"

    https://github.com/StreamHPC/openmm-hip/issues/3


    hipDeviceSetCacheConfig returns hipErrorNotSupported on 5.3
Co-authored-by: Nick Curtis <nicholas.curtis@amd.com>

89d2ff0e

14 Dec, 2023 1 commit
- Fixed bug in large blocks optimization with triclinic boxes (#4351) · 4c107329
  Peter Eastman authored Dec 14, 2023
  
  4c107329
11 Dec, 2023 1 commit

Improved sorting of blocks when building neighbor list (#4343) · 796ffaaa

Peter Eastman authored Dec 11, 2023

* Improved sorting of blocks when building neighbor list

* Improved block sorting for OpenCL

* Made sort keys more evenly distributed

796ffaaa

24 Jul, 2023 1 commit

Use large blocks to optimize building the neighbor list (#4147) · 3955033a

Peter Eastman authored Jul 24, 2023

* Use large blocks to optimize building the neighbor list

* Large blocks optimization for OpenCL

* Fix test failures

* Select whether to use large blocks based on system size

3955033a

14 May, 2023 1 commit
- Store bounding box sizes in half precision (#4066) · 2ae50f9d
  Peter Eastman authored May 13, 2023
```
* Store bounding box sizes in half precision

* Work correctly in double precision mode
```
  2ae50f9d
27 Jan, 2022 1 commit
- Fixed potential invalid memory access (#3428) · 995c6318
  Peter Eastman authored Jan 26, 2022
```
* Fixed potential invalid memory access

* Fixed exception
```
  995c6318
11 Mar, 2021 1 commit
- Fixed rare error related to single pairs (#3057) · 2a7fd676
  Peter Eastman authored Mar 11, 2021
  
  2a7fd676
18 Feb, 2021 1 commit
- Reduced padding on cutoff (#3025) · ebef35a4
  Peter Eastman authored Feb 17, 2021
  
  ebef35a4
28 Jan, 2021 1 commit

Implements findBlocksWithInteractions with matrix multiplication (#2989) · ffcabcf6

David Clark authored Jan 27, 2021



* Frames distance calculation as matrix multiplciation

* Adds comment explaining distance calculation

* Tunes launch bound for cuda11.2

* Simplifies the effective matrix multiplication
Co-authored-by: David Clark <daclark@nvidia.com>

ffcabcf6

10 Dec, 2020 1 commit

Minor CUDA Changes (#2947) · d24ce6ed

David Clark authored Dec 10, 2020



* Changes name of NVRTC program

* Adds launch bounds for findInteractingBlocks

* Replaces launch bound parameter with named constant
Co-authored-by: David Clark <daclark@nvidia.com>

d24ce6ed

25 Sep, 2020 1 commit
- Re-enabled single pair list (#2863) · 94d7225b
  peastman authored Sep 25, 2020
  
  94d7225b
16 Sep, 2020 1 commit
- Fixed illegal read in kernel (#2843) · 050a1aa1
  peastman authored Sep 16, 2020
  
  050a1aa1
20 Aug, 2020 1 commit

Fixed range overflow with very large numbers of atoms (#2806) · cdc0789a

peastman authored Aug 20, 2020

* Fixed range overflow with very large numbers of atoms

* More fixes to overflow with large numbers of atoms

* Fix test failures

cdc0789a

04 Oct, 2019 1 commit
- Fixed error in #2416 · f144dee2
  Peter Eastman authored Oct 03, 2019
  
  f144dee2
03 Oct, 2019 1 commit
- Fixed error building neighbor list with triclinic box · 2d7fb355
  Peter Eastman authored Oct 02, 2019
  
  2d7fb355
03 May, 2018 1 commit
- Minor optimizations · f3c55c28
  peastman authored May 03, 2018
  
  f3c55c28
21 Sep, 2017 1 commit
- Replace __ballot() with __ballot_sync() · df95e2d6
  peastman authored Sep 21, 2017
  
  df95e2d6
10 Jan, 2017 1 commit
- Fixed bug in building neighbor list · cfa7ab82
  Peter Eastman authored Jan 10, 2017
  
  cfa7ab82
02 Dec, 2016 1 commit
- Fixed error building neighbor list with triclinic boxes · 5e535679
  Peter Eastman authored Dec 02, 2016
  
  5e535679
18 Oct, 2016 1 commit
- Fixed bug computing neighbor list · 556dddb6
  Peter Eastman authored Oct 18, 2016
  
  556dddb6
13 Oct, 2016 1 commit
- Further improvements to pair list · 7c911459
  Peter Eastman authored Oct 13, 2016
  
  7c911459
22 Sep, 2016 1 commit
- Improvements to fine grained pair list · 9650d66e
  Peter Eastman authored Sep 22, 2016
  
  9650d66e
14 Sep, 2016 1 commit
- Beginning of implementing fine grained pair list · 5fa4345f
  Peter Eastman authored Sep 14, 2016
  
  5fa4345f
19 Aug, 2016 2 commits
- Further optimizations to neighbor list · b07cf776
  Peter Eastman authored Aug 19, 2016
  
  b07cf776
- Minor optimizations · d927ff49
  Peter Eastman authored Aug 19, 2016
  
  d927ff49
06 Mar, 2015 1 commit
- Load balancing between GPUs forces the neighbor list to be rebuilt · 62f7c44c
  peastman authored Mar 06, 2015
  
  62f7c44c
05 Jan, 2015 1 commit
- Beginning of CUDA support for triclinic boxes · 61458a5f
  Peter Eastman authored Jan 05, 2015
  
  61458a5f
10 Nov, 2014 1 commit
- Fixed a bug in neighbor list construction · d5666536
  peastman authored Nov 10, 2014
  
  d5666536
09 Sep, 2014 1 commit
- Cleaned up obsolete code · 68984320
  peastman authored Sep 09, 2014
  
  68984320
08 Sep, 2014 1 commit
- Rewrote findBlocksWithInteractions() in a way that is both faster and simpler. · bfeb8ac7
  peastman authored Sep 08, 2014
  
  bfeb8ac7
04 Jun, 2013 1 commit

Converted the array containing atom block indices for the neighbor list from... · 2ff0b0ae

peastman authored Jun 04, 2013

Converted the array containing atom block indices for the neighbor list from ushort2 to int.  This removes the hard limit of 2 million atoms.

2ff0b0ae

16 May, 2013 1 commit
- Added documentation regarding how we construct neighborlists. · dd1ae5c1
  Yutong Zhao authored May 16, 2013
  
  dd1ae5c1
03 May, 2013 1 commit
- Very minor optimizations and cleanup to building neighborlist · 78f8d078
  Peter Eastman authored May 03, 2013
  
  78f8d078
24 Apr, 2013 1 commit
- Fixed bug caused by missing synchronization in findInteractingBlocks() · 81753737
  Peter Eastman authored Apr 24, 2013
  
  81753737