1. 05 Sep, 2024 5 commits
    • Anton Gorenko's avatar
      Use VkFFT in HipFFT3D, remove hipFFT and the builtin FFT · f717ed89
      Anton Gorenko authored
      * VkFFT-based 3D FFT;
      * Caching of compiled VkFFT kernels;
      * Extend FFT tests with more sizes.
      f717ed89
    • Anton Gorenko's avatar
      Always use hipRTC, support Windows · b9c45d45
      Anton Gorenko authored
      * Unload all loaded modules in HipContext's destructor,
        HIP modules keep file desctriptors opened, but OpenMM never unloads
        modules leaking these file descriptors. This can cause crashinf of
        some scripts like test-openmm-platforms from openmmtools.
      * ROCm 6.0 defines operator* for complex types (that are typedefs for
        float2 and double2), they conflict with operators defined for vectors.
        This is fixed in newer ROCm versions.
      * Revert HIP_DYNAMIC_SHARED back to extern __shared__ (the macro is
        in the headers).
      * Reduce the speed of the HIP platform if there are no HIP devices in
        the system.
      b9c45d45
    • Anton Gorenko's avatar
      Optimize PME kernels · a0acfbc9
      Anton Gorenko authored
      * Compile with -munsafe-fp-atomics to enable fast hardware f32 atomic
        add on global memory on pre-MI100 GPUs;
      * Use fixed point charge spreading on other GPUs, otherwise float atomic
        add will be compiled as a slow CAS loop;
      * Tune block sizes, use executeKernelFlat;
      * Tune launch bounds of PME grid-related kernels: force the compiler to
        use all registers by limiting max waves per EU to 1.
      a0acfbc9
    • Anton Gorenko's avatar
      Optimize findInteractingBlocks · a96534c1
      Anton Gorenko authored
      Optimize findBlocksWithInteractions
      
      * Replace volatile shared mem accesses with shuffles;
      * Add NUM_TILES_IN_BATCH for processing block1 by multiple warps
        (for small systems);
      * Cherry-pick missing changes from .cu;
      * Tune MAX_BITS_FOR_PAIRS depending on device and the system size;
      * Store single pairs immediately (if there are any), this allows not to
        store flags to shared memory and filter buffer and flagsBuffer after
        saving single pairs;
      * Use fma explicitly and sign bit for better device code;
      * Use CDNA's MFMA with singe/mixed precision;
      * On CDNA the coarse grained stage processes warpSize blocks for
        one block1, the fine grained stage checks atoms of two block2 vs atoms
        of the same block1, singlePairs and interactingAtoms are also stored
        by warps, not half-warps;
      
      Optimize findBlockBounds
      
      * Use shuffles;
      * Use executeKernelFlat;
      * Process 2 tiles per warp 64 on CDNA;
      * Use more uniformly distributed keys when sorting blocks;
      
      Use compareInt2LargeSIMD when tile size < SIMD width
      
      Fix exclusion tiles sorting on AMD CDNA (64 threads per wave)
      
          The nonbonded kernel uses USE_NEIGHBOR_LIST (useNeighborList)
          so host code also must check it instead of useCutoff.
      
          See also https://github.com/openmm/openmm/issues/3462
      a96534c1
    • Anton Gorenko's avatar
      Optimize computeNonbonded · 67f5644d
      Anton Gorenko authored
      * All AMD GPUs support shuffle, double precision and 64-bit int atomics;
      * Remove unused code: !ENABLE_SHUFFLE code paths in nonbonded.hip;
      * Use intrinsics in single-precision;
      * Use realToFixedPoint (faster float32-to-int64);
      * Remove shared atomIndices, use shuffles;
      * Check early if atoms are in the cutoff range, sometimes all lanes in
        a warp can skip computations, single pairs can also skip useless
        atomics with zero values;
      * Remove volatile skipTiles access, use shuffles;
      * Distribute work for warps in a strided order;
      * Skip warps that may be still busy in the first loop;
      * Unify conditions for excluded atoms with `includeInteraction`;
      * Move multiprocessors to HipContext;
      * Increase number of warps for computeNonbonded;
      * Disable packed math for >=MI200 (it affects performance of some
        kernels like computeGKForces of amoebagk);
      * Remove defaultOptimizationOptions and createModule's optimizationFlags
        as they are never used;
      * Support -save-temps.
      67f5644d
  2. 01 Sep, 2024 3 commits
    • Anton Gorenko's avatar
      Optimize sorting kernels and tune block sizes · 7279c539
      Anton Gorenko authored
      * Compile kernels with max block size of 256 threads:
        The default hipcc behavior since ROCm 4.2 is to compile kernels
        with 1024 threads unless __launch_bounds__ is specified. This
        significantly increases register pressure especially in heavy kernels
        (double precision, for example), requiring register spilling;
      * Optimize computeRange by using multiple blocks for reduction;
      * Use blocks of 1024 threads for computeBucketPositions - it is executed
        as a single work group so larger block size is faster;
      * Sort up-to lenghtNextPow2 instead of blockDim.x (faster for short
        buckets);
      * Optimize sortShortList2;
      * Optimize sortBuckets with bit instructions;
      * Decrease bucket size for non-uniform sorting: too many buckets may
        have sizes too large to sort in shared memory;
      * Add more sizes in tests.
      7279c539
    • Anton Gorenko's avatar
      Cleanup Cmake scripts for HIP platform · aca24d5f
      Anton Gorenko authored
      * Remove setting of link libraries, include and link dirs and compile
        flags for each target, instead let Cmake deal with them by linking the
        main library to hip::host hiprtc::hiprtc hip::hipfft;
      * Fix: custom command without ADD_CUSTOM_TARGET and ADD_DEPENDENCIES is
        executed for both static and shared targets;
      * Remove IF(APPLE) parts.
      aca24d5f
    • Anton Gorenko's avatar
      Add hipification of CUDA platform · 89d2ff0e
      Anton Gorenko authored
      Port changes in CUDA backend to HIP
      
      Fix a warning about arithmetic operations on void* in HipArray::uploadSubArray
      
      Fix "Error Initializing context ROCm 5.3.0"
      
          https://github.com/StreamHPC/openmm-hip/issues/3
      
      
          hipDeviceSetCacheConfig returns hipErrorNotSupported on 5.3
      Co-authored-by: default avatarNick Curtis <nicholas.curtis@amd.com>
      89d2ff0e
  3. 19 Aug, 2024 1 commit
  4. 06 Apr, 2024 1 commit
  5. 24 Feb, 2024 1 commit
  6. 23 Feb, 2024 1 commit
  7. 17 Feb, 2024 1 commit
  8. 02 Feb, 2024 1 commit
  9. 18 Jan, 2024 1 commit
  10. 20 Dec, 2023 2 commits
  11. 14 Dec, 2023 1 commit
  12. 12 Dec, 2023 1 commit
  13. 11 Dec, 2023 1 commit
  14. 02 Nov, 2023 1 commit
  15. 31 Oct, 2023 1 commit
  16. 24 Oct, 2023 2 commits
  17. 16 Oct, 2023 1 commit
    • Christopher Woods's avatar
      WIP - looking for a way to optimise performance of creating contexts by... · 03ed8ff2
      Christopher Woods authored
      WIP - looking for a way to optimise performance of creating contexts by removing temporary arrays (and their associated mallocs/frees) (#4261)
      
      * Suggesting a "haveSameParameters" function for CustomNonbondedForce which could be
      used to avoid creating temporary copies of arrays when testing if particles are
      the same.
      
      Also updating "getParticleParameters" so that it re-uses the memory of the
      passed vector argument, rather than deallocating and reallocating it
      via a copy.
      
      * Revert "Suggesting a "haveSameParameters" function for CustomNonbondedForce which could be"
      
      This reverts commit e80ec2d2e9981abb90711636bf3a78d0c49e43fc.
      
      * Moved to `thread_local static` as suggested to prevent new vector allocations on each function call.
      
      Updated `getParameters` and `getBondParameters` to re-use the memory from the argument rather
      than re-allocating via the copy.
      
      * Forgot to reuse the memory for the groups...
      
      * Reverted back the manual copies via memcpy as they aren't needed. Looking at the header
      file and benchmarking shows that std::vector does the right thing.
      
      * Confined `thread_local static` only to ForceInfo methods, and have also put declarations
      for multiple variables back onto a single line
      
      * Removed `thread_local static` from the constructor
      
      * Moved constructor declarations back into the for loop
      03ed8ff2
  18. 14 Oct, 2023 2 commits
  19. 28 Sep, 2023 1 commit
  20. 16 Sep, 2023 1 commit
  21. 04 Sep, 2023 1 commit
  22. 02 Sep, 2023 1 commit
  23. 01 Sep, 2023 1 commit
  24. 28 Aug, 2023 1 commit
  25. 18 Aug, 2023 2 commits
  26. 02 Aug, 2023 1 commit
    • Emilio Gallicchio's avatar
      Draft integration of the Alchemical Transfer Method (ATM) plugin (#4110) · d8c67699
      Emilio Gallicchio authored
      
      
      * Draft integration of the Alchemical Transfer Method (ATM) plugin
      
      * Attempt to store and retrieve forces--does not compile
      
      * Implement addForce()/getForce() methods
      
      * Throw exception when specifying properties without a Platform (#4130)
      
      * Fixed DOF calculation for NoseHooverIntegrator (#4128)
      
      * Fix variance in documentation of VerletIntegrator (#4138)
      
      * Python API for ATMForce
      
      * Fixed compilation error
      
      * Minor cleanup of formatting and documentation
      
      * Files for ATMForce test cases
      
      * More cleanup
      
      * Removed variable groups
      
      * Test ATMForce with two particles
      
      * More tests for ATMForce plus fixes
      
      * Added missing header
      
      * Rework interface to pass displacements as vector of parameters
      
      * Revert "Rework interface to pass displacements as vector of parameters"
      
      This reverts commit 5e092031f31ded1137b677588f007add1c2d6f82.
      
      * Test with nonbonded force
      
      * Allow energy expression to be customized
      
      * Optional displacements at the initial state
      
      * Fixed compilation error build C wrapper
      
      * Address edge case of default energy expression
      
      * Consistent naming of the variables of the displacement states
      
      * Test of soft core function of the default energy expression
      
      * Mark addForce() as taking ownership
      
      * initial python test for ATMForce
      
      * Test custom expressions
      
      * Expanded C++ API documentation for ATMForce
      
      * Energy parameter derivatives
      
      * Serialization for ATMForce
      
      * Documentation, cleanup, and fixes
      
      * Fixed typos
      
      * getPerturbationEnergy() computes energy
      
      * Another test case
      
      * Minor edits
      
      ---------
      Co-authored-by: default avatarPeter Eastman <peastman@stanford.edu>
      Co-authored-by: default avatarMichael Plainer <plainer@ymail.com>
      d8c67699
  27. 24 Jul, 2023 1 commit
  28. 21 Jul, 2023 1 commit
  29. 20 Jul, 2023 1 commit
  30. 14 Jul, 2023 1 commit