GPU implementation of L-BFGS (#5198)

* Make reference/CPU minimizer into a kernel * Add per-platform support for GPU minimization * Initial implementation of GPU minimization * Fixes * Increase robustness when initial gradient is huge * Handle overflow leading to non-finite values gracefully * Handle large forces in single precision more robustly * Optimize kernels * Fix kernel launch size * Update banner years * Don't create MinimizeKernel until first minimization requested * Make some compile-time constants into kernel arguments * Consolidate scale calculation kernel * Condense alpha/beta reduction kernels using atomics * Condense line search dot kernels with reductions * Remove a download, and download grad norm separately * Asynchronously check lbfgs convergence condition * Restructure line search to avoid download waiting * Start line search preemptively in case CPU evaluation is not needed * In rare cases, constraint error might not decrease after one optimization round * Better handling of unsupported 64-bit atomics, use FLT_MAX * Pick gradient mode based on GPU vs. CPU evaluation * Rework getDiff/getScale reduction, remove reduceBuffer * Older CUDA might not like float hex literals * Fix error in a comment

GPU implementation of L-BFGS (#5198)
* Make reference/CPU minimizer into a kernel * Add per-platform support for GPU minimization * Initial implementation of GPU minimization * Fixes * Increase robustness when initial gradient is huge * Handle overflow leading to non-finite values gracefully * Handle large forces in single precision more robustly * Optimize kernels * Fix kernel launch size * Update banner years * Don't create MinimizeKernel until first minimization requested * Make some compile-time constants into kernel arguments * Consolidate scale calculation kernel * Condense alpha/beta reduction kernels using atomics * Condense line search dot kernels with reductions * Remove a download, and download grad norm separately * Asynchronously check lbfgs convergence condition * Restructure line search to avoid download waiting * Start line search preemptively in case CPU evaluation is not needed * In rare cases, constraint error might not decrease after one optimization round * Better handling of unsupported 64-bit atomics, use FLT_MAX * Pick gradient mode based on GPU vs. CPU evaluation * Rework getDiff/getScale reduction, remove reduceBuffer * Older CUDA might not like float hex literals * Fix error in a comment
4ab645ea · Evan Pretti · GitHub · 834b1294 · 4ab645ea · 4ab645ea
Unverified Commit 4ab645ea authored Feb 10, 2026 by Evan Pretti Committed by GitHub Feb 10, 2026
20 changed files
--- a/libraries/lbfgs/src/lbfgs.cpp
+++ b/libraries/lbfgs/src/lbfgs.cpp
@@ -455,10 +455,13 @@ int lbfgs(
            goto lbfgs_exit;
        }

-        /* Compute the initial step:
-            step = 1.0 / sqrt(vecdot(d, d, n))
+        /*
+            Normalize the initial steepest descent direction and set the initial
+            step to 1 to avoid underflow/overflow issues with large gradients.
         */
        vec2norminv(&step, d, n);
+        vecscale(d, step, n);
+        step = 1.0;

        k = 1;
        end = 0;

--- a/olla/include/openmm/kernels.h
+++ b/olla/include/openmm/kernels.h
@@ -7,7 +7,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2025 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -55,6 +55,7 @@
 #include "openmm/HarmonicBondForce.h"
 #include "openmm/KernelImpl.h"
 #include "openmm/LCPOForce.h"
+#include "openmm/LocalEnergyMinimizer.h"
 #include "openmm/MonteCarloBarostat.h"
 #include "openmm/OrientationRestraintForce.h"
 #include "openmm/PeriodicTorsionForce.h"
@@ -300,6 +301,33 @@ public:
    virtual void computePositions(ContextImpl& context) = 0;
 };

+/**
+ * This kernel performs local energy minimization.
+ */
+class MinimizeKernel : public KernelImpl {
+public:
+    static std::string Name() {
+        return "Minimize";
+    }
+    MinimizeKernel(std::string name, const Platform& platform) : KernelImpl(name, platform) {
+    }
+    /**
+     * Initialize the kernel.
+     *
+     * @param system     the System this kernel will be applied to
+     */
+    virtual void initialize(const System& system) = 0;
+    /**
+     * Perform local energy minimization.
+     * 
+     * @param context        the context with which to perform the minimization
+     * @param tolerance      limiting root-mean-square value of all force components in kJ/mol/nm for convergence
+     * @param maxIterations  the maximum number of iterations to perform, or 0 to continue until convergence
+     * @param reporter       an optional reporter to invoke after each iteration of minimization
+     */
+    virtual void execute(ContextImpl& context, double tolerance, int maxIterations, MinimizationReporter* reporter) = 0;
+};
+
 /**
 * This kernel is invoked by HarmonicBondForce to calculate the forces acting on the system and the energy of the system.
 */

--- a/openmmapi/include/openmm/Context.h
+++ b/openmmapi/include/openmm/Context.h
@@ -7,7 +7,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2021 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -287,6 +287,7 @@ private:
    friend class Force;
    friend class ForceImpl;
    friend class Platform;
+    friend class LocalEnergyMinimizer;
    Context(const System& system, Integrator& integrator, ContextImpl& linked);
    ContextImpl& getImpl();
    const ContextImpl& getImpl() const;

--- a/openmmapi/include/openmm/internal/ContextImpl.h
+++ b/openmmapi/include/openmm/internal/ContextImpl.h
@@ -7,7 +7,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2022 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -31,6 +31,7 @@
 * -------------------------------------------------------------------------- */

 #include "openmm/Kernel.h"
+#include "openmm/LocalEnergyMinimizer.h"
 #include "openmm/Platform.h"
 #include "openmm/Vec3.h"
 #include <iosfwd>
@@ -295,6 +296,15 @@ public:
     * means you shouldn't.
     */
    Context* createLinkedContext(const System& system, Integrator& integrator);
+    /**
+     * Run local energy minimization on the Context.  See LocalEnergyMinimizer
+     * for details of the parameters.
+     * 
+     * @param tolerance      how precisely the energy minimum must be located
+     * @param maxIterations  the maximum number of iterations to perform
+     * @param reporter       an optional MinimizationReporter to invoke after each iteration
+     */
+    void minimize(double tolerance, int maxIterations, MinimizationReporter* reporter);
 private:
    friend class Context;
    void initialize();
@@ -304,10 +314,10 @@ private:
    std::vector<ForceImpl*> forceImpls;
    std::map<std::string, double> parameters;
    mutable std::vector<std::vector<int> > molecules;
-    bool hasInitializedForces, hasSetPositions, integratorIsDeleted;
+    bool hasInitializedForces, hasSetPositions, integratorIsDeleted, hasMinimizeKernel;
    int lastForceGroups;
    Platform* platform;
-    Kernel initializeForcesKernel, updateStateDataKernel, applyConstraintsKernel, virtualSitesKernel;
+    Kernel initializeForcesKernel, updateStateDataKernel, applyConstraintsKernel, virtualSitesKernel, minimizeKernel;
    void* platformData;
 };


--- a/openmmapi/src/ContextImpl.cpp
+++ b/openmmapi/src/ContextImpl.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2022 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -52,7 +52,7 @@ const static char CHECKPOINT_MAGIC_BYTES[] = "OpenMM Binary Checkpoint\n";


 ContextImpl::ContextImpl(Context& owner, const System& system, Integrator& integrator, Platform* platform, const map<string, string>& properties, ContextImpl* originalContext) :
-        owner(owner), system(system), integrator(integrator), hasInitializedForces(false), hasSetPositions(false), integratorIsDeleted(false),
+        owner(owner), system(system), integrator(integrator), hasInitializedForces(false), hasSetPositions(false), integratorIsDeleted(false), hasMinimizeKernel(false),
        lastForceGroups(-1), platform(platform), platformData(NULL) {
    int numParticles = system.getNumParticles();
    if (numParticles == 0)
@@ -113,6 +113,7 @@ ContextImpl::ContextImpl(Context& owner, const System& system, Integrator& integ
    kernelNames.push_back(UpdateStateDataKernel::Name());
    kernelNames.push_back(ApplyConstraintsKernel::Name());
    kernelNames.push_back(VirtualSitesKernel::Name());
+    kernelNames.push_back(MinimizeKernel::Name());
    for (int i = 0; i < system.getNumForces(); ++i) {
        forceImpls.push_back(system.getForce(i).createImpl());
        vector<string> forceKernels = forceImpls[forceImpls.size()-1]->getKernelNames();
@@ -198,6 +199,7 @@ ContextImpl::~ContextImpl() {
    updateStateDataKernel = Kernel();
    applyConstraintsKernel = Kernel();
    virtualSitesKernel = Kernel();
+    minimizeKernel = Kernel();
    if (!integratorIsDeleted) {
        // The Context is being deleted before the Integrator, so call cleanup() on it now.
        
@@ -507,3 +509,12 @@ void ContextImpl::systemChanged() {
 Context* ContextImpl::createLinkedContext(const System& system, Integrator& integrator) {
    return new Context(system, integrator, *this);
 }
+
+void ContextImpl::minimize(double tolerance, int maxIterations, MinimizationReporter* reporter) {
+    if (!hasMinimizeKernel) {
+        minimizeKernel = platform->createKernel(MinimizeKernel::Name(), *this);
+        minimizeKernel.getAs<MinimizeKernel>().initialize(system);
+        hasMinimizeKernel = true;
+    }
+    minimizeKernel.getAs<MinimizeKernel>().execute(*this, tolerance, maxIterations, reporter);
+}
--- a/openmmapi/src/LocalEnergyMinimizer.cpp
+++ b/openmmapi/src/LocalEnergyMinimizer.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2010-2020 Stanford University and the Authors.      *
+ * Portions copyright (c) 2010-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -28,264 +28,11 @@
 * -------------------------------------------------------------------------- */

 #include "openmm/LocalEnergyMinimizer.h"
-#include "openmm/OpenMMException.h"
-#include "openmm/Platform.h"
-#include "openmm/VerletIntegrator.h"
-#include "lbfgs.h"
-#include <cmath>
-#include <sstream>
-#include <string>
-#include <vector>
-#include <algorithm>
+#include "openmm/internal/ContextImpl.h"

 using namespace OpenMM;
 using namespace std;

-struct MinimizerData {
-    Context& context;
-    double k;
-    MinimizationReporter* reporter;
-    bool checkLargeForces;
-    VerletIntegrator cpuIntegrator;
-    Context* cpuContext;
-    MinimizerData(Context& context, double k, MinimizationReporter* reporter) :
-            context(context), k(k), reporter(reporter), cpuIntegrator(1.0), cpuContext(NULL) {
-        string platformName = context.getPlatform().getName();
-        checkLargeForces = (platformName == "CUDA" || platformName == "OpenCL" || platformName == "HIP" || platformName == "Metal");
-    }
-    ~MinimizerData() {
-        if (cpuContext != NULL)
-            delete cpuContext;
-    }
-    Context& getCpuContext() {
-        // Get an alternate context that runs on the CPU and doesn't place any limits
-        // on the magnitude of forces.
-
-        if (cpuContext == NULL) {
-            Platform* cpuPlatform;
-            try {
-                cpuPlatform = &Platform::getPlatformByName("CPU");
-            }
-            catch (...) {
-                cpuPlatform = &Platform::getPlatformByName("Reference");
-            }
-            cpuContext = new Context(context.getSystem(), cpuIntegrator, *cpuPlatform);
-            cpuContext->setState(context.getState(State::Positions | State::Velocities | State::Parameters));
-        }
-        return *cpuContext;
-    }
-};
-
-static double computeForcesAndEnergy(Context& context, const vector<Vec3>& positions, lbfgsfloatval_t *g) {
-    context.setPositions(positions);
-    context.computeVirtualSites();
-    State state = context.getState(State::Forces | State::Energy, false, context.getIntegrator().getIntegrationForceGroups());
-    const vector<Vec3>& forces = state.getForces();
-    const System& system = context.getSystem();
-    for (int i = 0; i < forces.size(); i++) {
-        if (system.getParticleMass(i) == 0) {
-            g[3*i] = 0.0;
-            g[3*i+1] = 0.0;
-            g[3*i+2] = 0.0;
-        }
-        else {
-            g[3*i] = -forces[i][0];
-            g[3*i+1] = -forces[i][1];
-            g[3*i+2] = -forces[i][2];
-        }
-    }
-    return state.getPotentialEnergy();
-}
-
-static lbfgsfloatval_t evaluate(void *instance, const lbfgsfloatval_t *x, lbfgsfloatval_t *g, const int n, const lbfgsfloatval_t step) {
-    MinimizerData* data = reinterpret_cast<MinimizerData*>(instance);
-    Context& context = data->context;
-    const System& system = context.getSystem();
-    int numParticles = system.getNumParticles();
-
-    // Compute the force and energy for this configuration.
-
-    vector<Vec3> positions(numParticles);
-    for (int i = 0; i < numParticles; i++)
-        positions[i] = Vec3(x[3*i], x[3*i+1], x[3*i+2]);
-    double energy = computeForcesAndEnergy(context, positions, g);
-    if (data->checkLargeForces) {
-        // The CUDA, OpenCL and HIP platforms accumulate forces in fixed point, so they
-        // can't handle very large forces.  Check for problematic forces (very large,
-        // infinite, or NaN) and if necessary recompute them on the CPU.
-
-        for (int i = 0; i < 3*numParticles; i++) {
-            if (!(fabs(g[i]) < 2e9)) {
-                energy = computeForcesAndEnergy(data->getCpuContext(), positions, g);
-                break;
-            }
-        }
-    }
-
-    // Add harmonic forces for any constraints.
-
-    int numConstraints = system.getNumConstraints();
-    double k = data->k;
-    for (int i = 0; i < numConstraints; i++) {
-        int particle1, particle2;
-        double distance;
-        system.getConstraintParameters(i, particle1, particle2, distance);
-        Vec3 delta = positions[particle2]-positions[particle1];
-        double r2 = delta.dot(delta);
-        double r = sqrt(r2);
-        delta *= 1/r;
-        double dr = r-distance;
-        double kdr = k*dr;
-        energy += 0.5*kdr*dr;
-        if (system.getParticleMass(particle1) != 0) {
-            g[3*particle1] -= kdr*delta[0];
-            g[3*particle1+1] -= kdr*delta[1];
-            g[3*particle1+2] -= kdr*delta[2];
-        }
-        if (system.getParticleMass(particle2) != 0) {
-            g[3*particle2] += kdr*delta[0];
-            g[3*particle2+1] += kdr*delta[1];
-            g[3*particle2+2] += kdr*delta[2];
-        }
-    }
-    return energy;
-}
-
-static int report(void *instance, const lbfgsfloatval_t *x, const lbfgsfloatval_t *g, const lbfgsfloatval_t fx,
-        const lbfgsfloatval_t xnorm, const lbfgsfloatval_t gnorm, const lbfgsfloatval_t step, int n, int iteration, int ls) {
-    // Copy over the positions and gradients.
-
-    vector<double> xout(n), gradout(n);
-    for (int i = 0; i < n; i++) {
-        xout[i] = x[i];
-        gradout[i] = g[i];
-    }
-
-    // Compute the other arguments passed to the reporter.
-
-    MinimizerData* data = reinterpret_cast<MinimizerData*>(instance);
-    Context& context = data->context;
-    const System& system = context.getSystem();
-    double restraintEnergy = 0.0, maxError = 0.0;
-    double k = data->k;
-    for (int i = 0; i < system.getNumConstraints(); i++) {
-        int p1, p2;
-        double distance;
-        system.getConstraintParameters(i, p1, p2, distance);
-        Vec3 delta(x[3*p1]-x[3*p2], x[3*p1+1]-x[3*p2+1], x[3*p1+2]-x[3*p2+2]);
-        double r2 = delta.dot(delta);
-        double r = sqrt(r2);
-        double dr = r-distance;
-        restraintEnergy += 0.5*k*dr*dr;
-        maxError = max(maxError, fabs(dr)/distance);
-    }
-    map<string, double> args;
-    args["restraint energy"] = restraintEnergy;
-    args["system energy"] = fx-restraintEnergy;
-    args["restraint strength"] = k;
-    args["max constraint error"] = maxError;
-
-    // Invoke the reporter.
-
-    MinimizationReporter* reporter = reinterpret_cast<MinimizationReporter*>(data->reporter);
-    if (reporter->report(iteration-1, xout, gradout, args))
-        return 1;
-    return 0;
-}
-
 void LocalEnergyMinimizer::minimize(Context& context, double tolerance, int maxIterations, MinimizationReporter* reporter) {
-    const System& system = context.getSystem();
-    int numParticles = system.getNumParticles();
-    double constraintTol = context.getIntegrator().getConstraintTolerance();
-    double workingConstraintTol = std::max(1e-4, constraintTol);
-    double k = 100/workingConstraintTol;
-    lbfgsfloatval_t *x = lbfgs_malloc(numParticles*3);
-    if (x == NULL)
-        throw OpenMMException("LocalEnergyMinimizer: Failed to allocate memory");
-    try {
-
-        // Initialize the minimizer.
-
-        lbfgs_parameter_t param;
-        lbfgs_parameter_init(&param);
-        if (!context.getPlatform().supportsDoublePrecision())
-            param.xtol = 1e-7;
-        param.max_iterations = maxIterations;
-        param.linesearch = LBFGS_LINESEARCH_BACKTRACKING_STRONG_WOLFE;
-
-        // Make sure the initial configuration satisfies all constraints.
-
-        context.applyConstraints(workingConstraintTol);
-
-        // Record the initial positions and determine a normalization constant for scaling the tolerance.
-
-        vector<Vec3> initialPos = context.getState(State::Positions).getPositions();
-        double norm = 0.0;
-        for (int i = 0; i < numParticles; i++) {
-            x[3*i] = initialPos[i][0];
-            x[3*i+1] = initialPos[i][1];
-            x[3*i+2] = initialPos[i][2];
-            norm += initialPos[i].dot(initialPos[i]);
-        }
-        norm /= numParticles;
-        norm = (norm < 1 ? 1 : sqrt(norm));
-        param.epsilon = tolerance/norm;
-
-        // Repeatedly minimize, steadily increasing the strength of the springs until all constraints are satisfied.
-
-        double prevMaxError = 1e10;
-        MinimizerData data(context, k, reporter);
-        while (true) {
-            // Perform the minimization.
-
-            lbfgsfloatval_t fx;
-            lbfgs_progress_t reportFn = (reporter == NULL ? NULL : report);
-            lbfgs(numParticles*3, x, &fx, evaluate, reportFn, &data, &param);
-
-            // Check whether all constraints are satisfied.
-
-            vector<Vec3> positions = context.getState(State::Positions).getPositions();
-            int numConstraints = system.getNumConstraints();
-            double maxError = 0.0;
-            for (int i = 0; i < numConstraints; i++) {
-                int particle1, particle2;
-                double distance;
-                system.getConstraintParameters(i, particle1, particle2, distance);
-                Vec3 delta = positions[particle2]-positions[particle1];
-                double r = sqrt(delta.dot(delta));
-                double error = fabs(r-distance)/distance;
-                if (error > maxError)
-                    maxError = error;
-            }
-            if (maxError <= workingConstraintTol)
-                break; // All constraints are satisfied.
-            context.setPositions(initialPos);
-            if (maxError >= prevMaxError)
-                break; // Further tightening the springs doesn't seem to be helping, so just give up.
-            prevMaxError = maxError;
-            data.k *= 10;
-            if (maxError > 100*workingConstraintTol) {
-                // We've gotten far enough from a valid state that we might have trouble getting
-                // back, so reset to the original positions.
-
-                for (int i = 0; i < numParticles; i++) {
-                    x[3*i] = initialPos[i][0];
-                    x[3*i+1] = initialPos[i][1];
-                    x[3*i+2] = initialPos[i][2];
-                }
-            }
-        }
-    }
-    catch (...) {
-        lbfgs_free(x);
-        throw;
-    }
-    lbfgs_free(x);
-    
-    // If necessary, do a final constraint projection to make sure they are satisfied
-    // to the full precision requested by the user.
-    
-    if (constraintTol < workingConstraintTol)
-        context.applyConstraints(workingConstraintTol);
+    context.getImpl().minimize(tolerance, maxIterations, reporter);
 }
-
--- a/platforms/common/include/openmm/common/CommonMinimizeKernel.h
+++ b/platforms/common/include/openmm/common/CommonMinimizeKernel.h
+#ifndef OPENMM_COMMONMINIMIZEKERNEL_H_
+#define OPENMM_COMMONMINIMIZEKERNEL_H_
+
+/* -------------------------------------------------------------------------- *
+ *                                   OpenMM                                   *
+ * -------------------------------------------------------------------------- *
+ * This is part of the OpenMM molecular simulation toolkit.                   *
+ * See https://openmm.org/development.                                        *
+ *                                                                            *
+ * Portions copyright (c) 2026 Stanford University and the Authors.           *
+ * Authors: Evan Pretti                                                       *
+ * Contributors:                                                              *
+ *                                                                            *
+ * Permission is hereby granted, free of charge, to any person obtaining a    *
+ * copy of this software and associated documentation files (the "Software"), *
+ * to deal in the Software without restriction, including without limitation  *
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,   *
+ * and/or sell copies of the Software, and to permit persons to whom the      *
+ * Software is furnished to do so, subject to the following conditions:       *
+ *                                                                            *
+ * The above copyright notice and this permission notice shall be included in *
+ * all copies or substantial portions of the Software.                        *
+ *                                                                            *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR *
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,   *
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL    *
+ * THE AUTHORS, CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,    *
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR      *
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE  *
+ * USE OR OTHER DEALINGS IN THE SOFTWARE.                                     *
+ * -------------------------------------------------------------------------- */
+
+#include "openmm/LocalEnergyMinimizer.h"
+#include "openmm/kernels.h"
+#include "openmm/common/ComputeContext.h"
+
+namespace OpenMM {
+
+/**
+ * This kernel performs local energy minimization.
+ */
+class CommonMinimizeKernel : public MinimizeKernel {
+public:
+    CommonMinimizeKernel(std::string name, const Platform& platform, ComputeContext& cc) : MinimizeKernel(name, platform), cc(cc), isSetup(false), cpuContext(NULL), cpuIntegrator(1) {
+    }
+    ~CommonMinimizeKernel();
+    /**
+     * Initialize the kernel.
+     *
+     * @param system     the System this kernel will be applied to
+     */
+    void initialize(const System& system);
+    /**
+     * Perform local energy minimization.
+     *
+     * @param context        the context with which to perform the minimization
+     * @param tolerance      limiting root-mean-square value of all force components in kJ/mol/nm for convergence
+     * @param maxIterations  the maximum number of iterations to perform, or 0 to continue until convergence
+     * @param reporter       an optional reporter to invoke after each iteration of minimization
+     */
+    void execute(ContextImpl& context, double tolerance, int maxIterations, MinimizationReporter* reporter);
+private:
+    static const double minConstraintTol, kRestraintScale, prevMaxErrorInit, kRestraintScaleUp, constraintTolScale;
+    static const double fTol, wolfeParam, stepScaleDown, stepScaleUp, minStep, maxStep;
+    static const int numVectors, maxLineSearchIterations;
+
+    void setup(ContextImpl& context);
+    void lbfgs(ContextImpl& context);
+    void evaluateGpu(ContextImpl& context);
+    double evaluateCpu(ContextImpl& context);
+    bool report(ContextImpl& context, int iteration);
+    void downloadReturnFlagStart();
+    void downloadReturnValueStart();
+    int downloadReturnFlagFinish();
+    double downloadReturnValueFinish();
+    double downloadReturnValueSync();
+    double downloadGradNormSync();
+    void runLineSearchKernels();
+
+    ComputeContext& cc;
+
+    int numParticles, numVariables, numConstraints;
+
+    std::vector<Vec3> hostPositions;
+    std::vector<double> hostX;
+    std::vector<double> hostGrad;
+    std::vector<mm_int2> hostConstraintIndices;
+    std::vector<double> hostConstraintDistances;
+
+    bool isSetup, mixedIsDouble;
+    int elementSize, threadBlockSize;
+    void* pinnedMemory;
+
+    int forceGroups;
+    double constraintTol;
+
+    double tolerance;
+    int maxIterations;
+    MinimizationReporter* reporter;
+
+    double kRestraint, energy;
+    bool largeGrad;
+
+    ComputeArray constraintIndices, constraintDistances;
+    ComputeArray xInit, x, xPrev, grad, gradPrev, dir;
+    ComputeArray alpha, scale, xDiff, gradDiff;
+    ComputeArray returnFlag, returnValue, gradNorm, lineSearchData, lineSearchDataBackup;
+
+    ComputeKernel recordInitialPosKernel;
+    ComputeKernel restorePosKernel;
+    ComputeKernel convertForcesKernel;
+    ComputeKernel getConstraintEnergyForcesKernel;
+    ComputeKernel getConstraintErrorKernel;
+    ComputeKernel initializeDirKernel;
+    ComputeKernel gradNormKernel;
+    ComputeKernel getDiffKernel;
+    ComputeKernel getScaleKernel;
+    ComputeKernel reinitializeDirKernel;
+    ComputeKernel updateDirAlphaKernel;
+    ComputeKernel scaleDirKernel;
+    ComputeKernel updateDirBetaKernel;
+    ComputeKernel updateDirFinalKernel;
+    ComputeKernel lineSearchSetupKernel;
+    ComputeKernel lineSearchStepKernel;
+    ComputeKernel lineSearchDotKernel;
+    ComputeKernel lineSearchContinueKernel;
+
+    ComputeEvent downloadStartEvent;
+    ComputeEvent downloadFinishEvent;
+    ComputeQueue downloadQueue;
+
+    Context* cpuContext;
+    VerletIntegrator cpuIntegrator;
+};
+
+} // namespace OpenMM
+
+#endif // OPENMM_COMMONMINIMIZEKERNEL_H_
--- a/platforms/common/src/CommonMinimizeKernel.cpp
+++ b/platforms/common/src/CommonMinimizeKernel.cpp
--- a/platforms/common/src/kernels/minimize.cc
+++ b/platforms/common/src/kernels/minimize.cc
--- a/platforms/cpu/tests/TestCpuLocalEnergyMinimizer.cpp
+++ b/platforms/cpu/tests/TestCpuLocalEnergyMinimizer.cpp
+/* -------------------------------------------------------------------------- *
+ *                                   OpenMM                                   *
+ * -------------------------------------------------------------------------- *
+ * This is part of the OpenMM molecular simulation toolkit.                   *
+ * See https://openmm.org/development.                                        *
+ *                                                                            *
+ * Portions copyright (c) 2026 Stanford University and the Authors.           *
+ * Authors: Evan Pretti                                                       *
+ * Contributors:                                                              *
+ *                                                                            *
+ * Permission is hereby granted, free of charge, to any person obtaining a    *
+ * copy of this software and associated documentation files (the "Software"), *
+ * to deal in the Software without restriction, including without limitation  *
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,   *
+ * and/or sell copies of the Software, and to permit persons to whom the      *
+ * Software is furnished to do so, subject to the following conditions:       *
+ *                                                                            *
+ * The above copyright notice and this permission notice shall be included in *
+ * all copies or substantial portions of the Software.                        *
+ *                                                                            *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR *
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,   *
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL    *
+ * THE AUTHORS, CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,    *
+ * DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR      *
+ * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE  *
+ * USE OR OTHER DEALINGS IN THE SOFTWARE.                                     *
+ * -------------------------------------------------------------------------- */
+
+#include "CpuTests.h"
+#include "TestLocalEnergyMinimizer.h"
+
+void runPlatformTests() {
+}
--- a/platforms/cuda/src/CudaContext.cpp
+++ b/platforms/cuda/src/CudaContext.cpp
@@ -291,6 +291,7 @@ CudaContext::CudaContext(const System& system, int deviceIndex, bool useBlocking
    compilationDefines["ERF"] = useDoublePrecision ? "erf" : "erff";
    compilationDefines["ERFC"] = useDoublePrecision ? "erfc" : "erfcf";
    compilationDefines["FMA"] = useDoublePrecision ? "fma" : "fmaf";
+    compilationDefines["FABS"] = useDoublePrecision ? "fabs" : "fabsf";

    // Set defines for applying periodic boundary conditions.


--- a/platforms/cuda/src/CudaKernelFactory.cpp
+++ b/platforms/cuda/src/CudaKernelFactory.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2025 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -35,6 +35,7 @@
 #include "openmm/common/CommonIntegrateCustomStepKernel.h"
 #include "openmm/common/CommonIntegrateNoseHooverStepKernel.h"
 #include "openmm/common/CommonIntegrateQTBStepKernel.h"
+#include "openmm/common/CommonMinimizeKernel.h"
 #include "openmm/internal/ContextImpl.h"
 #include "openmm/OpenMMException.h"

@@ -83,6 +84,8 @@ KernelImpl* CudaKernelFactory::createKernelImpl(std::string name, const Platform
        return new CommonApplyConstraintsKernel(name, platform, cu);
    if (name == VirtualSitesKernel::Name())
        return new CommonVirtualSitesKernel(name, platform, cu);
+    if (name == MinimizeKernel::Name())
+        return new CommonMinimizeKernel(name, platform, cu);
    if (name == CalcHarmonicBondForceKernel::Name())
        return new CommonCalcHarmonicBondForceKernel(name, platform, cu, context.getSystem());
    if (name == CalcCustomBondForceKernel::Name())

--- a/platforms/cuda/src/CudaPlatform.cpp
+++ b/platforms/cuda/src/CudaPlatform.cpp
@@ -73,6 +73,7 @@ CudaPlatform::CudaPlatform() {
    registerKernelFactory(UpdateStateDataKernel::Name(), factory);
    registerKernelFactory(ApplyConstraintsKernel::Name(), factory);
    registerKernelFactory(VirtualSitesKernel::Name(), factory);
+    registerKernelFactory(MinimizeKernel::Name(), factory);
    registerKernelFactory(CalcHarmonicBondForceKernel::Name(), factory);
    registerKernelFactory(CalcCustomBondForceKernel::Name(), factory);
    registerKernelFactory(CalcHarmonicAngleForceKernel::Name(), factory);

--- a/platforms/cuda/src/kernels/common.cu
+++ b/platforms/cuda/src/kernels/common.cu
@@ -18,6 +18,7 @@
 #define SYNC_THREADS __syncthreads();
 #define MEM_FENCE __threadfence_block();
 #define ATOMIC_ADD(dest, value) atomicAdd(dest, value)
+#define FLT_MAX 3.40282347e+38f

 typedef long long mm_long;
 typedef unsigned long long mm_ulong;

--- a/platforms/hip/src/HipContext.cpp
+++ b/platforms/hip/src/HipContext.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2009-2025 Stanford University and the Authors.      *
+ * Portions copyright (c) 2009-2026 Stanford University and the Authors.      *
 * Portions copyright (c) 2020-2023 Advanced Micro Devices, Inc.              *
 * Authors: Peter Eastman, Nicholas Curtis                                    *
 * Contributors:                                                              *
@@ -294,6 +294,7 @@ HipContext::HipContext(const System& system, int deviceIndex, bool useBlockingSy
    compilationDefines["ERF"] = useDoublePrecision ? "erf" : "erff";
    compilationDefines["ERFC"] = useDoublePrecision ? "erfc" : "erfcf";
    compilationDefines["FMA"] = useDoublePrecision ? "fma" : "fmaf";
+    compilationDefines["FABS"] = useDoublePrecision ? "fabs" : "fabsf";

    // Set defines for applying periodic boundary conditions.


--- a/platforms/hip/src/HipKernelFactory.cpp
+++ b/platforms/hip/src/HipKernelFactory.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2025 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Portions copyright (c) 2020 Advanced Micro Devices, Inc.                   *
 * Authors: Peter Eastman, Nicholas Curtis                                    *
 * Contributors:                                                              *
@@ -36,6 +36,7 @@
 #include "openmm/common/CommonIntegrateCustomStepKernel.h"
 #include "openmm/common/CommonIntegrateNoseHooverStepKernel.h"
 #include "openmm/common/CommonIntegrateQTBStepKernel.h"
+#include "openmm/common/CommonMinimizeKernel.h"
 #include "openmm/internal/ContextImpl.h"
 #include "openmm/OpenMMException.h"

@@ -84,6 +85,8 @@ KernelImpl* HipKernelFactory::createKernelImpl(std::string name, const Platform&
        return new CommonApplyConstraintsKernel(name, platform, cu);
    if (name == VirtualSitesKernel::Name())
        return new CommonVirtualSitesKernel(name, platform, cu);
+    if (name == MinimizeKernel::Name())
+        return new CommonMinimizeKernel(name, platform, cu);
    if (name == CalcHarmonicBondForceKernel::Name())
        return new CommonCalcHarmonicBondForceKernel(name, platform, cu, context.getSystem());
    if (name == CalcCustomBondForceKernel::Name())

--- a/platforms/hip/src/HipPlatform.cpp
+++ b/platforms/hip/src/HipPlatform.cpp
@@ -74,6 +74,7 @@ HipPlatform::HipPlatform() {
    registerKernelFactory(UpdateStateDataKernel::Name(), factory);
    registerKernelFactory(ApplyConstraintsKernel::Name(), factory);
    registerKernelFactory(VirtualSitesKernel::Name(), factory);
+    registerKernelFactory(MinimizeKernel::Name(), factory);
    registerKernelFactory(CalcHarmonicBondForceKernel::Name(), factory);
    registerKernelFactory(CalcCustomBondForceKernel::Name(), factory);
    registerKernelFactory(CalcHarmonicAngleForceKernel::Name(), factory);

--- a/platforms/hip/src/kernels/common.hip
+++ b/platforms/hip/src/kernels/common.hip
@@ -19,6 +19,7 @@
 #define SYNC_WARPS {__builtin_amdgcn_wave_barrier(); __builtin_amdgcn_fence(__ATOMIC_ACQ_REL, "wavefront");}
 #define MEM_FENCE __threadfence_block();
 #define ATOMIC_ADD(dest, value) atomicAdd(dest, value)
+#define FLT_MAX 3.40282347e+38f

 typedef long long mm_long;
 typedef unsigned long long mm_ulong;

--- a/platforms/opencl/src/OpenCLContext.cpp
+++ b/platforms/opencl/src/OpenCLContext.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2009-2025 Stanford University and the Authors.      *
+ * Portions copyright (c) 2009-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -433,6 +433,7 @@ OpenCLContext::OpenCLContext(const System& system, int platformIndex, int device
    compilationDefines["ERF"] = "erf";
    compilationDefines["ERFC"] = "erfc";
    compilationDefines["FMA"] = "fma";
+    compilationDefines["FABS"] = "fabs";

    // Set defines for applying periodic boundary conditions.


--- a/platforms/opencl/src/OpenCLKernelFactory.cpp
+++ b/platforms/opencl/src/OpenCLKernelFactory.cpp
@@ -4,7 +4,7 @@
 * This is part of the OpenMM molecular simulation toolkit.                   *
 * See https://openmm.org/development.                                        *
 *                                                                            *
- * Portions copyright (c) 2008-2025 Stanford University and the Authors.      *
+ * Portions copyright (c) 2008-2026 Stanford University and the Authors.      *
 * Authors: Peter Eastman                                                     *
 * Contributors:                                                              *
 *                                                                            *
@@ -33,6 +33,7 @@
 #include "openmm/common/CommonIntegrateCustomStepKernel.h"
 #include "openmm/common/CommonIntegrateNoseHooverStepKernel.h"
 #include "openmm/common/CommonIntegrateQTBStepKernel.h"
+#include "openmm/common/CommonMinimizeKernel.h"
 #include "openmm/internal/ContextImpl.h"
 #include "openmm/OpenMMException.h"

@@ -81,6 +82,8 @@ KernelImpl* OpenCLKernelFactory::createKernelImpl(std::string name, const Platfo
        return new CommonApplyConstraintsKernel(name, platform, cl);
    if (name == VirtualSitesKernel::Name())
        return new CommonVirtualSitesKernel(name, platform, cl);
+    if (name == MinimizeKernel::Name())
+        return new CommonMinimizeKernel(name, platform, cl);
    if (name == CalcHarmonicBondForceKernel::Name())
        return new CommonCalcHarmonicBondForceKernel(name, platform, cl, context.getSystem());
    if (name == CalcCustomBondForceKernel::Name())