Converted AMOEBA to common platform (#3120)

* Began converting AMOEBA to common platform * Beginning of OpenCL platform for AMOEBA * Converted AmoebaVdwForce to common platform * Cleaned up reference AMOEBA tests * Began converting AmoebaMultipoleForce to common platform * Continue converting AmoebaMultipoleForce to common platform * Bug fixes * Bug fix * Continue converting AmoebaMultipoleForce to common platform * Converting AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce to common platform * Converting AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce to common platform * Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce * Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce * Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce * Converted arrays from real3 to real * Bug fix to OpenCL AmoebaGeneralizedKirkwoodForce * Fixes for AMD GPUs * Began converting HippoNonbondedForce to common platform * Continuing to convert HippoNonbondedForce to common platform * Continuing to convert HippoNonbondedForce to common platform * Working on unifying PME kernels * Fixed error on devices without 64 bit atomics * Unified PME kernels * Converted HippoNonbondedForce to common platform * Creating OpenCL implementation of HippoNonbondedForce * Continuing OpenCL implementation of HippoNonbondedForce * Mostly finished OpenCL implementation of HippoNonbondedForce * Eliminated three component vector types in host code * Fix errors on CPU OpenCL * Skip double precision tests for AMOEBA on OpenCL * Bug fixes * Bug fixes * Fixed compilation error

Converted AMOEBA to common platform (#3120)
* Began converting AMOEBA to common platform * Beginning of OpenCL platform for AMOEBA * Converted AmoebaVdwForce to common platform * Cleaned up reference AMOEBA tests * Began converting AmoebaMultipoleForce to common platform * Continue converting AmoebaMultipoleForce to common platform * Bug fixes * Bug fix * Continue converting AmoebaMultipoleForce to common platform * Converting AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce to common platform * Converting AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce to common platform * Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce * Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce * Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce * Converted arrays from real3 to real * Bug fix to OpenCL AmoebaGeneralizedKirkwoodForce * Fixes for AMD GPUs * Began converting HippoNonbondedForce to common platform * Continuing to convert HippoNonbondedForce to common platform * Continuing to convert HippoNonbondedForce to common platform * Working on unifying PME kernels * Fixed error on devices without 64 bit atomics * Unified PME kernels * Converted HippoNonbondedForce to common platform * Creating OpenCL implementation of HippoNonbondedForce * Continuing OpenCL implementation of HippoNonbondedForce * Mostly finished OpenCL implementation of HippoNonbondedForce * Eliminated three component vector types in host code * Fix errors on CPU OpenCL * Skip double precision tests for AMOEBA on OpenCL * Bug fixes * Bug fixes * Fixed compilation error
8e8923a7 · Peter Eastman · GitHub · 393a4dbd · 8e8923a7 · 8e8923a7
Unverified Commit 8e8923a7 authored May 22, 2021 by Peter Eastman Committed by GitHub May 22, 2021
20 changed files
--- a/docs-source/developerguide/developer.rst
+++ b/docs-source/developerguide/developer.rst
@@ -975,4 +975,14 @@ CUDA both define types for them, but they have different names, and in any case
 you want to avoid using OpenCL-specific or CUDA-specific types in common code.
 OpenMM therefore defines types for vectors in host code.  They have the same
 names as the corresponding types in device code, only with the prefix :code:`mm_`\ ,
-for example :code:`mm_int2` and :code:`mm_float3`\ .
+for example :code:`mm_int2` and :code:`mm_float4`\ .
\ No newline at end of file
+Three component vectors need special care in this context, because the platforms
+define them differently.  In OpenCL, a three component vector is essentially a
+four component vector whose last component is ignored.  For example,
+:code:`sizeof(float3)` is 12 in CUDA but 16 in OpenCL.  Within a kernel this
+distinction can usually be ignored, but when communicating between host and
+device it becomes vitally important.  It is generally best to avoid storing
+three component vectors in arrays or passing them as arguments.  There are no
+:code:`mm_` host types defined for three component vectors, because CUDA and
+OpenCL would require them to be defined in different ways.
\ No newline at end of file
--- a/platforms/common/include/openmm/common/ComputeContext.h
+++ b/platforms/common/include/openmm/common/ComputeContext.h
@@ -65,6 +65,8 @@ public:
    class ReorderListener;
    class ForcePreComputation;
    class ForcePostComputation;
+    static const int ThreadBlockSize;
+    static const int TileSize;
    ComputeContext(const System& system);
    virtual ~ComputeContext();
    /**
@@ -122,6 +124,13 @@ public:
     * @param defines            a set of preprocessor definitions (name, value) to define when compiling the program
     */
    virtual ComputeProgram compileProgram(const std::string source, const std::map<std::string, std::string>& defines=std::map<std::string, std::string>()) = 0;
+    /**
+     * Compute the largest thread block size that can be used for a kernel that requires a particular amount of
+     * shared memory per thread.
+     * 
+     * @param memory        the number of bytes of shared memory per thread
+     */
+    virtual int computeThreadBlockSize(double memory) const = 0;
    /**
     * Set all elements of an array to 0.
     */
@@ -278,6 +287,10 @@ public:
    int getPaddedNumAtoms() const {
        return paddedNumAtoms;
    }
+    /**
+     * Get the number of blocks of TileSize atoms.
+     */
+    virtual int getNumAtomBlocks() const = 0;
    /**
     * Get the standard number of thread blocks to use when executing kernels.
     */
@@ -398,6 +411,13 @@ public:
     * Get the NonbondedUtilities for this context.
     */
    virtual NonbondedUtilities& getNonbondedUtilities() = 0;
+    /**
+     * Create a new NonbondedUtilities for use with this context.  This should be called
+     * only in unusual situations, when a Force needs its own NonbondedUtilities object
+     * separate from the standard one.  The caller is responsible for deleting the object
+     * when it is no longer needed.
+     */
+    virtual NonbondedUtilities* createNonbondedUtilities() = 0;
    /**
     * This should be called by the Integrator from its own initialize() method.
     * It ensures all contexts are fully initialized.

--- a/platforms/common/include/openmm/common/ComputeKernel.h
+++ b/platforms/common/include/openmm/common/ComputeKernel.h
@@ -97,6 +97,10 @@ public:
    void setArg(int index, ArrayInterface& value) {
        setArrayArg(index, value);
    }
+    /**
+     * Get the maximum block size that can be used when executing this kernel.
+     */
+    virtual int getMaxBlockSize() const = 0;
    /**
     * Execute this kernel.
     *

--- a/platforms/common/include/openmm/common/ComputeVectorTypes.h
+++ b/platforms/common/include/openmm/common/ComputeVectorTypes.h
@@ -36,13 +36,6 @@ struct mm_short2 {
    mm_short2(short x, short y) : x(x), y(y) {
    }
 };
-struct mm_short3 {
-    short x, y, z, w;
-    mm_short3() {
-    }
-    mm_short3(short x, short y, short z) : x(x), y(y), z(z) {
-    }
-};
 struct mm_short4 {
    short x, y, z, w;
    mm_short4() {
@@ -57,13 +50,6 @@ struct mm_int2 {
    mm_int2(int x, int y) : x(x), y(y) {
    }
 };
-struct mm_int3 {
-    int x, y, z, w;
-    mm_int3() {
-    }
-    mm_int3(int x, int y, int z) : x(x), y(y), z(z) {
-    }
-};
 struct mm_int4 {
    int x, y, z, w;
    mm_int4() {
@@ -78,13 +64,6 @@ struct mm_float2 {
    mm_float2(float x, float y) : x(x), y(y) {
    }
 };
-struct mm_float3 {
-    float x, y, z, w;
-    mm_float3() {
-    }
-    mm_float3(float x, float y, float z) : x(x), y(y), z(z) {
-    }
-};
 struct mm_float4 {
    float x, y, z, w;
    mm_float4() {
@@ -99,13 +78,6 @@ struct mm_double2 {
    mm_double2(double x, double y) : x(x), y(y) {
    }
 };
-struct mm_double3 {
-    double x, y, z, w;
-    mm_double3() {
-    }
-    mm_double3(double x, double y, double z) : x(x), y(y), z(z) {
-    }
-};
 struct mm_double4 {
    double x, y, z, w;
    mm_double4() {

--- a/platforms/common/include/openmm/common/NonbondedUtilities.h
+++ b/platforms/common/include/openmm/common/NonbondedUtilities.h
@@ -158,6 +158,42 @@ public:
     * on the most recent call to prepareInteractions().
     */
    virtual ArrayInterface& getRebuildNeighborList() = 0;
+    /**
+     * Get the index of the first tile this context is responsible for processing.
+     */
+    virtual int getStartTileIndex() const = 0;
+    /**
+     * Get the total number of tiles this context is responsible for processing.
+     */
+    virtual int getNumTiles() const = 0;
+    /**
+     * Set whether to add padding to the cutoff distance when building the neighbor list.
+     * This increases the size of the neighbor list (and thus the cost of computing interactions),
+     * but also means we don't need to rebuild it every time step.  The default value is true,
+     * since usually this improves performance.  For very expensive interactions, however,
+     * it may be better to set this to false.
+     */
+    virtual void setUsePadding(bool padding) = 0;
+    /**
+     * Initialize this object in preparation for a simulation.
+     */
+    virtual void initialize(const System& system) = 0;
+    /**
+     * Prepare to compute interactions.  This updates the neighbor list.
+     */
+    virtual void prepareInteractions(int forceGroups) = 0;
+    /**
+     * Compute the nonbonded interactions.
+     * 
+     * @param forceGroups    the flags specifying which force groups to include
+     * @param includeForces  whether to compute forces
+     * @param includeEnergy  whether to compute the potential energy
+     */
+    virtual void computeInteractions(int forceGroups, bool includeForces, bool includeEnergy) = 0;
+    /**
+     * Set the source code for the main kernel.  It only needs to be changed in very unusual circumstances.
+     */
+    virtual void setKernelSource(const std::string& source) = 0;
 };
 } // namespace OpenMM

--- a/platforms/common/src/CommonKernels.cpp
+++ b/platforms/common/src/CommonKernels.cpp
@@ -7362,7 +7362,7 @@ void CommonRemoveCMMotionKernel::initialize(const System& system, const CMMotion
    cc.setAsCurrent();
    frequency = force.getFrequency();
    int numAtoms = cc.getNumAtoms();
-    cmMomentum.initialize<mm_float3>(cc, cc.getPaddedNumAtoms(), "cmMomentum");
+    cmMomentum.initialize<mm_float4>(cc, cc.getPaddedNumAtoms(), "cmMomentum");
    double totalMass = 0.0;
    for (int i = 0; i < numAtoms; i++)
        totalMass += system.getParticleMass(i);

--- a/platforms/common/src/ComputeContext.cpp
+++ b/platforms/common/src/ComputeContext.cpp
@@ -39,6 +39,9 @@
 using namespace OpenMM;
 using namespace std;
+const int ComputeContext::ThreadBlockSize = 64;
+const int ComputeContext::TileSize = 32;
 ComputeContext::ComputeContext(const System& system) : system(system), time(0.0), stepCount(0), computeForceCount(0), stepsSinceReorder(99999),
        atomsWereReordered(false), forcesValid(false), thread(NULL) {
    thread = new WorkThread();

--- a/platforms/opencl/src/kernels/pme.cl
+++ b/platforms/opencl/src/kernels/pme.cl
--- a/platforms/common/src/kernels/removeCM.cc
+++ b/platforms/common/src/kernels/removeCM.cc
@@ -2,9 +2,9 @@
 * Calculate the center of mass momentum.
 */
-KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT velm, GLOBAL float3* RESTRICT cmMomentum) {
+KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT velm, GLOBAL float4* RESTRICT cmMomentum) {
-    LOCAL float3 temp[64];
+    LOCAL float4 temp[64];
-    float3 cm = make_float3(0, 0, 0);
+    float4 cm = make_float4(0);
    for (int index = GLOBAL_ID; index < numAtoms; index += GLOBAL_SIZE) {
        mixed4 velocity = velm[index];
        if (velocity.w != 0) {
@@ -43,11 +43,11 @@ KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT
 * Remove center of mass motion.
 */
-KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT velm, GLOBAL const float3* RESTRICT cmMomentum) {
+KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT velm, GLOBAL const float4* RESTRICT cmMomentum) {
    // First sum all of the momenta that were calculated by individual groups.
-    LOCAL float3 temp[64];
+    LOCAL float4 temp[64];
-    float3 cm = make_float3(0, 0, 0);
+    float4 cm = make_float4(0);
    for (int index = LOCAL_ID; index < NUM_GROUPS; index += LOCAL_SIZE)
        cm += cmMomentum[index];
    int thread = LOCAL_ID;
@@ -68,7 +68,7 @@ KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT vel
    if (thread < 2)
        temp[thread] += temp[thread+2];
    SYNC_THREADS;
-    cm = make_float3(INVERSE_TOTAL_MASS*(temp[0].x+temp[1].x), INVERSE_TOTAL_MASS*(temp[0].y+temp[1].y), INVERSE_TOTAL_MASS*(temp[0].z+temp[1].z));
+    cm = make_float4(INVERSE_TOTAL_MASS*(temp[0].x+temp[1].x), INVERSE_TOTAL_MASS*(temp[0].y+temp[1].y), INVERSE_TOTAL_MASS*(temp[0].z+temp[1].z), 0);
    // Now remove the center of mass velocity from each atom.

--- a/platforms/cuda/include/CudaContext.h
+++ b/platforms/cuda/include/CudaContext.h
@@ -289,9 +289,8 @@ public:
     * shared memory per thread.
     * 
     * @param memory        the number of bytes of shared memory per thread
-     * @param preferShared  whether the kernel is set to prefer shared memory over cache
     */
-    int computeThreadBlockSize(double memory, bool preferShared=true) const;
+    int computeThreadBlockSize(double memory) const;
    /**
     * Set all elements of an array to 0.
     */
@@ -481,6 +480,15 @@ public:
    CudaNonbondedUtilities& getNonbondedUtilities() {
        return *nonbonded;
    }
+    /**
+     * Create a new NonbondedUtilities for use with this context.  This should be called
+     * only in unusual situations, when a Force needs its own NonbondedUtilities object
+     * separate from the standard one.  The caller is responsible for deleting the object
+     * when it is no longer needed.
+     */
+    CudaNonbondedUtilities* createNonbondedUtilities() {
+        return new CudaNonbondedUtilities(*this);
+    }
    /**
     * This should be called by the Integrator from its own initialize() method.
     * It ensures all contexts are fully initialized.

--- a/platforms/cuda/include/CudaKernel.h
+++ b/platforms/cuda/include/CudaKernel.h
@@ -52,6 +52,10 @@ public:
     * Get the name of this kernel.
     */
    std::string getName() const;
+    /**
+     * Get the maximum block size that can be used when executing this kernel.
+     */
+    int getMaxBlockSize() const;
    /**
     * Execute this kernel.
     *

--- a/platforms/cuda/src/CudaContext.cpp
+++ b/platforms/cuda/src/CudaContext.cpp
@@ -723,10 +723,9 @@ void CudaContext::executeKernel(CUfunction kernel, void** arguments, int threads
    }
 }
-int CudaContext::computeThreadBlockSize(double memory, bool preferShared) const {
+int CudaContext::computeThreadBlockSize(double memory) const {
-    int maxShared = 16*1024;
+    int maxShared;
-    if (computeCapability >= 2.0 && preferShared)
+    CHECK_RESULT2(cuDeviceGetAttribute(&maxShared, CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK, device), "Error querying device property");
-        maxShared = 48*1024;
    int max = (int) (maxShared/memory);
    if (max < 64)
        return 32;

--- a/platforms/cuda/src/CudaKernel.cpp
+++ b/platforms/cuda/src/CudaKernel.cpp
@@ -26,6 +26,7 @@
 #include "CudaKernel.h"
 #include "openmm/common/ComputeArray.h"
+#include "openmm/internal/AssertionUtilities.h"
 #include <cstring>
 #include <vector>
@@ -39,6 +40,14 @@ string CudaKernel::getName() const {
    return name;
 }
+int CudaKernel::getMaxBlockSize() const {
+    int size;
+    CUresult result = cuFuncGetAttribute(&size, CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, kernel);
+    if (result != CUDA_SUCCESS)
+        throw OpenMMException("Error querying max thread block size: "+context.getErrorString(result));
+    return size;
+}
 void CudaKernel::execute(int threads, int blockSize) {
    int numArgs = arrayArgs.size();
    argPointers.resize(numArgs);
@@ -69,10 +78,12 @@ void CudaKernel::addEmptyArg() {
 }
 void CudaKernel::setArrayArg(int index, ArrayInterface& value) {
+    ASSERT_VALID_INDEX(index, arrayArgs);
    arrayArgs[index] = &context.unwrap(value);
 }
 void CudaKernel::setPrimitiveArg(int index, const void* value, int size) {
+    ASSERT_VALID_INDEX(index, primitiveArgs);
    if (size > sizeof(double4))
        throw OpenMMException("Unsupported value type for kernel argument");
    memcpy(&primitiveArgs[index], value, size);

--- a/platforms/cuda/src/CudaKernels.cpp
+++ b/platforms/cuda/src/CudaKernels.cpp
@@ -736,15 +736,13 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon
            pmeDefines["GRID_SIZE_Z"] = cu.intToString(gridSizeZ);
            pmeDefines["EPSILON_FACTOR"] = cu.doubleToString(sqrt(ONE_4PI_EPS0));
            pmeDefines["M_PI"] = cu.doubleToString(M_PI);
-            if (cu.getUseDoublePrecision())
+            if (cu.getUseDoublePrecision() || cu.getPlatformData().deterministicForces)
-                pmeDefines["USE_DOUBLE_PRECISION"] = "1";
+                pmeDefines["USE_FIXED_POINT_CHARGE_SPREADING"] = "1";
            if (usePmeStream)
                pmeDefines["USE_PME_STREAM"] = "1";
-            if (cu.getPlatformData().deterministicForces)
-                pmeDefines["USE_DETERMINISTIC_FORCES"] = "1";
            map<string, string> replacements;
            replacements["CHARGE"] = (usePosqCharges ? "pos.w" : "charges[atom]");
-            CUmodule module = cu.createModule(CudaKernelSources::vectorOps+cu.replaceStrings(CudaKernelSources::pme, replacements), pmeDefines);
+            CUmodule module = cu.createModule(CudaKernelSources::vectorOps+cu.replaceStrings(CommonKernelSources::pme, replacements), pmeDefines);
            if (cu.getPlatformData().useCpuPme && !doLJPME && usePosqCharges) {
                // Create the CPU PME kernel.
@@ -777,6 +775,8 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon
                    pmeDefines["RECIP_EXP_FACTOR"] = cu.doubleToString(M_PI*M_PI/(dispersionAlpha*dispersionAlpha));
                    pmeDefines["USE_LJPME"] = "1";
                    pmeDefines["CHARGE_FROM_SIGEPS"] = "1";
+                    if (cu.getUseDoublePrecision() || cu.getPlatformData().deterministicForces)
+                        pmeDefines["USE_FIXED_POINT_CHARGE_SPREADING"] = "1";
                    double invRCut6 = pow(force.getCutoffDistance(), -6);
                    double dalphaR = dispersionAlpha * force.getCutoffDistance();
                    double dar2 = dalphaR*dalphaR;
@@ -784,7 +784,7 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon
                    double multShift6 = -invRCut6*(1.0 - exp(-dar2) * (1.0 + dar2 + 0.5*dar4));
                    defines["INVCUT6"] = cu.doubleToString(invRCut6);
                    defines["MULTSHIFT6"] = cu.doubleToString(multShift6);
-                    module = cu.createModule(CudaKernelSources::vectorOps+CudaKernelSources::pme, pmeDefines);
+                    module = cu.createModule(CudaKernelSources::vectorOps+CommonKernelSources::pme, pmeDefines);
                    pmeDispersionFinishSpreadChargeKernel = cu.getKernel(module, "finishSpreadCharge");
                    pmeDispersionGridIndexKernel = cu.getKernel(module, "findAtomGridIndex");
                    pmeDispersionSpreadChargeKernel = cu.getKernel(module, "gridSpreadCharge");
@@ -1177,11 +1177,11 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
        cu.getPeriodicBoxVectors(boxVectors[0], boxVectors[1], boxVectors[2]);
        double determinant = boxVectors[0][0]*boxVectors[1][1]*boxVectors[2][2];
        double scale = 1.0/determinant;
-        double3 recipBoxVectors[3];
+        double4 recipBoxVectors[3];
-        recipBoxVectors[0] = make_double3(boxVectors[1][1]*boxVectors[2][2]*scale, 0, 0);
+        recipBoxVectors[0] = make_double4(boxVectors[1][1]*boxVectors[2][2]*scale, 0, 0, 0);
-        recipBoxVectors[1] = make_double3(-boxVectors[1][0]*boxVectors[2][2]*scale, boxVectors[0][0]*boxVectors[2][2]*scale, 0);
+        recipBoxVectors[1] = make_double4(-boxVectors[1][0]*boxVectors[2][2]*scale, boxVectors[0][0]*boxVectors[2][2]*scale, 0, 0);
-        recipBoxVectors[2] = make_double3((boxVectors[1][0]*boxVectors[2][1]-boxVectors[1][1]*boxVectors[2][0])*scale, -boxVectors[0][0]*boxVectors[2][1]*scale, boxVectors[0][0]*boxVectors[1][1]*scale);
+        recipBoxVectors[2] = make_double4((boxVectors[1][0]*boxVectors[2][1]-boxVectors[1][1]*boxVectors[2][0])*scale, -boxVectors[0][0]*boxVectors[2][1]*scale, boxVectors[0][0]*boxVectors[1][1]*scale, 0);
-        float3 recipBoxVectorsFloat[3];
+        float4 recipBoxVectorsFloat[3];
        void* recipBoxVectorPointer[3];
        if (cu.getUseDoublePrecision()) {
            recipBoxVectorPointer[0] = &recipBoxVectors[0];
@@ -1189,9 +1189,9 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
            recipBoxVectorPointer[2] = &recipBoxVectors[2];
        }
        else {
-            recipBoxVectorsFloat[0] = make_float3((float) recipBoxVectors[0].x, 0, 0);
+            recipBoxVectorsFloat[0] = make_float4((float) recipBoxVectors[0].x, 0, 0, 0);
-            recipBoxVectorsFloat[1] = make_float3((float) recipBoxVectors[1].x, (float) recipBoxVectors[1].y, 0);
+            recipBoxVectorsFloat[1] = make_float4((float) recipBoxVectors[1].x, (float) recipBoxVectors[1].y, 0, 0);
-            recipBoxVectorsFloat[2] = make_float3((float) recipBoxVectors[2].x, (float) recipBoxVectors[2].y, (float) recipBoxVectors[2].z);
+            recipBoxVectorsFloat[2] = make_float4((float) recipBoxVectors[2].x, (float) recipBoxVectors[2].y, (float) recipBoxVectors[2].z, 0);
            recipBoxVectorPointer[0] = &recipBoxVectorsFloat[0];
            recipBoxVectorPointer[1] = &recipBoxVectorsFloat[1];
            recipBoxVectorPointer[2] = &recipBoxVectorsFloat[2];
@@ -1234,13 +1234,13 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
            if (includeEnergy) {
                void* computeEnergyArgs[] = {&pmeGrid2.getDevicePointer(), usePmeStream ? &pmeEnergyBuffer.getDevicePointer() : &cu.getEnergyBuffer().getDevicePointer(),
                        &pmeBsplineModuliX.getDevicePointer(), &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(),
-                        cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
+                        recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
                cu.executeKernel(pmeEvalEnergyKernel, computeEnergyArgs, gridSizeX*gridSizeY*gridSizeZ);
            }
-            void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &cu.getEnergyBuffer().getDevicePointer(),
+            void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &pmeBsplineModuliX.getDevicePointer(),
-                    &pmeBsplineModuliX.getDevicePointer(), &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(),
+                    &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(),
-                    cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
+                    recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
            cu.executeKernel(pmeConvolutionKernel, convolutionArgs, gridSizeX*gridSizeY*gridSizeZ, 256);
            if (useCudaFFT) {
@@ -1304,13 +1304,13 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
            if (includeEnergy) {
                void* computeEnergyArgs[] = {&pmeGrid2.getDevicePointer(), usePmeStream ? &pmeEnergyBuffer.getDevicePointer() : &cu.getEnergyBuffer().getDevicePointer(),
                        &pmeDispersionBsplineModuliX.getDevicePointer(), &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(),
-                        cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
+                        recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
                cu.executeKernel(pmeEvalDispersionEnergyKernel, computeEnergyArgs, dispersionGridSizeX*dispersionGridSizeY*dispersionGridSizeZ);
            }
-            void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &cu.getEnergyBuffer().getDevicePointer(),
+            void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &pmeDispersionBsplineModuliX.getDevicePointer(),
-                    &pmeDispersionBsplineModuliX.getDevicePointer(), &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(),
+                    &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(),
-                    cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
+                    recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
            cu.executeKernel(pmeDispersionConvolutionKernel, convolutionArgs, dispersionGridSizeX*dispersionGridSizeY*dispersionGridSizeZ, 256);
            if (useCudaFFT) {

--- a/platforms/cuda/src/CudaNonbondedUtilities.cpp
+++ b/platforms/cuda/src/CudaNonbondedUtilities.cpp
@@ -117,7 +117,7 @@ void CudaNonbondedUtilities::addInteraction(bool usesCutoff, bool usesPeriodic,
 void CudaNonbondedUtilities::addParameter(ComputeParameterInfo parameter) {
    parameters.push_back(ParameterInfo(parameter.getName(), parameter.getComponentType(), parameter.getNumComponents(),
-            parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer()));
+            parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer(), parameter.isConstant()));
 }
 void CudaNonbondedUtilities::addParameter(const ParameterInfo& parameter) {
@@ -126,7 +126,7 @@ void CudaNonbondedUtilities::addParameter(const ParameterInfo& parameter) {
 void CudaNonbondedUtilities::addArgument(ComputeParameterInfo parameter) {
    arguments.push_back(ParameterInfo(parameter.getName(), parameter.getComponentType(), parameter.getNumComponents(),
-            parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer()));
+            parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer(), parameter.isConstant()));
 }
 void CudaNonbondedUtilities::addArgument(const ParameterInfo& parameter) {
@@ -311,8 +311,8 @@ void CudaNonbondedUtilities::initialize(const System& system) {
    }
    for (int i = 0; i < (int) parameters.size(); i++)
        forceArgs.push_back(&parameters[i].getMemory());
-    for (int i = 0; i < (int) arguments.size(); i++)
+    for (ParameterInfo& arg : arguments)
-        forceArgs.push_back(&arguments[i].getMemory());
+        forceArgs.push_back(&arg.getMemory());
    if (energyParameterDerivatives.size() > 0)
        forceArgs.push_back(&context.getEnergyParamDerivBuffer().getDevicePointer());
    if (useCutoff) {
@@ -513,44 +513,44 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
    const string suffixes[] = {"x", "y", "z", "w"};
    stringstream localData;
    int localDataSize = 0;
-    for (int i = 0; i < (int) params.size(); i++) {
+    for (const ParameterInfo& param : params) {
-        if (params[i].getNumComponents() == 1)
+        if (param.getNumComponents() == 1)
-            localData<<params[i].getType()<<" "<<params[i].getName()<<";\n";
+            localData<<param.getType()<<" "<<param.getName()<<";\n";
        else {
-            for (int j = 0; j < params[i].getNumComponents(); ++j)
+            for (int j = 0; j < param.getNumComponents(); ++j)
-                localData<<params[i].getComponentType()<<" "<<params[i].getName()<<"_"<<suffixes[j]<<";\n";
+                localData<<param.getComponentType()<<" "<<param.getName()<<"_"<<suffixes[j]<<";\n";
        }
-        localDataSize += params[i].getSize();
+        localDataSize += param.getSize();
    }
    replacements["ATOM_PARAMETER_DATA"] = localData.str();
    stringstream args;
-    for (int i = 0; i < (int) params.size(); i++) {
+    for (const ParameterInfo& param : params) {
        args << ", ";
-        if (params[i].isConstant())
+        if (param.isConstant())
            args << "const ";
-        args << params[i].getType();
+        args << param.getType();
        args << "* __restrict__ global_";
-        args << params[i].getName();
+        args << param.getName();
    }
-    for (int i = 0; i < (int) arguments.size(); i++) {
+    for (const ParameterInfo& arg : arguments) {
        args << ", ";
-        if (arguments[i].isConstant())
+        if (arg.isConstant())
            args << "const ";
-        args << arguments[i].getType();
+        args << arg.getType();
        args << "* __restrict__ ";
-        args << arguments[i].getName();
+        args << arg.getName();
    }
    if (energyParameterDerivatives.size() > 0)
        args << ", mixed* __restrict__ energyParamDerivs";
    replacements["PARAMETER_ARGUMENTS"] = args.str();
    stringstream load1;
-    for (int i = 0; i < (int) params.size(); i++) {
+    for (const ParameterInfo& param : params) {
-        load1 << params[i].getType();
+        load1 << param.getType();
        load1 << " ";
-        load1 << params[i].getName();
+        load1 << param.getName();
        load1 << "1 = global_";
-        load1 << params[i].getName();
+        load1 << param.getName();
        load1 << "[atom1];\n";
    }
    replacements["LOAD_ATOM1_PARAMETERS"] = load1.str();
@@ -564,34 +564,30 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
    if(useShuffle) {
        // not needed if using shuffles as we can directly fetch from register
    } else {
-        for (int i = 0; i < (int) params.size(); i++) {
+        for (const ParameterInfo& param : params) {
-            if (params[i].getNumComponents() == 1) {
+            if (param.getNumComponents() == 1)
-                loadLocal1<<"localData[threadIdx.x]."<<params[i].getName()<<" = "<<params[i].getName()<<"1;\n";
+                loadLocal1<<"localData[LOCAL_ID]."<<param.getName()<<" = "<<param.getName()<<"1;\n";
-            }
            else {
-                for (int j = 0; j < params[i].getNumComponents(); ++j)
+                for (int j = 0; j < param.getNumComponents(); ++j)
-                    loadLocal1<<"localData[threadIdx.x]."<<params[i].getName()<<"_"<<suffixes[j]<<" = "<<params[i].getName()<<"1."<<suffixes[j]<<";\n";
+                    loadLocal1<<"localData[LOCAL_ID]."<<param.getName()<<"_"<<suffixes[j]<<" = "<<param.getName()<<"1."<<suffixes[j]<<";\n";
            }
        }
    }
    replacements["LOAD_LOCAL_PARAMETERS_FROM_1"] = loadLocal1.str();
    stringstream broadcastWarpData;
-    if(useShuffle) {
+    if (useShuffle) {
        broadcastWarpData << "posq2.x = real_shfl(shflPosq.x, j);\n";
        broadcastWarpData << "posq2.y = real_shfl(shflPosq.y, j);\n";
        broadcastWarpData << "posq2.z = real_shfl(shflPosq.z, j);\n";
        broadcastWarpData << "posq2.w = real_shfl(shflPosq.w, j);\n";
-        for(int i=0; i< (int) params.size();i++) {
+        for (const ParameterInfo& param : params) {
-            broadcastWarpData << params[i].getType() << " shfl" << params[i].getName() << ";\n";
+            broadcastWarpData << param.getType() << " shfl" << param.getName() << ";\n";
-            for(int j=0; j < params[i].getNumComponents(); j++) {
+            for (int j = 0; j < param.getNumComponents(); j++) {
-                string name;
+                if (param.getNumComponents() == 1)
-                if (params[i].getNumComponents() == 1) {
+                    broadcastWarpData << "shfl" << param.getName() << "=real_shfl(" << param.getName() <<"1,j);\n";
-                    broadcastWarpData << "shfl" << params[i].getName() << "=real_shfl(" << params[i].getName() <<"1,j);\n";
+                else
+                    broadcastWarpData << "shfl" << param.getName()+"."+suffixes[j] << "=real_shfl(" << param.getName()+"1."+suffixes[j] <<",j);\n";
-                } else {
-                    broadcastWarpData << "shfl" << params[i].getName()+"."+suffixes[j] << "=real_shfl(" << params[i].getName()+"1."+suffixes[j] <<",j);\n";
-                }
            }
        }
    } else {
@@ -601,49 +597,48 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
    // Part 2. Defines for off-diagonal exclusions, and neighborlist tiles. 
    stringstream declareLocal2;
-    if(useShuffle) {
+    if (useShuffle) {
-        for(int i=0; i< (int) params.size(); i++) {
+        for (const ParameterInfo& param : params)
-            declareLocal2<<params[i].getType()<<" shfl"<<params[i].getName()<<";\n";
+            declareLocal2<<param.getType()<<" shfl"<<param.getName()<<";\n";
-        }
+    }
-    } else {
+    else {
        // not used if using shared memory
    }
    replacements["DECLARE_LOCAL_PARAMETERS"] = declareLocal2.str();
    stringstream loadLocal2;
-    if(useShuffle) {
+    if (useShuffle) {
-        for(int i=0; i< (int) params.size(); i++) {
+        for (const ParameterInfo& param : params)
-            loadLocal2<<"shfl"<<params[i].getName()<<" = global_"<<params[i].getName()<<"[j];\n";
+            loadLocal2<<"shfl"<<param.getName()<<" = global_"<<param.getName()<<"[j];\n";
-        }
+    }
-    } else {
+    else {
-        for (int i = 0; i < (int) params.size(); i++) {
+        for (const ParameterInfo& param : params) {
-            if (params[i].getNumComponents() == 1) {
+            if (param.getNumComponents() == 1)
-                loadLocal2<<"localData[threadIdx.x]."<<params[i].getName()<<" = global_"<<params[i].getName()<<"[j];\n";
+                loadLocal2<<"localData[LOCAL_ID]."<<param.getName()<<" = global_"<<param.getName()<<"[j];\n";
-            }
            else {
-                loadLocal2<<params[i].getType()<<" temp_"<<params[i].getName()<<" = global_"<<params[i].getName()<<"[j];\n";
+                loadLocal2<<param.getType()<<" temp_"<<param.getName()<<" = global_"<<param.getName()<<"[j];\n";
-                for (int j = 0; j < params[i].getNumComponents(); ++j)
+                for (int j = 0; j < param.getNumComponents(); ++j)
-                    loadLocal2<<"localData[threadIdx.x]."<<params[i].getName()<<"_"<<suffixes[j]<<" = temp_"<<params[i].getName()<<"."<<suffixes[j]<<";\n";
+                    loadLocal2<<"localData[LOCAL_ID]."<<param.getName()<<"_"<<suffixes[j]<<" = temp_"<<param.getName()<<"."<<suffixes[j]<<";\n";
            }
        }
    }
    replacements["LOAD_LOCAL_PARAMETERS_FROM_GLOBAL"] = loadLocal2.str();
    stringstream load2j;
-    if(useShuffle) {
+    if (useShuffle) {
-        for(int i = 0; i < (int) params.size(); i++)
+        for (const ParameterInfo& param : params)
-            load2j<<params[i].getType()<<" "<<params[i].getName()<<"2 = shfl"<<params[i].getName()<<";\n";
+            load2j<<param.getType()<<" "<<param.getName()<<"2 = shfl"<<param.getName()<<";\n";
-    } else {
+    }
-        for (int i = 0; i < (int) params.size(); i++) {
+    else {
-            if (params[i].getNumComponents() == 1) {
+        for (const ParameterInfo& param : params) {
-                load2j<<params[i].getType()<<" "<<params[i].getName()<<"2 = localData[atom2]."<<params[i].getName()<<";\n";
+            if (param.getNumComponents() == 1)
-            }
+                load2j<<param.getType()<<" "<<param.getName()<<"2 = localData[atom2]."<<param.getName()<<";\n";
            else {
-                load2j<<params[i].getType()<<" "<<params[i].getName()<<"2 = make_"<<params[i].getType()<<"(";
+                load2j<<param.getType()<<" "<<param.getName()<<"2 = make_"<<param.getType()<<"(";
-                for (int j = 0; j < params[i].getNumComponents(); ++j) {
+                for (int j = 0; j < param.getNumComponents(); ++j) {
                    if (j > 0)
                        load2j<<", ";
-                    load2j<<"localData[atom2]."<<params[i].getName()<<"_"<<suffixes[j];
+                    load2j<<"localData[atom2]."<<param.getName()<<"_"<<suffixes[j];
                }
                load2j<<");\n";
            }
@@ -652,16 +647,16 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
    replacements["LOAD_ATOM2_PARAMETERS"] = load2j.str();
    stringstream clearLocal;
-    for (int i = 0; i < (int) params.size(); i++) {
+    for (const ParameterInfo& param : params) {
        if (useShuffle)
            clearLocal<<"shfl";
        else
            clearLocal<<"localData[atom2].";
-        clearLocal<<params[i].getName()<<" = ";
+        clearLocal<<param.getName()<<" = ";
-        if (params[i].getNumComponents() == 1)
+        if (param.getNumComponents() == 1)
            clearLocal<<"0;\n";
        else
-            clearLocal<<"make_"<<params[i].getType()<<"(0);\n";
+            clearLocal<<"make_"<<param.getType()<<"(0);\n";
    }
    replacements["CLEAR_LOCAL_PARAMETERS"] = clearLocal.str();
@@ -675,7 +670,7 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
    for (int i = 0; i < energyParameterDerivatives.size(); i++)
        for (int index = 0; index < numDerivs; index++)
            if (allParamDerivNames[index] == energyParameterDerivatives[i])
-                saveDerivs<<"energyParamDerivs[(blockIdx.x*blockDim.x+threadIdx.x)*"<<numDerivs<<"+"<<index<<"] += energyParamDeriv"<<i<<";\n";
+                saveDerivs<<"energyParamDerivs[GLOBAL_ID*"<<numDerivs<<"+"<<index<<"] += energyParamDeriv"<<i<<";\n";
    replacements["SAVE_DERIVATIVES"] = saveDerivs.str();
    stringstream shuffleWarpData;
@@ -687,15 +682,15 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
        shuffleWarpData << "shflForce.x = real_shfl(shflForce.x, tgx+1);\n";
        shuffleWarpData << "shflForce.y = real_shfl(shflForce.y, tgx+1);\n";
        shuffleWarpData << "shflForce.z = real_shfl(shflForce.z, tgx+1);\n";
-        for(int i=0; i < (int) params.size(); i++) {
+        for (const ParameterInfo& param : params) {
-            if(params[i].getNumComponents() == 1) {
+            if (param.getNumComponents() == 1)
-                shuffleWarpData<<"shfl"<<params[i].getName()<<"=real_shfl(shfl"<<params[i].getName()<<", tgx+1);\n";
+                shuffleWarpData<<"shfl"<<param.getName()<<"=real_shfl(shfl"<<param.getName()<<", tgx+1);\n";
-            } else {
+            else {
-                for(int j=0;j<params[i].getNumComponents();j++) {
+                for (int j = 0; j < param.getNumComponents(); j++) {
                    // looks something like shflsigmaEpsilon.x = real_shfl(shflsigmaEpsilon.x,tgx+1);
-                    shuffleWarpData<<"shfl"<<params[i].getName()
+                    shuffleWarpData<<"shfl"<<param.getName()
                        <<"."<<suffixes[j]<<"=real_shfl(shfl"
-                        <<params[i].getName()<<"."<<suffixes[j]
+                        <<param.getName()<<"."<<suffixes[j]
                        <<", tgx+1);\n";
                }
            }

--- a/platforms/cuda/src/kernels/pme.cu
+++ b/platforms/cuda/src/kernels/pme.cu
-extern "C" __global__ void findAtomGridIndex(const real4* __restrict__ posq, int2* __restrict__ pmeAtomGridIndex,
-            real4 periodicBoxSize, real4 invPeriodicBoxSize, real4 periodicBoxVecX, real4 periodicBoxVecY, real4 periodicBoxVecZ,
-            real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ) {
-    // Compute the index of the grid point each atom is associated with.
-    for (int atom = blockIdx.x*blockDim.x+threadIdx.x; atom < NUM_ATOMS; atom += blockDim.x*gridDim.x) {
-        real4 pos = posq[atom];
-        APPLY_PERIODIC_TO_POS(pos)
-        real3 t = make_real3(pos.x*recipBoxVecX.x+pos.y*recipBoxVecY.x+pos.z*recipBoxVecZ.x,
-                             pos.y*recipBoxVecY.y+pos.z*recipBoxVecZ.y,
-                             pos.z*recipBoxVecZ.z);
-        t.x = (t.x-floor(t.x))*GRID_SIZE_X;
-        t.y = (t.y-floor(t.y))*GRID_SIZE_Y;
-        t.z = (t.z-floor(t.z))*GRID_SIZE_Z;
-        int3 gridIndex = make_int3(((int) t.x) % GRID_SIZE_X,
-                                   ((int) t.y) % GRID_SIZE_Y,
-                                   ((int) t.z) % GRID_SIZE_Z);
-        pmeAtomGridIndex[atom] = make_int2(atom, gridIndex.x*GRID_SIZE_Y*GRID_SIZE_Z+gridIndex.y*GRID_SIZE_Z+gridIndex.z);
-    }
-}
-extern "C" __global__ void gridSpreadCharge(const real4* __restrict__ posq, real* __restrict__ originalPmeGrid,
-        real4 periodicBoxSize, real4 invPeriodicBoxSize, real4 periodicBoxVecX, real4 periodicBoxVecY, real4 periodicBoxVecZ,
-        real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ, const int2* __restrict__ pmeAtomGridIndex
-#ifdef CHARGE_FROM_SIGEPS
-        , const float2* __restrict__ sigmaEpsilon
-#else
-        , const real* __restrict__ charges
-#endif
-        ) {
-    // To improve memory efficiency, we divide indices along the z axis into
-    // PME_ORDER blocks, where the data for each block is stored together.  We
-    // can ensure that all threads write to the same block at the same time,
-    // which leads to better coalescing of writes.
-    __shared__ int zindexTable[GRID_SIZE_Z+PME_ORDER];
-    int blockSize = (int) ceil(GRID_SIZE_Z/(real) PME_ORDER);
-    for (int i = threadIdx.x; i < GRID_SIZE_Z+PME_ORDER; i += blockDim.x) {
-        int zindex = i % GRID_SIZE_Z;
-	int block = zindex % PME_ORDER;
-        zindexTable[i] = zindex/PME_ORDER + block*GRID_SIZE_X*GRID_SIZE_Y*blockSize;
-    }
-    __syncthreads();
-    // Process the atoms in spatially sorted order.  This improves efficiency when writing
-    // the grid values.
-    real3 data[PME_ORDER];
-    const real scale = RECIP(PME_ORDER-1);
-    for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < NUM_ATOMS; i += blockDim.x*gridDim.x) {
-        int atom = pmeAtomGridIndex[i].x;
-        real4 pos = posq[atom];
-#ifdef CHARGE_FROM_SIGEPS
-        const float2 sigEps = sigmaEpsilon[atom];
-        const real charge = 8*sigEps.x*sigEps.x*sigEps.x*sigEps.y;
-#else
-        const real charge = (CHARGE)*EPSILON_FACTOR;
-#endif
-        APPLY_PERIODIC_TO_POS(pos)
-        real3 t = make_real3(pos.x*recipBoxVecX.x+pos.y*recipBoxVecY.x+pos.z*recipBoxVecZ.x,
-                             pos.y*recipBoxVecY.y+pos.z*recipBoxVecZ.y,
-                             pos.z*recipBoxVecZ.z);
-        t.x = (t.x-floor(t.x))*GRID_SIZE_X;
-        t.y = (t.y-floor(t.y))*GRID_SIZE_Y;
-        t.z = (t.z-floor(t.z))*GRID_SIZE_Z;
-        int3 gridIndex = make_int3(((int) t.x) % GRID_SIZE_X,
-                                   ((int) t.y) % GRID_SIZE_Y,
-                                   ((int) t.z) % GRID_SIZE_Z);
-        if (charge == 0)
-            continue;
-        // Since we need the full set of thetas, it's faster to compute them here than load them
-        // from global memory.
-        real3 dr = make_real3(t.x-(int) t.x, t.y-(int) t.y, t.z-(int) t.z);
-        data[PME_ORDER-1] = make_real3(0);
-        data[1] = dr;
-        data[0] = make_real3(1)-dr;
-        for (int j = 3; j < PME_ORDER; j++) {
-            real div = RECIP(j-1);
-            data[j-1] = div*dr*data[j-2];
-            for (int k = 1; k < (j-1); k++)
-                data[j-k-1] = div*((dr+make_real3(k))*data[j-k-2] + (make_real3(j-k)-dr)*data[j-k-1]);
-            data[0] = div*(make_real3(1)-dr)*data[0];
-        }
-        data[PME_ORDER-1] = scale*dr*data[PME_ORDER-2];
-        for (int j = 1; j < (PME_ORDER-1); j++)
-            data[PME_ORDER-j-1] = scale*((dr+make_real3(j))*data[PME_ORDER-j-2] + (make_real3(PME_ORDER-j)-dr)*data[PME_ORDER-j-1]);
-        data[0] = scale*(make_real3(1)-dr)*data[0];
-        // Spread the charge from this atom onto each grid point.
-	int izoffset = (PME_ORDER-(gridIndex.z%PME_ORDER)) % PME_ORDER;
-        for (int ix = 0; ix < PME_ORDER; ix++) {
-            int xbase = gridIndex.x+ix;
-            xbase -= (xbase >= GRID_SIZE_X ? GRID_SIZE_X : 0);
-            xbase = xbase*GRID_SIZE_Y;
-            real dx = charge*data[ix].x;
-            for (int iy = 0; iy < PME_ORDER; iy++) {
-                int ybase = gridIndex.y+iy;
-                ybase -= (ybase >= GRID_SIZE_Y ? GRID_SIZE_Y : 0);
-                ybase = (xbase+ybase)*blockSize;
-                real dxdy = dx*data[iy].y;
-                for (int i = 0; i < PME_ORDER; i++) {
-		    int iz = (i+izoffset) % PME_ORDER;
-                    int zindex = gridIndex.z+iz;
-                    int index = ybase + zindexTable[zindex];
-                    real add = dxdy*data[iz].z;
-#if defined(USE_DOUBLE_PRECISION) || defined(USE_DETERMINISTIC_FORCES)
-                    unsigned long long * ulonglong_p = (unsigned long long *) originalPmeGrid;
-                    atomicAdd(&ulonglong_p[index],  static_cast<unsigned long long>((long long) (add*0x100000000)));
-#else
-                    atomicAdd(&originalPmeGrid[index], add);
-#endif
-                }
-            }
-        }
-    }
-}
-extern "C" __global__ void finishSpreadCharge(
-#if defined(USE_DOUBLE_PRECISION) || defined(USE_DETERMINISTIC_FORCES)
-        const long long* __restrict__ grid1,
-#else
-        const real* __restrict__ grid1,
-#endif
-        real* __restrict__ grid2) {
-    // During charge spreading, we shuffled the order of indices along the z
-    // axis to make memory access more efficient.  We now need to unshuffle
-    // them.  If the values were accumulated as fixed point, we also need to
-    // convert them to floating point.
-    __shared__ int zindexTable[GRID_SIZE_Z];
-    int blockSize = (int) ceil(GRID_SIZE_Z/(real) PME_ORDER);
-    for (int i = threadIdx.x; i < GRID_SIZE_Z; i += blockDim.x) {
-	int block = i % PME_ORDER;
-        zindexTable[i] = i/PME_ORDER + block*GRID_SIZE_X*GRID_SIZE_Y*blockSize;
-    }
-    __syncthreads();
-    const unsigned int gridSize = GRID_SIZE_X*GRID_SIZE_Y*GRID_SIZE_Z;
-    real scale = 1/(real) 0x100000000;
-    for (int index = blockIdx.x*blockDim.x+threadIdx.x; index < gridSize; index += blockDim.x*gridDim.x) {
-        int zindex = index%GRID_SIZE_Z;
-        int loadIndex = zindexTable[zindex] + blockSize*(int) (index/GRID_SIZE_Z);
-#if defined(USE_DOUBLE_PRECISION) || defined(USE_DETERMINISTIC_FORCES)
-        grid2[index] = scale*grid1[loadIndex];
-#else
-        grid2[index] = grid1[loadIndex];
-#endif
-    }
-}
-// convolutes on the halfcomplex_pmeGrid, which is of size NX*NY*(NZ/2+1) as F(Q) is conjugate symmetric
-extern "C" __global__ void 
-reciprocalConvolution(real2* __restrict__ halfcomplex_pmeGrid, mixed* __restrict__ energyBuffer, 
-                      const real* __restrict__ pmeBsplineModuliX, const real* __restrict__ pmeBsplineModuliY, const real* __restrict__ pmeBsplineModuliZ, 
-                      real4 periodicBoxSize, real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ) {
-    // R2C stores into a half complex matrix where the last dimension is cut by half
-    const unsigned int gridSize = GRID_SIZE_X*GRID_SIZE_Y*(GRID_SIZE_Z/2+1);
-#ifdef USE_LJPME
-    const real recipScaleFactor = -2*M_PI*SQRT(M_PI)*RECIP(6*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
-    real bfac = M_PI / EWALD_ALPHA;
-    real fac1 = 2*M_PI*M_PI*M_PI*SQRT(M_PI);
-    real fac2 = EWALD_ALPHA*EWALD_ALPHA*EWALD_ALPHA;
-    real fac3 = -2*EWALD_ALPHA*M_PI*M_PI;
-#else
-    const real recipScaleFactor = RECIP(M_PI*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
-#endif
-    for (int index = blockIdx.x*blockDim.x+threadIdx.x; index < gridSize; index += blockDim.x*gridDim.x) {
-        // real indices
-        int kx = index/(GRID_SIZE_Y*(GRID_SIZE_Z/2+1));
-        int remainder = index-kx*GRID_SIZE_Y*(GRID_SIZE_Z/2+1);
-        int ky = remainder/(GRID_SIZE_Z/2+1);
-        int kz = remainder-ky*(GRID_SIZE_Z/2+1);
-        int mx = (kx < (GRID_SIZE_X+1)/2) ? kx : (kx-GRID_SIZE_X);
-        int my = (ky < (GRID_SIZE_Y+1)/2) ? ky : (ky-GRID_SIZE_Y);
-        int mz = (kz < (GRID_SIZE_Z+1)/2) ? kz : (kz-GRID_SIZE_Z);
-        real mhx = mx*recipBoxVecX.x;
-        real mhy = mx*recipBoxVecY.x+my*recipBoxVecY.y;
-        real mhz = mx*recipBoxVecZ.x+my*recipBoxVecZ.y+mz*recipBoxVecZ.z;
-        real bx = pmeBsplineModuliX[kx];
-        real by = pmeBsplineModuliY[ky];
-        real bz = pmeBsplineModuliZ[kz];
-        real2 grid = halfcomplex_pmeGrid[index];
-        real m2 = mhx*mhx+mhy*mhy+mhz*mhz;
-#ifdef USE_LJPME
-        real denom = recipScaleFactor/(bx*by*bz);
-        real m = SQRT(m2);
-        real m3 = m*m2;
-        real b = bfac*m;
-        real expfac = -b*b;
-        real expterm = EXP(expfac);
-        real erfcterm = ERFC(b);
-        real eterm = (fac1*erfcterm*m3 + expterm*(fac2 + fac3*m2)) * denom;
-        halfcomplex_pmeGrid[index] = make_real2(grid.x*eterm, grid.y*eterm);
-#else
-        real denom = m2*bx*by*bz;
-        real eterm = recipScaleFactor*EXP(-RECIP_EXP_FACTOR*m2)/denom;
-        if (kx != 0 || ky != 0 || kz != 0) {
-            halfcomplex_pmeGrid[index] = make_real2(grid.x*eterm, grid.y*eterm);
-        }
-#endif
-    }
-}
-extern "C" __global__ void
-gridEvaluateEnergy(real2* __restrict__ halfcomplex_pmeGrid, mixed* __restrict__ energyBuffer,
-                      const real* __restrict__ pmeBsplineModuliX, const real* __restrict__ pmeBsplineModuliY, const real* __restrict__ pmeBsplineModuliZ,
-                      real4 periodicBoxSize, real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ) {
-    // R2C stores into a half complex matrix where the last dimension is cut by half
-    const unsigned int gridSize = GRID_SIZE_X*GRID_SIZE_Y*GRID_SIZE_Z;
- #ifdef USE_LJPME
-    const real recipScaleFactor = -2*M_PI*SQRT(M_PI)*RECIP(6*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
-    real bfac = M_PI / EWALD_ALPHA;
-    real fac1 = 2*M_PI*M_PI*M_PI*SQRT(M_PI);
-    real fac2 = EWALD_ALPHA*EWALD_ALPHA*EWALD_ALPHA;
-    real fac3 = -2*EWALD_ALPHA*M_PI*M_PI;
-#else
-    const real recipScaleFactor = RECIP(M_PI*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
-#endif
-    mixed energy = 0;
-    for (int index = blockIdx.x*blockDim.x+threadIdx.x; index < gridSize; index += blockDim.x*gridDim.x) {
-        // real indices
-        int kx = index/(GRID_SIZE_Y*(GRID_SIZE_Z));
-        int remainder = index-kx*GRID_SIZE_Y*(GRID_SIZE_Z);
-        int ky = remainder/(GRID_SIZE_Z);
-        int kz = remainder-ky*(GRID_SIZE_Z);
-        int mx = (kx < (GRID_SIZE_X+1)/2) ? kx : (kx-GRID_SIZE_X);
-        int my = (ky < (GRID_SIZE_Y+1)/2) ? ky : (ky-GRID_SIZE_Y);
-        int mz = (kz < (GRID_SIZE_Z+1)/2) ? kz : (kz-GRID_SIZE_Z);
-        real mhx = mx*recipBoxVecX.x;
-        real mhy = mx*recipBoxVecY.x+my*recipBoxVecY.y;
-        real mhz = mx*recipBoxVecZ.x+my*recipBoxVecZ.y+mz*recipBoxVecZ.z;
-        real m2 = mhx*mhx+mhy*mhy+mhz*mhz;
-        real bx = pmeBsplineModuliX[kx];
-        real by = pmeBsplineModuliY[ky];
-        real bz = pmeBsplineModuliZ[kz];
-#ifdef USE_LJPME
-        real denom = recipScaleFactor/(bx*by*bz);
-        real m = SQRT(m2);
-        real m3 = m*m2;
-        real b = bfac*m;
-        real expfac = -b*b;
-        real expterm = EXP(expfac);
-        real erfcterm = ERFC(b);
-        real eterm = (fac1*erfcterm*m3 + expterm*(fac2 + fac3*m2)) * denom;
-#else
-        real denom = m2*bx*by*bz;
-        real eterm = recipScaleFactor*EXP(-RECIP_EXP_FACTOR*m2)/denom;
-#endif
-        if (kz >= (GRID_SIZE_Z/2+1)) {
-            kx = ((kx == 0) ? kx : GRID_SIZE_X-kx);
-            ky = ((ky == 0) ? ky : GRID_SIZE_Y-ky);
-            kz = GRID_SIZE_Z-kz;
-        } 
-        int indexInHalfComplexGrid = kz + ky*(GRID_SIZE_Z/2+1)+kx*(GRID_SIZE_Y*(GRID_SIZE_Z/2+1));
-        real2 grid = halfcomplex_pmeGrid[indexInHalfComplexGrid];
-#ifndef USE_LJPME
-        if (kx != 0 || ky != 0 || kz != 0)
-#endif
-            energy += eterm*(grid.x*grid.x + grid.y*grid.y);
-    }
-#if defined(USE_PME_STREAM) && !defined(USE_LJPME)
-    energyBuffer[blockIdx.x*blockDim.x+threadIdx.x] = 0.5f*energy;
-#else
-    energyBuffer[blockIdx.x*blockDim.x+threadIdx.x] += 0.5f*energy;
-#endif
-}
-extern "C" __global__
-void gridInterpolateForce(const real4* __restrict__ posq, unsigned long long* __restrict__ forceBuffers, const real* __restrict__ originalPmeGrid,
-        real4 periodicBoxSize, real4 invPeriodicBoxSize, real4 periodicBoxVecX, real4 periodicBoxVecY, real4 periodicBoxVecZ,
-        real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ, const int2* __restrict__ pmeAtomGridIndex
-#ifdef CHARGE_FROM_SIGEPS
-        , const float2* __restrict__ sigmaEpsilon
-#else
-        , const real* __restrict__ charges
-#endif
-        ) {
-    real3 data[PME_ORDER];
-    real3 ddata[PME_ORDER];
-    const real scale = RECIP(PME_ORDER-1);
-    // Process the atoms in spatially sorted order.  This improves cache performance when loading
-    // the grid values.
-    for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < NUM_ATOMS; i += blockDim.x*gridDim.x) {
-        int atom = pmeAtomGridIndex[i].x;
-        real3 force = make_real3(0);
-        real4 pos = posq[atom];
-        APPLY_PERIODIC_TO_POS(pos)
-        real3 t = make_real3(pos.x*recipBoxVecX.x+pos.y*recipBoxVecY.x+pos.z*recipBoxVecZ.x,
-                             pos.y*recipBoxVecY.y+pos.z*recipBoxVecZ.y,
-                             pos.z*recipBoxVecZ.z);
-        t.x = (t.x-floor(t.x))*GRID_SIZE_X;
-        t.y = (t.y-floor(t.y))*GRID_SIZE_Y;
-        t.z = (t.z-floor(t.z))*GRID_SIZE_Z;
-        int3 gridIndex = make_int3(((int) t.x) % GRID_SIZE_X,
-                                   ((int) t.y) % GRID_SIZE_Y,
-                                   ((int) t.z) % GRID_SIZE_Z);
-        // Since we need the full set of thetas, it's faster to compute them here than load them
-        // from global memory.
-        real3 dr = make_real3(t.x-(int) t.x, t.y-(int) t.y, t.z-(int) t.z);
-        data[PME_ORDER-1] = make_real3(0);
-        data[1] = dr;
-        data[0] = make_real3(1)-dr;
-        for (int j = 3; j < PME_ORDER; j++) {
-            real div = RECIP(j-1);
-            data[j-1] = div*dr*data[j-2];
-            for (int k = 1; k < (j-1); k++)
-                data[j-k-1] = div*((dr+make_real3(k))*data[j-k-2] + (make_real3(j-k)-dr)*data[j-k-1]);
-            data[0] = div*(make_real3(1)-dr)*data[0];
-        }
-        ddata[0] = -data[0];
-        for (int j = 1; j < PME_ORDER; j++)
-            ddata[j] = data[j-1]-data[j];
-        data[PME_ORDER-1] = scale*dr*data[PME_ORDER-2];
-        for (int j = 1; j < (PME_ORDER-1); j++)
-            data[PME_ORDER-j-1] = scale*((dr+make_real3(j))*data[PME_ORDER-j-2] + (make_real3(PME_ORDER-j)-dr)*data[PME_ORDER-j-1]);
-        data[0] = scale*(make_real3(1)-dr)*data[0];
-        // Compute the force on this atom.
-        for (int ix = 0; ix < PME_ORDER; ix++) {
-            int xbase = gridIndex.x+ix;
-            xbase -= (xbase >= GRID_SIZE_X ? GRID_SIZE_X : 0);
-            xbase = xbase*GRID_SIZE_Y*GRID_SIZE_Z;
-            real dx = data[ix].x;
-            real ddx = ddata[ix].x;
-            for (int iy = 0; iy < PME_ORDER; iy++) {
-                int ybase = gridIndex.y+iy;
-                ybase -= (ybase >= GRID_SIZE_Y ? GRID_SIZE_Y : 0);
-                ybase = xbase + ybase*GRID_SIZE_Z;
-                real dy = data[iy].y;
-                real ddy = ddata[iy].y;
-                for (int iz = 0; iz < PME_ORDER; iz++) {
-                    int zindex = gridIndex.z+iz;
-                    zindex -= (zindex >= GRID_SIZE_Z ? GRID_SIZE_Z : 0);
-                    int index = ybase + zindex;
-                    real gridvalue = originalPmeGrid[index];
-                    force.x += ddx*dy*data[iz].z*gridvalue;
-                    force.y += dx*ddy*data[iz].z*gridvalue;
-                    force.z += dx*dy*ddata[iz].z*gridvalue;
-                }
-            }
-        }
-#ifdef CHARGE_FROM_SIGEPS
-        const float2 sigEps = sigmaEpsilon[atom];
-        real q = 8*sigEps.x*sigEps.x*sigEps.x*sigEps.y;
-#else
-        real q = CHARGE*EPSILON_FACTOR;
-#endif
-        real forceX = -q*(force.x*GRID_SIZE_X*recipBoxVecX.x);
-        real forceY = -q*(force.x*GRID_SIZE_X*recipBoxVecY.x+force.y*GRID_SIZE_Y*recipBoxVecY.y);
-        real forceZ = -q*(force.x*GRID_SIZE_X*recipBoxVecZ.x+force.y*GRID_SIZE_Y*recipBoxVecZ.y+force.z*GRID_SIZE_Z*recipBoxVecZ.z);
-        atomicAdd(&forceBuffers[atom], static_cast<unsigned long long>((long long) (forceX*0x100000000)));
-        atomicAdd(&forceBuffers[atom+PADDED_NUM_ATOMS], static_cast<unsigned long long>((long long) (forceY*0x100000000)));
-        atomicAdd(&forceBuffers[atom+2*PADDED_NUM_ATOMS], static_cast<unsigned long long>((long long) (forceZ*0x100000000)));
-    }
-}
-extern "C" __global__
-void addForces(const real4* __restrict__ forces, unsigned long long* __restrict__ forceBuffers) {
-    for (int atom = blockIdx.x*blockDim.x+threadIdx.x; atom < NUM_ATOMS; atom += blockDim.x*gridDim.x) {
-        real4 f = forces[atom];
-        forceBuffers[atom] += static_cast<unsigned long long>((long long) (f.x*0x100000000));
-        forceBuffers[atom+PADDED_NUM_ATOMS] += static_cast<unsigned long long>((long long) (f.y*0x100000000));
-        forceBuffers[atom+2*PADDED_NUM_ATOMS] += static_cast<unsigned long long>((long long) (f.z*0x100000000));
-    }
-}
-extern "C" __global__
-void addEnergy(const mixed* __restrict__ pmeEnergyBuffer, mixed* __restrict__ energyBuffer, int bufferSize) {
-    for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < bufferSize; i += blockDim.x*gridDim.x)
-        energyBuffer[i] += pmeEnergyBuffer[i];
-}
--- a/platforms/opencl/include/OpenCLContext.h
+++ b/platforms/opencl/include/OpenCLContext.h
@@ -328,6 +328,13 @@ public:
     * @param blockSize    the size of each thread block to use
     */
    void executeKernel(cl::Kernel& kernel, int workUnits, int blockSize = -1);
+    /**
+     * Compute the largest thread block size that can be used for a kernel that requires a particular amount of
+     * shared memory per thread.
+     * 
+     * @param memory        the number of bytes of shared memory per thread
+     */
+    int computeThreadBlockSize(double memory) const;
    /**
     * Set all elements of an array to 0.
     */
@@ -604,6 +611,15 @@ public:
    OpenCLNonbondedUtilities& getNonbondedUtilities() {
        return *nonbonded;
    }
+    /**
+     * Create a new NonbondedUtilities for use with this context.  This should be called
+     * only in unusual situations, when a Force needs its own NonbondedUtilities object
+     * separate from the standard one.  The caller is responsible for deleting the object
+     * when it is no longer needed.
+     */
+    OpenCLNonbondedUtilities* createNonbondedUtilities() {
+        return new OpenCLNonbondedUtilities(*this);
+    }
    /**
     * This should be called by the Integrator from its own initialize() method.
     * It ensures all contexts are fully initialized.

--- a/platforms/opencl/include/OpenCLKernel.h
+++ b/platforms/opencl/include/OpenCLKernel.h
@@ -51,6 +51,10 @@ public:
     * Get the name of this kernel.
     */
    std::string getName() const;
+    /**
+     * Get the maximum block size that can be used when executing this kernel.
+     */
+    int getMaxBlockSize() const;
    /**
     * Execute this kernel.
     *

--- a/platforms/opencl/include/OpenCLKernels.h
+++ b/platforms/opencl/include/OpenCLKernels.h
@@ -288,8 +288,8 @@ private:
    cl::Kernel pmeDispersionAtomRangeKernel;
    cl::Kernel pmeZIndexKernel;
    cl::Kernel pmeDispersionZIndexKernel;
-    cl::Kernel pmeUpdateBsplinesKernel;
+    cl::Kernel pmeGridIndexKernel;
-    cl::Kernel pmeDispersionUpdateBsplinesKernel;
+    cl::Kernel pmeDispersionGridIndexKernel;
    cl::Kernel pmeSpreadChargeKernel;
    cl::Kernel pmeDispersionSpreadChargeKernel;
    cl::Kernel pmeFinishSpreadChargeKernel;

--- a/platforms/opencl/include/OpenCLNonbondedUtilities.h
+++ b/platforms/opencl/include/OpenCLNonbondedUtilities.h
@@ -296,6 +296,10 @@ public:
     * @param groups    the set of force groups
     */
    void createKernelsForGroups(int groups);
+    /**
+     * Set the source code for the main kernel.  It only needs to be changed in very unusual circumstances.
+     */
+    void setKernelSource(const std::string& source);
 private:
    class KernelSet;
    class BlockSortTrait;
@@ -330,6 +334,7 @@ private:
    int numForceBuffers, startTileIndex, startBlockIndex, numBlocks, maxExclusions, numForceThreadBlocks;
    int forceThreadBlockSize, interactingBlocksThreadBlockSize, groupFlags;
    long long numTiles;
+    std::string kernelSource;
 };
 /**
@@ -362,9 +367,10 @@ public:
     * @param numComponents  the number of components in the parameter
     * @param size           the size of the parameter in bytes
     * @param memory         the memory containing the parameter values
+     * @param constant       whether the memory should be marked as constant
     */
-    ParameterInfo(const std::string& name, const std::string& componentType, int numComponents, int size, cl::Memory& memory) :
+    ParameterInfo(const std::string& name, const std::string& componentType, int numComponents, int size, cl::Memory& memory, bool constant=true) :
-            name(name), componentType(componentType), numComponents(numComponents), size(size), memory(&memory) {
+            name(name), componentType(componentType), numComponents(numComponents), size(size), memory(&memory), constant(constant) {
        if (numComponents == 1)
            type = componentType;
        else {
@@ -391,12 +397,16 @@ public:
    cl::Memory& getMemory() const {
        return *memory;
    }
+    bool isConstant() const {
+        return constant;
+    }
 private:
    std::string name;
    std::string componentType;
    std::string type;
    int size, numComponents;
    cl::Memory* memory;
+    bool constant;
 };
 } // namespace OpenMM