Unverified Commit 8e8923a7 authored by Peter Eastman's avatar Peter Eastman Committed by GitHub
Browse files

Converted AMOEBA to common platform (#3120)

* Began converting AMOEBA to common platform

* Beginning of OpenCL platform for AMOEBA

* Converted AmoebaVdwForce to common platform

* Cleaned up reference AMOEBA tests

* Began converting AmoebaMultipoleForce to common platform

* Continue converting AmoebaMultipoleForce to common platform

* Bug fixes

* Bug fix

* Continue converting AmoebaMultipoleForce to common platform

* Converting AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce to common platform

* Converting AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce to common platform

* Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce

* Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce

* Creating OpenCL version of AmoebaMultipoleForce and AmoebaGeneralizedKirkwoodForce

* Converted arrays from real3 to real

* Bug fix to OpenCL AmoebaGeneralizedKirkwoodForce

* Fixes for AMD GPUs

* Began converting HippoNonbondedForce to common platform

* Continuing to convert HippoNonbondedForce to common platform

* Continuing to convert HippoNonbondedForce to common platform

* Working on unifying PME kernels

* Fixed error on devices without 64 bit atomics

* Unified PME kernels

* Converted HippoNonbondedForce to common platform

* Creating OpenCL implementation of HippoNonbondedForce

* Continuing OpenCL implementation of HippoNonbondedForce

* Mostly finished OpenCL implementation of HippoNonbondedForce

* Eliminated three component vector types in host code

* Fix errors on CPU OpenCL

* Skip double precision tests for AMOEBA on OpenCL

* Bug fixes

* Bug fixes

* Fixed compilation error
parent 393a4dbd
...@@ -975,4 +975,14 @@ CUDA both define types for them, but they have different names, and in any case ...@@ -975,4 +975,14 @@ CUDA both define types for them, but they have different names, and in any case
you want to avoid using OpenCL-specific or CUDA-specific types in common code. you want to avoid using OpenCL-specific or CUDA-specific types in common code.
OpenMM therefore defines types for vectors in host code. They have the same OpenMM therefore defines types for vectors in host code. They have the same
names as the corresponding types in device code, only with the prefix :code:`mm_`\ , names as the corresponding types in device code, only with the prefix :code:`mm_`\ ,
for example :code:`mm_int2` and :code:`mm_float3`\ . for example :code:`mm_int2` and :code:`mm_float4`\ .
\ No newline at end of file
Three component vectors need special care in this context, because the platforms
define them differently. In OpenCL, a three component vector is essentially a
four component vector whose last component is ignored. For example,
:code:`sizeof(float3)` is 12 in CUDA but 16 in OpenCL. Within a kernel this
distinction can usually be ignored, but when communicating between host and
device it becomes vitally important. It is generally best to avoid storing
three component vectors in arrays or passing them as arguments. There are no
:code:`mm_` host types defined for three component vectors, because CUDA and
OpenCL would require them to be defined in different ways.
\ No newline at end of file
...@@ -65,6 +65,8 @@ public: ...@@ -65,6 +65,8 @@ public:
class ReorderListener; class ReorderListener;
class ForcePreComputation; class ForcePreComputation;
class ForcePostComputation; class ForcePostComputation;
static const int ThreadBlockSize;
static const int TileSize;
ComputeContext(const System& system); ComputeContext(const System& system);
virtual ~ComputeContext(); virtual ~ComputeContext();
/** /**
...@@ -122,6 +124,13 @@ public: ...@@ -122,6 +124,13 @@ public:
* @param defines a set of preprocessor definitions (name, value) to define when compiling the program * @param defines a set of preprocessor definitions (name, value) to define when compiling the program
*/ */
virtual ComputeProgram compileProgram(const std::string source, const std::map<std::string, std::string>& defines=std::map<std::string, std::string>()) = 0; virtual ComputeProgram compileProgram(const std::string source, const std::map<std::string, std::string>& defines=std::map<std::string, std::string>()) = 0;
/**
* Compute the largest thread block size that can be used for a kernel that requires a particular amount of
* shared memory per thread.
*
* @param memory the number of bytes of shared memory per thread
*/
virtual int computeThreadBlockSize(double memory) const = 0;
/** /**
* Set all elements of an array to 0. * Set all elements of an array to 0.
*/ */
...@@ -278,6 +287,10 @@ public: ...@@ -278,6 +287,10 @@ public:
int getPaddedNumAtoms() const { int getPaddedNumAtoms() const {
return paddedNumAtoms; return paddedNumAtoms;
} }
/**
* Get the number of blocks of TileSize atoms.
*/
virtual int getNumAtomBlocks() const = 0;
/** /**
* Get the standard number of thread blocks to use when executing kernels. * Get the standard number of thread blocks to use when executing kernels.
*/ */
...@@ -398,6 +411,13 @@ public: ...@@ -398,6 +411,13 @@ public:
* Get the NonbondedUtilities for this context. * Get the NonbondedUtilities for this context.
*/ */
virtual NonbondedUtilities& getNonbondedUtilities() = 0; virtual NonbondedUtilities& getNonbondedUtilities() = 0;
/**
* Create a new NonbondedUtilities for use with this context. This should be called
* only in unusual situations, when a Force needs its own NonbondedUtilities object
* separate from the standard one. The caller is responsible for deleting the object
* when it is no longer needed.
*/
virtual NonbondedUtilities* createNonbondedUtilities() = 0;
/** /**
* This should be called by the Integrator from its own initialize() method. * This should be called by the Integrator from its own initialize() method.
* It ensures all contexts are fully initialized. * It ensures all contexts are fully initialized.
......
...@@ -97,6 +97,10 @@ public: ...@@ -97,6 +97,10 @@ public:
void setArg(int index, ArrayInterface& value) { void setArg(int index, ArrayInterface& value) {
setArrayArg(index, value); setArrayArg(index, value);
} }
/**
* Get the maximum block size that can be used when executing this kernel.
*/
virtual int getMaxBlockSize() const = 0;
/** /**
* Execute this kernel. * Execute this kernel.
* *
......
...@@ -36,13 +36,6 @@ struct mm_short2 { ...@@ -36,13 +36,6 @@ struct mm_short2 {
mm_short2(short x, short y) : x(x), y(y) { mm_short2(short x, short y) : x(x), y(y) {
} }
}; };
struct mm_short3 {
short x, y, z, w;
mm_short3() {
}
mm_short3(short x, short y, short z) : x(x), y(y), z(z) {
}
};
struct mm_short4 { struct mm_short4 {
short x, y, z, w; short x, y, z, w;
mm_short4() { mm_short4() {
...@@ -57,13 +50,6 @@ struct mm_int2 { ...@@ -57,13 +50,6 @@ struct mm_int2 {
mm_int2(int x, int y) : x(x), y(y) { mm_int2(int x, int y) : x(x), y(y) {
} }
}; };
struct mm_int3 {
int x, y, z, w;
mm_int3() {
}
mm_int3(int x, int y, int z) : x(x), y(y), z(z) {
}
};
struct mm_int4 { struct mm_int4 {
int x, y, z, w; int x, y, z, w;
mm_int4() { mm_int4() {
...@@ -78,13 +64,6 @@ struct mm_float2 { ...@@ -78,13 +64,6 @@ struct mm_float2 {
mm_float2(float x, float y) : x(x), y(y) { mm_float2(float x, float y) : x(x), y(y) {
} }
}; };
struct mm_float3 {
float x, y, z, w;
mm_float3() {
}
mm_float3(float x, float y, float z) : x(x), y(y), z(z) {
}
};
struct mm_float4 { struct mm_float4 {
float x, y, z, w; float x, y, z, w;
mm_float4() { mm_float4() {
...@@ -99,13 +78,6 @@ struct mm_double2 { ...@@ -99,13 +78,6 @@ struct mm_double2 {
mm_double2(double x, double y) : x(x), y(y) { mm_double2(double x, double y) : x(x), y(y) {
} }
}; };
struct mm_double3 {
double x, y, z, w;
mm_double3() {
}
mm_double3(double x, double y, double z) : x(x), y(y), z(z) {
}
};
struct mm_double4 { struct mm_double4 {
double x, y, z, w; double x, y, z, w;
mm_double4() { mm_double4() {
......
...@@ -158,6 +158,42 @@ public: ...@@ -158,6 +158,42 @@ public:
* on the most recent call to prepareInteractions(). * on the most recent call to prepareInteractions().
*/ */
virtual ArrayInterface& getRebuildNeighborList() = 0; virtual ArrayInterface& getRebuildNeighborList() = 0;
/**
* Get the index of the first tile this context is responsible for processing.
*/
virtual int getStartTileIndex() const = 0;
/**
* Get the total number of tiles this context is responsible for processing.
*/
virtual int getNumTiles() const = 0;
/**
* Set whether to add padding to the cutoff distance when building the neighbor list.
* This increases the size of the neighbor list (and thus the cost of computing interactions),
* but also means we don't need to rebuild it every time step. The default value is true,
* since usually this improves performance. For very expensive interactions, however,
* it may be better to set this to false.
*/
virtual void setUsePadding(bool padding) = 0;
/**
* Initialize this object in preparation for a simulation.
*/
virtual void initialize(const System& system) = 0;
/**
* Prepare to compute interactions. This updates the neighbor list.
*/
virtual void prepareInteractions(int forceGroups) = 0;
/**
* Compute the nonbonded interactions.
*
* @param forceGroups the flags specifying which force groups to include
* @param includeForces whether to compute forces
* @param includeEnergy whether to compute the potential energy
*/
virtual void computeInteractions(int forceGroups, bool includeForces, bool includeEnergy) = 0;
/**
* Set the source code for the main kernel. It only needs to be changed in very unusual circumstances.
*/
virtual void setKernelSource(const std::string& source) = 0;
}; };
} // namespace OpenMM } // namespace OpenMM
......
...@@ -7362,7 +7362,7 @@ void CommonRemoveCMMotionKernel::initialize(const System& system, const CMMotion ...@@ -7362,7 +7362,7 @@ void CommonRemoveCMMotionKernel::initialize(const System& system, const CMMotion
cc.setAsCurrent(); cc.setAsCurrent();
frequency = force.getFrequency(); frequency = force.getFrequency();
int numAtoms = cc.getNumAtoms(); int numAtoms = cc.getNumAtoms();
cmMomentum.initialize<mm_float3>(cc, cc.getPaddedNumAtoms(), "cmMomentum"); cmMomentum.initialize<mm_float4>(cc, cc.getPaddedNumAtoms(), "cmMomentum");
double totalMass = 0.0; double totalMass = 0.0;
for (int i = 0; i < numAtoms; i++) for (int i = 0; i < numAtoms; i++)
totalMass += system.getParticleMass(i); totalMass += system.getParticleMass(i);
......
...@@ -39,6 +39,9 @@ ...@@ -39,6 +39,9 @@
using namespace OpenMM; using namespace OpenMM;
using namespace std; using namespace std;
const int ComputeContext::ThreadBlockSize = 64;
const int ComputeContext::TileSize = 32;
ComputeContext::ComputeContext(const System& system) : system(system), time(0.0), stepCount(0), computeForceCount(0), stepsSinceReorder(99999), ComputeContext::ComputeContext(const System& system) : system(system), time(0.0), stepCount(0), computeForceCount(0), stepsSinceReorder(99999),
atomsWereReordered(false), forcesValid(false), thread(NULL) { atomsWereReordered(false), forcesValid(false), thread(NULL) {
thread = new WorkThread(); thread = new WorkThread();
......
...@@ -2,9 +2,9 @@ ...@@ -2,9 +2,9 @@
* Calculate the center of mass momentum. * Calculate the center of mass momentum.
*/ */
KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT velm, GLOBAL float3* RESTRICT cmMomentum) { KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT velm, GLOBAL float4* RESTRICT cmMomentum) {
LOCAL float3 temp[64]; LOCAL float4 temp[64];
float3 cm = make_float3(0, 0, 0); float4 cm = make_float4(0);
for (int index = GLOBAL_ID; index < numAtoms; index += GLOBAL_SIZE) { for (int index = GLOBAL_ID; index < numAtoms; index += GLOBAL_SIZE) {
mixed4 velocity = velm[index]; mixed4 velocity = velm[index];
if (velocity.w != 0) { if (velocity.w != 0) {
...@@ -43,11 +43,11 @@ KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT ...@@ -43,11 +43,11 @@ KERNEL void calcCenterOfMassMomentum(int numAtoms, GLOBAL const mixed4* RESTRICT
* Remove center of mass motion. * Remove center of mass motion.
*/ */
KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT velm, GLOBAL const float3* RESTRICT cmMomentum) { KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT velm, GLOBAL const float4* RESTRICT cmMomentum) {
// First sum all of the momenta that were calculated by individual groups. // First sum all of the momenta that were calculated by individual groups.
LOCAL float3 temp[64]; LOCAL float4 temp[64];
float3 cm = make_float3(0, 0, 0); float4 cm = make_float4(0);
for (int index = LOCAL_ID; index < NUM_GROUPS; index += LOCAL_SIZE) for (int index = LOCAL_ID; index < NUM_GROUPS; index += LOCAL_SIZE)
cm += cmMomentum[index]; cm += cmMomentum[index];
int thread = LOCAL_ID; int thread = LOCAL_ID;
...@@ -68,7 +68,7 @@ KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT vel ...@@ -68,7 +68,7 @@ KERNEL void removeCenterOfMassMomentum(int numAtoms, GLOBAL mixed4* RESTRICT vel
if (thread < 2) if (thread < 2)
temp[thread] += temp[thread+2]; temp[thread] += temp[thread+2];
SYNC_THREADS; SYNC_THREADS;
cm = make_float3(INVERSE_TOTAL_MASS*(temp[0].x+temp[1].x), INVERSE_TOTAL_MASS*(temp[0].y+temp[1].y), INVERSE_TOTAL_MASS*(temp[0].z+temp[1].z)); cm = make_float4(INVERSE_TOTAL_MASS*(temp[0].x+temp[1].x), INVERSE_TOTAL_MASS*(temp[0].y+temp[1].y), INVERSE_TOTAL_MASS*(temp[0].z+temp[1].z), 0);
// Now remove the center of mass velocity from each atom. // Now remove the center of mass velocity from each atom.
......
...@@ -289,9 +289,8 @@ public: ...@@ -289,9 +289,8 @@ public:
* shared memory per thread. * shared memory per thread.
* *
* @param memory the number of bytes of shared memory per thread * @param memory the number of bytes of shared memory per thread
* @param preferShared whether the kernel is set to prefer shared memory over cache
*/ */
int computeThreadBlockSize(double memory, bool preferShared=true) const; int computeThreadBlockSize(double memory) const;
/** /**
* Set all elements of an array to 0. * Set all elements of an array to 0.
*/ */
...@@ -481,6 +480,15 @@ public: ...@@ -481,6 +480,15 @@ public:
CudaNonbondedUtilities& getNonbondedUtilities() { CudaNonbondedUtilities& getNonbondedUtilities() {
return *nonbonded; return *nonbonded;
} }
/**
* Create a new NonbondedUtilities for use with this context. This should be called
* only in unusual situations, when a Force needs its own NonbondedUtilities object
* separate from the standard one. The caller is responsible for deleting the object
* when it is no longer needed.
*/
CudaNonbondedUtilities* createNonbondedUtilities() {
return new CudaNonbondedUtilities(*this);
}
/** /**
* This should be called by the Integrator from its own initialize() method. * This should be called by the Integrator from its own initialize() method.
* It ensures all contexts are fully initialized. * It ensures all contexts are fully initialized.
......
...@@ -52,6 +52,10 @@ public: ...@@ -52,6 +52,10 @@ public:
* Get the name of this kernel. * Get the name of this kernel.
*/ */
std::string getName() const; std::string getName() const;
/**
* Get the maximum block size that can be used when executing this kernel.
*/
int getMaxBlockSize() const;
/** /**
* Execute this kernel. * Execute this kernel.
* *
......
...@@ -723,10 +723,9 @@ void CudaContext::executeKernel(CUfunction kernel, void** arguments, int threads ...@@ -723,10 +723,9 @@ void CudaContext::executeKernel(CUfunction kernel, void** arguments, int threads
} }
} }
int CudaContext::computeThreadBlockSize(double memory, bool preferShared) const { int CudaContext::computeThreadBlockSize(double memory) const {
int maxShared = 16*1024; int maxShared;
if (computeCapability >= 2.0 && preferShared) CHECK_RESULT2(cuDeviceGetAttribute(&maxShared, CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK, device), "Error querying device property");
maxShared = 48*1024;
int max = (int) (maxShared/memory); int max = (int) (maxShared/memory);
if (max < 64) if (max < 64)
return 32; return 32;
......
...@@ -26,6 +26,7 @@ ...@@ -26,6 +26,7 @@
#include "CudaKernel.h" #include "CudaKernel.h"
#include "openmm/common/ComputeArray.h" #include "openmm/common/ComputeArray.h"
#include "openmm/internal/AssertionUtilities.h"
#include <cstring> #include <cstring>
#include <vector> #include <vector>
...@@ -39,6 +40,14 @@ string CudaKernel::getName() const { ...@@ -39,6 +40,14 @@ string CudaKernel::getName() const {
return name; return name;
} }
int CudaKernel::getMaxBlockSize() const {
int size;
CUresult result = cuFuncGetAttribute(&size, CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, kernel);
if (result != CUDA_SUCCESS)
throw OpenMMException("Error querying max thread block size: "+context.getErrorString(result));
return size;
}
void CudaKernel::execute(int threads, int blockSize) { void CudaKernel::execute(int threads, int blockSize) {
int numArgs = arrayArgs.size(); int numArgs = arrayArgs.size();
argPointers.resize(numArgs); argPointers.resize(numArgs);
...@@ -69,10 +78,12 @@ void CudaKernel::addEmptyArg() { ...@@ -69,10 +78,12 @@ void CudaKernel::addEmptyArg() {
} }
void CudaKernel::setArrayArg(int index, ArrayInterface& value) { void CudaKernel::setArrayArg(int index, ArrayInterface& value) {
ASSERT_VALID_INDEX(index, arrayArgs);
arrayArgs[index] = &context.unwrap(value); arrayArgs[index] = &context.unwrap(value);
} }
void CudaKernel::setPrimitiveArg(int index, const void* value, int size) { void CudaKernel::setPrimitiveArg(int index, const void* value, int size) {
ASSERT_VALID_INDEX(index, primitiveArgs);
if (size > sizeof(double4)) if (size > sizeof(double4))
throw OpenMMException("Unsupported value type for kernel argument"); throw OpenMMException("Unsupported value type for kernel argument");
memcpy(&primitiveArgs[index], value, size); memcpy(&primitiveArgs[index], value, size);
......
...@@ -736,15 +736,13 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon ...@@ -736,15 +736,13 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon
pmeDefines["GRID_SIZE_Z"] = cu.intToString(gridSizeZ); pmeDefines["GRID_SIZE_Z"] = cu.intToString(gridSizeZ);
pmeDefines["EPSILON_FACTOR"] = cu.doubleToString(sqrt(ONE_4PI_EPS0)); pmeDefines["EPSILON_FACTOR"] = cu.doubleToString(sqrt(ONE_4PI_EPS0));
pmeDefines["M_PI"] = cu.doubleToString(M_PI); pmeDefines["M_PI"] = cu.doubleToString(M_PI);
if (cu.getUseDoublePrecision()) if (cu.getUseDoublePrecision() || cu.getPlatformData().deterministicForces)
pmeDefines["USE_DOUBLE_PRECISION"] = "1"; pmeDefines["USE_FIXED_POINT_CHARGE_SPREADING"] = "1";
if (usePmeStream) if (usePmeStream)
pmeDefines["USE_PME_STREAM"] = "1"; pmeDefines["USE_PME_STREAM"] = "1";
if (cu.getPlatformData().deterministicForces)
pmeDefines["USE_DETERMINISTIC_FORCES"] = "1";
map<string, string> replacements; map<string, string> replacements;
replacements["CHARGE"] = (usePosqCharges ? "pos.w" : "charges[atom]"); replacements["CHARGE"] = (usePosqCharges ? "pos.w" : "charges[atom]");
CUmodule module = cu.createModule(CudaKernelSources::vectorOps+cu.replaceStrings(CudaKernelSources::pme, replacements), pmeDefines); CUmodule module = cu.createModule(CudaKernelSources::vectorOps+cu.replaceStrings(CommonKernelSources::pme, replacements), pmeDefines);
if (cu.getPlatformData().useCpuPme && !doLJPME && usePosqCharges) { if (cu.getPlatformData().useCpuPme && !doLJPME && usePosqCharges) {
// Create the CPU PME kernel. // Create the CPU PME kernel.
...@@ -777,6 +775,8 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon ...@@ -777,6 +775,8 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon
pmeDefines["RECIP_EXP_FACTOR"] = cu.doubleToString(M_PI*M_PI/(dispersionAlpha*dispersionAlpha)); pmeDefines["RECIP_EXP_FACTOR"] = cu.doubleToString(M_PI*M_PI/(dispersionAlpha*dispersionAlpha));
pmeDefines["USE_LJPME"] = "1"; pmeDefines["USE_LJPME"] = "1";
pmeDefines["CHARGE_FROM_SIGEPS"] = "1"; pmeDefines["CHARGE_FROM_SIGEPS"] = "1";
if (cu.getUseDoublePrecision() || cu.getPlatformData().deterministicForces)
pmeDefines["USE_FIXED_POINT_CHARGE_SPREADING"] = "1";
double invRCut6 = pow(force.getCutoffDistance(), -6); double invRCut6 = pow(force.getCutoffDistance(), -6);
double dalphaR = dispersionAlpha * force.getCutoffDistance(); double dalphaR = dispersionAlpha * force.getCutoffDistance();
double dar2 = dalphaR*dalphaR; double dar2 = dalphaR*dalphaR;
...@@ -784,7 +784,7 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon ...@@ -784,7 +784,7 @@ void CudaCalcNonbondedForceKernel::initialize(const System& system, const Nonbon
double multShift6 = -invRCut6*(1.0 - exp(-dar2) * (1.0 + dar2 + 0.5*dar4)); double multShift6 = -invRCut6*(1.0 - exp(-dar2) * (1.0 + dar2 + 0.5*dar4));
defines["INVCUT6"] = cu.doubleToString(invRCut6); defines["INVCUT6"] = cu.doubleToString(invRCut6);
defines["MULTSHIFT6"] = cu.doubleToString(multShift6); defines["MULTSHIFT6"] = cu.doubleToString(multShift6);
module = cu.createModule(CudaKernelSources::vectorOps+CudaKernelSources::pme, pmeDefines); module = cu.createModule(CudaKernelSources::vectorOps+CommonKernelSources::pme, pmeDefines);
pmeDispersionFinishSpreadChargeKernel = cu.getKernel(module, "finishSpreadCharge"); pmeDispersionFinishSpreadChargeKernel = cu.getKernel(module, "finishSpreadCharge");
pmeDispersionGridIndexKernel = cu.getKernel(module, "findAtomGridIndex"); pmeDispersionGridIndexKernel = cu.getKernel(module, "findAtomGridIndex");
pmeDispersionSpreadChargeKernel = cu.getKernel(module, "gridSpreadCharge"); pmeDispersionSpreadChargeKernel = cu.getKernel(module, "gridSpreadCharge");
...@@ -1177,11 +1177,11 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF ...@@ -1177,11 +1177,11 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
cu.getPeriodicBoxVectors(boxVectors[0], boxVectors[1], boxVectors[2]); cu.getPeriodicBoxVectors(boxVectors[0], boxVectors[1], boxVectors[2]);
double determinant = boxVectors[0][0]*boxVectors[1][1]*boxVectors[2][2]; double determinant = boxVectors[0][0]*boxVectors[1][1]*boxVectors[2][2];
double scale = 1.0/determinant; double scale = 1.0/determinant;
double3 recipBoxVectors[3]; double4 recipBoxVectors[3];
recipBoxVectors[0] = make_double3(boxVectors[1][1]*boxVectors[2][2]*scale, 0, 0); recipBoxVectors[0] = make_double4(boxVectors[1][1]*boxVectors[2][2]*scale, 0, 0, 0);
recipBoxVectors[1] = make_double3(-boxVectors[1][0]*boxVectors[2][2]*scale, boxVectors[0][0]*boxVectors[2][2]*scale, 0); recipBoxVectors[1] = make_double4(-boxVectors[1][0]*boxVectors[2][2]*scale, boxVectors[0][0]*boxVectors[2][2]*scale, 0, 0);
recipBoxVectors[2] = make_double3((boxVectors[1][0]*boxVectors[2][1]-boxVectors[1][1]*boxVectors[2][0])*scale, -boxVectors[0][0]*boxVectors[2][1]*scale, boxVectors[0][0]*boxVectors[1][1]*scale); recipBoxVectors[2] = make_double4((boxVectors[1][0]*boxVectors[2][1]-boxVectors[1][1]*boxVectors[2][0])*scale, -boxVectors[0][0]*boxVectors[2][1]*scale, boxVectors[0][0]*boxVectors[1][1]*scale, 0);
float3 recipBoxVectorsFloat[3]; float4 recipBoxVectorsFloat[3];
void* recipBoxVectorPointer[3]; void* recipBoxVectorPointer[3];
if (cu.getUseDoublePrecision()) { if (cu.getUseDoublePrecision()) {
recipBoxVectorPointer[0] = &recipBoxVectors[0]; recipBoxVectorPointer[0] = &recipBoxVectors[0];
...@@ -1189,9 +1189,9 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF ...@@ -1189,9 +1189,9 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
recipBoxVectorPointer[2] = &recipBoxVectors[2]; recipBoxVectorPointer[2] = &recipBoxVectors[2];
} }
else { else {
recipBoxVectorsFloat[0] = make_float3((float) recipBoxVectors[0].x, 0, 0); recipBoxVectorsFloat[0] = make_float4((float) recipBoxVectors[0].x, 0, 0, 0);
recipBoxVectorsFloat[1] = make_float3((float) recipBoxVectors[1].x, (float) recipBoxVectors[1].y, 0); recipBoxVectorsFloat[1] = make_float4((float) recipBoxVectors[1].x, (float) recipBoxVectors[1].y, 0, 0);
recipBoxVectorsFloat[2] = make_float3((float) recipBoxVectors[2].x, (float) recipBoxVectors[2].y, (float) recipBoxVectors[2].z); recipBoxVectorsFloat[2] = make_float4((float) recipBoxVectors[2].x, (float) recipBoxVectors[2].y, (float) recipBoxVectors[2].z, 0);
recipBoxVectorPointer[0] = &recipBoxVectorsFloat[0]; recipBoxVectorPointer[0] = &recipBoxVectorsFloat[0];
recipBoxVectorPointer[1] = &recipBoxVectorsFloat[1]; recipBoxVectorPointer[1] = &recipBoxVectorsFloat[1];
recipBoxVectorPointer[2] = &recipBoxVectorsFloat[2]; recipBoxVectorPointer[2] = &recipBoxVectorsFloat[2];
...@@ -1234,13 +1234,13 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF ...@@ -1234,13 +1234,13 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
if (includeEnergy) { if (includeEnergy) {
void* computeEnergyArgs[] = {&pmeGrid2.getDevicePointer(), usePmeStream ? &pmeEnergyBuffer.getDevicePointer() : &cu.getEnergyBuffer().getDevicePointer(), void* computeEnergyArgs[] = {&pmeGrid2.getDevicePointer(), usePmeStream ? &pmeEnergyBuffer.getDevicePointer() : &cu.getEnergyBuffer().getDevicePointer(),
&pmeBsplineModuliX.getDevicePointer(), &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(), &pmeBsplineModuliX.getDevicePointer(), &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(),
cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]}; recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
cu.executeKernel(pmeEvalEnergyKernel, computeEnergyArgs, gridSizeX*gridSizeY*gridSizeZ); cu.executeKernel(pmeEvalEnergyKernel, computeEnergyArgs, gridSizeX*gridSizeY*gridSizeZ);
} }
void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &cu.getEnergyBuffer().getDevicePointer(), void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &pmeBsplineModuliX.getDevicePointer(),
&pmeBsplineModuliX.getDevicePointer(), &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(), &pmeBsplineModuliY.getDevicePointer(), &pmeBsplineModuliZ.getDevicePointer(),
cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]}; recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
cu.executeKernel(pmeConvolutionKernel, convolutionArgs, gridSizeX*gridSizeY*gridSizeZ, 256); cu.executeKernel(pmeConvolutionKernel, convolutionArgs, gridSizeX*gridSizeY*gridSizeZ, 256);
if (useCudaFFT) { if (useCudaFFT) {
...@@ -1304,13 +1304,13 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF ...@@ -1304,13 +1304,13 @@ double CudaCalcNonbondedForceKernel::execute(ContextImpl& context, bool includeF
if (includeEnergy) { if (includeEnergy) {
void* computeEnergyArgs[] = {&pmeGrid2.getDevicePointer(), usePmeStream ? &pmeEnergyBuffer.getDevicePointer() : &cu.getEnergyBuffer().getDevicePointer(), void* computeEnergyArgs[] = {&pmeGrid2.getDevicePointer(), usePmeStream ? &pmeEnergyBuffer.getDevicePointer() : &cu.getEnergyBuffer().getDevicePointer(),
&pmeDispersionBsplineModuliX.getDevicePointer(), &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(), &pmeDispersionBsplineModuliX.getDevicePointer(), &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(),
cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]}; recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
cu.executeKernel(pmeEvalDispersionEnergyKernel, computeEnergyArgs, dispersionGridSizeX*dispersionGridSizeY*dispersionGridSizeZ); cu.executeKernel(pmeEvalDispersionEnergyKernel, computeEnergyArgs, dispersionGridSizeX*dispersionGridSizeY*dispersionGridSizeZ);
} }
void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &cu.getEnergyBuffer().getDevicePointer(), void* convolutionArgs[] = {&pmeGrid2.getDevicePointer(), &pmeDispersionBsplineModuliX.getDevicePointer(),
&pmeDispersionBsplineModuliX.getDevicePointer(), &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(), &pmeDispersionBsplineModuliY.getDevicePointer(), &pmeDispersionBsplineModuliZ.getDevicePointer(),
cu.getPeriodicBoxSizePointer(), recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]}; recipBoxVectorPointer[0], recipBoxVectorPointer[1], recipBoxVectorPointer[2]};
cu.executeKernel(pmeDispersionConvolutionKernel, convolutionArgs, dispersionGridSizeX*dispersionGridSizeY*dispersionGridSizeZ, 256); cu.executeKernel(pmeDispersionConvolutionKernel, convolutionArgs, dispersionGridSizeX*dispersionGridSizeY*dispersionGridSizeZ, 256);
if (useCudaFFT) { if (useCudaFFT) {
......
...@@ -117,7 +117,7 @@ void CudaNonbondedUtilities::addInteraction(bool usesCutoff, bool usesPeriodic, ...@@ -117,7 +117,7 @@ void CudaNonbondedUtilities::addInteraction(bool usesCutoff, bool usesPeriodic,
void CudaNonbondedUtilities::addParameter(ComputeParameterInfo parameter) { void CudaNonbondedUtilities::addParameter(ComputeParameterInfo parameter) {
parameters.push_back(ParameterInfo(parameter.getName(), parameter.getComponentType(), parameter.getNumComponents(), parameters.push_back(ParameterInfo(parameter.getName(), parameter.getComponentType(), parameter.getNumComponents(),
parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer())); parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer(), parameter.isConstant()));
} }
void CudaNonbondedUtilities::addParameter(const ParameterInfo& parameter) { void CudaNonbondedUtilities::addParameter(const ParameterInfo& parameter) {
...@@ -126,7 +126,7 @@ void CudaNonbondedUtilities::addParameter(const ParameterInfo& parameter) { ...@@ -126,7 +126,7 @@ void CudaNonbondedUtilities::addParameter(const ParameterInfo& parameter) {
void CudaNonbondedUtilities::addArgument(ComputeParameterInfo parameter) { void CudaNonbondedUtilities::addArgument(ComputeParameterInfo parameter) {
arguments.push_back(ParameterInfo(parameter.getName(), parameter.getComponentType(), parameter.getNumComponents(), arguments.push_back(ParameterInfo(parameter.getName(), parameter.getComponentType(), parameter.getNumComponents(),
parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer())); parameter.getSize(), context.unwrap(parameter.getArray()).getDevicePointer(), parameter.isConstant()));
} }
void CudaNonbondedUtilities::addArgument(const ParameterInfo& parameter) { void CudaNonbondedUtilities::addArgument(const ParameterInfo& parameter) {
...@@ -311,8 +311,8 @@ void CudaNonbondedUtilities::initialize(const System& system) { ...@@ -311,8 +311,8 @@ void CudaNonbondedUtilities::initialize(const System& system) {
} }
for (int i = 0; i < (int) parameters.size(); i++) for (int i = 0; i < (int) parameters.size(); i++)
forceArgs.push_back(&parameters[i].getMemory()); forceArgs.push_back(&parameters[i].getMemory());
for (int i = 0; i < (int) arguments.size(); i++) for (ParameterInfo& arg : arguments)
forceArgs.push_back(&arguments[i].getMemory()); forceArgs.push_back(&arg.getMemory());
if (energyParameterDerivatives.size() > 0) if (energyParameterDerivatives.size() > 0)
forceArgs.push_back(&context.getEnergyParamDerivBuffer().getDevicePointer()); forceArgs.push_back(&context.getEnergyParamDerivBuffer().getDevicePointer());
if (useCutoff) { if (useCutoff) {
...@@ -513,44 +513,44 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source, ...@@ -513,44 +513,44 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
const string suffixes[] = {"x", "y", "z", "w"}; const string suffixes[] = {"x", "y", "z", "w"};
stringstream localData; stringstream localData;
int localDataSize = 0; int localDataSize = 0;
for (int i = 0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
if (params[i].getNumComponents() == 1) if (param.getNumComponents() == 1)
localData<<params[i].getType()<<" "<<params[i].getName()<<";\n"; localData<<param.getType()<<" "<<param.getName()<<";\n";
else { else {
for (int j = 0; j < params[i].getNumComponents(); ++j) for (int j = 0; j < param.getNumComponents(); ++j)
localData<<params[i].getComponentType()<<" "<<params[i].getName()<<"_"<<suffixes[j]<<";\n"; localData<<param.getComponentType()<<" "<<param.getName()<<"_"<<suffixes[j]<<";\n";
} }
localDataSize += params[i].getSize(); localDataSize += param.getSize();
} }
replacements["ATOM_PARAMETER_DATA"] = localData.str(); replacements["ATOM_PARAMETER_DATA"] = localData.str();
stringstream args; stringstream args;
for (int i = 0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
args << ", "; args << ", ";
if (params[i].isConstant()) if (param.isConstant())
args << "const "; args << "const ";
args << params[i].getType(); args << param.getType();
args << "* __restrict__ global_"; args << "* __restrict__ global_";
args << params[i].getName(); args << param.getName();
} }
for (int i = 0; i < (int) arguments.size(); i++) { for (const ParameterInfo& arg : arguments) {
args << ", "; args << ", ";
if (arguments[i].isConstant()) if (arg.isConstant())
args << "const "; args << "const ";
args << arguments[i].getType(); args << arg.getType();
args << "* __restrict__ "; args << "* __restrict__ ";
args << arguments[i].getName(); args << arg.getName();
} }
if (energyParameterDerivatives.size() > 0) if (energyParameterDerivatives.size() > 0)
args << ", mixed* __restrict__ energyParamDerivs"; args << ", mixed* __restrict__ energyParamDerivs";
replacements["PARAMETER_ARGUMENTS"] = args.str(); replacements["PARAMETER_ARGUMENTS"] = args.str();
stringstream load1; stringstream load1;
for (int i = 0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
load1 << params[i].getType(); load1 << param.getType();
load1 << " "; load1 << " ";
load1 << params[i].getName(); load1 << param.getName();
load1 << "1 = global_"; load1 << "1 = global_";
load1 << params[i].getName(); load1 << param.getName();
load1 << "[atom1];\n"; load1 << "[atom1];\n";
} }
replacements["LOAD_ATOM1_PARAMETERS"] = load1.str(); replacements["LOAD_ATOM1_PARAMETERS"] = load1.str();
...@@ -564,34 +564,30 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source, ...@@ -564,34 +564,30 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
if(useShuffle) { if(useShuffle) {
// not needed if using shuffles as we can directly fetch from register // not needed if using shuffles as we can directly fetch from register
} else { } else {
for (int i = 0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
if (params[i].getNumComponents() == 1) { if (param.getNumComponents() == 1)
loadLocal1<<"localData[threadIdx.x]."<<params[i].getName()<<" = "<<params[i].getName()<<"1;\n"; loadLocal1<<"localData[LOCAL_ID]."<<param.getName()<<" = "<<param.getName()<<"1;\n";
}
else { else {
for (int j = 0; j < params[i].getNumComponents(); ++j) for (int j = 0; j < param.getNumComponents(); ++j)
loadLocal1<<"localData[threadIdx.x]."<<params[i].getName()<<"_"<<suffixes[j]<<" = "<<params[i].getName()<<"1."<<suffixes[j]<<";\n"; loadLocal1<<"localData[LOCAL_ID]."<<param.getName()<<"_"<<suffixes[j]<<" = "<<param.getName()<<"1."<<suffixes[j]<<";\n";
} }
} }
} }
replacements["LOAD_LOCAL_PARAMETERS_FROM_1"] = loadLocal1.str(); replacements["LOAD_LOCAL_PARAMETERS_FROM_1"] = loadLocal1.str();
stringstream broadcastWarpData; stringstream broadcastWarpData;
if(useShuffle) { if (useShuffle) {
broadcastWarpData << "posq2.x = real_shfl(shflPosq.x, j);\n"; broadcastWarpData << "posq2.x = real_shfl(shflPosq.x, j);\n";
broadcastWarpData << "posq2.y = real_shfl(shflPosq.y, j);\n"; broadcastWarpData << "posq2.y = real_shfl(shflPosq.y, j);\n";
broadcastWarpData << "posq2.z = real_shfl(shflPosq.z, j);\n"; broadcastWarpData << "posq2.z = real_shfl(shflPosq.z, j);\n";
broadcastWarpData << "posq2.w = real_shfl(shflPosq.w, j);\n"; broadcastWarpData << "posq2.w = real_shfl(shflPosq.w, j);\n";
for(int i=0; i< (int) params.size();i++) { for (const ParameterInfo& param : params) {
broadcastWarpData << params[i].getType() << " shfl" << params[i].getName() << ";\n"; broadcastWarpData << param.getType() << " shfl" << param.getName() << ";\n";
for(int j=0; j < params[i].getNumComponents(); j++) { for (int j = 0; j < param.getNumComponents(); j++) {
string name; if (param.getNumComponents() == 1)
if (params[i].getNumComponents() == 1) { broadcastWarpData << "shfl" << param.getName() << "=real_shfl(" << param.getName() <<"1,j);\n";
broadcastWarpData << "shfl" << params[i].getName() << "=real_shfl(" << params[i].getName() <<"1,j);\n"; else
broadcastWarpData << "shfl" << param.getName()+"."+suffixes[j] << "=real_shfl(" << param.getName()+"1."+suffixes[j] <<",j);\n";
} else {
broadcastWarpData << "shfl" << params[i].getName()+"."+suffixes[j] << "=real_shfl(" << params[i].getName()+"1."+suffixes[j] <<",j);\n";
}
} }
} }
} else { } else {
...@@ -601,49 +597,48 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source, ...@@ -601,49 +597,48 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
// Part 2. Defines for off-diagonal exclusions, and neighborlist tiles. // Part 2. Defines for off-diagonal exclusions, and neighborlist tiles.
stringstream declareLocal2; stringstream declareLocal2;
if(useShuffle) { if (useShuffle) {
for(int i=0; i< (int) params.size(); i++) { for (const ParameterInfo& param : params)
declareLocal2<<params[i].getType()<<" shfl"<<params[i].getName()<<";\n"; declareLocal2<<param.getType()<<" shfl"<<param.getName()<<";\n";
} }
} else { else {
// not used if using shared memory // not used if using shared memory
} }
replacements["DECLARE_LOCAL_PARAMETERS"] = declareLocal2.str(); replacements["DECLARE_LOCAL_PARAMETERS"] = declareLocal2.str();
stringstream loadLocal2; stringstream loadLocal2;
if(useShuffle) { if (useShuffle) {
for(int i=0; i< (int) params.size(); i++) { for (const ParameterInfo& param : params)
loadLocal2<<"shfl"<<params[i].getName()<<" = global_"<<params[i].getName()<<"[j];\n"; loadLocal2<<"shfl"<<param.getName()<<" = global_"<<param.getName()<<"[j];\n";
} }
} else { else {
for (int i = 0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
if (params[i].getNumComponents() == 1) { if (param.getNumComponents() == 1)
loadLocal2<<"localData[threadIdx.x]."<<params[i].getName()<<" = global_"<<params[i].getName()<<"[j];\n"; loadLocal2<<"localData[LOCAL_ID]."<<param.getName()<<" = global_"<<param.getName()<<"[j];\n";
}
else { else {
loadLocal2<<params[i].getType()<<" temp_"<<params[i].getName()<<" = global_"<<params[i].getName()<<"[j];\n"; loadLocal2<<param.getType()<<" temp_"<<param.getName()<<" = global_"<<param.getName()<<"[j];\n";
for (int j = 0; j < params[i].getNumComponents(); ++j) for (int j = 0; j < param.getNumComponents(); ++j)
loadLocal2<<"localData[threadIdx.x]."<<params[i].getName()<<"_"<<suffixes[j]<<" = temp_"<<params[i].getName()<<"."<<suffixes[j]<<";\n"; loadLocal2<<"localData[LOCAL_ID]."<<param.getName()<<"_"<<suffixes[j]<<" = temp_"<<param.getName()<<"."<<suffixes[j]<<";\n";
} }
} }
} }
replacements["LOAD_LOCAL_PARAMETERS_FROM_GLOBAL"] = loadLocal2.str(); replacements["LOAD_LOCAL_PARAMETERS_FROM_GLOBAL"] = loadLocal2.str();
stringstream load2j; stringstream load2j;
if(useShuffle) { if (useShuffle) {
for(int i = 0; i < (int) params.size(); i++) for (const ParameterInfo& param : params)
load2j<<params[i].getType()<<" "<<params[i].getName()<<"2 = shfl"<<params[i].getName()<<";\n"; load2j<<param.getType()<<" "<<param.getName()<<"2 = shfl"<<param.getName()<<";\n";
} else { }
for (int i = 0; i < (int) params.size(); i++) { else {
if (params[i].getNumComponents() == 1) { for (const ParameterInfo& param : params) {
load2j<<params[i].getType()<<" "<<params[i].getName()<<"2 = localData[atom2]."<<params[i].getName()<<";\n"; if (param.getNumComponents() == 1)
} load2j<<param.getType()<<" "<<param.getName()<<"2 = localData[atom2]."<<param.getName()<<";\n";
else { else {
load2j<<params[i].getType()<<" "<<params[i].getName()<<"2 = make_"<<params[i].getType()<<"("; load2j<<param.getType()<<" "<<param.getName()<<"2 = make_"<<param.getType()<<"(";
for (int j = 0; j < params[i].getNumComponents(); ++j) { for (int j = 0; j < param.getNumComponents(); ++j) {
if (j > 0) if (j > 0)
load2j<<", "; load2j<<", ";
load2j<<"localData[atom2]."<<params[i].getName()<<"_"<<suffixes[j]; load2j<<"localData[atom2]."<<param.getName()<<"_"<<suffixes[j];
} }
load2j<<");\n"; load2j<<");\n";
} }
...@@ -652,16 +647,16 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source, ...@@ -652,16 +647,16 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
replacements["LOAD_ATOM2_PARAMETERS"] = load2j.str(); replacements["LOAD_ATOM2_PARAMETERS"] = load2j.str();
stringstream clearLocal; stringstream clearLocal;
for (int i = 0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
if (useShuffle) if (useShuffle)
clearLocal<<"shfl"; clearLocal<<"shfl";
else else
clearLocal<<"localData[atom2]."; clearLocal<<"localData[atom2].";
clearLocal<<params[i].getName()<<" = "; clearLocal<<param.getName()<<" = ";
if (params[i].getNumComponents() == 1) if (param.getNumComponents() == 1)
clearLocal<<"0;\n"; clearLocal<<"0;\n";
else else
clearLocal<<"make_"<<params[i].getType()<<"(0);\n"; clearLocal<<"make_"<<param.getType()<<"(0);\n";
} }
replacements["CLEAR_LOCAL_PARAMETERS"] = clearLocal.str(); replacements["CLEAR_LOCAL_PARAMETERS"] = clearLocal.str();
...@@ -675,7 +670,7 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source, ...@@ -675,7 +670,7 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
for (int i = 0; i < energyParameterDerivatives.size(); i++) for (int i = 0; i < energyParameterDerivatives.size(); i++)
for (int index = 0; index < numDerivs; index++) for (int index = 0; index < numDerivs; index++)
if (allParamDerivNames[index] == energyParameterDerivatives[i]) if (allParamDerivNames[index] == energyParameterDerivatives[i])
saveDerivs<<"energyParamDerivs[(blockIdx.x*blockDim.x+threadIdx.x)*"<<numDerivs<<"+"<<index<<"] += energyParamDeriv"<<i<<";\n"; saveDerivs<<"energyParamDerivs[GLOBAL_ID*"<<numDerivs<<"+"<<index<<"] += energyParamDeriv"<<i<<";\n";
replacements["SAVE_DERIVATIVES"] = saveDerivs.str(); replacements["SAVE_DERIVATIVES"] = saveDerivs.str();
stringstream shuffleWarpData; stringstream shuffleWarpData;
...@@ -687,15 +682,15 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source, ...@@ -687,15 +682,15 @@ CUfunction CudaNonbondedUtilities::createInteractionKernel(const string& source,
shuffleWarpData << "shflForce.x = real_shfl(shflForce.x, tgx+1);\n"; shuffleWarpData << "shflForce.x = real_shfl(shflForce.x, tgx+1);\n";
shuffleWarpData << "shflForce.y = real_shfl(shflForce.y, tgx+1);\n"; shuffleWarpData << "shflForce.y = real_shfl(shflForce.y, tgx+1);\n";
shuffleWarpData << "shflForce.z = real_shfl(shflForce.z, tgx+1);\n"; shuffleWarpData << "shflForce.z = real_shfl(shflForce.z, tgx+1);\n";
for(int i=0; i < (int) params.size(); i++) { for (const ParameterInfo& param : params) {
if(params[i].getNumComponents() == 1) { if (param.getNumComponents() == 1)
shuffleWarpData<<"shfl"<<params[i].getName()<<"=real_shfl(shfl"<<params[i].getName()<<", tgx+1);\n"; shuffleWarpData<<"shfl"<<param.getName()<<"=real_shfl(shfl"<<param.getName()<<", tgx+1);\n";
} else { else {
for(int j=0;j<params[i].getNumComponents();j++) { for (int j = 0; j < param.getNumComponents(); j++) {
// looks something like shflsigmaEpsilon.x = real_shfl(shflsigmaEpsilon.x,tgx+1); // looks something like shflsigmaEpsilon.x = real_shfl(shflsigmaEpsilon.x,tgx+1);
shuffleWarpData<<"shfl"<<params[i].getName() shuffleWarpData<<"shfl"<<param.getName()
<<"."<<suffixes[j]<<"=real_shfl(shfl" <<"."<<suffixes[j]<<"=real_shfl(shfl"
<<params[i].getName()<<"."<<suffixes[j] <<param.getName()<<"."<<suffixes[j]
<<", tgx+1);\n"; <<", tgx+1);\n";
} }
} }
......
extern "C" __global__ void findAtomGridIndex(const real4* __restrict__ posq, int2* __restrict__ pmeAtomGridIndex,
real4 periodicBoxSize, real4 invPeriodicBoxSize, real4 periodicBoxVecX, real4 periodicBoxVecY, real4 periodicBoxVecZ,
real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ) {
// Compute the index of the grid point each atom is associated with.
for (int atom = blockIdx.x*blockDim.x+threadIdx.x; atom < NUM_ATOMS; atom += blockDim.x*gridDim.x) {
real4 pos = posq[atom];
APPLY_PERIODIC_TO_POS(pos)
real3 t = make_real3(pos.x*recipBoxVecX.x+pos.y*recipBoxVecY.x+pos.z*recipBoxVecZ.x,
pos.y*recipBoxVecY.y+pos.z*recipBoxVecZ.y,
pos.z*recipBoxVecZ.z);
t.x = (t.x-floor(t.x))*GRID_SIZE_X;
t.y = (t.y-floor(t.y))*GRID_SIZE_Y;
t.z = (t.z-floor(t.z))*GRID_SIZE_Z;
int3 gridIndex = make_int3(((int) t.x) % GRID_SIZE_X,
((int) t.y) % GRID_SIZE_Y,
((int) t.z) % GRID_SIZE_Z);
pmeAtomGridIndex[atom] = make_int2(atom, gridIndex.x*GRID_SIZE_Y*GRID_SIZE_Z+gridIndex.y*GRID_SIZE_Z+gridIndex.z);
}
}
extern "C" __global__ void gridSpreadCharge(const real4* __restrict__ posq, real* __restrict__ originalPmeGrid,
real4 periodicBoxSize, real4 invPeriodicBoxSize, real4 periodicBoxVecX, real4 periodicBoxVecY, real4 periodicBoxVecZ,
real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ, const int2* __restrict__ pmeAtomGridIndex
#ifdef CHARGE_FROM_SIGEPS
, const float2* __restrict__ sigmaEpsilon
#else
, const real* __restrict__ charges
#endif
) {
// To improve memory efficiency, we divide indices along the z axis into
// PME_ORDER blocks, where the data for each block is stored together. We
// can ensure that all threads write to the same block at the same time,
// which leads to better coalescing of writes.
__shared__ int zindexTable[GRID_SIZE_Z+PME_ORDER];
int blockSize = (int) ceil(GRID_SIZE_Z/(real) PME_ORDER);
for (int i = threadIdx.x; i < GRID_SIZE_Z+PME_ORDER; i += blockDim.x) {
int zindex = i % GRID_SIZE_Z;
int block = zindex % PME_ORDER;
zindexTable[i] = zindex/PME_ORDER + block*GRID_SIZE_X*GRID_SIZE_Y*blockSize;
}
__syncthreads();
// Process the atoms in spatially sorted order. This improves efficiency when writing
// the grid values.
real3 data[PME_ORDER];
const real scale = RECIP(PME_ORDER-1);
for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < NUM_ATOMS; i += blockDim.x*gridDim.x) {
int atom = pmeAtomGridIndex[i].x;
real4 pos = posq[atom];
#ifdef CHARGE_FROM_SIGEPS
const float2 sigEps = sigmaEpsilon[atom];
const real charge = 8*sigEps.x*sigEps.x*sigEps.x*sigEps.y;
#else
const real charge = (CHARGE)*EPSILON_FACTOR;
#endif
APPLY_PERIODIC_TO_POS(pos)
real3 t = make_real3(pos.x*recipBoxVecX.x+pos.y*recipBoxVecY.x+pos.z*recipBoxVecZ.x,
pos.y*recipBoxVecY.y+pos.z*recipBoxVecZ.y,
pos.z*recipBoxVecZ.z);
t.x = (t.x-floor(t.x))*GRID_SIZE_X;
t.y = (t.y-floor(t.y))*GRID_SIZE_Y;
t.z = (t.z-floor(t.z))*GRID_SIZE_Z;
int3 gridIndex = make_int3(((int) t.x) % GRID_SIZE_X,
((int) t.y) % GRID_SIZE_Y,
((int) t.z) % GRID_SIZE_Z);
if (charge == 0)
continue;
// Since we need the full set of thetas, it's faster to compute them here than load them
// from global memory.
real3 dr = make_real3(t.x-(int) t.x, t.y-(int) t.y, t.z-(int) t.z);
data[PME_ORDER-1] = make_real3(0);
data[1] = dr;
data[0] = make_real3(1)-dr;
for (int j = 3; j < PME_ORDER; j++) {
real div = RECIP(j-1);
data[j-1] = div*dr*data[j-2];
for (int k = 1; k < (j-1); k++)
data[j-k-1] = div*((dr+make_real3(k))*data[j-k-2] + (make_real3(j-k)-dr)*data[j-k-1]);
data[0] = div*(make_real3(1)-dr)*data[0];
}
data[PME_ORDER-1] = scale*dr*data[PME_ORDER-2];
for (int j = 1; j < (PME_ORDER-1); j++)
data[PME_ORDER-j-1] = scale*((dr+make_real3(j))*data[PME_ORDER-j-2] + (make_real3(PME_ORDER-j)-dr)*data[PME_ORDER-j-1]);
data[0] = scale*(make_real3(1)-dr)*data[0];
// Spread the charge from this atom onto each grid point.
int izoffset = (PME_ORDER-(gridIndex.z%PME_ORDER)) % PME_ORDER;
for (int ix = 0; ix < PME_ORDER; ix++) {
int xbase = gridIndex.x+ix;
xbase -= (xbase >= GRID_SIZE_X ? GRID_SIZE_X : 0);
xbase = xbase*GRID_SIZE_Y;
real dx = charge*data[ix].x;
for (int iy = 0; iy < PME_ORDER; iy++) {
int ybase = gridIndex.y+iy;
ybase -= (ybase >= GRID_SIZE_Y ? GRID_SIZE_Y : 0);
ybase = (xbase+ybase)*blockSize;
real dxdy = dx*data[iy].y;
for (int i = 0; i < PME_ORDER; i++) {
int iz = (i+izoffset) % PME_ORDER;
int zindex = gridIndex.z+iz;
int index = ybase + zindexTable[zindex];
real add = dxdy*data[iz].z;
#if defined(USE_DOUBLE_PRECISION) || defined(USE_DETERMINISTIC_FORCES)
unsigned long long * ulonglong_p = (unsigned long long *) originalPmeGrid;
atomicAdd(&ulonglong_p[index], static_cast<unsigned long long>((long long) (add*0x100000000)));
#else
atomicAdd(&originalPmeGrid[index], add);
#endif
}
}
}
}
}
extern "C" __global__ void finishSpreadCharge(
#if defined(USE_DOUBLE_PRECISION) || defined(USE_DETERMINISTIC_FORCES)
const long long* __restrict__ grid1,
#else
const real* __restrict__ grid1,
#endif
real* __restrict__ grid2) {
// During charge spreading, we shuffled the order of indices along the z
// axis to make memory access more efficient. We now need to unshuffle
// them. If the values were accumulated as fixed point, we also need to
// convert them to floating point.
__shared__ int zindexTable[GRID_SIZE_Z];
int blockSize = (int) ceil(GRID_SIZE_Z/(real) PME_ORDER);
for (int i = threadIdx.x; i < GRID_SIZE_Z; i += blockDim.x) {
int block = i % PME_ORDER;
zindexTable[i] = i/PME_ORDER + block*GRID_SIZE_X*GRID_SIZE_Y*blockSize;
}
__syncthreads();
const unsigned int gridSize = GRID_SIZE_X*GRID_SIZE_Y*GRID_SIZE_Z;
real scale = 1/(real) 0x100000000;
for (int index = blockIdx.x*blockDim.x+threadIdx.x; index < gridSize; index += blockDim.x*gridDim.x) {
int zindex = index%GRID_SIZE_Z;
int loadIndex = zindexTable[zindex] + blockSize*(int) (index/GRID_SIZE_Z);
#if defined(USE_DOUBLE_PRECISION) || defined(USE_DETERMINISTIC_FORCES)
grid2[index] = scale*grid1[loadIndex];
#else
grid2[index] = grid1[loadIndex];
#endif
}
}
// convolutes on the halfcomplex_pmeGrid, which is of size NX*NY*(NZ/2+1) as F(Q) is conjugate symmetric
extern "C" __global__ void
reciprocalConvolution(real2* __restrict__ halfcomplex_pmeGrid, mixed* __restrict__ energyBuffer,
const real* __restrict__ pmeBsplineModuliX, const real* __restrict__ pmeBsplineModuliY, const real* __restrict__ pmeBsplineModuliZ,
real4 periodicBoxSize, real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ) {
// R2C stores into a half complex matrix where the last dimension is cut by half
const unsigned int gridSize = GRID_SIZE_X*GRID_SIZE_Y*(GRID_SIZE_Z/2+1);
#ifdef USE_LJPME
const real recipScaleFactor = -2*M_PI*SQRT(M_PI)*RECIP(6*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
real bfac = M_PI / EWALD_ALPHA;
real fac1 = 2*M_PI*M_PI*M_PI*SQRT(M_PI);
real fac2 = EWALD_ALPHA*EWALD_ALPHA*EWALD_ALPHA;
real fac3 = -2*EWALD_ALPHA*M_PI*M_PI;
#else
const real recipScaleFactor = RECIP(M_PI*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
#endif
for (int index = blockIdx.x*blockDim.x+threadIdx.x; index < gridSize; index += blockDim.x*gridDim.x) {
// real indices
int kx = index/(GRID_SIZE_Y*(GRID_SIZE_Z/2+1));
int remainder = index-kx*GRID_SIZE_Y*(GRID_SIZE_Z/2+1);
int ky = remainder/(GRID_SIZE_Z/2+1);
int kz = remainder-ky*(GRID_SIZE_Z/2+1);
int mx = (kx < (GRID_SIZE_X+1)/2) ? kx : (kx-GRID_SIZE_X);
int my = (ky < (GRID_SIZE_Y+1)/2) ? ky : (ky-GRID_SIZE_Y);
int mz = (kz < (GRID_SIZE_Z+1)/2) ? kz : (kz-GRID_SIZE_Z);
real mhx = mx*recipBoxVecX.x;
real mhy = mx*recipBoxVecY.x+my*recipBoxVecY.y;
real mhz = mx*recipBoxVecZ.x+my*recipBoxVecZ.y+mz*recipBoxVecZ.z;
real bx = pmeBsplineModuliX[kx];
real by = pmeBsplineModuliY[ky];
real bz = pmeBsplineModuliZ[kz];
real2 grid = halfcomplex_pmeGrid[index];
real m2 = mhx*mhx+mhy*mhy+mhz*mhz;
#ifdef USE_LJPME
real denom = recipScaleFactor/(bx*by*bz);
real m = SQRT(m2);
real m3 = m*m2;
real b = bfac*m;
real expfac = -b*b;
real expterm = EXP(expfac);
real erfcterm = ERFC(b);
real eterm = (fac1*erfcterm*m3 + expterm*(fac2 + fac3*m2)) * denom;
halfcomplex_pmeGrid[index] = make_real2(grid.x*eterm, grid.y*eterm);
#else
real denom = m2*bx*by*bz;
real eterm = recipScaleFactor*EXP(-RECIP_EXP_FACTOR*m2)/denom;
if (kx != 0 || ky != 0 || kz != 0) {
halfcomplex_pmeGrid[index] = make_real2(grid.x*eterm, grid.y*eterm);
}
#endif
}
}
extern "C" __global__ void
gridEvaluateEnergy(real2* __restrict__ halfcomplex_pmeGrid, mixed* __restrict__ energyBuffer,
const real* __restrict__ pmeBsplineModuliX, const real* __restrict__ pmeBsplineModuliY, const real* __restrict__ pmeBsplineModuliZ,
real4 periodicBoxSize, real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ) {
// R2C stores into a half complex matrix where the last dimension is cut by half
const unsigned int gridSize = GRID_SIZE_X*GRID_SIZE_Y*GRID_SIZE_Z;
#ifdef USE_LJPME
const real recipScaleFactor = -2*M_PI*SQRT(M_PI)*RECIP(6*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
real bfac = M_PI / EWALD_ALPHA;
real fac1 = 2*M_PI*M_PI*M_PI*SQRT(M_PI);
real fac2 = EWALD_ALPHA*EWALD_ALPHA*EWALD_ALPHA;
real fac3 = -2*EWALD_ALPHA*M_PI*M_PI;
#else
const real recipScaleFactor = RECIP(M_PI*periodicBoxSize.x*periodicBoxSize.y*periodicBoxSize.z);
#endif
mixed energy = 0;
for (int index = blockIdx.x*blockDim.x+threadIdx.x; index < gridSize; index += blockDim.x*gridDim.x) {
// real indices
int kx = index/(GRID_SIZE_Y*(GRID_SIZE_Z));
int remainder = index-kx*GRID_SIZE_Y*(GRID_SIZE_Z);
int ky = remainder/(GRID_SIZE_Z);
int kz = remainder-ky*(GRID_SIZE_Z);
int mx = (kx < (GRID_SIZE_X+1)/2) ? kx : (kx-GRID_SIZE_X);
int my = (ky < (GRID_SIZE_Y+1)/2) ? ky : (ky-GRID_SIZE_Y);
int mz = (kz < (GRID_SIZE_Z+1)/2) ? kz : (kz-GRID_SIZE_Z);
real mhx = mx*recipBoxVecX.x;
real mhy = mx*recipBoxVecY.x+my*recipBoxVecY.y;
real mhz = mx*recipBoxVecZ.x+my*recipBoxVecZ.y+mz*recipBoxVecZ.z;
real m2 = mhx*mhx+mhy*mhy+mhz*mhz;
real bx = pmeBsplineModuliX[kx];
real by = pmeBsplineModuliY[ky];
real bz = pmeBsplineModuliZ[kz];
#ifdef USE_LJPME
real denom = recipScaleFactor/(bx*by*bz);
real m = SQRT(m2);
real m3 = m*m2;
real b = bfac*m;
real expfac = -b*b;
real expterm = EXP(expfac);
real erfcterm = ERFC(b);
real eterm = (fac1*erfcterm*m3 + expterm*(fac2 + fac3*m2)) * denom;
#else
real denom = m2*bx*by*bz;
real eterm = recipScaleFactor*EXP(-RECIP_EXP_FACTOR*m2)/denom;
#endif
if (kz >= (GRID_SIZE_Z/2+1)) {
kx = ((kx == 0) ? kx : GRID_SIZE_X-kx);
ky = ((ky == 0) ? ky : GRID_SIZE_Y-ky);
kz = GRID_SIZE_Z-kz;
}
int indexInHalfComplexGrid = kz + ky*(GRID_SIZE_Z/2+1)+kx*(GRID_SIZE_Y*(GRID_SIZE_Z/2+1));
real2 grid = halfcomplex_pmeGrid[indexInHalfComplexGrid];
#ifndef USE_LJPME
if (kx != 0 || ky != 0 || kz != 0)
#endif
energy += eterm*(grid.x*grid.x + grid.y*grid.y);
}
#if defined(USE_PME_STREAM) && !defined(USE_LJPME)
energyBuffer[blockIdx.x*blockDim.x+threadIdx.x] = 0.5f*energy;
#else
energyBuffer[blockIdx.x*blockDim.x+threadIdx.x] += 0.5f*energy;
#endif
}
extern "C" __global__
void gridInterpolateForce(const real4* __restrict__ posq, unsigned long long* __restrict__ forceBuffers, const real* __restrict__ originalPmeGrid,
real4 periodicBoxSize, real4 invPeriodicBoxSize, real4 periodicBoxVecX, real4 periodicBoxVecY, real4 periodicBoxVecZ,
real3 recipBoxVecX, real3 recipBoxVecY, real3 recipBoxVecZ, const int2* __restrict__ pmeAtomGridIndex
#ifdef CHARGE_FROM_SIGEPS
, const float2* __restrict__ sigmaEpsilon
#else
, const real* __restrict__ charges
#endif
) {
real3 data[PME_ORDER];
real3 ddata[PME_ORDER];
const real scale = RECIP(PME_ORDER-1);
// Process the atoms in spatially sorted order. This improves cache performance when loading
// the grid values.
for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < NUM_ATOMS; i += blockDim.x*gridDim.x) {
int atom = pmeAtomGridIndex[i].x;
real3 force = make_real3(0);
real4 pos = posq[atom];
APPLY_PERIODIC_TO_POS(pos)
real3 t = make_real3(pos.x*recipBoxVecX.x+pos.y*recipBoxVecY.x+pos.z*recipBoxVecZ.x,
pos.y*recipBoxVecY.y+pos.z*recipBoxVecZ.y,
pos.z*recipBoxVecZ.z);
t.x = (t.x-floor(t.x))*GRID_SIZE_X;
t.y = (t.y-floor(t.y))*GRID_SIZE_Y;
t.z = (t.z-floor(t.z))*GRID_SIZE_Z;
int3 gridIndex = make_int3(((int) t.x) % GRID_SIZE_X,
((int) t.y) % GRID_SIZE_Y,
((int) t.z) % GRID_SIZE_Z);
// Since we need the full set of thetas, it's faster to compute them here than load them
// from global memory.
real3 dr = make_real3(t.x-(int) t.x, t.y-(int) t.y, t.z-(int) t.z);
data[PME_ORDER-1] = make_real3(0);
data[1] = dr;
data[0] = make_real3(1)-dr;
for (int j = 3; j < PME_ORDER; j++) {
real div = RECIP(j-1);
data[j-1] = div*dr*data[j-2];
for (int k = 1; k < (j-1); k++)
data[j-k-1] = div*((dr+make_real3(k))*data[j-k-2] + (make_real3(j-k)-dr)*data[j-k-1]);
data[0] = div*(make_real3(1)-dr)*data[0];
}
ddata[0] = -data[0];
for (int j = 1; j < PME_ORDER; j++)
ddata[j] = data[j-1]-data[j];
data[PME_ORDER-1] = scale*dr*data[PME_ORDER-2];
for (int j = 1; j < (PME_ORDER-1); j++)
data[PME_ORDER-j-1] = scale*((dr+make_real3(j))*data[PME_ORDER-j-2] + (make_real3(PME_ORDER-j)-dr)*data[PME_ORDER-j-1]);
data[0] = scale*(make_real3(1)-dr)*data[0];
// Compute the force on this atom.
for (int ix = 0; ix < PME_ORDER; ix++) {
int xbase = gridIndex.x+ix;
xbase -= (xbase >= GRID_SIZE_X ? GRID_SIZE_X : 0);
xbase = xbase*GRID_SIZE_Y*GRID_SIZE_Z;
real dx = data[ix].x;
real ddx = ddata[ix].x;
for (int iy = 0; iy < PME_ORDER; iy++) {
int ybase = gridIndex.y+iy;
ybase -= (ybase >= GRID_SIZE_Y ? GRID_SIZE_Y : 0);
ybase = xbase + ybase*GRID_SIZE_Z;
real dy = data[iy].y;
real ddy = ddata[iy].y;
for (int iz = 0; iz < PME_ORDER; iz++) {
int zindex = gridIndex.z+iz;
zindex -= (zindex >= GRID_SIZE_Z ? GRID_SIZE_Z : 0);
int index = ybase + zindex;
real gridvalue = originalPmeGrid[index];
force.x += ddx*dy*data[iz].z*gridvalue;
force.y += dx*ddy*data[iz].z*gridvalue;
force.z += dx*dy*ddata[iz].z*gridvalue;
}
}
}
#ifdef CHARGE_FROM_SIGEPS
const float2 sigEps = sigmaEpsilon[atom];
real q = 8*sigEps.x*sigEps.x*sigEps.x*sigEps.y;
#else
real q = CHARGE*EPSILON_FACTOR;
#endif
real forceX = -q*(force.x*GRID_SIZE_X*recipBoxVecX.x);
real forceY = -q*(force.x*GRID_SIZE_X*recipBoxVecY.x+force.y*GRID_SIZE_Y*recipBoxVecY.y);
real forceZ = -q*(force.x*GRID_SIZE_X*recipBoxVecZ.x+force.y*GRID_SIZE_Y*recipBoxVecZ.y+force.z*GRID_SIZE_Z*recipBoxVecZ.z);
atomicAdd(&forceBuffers[atom], static_cast<unsigned long long>((long long) (forceX*0x100000000)));
atomicAdd(&forceBuffers[atom+PADDED_NUM_ATOMS], static_cast<unsigned long long>((long long) (forceY*0x100000000)));
atomicAdd(&forceBuffers[atom+2*PADDED_NUM_ATOMS], static_cast<unsigned long long>((long long) (forceZ*0x100000000)));
}
}
extern "C" __global__
void addForces(const real4* __restrict__ forces, unsigned long long* __restrict__ forceBuffers) {
for (int atom = blockIdx.x*blockDim.x+threadIdx.x; atom < NUM_ATOMS; atom += blockDim.x*gridDim.x) {
real4 f = forces[atom];
forceBuffers[atom] += static_cast<unsigned long long>((long long) (f.x*0x100000000));
forceBuffers[atom+PADDED_NUM_ATOMS] += static_cast<unsigned long long>((long long) (f.y*0x100000000));
forceBuffers[atom+2*PADDED_NUM_ATOMS] += static_cast<unsigned long long>((long long) (f.z*0x100000000));
}
}
extern "C" __global__
void addEnergy(const mixed* __restrict__ pmeEnergyBuffer, mixed* __restrict__ energyBuffer, int bufferSize) {
for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < bufferSize; i += blockDim.x*gridDim.x)
energyBuffer[i] += pmeEnergyBuffer[i];
}
...@@ -328,6 +328,13 @@ public: ...@@ -328,6 +328,13 @@ public:
* @param blockSize the size of each thread block to use * @param blockSize the size of each thread block to use
*/ */
void executeKernel(cl::Kernel& kernel, int workUnits, int blockSize = -1); void executeKernel(cl::Kernel& kernel, int workUnits, int blockSize = -1);
/**
* Compute the largest thread block size that can be used for a kernel that requires a particular amount of
* shared memory per thread.
*
* @param memory the number of bytes of shared memory per thread
*/
int computeThreadBlockSize(double memory) const;
/** /**
* Set all elements of an array to 0. * Set all elements of an array to 0.
*/ */
...@@ -604,6 +611,15 @@ public: ...@@ -604,6 +611,15 @@ public:
OpenCLNonbondedUtilities& getNonbondedUtilities() { OpenCLNonbondedUtilities& getNonbondedUtilities() {
return *nonbonded; return *nonbonded;
} }
/**
* Create a new NonbondedUtilities for use with this context. This should be called
* only in unusual situations, when a Force needs its own NonbondedUtilities object
* separate from the standard one. The caller is responsible for deleting the object
* when it is no longer needed.
*/
OpenCLNonbondedUtilities* createNonbondedUtilities() {
return new OpenCLNonbondedUtilities(*this);
}
/** /**
* This should be called by the Integrator from its own initialize() method. * This should be called by the Integrator from its own initialize() method.
* It ensures all contexts are fully initialized. * It ensures all contexts are fully initialized.
......
...@@ -51,6 +51,10 @@ public: ...@@ -51,6 +51,10 @@ public:
* Get the name of this kernel. * Get the name of this kernel.
*/ */
std::string getName() const; std::string getName() const;
/**
* Get the maximum block size that can be used when executing this kernel.
*/
int getMaxBlockSize() const;
/** /**
* Execute this kernel. * Execute this kernel.
* *
......
...@@ -288,8 +288,8 @@ private: ...@@ -288,8 +288,8 @@ private:
cl::Kernel pmeDispersionAtomRangeKernel; cl::Kernel pmeDispersionAtomRangeKernel;
cl::Kernel pmeZIndexKernel; cl::Kernel pmeZIndexKernel;
cl::Kernel pmeDispersionZIndexKernel; cl::Kernel pmeDispersionZIndexKernel;
cl::Kernel pmeUpdateBsplinesKernel; cl::Kernel pmeGridIndexKernel;
cl::Kernel pmeDispersionUpdateBsplinesKernel; cl::Kernel pmeDispersionGridIndexKernel;
cl::Kernel pmeSpreadChargeKernel; cl::Kernel pmeSpreadChargeKernel;
cl::Kernel pmeDispersionSpreadChargeKernel; cl::Kernel pmeDispersionSpreadChargeKernel;
cl::Kernel pmeFinishSpreadChargeKernel; cl::Kernel pmeFinishSpreadChargeKernel;
......
...@@ -296,6 +296,10 @@ public: ...@@ -296,6 +296,10 @@ public:
* @param groups the set of force groups * @param groups the set of force groups
*/ */
void createKernelsForGroups(int groups); void createKernelsForGroups(int groups);
/**
* Set the source code for the main kernel. It only needs to be changed in very unusual circumstances.
*/
void setKernelSource(const std::string& source);
private: private:
class KernelSet; class KernelSet;
class BlockSortTrait; class BlockSortTrait;
...@@ -330,6 +334,7 @@ private: ...@@ -330,6 +334,7 @@ private:
int numForceBuffers, startTileIndex, startBlockIndex, numBlocks, maxExclusions, numForceThreadBlocks; int numForceBuffers, startTileIndex, startBlockIndex, numBlocks, maxExclusions, numForceThreadBlocks;
int forceThreadBlockSize, interactingBlocksThreadBlockSize, groupFlags; int forceThreadBlockSize, interactingBlocksThreadBlockSize, groupFlags;
long long numTiles; long long numTiles;
std::string kernelSource;
}; };
/** /**
...@@ -362,9 +367,10 @@ public: ...@@ -362,9 +367,10 @@ public:
* @param numComponents the number of components in the parameter * @param numComponents the number of components in the parameter
* @param size the size of the parameter in bytes * @param size the size of the parameter in bytes
* @param memory the memory containing the parameter values * @param memory the memory containing the parameter values
* @param constant whether the memory should be marked as constant
*/ */
ParameterInfo(const std::string& name, const std::string& componentType, int numComponents, int size, cl::Memory& memory) : ParameterInfo(const std::string& name, const std::string& componentType, int numComponents, int size, cl::Memory& memory, bool constant=true) :
name(name), componentType(componentType), numComponents(numComponents), size(size), memory(&memory) { name(name), componentType(componentType), numComponents(numComponents), size(size), memory(&memory), constant(constant) {
if (numComponents == 1) if (numComponents == 1)
type = componentType; type = componentType;
else { else {
...@@ -391,12 +397,16 @@ public: ...@@ -391,12 +397,16 @@ public:
cl::Memory& getMemory() const { cl::Memory& getMemory() const {
return *memory; return *memory;
} }
bool isConstant() const {
return constant;
}
private: private:
std::string name; std::string name;
std::string componentType; std::string componentType;
std::string type; std::string type;
int size, numComponents; int size, numComponents;
cl::Memory* memory; cl::Memory* memory;
bool constant;
}; };
} // namespace OpenMM } // namespace OpenMM
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment