"3rdparty/common-r22.12/src/model_config.cc" did not exist on "e38ee081a0495769e25766b894abe19bc8a6209e"
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation. * Split kernel into general and specialized versions (for num_channel <= 8192) * Enabled float4-based vectorized memory access, when possible. * Added runtime dispatch logic for kernel specialization. Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters. Extracted shared host/device functions and declarations into a separate module: * attention_utils.cuh * attention_utils.cu
9a463332