- 21 Jul, 2025 28 commits
-
-
Andrea Paris authored
-
Andrea Paris authored
-
Andrea Paris authored
-
Andrea Paris authored
-
root authored
-
Andrea Paris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
Thorsten Kurth authored
Tkurth/device fixes
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
- 17 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
Attention Backward improvement
-
- 16 Jul, 2025 9 commits
-
-
Mauro Bisson authored
Renamed the template parameter to a simpler name (it's the number of warps per tile used in the permutation).
-
Andrea Paris authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation. * Split kernel into general and specialized versions (for num_channel <= 8192) * Enabled float4-based vectorized memory access, when possible. * Added runtime dispatch logic for kernel specialization. Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters. Extracted shared host/device functions and declarations into a separate module: * attention_utils.cuh * attention_utils.cu
-
- 14 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
* removing duplicate code from distributed convoloution * replacing from_numpy with as_tensor * make preprocess_psi_tensor GPU ready.
-
- 08 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
* refactoring disco backend code * removed get_psi as member function and instead put it in _disco_convolution * setting seeds in tests more consistently * parametrized test classes to ensure that tests are always run on both CPU and GPU (if available) * cleaning up
-