- 21 Jul, 2025 16 commits
-
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
apaaris authored
-
Thorsten Kurth authored
Tkurth/device fixes
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
- 17 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
Attention Backward improvement
-
- 16 Jul, 2025 9 commits
-
-
Mauro Bisson authored
Renamed the template parameter to a simpler name (it's the number of warps per tile used in the permutation).
-
Andrea Paris authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation. * Split kernel into general and specialized versions (for num_channel <= 8192) * Enabled float4-based vectorized memory access, when possible. * Added runtime dispatch logic for kernel specialization. Aligned attention_fwd_cuda.cu with attention_bwd_cuda.cu in terms of naming conventions and kernel parameters. Extracted shared host/device functions and declarations into a separate module: * attention_utils.cuh * attention_utils.cu
-
- 14 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
* removing duplicate code from distributed convoloution * replacing from_numpy with as_tensor * make preprocess_psi_tensor GPU ready.
-
- 08 Jul, 2025 1 commit
-
-
Thorsten Kurth authored
* refactoring disco backend code * removed get_psi as member function and instead put it in _disco_convolution * setting seeds in tests more consistently * parametrized test classes to ensure that tests are always run on both CPU and GPU (if available) * cleaning up
-
- 07 Jul, 2025 2 commits
-
-
Thorsten Kurth authored
-
Thorsten Kurth authored
Use 64-bit for pointer offsets
-
- 04 Jul, 2025 2 commits
-
-
Mauro Bisson authored
Updated pointer offset calculations to use 64-bit integers to prevent overflow with large batch or image sizes.
-
Thorsten Kurth authored
using torch tools to change layout in bd pass
-
- 03 Jul, 2025 4 commits
-
-
Max Rietmann authored
-
Max Rietmann authored
-
Thorsten Kurth authored
-
Thorsten Kurth authored
Optimized forward kernel for attention
-
- 02 Jul, 2025 4 commits
-
-
Mauro Bisson authored
-
Mauro Bisson authored
* Added a new CSR array, psi_row_index, containing "ho" values sorted in descending order of CSR row length; this is used to process (ho, wo) points corresponding to longer rows before shorter ones, improving overlap and reducing the tail effect.
-
Mauro Bisson authored
* Replaced PyTorch's slow permutation ops with custom kernels, significantly improving performance (especially on GB200). * Split kernel into general and specialized versions for num_channel <= 16384, significantly reducing memory accesses. * Enabled float4-based vectorized memory access when pointer alignment and channel size allow, improving throughput. * Added runtime dispatch logic for kernel specialization.