- 08 Jun, 2026 1 commit
-
-
one authored
Limit the DTK SIMT MaskImplicitGemm forward descriptor set for the MV2DFusion VVM/SubM fp32 shape family on sm_93. For kv=27, input channels=128, and output channels=64, keep only the SIMT descriptor validated by dense-oracle replay. Apply the same selection rule in both the Python tuner path and generated C++ tuner path to keep build-time and runtime behavior aligned.
-
- 12 May, 2026 1 commit
-
-
one authored
Centralize the 1024-thread launch bound annotation for sparse index CUDA kernels and apply it consistently across index, hash table, mask, and SubM indice helper kernels. This keeps generated kernel definitions aligned with the launch configuration used by DTK runtime checks.
-
- 09 May, 2026 1 commit
-
-
one authored
Add a DTK-specific kernel filter path for running spconv through the DTK CUDA compatibility layer on BW100. The recommended `dtk_simt` filter keeps only SIMT kernels, forces SIMT params to static codegen, and removes Volta/Turing/Ampere TensorOp, int8, and NVRTC paths from the active kernel set. Add `dtk_tensorop` as a separate non-default adaptation entry point for future Ampere TensorOp work. This keeps static non-int8 Ampere TensorOp params while still excluding Volta/Turing, int8, and NVRTC paths. Allow fp16 workloads to use SIMT fallback when `SPCONV_DTK_KERNEL_FILTER` is set to `dtk_simt`. This updates both the Python tuner and generated C++ ConvTunerSimple logic so fp16 no longer depends on currently unsupported TensorOp paths on DTK. Add `SPCONV_FORCE_CUDA_ARCH` to keep runtime dispatch aligned with the compiled arch list, and keep the BW100 path explicit with `9.3`. Adjust DTK build/runtime compatibility: - reuse the guarded editable-install state during constants setup - skip the Linux Thrust `-fno-gnu-unique` flag under the DTK inline-PTX compatibility path - add launch bounds to helper kernels that are launched with 1024 threads/block This leaves full TensorOp, int8, fp8, and NVRTC support out of the recommended DTK path. Those remain future adaptation work.
-
- 15 Dec, 2024 3 commits
- 10 Dec, 2024 1 commit
-
-
yan.yan authored
-
- 09 Dec, 2024 1 commit
-
-
yan.yan authored
-
- 19 Apr, 2023 1 commit
-
-
yan.yan authored
-
- 23 Mar, 2023 3 commits
- 02 Feb, 2023 2 commits
- 20 Jan, 2023 2 commits
- 19 Jan, 2023 2 commits
- 17 Jan, 2023 1 commit
-
-
yan.yan authored
-
- 10 Jan, 2023 1 commit
-
-
yan.yan authored
-
- 04 Jan, 2023 1 commit
-
-
yan.yan authored
-
- 03 Jan, 2023 1 commit
-
-
yan.yan authored
-
- 29 Dec, 2022 1 commit
-
-
yan.yan authored
-
- 27 Dec, 2022 1 commit
-
-
FindDefinition authored
* large kernel bwd&bwdI, not test increment RS * large kernel fix, no split_mask and increment rs * large kernel fix2, no split_mask and increment rs * reset benchmark.py * fix merge Co-authored-by:EvernightAurora <2465542858@qq.com>
-
- 05 Nov, 2022 2 commits
- 26 Oct, 2022 1 commit
-
-
yan.yan authored
-
- 18 Oct, 2022 2 commits
- 28 Sep, 2022 1 commit
-
-
yan.yan authored
-
- 26 Sep, 2022 3 commits
- 25 Sep, 2022 5 commits
- 24 Sep, 2022 2 commits