1. 12 Dec, 2025 1 commit
  2. 01 Dec, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Implement dynamic unroll factor in CUDA code generation (#1360) · 388ee7ee
      Lei Wang authored
      * [Enhancement] Implement dynamic unroll factor in CUDA code generation
      
      This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics.
      
      * lint fix
      
      * [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation
      388ee7ee
    • botbw's avatar
      [Language] support `T.gemm_sp_v2` on sm80 and sm89 (#1056) · 283a9a00
      botbw authored
      * [misc] add a cpp side wrapper for gemm_sp_py
      
      * [misc] typing
      
      * [IR] bind GemmSPWarpPolicy
      
      * [chore] add wrapper code
      
      * [IR] fix GemmSPWarpPolicy
      
      * [codegen] apply ptxas instructions
      
      * [intrinsic] add typical (unused) mma layout
      
      * [template] add uint16 debug func
      
      * [intrinsic] add b matrix layout
      
      * [gemm_sp] enable fp16/bf16 on sm8x
      
      * [layout] refactor fp16/bf16 layout
      
      * [gemm_sp] enable int8
      
      * [chore] update test case dtype
      
      * [gemm_sp] enable fp32
      
      * [layout] refactor layouts
      
      * [intrinsic] enable ldmatrix for mat A
      
      * [layout] enable ldsm for matrix b
      
      * [layout] add ldmatrix for fp32 and fp8
      
      * [chore] refine
      
      * [chore] refactor
      
      * [chore] add fp8 efactor
      
      * [chore] refactor
      
      * [chore] add remove negative zero util
      
      * [example] add a custom compress kernel
      
      * [chore] minor update
      
      * [test] refactor gemm_sp test
      
      * [refactor] make metadata layout func
      
      * [example] add option for using cutlass layout
      
      * [doc] add a gemm_sp doc
      
      * [doc] minor polish
      
      * [chore] remove unused
      
      * [bugfix] fix non replicate b case
      
      * [test] refactor
      
      * [chore] add a check
      
      * [bugfix] fix util bug
      
      * [wip] init a new test case for v2
      
      * [chore] minor refactor
      
      * [chore] minor update
      
      * [bugfix] enable 16bit rs
      
      * [language] enable rs
      
      * [language] enable gemm_sp_sr
      
      * [language] enable gemm_sp_rr
      
      * [test] enable more tests
      
      * [tvm] update ffi binding
      
      * [chore] remove print
      
      * [chore] fix benchmark script
      
      * [lint] precommit lint
      
      * [chore] apply feedback
      
      * [test] use arch 8.0
      
      * [chore] rollback ::ordered_metadata for backward compatibility
      
      * [bugfix] fix captialized
      
      * [example] keep gemm_sp on hopper
      
      * [test] fix no fp8 normal kernel
      
      * [test] reduce matmul size to satisfy accum error
      
      * [test] use cal_diff for assertion
      
      * [bugfix] expand float8 type
      
      * [lib] add make_int4 for short type
      
      * [language] add transpose E
      
      * [bugfix] fix wrong var
      
      * [format] format
      
      * [chore] refactor binding
      
      * [chore] fix wrong passing var
      283a9a00
  3. 30 Sep, 2025 1 commit
  4. 15 Sep, 2025 1 commit
    • botbw's avatar
      [feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf
      botbw authored
      
      
      * [feat] add an example mma atom
      
      * [fix] fix typo naming
      
      * [feat] add a template to enable compilation
      
      * [feat] add print util
      
      * [WIP] pass on single block tile
      
      * [feat] add sm80 metadata layout
      
      * [chore] clean codebase
      
      * [CI] format.sh
      
      * [feat] add sm80 compress utils
      
      * [bugfix] fix C fragment layout
      
      * [refactor] use nvcc version instead of str
      
      * [test] add test cases
      
      * [chore] add a param check
      
      * [chore] format a bit
      
      * [chore] rename func to satisfy PEP 8 and appease gemini
      
      * [chore] add check
      
      * [feat] support sm75 layout && add assertion && chore
      
      * [bug] fix illegal memory access when using two warps over N=32
      
      This could be a missing check related to cutlass 2.x implementation.
      Using the cutlass example can't trigger this cause it's bypassed by
      padding the input.
      
      For now I think it might be safe to increase the atom size and inve-
      sgate in the future.
      
      * [chore] add example
      
      * [chore] format
      
      * [example] update benchmark
      
      * [bugfix] fix namespace and format
      
      * [bugfix] fix incorrect param passing
      
      * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp
      
      * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py
      
      * [CI] fix arch
      
      * [example] add torch sparse benchmark
      
      * [misc] polish && add reference && apply review suggestionsi && format
      
      * [CI] format with clang-tidy
      
      * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h
      
      * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0b3683bf