- 18 Dec, 2025 1 commit
-
-
Gabriel Wu authored
* feat: CuTeDSL backend * fix: clang-tidy * fix: clang-format * fix: ci * fix: revert example gemm fp8 * fix: remove duplicate code * fix: switch-case * fix: fp16 silence * fix: TVM IR print * fix: useless tir * fix: clang-format * fix: remove tilelang/contrib/cutedsl/.gitignore * fix: use hexfloat * fix: gsym guard * fix: unknown storage sync type * fix: string literal * fix: add args guard * fix: name hint dedup * fix: better find_kernel_by_pattern * fix: set libpath for from_database path * fix: guard buffer.strides * fix: from guard * fix: eviction guard * fix: use thread local tma descs * fix: ruff * fix: drop tma_init_cpp * fix: exc_info * fix: negative unmatch early return * fix: rename postproc func and add test * fix: handle fast math according to pass config * fix: dyn_sym parse * fix: wrap_forward * fix: use tvm_ffi.libinfo instead of cli * fix: keep signature * fix: C++ string safety * fix: mark tma_store_add as unsupported * fix: tvm version * resolve ldsm and cpasync issues. * fix: minor fixes * fix: parse signature using ast * fix: guard global_addr * fix: create tempfile only when necessary * fix: use logger.execption for exceptions * fix: guard lib_path and host_func * fix: remove tma_cpp_init and add timeout for cpp compile * add timeout for mbarrier_wait. * fix: _load_kernel_from_disk signature * resolve codegen issues. * fix: logger.exception * add comment for div_by=1 * merge * fix: reserve cutlass,cute,tl * fix: guard tma_store * fix: allow int64 offset in make_tensor_at_offset * fix: guard barrier * fix: add comments for div_by=16 * fix: div_by=1 issue * delete div_by when offset is 0 * use tl.make_tensor when offset is 0 * fix: explicitly check cutedsl target * fix: use param.torch_dtype() --------- Co-authored-by:
yuxic <yuxic@nvidia.com> Co-authored-by:
Yong <yong@local> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 14 Oct, 2025 1 commit
-
-
Lei Wang authored
* Donot lower ceildiv to >> * lint fix * test fix * fallback ceildiv changes
-
- 19 Aug, 2025 1 commit
-
-
Zhengju Tang authored
* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-