Commits · 7248a810d97ca8ceb999cc0a9e2bf58adc68f263 · OpenDAS / tilelang

18 Dec, 2025 1 commit

feat(cutedsl): add CuTeDSL backend (#1421) · 7248a810

Gabriel Wu authored Dec 18, 2025



* feat: CuTeDSL backend

* fix: clang-tidy

* fix: clang-format

* fix: ci

* fix: revert example gemm fp8

* fix: remove duplicate code

* fix: switch-case

* fix: fp16 silence

* fix: TVM IR print

* fix: useless tir

* fix: clang-format

* fix: remove tilelang/contrib/cutedsl/.gitignore

* fix: use hexfloat

* fix: gsym guard

* fix: unknown storage sync type

* fix: string literal

* fix: add args guard

* fix: name hint dedup

* fix: better find_kernel_by_pattern

* fix: set libpath for from_database path

* fix: guard buffer.strides

* fix: from guard

* fix: eviction guard

* fix: use thread local tma descs

* fix: ruff

* fix: drop tma_init_cpp

* fix: exc_info

* fix: negative unmatch early return

* fix: rename postproc func and add test

* fix: handle fast math according to pass config

* fix: dyn_sym parse

* fix: wrap_forward

* fix: use tvm_ffi.libinfo instead of cli

* fix: keep signature

* fix: C++ string safety

* fix: mark tma_store_add as unsupported

* fix: tvm version

* resolve ldsm and cpasync issues.

* fix: minor fixes

* fix: parse signature using ast

* fix: guard global_addr

* fix: create tempfile only when necessary

* fix: use logger.execption for exceptions

* fix: guard lib_path and host_func

* fix: remove tma_cpp_init and add timeout for cpp compile

* add timeout for mbarrier_wait.

* fix: _load_kernel_from_disk signature

* resolve codegen issues.

* fix: logger.exception

* add comment for div_by=1

* merge

* fix: reserve cutlass,cute,tl

* fix: guard tma_store

* fix: allow int64 offset in make_tensor_at_offset

* fix: guard barrier

* fix: add comments for div_by=16

* fix: div_by=1 issue

* delete div_by when offset is 0

* use tl.make_tensor when offset is 0

* fix: explicitly check cutedsl target

* fix: use param.torch_dtype()

---------
Co-authored-by: yuxic <yuxic@nvidia.com>
Co-authored-by: Yong <yong@local>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7248a810

14 Oct, 2025 1 commit
- [Transform] Migrate `LowerIntrin` from tvm into tilelang (#999) · 7a5077e4
  Lei Wang authored Oct 14, 2025
```
* Donot lower ceildiv to >>

* lint fix

* test fix

* fallback ceildiv changes
```
  7a5077e4
19 Aug, 2025 1 commit

[Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) · 24603e4a

Zhengju Tang authored Aug 19, 2025



* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16

* [Dequant] Add extern call and serial dequantization

* [Dequant] Parallel Dequant wait for fence debug.

* [Scale] Add scale matrix to mxfp4 gemm

* [Remove] Remove fence-buggy example and some generated source cuda code

* [MXFP4] Update initial version of MXFP4 GEMM

* [Scale] Add scale to latest mxfp4 gemm

* [Lint]

* [BugFix] Load Scale, disabe TMA to recover performance

* [Lint]

* [Lint]

* [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance

* [Lint]

* Update example_dequant_gemm_bf16_fp4_hopper_serial.py

* Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.

* Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.

* Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.

* lint fix

* ci fix

* Remove non-existent example

* [BugFix] Add smem swizzle to recover performance of TMA

* [BugFix] Enough reg for producer when threads=512

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

24603e4a