Commits · 7c61d31a1d27826c5f61402fcba6e3efa36c2076 · OpenDAS / tilelang

03 Nov, 2025 1 commit

[Bugfix] Legalize Datatype for mma intrinisc codegen (#1179) · 7c61d31a

Lei Wang authored Nov 03, 2025

* fix

* lint fix

* Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.

7c61d31a

02 Nov, 2025 4 commits

[Language] Add Correctness and performance check scripts for V2 (#1174) · d99853b6
Lei Wang authored Nov 03, 2025
```
* fix

* lint fix

* fix

* lint fix

* fix

* upd
```
d99853b6

[Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb

Lei Wang authored Nov 03, 2025



* remove debug print

* pipeline fix

* use the correct buffer access scope

* rs support

* warp warpgroup_fence_operand

* fix

* fp8 dtype ptx enhance

* mma fix

* TCGEN05 Interface

* tcgen05 support

* rebase

* update

* Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.

* lint fix

* Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.

* wgmma fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

aef0a6bb

[Bugfix] Fix tvm import path for editable build (#1172) · c85bb3ac
Lei Wang authored Nov 02, 2025

c85bb3ac

[Refactor]: Change the params in pytest to avoid oom error during ci (#1170) · 13bdcd60

Yuqi Dong authored Nov 02, 2025

* [Refactor]: Change the params in pytest to avoid oom error during ci

* format

* fix

* Update test_example_cast.py

* Update parameters in test_example_cast

* Update test_example_flash_attention.py

* update

* format

* fix

* fix

* format

13bdcd60

01 Nov, 2025 1 commit
- [Testing] Move TMA 1D and test for its functionality (#1167) · 5c62d00a
  Zhengju Tang authored Nov 01, 2025
```
* [Testing] Move TMA 1D and test for its functionality

* [Lint]
```
  5c62d00a
31 Oct, 2025 4 commits

[Bugfix] Support 16bits shfl_sync (#1169) · 54d4bd62

Lei Wang authored Oct 31, 2025

* Add type-safe warp shuffle helpers for 16-bit float types in common.h

- Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`.
- Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations.
- Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability.

* lint fix

54d4bd62

[Bugfix] Enable code lowering with producer‑copy‑only program (#1168) · 7a80b6df

Lei Wang authored Oct 31, 2025

* bugfix

* lint fix

* Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.

* Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.

* Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

7a80b6df

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28

Lei Wang authored Oct 31, 2025



* 3rdparty tvm bump

* bump tvm into v0.22.0

* lint fix

* rebase tvm

* Update submodule tvm to latest commit 3085bc4

* Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang

* test fix

* add requirement

* atomic_fix

* atomic_fix

* phaseout py39

* optimize

* optimize

* lint fix

* do not clean cache

* do not clean cache

* [Minor] Minor update for Python versions and dependencies

* [Lint] fix lint for py39

* [Lint] fix lint for ROCm

* [Build][CI] Sync CI changes from upstream/sdist

* [Lint] fix lint for ROCm

* [Build][CI] Update `repair-wheel-command`

* [Minor] update abi3audit result format

* [Lint] fix lint for ROCm

* [BugFix] fix build

* [Lint] fix lint for ROCm

* [BugFix] set rpath for libtvm and libtvm_runtime

* [Deps] pin apache-tvm-ffi version

* [Build] set Python 3.9 Limited API for Cython target

* [Build] set Python 3.9 Limited API for Cython target

* [Deps] Restore Python 3.8 support

* [Build] use `apache-tvm-ffi`'s `libtvm_ffi`

* [BugFix] use `;` as delimiter for RPATH on macOS

* [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`

* [Build] support `sccache` if available

* [Build] add CIBW import test

* [Build][CI] enable ccache for CIBW on Linux

* [BugFix] set rpath for libtvm and libtvm_runtime

* Revert "[Build][CI] enable ccache for CIBW on Linux"

This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.

* [CI] fix perfbench bot

* [BugFix] use Python 3.9 to build wheel

* [Minor] update perfbench bot envs

* [BugFix] fix CIBW environment on Linux

* [CI] skip import test on CentOS 7

* [CI] use Python urllib to download file instead of Wget

---------
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

10911e28

[Release] Bump version to v0.1.6.post2 (#1160) · c37621c5

Lei Wang authored Oct 31, 2025

* [Release] Update README and VERSION for v0.1.6.post2 compatibility with Python 3.8

* [Enhancement] Update packaging configuration and Docker scripts for multi-architecture support

* Add allowlist for TVM, CUTLASS, and Composable Kernel items in pyproject.toml
* Enhance docker_local_distribute.sh to support cross-architecture builds using docker buildx
* Modify pypi.manylinux.Dockerfile to accept TARGETARCH argument for better architecture handling

* [Enhancement] Improve Docker scripts and build process for multi-architecture support

* Update .gitignore to include dist directories
* Refactor docker_local_distribute.sh for better cross-architecture handling and error management
* Enhance docker_pypi_distribute.sh to support multi-architecture builds with docker buildx
* Modify pypi_distribution.sh to clean up additional directories
* Update pypi.manylinux.Dockerfile for improved environment configuration and architecture handling

* fix

* Remove outdated classifier for Artificial Intelligence from pyproject.toml

* Update pyproject.toml classifiers and modify Docker distribution scripts for clarity

* Add new classifier for Artificial Intelligence in pyproject.toml
* Rename output directories in docker_local_distribute.sh and docker_pypi_distribute.sh for better context

c37621c5

29 Oct, 2025 6 commits

[Bugfix] Enhance LetStmt handling in Vectorize Loop Pass (#1159) · 79730b11

Lei Wang authored Oct 30, 2025

* [Refactor] Enhance TLVectorizer with loop vectorization convenience method and improve let variable handling

* lint fix

* let test fix

* lint fix

79730b11

[Enhancement] Enhance Cast operations Vectorization (#1156) · feef9ef6
LJC00118 authored Oct 29, 2025
```
* Enhance Cast vectorized

* Add Parallel vectorized cast test

* code lint

* merge newest commit
```
feef9ef6
[Refactor]:Move device_assert from extern_call to intrin_call (#1134) · 198f22b3
Yuqi Dong authored Oct 29, 2025
```
* update

* Update codegen_cuda.cc
```
198f22b3

[BugFix] Correct direct copy from bf16 to fp8 (#1090) · e1b12bd0

Cunxiao Ni authored Oct 29, 2025



* [BugFix] Correct direct copy from bf16 to fp8

* fix lint

* implement overloaded cast codegen for type conversion

* fix lint

* remove test

* fix lint

* trigger CI

* Overload fp8 for implicit conversion

* format

* new format

* fix: Reinterpret types to cute types in GEMM

* new format

* fix lint

* new format

* fix lint

* format

* trigger ci

---------
Co-authored-by: nicunxiao <nicunxiao@bytedance.com>

e1b12bd0

[CI] use Python urllib to download file instead of Wget (#1154) · d9a0f131
Xuehai Pan authored Oct 29, 2025

d9a0f131
[CI] allow dirty workspace for `format.sh` and introduce loop carry thread sync unit test (#1153) · 4efd2d2d
Lei Wang authored Oct 29, 2025
```
* atomic_fix

* atomic_fix

* mem fix

* lint fix

* add some comments

* fix

* fix

* lint fix

* handle async copy

* lint fix

* lint fix
```
4efd2d2d

28 Oct, 2025 5 commits
- [Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection (#1146) · f7ba45d8
  Lei Wang authored Oct 29, 2025
```
* atomic_fix

* atomic_fix

* mem fix

* lint fix

* add some comments

* fix

* fix

* lint fix

* handle async copy

* lint fix
```
  f7ba45d8
- [BugFix] Implement bfloat16 support in CUDA code generation with min/max... · c70b2697
  Tong WU authored Oct 29, 2025
```
[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values (#1143)

* Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values

* refactor

* fix prev typo

* bugfix

* lint

* bugfix
```
  c70b2697
- [Refactor] Remove amd gemm_v2 tests (#1149) · bc773c56
  Lei Wang authored Oct 29, 2025
  
  bc773c56
- [BugFix] alloc_var init failed to handle complex expression (#1144) · 399af087
  Kurisu authored Oct 28, 2025
```
* [Fix] init var with complex expression

* fix lint error
```
  399af087
- [AMD] Supoort T.gemm_v2 for AMD Backend (#1136) · 60567ba3
  Jiaxing Ding authored Oct 28, 2025
  
  60567ba3
27 Oct, 2025 9 commits

[Bugfix] Correctly construct the argument list for atomic add based on the vector size (#1137) · 7d389a43
Lei Wang authored Oct 28, 2025
```
* atomic_fix

* atomic_fix
```
7d389a43
[BugFix] Add memory order and testing script for split version GQA bwd kernel (#1100) · 853f9c3d
Zhengju Tang authored Oct 28, 2025
```
* [BugFix] Add memory order for split version kernel; Remove torch manual seed

* [Lint] Manual
```
853f9c3d

Add int2 and longlong4 pack functions (#1129) · 4c9da81a

LJC00118 authored Oct 27, 2025

* Remove an incorrect check

* add fp8 pack function

* code lint

* minor fix

* minor fix

* minor fix

* Minor fix

* Minor fix

* add pack function

* code lint

* code lint

4c9da81a

[Benchmark] Update triton and helion baselines in mamba-chuk-scan (#1131) · 95e7bc37
Yu Cheng authored Oct 27, 2025
```
* [Benchmark] Update triton and helion baselines in mamba-chuk-scan

* lint

* update mamba baseline version
```
95e7bc37
[Build][CI] Build and test SDist in release CI (#1098) · 6e1dc6a1
Xuehai Pan authored Oct 27, 2025

6e1dc6a1
[Feature]:Add device assert (#1116) · 5475f8e7
Yuqi Dong authored Oct 27, 2025
```
* update

* update
```
5475f8e7
[Enhancement] Add missing `fence_barrier_init` primitive after mbarrier init (#1121) · 17a63976
Yu Cheng authored Oct 27, 2025
```
* [Enhancement] Add missing  primitive after mbarrier init

* lint
```
17a63976

[CI]: Bump actions/download-artifact from 5 to 6 (#1127) · 0dc50a54

dependabot[bot] authored Oct 27, 2025

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v5...v6

)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0dc50a54

[CI]: Bump actions/upload-artifact from 4 to 5 (#1128) · 69113a6d

dependabot[bot] authored Oct 27, 2025

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v5

)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

69113a6d

25 Oct, 2025 1 commit

[Feature] Add memory_order PTX for vectorized atomic add (#1112) · 59865bdf

Zhengju Tang authored Oct 25, 2025



* [Feature] Add memory_order PTX for vectorized (2x) atomic add

* [Feature] Add memory_order PTX for all vectorized atomic add

* [Lint]

* test

* [BugFix] FIx init optional argument in alloc_var

* bug fix

* bug fix

* lint fix

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

59865bdf

24 Oct, 2025 1 commit
- [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) (#1119) · 65c4711f
  Lei Wang authored Oct 24, 2025
```
* fix int32 dtype issue

* lint fix

* lint

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  65c4711f
23 Oct, 2025 4 commits

[Feature] Support None type as input for `T.ptr` and `T.Tensor` (#1114) · 50e789dd
Wenhao Xie authored Oct 23, 2025
```
* [Feature] Support None type as input for T.ptr and T.Tensor

* lint

* lint

* lint

* lint fix
```
50e789dd

[Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a

Tong WU authored Oct 23, 2025

* [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen

* Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
* Enhanced the existing code to support both directions of conversion based on the lane count.
* Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.

* [Feature] Add float32 to float8 conversion support in CUDA codegen

* Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
* Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
* Enhanced type handling for better compatibility with TileLang, particularly for float8 types.

* lint

* fix a bug

* [Enhancement] Support lanes=4 cases and add unit test for vectorized cast

* lint

* [Feature] Refactor bf16 convertion operations and remove legacy compile flags

* lint

a148d62a

[Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46

Lei Wang authored Oct 23, 2025

* [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic

* Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
* Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.

* lint fix

* remove debug print

86c8bb46

[Lint] Enable pyupgrade linter in ruff (#963) · f14fb111
Yichen Yan authored Oct 23, 2025
```
* update rules

* ruff check

* other fixes

* fmt

* do not touch examples

* fmt
```
f14fb111

22 Oct, 2025 4 commits
- [Benchmark] Update Mamba2_chunk_scan benchmark (#1110) · 4f3523dc
  Yu Cheng authored Oct 22, 2025
  
  4f3523dc
- [Benchmark] Add Mamba2_chunk_scan benchmark (#1109) · 717f7b5d
  Yu Cheng authored Oct 22, 2025
  
  717f7b5d
- [Maint] Update uncommitted change detection command in `format.sh` (#1102) · e28433e0
  Xuehai Pan authored Oct 22, 2025
```
* [Maint] Remove pre-commit install in `format.sh`

* [Maint] Update uncommitted change detection command

* [Minor] update warning messages
```
  e28433e0
- [Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel (#1104) · 8a5eb569
  Yu Cheng authored Oct 22, 2025
  
  8a5eb569