Commits · 10911e280fe8c7c3d603a08a062fde81aea6a819 · OpenDAS / tilelang

31 Oct, 2025 2 commits

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28

Lei Wang authored Oct 31, 2025



* 3rdparty tvm bump

* bump tvm into v0.22.0

* lint fix

* rebase tvm

* Update submodule tvm to latest commit 3085bc4

* Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang

* test fix

* add requirement

* atomic_fix

* atomic_fix

* phaseout py39

* optimize

* optimize

* lint fix

* do not clean cache

* do not clean cache

* [Minor] Minor update for Python versions and dependencies

* [Lint] fix lint for py39

* [Lint] fix lint for ROCm

* [Build][CI] Sync CI changes from upstream/sdist

* [Lint] fix lint for ROCm

* [Build][CI] Update `repair-wheel-command`

* [Minor] update abi3audit result format

* [Lint] fix lint for ROCm

* [BugFix] fix build

* [Lint] fix lint for ROCm

* [BugFix] set rpath for libtvm and libtvm_runtime

* [Deps] pin apache-tvm-ffi version

* [Build] set Python 3.9 Limited API for Cython target

* [Build] set Python 3.9 Limited API for Cython target

* [Deps] Restore Python 3.8 support

* [Build] use `apache-tvm-ffi`'s `libtvm_ffi`

* [BugFix] use `;` as delimiter for RPATH on macOS

* [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`

* [Build] support `sccache` if available

* [Build] add CIBW import test

* [Build][CI] enable ccache for CIBW on Linux

* [BugFix] set rpath for libtvm and libtvm_runtime

* Revert "[Build][CI] enable ccache for CIBW on Linux"

This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.

* [CI] fix perfbench bot

* [BugFix] use Python 3.9 to build wheel

* [Minor] update perfbench bot envs

* [BugFix] fix CIBW environment on Linux

* [CI] skip import test on CentOS 7

* [CI] use Python urllib to download file instead of Wget

---------
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

10911e28

[Release] Bump version to v0.1.6.post2 (#1160) · c37621c5

Lei Wang authored Oct 31, 2025

* [Release] Update README and VERSION for v0.1.6.post2 compatibility with Python 3.8

* [Enhancement] Update packaging configuration and Docker scripts for multi-architecture support

* Add allowlist for TVM, CUTLASS, and Composable Kernel items in pyproject.toml
* Enhance docker_local_distribute.sh to support cross-architecture builds using docker buildx
* Modify pypi.manylinux.Dockerfile to accept TARGETARCH argument for better architecture handling

* [Enhancement] Improve Docker scripts and build process for multi-architecture support

* Update .gitignore to include dist directories
* Refactor docker_local_distribute.sh for better cross-architecture handling and error management
* Enhance docker_pypi_distribute.sh to support multi-architecture builds with docker buildx
* Modify pypi_distribution.sh to clean up additional directories
* Update pypi.manylinux.Dockerfile for improved environment configuration and architecture handling

* fix

* Remove outdated classifier for Artificial Intelligence from pyproject.toml

* Update pyproject.toml classifiers and modify Docker distribution scripts for clarity

* Add new classifier for Artificial Intelligence in pyproject.toml
* Rename output directories in docker_local_distribute.sh and docker_pypi_distribute.sh for better context

c37621c5

29 Oct, 2025 6 commits

[Bugfix] Enhance LetStmt handling in Vectorize Loop Pass (#1159) · 79730b11

Lei Wang authored Oct 30, 2025

* [Refactor] Enhance TLVectorizer with loop vectorization convenience method and improve let variable handling

* lint fix

* let test fix

* lint fix

79730b11

[Enhancement] Enhance Cast operations Vectorization (#1156) · feef9ef6
LJC00118 authored Oct 29, 2025
```
* Enhance Cast vectorized

* Add Parallel vectorized cast test

* code lint

* merge newest commit
```
feef9ef6
[Refactor]:Move device_assert from extern_call to intrin_call (#1134) · 198f22b3
Yuqi Dong authored Oct 29, 2025
```
* update

* Update codegen_cuda.cc
```
198f22b3

[BugFix] Correct direct copy from bf16 to fp8 (#1090) · e1b12bd0

Cunxiao Ni authored Oct 29, 2025



* [BugFix] Correct direct copy from bf16 to fp8

* fix lint

* implement overloaded cast codegen for type conversion

* fix lint

* remove test

* fix lint

* trigger CI

* Overload fp8 for implicit conversion

* format

* new format

* fix: Reinterpret types to cute types in GEMM

* new format

* fix lint

* new format

* fix lint

* format

* trigger ci

---------
Co-authored-by: nicunxiao <nicunxiao@bytedance.com>

e1b12bd0

[CI] use Python urllib to download file instead of Wget (#1154) · d9a0f131
Xuehai Pan authored Oct 29, 2025

d9a0f131
[CI] allow dirty workspace for `format.sh` and introduce loop carry thread sync unit test (#1153) · 4efd2d2d
Lei Wang authored Oct 29, 2025
```
* atomic_fix

* atomic_fix

* mem fix

* lint fix

* add some comments

* fix

* fix

* lint fix

* handle async copy

* lint fix

* lint fix
```
4efd2d2d

28 Oct, 2025 5 commits
- [Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection (#1146) · f7ba45d8
  Lei Wang authored Oct 29, 2025
```
* atomic_fix

* atomic_fix

* mem fix

* lint fix

* add some comments

* fix

* fix

* lint fix

* handle async copy

* lint fix
```
  f7ba45d8
- [BugFix] Implement bfloat16 support in CUDA code generation with min/max... · c70b2697
  Tong WU authored Oct 29, 2025
```
[BugFix] Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values (#1143)

* Implement bfloat16 support in CUDA code generation with min/max functions and inf/nan values

* refactor

* fix prev typo

* bugfix

* lint

* bugfix
```
  c70b2697
- [Refactor] Remove amd gemm_v2 tests (#1149) · bc773c56
  Lei Wang authored Oct 29, 2025
  
  bc773c56
- [BugFix] alloc_var init failed to handle complex expression (#1144) · 399af087
  Kurisu authored Oct 28, 2025
```
* [Fix] init var with complex expression

* fix lint error
```
  399af087
- [AMD] Supoort T.gemm_v2 for AMD Backend (#1136) · 60567ba3
  Jiaxing Ding authored Oct 28, 2025
  
  60567ba3
27 Oct, 2025 9 commits

[Bugfix] Correctly construct the argument list for atomic add based on the vector size (#1137) · 7d389a43
Lei Wang authored Oct 28, 2025
```
* atomic_fix

* atomic_fix
```
7d389a43
[BugFix] Add memory order and testing script for split version GQA bwd kernel (#1100) · 853f9c3d
Zhengju Tang authored Oct 28, 2025
```
* [BugFix] Add memory order for split version kernel; Remove torch manual seed

* [Lint] Manual
```
853f9c3d

Add int2 and longlong4 pack functions (#1129) · 4c9da81a

LJC00118 authored Oct 27, 2025

* Remove an incorrect check

* add fp8 pack function

* code lint

* minor fix

* minor fix

* minor fix

* Minor fix

* Minor fix

* add pack function

* code lint

* code lint

4c9da81a

[Benchmark] Update triton and helion baselines in mamba-chuk-scan (#1131) · 95e7bc37
Yu Cheng authored Oct 27, 2025
```
* [Benchmark] Update triton and helion baselines in mamba-chuk-scan

* lint

* update mamba baseline version
```
95e7bc37
[Build][CI] Build and test SDist in release CI (#1098) · 6e1dc6a1
Xuehai Pan authored Oct 27, 2025

6e1dc6a1
[Feature]:Add device assert (#1116) · 5475f8e7
Yuqi Dong authored Oct 27, 2025
```
* update

* update
```
5475f8e7
[Enhancement] Add missing `fence_barrier_init` primitive after mbarrier init (#1121) · 17a63976
Yu Cheng authored Oct 27, 2025
```
* [Enhancement] Add missing  primitive after mbarrier init

* lint
```
17a63976

[CI]: Bump actions/download-artifact from 5 to 6 (#1127) · 0dc50a54

dependabot[bot] authored Oct 27, 2025

Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v5...v6

)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0dc50a54

[CI]: Bump actions/upload-artifact from 4 to 5 (#1128) · 69113a6d

dependabot[bot] authored Oct 27, 2025

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v5

)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

69113a6d

25 Oct, 2025 1 commit

[Feature] Add memory_order PTX for vectorized atomic add (#1112) · 59865bdf

Zhengju Tang authored Oct 25, 2025



* [Feature] Add memory_order PTX for vectorized (2x) atomic add

* [Feature] Add memory_order PTX for all vectorized atomic add

* [Lint]

* test

* [BugFix] FIx init optional argument in alloc_var

* bug fix

* bug fix

* lint fix

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

59865bdf

24 Oct, 2025 1 commit
- [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) (#1119) · 65c4711f
  Lei Wang authored Oct 24, 2025
```
* fix int32 dtype issue

* lint fix

* lint

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  65c4711f
23 Oct, 2025 4 commits

[Feature] Support None type as input for `T.ptr` and `T.Tensor` (#1114) · 50e789dd
Wenhao Xie authored Oct 23, 2025
```
* [Feature] Support None type as input for T.ptr and T.Tensor

* lint

* lint

* lint

* lint fix
```
50e789dd

[Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a

Tong WU authored Oct 23, 2025

* [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen

* Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
* Enhanced the existing code to support both directions of conversion based on the lane count.
* Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.

* [Feature] Add float32 to float8 conversion support in CUDA codegen

* Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
* Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
* Enhanced type handling for better compatibility with TileLang, particularly for float8 types.

* lint

* fix a bug

* [Enhancement] Support lanes=4 cases and add unit test for vectorized cast

* lint

* [Feature] Refactor bf16 convertion operations and remove legacy compile flags

* lint

a148d62a

[Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46

Lei Wang authored Oct 23, 2025

* [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic

* Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
* Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.

* lint fix

* remove debug print

86c8bb46

[Lint] Enable pyupgrade linter in ruff (#963) · f14fb111
Yichen Yan authored Oct 23, 2025
```
* update rules

* ruff check

* other fixes

* fmt

* do not touch examples

* fmt
```
f14fb111

22 Oct, 2025 7 commits
- [Benchmark] Update Mamba2_chunk_scan benchmark (#1110) · 4f3523dc
  Yu Cheng authored Oct 22, 2025
  
  4f3523dc
- [Benchmark] Add Mamba2_chunk_scan benchmark (#1109) · 717f7b5d
  Yu Cheng authored Oct 22, 2025
  
  717f7b5d
- [Maint] Update uncommitted change detection command in `format.sh` (#1102) · e28433e0
  Xuehai Pan authored Oct 22, 2025
```
* [Maint] Remove pre-commit install in `format.sh`

* [Maint] Update uncommitted change detection command

* [Minor] update warning messages
```
  e28433e0
- [Refactor] Use forceinline in `ldmatrix` and update mamba scan kernel (#1104) · 8a5eb569
  Yu Cheng authored Oct 22, 2025
  
  8a5eb569
- [CI][Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow (#1044) · 5683e6a6
  Xuehai Pan authored Oct 22, 2025
```
* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow

* chore: update clang-tidy settings

* chore: upgrade clang-format and clang-tidy version

* lint: resolve clang-tidy errors

* [Maint] restore format.sh

* [CI] pre-commit autoupdate

* [Minor] fix `command -v` usage
```
  5683e6a6
- [Refactor] Optimize debug message for parallel inference (#1096) · 151d9e6b
  Lei Wang authored Oct 22, 2025
  
  151d9e6b
- [Example] Add block level high performance gemv example (#1097) · 514bdeaa
  Lei Wang authored Oct 22, 2025
```
* add alloc_reducer gemv example

* test
```
  514bdeaa
21 Oct, 2025 5 commits

[GQA] Add regional atomic add to slightly boost performance (#1093) · f003f371

Zhengju Tang authored Oct 22, 2025

* [Lint]

* [BugFix] Freeze the memory order of all atomic_add operations

* [Lint]

* [Atomic] Move on to regional atomic add

* [Lint]

f003f371

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call (#1094) · 5cb5c068
Yu Cheng authored Oct 22, 2025

5cb5c068

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

[PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4

Lei Wang authored Oct 21, 2025

* • Enable configurable StorageRewrite inplace detection

  - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
  - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
  - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.

* lint fix

* add test

* lint fix

cdc67fc4

[Cleanup] Remove `tilelang.disable_cache()` calls from examples and tests (#1088) · 0c7e7419
Tong WU authored Oct 21, 2025
```
* [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts

* lint

* lint
```
0c7e7419